Python Web Scraping for Beginners: BeautifulSoup Tutorial
You need data from a website. You could copy-paste for hours. Or you could write 20 lines of Python that extracts everything in seconds—and can run automatically whenever you need updated data.
Web scraping transforms you from a data consumer to a data collector. Today you'll learn BeautifulSoup, Python's most beginner-friendly web scraping library.
What You'll Learn
- Web scraping fundamentals and ethics
- Installing and using BeautifulSoup
- Finding and extracting HTML elements
- Cleaning and organizing scraped data
- Handling common challenges and errors
- Building a complete scraping project
What is Web Scraping?
Web scraping is automatically extracting data from websites using code.
Common use cases:
- Price monitoring for products
- Job posting aggregation
- Real estate listing collection
- News article monitoring
- Research data collection
- Competitor analysis
When scraping is appropriate: âś… Public data that's freely accessible âś… For personal or research use âś… When the site has no API âś… When manual collection would take too long
When NOT to scrape: ❌ Site's terms of service prohibit it ❌ Personal or private user data ❌ Behind login/paywall (without permission) ❌ Causes server load or harm
Always check: robots.txt (website.com/robots.txt) to see scraping permissions.
Prerequisites
- Python 3.7 or higher
- Basic Python knowledge (variables, loops, functions)
- Basic HTML understanding (tags, attributes, structure)
Installing Required Libraries
1pip install beautifulsoup42pip install requests3pip install lxml
Libraries explained:
- requests: Fetches web pages (HTTP requests)
- beautifulsoup4: Parses HTML and extracts data
- lxml: Fast HTML parser (optional but recommended)
Understanding HTML Structure
Before scraping, understand basic HTML:
1<div class="product" id="item-123">2 <h2 class="title">Blue Widget</h2>3 <span class="price">$29.99</span>4 <p class="description">High-quality widget for all needs</p>5 <a href="/product/123">View Details</a>6</div>
Key concepts:
- Tags:
<div>,<h2>,<span>,<p>,<a> - Classes:
class="product"(can be shared by multiple elements) - IDs:
id="item-123"(unique per page) - Attributes:
href="/product/123" - Text content: "Blue Widget", "$29.99"
Your First Scraper
Step 1: Fetch a Web Page
1import requests2from bs4 import BeautifulSoup34# Fetch the webpage5url = "https://example.com"6response = requests.get(url)78# Check if request was successful9if response.status_code == 200:10 print("Page fetched successfully!")11 html_content = response.text12else:13 print(f"Failed to fetch page. Status code: {response.status_code}")
Step 2: Parse with BeautifulSoup
1# Create BeautifulSoup object2soup = BeautifulSoup(html_content, 'lxml')34# Now you can navigate and search the HTML5print(soup.title) # Get the page title6print(soup.title.string) # Get just the text of the title
Step 3: Find Elements
1# Find first h1 tag2h1 = soup.find('h1')3print(h1.text)45# Find all paragraph tags6paragraphs = soup.find_all('p')7for p in paragraphs:8 print(p.text)910# Find by class11products = soup.find_all('div', class_='product')12for product in products:13 print(product.text)1415# Find by ID16item = soup.find('div', id='item-123')17print(item.text)
Finding Elements: Complete Guide
Basic Finding Methods
1# Find first matching element2soup.find('div') # First <div> tag3soup.find('a') # First <a> tag45# Find all matching elements6soup.find_all('div') # All <div> tags7soup.find_all('p') # All <p> tags89# Find by class10soup.find('div', class_='product') # First element with class="product"11soup.find_all('div', class_='product') # All elements1213# Find by ID14soup.find('div', id='main-content')1516# Find by multiple attributes17soup.find('a', {'href': '/products', 'class': 'link'})
CSS Selectors (More Powerful)
1# Find using CSS selectors2soup.select('div.product') # All divs with class="product"3soup.select('#main-content') # Element with id="main-content"4soup.select('div.product > h2') # h2 that's a direct child of div.product5soup.select('a[href^="/products"]') # Links starting with "/products"67# Select first match8soup.select_one('div.product')
CSS Selector Cheat Sheet:
1'.classname' # By class2'#idname' # By ID3'tag' # By tag name4'tag.class' # Tag with class5'parent > child' # Direct child6'parent descendant' # Any descendant7'[attribute]' # Has attribute8'[attribute="value"]' # Attribute equals value9'[attribute^="start"]' # Attribute starts with10'[attribute$="end"]' # Attribute ends with11'[attribute*="contains"]' # Attribute contains
Extracting Data
Get Text Content
1# Get text from element2element = soup.find('h1')3text = element.text # or element.get_text()4print(text)56# Get text, stripping extra whitespace7clean_text = element.get_text(strip=True)
Get Attribute Values
1# Get href from link2link = soup.find('a')3href = link.get('href') # or link['href']4print(href)56# Get src from image7image = soup.find('img')8src = image.get('src')9alt = image.get('alt')
Navigate HTML Structure
1# Get parent element2child = soup.find('span', class_='price')3parent = child.parent45# Get children6parent = soup.find('div', class_='product')7children = parent.find_all(recursive=False) # Direct children only89# Get siblings10element = soup.find('h2')11next_element = element.find_next_sibling()12previous_element = element.find_previous_sibling()
Complete Example: Scraping Product Listings
Let's scrape product information from a fictional e-commerce site:
1import requests2from bs4 import BeautifulSoup3import csv45def scrape_products(url):6 """7 Scrape product information from a product listing page.89 Args:10 url: URL of the product listing page1112 Returns:13 List of dictionaries containing product data14 """1516 # Add headers to mimic a browser request17 headers = {18 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'19 }2021 # Fetch the page22 response = requests.get(url, headers=headers)2324 if response.status_code != 200:25 print(f"Failed to fetch page: {response.status_code}")26 return []2728 # Parse HTML29 soup = BeautifulSoup(response.text, 'lxml')3031 # Find all product containers32 products = soup.find_all('div', class_='product-card')3334 product_list = []3536 for product in products:37 try:38 # Extract product data39 title = product.find('h3', class_='product-title').text.strip()4041 # Price might have dollar sign, remove it42 price_text = product.find('span', class_='price').text.strip()43 price = float(price_text.replace('$', '').replace(',', ''))4445 # Rating (might not exist for all products)46 rating_element = product.find('span', class_='rating')47 rating = float(rating_element.text.strip()) if rating_element else None4849 # Product link50 link_element = product.find('a', class_='product-link')51 link = link_element.get('href') if link_element else None5253 # Make link absolute if it's relative54 if link and not link.startswith('http'):55 link = f"https://example.com{link}"5657 # Store product data58 product_data = {59 'title': title,60 'price': price,61 'rating': rating,62 'link': link63 }6465 product_list.append(product_data)6667 except AttributeError as e:68 # Handle missing elements69 print(f"Error parsing product: {e}")70 continue7172 return product_list737475def save_to_csv(products, filename='products.csv'):76 """Save product data to CSV file."""7778 if not products:79 print("No products to save")80 return8182 # Get field names from first product83 fieldnames = products[0].keys()8485 with open(filename, 'w', newline='', encoding='utf-8') as f:86 writer = csv.DictWriter(f, fieldnames=fieldnames)87 writer.writeheader()88 writer.writerows(products)8990 print(f"Saved {len(products)} products to {filename}")919293def main():94 """Main execution function."""9596 url = "https://example.com/products"9798 print("Starting scraper...")99 products = scrape_products(url)100101 if products:102 print(f"\nScraped {len(products)} products")103104 # Display first few products105 print("\nFirst 3 products:")106 for product in products[:3]:107 print(f"- {product['title']}: ${product['price']}")108109 # Save to CSV110 save_to_csv(products)111 else:112 print("No products found")113114115if __name__ == "__main__":116 main()
Handling Multiple Pages
Many sites paginate results. Scrape all pages:
1def scrape_all_pages(base_url, max_pages=5):2 """Scrape multiple pages of products."""34 all_products = []56 for page_num in range(1, max_pages + 1):7 # Construct URL with page number8 url = f"{base_url}?page={page_num}"9 print(f"Scraping page {page_num}...")1011 products = scrape_products(url)1213 if not products:14 # No more products, stop15 print(f"No products found on page {page_num}. Stopping.")16 break1718 all_products.extend(products)1920 # Be polite: wait between requests21 import time22 time.sleep(2) # Wait 2 seconds between pages2324 return all_products252627# Usage28products = scrape_all_pages("https://example.com/products", max_pages=10)29save_to_csv(products, 'all_products.csv')
Common Challenges and Solutions
Challenge 1: Dynamic Content (JavaScript)
Problem: Content loads with JavaScript after page loads. BeautifulSoup only sees initial HTML.
Solution: Use Selenium instead:
1from selenium import webdriver2from selenium.webdriver.common.by import By3from selenium.webdriver.support.ui import WebDriverWait4from selenium.webdriver.support import expected_conditions as EC56driver = webdriver.Chrome()7driver.get(url)89# Wait for elements to load10wait = WebDriverWait(driver, 10)11products = wait.until(EC.presence_of_all_elements_located((By.CLASS_NAME, "product")))1213# Now extract data from loaded elements
Challenge 2: Blocked Requests
Problem: Website blocks your scraper (returns 403, 429, or captcha).
Solutions:
1# 1. Add realistic headers2headers = {3 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36',4 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',5 'Accept-Language': 'en-US,en;q=0.5',6 'Referer': 'https://www.google.com/',7}89response = requests.get(url, headers=headers)1011# 2. Add delays between requests12import time13time.sleep(2) # Wait 2 seconds1415# 3. Rotate user agents16import random1718user_agents = [19 'Mozilla/5.0 (Windows NT 10.0; Win64; x64)...',20 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7)...',21 'Mozilla/5.0 (X11; Linux x86_64)...',22]2324headers['User-Agent'] = random.choice(user_agents)2526# 4. Use sessions to maintain cookies27session = requests.Session()28response = session.get(url)
Challenge 3: Handling Errors Gracefully
1def safe_scrape(url, retries=3):2 """Scrape with error handling and retries."""34 for attempt in range(retries):5 try:6 response = requests.get(url, timeout=10)7 response.raise_for_status() # Raise exception for bad status codes89 soup = BeautifulSoup(response.text, 'lxml')10 return soup1112 except requests.exceptions.Timeout:13 print(f"Timeout on attempt {attempt + 1}")14 time.sleep(2)1516 except requests.exceptions.RequestException as e:17 print(f"Request failed: {e}")18 time.sleep(2)1920 return None # All retries failed
Challenge 4: Cleaning Messy Data
1def clean_text(text):2 """Clean scraped text data."""3 if not text:4 return ""56 # Remove extra whitespace7 text = ' '.join(text.split())89 # Remove special characters if needed10 import re11 text = re.sub(r'[^\w\s$,.-]', '', text)1213 return text.strip()141516def clean_price(price_text):17 """Extract numeric price from text."""18 import re1920 # Remove currency symbols and extra characters21 price_clean = re.sub(r'[^0-9.]', '', price_text)2223 try:24 return float(price_clean)25 except ValueError:26 return None
Best Practices
1. Be Respectful
1# Add delays between requests2import time3time.sleep(1) # Minimum 1 second between requests45# Respect robots.txt6from urllib.robotparser import RobotFileParser78rp = RobotFileParser()9rp.set_url("https://example.com/robots.txt")10rp.read()1112if rp.can_fetch("*", "https://example.com/products"):13 # OK to scrape14 scrape_page()
2. Handle Errors
1try:2 element = soup.find('div', class_='product')3 title = element.find('h2').text4except AttributeError:5 title = "Title not found"6except Exception as e:7 print(f"Unexpected error: {e}")8 title = None
3. Log Your Scraping
1import logging23logging.basicConfig(4 level=logging.INFO,5 format='%(asctime)s - %(levelname)s - %(message)s',6 filename='scraper.log'7)89logging.info(f"Scraping started for {url}")10logging.info(f"Found {len(products)} products")11logging.error(f"Failed to scrape page: {error}")
4. Cache Results
1import json2from pathlib import Path34def cache_results(data, filename='cache.json'):5 """Save results to cache file."""6 with open(filename, 'w') as f:7 json.dump(data, f, indent=2)89def load_cache(filename='cache.json'):10 """Load cached results if available."""11 if Path(filename).exists():12 with open(filename) as f:13 return json.load(f)14 return None
Project: Job Posting Scraper
Complete project to scrape job postings:
1import requests2from bs4 import BeautifulSoup3import csv4from datetime import datetime5import time67class JobScraper:8 def __init__(self):9 self.headers = {10 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'11 }12 self.jobs = []1314 def scrape_job_board(self, url, pages=3):15 """Scrape job postings from multiple pages."""1617 for page in range(1, pages + 1):18 print(f"Scraping page {page}...")1920 page_url = f"{url}&page={page}"21 soup = self.fetch_page(page_url)2223 if not soup:24 continue2526 jobs_on_page = self.parse_jobs(soup)27 self.jobs.extend(jobs_on_page)2829 print(f"Found {len(jobs_on_page)} jobs on page {page}")30 time.sleep(2) # Be polite3132 return self.jobs3334 def fetch_page(self, url):35 """Fetch and parse a single page."""36 try:37 response = requests.get(url, headers=self.headers, timeout=10)38 response.raise_for_status()39 return BeautifulSoup(response.text, 'lxml')40 except Exception as e:41 print(f"Error fetching {url}: {e}")42 return None4344 def parse_jobs(self, soup):45 """Parse job listings from page."""46 jobs = []4748 job_cards = soup.find_all('div', class_='job-listing')4950 for card in job_cards:51 try:52 job = {53 'title': card.find('h2', class_='job-title').text.strip(),54 'company': card.find('span', class_='company-name').text.strip(),55 'location': card.find('span', class_='location').text.strip(),56 'posted_date': card.find('span', class_='posted-date').text.strip(),57 'url': card.find('a', class_='apply-link')['href'],58 'scraped_at': datetime.now().isoformat()59 }60 jobs.append(job)61 except Exception as e:62 print(f"Error parsing job: {e}")63 continue6465 return jobs6667 def save_to_csv(self, filename='jobs.csv'):68 """Save jobs to CSV."""69 if not self.jobs:70 print("No jobs to save")71 return7273 with open(filename, 'w', newline='', encoding='utf-8') as f:74 writer = csv.DictWriter(f, fieldnames=self.jobs[0].keys())75 writer.writeheader()76 writer.writerows(self.jobs)7778 print(f"Saved {len(self.jobs)} jobs to {filename}")7980# Usage81scraper = JobScraper()82scraper.scrape_job_board("https://example.com/jobs?q=python", pages=5)83scraper.save_to_csv('python_jobs.csv')
Key Takeaways
- BeautifulSoup makes web scraping accessible to Python beginners
- Always check site's terms of service and robots.txt
- Use CSS selectors for precise element targeting
- Handle errors gracefully with try/except blocks
- Be respectful: Add delays, don't overload servers
- Clean your data after scraping for better usability
Conclusion
Web scraping opens up massive data collection possibilities. What used to require hours of manual copying now runs automatically in minutes.
Start with simple projects—scraping a single page. Then expand to multiple pages, add data cleaning, schedule scripts to run automatically. Build a library of scrapers for common tasks.
The data you need is out there. Now you know how to collect it.
Related articles: Build a Price Monitoring Bot with Python, Extract Data from PDFs with Python
Sponsored Content
Interested in advertising? Reach automation professionals through our platform.
