Python Web Scraping for Beginners: BeautifulSoup Tutorial

You need data from a website. You could copy-paste for hours. Or you could write 20 lines of Python that extracts everything in seconds—and can run automatically whenever you need updated data.

Web scraping transforms you from a data consumer to a data collector. Today you'll learn BeautifulSoup, Python's most beginner-friendly web scraping library.

What You'll Learn

Web scraping fundamentals and ethics
Installing and using BeautifulSoup
Finding and extracting HTML elements
Cleaning and organizing scraped data
Handling common challenges and errors
Building a complete scraping project

What is Web Scraping?

Web scraping is automatically extracting data from websites using code.

Common use cases:

Price monitoring for products
Job posting aggregation
Real estate listing collection
News article monitoring
Research data collection
Competitor analysis

When scraping is appropriate: ✅ Public data that's freely accessible ✅ For personal or research use ✅ When the site has no API ✅ When manual collection would take too long

When NOT to scrape: ❌ Site's terms of service prohibit it ❌ Personal or private user data ❌ Behind login/paywall (without permission) ❌ Causes server load or harm

Always check: robots.txt (website.com/robots.txt) to see scraping permissions.

Prerequisites

Python 3.7 or higher
Basic Python knowledge (variables, loops, functions)
Basic HTML understanding (tags, attributes, structure)

Installing Required Libraries

bash

1pip install beautifulsoup4
2pip install requests
3pip install lxml

Libraries explained:

requests: Fetches web pages (HTTP requests)
beautifulsoup4: Parses HTML and extracts data
lxml: Fast HTML parser (optional but recommended)

Understanding HTML Structure

Before scraping, understand basic HTML:

html

1<div class="product" id="item-123">
2    <h2 class="title">Blue Widget</h2>
3    <span class="price">$29.99</span>
4    <p class="description">High-quality widget for all needs</p>
5    <a href="/product/123">View Details</a>
6</div>

Key concepts:

Tags: <div>, <h2>, <span>, <p>, <a>
Classes: class="product" (can be shared by multiple elements)
IDs: id="item-123" (unique per page)
Attributes: href="/product/123"
Text content: "Blue Widget", "$29.99"

Your First Scraper

Step 1: Fetch a Web Page

python

1import requests
2from bs4 import BeautifulSoup
3
4# Fetch the webpage
5url = "https://example.com"
6response = requests.get(url)
7
8# Check if request was successful
9if response.status_code == 200:
10    print("Page fetched successfully!")
11    html_content = response.text
12else:
13    print(f"Failed to fetch page. Status code: {response.status_code}")

Step 2: Parse with BeautifulSoup

python

1# Create BeautifulSoup object
2soup = BeautifulSoup(html_content, 'lxml')
3
4# Now you can navigate and search the HTML
5print(soup.title)  # Get the page title
6print(soup.title.string)  # Get just the text of the title

Step 3: Find Elements

python

1# Find first h1 tag
2h1 = soup.find('h1')
3print(h1.text)
4
5# Find all paragraph tags
6paragraphs = soup.find_all('p')
7for p in paragraphs:
8    print(p.text)
9
10# Find by class
11products = soup.find_all('div', class_='product')
12for product in products:
13    print(product.text)
14
15# Find by ID
16item = soup.find('div', id='item-123')
17print(item.text)

Finding Elements: Complete Guide

Basic Finding Methods

python

1# Find first matching element
2soup.find('div')  # First <div> tag
3soup.find('a')  # First <a> tag
4
5# Find all matching elements
6soup.find_all('div')  # All <div> tags
7soup.find_all('p')  # All <p> tags
8
9# Find by class
10soup.find('div', class_='product')  # First element with class="product"
11soup.find_all('div', class_='product')  # All elements
12
13# Find by ID
14soup.find('div', id='main-content')
15
16# Find by multiple attributes
17soup.find('a', {'href': '/products', 'class': 'link'})

CSS Selectors (More Powerful)

python

1# Find using CSS selectors
2soup.select('div.product')  # All divs with class="product"
3soup.select('#main-content')  # Element with id="main-content"
4soup.select('div.product > h2')  # h2 that's a direct child of div.product
5soup.select('a[href^="/products"]')  # Links starting with "/products"
6
7# Select first match
8soup.select_one('div.product')

CSS Selector Cheat Sheet:

python

1'.classname'  # By class
2'#idname'  # By ID
3'tag'  # By tag name
4'tag.class'  # Tag with class
5'parent > child'  # Direct child
6'parent descendant'  # Any descendant
7'[attribute]'  # Has attribute
8'[attribute="value"]'  # Attribute equals value
9'[attribute^="start"]'  # Attribute starts with
10'[attribute$="end"]'  # Attribute ends with
11'[attribute*="contains"]'  # Attribute contains

Extracting Data

Get Text Content

python

1# Get text from element
2element = soup.find('h1')
3text = element.text  # or element.get_text()
4print(text)
5
6# Get text, stripping extra whitespace
7clean_text = element.get_text(strip=True)

Get Attribute Values

python

1# Get href from link
2link = soup.find('a')
3href = link.get('href')  # or link['href']
4print(href)
5
6# Get src from image
7image = soup.find('img')
8src = image.get('src')
9alt = image.get('alt')

Navigate HTML Structure

python

1# Get parent element
2child = soup.find('span', class_='price')
3parent = child.parent
4
5# Get children
6parent = soup.find('div', class_='product')
7children = parent.find_all(recursive=False)  # Direct children only
8
9# Get siblings
10element = soup.find('h2')
11next_element = element.find_next_sibling()
12previous_element = element.find_previous_sibling()

Complete Example: Scraping Product Listings

Let's scrape product information from a fictional e-commerce site:

python

1import requests
2from bs4 import BeautifulSoup
3import csv
4
5def scrape_products(url):
6    """
7    Scrape product information from a product listing page.
8    
9    Args:
10        url: URL of the product listing page
11    
12    Returns:
13        List of dictionaries containing product data
14    """
15    
16    # Add headers to mimic a browser request
17    headers = {
18        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
19    }
20    
21    # Fetch the page
22    response = requests.get(url, headers=headers)
23    
24    if response.status_code != 200:
25        print(f"Failed to fetch page: {response.status_code}")
26        return []
27    
28    # Parse HTML
29    soup = BeautifulSoup(response.text, 'lxml')
30    
31    # Find all product containers
32    products = soup.find_all('div', class_='product-card')
33    
34    product_list = []
35    
36    for product in products:
37        try:
38            # Extract product data
39            title = product.find('h3', class_='product-title').text.strip()
40            
41            # Price might have dollar sign, remove it
42            price_text = product.find('span', class_='price').text.strip()
43            price = float(price_text.replace('$', '').replace(',', ''))
44            
45            # Rating (might not exist for all products)
46            rating_element = product.find('span', class_='rating')
47            rating = float(rating_element.text.strip()) if rating_element else None
48            
49            # Product link
50            link_element = product.find('a', class_='product-link')
51            link = link_element.get('href') if link_element else None
52            
53            # Make link absolute if it's relative
54            if link and not link.startswith('http'):
55                link = f"https://example.com{link}"
56            
57            # Store product data
58            product_data = {
59                'title': title,
60                'price': price,
61                'rating': rating,
62                'link': link
63            }
64            
65            product_list.append(product_data)
66            
67        except AttributeError as e:
68            # Handle missing elements
69            print(f"Error parsing product: {e}")
70            continue
71    
72    return product_list
73
74
75def save_to_csv(products, filename='products.csv'):
76    """Save product data to CSV file."""
77    
78    if not products:
79        print("No products to save")
80        return
81    
82    # Get field names from first product
83    fieldnames = products[0].keys()
84    
85    with open(filename, 'w', newline='', encoding='utf-8') as f:
86        writer = csv.DictWriter(f, fieldnames=fieldnames)
87        writer.writeheader()
88        writer.writerows(products)
89    
90    print(f"Saved {len(products)} products to {filename}")
91
92
93def main():
94    """Main execution function."""
95    
96    url = "https://example.com/products"
97    
98    print("Starting scraper...")
99    products = scrape_products(url)
100    
101    if products:
102        print(f"\nScraped {len(products)} products")
103        
104        # Display first few products
105        print("\nFirst 3 products:")
106        for product in products[:3]:
107            print(f"- {product['title']}: ${product['price']}")
108        
109        # Save to CSV
110        save_to_csv(products)
111    else:
112        print("No products found")
113
114
115if __name__ == "__main__":
116    main()

Handling Multiple Pages

Many sites paginate results. Scrape all pages:

python

1def scrape_all_pages(base_url, max_pages=5):
2    """Scrape multiple pages of products."""
3    
4    all_products = []
5    
6    for page_num in range(1, max_pages + 1):
7        # Construct URL with page number
8        url = f"{base_url}?page={page_num}"
9        print(f"Scraping page {page_num}...")
10        
11        products = scrape_products(url)
12        
13        if not products:
14            # No more products, stop
15            print(f"No products found on page {page_num}. Stopping.")
16            break
17        
18        all_products.extend(products)
19        
20        # Be polite: wait between requests
21        import time
22        time.sleep(2)  # Wait 2 seconds between pages
23    
24    return all_products
25
26
27# Usage
28products = scrape_all_pages("https://example.com/products", max_pages=10)
29save_to_csv(products, 'all_products.csv')

Common Challenges and Solutions

Challenge 1: Dynamic Content (JavaScript)

Problem: Content loads with JavaScript after page loads. BeautifulSoup only sees initial HTML.

Solution: Use Selenium instead:

python

1from selenium import webdriver
2from selenium.webdriver.common.by import By
3from selenium.webdriver.support.ui import WebDriverWait
4from selenium.webdriver.support import expected_conditions as EC
5
6driver = webdriver.Chrome()
7driver.get(url)
8
9# Wait for elements to load
10wait = WebDriverWait(driver, 10)
11products = wait.until(EC.presence_of_all_elements_located((By.CLASS_NAME, "product")))
12
13# Now extract data from loaded elements

Challenge 2: Blocked Requests

Problem: Website blocks your scraper (returns 403, 429, or captcha).

Solutions:

python

1# 1. Add realistic headers
2headers = {
3    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36',
4    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
5    'Accept-Language': 'en-US,en;q=0.5',
6    'Referer': 'https://www.google.com/',
7}
8
9response = requests.get(url, headers=headers)
10
11# 2. Add delays between requests
12import time
13time.sleep(2)  # Wait 2 seconds
14
15# 3. Rotate user agents
16import random
17
18user_agents = [
19    'Mozilla/5.0 (Windows NT 10.0; Win64; x64)...',
20    'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7)...',
21    'Mozilla/5.0 (X11; Linux x86_64)...',
22]
23
24headers['User-Agent'] = random.choice(user_agents)
25
26# 4. Use sessions to maintain cookies
27session = requests.Session()
28response = session.get(url)

Challenge 3: Handling Errors Gracefully

python

1def safe_scrape(url, retries=3):
2    """Scrape with error handling and retries."""
3    
4    for attempt in range(retries):
5        try:
6            response = requests.get(url, timeout=10)
7            response.raise_for_status()  # Raise exception for bad status codes
8            
9            soup = BeautifulSoup(response.text, 'lxml')
10            return soup
11            
12        except requests.exceptions.Timeout:
13            print(f"Timeout on attempt {attempt + 1}")
14            time.sleep(2)
15            
16        except requests.exceptions.RequestException as e:
17            print(f"Request failed: {e}")
18            time.sleep(2)
19    
20    return None  # All retries failed

Challenge 4: Cleaning Messy Data

python

1def clean_text(text):
2    """Clean scraped text data."""
3    if not text:
4        return ""
5    
6    # Remove extra whitespace
7    text = ' '.join(text.split())
8    
9    # Remove special characters if needed
10    import re
11    text = re.sub(r'[^\w\s$,.-]', '', text)
12    
13    return text.strip()
14
15
16def clean_price(price_text):
17    """Extract numeric price from text."""
18    import re
19    
20    # Remove currency symbols and extra characters
21    price_clean = re.sub(r'[^0-9.]', '', price_text)
22    
23    try:
24        return float(price_clean)
25    except ValueError:
26        return None

Best Practices

1. Be Respectful

python

1# Add delays between requests
2import time
3time.sleep(1)  # Minimum 1 second between requests
4
5# Respect robots.txt
6from urllib.robotparser import RobotFileParser
7
8rp = RobotFileParser()
9rp.set_url("https://example.com/robots.txt")
10rp.read()
11
12if rp.can_fetch("*", "https://example.com/products"):
13    # OK to scrape
14    scrape_page()

2. Handle Errors

python

1try:
2    element = soup.find('div', class_='product')
3    title = element.find('h2').text
4except AttributeError:
5    title = "Title not found"
6except Exception as e:
7    print(f"Unexpected error: {e}")
8    title = None

3. Log Your Scraping

python

1import logging
2
3logging.basicConfig(
4    level=logging.INFO,
5    format='%(asctime)s - %(levelname)s - %(message)s',
6    filename='scraper.log'
7)
8
9logging.info(f"Scraping started for {url}")
10logging.info(f"Found {len(products)} products")
11logging.error(f"Failed to scrape page: {error}")

4. Cache Results

python

1import json
2from pathlib import Path
3
4def cache_results(data, filename='cache.json'):
5    """Save results to cache file."""
6    with open(filename, 'w') as f:
7        json.dump(data, f, indent=2)
8
9def load_cache(filename='cache.json'):
10    """Load cached results if available."""
11    if Path(filename).exists():
12        with open(filename) as f:
13            return json.load(f)
14    return None

Project: Job Posting Scraper

Complete project to scrape job postings:

python

1import requests
2from bs4 import BeautifulSoup
3import csv
4from datetime import datetime
5import time
6
7class JobScraper:
8    def __init__(self):
9        self.headers = {
10            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
11        }
12        self.jobs = []
13    
14    def scrape_job_board(self, url, pages=3):
15        """Scrape job postings from multiple pages."""
16        
17        for page in range(1, pages + 1):
18            print(f"Scraping page {page}...")
19            
20            page_url = f"{url}&page={page}"
21            soup = self.fetch_page(page_url)
22            
23            if not soup:
24                continue
25            
26            jobs_on_page = self.parse_jobs(soup)
27            self.jobs.extend(jobs_on_page)
28            
29            print(f"Found {len(jobs_on_page)} jobs on page {page}")
30            time.sleep(2)  # Be polite
31        
32        return self.jobs
33    
34    def fetch_page(self, url):
35        """Fetch and parse a single page."""
36        try:
37            response = requests.get(url, headers=self.headers, timeout=10)
38            response.raise_for_status()
39            return BeautifulSoup(response.text, 'lxml')
40        except Exception as e:
41            print(f"Error fetching {url}: {e}")
42            return None
43    
44    def parse_jobs(self, soup):
45        """Parse job listings from page."""
46        jobs = []
47        
48        job_cards = soup.find_all('div', class_='job-listing')
49        
50        for card in job_cards:
51            try:
52                job = {
53                    'title': card.find('h2', class_='job-title').text.strip(),
54                    'company': card.find('span', class_='company-name').text.strip(),
55                    'location': card.find('span', class_='location').text.strip(),
56                    'posted_date': card.find('span', class_='posted-date').text.strip(),
57                    'url': card.find('a', class_='apply-link')['href'],
58                    'scraped_at': datetime.now().isoformat()
59                }
60                jobs.append(job)
61            except Exception as e:
62                print(f"Error parsing job: {e}")
63                continue
64        
65        return jobs
66    
67    def save_to_csv(self, filename='jobs.csv'):
68        """Save jobs to CSV."""
69        if not self.jobs:
70            print("No jobs to save")
71            return
72        
73        with open(filename, 'w', newline='', encoding='utf-8') as f:
74            writer = csv.DictWriter(f, fieldnames=self.jobs[0].keys())
75            writer.writeheader()
76            writer.writerows(self.jobs)
77        
78        print(f"Saved {len(self.jobs)} jobs to {filename}")
79
80# Usage
81scraper = JobScraper()
82scraper.scrape_job_board("https://example.com/jobs?q=python", pages=5)
83scraper.save_to_csv('python_jobs.csv')

Key Takeaways

BeautifulSoup makes web scraping accessible to Python beginners
Always check site's terms of service and robots.txt
Use CSS selectors for precise element targeting
Handle errors gracefully with try/except blocks
Be respectful: Add delays, don't overload servers
Clean your data after scraping for better usability

Conclusion

Web scraping opens up massive data collection possibilities. What used to require hours of manual copying now runs automatically in minutes.

Start with simple projects—scraping a single page. Then expand to multiple pages, add data cleaning, schedule scripts to run automatically. Build a library of scrapers for common tasks.

The data you need is out there. Now you know how to collect it.

Python Web Scraping for Beginners: BeautifulSoup Tutorial

You need data from a website. You could copy-paste for hours. Or you could write 20 lines of Python that extracts everything in seconds—and can run automatically whenever you need updated data.

Web scraping transforms you from a data consumer to a data collector. Today you'll learn BeautifulSoup, Python's most beginner-friendly web scraping library.

What You'll Learn

Web scraping fundamentals and ethics
Installing and using BeautifulSoup
Finding and extracting HTML elements
Cleaning and organizing scraped data
Handling common challenges and errors
Building a complete scraping project

What is Web Scraping?

Web scraping is automatically extracting data from websites using code.

Common use cases:

Price monitoring for products
Job posting aggregation
Real estate listing collection
News article monitoring
Research data collection
Competitor analysis

When scraping is appropriate: ✅ Public data that's freely accessible ✅ For personal or research use ✅ When the site has no API ✅ When manual collection would take too long

When NOT to scrape: ❌ Site's terms of service prohibit it ❌ Personal or private user data ❌ Behind login/paywall (without permission) ❌ Causes server load or harm

Always check: robots.txt (website.com/robots.txt) to see scraping permissions.

Prerequisites

Python 3.7 or higher
Basic Python knowledge (variables, loops, functions)
Basic HTML understanding (tags, attributes, structure)

Installing Required Libraries

bash

1pip install beautifulsoup4
2pip install requests
3pip install lxml

Libraries explained:

requests: Fetches web pages (HTTP requests)
beautifulsoup4: Parses HTML and extracts data
lxml: Fast HTML parser (optional but recommended)

Understanding HTML Structure

Before scraping, understand basic HTML:

html

1<div class="product" id="item-123">
2    <h2 class="title">Blue Widget</h2>
3    <span class="price">$29.99</span>
4    <p class="description">High-quality widget for all needs</p>
5    <a href="/product/123">View Details</a>
6</div>

Key concepts:

Tags: <div>, <h2>, <span>, <p>, <a>
Classes: class="product" (can be shared by multiple elements)
IDs: id="item-123" (unique per page)
Attributes: href="/product/123"
Text content: "Blue Widget", "$29.99"

Your First Scraper

Step 1: Fetch a Web Page

python

1import requests
2from bs4 import BeautifulSoup
3
4# Fetch the webpage
5url = "https://example.com"
6response = requests.get(url)
7
8# Check if request was successful
9if response.status_code == 200:
10    print("Page fetched successfully!")
11    html_content = response.text
12else:
13    print(f"Failed to fetch page. Status code: {response.status_code}")

Step 2: Parse with BeautifulSoup

python

1# Create BeautifulSoup object
2soup = BeautifulSoup(html_content, 'lxml')
3
4# Now you can navigate and search the HTML
5print(soup.title)  # Get the page title
6print(soup.title.string)  # Get just the text of the title

Step 3: Find Elements

python

1# Find first h1 tag
2h1 = soup.find('h1')
3print(h1.text)
4
5# Find all paragraph tags
6paragraphs = soup.find_all('p')
7for p in paragraphs:
8    print(p.text)
9
10# Find by class
11products = soup.find_all('div', class_='product')
12for product in products:
13    print(product.text)
14
15# Find by ID
16item = soup.find('div', id='item-123')
17print(item.text)

Finding Elements: Complete Guide

Basic Finding Methods

python

1# Find first matching element
2soup.find('div')  # First <div> tag
3soup.find('a')  # First <a> tag
4
5# Find all matching elements
6soup.find_all('div')  # All <div> tags
7soup.find_all('p')  # All <p> tags
8
9# Find by class
10soup.find('div', class_='product')  # First element with class="product"
11soup.find_all('div', class_='product')  # All elements
12
13# Find by ID
14soup.find('div', id='main-content')
15
16# Find by multiple attributes
17soup.find('a', {'href': '/products', 'class': 'link'})

CSS Selectors (More Powerful)

python

1# Find using CSS selectors
2soup.select('div.product')  # All divs with class="product"
3soup.select('#main-content')  # Element with id="main-content"
4soup.select('div.product > h2')  # h2 that's a direct child of div.product
5soup.select('a[href^="/products"]')  # Links starting with "/products"
6
7# Select first match
8soup.select_one('div.product')

CSS Selector Cheat Sheet:

python

1'.classname'  # By class
2'#idname'  # By ID
3'tag'  # By tag name
4'tag.class'  # Tag with class
5'parent > child'  # Direct child
6'parent descendant'  # Any descendant
7'[attribute]'  # Has attribute
8'[attribute="value"]'  # Attribute equals value
9'[attribute^="start"]'  # Attribute starts with
10'[attribute$="end"]'  # Attribute ends with
11'[attribute*="contains"]'  # Attribute contains

Extracting Data

Get Text Content

python

1# Get text from element
2element = soup.find('h1')
3text = element.text  # or element.get_text()
4print(text)
5
6# Get text, stripping extra whitespace
7clean_text = element.get_text(strip=True)

Get Attribute Values

python

1# Get href from link
2link = soup.find('a')
3href = link.get('href')  # or link['href']
4print(href)
5
6# Get src from image
7image = soup.find('img')
8src = image.get('src')
9alt = image.get('alt')

Navigate HTML Structure

python

1# Get parent element
2child = soup.find('span', class_='price')
3parent = child.parent
4
5# Get children
6parent = soup.find('div', class_='product')
7children = parent.find_all(recursive=False)  # Direct children only
8
9# Get siblings
10element = soup.find('h2')
11next_element = element.find_next_sibling()
12previous_element = element.find_previous_sibling()

Complete Example: Scraping Product Listings

Let's scrape product information from a fictional e-commerce site:

python

1import requests
2from bs4 import BeautifulSoup
3import csv
4
5def scrape_products(url):
6    """
7    Scrape product information from a product listing page.
8    
9    Args:
10        url: URL of the product listing page
11    
12    Returns:
13        List of dictionaries containing product data
14    """
15    
16    # Add headers to mimic a browser request
17    headers = {
18        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
19    }
20    
21    # Fetch the page
22    response = requests.get(url, headers=headers)
23    
24    if response.status_code != 200:
25        print(f"Failed to fetch page: {response.status_code}")
26        return []
27    
28    # Parse HTML
29    soup = BeautifulSoup(response.text, 'lxml')
30    
31    # Find all product containers
32    products = soup.find_all('div', class_='product-card')
33    
34    product_list = []
35    
36    for product in products:
37        try:
38            # Extract product data
39            title = product.find('h3', class_='product-title').text.strip()
40            
41            # Price might have dollar sign, remove it
42            price_text = product.find('span', class_='price').text.strip()
43            price = float(price_text.replace('$', '').replace(',', ''))
44            
45            # Rating (might not exist for all products)
46            rating_element = product.find('span', class_='rating')
47            rating = float(rating_element.text.strip()) if rating_element else None
48            
49            # Product link
50            link_element = product.find('a', class_='product-link')
51            link = link_element.get('href') if link_element else None
52            
53            # Make link absolute if it's relative
54            if link and not link.startswith('http'):
55                link = f"https://example.com{link}"
56            
57            # Store product data
58            product_data = {
59                'title': title,
60                'price': price,
61                'rating': rating,
62                'link': link
63            }
64            
65            product_list.append(product_data)
66            
67        except AttributeError as e:
68            # Handle missing elements
69            print(f"Error parsing product: {e}")
70            continue
71    
72    return product_list
73
74
75def save_to_csv(products, filename='products.csv'):
76    """Save product data to CSV file."""
77    
78    if not products:
79        print("No products to save")
80        return
81    
82    # Get field names from first product
83    fieldnames = products[0].keys()
84    
85    with open(filename, 'w', newline='', encoding='utf-8') as f:
86        writer = csv.DictWriter(f, fieldnames=fieldnames)
87        writer.writeheader()
88        writer.writerows(products)
89    
90    print(f"Saved {len(products)} products to {filename}")
91
92
93def main():
94    """Main execution function."""
95    
96    url = "https://example.com/products"
97    
98    print("Starting scraper...")
99    products = scrape_products(url)
100    
101    if products:
102        print(f"\nScraped {len(products)} products")
103        
104        # Display first few products
105        print("\nFirst 3 products:")
106        for product in products[:3]:
107            print(f"- {product['title']}: ${product['price']}")
108        
109        # Save to CSV
110        save_to_csv(products)
111    else:
112        print("No products found")
113
114
115if __name__ == "__main__":
116    main()

Handling Multiple Pages

Many sites paginate results. Scrape all pages:

python

1def scrape_all_pages(base_url, max_pages=5):
2    """Scrape multiple pages of products."""
3    
4    all_products = []
5    
6    for page_num in range(1, max_pages + 1):
7        # Construct URL with page number
8        url = f"{base_url}?page={page_num}"
9        print(f"Scraping page {page_num}...")
10        
11        products = scrape_products(url)
12        
13        if not products:
14            # No more products, stop
15            print(f"No products found on page {page_num}. Stopping.")
16            break
17        
18        all_products.extend(products)
19        
20        # Be polite: wait between requests
21        import time
22        time.sleep(2)  # Wait 2 seconds between pages
23    
24    return all_products
25
26
27# Usage
28products = scrape_all_pages("https://example.com/products", max_pages=10)
29save_to_csv(products, 'all_products.csv')

Common Challenges and Solutions

Challenge 1: Dynamic Content (JavaScript)

Problem: Content loads with JavaScript after page loads. BeautifulSoup only sees initial HTML.

Solution: Use Selenium instead:

python

1from selenium import webdriver
2from selenium.webdriver.common.by import By
3from selenium.webdriver.support.ui import WebDriverWait
4from selenium.webdriver.support import expected_conditions as EC
5
6driver = webdriver.Chrome()
7driver.get(url)
8
9# Wait for elements to load
10wait = WebDriverWait(driver, 10)
11products = wait.until(EC.presence_of_all_elements_located((By.CLASS_NAME, "product")))
12
13# Now extract data from loaded elements

Challenge 2: Blocked Requests

Problem: Website blocks your scraper (returns 403, 429, or captcha).

Solutions:

python

1# 1. Add realistic headers
2headers = {
3    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36',
4    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
5    'Accept-Language': 'en-US,en;q=0.5',
6    'Referer': 'https://www.google.com/',
7}
8
9response = requests.get(url, headers=headers)
10
11# 2. Add delays between requests
12import time
13time.sleep(2)  # Wait 2 seconds
14
15# 3. Rotate user agents
16import random
17
18user_agents = [
19    'Mozilla/5.0 (Windows NT 10.0; Win64; x64)...',
20    'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7)...',
21    'Mozilla/5.0 (X11; Linux x86_64)...',
22]
23
24headers['User-Agent'] = random.choice(user_agents)
25
26# 4. Use sessions to maintain cookies
27session = requests.Session()
28response = session.get(url)

Challenge 3: Handling Errors Gracefully

python

1def safe_scrape(url, retries=3):
2    """Scrape with error handling and retries."""
3    
4    for attempt in range(retries):
5        try:
6            response = requests.get(url, timeout=10)
7            response.raise_for_status()  # Raise exception for bad status codes
8            
9            soup = BeautifulSoup(response.text, 'lxml')
10            return soup
11            
12        except requests.exceptions.Timeout:
13            print(f"Timeout on attempt {attempt + 1}")
14            time.sleep(2)
15            
16        except requests.exceptions.RequestException as e:
17            print(f"Request failed: {e}")
18            time.sleep(2)
19    
20    return None  # All retries failed

Challenge 4: Cleaning Messy Data

python

1def clean_text(text):
2    """Clean scraped text data."""
3    if not text:
4        return ""
5    
6    # Remove extra whitespace
7    text = ' '.join(text.split())
8    
9    # Remove special characters if needed
10    import re
11    text = re.sub(r'[^\w\s$,.-]', '', text)
12    
13    return text.strip()
14
15
16def clean_price(price_text):
17    """Extract numeric price from text."""
18    import re
19    
20    # Remove currency symbols and extra characters
21    price_clean = re.sub(r'[^0-9.]', '', price_text)
22    
23    try:
24        return float(price_clean)
25    except ValueError:
26        return None

Best Practices

1. Be Respectful

python

1# Add delays between requests
2import time
3time.sleep(1)  # Minimum 1 second between requests
4
5# Respect robots.txt
6from urllib.robotparser import RobotFileParser
7
8rp = RobotFileParser()
9rp.set_url("https://example.com/robots.txt")
10rp.read()
11
12if rp.can_fetch("*", "https://example.com/products"):
13    # OK to scrape
14    scrape_page()

2. Handle Errors

python

1try:
2    element = soup.find('div', class_='product')
3    title = element.find('h2').text
4except AttributeError:
5    title = "Title not found"
6except Exception as e:
7    print(f"Unexpected error: {e}")
8    title = None

3. Log Your Scraping

python

1import logging
2
3logging.basicConfig(
4    level=logging.INFO,
5    format='%(asctime)s - %(levelname)s - %(message)s',
6    filename='scraper.log'
7)
8
9logging.info(f"Scraping started for {url}")
10logging.info(f"Found {len(products)} products")
11logging.error(f"Failed to scrape page: {error}")

4. Cache Results

python

1import json
2from pathlib import Path
3
4def cache_results(data, filename='cache.json'):
5    """Save results to cache file."""
6    with open(filename, 'w') as f:
7        json.dump(data, f, indent=2)
8
9def load_cache(filename='cache.json'):
10    """Load cached results if available."""
11    if Path(filename).exists():
12        with open(filename) as f:
13            return json.load(f)
14    return None

Project: Job Posting Scraper

Complete project to scrape job postings:

python

1import requests
2from bs4 import BeautifulSoup
3import csv
4from datetime import datetime
5import time
6
7class JobScraper:
8    def __init__(self):
9        self.headers = {
10            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
11        }
12        self.jobs = []
13    
14    def scrape_job_board(self, url, pages=3):
15        """Scrape job postings from multiple pages."""
16        
17        for page in range(1, pages + 1):
18            print(f"Scraping page {page}...")
19            
20            page_url = f"{url}&page={page}"
21            soup = self.fetch_page(page_url)
22            
23            if not soup:
24                continue
25            
26            jobs_on_page = self.parse_jobs(soup)
27            self.jobs.extend(jobs_on_page)
28            
29            print(f"Found {len(jobs_on_page)} jobs on page {page}")
30            time.sleep(2)  # Be polite
31        
32        return self.jobs
33    
34    def fetch_page(self, url):
35        """Fetch and parse a single page."""
36        try:
37            response = requests.get(url, headers=self.headers, timeout=10)
38            response.raise_for_status()
39            return BeautifulSoup(response.text, 'lxml')
40        except Exception as e:
41            print(f"Error fetching {url}: {e}")
42            return None
43    
44    def parse_jobs(self, soup):
45        """Parse job listings from page."""
46        jobs = []
47        
48        job_cards = soup.find_all('div', class_='job-listing')
49        
50        for card in job_cards:
51            try:
52                job = {
53                    'title': card.find('h2', class_='job-title').text.strip(),
54                    'company': card.find('span', class_='company-name').text.strip(),
55                    'location': card.find('span', class_='location').text.strip(),
56                    'posted_date': card.find('span', class_='posted-date').text.strip(),
57                    'url': card.find('a', class_='apply-link')['href'],
58                    'scraped_at': datetime.now().isoformat()
59                }
60                jobs.append(job)
61            except Exception as e:
62                print(f"Error parsing job: {e}")
63                continue
64        
65        return jobs
66    
67    def save_to_csv(self, filename='jobs.csv'):
68        """Save jobs to CSV."""
69        if not self.jobs:
70            print("No jobs to save")
71            return
72        
73        with open(filename, 'w', newline='', encoding='utf-8') as f:
74            writer = csv.DictWriter(f, fieldnames=self.jobs[0].keys())
75            writer.writeheader()
76            writer.writerows(self.jobs)
77        
78        print(f"Saved {len(self.jobs)} jobs to {filename}")
79
80# Usage
81scraper = JobScraper()
82scraper.scrape_job_board("https://example.com/jobs?q=python", pages=5)
83scraper.save_to_csv('python_jobs.csv')

Key Takeaways

BeautifulSoup makes web scraping accessible to Python beginners
Always check site's terms of service and robots.txt
Use CSS selectors for precise element targeting
Handle errors gracefully with try/except blocks
Be respectful: Add delays, don't overload servers
Clean your data after scraping for better usability

Conclusion

Web scraping opens up massive data collection possibilities. What used to require hours of manual copying now runs automatically in minutes.

Start with simple projects—scraping a single page. Then expand to multiple pages, add data cleaning, schedule scripts to run automatically. Build a library of scrapers for common tasks.

The data you need is out there. Now you know how to collect it.

Python Web Scraping for Beginners: BeautifulSoup Tutorial

What You'll Learn

What is Web Scraping?

Prerequisites

Installing Required Libraries

Understanding HTML Structure

Your First Scraper

Step 1: Fetch a Web Page

Step 2: Parse with BeautifulSoup

Step 3: Find Elements

Finding Elements: Complete Guide

Basic Finding Methods

CSS Selectors (More Powerful)

Extracting Data

Get Text Content

Get Attribute Values

Navigate HTML Structure

Complete Example: Scraping Product Listings

Handling Multiple Pages

Common Challenges and Solutions

Challenge 1: Dynamic Content (JavaScript)

Challenge 2: Blocked Requests

Challenge 3: Handling Errors Gracefully

Challenge 4: Cleaning Messy Data

Best Practices

1. Be Respectful

2. Handle Errors

3. Log Your Scraping

4. Cache Results

Project: Job Posting Scraper

Key Takeaways

Conclusion

Share this article

Python Web Scraping for Beginners: BeautifulSoup Tutorial

What You'll Learn

What is Web Scraping?

Prerequisites

Installing Required Libraries

Understanding HTML Structure

Your First Scraper

Step 1: Fetch a Web Page

Step 2: Parse with BeautifulSoup

Step 3: Find Elements

Finding Elements: Complete Guide

Basic Finding Methods

CSS Selectors (More Powerful)

Extracting Data

Get Text Content

Get Attribute Values

Navigate HTML Structure

Complete Example: Scraping Product Listings

Handling Multiple Pages

Common Challenges and Solutions

Challenge 1: Dynamic Content (JavaScript)

Challenge 2: Blocked Requests

Challenge 3: Handling Errors Gracefully

Challenge 4: Cleaning Messy Data

Best Practices

1. Be Respectful

2. Handle Errors

3. Log Your Scraping

4. Cache Results

Project: Job Posting Scraper

Key Takeaways

Conclusion

Share this article