AutomateMyJob
Back to BlogPython Automation

Web Scraping for Beginners: Extract Data from Any Website with Python

Alex Rodriguez15 min read

Web Scraping for Beginners: Extract Data from Any Website with Python

You need competitor pricing data. Or job listings. Or product information from multiple websites. Copying and pasting hundreds of items manually? That's not a good use of your time.

Web scraping lets Python visit websites for you, extract the data you need, and organize it into a usable format. Let's learn how.

What You'll Learn

  • How web pages are structured (HTML basics)
  • Using requests to fetch web pages
  • Parsing HTML with BeautifulSoup
  • Extracting specific data from pages
  • Handling common scraping challenges

Prerequisites

  • Python 3.8 or higher
  • requests library (pip install requests)
  • BeautifulSoup library (pip install beautifulsoup4)
  • Basic understanding of HTML (we'll cover the essentials)

The Problem

Manual data collection from websites is:

  • Tedious (clicking, copying, pasting, repeat)
  • Slow (limited by human speed)
  • Error-prone (typos, missed items)
  • Not scalable (10 items is fine, 10,000 is impossible)

The Solution

Web scraping automates this process:

  1. Request the web page (like your browser does)
  2. Parse the HTML structure
  3. Extract the specific data you need
  4. Save it to a file or database

Important: Web Scraping Ethics

Before we start, some important rules:

  • Check robots.txt: Visit website.com/robots.txt to see what's allowed
  • Read Terms of Service: Some sites prohibit scraping
  • Be respectful: Don't overload servers with too many requests
  • Personal use vs. commercial: Different rules may apply
  • Don't scrape personal data: Privacy laws like GDPR apply

Step 1: Understanding HTML Structure

Web pages are built with HTML. Here's a simplified example:

html
1<html>
2  <body>
3    <div class="product">
4      <h2 class="title">Awesome Widget</h2>
5      <span class="price">$29.99</span>
6      <p class="description">The best widget ever made.</p>
7    </div>
8    <div class="product">
9      <h2 class="title">Super Gadget</h2>
10      <span class="price">$49.99</span>
11      <p class="description">A gadget for everything.</p>
12    </div>
13  </body>
14</html>

Key concepts:

  • Tags: <div>, <h2>, <span> are tags
  • Classes: class="product" identifies elements
  • Nesting: Tags can contain other tags

Step 2: Fetching a Web Page

Let's start by getting a page's HTML:

python
1import requests
2
3def fetch_page(url):
4    """
5    Fetch a web page and return its HTML content.
6    
7    Args:
8        url: The URL to fetch
9    
10    Returns:
11        HTML content as string, or None if failed
12    """
13    # Add headers to look like a real browser
14    headers = {
15        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
16    }
17    
18    try:
19        response = requests.get(url, headers=headers, timeout=10)
20        
21        # Check if request was successful
22        response.raise_for_status()
23        
24        return response.text
25        
26    except requests.RequestException as e:
27        print(f"Error fetching {url}: {e}")
28        return None
29
30# Example usage
31html = fetch_page("https://example.com")
32print(html[:500])  # Print first 500 characters

Step 3: Parsing HTML with BeautifulSoup

BeautifulSoup makes it easy to navigate and search HTML:

python
1from bs4 import BeautifulSoup
2
3def parse_html(html_content):
4    """
5    Parse HTML content into a BeautifulSoup object.
6    
7    Args:
8        html_content: Raw HTML string
9    
10    Returns:
11        BeautifulSoup object for querying
12    """
13    soup = BeautifulSoup(html_content, 'html.parser')
14    return soup
15
16# Example: Find elements
17soup = parse_html(html)
18
19# Find by tag
20all_divs = soup.find_all('div')
21
22# Find by class
23products = soup.find_all('div', class_='product')
24
25# Find by ID
26header = soup.find(id='main-header')
27
28# Find first match
29first_product = soup.find('div', class_='product')

Step 4: Extracting Data

Let's extract product information:

python
1def extract_products(soup):
2    """
3    Extract product data from parsed HTML.
4    
5    Args:
6        soup: BeautifulSoup object
7    
8    Returns:
9        List of dictionaries with product data
10    """
11    products = []
12    
13    # Find all product containers
14    product_divs = soup.find_all('div', class_='product')
15    
16    for div in product_divs:
17        # Extract each piece of data
18        title_tag = div.find('h2', class_='title')
19        price_tag = div.find('span', class_='price')
20        desc_tag = div.find('p', class_='description')
21        
22        product = {
23            'title': title_tag.text.strip() if title_tag else None,
24            'price': price_tag.text.strip() if price_tag else None,
25            'description': desc_tag.text.strip() if desc_tag else None,
26        }
27        
28        products.append(product)
29    
30    return products

Step 5: Handling Multiple Pages

Many websites have pagination. Let's handle that:

python
1def scrape_multiple_pages(base_url, max_pages=5):
2    """
3    Scrape data from multiple pages.
4    
5    Args:
6        base_url: URL pattern with {page} placeholder
7        max_pages: Maximum number of pages to scrape
8    
9    Returns:
10        Combined list of all extracted data
11    """
12    import time
13    
14    all_data = []
15    
16    for page_num in range(1, max_pages + 1):
17        url = base_url.format(page=page_num)
18        print(f"Scraping page {page_num}: {url}")
19        
20        html = fetch_page(url)
21        if not html:
22            print(f"Failed to fetch page {page_num}, stopping.")
23            break
24        
25        soup = parse_html(html)
26        page_data = extract_products(soup)
27        
28        if not page_data:
29            print(f"No data found on page {page_num}, stopping.")
30            break
31        
32        all_data.extend(page_data)
33        print(f"  Found {len(page_data)} items")
34        
35        # Be respectful - wait between requests
36        time.sleep(1)
37    
38    return all_data
39
40# Usage example
41# data = scrape_multiple_pages("https://example.com/products?page={page}", max_pages=10)

Step 6: Saving the Data

Save your scraped data to CSV or JSON:

python
1import csv
2import json
3
4def save_to_csv(data, filename):
5    """Save data to a CSV file."""
6    if not data:
7        print("No data to save")
8        return
9    
10    # Get column names from first item
11    fieldnames = data[0].keys()
12    
13    with open(filename, 'w', newline='', encoding='utf-8') as f:
14        writer = csv.DictWriter(f, fieldnames=fieldnames)
15        writer.writeheader()
16        writer.writerows(data)
17    
18    print(f"Saved {len(data)} records to {filename}")
19
20
21def save_to_json(data, filename):
22    """Save data to a JSON file."""
23    with open(filename, 'w', encoding='utf-8') as f:
24        json.dump(data, f, indent=2, ensure_ascii=False)
25    
26    print(f"Saved {len(data)} records to {filename}")

The Complete Script

python
1#!/usr/bin/env python3
2"""
3Web Scraper - Extract data from websites automatically.
4Author: Alex Rodriguez
5
6This script demonstrates web scraping fundamentals using requests and BeautifulSoup.
7Customize the extraction logic for your target website.
8"""
9
10import csv
11import json
12import time
13from datetime import datetime
14
15import requests
16from bs4 import BeautifulSoup
17
18
19# Configuration
20REQUEST_TIMEOUT = 10
21DELAY_BETWEEN_REQUESTS = 1  # seconds
22USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
23
24
25def fetch_page(url):
26    """
27    Fetch a web page and return its HTML content.
28    
29    Args:
30        url: The URL to fetch
31    
32    Returns:
33        HTML content as string, or None if failed
34    """
35    headers = {
36        'User-Agent': USER_AGENT,
37        'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
38        'Accept-Language': 'en-US,en;q=0.5',
39    }
40    
41    try:
42        response = requests.get(url, headers=headers, timeout=REQUEST_TIMEOUT)
43        response.raise_for_status()
44        return response.text
45        
46    except requests.Timeout:
47        print(f"Timeout fetching {url}")
48    except requests.HTTPError as e:
49        print(f"HTTP error {e.response.status_code} for {url}")
50    except requests.RequestException as e:
51        print(f"Error fetching {url}: {e}")
52    
53    return None
54
55
56def parse_html(html_content):
57    """Parse HTML content into a BeautifulSoup object."""
58    return BeautifulSoup(html_content, 'html.parser')
59
60
61def extract_data(soup):
62    """
63    Extract data from the parsed HTML.
64    
65    CUSTOMIZE THIS FUNCTION for your target website.
66    
67    Args:
68        soup: BeautifulSoup object
69    
70    Returns:
71        List of dictionaries with extracted data
72    """
73    data = []
74    
75    # Example: Extract product information
76    # Adjust selectors based on your target website's HTML structure
77    
78    items = soup.find_all('div', class_='product')
79    
80    for item in items:
81        try:
82            # Example extraction - customize for your target
83            title_elem = item.find('h2', class_='title')
84            price_elem = item.find('span', class_='price')
85            link_elem = item.find('a', href=True)
86            
87            record = {
88                'title': title_elem.text.strip() if title_elem else None,
89                'price': price_elem.text.strip() if price_elem else None,
90                'link': link_elem['href'] if link_elem else None,
91                'scraped_at': datetime.now().isoformat(),
92            }
93            
94            # Only add if we got meaningful data
95            if record['title']:
96                data.append(record)
97                
98        except Exception as e:
99            print(f"Error extracting item: {e}")
100            continue
101    
102    return data
103
104
105def clean_price(price_string):
106    """
107    Clean a price string and convert to float.
108    
109    Examples:
110        "$29.99" -> 29.99
111        "€ 1,299.00" -> 1299.00
112    """
113    if not price_string:
114        return None
115    
116    import re
117    
118    # Remove currency symbols and whitespace
119    cleaned = re.sub(r'[^\d.,]', '', price_string)
120    
121    # Handle European number format (1.299,00 -> 1299.00)
122    if ',' in cleaned and '.' in cleaned:
123        if cleaned.index('.') < cleaned.index(','):
124            cleaned = cleaned.replace('.', '').replace(',', '.')
125        else:
126            cleaned = cleaned.replace(',', '')
127    elif ',' in cleaned:
128        cleaned = cleaned.replace(',', '.')
129    
130    try:
131        return float(cleaned)
132    except ValueError:
133        return None
134
135
136def scrape_website(start_url, max_pages=1):
137    """
138    Scrape data from a website.
139    
140    Args:
141        start_url: URL to start scraping (use {page} for pagination)
142        max_pages: Maximum number of pages to scrape
143    
144    Returns:
145        List of all extracted data
146    """
147    all_data = []
148    
149    for page_num in range(1, max_pages + 1):
150        # Build URL (replace {page} if present)
151        if '{page}' in start_url:
152            url = start_url.format(page=page_num)
153        else:
154            url = start_url
155        
156        print(f"\n📄 Scraping page {page_num}: {url}")
157        
158        # Fetch page
159        html = fetch_page(url)
160        if not html:
161            print("  ❌ Failed to fetch page")
162            break
163        
164        # Parse and extract
165        soup = parse_html(html)
166        page_data = extract_data(soup)
167        
168        if not page_data:
169            print("  ⚠️ No data found on this page")
170            if page_num == 1:
171                print("  💡 Check your extraction selectors!")
172            break
173        
174        all_data.extend(page_data)
175        print(f"  ✓ Extracted {len(page_data)} items")
176        
177        # Respect the server - wait between requests
178        if page_num < max_pages:
179            time.sleep(DELAY_BETWEEN_REQUESTS)
180    
181    return all_data
182
183
184def save_to_csv(data, filename):
185    """Save data to a CSV file."""
186    if not data:
187        print("No data to save")
188        return
189    
190    fieldnames = data[0].keys()
191    
192    with open(filename, 'w', newline='', encoding='utf-8') as f:
193        writer = csv.DictWriter(f, fieldnames=fieldnames)
194        writer.writeheader()
195        writer.writerows(data)
196    
197    print(f"\n💾 Saved {len(data)} records to {filename}")
198
199
200def save_to_json(data, filename):
201    """Save data to a JSON file."""
202    if not data:
203        print("No data to save")
204        return
205    
206    with open(filename, 'w', encoding='utf-8') as f:
207        json.dump(data, f, indent=2, ensure_ascii=False)
208    
209    print(f"\n💾 Saved {len(data)} records to {filename}")
210
211
212def main():
213    """Main entry point."""
214    
215    print("=" * 60)
216    print("WEB SCRAPER")
217    print("=" * 60)
218    
219    # ========================================
220    # CONFIGURE YOUR SCRAPING TARGET
221    # ========================================
222    
223    # Target URL (use {page} for pagination)
224    target_url = "https://books.toscrape.com/catalogue/page-{page}.html"
225    
226    # Maximum pages to scrape
227    max_pages = 3
228    
229    # Output filename
230    output_file = f"scraped_data_{datetime.now().strftime('%Y%m%d_%H%M%S')}"
231    
232    # ========================================
233    # RUN THE SCRAPER
234    # ========================================
235    
236    data = scrape_website(target_url, max_pages)
237    
238    if data:
239        print(f"\n✅ Total items scraped: {len(data)}")
240        
241        # Save in both formats
242        save_to_csv(data, f"{output_file}.csv")
243        save_to_json(data, f"{output_file}.json")
244        
245        # Preview first few items
246        print("\n📋 Preview of scraped data:")
247        for item in data[:3]:
248            print(f"  - {item}")
249    else:
250        print("\n❌ No data was scraped")
251        print("Check that:")
252        print("  1. The URL is accessible")
253        print("  2. Your extraction selectors match the page structure")
254        print("  3. The website allows scraping (check robots.txt)")
255
256
257if __name__ == "__main__":
258    main()

How to Run This Script

  1. Install required libraries:

    bash
    1pip install requests beautifulsoup4
  2. Save the script as web_scraper.py

  3. Customize the extract_data() function for your target website:

    • Open the website in your browser
    • Right-click on an element → "Inspect" to see HTML structure
    • Update the selectors (tag names, class names) to match
  4. Run the scraper:

    bash
    1python web_scraper.py
  5. Expected output:

    Prompt
    ============================================================
    WEB SCRAPER
    ============================================================
    
    📄 Scraping page 1: https://books.toscrape.com/catalogue/page-1.html
      ✓ Extracted 20 items
    
    📄 Scraping page 2: https://books.toscrape.com/catalogue/page-2.html
      ✓ Extracted 20 items
    
    ✅ Total items scraped: 40
    
    💾 Saved 40 records to scraped_data_20251110_143022.csv
    💾 Saved 40 records to scraped_data_20251110_143022.json

Customization Options

Extract from Tables

python
1def extract_table_data(soup):
2    """Extract data from HTML tables."""
3    data = []
4    
5    table = soup.find('table', class_='data-table')
6    if not table:
7        return data
8    
9    # Get headers
10    headers = []
11    header_row = table.find('tr')
12    for th in header_row.find_all(['th', 'td']):
13        headers.append(th.text.strip())
14    
15    # Get data rows
16    for row in table.find_all('tr')[1:]:  # Skip header
17        cells = row.find_all('td')
18        if len(cells) == len(headers):
19            record = dict(zip(headers, [c.text.strip() for c in cells]))
20            data.append(record)
21    
22    return data

Handle JavaScript-Rendered Content

Some sites load content with JavaScript. Use Selenium for these:

python
1# pip install selenium webdriver-manager
2from selenium import webdriver
3from selenium.webdriver.chrome.service import Service
4from webdriver_manager.chrome import ChromeDriverManager
5
6def fetch_javascript_page(url):
7    """Fetch a page that requires JavaScript rendering."""
8    options = webdriver.ChromeOptions()
9    options.add_argument('--headless')  # Run without visible browser
10    
11    driver = webdriver.Chrome(
12        service=Service(ChromeDriverManager().install()),
13        options=options
14    )
15    
16    try:
17        driver.get(url)
18        time.sleep(3)  # Wait for JavaScript to load
19        return driver.page_source
20    finally:
21        driver.quit()

Add Proxy Support

python
1def fetch_with_proxy(url, proxy):
2    """Fetch through a proxy server."""
3    proxies = {
4        'http': proxy,
5        'https': proxy,
6    }
7    
8    response = requests.get(url, proxies=proxies, timeout=15)
9    return response.text

Common Issues & Solutions

IssueSolution
403 ForbiddenAdd proper User-Agent header; site may block scrapers
404 Not FoundCheck URL is correct; pagination may use different format
No data extractedInspect page HTML; selectors may not match
Getting blockedAdd delays; rotate User-Agents; use proxies
JavaScript contentUse Selenium instead of requests
Encoding errorsSpecify encoding='utf-8' when saving files

Taking It Further

Monitor Price Changes

python
1def monitor_prices(url, check_interval_hours=24):
2    """Monitor a page for price changes."""
3    import json
4    from datetime import datetime
5    
6    history_file = 'price_history.json'
7    
8    # Load previous prices
9    try:
10        with open(history_file) as f:
11            history = json.load(f)
12    except FileNotFoundError:
13        history = {}
14    
15    # Scrape current prices
16    data = scrape_website(url, max_pages=1)
17    
18    # Compare and alert
19    for item in data:
20        item_id = item['title']
21        current_price = clean_price(item['price'])
22        
23        if item_id in history:
24            old_price = history[item_id]['price']
25            if current_price < old_price:
26                print(f"🔔 PRICE DROP: {item_id}")
27                print(f"   Was: ${old_price:.2f} → Now: ${current_price:.2f}")
28        
29        # Update history
30        history[item_id] = {
31            'price': current_price,
32            'last_checked': datetime.now().isoformat()
33        }
34    
35    # Save updated history
36    with open(history_file, 'w') as f:
37        json.dump(history, f, indent=2)

Email Alerts

Combine with email automation to get notified of interesting findings.

Database Storage

For large-scale scraping, save to a database:

python
1import sqlite3
2
3def save_to_database(data, db_path='scraped_data.db'):
4    """Save scraped data to SQLite database."""
5    conn = sqlite3.connect(db_path)
6    cursor = conn.cursor()
7    
8    # Create table if not exists
9    cursor.execute('''
10        CREATE TABLE IF NOT EXISTS products (
11            id INTEGER PRIMARY KEY AUTOINCREMENT,
12            title TEXT,
13            price REAL,
14            link TEXT,
15            scraped_at TEXT
16        )
17    ''')
18    
19    # Insert data
20    for item in data:
21        cursor.execute('''
22            INSERT INTO products (title, price, link, scraped_at)
23            VALUES (?, ?, ?, ?)
24        ''', (item['title'], item['price'], item['link'], item['scraped_at']))
25    
26    conn.commit()
27    conn.close()

Conclusion

Web scraping opens up a world of data that would be impossible to collect manually. You can now extract product listings, job postings, news articles, research data—anything available on the web.

Remember the ethics: respect robots.txt, don't overload servers, and only scrape data you have the right to use. When in doubt, check the website's terms of service or contact the site owner.

Start with simple sites, get comfortable with HTML inspection and selector writing, then gradually tackle more complex scenarios. The data you need is out there—now you have the tools to get it.

The web is your data source.

Sponsored Content

Interested in advertising? Reach automation professionals through our platform.

Share this article