Web Scraping for Beginners: Extract Data from Any Website with Python
You need competitor pricing data. Or job listings. Or product information from multiple websites. Copying and pasting hundreds of items manually? That's not a good use of your time.
Web scraping lets Python visit websites for you, extract the data you need, and organize it into a usable format. Let's learn how.
What You'll Learn
- How web pages are structured (HTML basics)
- Using requests to fetch web pages
- Parsing HTML with BeautifulSoup
- Extracting specific data from pages
- Handling common scraping challenges
Prerequisites
- Python 3.8 or higher
- requests library (
pip install requests) - BeautifulSoup library (
pip install beautifulsoup4) - Basic understanding of HTML (we'll cover the essentials)
The Problem
Manual data collection from websites is:
- Tedious (clicking, copying, pasting, repeat)
- Slow (limited by human speed)
- Error-prone (typos, missed items)
- Not scalable (10 items is fine, 10,000 is impossible)
The Solution
Web scraping automates this process:
- Request the web page (like your browser does)
- Parse the HTML structure
- Extract the specific data you need
- Save it to a file or database
Important: Web Scraping Ethics
Before we start, some important rules:
- Check robots.txt: Visit
website.com/robots.txtto see what's allowed - Read Terms of Service: Some sites prohibit scraping
- Be respectful: Don't overload servers with too many requests
- Personal use vs. commercial: Different rules may apply
- Don't scrape personal data: Privacy laws like GDPR apply
Step 1: Understanding HTML Structure
Web pages are built with HTML. Here's a simplified example:
1<html>2 <body>3 <div class="product">4 <h2 class="title">Awesome Widget</h2>5 <span class="price">$29.99</span>6 <p class="description">The best widget ever made.</p>7 </div>8 <div class="product">9 <h2 class="title">Super Gadget</h2>10 <span class="price">$49.99</span>11 <p class="description">A gadget for everything.</p>12 </div>13 </body>14</html>
Key concepts:
- Tags:
<div>,<h2>,<span>are tags - Classes:
class="product"identifies elements - Nesting: Tags can contain other tags
Step 2: Fetching a Web Page
Let's start by getting a page's HTML:
1import requests23def fetch_page(url):4 """5 Fetch a web page and return its HTML content.67 Args:8 url: The URL to fetch910 Returns:11 HTML content as string, or None if failed12 """13 # Add headers to look like a real browser14 headers = {15 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'16 }1718 try:19 response = requests.get(url, headers=headers, timeout=10)2021 # Check if request was successful22 response.raise_for_status()2324 return response.text2526 except requests.RequestException as e:27 print(f"Error fetching {url}: {e}")28 return None2930# Example usage31html = fetch_page("https://example.com")32print(html[:500]) # Print first 500 characters
Step 3: Parsing HTML with BeautifulSoup
BeautifulSoup makes it easy to navigate and search HTML:
1from bs4 import BeautifulSoup23def parse_html(html_content):4 """5 Parse HTML content into a BeautifulSoup object.67 Args:8 html_content: Raw HTML string910 Returns:11 BeautifulSoup object for querying12 """13 soup = BeautifulSoup(html_content, 'html.parser')14 return soup1516# Example: Find elements17soup = parse_html(html)1819# Find by tag20all_divs = soup.find_all('div')2122# Find by class23products = soup.find_all('div', class_='product')2425# Find by ID26header = soup.find(id='main-header')2728# Find first match29first_product = soup.find('div', class_='product')
Step 4: Extracting Data
Let's extract product information:
1def extract_products(soup):2 """3 Extract product data from parsed HTML.45 Args:6 soup: BeautifulSoup object78 Returns:9 List of dictionaries with product data10 """11 products = []1213 # Find all product containers14 product_divs = soup.find_all('div', class_='product')1516 for div in product_divs:17 # Extract each piece of data18 title_tag = div.find('h2', class_='title')19 price_tag = div.find('span', class_='price')20 desc_tag = div.find('p', class_='description')2122 product = {23 'title': title_tag.text.strip() if title_tag else None,24 'price': price_tag.text.strip() if price_tag else None,25 'description': desc_tag.text.strip() if desc_tag else None,26 }2728 products.append(product)2930 return products
Step 5: Handling Multiple Pages
Many websites have pagination. Let's handle that:
1def scrape_multiple_pages(base_url, max_pages=5):2 """3 Scrape data from multiple pages.45 Args:6 base_url: URL pattern with {page} placeholder7 max_pages: Maximum number of pages to scrape89 Returns:10 Combined list of all extracted data11 """12 import time1314 all_data = []1516 for page_num in range(1, max_pages + 1):17 url = base_url.format(page=page_num)18 print(f"Scraping page {page_num}: {url}")1920 html = fetch_page(url)21 if not html:22 print(f"Failed to fetch page {page_num}, stopping.")23 break2425 soup = parse_html(html)26 page_data = extract_products(soup)2728 if not page_data:29 print(f"No data found on page {page_num}, stopping.")30 break3132 all_data.extend(page_data)33 print(f" Found {len(page_data)} items")3435 # Be respectful - wait between requests36 time.sleep(1)3738 return all_data3940# Usage example41# data = scrape_multiple_pages("https://example.com/products?page={page}", max_pages=10)
Step 6: Saving the Data
Save your scraped data to CSV or JSON:
1import csv2import json34def save_to_csv(data, filename):5 """Save data to a CSV file."""6 if not data:7 print("No data to save")8 return910 # Get column names from first item11 fieldnames = data[0].keys()1213 with open(filename, 'w', newline='', encoding='utf-8') as f:14 writer = csv.DictWriter(f, fieldnames=fieldnames)15 writer.writeheader()16 writer.writerows(data)1718 print(f"Saved {len(data)} records to {filename}")192021def save_to_json(data, filename):22 """Save data to a JSON file."""23 with open(filename, 'w', encoding='utf-8') as f:24 json.dump(data, f, indent=2, ensure_ascii=False)2526 print(f"Saved {len(data)} records to {filename}")
The Complete Script
1#!/usr/bin/env python32"""3Web Scraper - Extract data from websites automatically.4Author: Alex Rodriguez56This script demonstrates web scraping fundamentals using requests and BeautifulSoup.7Customize the extraction logic for your target website.8"""910import csv11import json12import time13from datetime import datetime1415import requests16from bs4 import BeautifulSoup171819# Configuration20REQUEST_TIMEOUT = 1021DELAY_BETWEEN_REQUESTS = 1 # seconds22USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'232425def fetch_page(url):26 """27 Fetch a web page and return its HTML content.2829 Args:30 url: The URL to fetch3132 Returns:33 HTML content as string, or None if failed34 """35 headers = {36 'User-Agent': USER_AGENT,37 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',38 'Accept-Language': 'en-US,en;q=0.5',39 }4041 try:42 response = requests.get(url, headers=headers, timeout=REQUEST_TIMEOUT)43 response.raise_for_status()44 return response.text4546 except requests.Timeout:47 print(f"Timeout fetching {url}")48 except requests.HTTPError as e:49 print(f"HTTP error {e.response.status_code} for {url}")50 except requests.RequestException as e:51 print(f"Error fetching {url}: {e}")5253 return None545556def parse_html(html_content):57 """Parse HTML content into a BeautifulSoup object."""58 return BeautifulSoup(html_content, 'html.parser')596061def extract_data(soup):62 """63 Extract data from the parsed HTML.6465 CUSTOMIZE THIS FUNCTION for your target website.6667 Args:68 soup: BeautifulSoup object6970 Returns:71 List of dictionaries with extracted data72 """73 data = []7475 # Example: Extract product information76 # Adjust selectors based on your target website's HTML structure7778 items = soup.find_all('div', class_='product')7980 for item in items:81 try:82 # Example extraction - customize for your target83 title_elem = item.find('h2', class_='title')84 price_elem = item.find('span', class_='price')85 link_elem = item.find('a', href=True)8687 record = {88 'title': title_elem.text.strip() if title_elem else None,89 'price': price_elem.text.strip() if price_elem else None,90 'link': link_elem['href'] if link_elem else None,91 'scraped_at': datetime.now().isoformat(),92 }9394 # Only add if we got meaningful data95 if record['title']:96 data.append(record)9798 except Exception as e:99 print(f"Error extracting item: {e}")100 continue101102 return data103104105def clean_price(price_string):106 """107 Clean a price string and convert to float.108109 Examples:110 "$29.99" -> 29.99111 "€ 1,299.00" -> 1299.00112 """113 if not price_string:114 return None115116 import re117118 # Remove currency symbols and whitespace119 cleaned = re.sub(r'[^\d.,]', '', price_string)120121 # Handle European number format (1.299,00 -> 1299.00)122 if ',' in cleaned and '.' in cleaned:123 if cleaned.index('.') < cleaned.index(','):124 cleaned = cleaned.replace('.', '').replace(',', '.')125 else:126 cleaned = cleaned.replace(',', '')127 elif ',' in cleaned:128 cleaned = cleaned.replace(',', '.')129130 try:131 return float(cleaned)132 except ValueError:133 return None134135136def scrape_website(start_url, max_pages=1):137 """138 Scrape data from a website.139140 Args:141 start_url: URL to start scraping (use {page} for pagination)142 max_pages: Maximum number of pages to scrape143144 Returns:145 List of all extracted data146 """147 all_data = []148149 for page_num in range(1, max_pages + 1):150 # Build URL (replace {page} if present)151 if '{page}' in start_url:152 url = start_url.format(page=page_num)153 else:154 url = start_url155156 print(f"\n📄 Scraping page {page_num}: {url}")157158 # Fetch page159 html = fetch_page(url)160 if not html:161 print(" ❌ Failed to fetch page")162 break163164 # Parse and extract165 soup = parse_html(html)166 page_data = extract_data(soup)167168 if not page_data:169 print(" ⚠️ No data found on this page")170 if page_num == 1:171 print(" 💡 Check your extraction selectors!")172 break173174 all_data.extend(page_data)175 print(f" ✓ Extracted {len(page_data)} items")176177 # Respect the server - wait between requests178 if page_num < max_pages:179 time.sleep(DELAY_BETWEEN_REQUESTS)180181 return all_data182183184def save_to_csv(data, filename):185 """Save data to a CSV file."""186 if not data:187 print("No data to save")188 return189190 fieldnames = data[0].keys()191192 with open(filename, 'w', newline='', encoding='utf-8') as f:193 writer = csv.DictWriter(f, fieldnames=fieldnames)194 writer.writeheader()195 writer.writerows(data)196197 print(f"\n💾 Saved {len(data)} records to {filename}")198199200def save_to_json(data, filename):201 """Save data to a JSON file."""202 if not data:203 print("No data to save")204 return205206 with open(filename, 'w', encoding='utf-8') as f:207 json.dump(data, f, indent=2, ensure_ascii=False)208209 print(f"\n💾 Saved {len(data)} records to {filename}")210211212def main():213 """Main entry point."""214215 print("=" * 60)216 print("WEB SCRAPER")217 print("=" * 60)218219 # ========================================220 # CONFIGURE YOUR SCRAPING TARGET221 # ========================================222223 # Target URL (use {page} for pagination)224 target_url = "https://books.toscrape.com/catalogue/page-{page}.html"225226 # Maximum pages to scrape227 max_pages = 3228229 # Output filename230 output_file = f"scraped_data_{datetime.now().strftime('%Y%m%d_%H%M%S')}"231232 # ========================================233 # RUN THE SCRAPER234 # ========================================235236 data = scrape_website(target_url, max_pages)237238 if data:239 print(f"\n✅ Total items scraped: {len(data)}")240241 # Save in both formats242 save_to_csv(data, f"{output_file}.csv")243 save_to_json(data, f"{output_file}.json")244245 # Preview first few items246 print("\n📋 Preview of scraped data:")247 for item in data[:3]:248 print(f" - {item}")249 else:250 print("\n❌ No data was scraped")251 print("Check that:")252 print(" 1. The URL is accessible")253 print(" 2. Your extraction selectors match the page structure")254 print(" 3. The website allows scraping (check robots.txt)")255256257if __name__ == "__main__":258 main()
How to Run This Script
-
Install required libraries:
bash1pip install requests beautifulsoup4 -
Save the script as
web_scraper.py -
Customize the
extract_data()function for your target website:- Open the website in your browser
- Right-click on an element → "Inspect" to see HTML structure
- Update the selectors (tag names, class names) to match
-
Run the scraper:
bash1python web_scraper.py -
Expected output:
Prompt============================================================ WEB SCRAPER ============================================================ 📄 Scraping page 1: https://books.toscrape.com/catalogue/page-1.html ✓ Extracted 20 items 📄 Scraping page 2: https://books.toscrape.com/catalogue/page-2.html ✓ Extracted 20 items ✅ Total items scraped: 40 💾 Saved 40 records to scraped_data_20251110_143022.csv 💾 Saved 40 records to scraped_data_20251110_143022.json
Customization Options
Extract from Tables
1def extract_table_data(soup):2 """Extract data from HTML tables."""3 data = []45 table = soup.find('table', class_='data-table')6 if not table:7 return data89 # Get headers10 headers = []11 header_row = table.find('tr')12 for th in header_row.find_all(['th', 'td']):13 headers.append(th.text.strip())1415 # Get data rows16 for row in table.find_all('tr')[1:]: # Skip header17 cells = row.find_all('td')18 if len(cells) == len(headers):19 record = dict(zip(headers, [c.text.strip() for c in cells]))20 data.append(record)2122 return data
Handle JavaScript-Rendered Content
Some sites load content with JavaScript. Use Selenium for these:
1# pip install selenium webdriver-manager2from selenium import webdriver3from selenium.webdriver.chrome.service import Service4from webdriver_manager.chrome import ChromeDriverManager56def fetch_javascript_page(url):7 """Fetch a page that requires JavaScript rendering."""8 options = webdriver.ChromeOptions()9 options.add_argument('--headless') # Run without visible browser1011 driver = webdriver.Chrome(12 service=Service(ChromeDriverManager().install()),13 options=options14 )1516 try:17 driver.get(url)18 time.sleep(3) # Wait for JavaScript to load19 return driver.page_source20 finally:21 driver.quit()
Add Proxy Support
1def fetch_with_proxy(url, proxy):2 """Fetch through a proxy server."""3 proxies = {4 'http': proxy,5 'https': proxy,6 }78 response = requests.get(url, proxies=proxies, timeout=15)9 return response.text
Common Issues & Solutions
| Issue | Solution |
|---|---|
| 403 Forbidden | Add proper User-Agent header; site may block scrapers |
| 404 Not Found | Check URL is correct; pagination may use different format |
| No data extracted | Inspect page HTML; selectors may not match |
| Getting blocked | Add delays; rotate User-Agents; use proxies |
| JavaScript content | Use Selenium instead of requests |
| Encoding errors | Specify encoding='utf-8' when saving files |
Taking It Further
Monitor Price Changes
1def monitor_prices(url, check_interval_hours=24):2 """Monitor a page for price changes."""3 import json4 from datetime import datetime56 history_file = 'price_history.json'78 # Load previous prices9 try:10 with open(history_file) as f:11 history = json.load(f)12 except FileNotFoundError:13 history = {}1415 # Scrape current prices16 data = scrape_website(url, max_pages=1)1718 # Compare and alert19 for item in data:20 item_id = item['title']21 current_price = clean_price(item['price'])2223 if item_id in history:24 old_price = history[item_id]['price']25 if current_price < old_price:26 print(f"🔔 PRICE DROP: {item_id}")27 print(f" Was: ${old_price:.2f} → Now: ${current_price:.2f}")2829 # Update history30 history[item_id] = {31 'price': current_price,32 'last_checked': datetime.now().isoformat()33 }3435 # Save updated history36 with open(history_file, 'w') as f:37 json.dump(history, f, indent=2)
Email Alerts
Combine with email automation to get notified of interesting findings.
Database Storage
For large-scale scraping, save to a database:
1import sqlite323def save_to_database(data, db_path='scraped_data.db'):4 """Save scraped data to SQLite database."""5 conn = sqlite3.connect(db_path)6 cursor = conn.cursor()78 # Create table if not exists9 cursor.execute('''10 CREATE TABLE IF NOT EXISTS products (11 id INTEGER PRIMARY KEY AUTOINCREMENT,12 title TEXT,13 price REAL,14 link TEXT,15 scraped_at TEXT16 )17 ''')1819 # Insert data20 for item in data:21 cursor.execute('''22 INSERT INTO products (title, price, link, scraped_at)23 VALUES (?, ?, ?, ?)24 ''', (item['title'], item['price'], item['link'], item['scraped_at']))2526 conn.commit()27 conn.close()
Conclusion
Web scraping opens up a world of data that would be impossible to collect manually. You can now extract product listings, job postings, news articles, research data—anything available on the web.
Remember the ethics: respect robots.txt, don't overload servers, and only scrape data you have the right to use. When in doubt, check the website's terms of service or contact the site owner.
Start with simple sites, get comfortable with HTML inspection and selector writing, then gradually tackle more complex scenarios. The data you need is out there—now you have the tools to get it.
The web is your data source.
Sponsored Content
Interested in advertising? Reach automation professionals through our platform.
