Web Scraping for Beginners: Extract Data from Any Website with Python
You need competitor pricing data. Or job listings. Or product information from multiple websites. Copying and pasting hundreds of items manually? That's not a good use of your time.
Web scraping lets Python visit websites for you, extract the data you need, and organize it into a usable format. Let's learn how.
What You'll Learn
- How web pages are structured (HTML basics)
- Using requests to fetch web pages
- Parsing HTML with BeautifulSoup
- Extracting specific data from pages
- Handling common scraping challenges
Prerequisites
- Python 3.8 or higher
- requests library (
pip install requests) - BeautifulSoup library (
pip install beautifulsoup4) - Basic understanding of HTML (we'll cover the essentials)
The Problem
Manual data collection from websites is:
- Tedious (clicking, copying, pasting, repeat)
- Slow (limited by human speed)
- Error-prone (typos, missed items)
- Not scalable (10 items is fine, 10,000 is impossible)
The Solution
Web scraping automates this process:
- Request the web page (like your browser does)
- Parse the HTML structure
- Extract the specific data you need
- Save it to a file or database
Important: Web Scraping Ethics
Before we start, some important rules:
- Check robots.txt: Visit
website.com/robots.txtto see what's allowed - Read Terms of Service: Some sites prohibit scraping
- Be respectful: Don't overload servers with too many requests
- Personal use vs. commercial: Different rules may apply
- Don't scrape personal data: Privacy laws like GDPR apply
Step 1: Understanding HTML Structure
Web pages are built with HTML. Here's a simplified example:
1<html>
2 <body>
3 <div class="product">
4 <h2 class="title">Awesome Widget</h2>
5 <span class="price">$29.99</span>
6 <p class="description">The best widget ever made.</p>
7 </div>
8 <div class="product">
9 <h2 class="title">Super Gadget</h2>
10 <span class="price">$49.99</span>
11 <p class="description">A gadget for everything.</p>
12 </div>
13 </body>
14</html>Key concepts:
- Tags:
<div>,<h2>,<span>are tags - Classes:
class="product"identifies elements - Nesting: Tags can contain other tags
Step 2: Fetching a Web Page
Let's start by getting a page's HTML:
1import requests
2
3def fetch_page(url):
4 """
5 Fetch a web page and return its HTML content.
6
7 Args:
8 url: The URL to fetch
9
10 Returns:
11 HTML content as string, or None if failed
12 """
13 # Add headers to look like a real browser
14 headers = {
15 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
16 }
17
18 try:
19 response = requests.get(url, headers=headers, timeout=10)
20
21 # Check if request was successful
22 response.raise_for_status()
23
24 return response.text
25
26 except requests.RequestException as e:
27 print(f"Error fetching {url}: {e}")
28 return None
29
30# Example usage
31html = fetch_page("https://example.com")
32print(html[:500]) # Print first 500 charactersStep 3: Parsing HTML with BeautifulSoup
BeautifulSoup makes it easy to navigate and search HTML:
1from bs4 import BeautifulSoup
2
3def parse_html(html_content):
4 """
5 Parse HTML content into a BeautifulSoup object.
6
7 Args:
8 html_content: Raw HTML string
9
10 Returns:
11 BeautifulSoup object for querying
12 """
13 soup = BeautifulSoup(html_content, 'html.parser')
14 return soup
15
16# Example: Find elements
17soup = parse_html(html)
18
19# Find by tag
20all_divs = soup.find_all('div')
21
22# Find by class
23products = soup.find_all('div', class_='product')
24
25# Find by ID
26header = soup.find(id='main-header')
27
28# Find first match
29first_product = soup.find('div', class_='product')Step 4: Extracting Data
Let's extract product information:
1def extract_products(soup):
2 """
3 Extract product data from parsed HTML.
4
5 Args:
6 soup: BeautifulSoup object
7
8 Returns:
9 List of dictionaries with product data
10 """
11 products = []
12
13 # Find all product containers
14 product_divs = soup.find_all('div', class_='product')
15
16 for div in product_divs:
17 # Extract each piece of data
18 title_tag = div.find('h2', class_='title')
19 price_tag = div.find('span', class_='price')
20 desc_tag = div.find('p', class_='description')
21
22 product = {
23 'title': title_tag.text.strip() if title_tag else None,
24 'price': price_tag.text.strip() if price_tag else None,
25 'description': desc_tag.text.strip() if desc_tag else None,
26 }
27
28 products.append(product)
29
30 return productsStep 5: Handling Multiple Pages
Many websites have pagination. Let's handle that:
1def scrape_multiple_pages(base_url, max_pages=5):
2 """
3 Scrape data from multiple pages.
4
5 Args:
6 base_url: URL pattern with {page} placeholder
7 max_pages: Maximum number of pages to scrape
8
9 Returns:
10 Combined list of all extracted data
11 """
12 import time
13
14 all_data = []
15
16 for page_num in range(1, max_pages + 1):
17 url = base_url.format(page=page_num)
18 print(f"Scraping page {page_num}: {url}")
19
20 html = fetch_page(url)
21 if not html:
22 print(f"Failed to fetch page {page_num}, stopping.")
23 break
24
25 soup = parse_html(html)
26 page_data = extract_products(soup)
27
28 if not page_data:
29 print(f"No data found on page {page_num}, stopping.")
30 break
31
32 all_data.extend(page_data)
33 print(f" Found {len(page_data)} items")
34
35 # Be respectful - wait between requests
36 time.sleep(1)
37
38 return all_data
39
40# Usage example
41# data = scrape_multiple_pages("https://example.com/products?page={page}", max_pages=10)Step 6: Saving the Data
Save your scraped data to CSV or JSON:
1import csv
2import json
3
4def save_to_csv(data, filename):
5 """Save data to a CSV file."""
6 if not data:
7 print("No data to save")
8 return
9
10 # Get column names from first item
11 fieldnames = data[0].keys()
12
13 with open(filename, 'w', newline='', encoding='utf-8') as f:
14 writer = csv.DictWriter(f, fieldnames=fieldnames)
15 writer.writeheader()
16 writer.writerows(data)
17
18 print(f"Saved {len(data)} records to {filename}")
19
20
21def save_to_json(data, filename):
22 """Save data to a JSON file."""
23 with open(filename, 'w', encoding='utf-8') as f:
24 json.dump(data, f, indent=2, ensure_ascii=False)
25
26 print(f"Saved {len(data)} records to {filename}")The Complete Script
1#!/usr/bin/env python3
2"""
3Web Scraper - Extract data from websites automatically.
4Author: Alex Rodriguez
5
6This script demonstrates web scraping fundamentals using requests and BeautifulSoup.
7Customize the extraction logic for your target website.
8"""
9
10import csv
11import json
12import time
13from datetime import datetime
14
15import requests
16from bs4 import BeautifulSoup
17
18
19# Configuration
20REQUEST_TIMEOUT = 10
21DELAY_BETWEEN_REQUESTS = 1 # seconds
22USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
23
24
25def fetch_page(url):
26 """
27 Fetch a web page and return its HTML content.
28
29 Args:
30 url: The URL to fetch
31
32 Returns:
33 HTML content as string, or None if failed
34 """
35 headers = {
36 'User-Agent': USER_AGENT,
37 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
38 'Accept-Language': 'en-US,en;q=0.5',
39 }
40
41 try:
42 response = requests.get(url, headers=headers, timeout=REQUEST_TIMEOUT)
43 response.raise_for_status()
44 return response.text
45
46 except requests.Timeout:
47 print(f"Timeout fetching {url}")
48 except requests.HTTPError as e:
49 print(f"HTTP error {e.response.status_code} for {url}")
50 except requests.RequestException as e:
51 print(f"Error fetching {url}: {e}")
52
53 return None
54
55
56def parse_html(html_content):
57 """Parse HTML content into a BeautifulSoup object."""
58 return BeautifulSoup(html_content, 'html.parser')
59
60
61def extract_data(soup):
62 """
63 Extract data from the parsed HTML.
64
65 CUSTOMIZE THIS FUNCTION for your target website.
66
67 Args:
68 soup: BeautifulSoup object
69
70 Returns:
71 List of dictionaries with extracted data
72 """
73 data = []
74
75 # Example: Extract product information
76 # Adjust selectors based on your target website's HTML structure
77
78 items = soup.find_all('div', class_='product')
79
80 for item in items:
81 try:
82 # Example extraction - customize for your target
83 title_elem = item.find('h2', class_='title')
84 price_elem = item.find('span', class_='price')
85 link_elem = item.find('a', href=True)
86
87 record = {
88 'title': title_elem.text.strip() if title_elem else None,
89 'price': price_elem.text.strip() if price_elem else None,
90 'link': link_elem['href'] if link_elem else None,
91 'scraped_at': datetime.now().isoformat(),
92 }
93
94 # Only add if we got meaningful data
95 if record['title']:
96 data.append(record)
97
98 except Exception as e:
99 print(f"Error extracting item: {e}")
100 continue
101
102 return data
103
104
105def clean_price(price_string):
106 """
107 Clean a price string and convert to float.
108
109 Examples:
110 "$29.99" -> 29.99
111 "€ 1,299.00" -> 1299.00
112 """
113 if not price_string:
114 return None
115
116 import re
117
118 # Remove currency symbols and whitespace
119 cleaned = re.sub(r'[^\d.,]', '', price_string)
120
121 # Handle European number format (1.299,00 -> 1299.00)
122 if ',' in cleaned and '.' in cleaned:
123 if cleaned.index('.') < cleaned.index(','):
124 cleaned = cleaned.replace('.', '').replace(',', '.')
125 else:
126 cleaned = cleaned.replace(',', '')
127 elif ',' in cleaned:
128 cleaned = cleaned.replace(',', '.')
129
130 try:
131 return float(cleaned)
132 except ValueError:
133 return None
134
135
136def scrape_website(start_url, max_pages=1):
137 """
138 Scrape data from a website.
139
140 Args:
141 start_url: URL to start scraping (use {page} for pagination)
142 max_pages: Maximum number of pages to scrape
143
144 Returns:
145 List of all extracted data
146 """
147 all_data = []
148
149 for page_num in range(1, max_pages + 1):
150 # Build URL (replace {page} if present)
151 if '{page}' in start_url:
152 url = start_url.format(page=page_num)
153 else:
154 url = start_url
155
156 print(f"\n📄 Scraping page {page_num}: {url}")
157
158 # Fetch page
159 html = fetch_page(url)
160 if not html:
161 print(" ❌ Failed to fetch page")
162 break
163
164 # Parse and extract
165 soup = parse_html(html)
166 page_data = extract_data(soup)
167
168 if not page_data:
169 print(" ⚠️ No data found on this page")
170 if page_num == 1:
171 print(" 💡 Check your extraction selectors!")
172 break
173
174 all_data.extend(page_data)
175 print(f" ✓ Extracted {len(page_data)} items")
176
177 # Respect the server - wait between requests
178 if page_num < max_pages:
179 time.sleep(DELAY_BETWEEN_REQUESTS)
180
181 return all_data
182
183
184def save_to_csv(data, filename):
185 """Save data to a CSV file."""
186 if not data:
187 print("No data to save")
188 return
189
190 fieldnames = data[0].keys()
191
192 with open(filename, 'w', newline='', encoding='utf-8') as f:
193 writer = csv.DictWriter(f, fieldnames=fieldnames)
194 writer.writeheader()
195 writer.writerows(data)
196
197 print(f"\n💾 Saved {len(data)} records to {filename}")
198
199
200def save_to_json(data, filename):
201 """Save data to a JSON file."""
202 if not data:
203 print("No data to save")
204 return
205
206 with open(filename, 'w', encoding='utf-8') as f:
207 json.dump(data, f, indent=2, ensure_ascii=False)
208
209 print(f"\n💾 Saved {len(data)} records to {filename}")
210
211
212def main():
213 """Main entry point."""
214
215 print("=" * 60)
216 print("WEB SCRAPER")
217 print("=" * 60)
218
219 # ========================================
220 # CONFIGURE YOUR SCRAPING TARGET
221 # ========================================
222
223 # Target URL (use {page} for pagination)
224 target_url = "https://books.toscrape.com/catalogue/page-{page}.html"
225
226 # Maximum pages to scrape
227 max_pages = 3
228
229 # Output filename
230 output_file = f"scraped_data_{datetime.now().strftime('%Y%m%d_%H%M%S')}"
231
232 # ========================================
233 # RUN THE SCRAPER
234 # ========================================
235
236 data = scrape_website(target_url, max_pages)
237
238 if data:
239 print(f"\n✅ Total items scraped: {len(data)}")
240
241 # Save in both formats
242 save_to_csv(data, f"{output_file}.csv")
243 save_to_json(data, f"{output_file}.json")
244
245 # Preview first few items
246 print("\n📋 Preview of scraped data:")
247 for item in data[:3]:
248 print(f" - {item}")
249 else:
250 print("\n❌ No data was scraped")
251 print("Check that:")
252 print(" 1. The URL is accessible")
253 print(" 2. Your extraction selectors match the page structure")
254 print(" 3. The website allows scraping (check robots.txt)")
255
256
257if __name__ == "__main__":
258 main()How to Run This Script
-
Install required libraries:
bash1pip install requests beautifulsoup4 -
Save the script as
web_scraper.py -
Customize the
extract_data()function for your target website:- Open the website in your browser
- Right-click on an element → "Inspect" to see HTML structure
- Update the selectors (tag names, class names) to match
-
Run the scraper:
bash1python web_scraper.py -
Expected output:
Prompt============================================================ WEB SCRAPER ============================================================ 📄 Scraping page 1: https://books.toscrape.com/catalogue/page-1.html ✓ Extracted 20 items 📄 Scraping page 2: https://books.toscrape.com/catalogue/page-2.html ✓ Extracted 20 items ✅ Total items scraped: 40 💾 Saved 40 records to scraped_data_20251110_143022.csv 💾 Saved 40 records to scraped_data_20251110_143022.json
Customization Options
Extract from Tables
1def extract_table_data(soup):
2 """Extract data from HTML tables."""
3 data = []
4
5 table = soup.find('table', class_='data-table')
6 if not table:
7 return data
8
9 # Get headers
10 headers = []
11 header_row = table.find('tr')
12 for th in header_row.find_all(['th', 'td']):
13 headers.append(th.text.strip())
14
15 # Get data rows
16 for row in table.find_all('tr')[1:]: # Skip header
17 cells = row.find_all('td')
18 if len(cells) == len(headers):
19 record = dict(zip(headers, [c.text.strip() for c in cells]))
20 data.append(record)
21
22 return dataHandle JavaScript-Rendered Content
Some sites load content with JavaScript. Use Selenium for these:
1# pip install selenium webdriver-manager
2from selenium import webdriver
3from selenium.webdriver.chrome.service import Service
4from webdriver_manager.chrome import ChromeDriverManager
5
6def fetch_javascript_page(url):
7 """Fetch a page that requires JavaScript rendering."""
8 options = webdriver.ChromeOptions()
9 options.add_argument('--headless') # Run without visible browser
10
11 driver = webdriver.Chrome(
12 service=Service(ChromeDriverManager().install()),
13 options=options
14 )
15
16 try:
17 driver.get(url)
18 time.sleep(3) # Wait for JavaScript to load
19 return driver.page_source
20 finally:
21 driver.quit()Add Proxy Support
1def fetch_with_proxy(url, proxy):
2 """Fetch through a proxy server."""
3 proxies = {
4 'http': proxy,
5 'https': proxy,
6 }
7
8 response = requests.get(url, proxies=proxies, timeout=15)
9 return response.textCommon Issues & Solutions
| Issue | Solution |
|---|---|
| 403 Forbidden | Add proper User-Agent header; site may block scrapers |
| 404 Not Found | Check URL is correct; pagination may use different format |
| No data extracted | Inspect page HTML; selectors may not match |
| Getting blocked | Add delays; rotate User-Agents; use proxies |
| JavaScript content | Use Selenium instead of requests |
| Encoding errors | Specify encoding='utf-8' when saving files |
Taking It Further
Monitor Price Changes
1def monitor_prices(url, check_interval_hours=24):
2 """Monitor a page for price changes."""
3 import json
4 from datetime import datetime
5
6 history_file = 'price_history.json'
7
8 # Load previous prices
9 try:
10 with open(history_file) as f:
11 history = json.load(f)
12 except FileNotFoundError:
13 history = {}
14
15 # Scrape current prices
16 data = scrape_website(url, max_pages=1)
17
18 # Compare and alert
19 for item in data:
20 item_id = item['title']
21 current_price = clean_price(item['price'])
22
23 if item_id in history:
24 old_price = history[item_id]['price']
25 if current_price < old_price:
26 print(f"🔔 PRICE DROP: {item_id}")
27 print(f" Was: ${old_price:.2f} → Now: ${current_price:.2f}")
28
29 # Update history
30 history[item_id] = {
31 'price': current_price,
32 'last_checked': datetime.now().isoformat()
33 }
34
35 # Save updated history
36 with open(history_file, 'w') as f:
37 json.dump(history, f, indent=2)Email Alerts
Combine with email automation to get notified of interesting findings.
Database Storage
For large-scale scraping, save to a database:
1import sqlite3
2
3def save_to_database(data, db_path='scraped_data.db'):
4 """Save scraped data to SQLite database."""
5 conn = sqlite3.connect(db_path)
6 cursor = conn.cursor()
7
8 # Create table if not exists
9 cursor.execute('''
10 CREATE TABLE IF NOT EXISTS products (
11 id INTEGER PRIMARY KEY AUTOINCREMENT,
12 title TEXT,
13 price REAL,
14 link TEXT,
15 scraped_at TEXT
16 )
17 ''')
18
19 # Insert data
20 for item in data:
21 cursor.execute('''
22 INSERT INTO products (title, price, link, scraped_at)
23 VALUES (?, ?, ?, ?)
24 ''', (item['title'], item['price'], item['link'], item['scraped_at']))
25
26 conn.commit()
27 conn.close()Conclusion
Web scraping opens up a world of data that would be impossible to collect manually. You can now extract product listings, job postings, news articles, research data—anything available on the web.
Remember the ethics: respect robots.txt, don't overload servers, and only scrape data you have the right to use. When in doubt, check the website's terms of service or contact the site owner.
Start with simple sites, get comfortable with HTML inspection and selector writing, then gradually tackle more complex scenarios. The data you need is out there—now you have the tools to get it.
The web is your data source.
Sponsored Content
Interested in advertising? Reach automation professionals through our platform.