Extract Data from PDFs with Python: Tables, Text, and More
PDFs are everywhere in business—invoices, reports, contracts, forms. They look great but are notoriously difficult to work with programmatically. Copy-pasting from PDFs into spreadsheets? A nightmare of broken formatting.
Python can crack open those PDFs and extract exactly what you need. Let's learn how.
What You'll Learn
- Extracting text from PDF files
- Pulling tables into structured data
- Handling multi-page documents
- Working with scanned PDFs (OCR)
- Saving extracted data to Excel or CSV
Prerequisites
- Python 3.8 or higher
- PyPDF2 library (
pip install pypdf2) - pdfplumber library (
pip install pdfplumber) - tabula-py for table extraction (
pip install tabula-py) - Java Runtime Environment (required for tabula-py)
The Problem
PDFs are designed for display, not data extraction:
- Text can be scattered across invisible boxes
- Tables are just visual arrangements, not data structures
- Scanned documents are images, not text
- Copy-paste mangles formatting
- Manual extraction takes hours
The Solution
We'll use Python libraries that understand PDF structure:
- PyPDF2: Basic text extraction
- pdfplumber: Better text extraction with position data
- tabula-py: Specialized table extraction
- pytesseract: OCR for scanned documents
Step 1: Installing the Libraries
1pip install pypdf2 pdfplumber tabula-py pandas openpyxlFor tabula-py, you also need Java installed:
- Windows: Download from java.com and install
- macOS:
brew install java - Linux:
sudo apt install default-jre
Step 2: Basic Text Extraction with PyPDF2
The simplest approach for text-based PDFs:
1from PyPDF2 import PdfReader
2
3def extract_text_pypdf2(pdf_path):
4 """
5 Extract all text from a PDF using PyPDF2.
6
7 Args:
8 pdf_path: Path to the PDF file
9
10 Returns:
11 String containing all extracted text
12 """
13 reader = PdfReader(pdf_path)
14
15 text = ""
16 for page_num, page in enumerate(reader.pages, start=1):
17 page_text = page.extract_text()
18 if page_text:
19 text += f"\n--- Page {page_num} ---\n"
20 text += page_text
21
22 return text
23
24# Example usage
25text = extract_text_pypdf2("report.pdf")
26print(text)Step 3: Better Text Extraction with pdfplumber
pdfplumber gives more control and better results:
1import pdfplumber
2
3def extract_text_pdfplumber(pdf_path):
4 """
5 Extract text from PDF with better formatting preservation.
6
7 Args:
8 pdf_path: Path to the PDF file
9
10 Returns:
11 String containing extracted text
12 """
13 text = ""
14
15 with pdfplumber.open(pdf_path) as pdf:
16 for page_num, page in enumerate(pdf.pages, start=1):
17 page_text = page.extract_text()
18 if page_text:
19 text += f"\n--- Page {page_num} ---\n"
20 text += page_text
21
22 return text
23
24
25def extract_text_by_area(pdf_path, page_num, bbox):
26 """
27 Extract text from a specific area of a page.
28
29 Args:
30 pdf_path: Path to the PDF file
31 page_num: Page number (1-indexed)
32 bbox: Bounding box (x0, y0, x1, y1) in points
33
34 Returns:
35 Extracted text from the specified area
36 """
37 with pdfplumber.open(pdf_path) as pdf:
38 page = pdf.pages[page_num - 1] # Convert to 0-indexed
39 cropped = page.crop(bbox)
40 return cropped.extract_text()
41
42# Example: Extract header area
43header_text = extract_text_by_area("report.pdf", 1, (0, 0, 612, 100))Step 4: Extracting Tables
This is where it gets powerful. tabula-py excels at table extraction:
1import tabula
2import pandas as pd
3
4def extract_tables_tabula(pdf_path, pages="all"):
5 """
6 Extract all tables from a PDF.
7
8 Args:
9 pdf_path: Path to the PDF file
10 pages: Pages to extract from ('all', '1', '1-3', [1,2,3])
11
12 Returns:
13 List of pandas DataFrames, one per table
14 """
15 tables = tabula.read_pdf(
16 pdf_path,
17 pages=pages,
18 multiple_tables=True,
19 pandas_options={"header": None} # Don't assume first row is header
20 )
21
22 return tables
23
24
25def extract_tables_pdfplumber(pdf_path):
26 """
27 Extract tables using pdfplumber (alternative method).
28
29 Args:
30 pdf_path: Path to the PDF file
31
32 Returns:
33 List of tables (each table is a list of rows)
34 """
35 all_tables = []
36
37 with pdfplumber.open(pdf_path) as pdf:
38 for page_num, page in enumerate(pdf.pages, start=1):
39 tables = page.extract_tables()
40
41 for table_num, table in enumerate(tables, start=1):
42 all_tables.append({
43 "page": page_num,
44 "table_num": table_num,
45 "data": table
46 })
47
48 return all_tables
49
50
51# Example usage
52tables = extract_tables_tabula("financial_report.pdf")
53
54for i, df in enumerate(tables):
55 print(f"\nTable {i + 1}:")
56 print(df.head())Step 5: Cleaning Extracted Data
Raw extracted data often needs cleaning:
1def clean_dataframe(df):
2 """
3 Clean a DataFrame extracted from PDF.
4
5 Args:
6 df: pandas DataFrame
7
8 Returns:
9 Cleaned DataFrame
10 """
11 # Remove empty rows and columns
12 df = df.dropna(how='all')
13 df = df.dropna(axis=1, how='all')
14
15 # Strip whitespace from string columns
16 for col in df.columns:
17 if df[col].dtype == 'object':
18 df[col] = df[col].str.strip()
19
20 # Reset index
21 df = df.reset_index(drop=True)
22
23 return df
24
25
26def promote_header(df):
27 """
28 Promote the first row to column headers.
29
30 Args:
31 df: pandas DataFrame
32
33 Returns:
34 DataFrame with first row as headers
35 """
36 df.columns = df.iloc[0]
37 df = df.iloc[1:].reset_index(drop=True)
38 return dfThe Complete Script
1#!/usr/bin/env python3
2"""
3PDF Data Extractor - Extract text and tables from PDF files.
4Author: Alex Rodriguez
5
6This script extracts text and tables from PDF documents and saves
7them in usable formats (CSV, Excel, or text files).
8"""
9
10import os
11from datetime import datetime
12from pathlib import Path
13
14import pandas as pd
15import pdfplumber
16import tabula
17
18
19def extract_text(pdf_path):
20 """
21 Extract all text from a PDF file.
22
23 Args:
24 pdf_path: Path to the PDF file
25
26 Returns:
27 Dictionary with page numbers as keys and text as values
28 """
29 results = {}
30
31 with pdfplumber.open(pdf_path) as pdf:
32 for i, page in enumerate(pdf.pages, start=1):
33 text = page.extract_text()
34 results[i] = text if text else ""
35
36 return results
37
38
39def extract_tables(pdf_path, method="tabula"):
40 """
41 Extract all tables from a PDF file.
42
43 Args:
44 pdf_path: Path to the PDF file
45 method: 'tabula' or 'pdfplumber'
46
47 Returns:
48 List of dictionaries with table metadata and DataFrames
49 """
50 tables = []
51
52 if method == "tabula":
53 try:
54 raw_tables = tabula.read_pdf(
55 pdf_path,
56 pages="all",
57 multiple_tables=True
58 )
59
60 for i, df in enumerate(raw_tables, start=1):
61 if not df.empty:
62 tables.append({
63 "table_num": i,
64 "rows": len(df),
65 "columns": len(df.columns),
66 "data": df
67 })
68 except Exception as e:
69 print(f"Tabula extraction failed: {e}")
70 print("Falling back to pdfplumber...")
71 method = "pdfplumber"
72
73 if method == "pdfplumber":
74 with pdfplumber.open(pdf_path) as pdf:
75 table_num = 0
76
77 for page_num, page in enumerate(pdf.pages, start=1):
78 page_tables = page.extract_tables()
79
80 for table in page_tables:
81 if table:
82 table_num += 1
83 df = pd.DataFrame(table[1:], columns=table[0])
84 tables.append({
85 "table_num": table_num,
86 "page": page_num,
87 "rows": len(df),
88 "columns": len(df.columns),
89 "data": df
90 })
91
92 return tables
93
94
95def clean_table(df):
96 """Clean and standardize a DataFrame."""
97 # Remove completely empty rows and columns
98 df = df.dropna(how='all').dropna(axis=1, how='all')
99
100 # Clean string values
101 for col in df.columns:
102 if df[col].dtype == 'object':
103 df[col] = df[col].astype(str).str.strip()
104 df[col] = df[col].replace(['nan', 'None', ''], pd.NA)
105
106 # Remove rows that are all NA after cleaning
107 df = df.dropna(how='all')
108
109 return df.reset_index(drop=True)
110
111
112def get_pdf_metadata(pdf_path):
113 """
114 Get metadata from a PDF file.
115
116 Args:
117 pdf_path: Path to the PDF file
118
119 Returns:
120 Dictionary with PDF metadata
121 """
122 with pdfplumber.open(pdf_path) as pdf:
123 return {
124 "path": str(pdf_path),
125 "pages": len(pdf.pages),
126 "metadata": pdf.metadata
127 }
128
129
130def save_text(text_dict, output_path):
131 """Save extracted text to a file."""
132 with open(output_path, 'w', encoding='utf-8') as f:
133 for page_num, text in text_dict.items():
134 f.write(f"\n{'='*50}\n")
135 f.write(f"PAGE {page_num}\n")
136 f.write(f"{'='*50}\n\n")
137 f.write(text or "[No text extracted]")
138 f.write("\n")
139
140 print(f"✅ Text saved to: {output_path}")
141
142
143def save_tables_to_excel(tables, output_path):
144 """Save all extracted tables to an Excel file."""
145 if not tables:
146 print("⚠️ No tables to save")
147 return
148
149 with pd.ExcelWriter(output_path, engine='openpyxl') as writer:
150 for table_info in tables:
151 sheet_name = f"Table_{table_info['table_num']}"
152 df = clean_table(table_info['data'])
153 df.to_excel(writer, sheet_name=sheet_name, index=False)
154
155 print(f"✅ Tables saved to: {output_path}")
156
157
158def save_tables_to_csv(tables, output_dir):
159 """Save each table to a separate CSV file."""
160 if not tables:
161 print("⚠️ No tables to save")
162 return
163
164 output_dir = Path(output_dir)
165 output_dir.mkdir(exist_ok=True)
166
167 for table_info in tables:
168 filename = f"table_{table_info['table_num']}.csv"
169 filepath = output_dir / filename
170 df = clean_table(table_info['data'])
171 df.to_csv(filepath, index=False)
172
173 print(f"✅ Tables saved to: {output_dir}/")
174
175
176def process_pdf(pdf_path, output_dir=None, extract_text_flag=True,
177 extract_tables_flag=True, table_format="excel"):
178 """
179 Process a PDF file and extract text and/or tables.
180
181 Args:
182 pdf_path: Path to the PDF file
183 output_dir: Directory for output files (default: same as PDF)
184 extract_text_flag: Whether to extract text
185 extract_tables_flag: Whether to extract tables
186 table_format: 'excel' or 'csv'
187
188 Returns:
189 Dictionary with extraction results
190 """
191 pdf_path = Path(pdf_path)
192
193 if not pdf_path.exists():
194 raise FileNotFoundError(f"PDF not found: {pdf_path}")
195
196 # Set output directory
197 if output_dir is None:
198 output_dir = pdf_path.parent
199 output_dir = Path(output_dir)
200 output_dir.mkdir(exist_ok=True)
201
202 # Base name for output files
203 base_name = pdf_path.stem
204
205 print(f"\n{'='*60}")
206 print(f"Processing: {pdf_path.name}")
207 print(f"{'='*60}")
208
209 # Get metadata
210 metadata = get_pdf_metadata(pdf_path)
211 print(f"Pages: {metadata['pages']}")
212
213 results = {
214 "pdf_path": str(pdf_path),
215 "pages": metadata['pages'],
216 "text_file": None,
217 "tables_file": None,
218 "tables_count": 0
219 }
220
221 # Extract text
222 if extract_text_flag:
223 print("\n📄 Extracting text...")
224 text = extract_text(pdf_path)
225 text_file = output_dir / f"{base_name}_text.txt"
226 save_text(text, text_file)
227 results["text_file"] = str(text_file)
228
229 # Extract tables
230 if extract_tables_flag:
231 print("\n📊 Extracting tables...")
232 tables = extract_tables(pdf_path)
233
234 if tables:
235 print(f" Found {len(tables)} table(s)")
236 results["tables_count"] = len(tables)
237
238 if table_format == "excel":
239 tables_file = output_dir / f"{base_name}_tables.xlsx"
240 save_tables_to_excel(tables, tables_file)
241 results["tables_file"] = str(tables_file)
242 else:
243 tables_dir = output_dir / f"{base_name}_tables"
244 save_tables_to_csv(tables, tables_dir)
245 results["tables_file"] = str(tables_dir)
246 else:
247 print(" No tables found")
248
249 return results
250
251
252def batch_process(pdf_folder, output_dir=None, **kwargs):
253 """
254 Process all PDFs in a folder.
255
256 Args:
257 pdf_folder: Folder containing PDF files
258 output_dir: Output directory (default: subfolder in pdf_folder)
259 **kwargs: Arguments to pass to process_pdf
260
261 Returns:
262 List of results for each PDF
263 """
264 pdf_folder = Path(pdf_folder)
265 pdf_files = list(pdf_folder.glob("*.pdf"))
266
267 if not pdf_files:
268 print(f"No PDF files found in {pdf_folder}")
269 return []
270
271 if output_dir is None:
272 output_dir = pdf_folder / "extracted"
273
274 print(f"Found {len(pdf_files)} PDF file(s)")
275
276 all_results = []
277 for pdf_path in pdf_files:
278 try:
279 result = process_pdf(pdf_path, output_dir, **kwargs)
280 all_results.append(result)
281 except Exception as e:
282 print(f"❌ Error processing {pdf_path.name}: {e}")
283 all_results.append({"pdf_path": str(pdf_path), "error": str(e)})
284
285 return all_results
286
287
288def main():
289 """Main entry point with example usage."""
290
291 print("=" * 60)
292 print("PDF DATA EXTRACTOR")
293 print("=" * 60)
294
295 # ========================================
296 # CONFIGURE YOUR EXTRACTION
297 # ========================================
298
299 # Single PDF
300 pdf_path = "sample.pdf"
301
302 # Or process a folder
303 # pdf_folder = "/path/to/pdfs"
304
305 # Output options
306 output_dir = None # Same folder as PDF
307 extract_text = True
308 extract_tables = True
309 table_format = "excel" # or "csv"
310
311 # ========================================
312 # RUN EXTRACTION
313 # ========================================
314
315 if os.path.isfile(pdf_path):
316 result = process_pdf(
317 pdf_path,
318 output_dir=output_dir,
319 extract_text_flag=extract_text,
320 extract_tables_flag=extract_tables,
321 table_format=table_format
322 )
323
324 print("\n" + "=" * 60)
325 print("EXTRACTION COMPLETE")
326 print("=" * 60)
327 print(f"Text file: {result.get('text_file', 'Not extracted')}")
328 print(f"Tables file: {result.get('tables_file', 'Not extracted')}")
329 print(f"Tables found: {result.get('tables_count', 0)}")
330
331 elif os.path.isdir(pdf_path):
332 results = batch_process(
333 pdf_path,
334 output_dir=output_dir,
335 extract_text_flag=extract_text,
336 extract_tables_flag=extract_tables,
337 table_format=table_format
338 )
339
340 print("\n" + "=" * 60)
341 print("BATCH EXTRACTION COMPLETE")
342 print("=" * 60)
343 print(f"Processed: {len(results)} files")
344
345 else:
346 print(f"❌ File or folder not found: {pdf_path}")
347 print("\nTo use this script:")
348 print("1. Update 'pdf_path' to point to your PDF file")
349 print("2. Or set it to a folder path for batch processing")
350
351
352if __name__ == "__main__":
353 main()How to Run This Script
-
Install dependencies:
bash1pip install pypdf2 pdfplumber tabula-py pandas openpyxl -
Install Java (required for tabula-py)
-
Save the script as
pdf_extractor.py -
Update the
pdf_pathin the script to your PDF file -
Run the script:
bash1python pdf_extractor.py -
Expected output:
Prompt============================================================ PDF DATA EXTRACTOR ============================================================ ============================================================ Processing: financial_report.pdf ============================================================ Pages: 5 📄 Extracting text... ✅ Text saved to: financial_report_text.txt 📊 Extracting tables... Found 3 table(s) ✅ Tables saved to: financial_report_tables.xlsx ============================================================ EXTRACTION COMPLETE ============================================================ Text file: financial_report_text.txt Tables file: financial_report_tables.xlsx Tables found: 3
Customization Options
Handle Scanned PDFs (OCR)
For scanned documents that are actually images:
1pip install pytesseract pdf2image
2# Also install Tesseract OCR: https://github.com/tesseract-ocr/tesseract1from pdf2image import convert_from_path
2import pytesseract
3
4def extract_text_ocr(pdf_path):
5 """Extract text from scanned PDF using OCR."""
6 images = convert_from_path(pdf_path)
7 text = ""
8
9 for i, image in enumerate(images, start=1):
10 page_text = pytesseract.image_to_string(image)
11 text += f"\n--- Page {i} ---\n{page_text}"
12
13 return textExtract Specific Fields
For structured documents like invoices:
1import re
2
3def extract_invoice_data(text):
4 """Extract common invoice fields from text."""
5 data = {}
6
7 # Invoice number
8 match = re.search(r'Invoice\s*#?\s*:?\s*(\w+)', text, re.IGNORECASE)
9 if match:
10 data['invoice_number'] = match.group(1)
11
12 # Date
13 match = re.search(r'Date\s*:?\s*(\d{1,2}[/-]\d{1,2}[/-]\d{2,4})', text)
14 if match:
15 data['date'] = match.group(1)
16
17 # Total amount
18 match = re.search(r'Total\s*:?\s*\$?([\d,]+\.?\d*)', text, re.IGNORECASE)
19 if match:
20 data['total'] = match.group(1)
21
22 return dataMerge Multiple Tables
1def merge_tables(tables, on_column=None):
2 """Merge multiple extracted tables into one."""
3 dfs = [clean_table(t['data']) for t in tables]
4
5 if on_column:
6 result = dfs[0]
7 for df in dfs[1:]:
8 result = result.merge(df, on=on_column, how='outer')
9 return result
10 else:
11 return pd.concat(dfs, ignore_index=True)Common Issues & Solutions
| Issue | Solution |
|---|---|
| "Java not found" | Install Java and add to PATH |
| Empty table extraction | Try pdfplumber instead of tabula |
| Garbled text | PDF may be scanned; use OCR |
| Table headers wrong | Use promote_header() function |
| Merged cells broken | Manually clean the DataFrame |
Taking It Further
Automate Invoice Processing
1def process_invoices(folder_path, output_csv):
2 """Process all invoice PDFs and compile data."""
3 all_data = []
4
5 for pdf_file in Path(folder_path).glob("*.pdf"):
6 text = extract_text(pdf_file)
7 full_text = "\n".join(text.values())
8
9 data = extract_invoice_data(full_text)
10 data['source_file'] = pdf_file.name
11 all_data.append(data)
12
13 df = pd.DataFrame(all_data)
14 df.to_csv(output_csv, index=False)
15 print(f"Processed {len(all_data)} invoices")Watch Folder for New PDFs
1import time
2from watchdog.observers import Observer
3from watchdog.events import FileSystemEventHandler
4
5class PDFHandler(FileSystemEventHandler):
6 def on_created(self, event):
7 if event.src_path.endswith('.pdf'):
8 print(f"New PDF detected: {event.src_path}")
9 process_pdf(event.src_path)
10
11# Usage:
12# observer = Observer()
13# observer.schedule(PDFHandler(), "/path/to/watch")
14# observer.start()Conclusion
PDF data extraction used to be a manual nightmare. Now you have tools that handle text, tables, and even scanned documents automatically.
Start with clean, text-based PDFs to get comfortable with the extraction process. Then tackle more complex scenarios—multi-page tables, scanned documents, structured field extraction.
The real power comes from combining this with your other automations. Extract data from PDFs, process it with pandas, generate reports with openpyxl, and email the results—all without opening a single file manually.
Your PDFs, finally accessible.
Sponsored Content
Interested in advertising? Reach automation professionals through our platform.