Extract Data from PDFs with Python: Tables, Text, and More
PDFs are everywhere in business—invoices, reports, contracts, forms. They look great but are notoriously difficult to work with programmatically. Copy-pasting from PDFs into spreadsheets? A nightmare of broken formatting.
Python can crack open those PDFs and extract exactly what you need. Let's learn how.
What You'll Learn
- Extracting text from PDF files
- Pulling tables into structured data
- Handling multi-page documents
- Working with scanned PDFs (OCR)
- Saving extracted data to Excel or CSV
Prerequisites
- Python 3.8 or higher
- PyPDF2 library (
pip install pypdf2) - pdfplumber library (
pip install pdfplumber) - tabula-py for table extraction (
pip install tabula-py) - Java Runtime Environment (required for tabula-py)
The Problem
PDFs are designed for display, not data extraction:
- Text can be scattered across invisible boxes
- Tables are just visual arrangements, not data structures
- Scanned documents are images, not text
- Copy-paste mangles formatting
- Manual extraction takes hours
The Solution
We'll use Python libraries that understand PDF structure:
- PyPDF2: Basic text extraction
- pdfplumber: Better text extraction with position data
- tabula-py: Specialized table extraction
- pytesseract: OCR for scanned documents
Step 1: Installing the Libraries
1pip install pypdf2 pdfplumber tabula-py pandas openpyxl
For tabula-py, you also need Java installed:
- Windows: Download from java.com and install
- macOS:
brew install java - Linux:
sudo apt install default-jre
Step 2: Basic Text Extraction with PyPDF2
The simplest approach for text-based PDFs:
1from PyPDF2 import PdfReader23def extract_text_pypdf2(pdf_path):4 """5 Extract all text from a PDF using PyPDF2.67 Args:8 pdf_path: Path to the PDF file910 Returns:11 String containing all extracted text12 """13 reader = PdfReader(pdf_path)1415 text = ""16 for page_num, page in enumerate(reader.pages, start=1):17 page_text = page.extract_text()18 if page_text:19 text += f"\n--- Page {page_num} ---\n"20 text += page_text2122 return text2324# Example usage25text = extract_text_pypdf2("report.pdf")26print(text)
Step 3: Better Text Extraction with pdfplumber
pdfplumber gives more control and better results:
1import pdfplumber23def extract_text_pdfplumber(pdf_path):4 """5 Extract text from PDF with better formatting preservation.67 Args:8 pdf_path: Path to the PDF file910 Returns:11 String containing extracted text12 """13 text = ""1415 with pdfplumber.open(pdf_path) as pdf:16 for page_num, page in enumerate(pdf.pages, start=1):17 page_text = page.extract_text()18 if page_text:19 text += f"\n--- Page {page_num} ---\n"20 text += page_text2122 return text232425def extract_text_by_area(pdf_path, page_num, bbox):26 """27 Extract text from a specific area of a page.2829 Args:30 pdf_path: Path to the PDF file31 page_num: Page number (1-indexed)32 bbox: Bounding box (x0, y0, x1, y1) in points3334 Returns:35 Extracted text from the specified area36 """37 with pdfplumber.open(pdf_path) as pdf:38 page = pdf.pages[page_num - 1] # Convert to 0-indexed39 cropped = page.crop(bbox)40 return cropped.extract_text()4142# Example: Extract header area43header_text = extract_text_by_area("report.pdf", 1, (0, 0, 612, 100))
Step 4: Extracting Tables
This is where it gets powerful. tabula-py excels at table extraction:
1import tabula2import pandas as pd34def extract_tables_tabula(pdf_path, pages="all"):5 """6 Extract all tables from a PDF.78 Args:9 pdf_path: Path to the PDF file10 pages: Pages to extract from ('all', '1', '1-3', [1,2,3])1112 Returns:13 List of pandas DataFrames, one per table14 """15 tables = tabula.read_pdf(16 pdf_path,17 pages=pages,18 multiple_tables=True,19 pandas_options={"header": None} # Don't assume first row is header20 )2122 return tables232425def extract_tables_pdfplumber(pdf_path):26 """27 Extract tables using pdfplumber (alternative method).2829 Args:30 pdf_path: Path to the PDF file3132 Returns:33 List of tables (each table is a list of rows)34 """35 all_tables = []3637 with pdfplumber.open(pdf_path) as pdf:38 for page_num, page in enumerate(pdf.pages, start=1):39 tables = page.extract_tables()4041 for table_num, table in enumerate(tables, start=1):42 all_tables.append({43 "page": page_num,44 "table_num": table_num,45 "data": table46 })4748 return all_tables495051# Example usage52tables = extract_tables_tabula("financial_report.pdf")5354for i, df in enumerate(tables):55 print(f"\nTable {i + 1}:")56 print(df.head())
Step 5: Cleaning Extracted Data
Raw extracted data often needs cleaning:
1def clean_dataframe(df):2 """3 Clean a DataFrame extracted from PDF.45 Args:6 df: pandas DataFrame78 Returns:9 Cleaned DataFrame10 """11 # Remove empty rows and columns12 df = df.dropna(how='all')13 df = df.dropna(axis=1, how='all')1415 # Strip whitespace from string columns16 for col in df.columns:17 if df[col].dtype == 'object':18 df[col] = df[col].str.strip()1920 # Reset index21 df = df.reset_index(drop=True)2223 return df242526def promote_header(df):27 """28 Promote the first row to column headers.2930 Args:31 df: pandas DataFrame3233 Returns:34 DataFrame with first row as headers35 """36 df.columns = df.iloc[0]37 df = df.iloc[1:].reset_index(drop=True)38 return df
The Complete Script
1#!/usr/bin/env python32"""3PDF Data Extractor - Extract text and tables from PDF files.4Author: Alex Rodriguez56This script extracts text and tables from PDF documents and saves7them in usable formats (CSV, Excel, or text files).8"""910import os11from datetime import datetime12from pathlib import Path1314import pandas as pd15import pdfplumber16import tabula171819def extract_text(pdf_path):20 """21 Extract all text from a PDF file.2223 Args:24 pdf_path: Path to the PDF file2526 Returns:27 Dictionary with page numbers as keys and text as values28 """29 results = {}3031 with pdfplumber.open(pdf_path) as pdf:32 for i, page in enumerate(pdf.pages, start=1):33 text = page.extract_text()34 results[i] = text if text else ""3536 return results373839def extract_tables(pdf_path, method="tabula"):40 """41 Extract all tables from a PDF file.4243 Args:44 pdf_path: Path to the PDF file45 method: 'tabula' or 'pdfplumber'4647 Returns:48 List of dictionaries with table metadata and DataFrames49 """50 tables = []5152 if method == "tabula":53 try:54 raw_tables = tabula.read_pdf(55 pdf_path,56 pages="all",57 multiple_tables=True58 )5960 for i, df in enumerate(raw_tables, start=1):61 if not df.empty:62 tables.append({63 "table_num": i,64 "rows": len(df),65 "columns": len(df.columns),66 "data": df67 })68 except Exception as e:69 print(f"Tabula extraction failed: {e}")70 print("Falling back to pdfplumber...")71 method = "pdfplumber"7273 if method == "pdfplumber":74 with pdfplumber.open(pdf_path) as pdf:75 table_num = 07677 for page_num, page in enumerate(pdf.pages, start=1):78 page_tables = page.extract_tables()7980 for table in page_tables:81 if table:82 table_num += 183 df = pd.DataFrame(table[1:], columns=table[0])84 tables.append({85 "table_num": table_num,86 "page": page_num,87 "rows": len(df),88 "columns": len(df.columns),89 "data": df90 })9192 return tables939495def clean_table(df):96 """Clean and standardize a DataFrame."""97 # Remove completely empty rows and columns98 df = df.dropna(how='all').dropna(axis=1, how='all')99100 # Clean string values101 for col in df.columns:102 if df[col].dtype == 'object':103 df[col] = df[col].astype(str).str.strip()104 df[col] = df[col].replace(['nan', 'None', ''], pd.NA)105106 # Remove rows that are all NA after cleaning107 df = df.dropna(how='all')108109 return df.reset_index(drop=True)110111112def get_pdf_metadata(pdf_path):113 """114 Get metadata from a PDF file.115116 Args:117 pdf_path: Path to the PDF file118119 Returns:120 Dictionary with PDF metadata121 """122 with pdfplumber.open(pdf_path) as pdf:123 return {124 "path": str(pdf_path),125 "pages": len(pdf.pages),126 "metadata": pdf.metadata127 }128129130def save_text(text_dict, output_path):131 """Save extracted text to a file."""132 with open(output_path, 'w', encoding='utf-8') as f:133 for page_num, text in text_dict.items():134 f.write(f"\n{'='*50}\n")135 f.write(f"PAGE {page_num}\n")136 f.write(f"{'='*50}\n\n")137 f.write(text or "[No text extracted]")138 f.write("\n")139140 print(f"✅ Text saved to: {output_path}")141142143def save_tables_to_excel(tables, output_path):144 """Save all extracted tables to an Excel file."""145 if not tables:146 print("⚠️ No tables to save")147 return148149 with pd.ExcelWriter(output_path, engine='openpyxl') as writer:150 for table_info in tables:151 sheet_name = f"Table_{table_info['table_num']}"152 df = clean_table(table_info['data'])153 df.to_excel(writer, sheet_name=sheet_name, index=False)154155 print(f"✅ Tables saved to: {output_path}")156157158def save_tables_to_csv(tables, output_dir):159 """Save each table to a separate CSV file."""160 if not tables:161 print("⚠️ No tables to save")162 return163164 output_dir = Path(output_dir)165 output_dir.mkdir(exist_ok=True)166167 for table_info in tables:168 filename = f"table_{table_info['table_num']}.csv"169 filepath = output_dir / filename170 df = clean_table(table_info['data'])171 df.to_csv(filepath, index=False)172173 print(f"✅ Tables saved to: {output_dir}/")174175176def process_pdf(pdf_path, output_dir=None, extract_text_flag=True,177 extract_tables_flag=True, table_format="excel"):178 """179 Process a PDF file and extract text and/or tables.180181 Args:182 pdf_path: Path to the PDF file183 output_dir: Directory for output files (default: same as PDF)184 extract_text_flag: Whether to extract text185 extract_tables_flag: Whether to extract tables186 table_format: 'excel' or 'csv'187188 Returns:189 Dictionary with extraction results190 """191 pdf_path = Path(pdf_path)192193 if not pdf_path.exists():194 raise FileNotFoundError(f"PDF not found: {pdf_path}")195196 # Set output directory197 if output_dir is None:198 output_dir = pdf_path.parent199 output_dir = Path(output_dir)200 output_dir.mkdir(exist_ok=True)201202 # Base name for output files203 base_name = pdf_path.stem204205 print(f"\n{'='*60}")206 print(f"Processing: {pdf_path.name}")207 print(f"{'='*60}")208209 # Get metadata210 metadata = get_pdf_metadata(pdf_path)211 print(f"Pages: {metadata['pages']}")212213 results = {214 "pdf_path": str(pdf_path),215 "pages": metadata['pages'],216 "text_file": None,217 "tables_file": None,218 "tables_count": 0219 }220221 # Extract text222 if extract_text_flag:223 print("\n📄 Extracting text...")224 text = extract_text(pdf_path)225 text_file = output_dir / f"{base_name}_text.txt"226 save_text(text, text_file)227 results["text_file"] = str(text_file)228229 # Extract tables230 if extract_tables_flag:231 print("\n📊 Extracting tables...")232 tables = extract_tables(pdf_path)233234 if tables:235 print(f" Found {len(tables)} table(s)")236 results["tables_count"] = len(tables)237238 if table_format == "excel":239 tables_file = output_dir / f"{base_name}_tables.xlsx"240 save_tables_to_excel(tables, tables_file)241 results["tables_file"] = str(tables_file)242 else:243 tables_dir = output_dir / f"{base_name}_tables"244 save_tables_to_csv(tables, tables_dir)245 results["tables_file"] = str(tables_dir)246 else:247 print(" No tables found")248249 return results250251252def batch_process(pdf_folder, output_dir=None, **kwargs):253 """254 Process all PDFs in a folder.255256 Args:257 pdf_folder: Folder containing PDF files258 output_dir: Output directory (default: subfolder in pdf_folder)259 **kwargs: Arguments to pass to process_pdf260261 Returns:262 List of results for each PDF263 """264 pdf_folder = Path(pdf_folder)265 pdf_files = list(pdf_folder.glob("*.pdf"))266267 if not pdf_files:268 print(f"No PDF files found in {pdf_folder}")269 return []270271 if output_dir is None:272 output_dir = pdf_folder / "extracted"273274 print(f"Found {len(pdf_files)} PDF file(s)")275276 all_results = []277 for pdf_path in pdf_files:278 try:279 result = process_pdf(pdf_path, output_dir, **kwargs)280 all_results.append(result)281 except Exception as e:282 print(f"❌ Error processing {pdf_path.name}: {e}")283 all_results.append({"pdf_path": str(pdf_path), "error": str(e)})284285 return all_results286287288def main():289 """Main entry point with example usage."""290291 print("=" * 60)292 print("PDF DATA EXTRACTOR")293 print("=" * 60)294295 # ========================================296 # CONFIGURE YOUR EXTRACTION297 # ========================================298299 # Single PDF300 pdf_path = "sample.pdf"301302 # Or process a folder303 # pdf_folder = "/path/to/pdfs"304305 # Output options306 output_dir = None # Same folder as PDF307 extract_text = True308 extract_tables = True309 table_format = "excel" # or "csv"310311 # ========================================312 # RUN EXTRACTION313 # ========================================314315 if os.path.isfile(pdf_path):316 result = process_pdf(317 pdf_path,318 output_dir=output_dir,319 extract_text_flag=extract_text,320 extract_tables_flag=extract_tables,321 table_format=table_format322 )323324 print("\n" + "=" * 60)325 print("EXTRACTION COMPLETE")326 print("=" * 60)327 print(f"Text file: {result.get('text_file', 'Not extracted')}")328 print(f"Tables file: {result.get('tables_file', 'Not extracted')}")329 print(f"Tables found: {result.get('tables_count', 0)}")330331 elif os.path.isdir(pdf_path):332 results = batch_process(333 pdf_path,334 output_dir=output_dir,335 extract_text_flag=extract_text,336 extract_tables_flag=extract_tables,337 table_format=table_format338 )339340 print("\n" + "=" * 60)341 print("BATCH EXTRACTION COMPLETE")342 print("=" * 60)343 print(f"Processed: {len(results)} files")344345 else:346 print(f"❌ File or folder not found: {pdf_path}")347 print("\nTo use this script:")348 print("1. Update 'pdf_path' to point to your PDF file")349 print("2. Or set it to a folder path for batch processing")350351352if __name__ == "__main__":353 main()
How to Run This Script
-
Install dependencies:
bash1pip install pypdf2 pdfplumber tabula-py pandas openpyxl -
Install Java (required for tabula-py)
-
Save the script as
pdf_extractor.py -
Update the
pdf_pathin the script to your PDF file -
Run the script:
bash1python pdf_extractor.py -
Expected output:
Prompt============================================================ PDF DATA EXTRACTOR ============================================================ ============================================================ Processing: financial_report.pdf ============================================================ Pages: 5 📄 Extracting text... ✅ Text saved to: financial_report_text.txt 📊 Extracting tables... Found 3 table(s) ✅ Tables saved to: financial_report_tables.xlsx ============================================================ EXTRACTION COMPLETE ============================================================ Text file: financial_report_text.txt Tables file: financial_report_tables.xlsx Tables found: 3
Customization Options
Handle Scanned PDFs (OCR)
For scanned documents that are actually images:
1pip install pytesseract pdf2image2# Also install Tesseract OCR: https://github.com/tesseract-ocr/tesseract
1from pdf2image import convert_from_path2import pytesseract34def extract_text_ocr(pdf_path):5 """Extract text from scanned PDF using OCR."""6 images = convert_from_path(pdf_path)7 text = ""89 for i, image in enumerate(images, start=1):10 page_text = pytesseract.image_to_string(image)11 text += f"\n--- Page {i} ---\n{page_text}"1213 return text
Extract Specific Fields
For structured documents like invoices:
1import re23def extract_invoice_data(text):4 """Extract common invoice fields from text."""5 data = {}67 # Invoice number8 match = re.search(r'Invoice\s*#?\s*:?\s*(\w+)', text, re.IGNORECASE)9 if match:10 data['invoice_number'] = match.group(1)1112 # Date13 match = re.search(r'Date\s*:?\s*(\d{1,2}[/-]\d{1,2}[/-]\d{2,4})', text)14 if match:15 data['date'] = match.group(1)1617 # Total amount18 match = re.search(r'Total\s*:?\s*\$?([\d,]+\.?\d*)', text, re.IGNORECASE)19 if match:20 data['total'] = match.group(1)2122 return data
Merge Multiple Tables
1def merge_tables(tables, on_column=None):2 """Merge multiple extracted tables into one."""3 dfs = [clean_table(t['data']) for t in tables]45 if on_column:6 result = dfs[0]7 for df in dfs[1:]:8 result = result.merge(df, on=on_column, how='outer')9 return result10 else:11 return pd.concat(dfs, ignore_index=True)
Common Issues & Solutions
| Issue | Solution |
|---|---|
| "Java not found" | Install Java and add to PATH |
| Empty table extraction | Try pdfplumber instead of tabula |
| Garbled text | PDF may be scanned; use OCR |
| Table headers wrong | Use promote_header() function |
| Merged cells broken | Manually clean the DataFrame |
Taking It Further
Automate Invoice Processing
1def process_invoices(folder_path, output_csv):2 """Process all invoice PDFs and compile data."""3 all_data = []45 for pdf_file in Path(folder_path).glob("*.pdf"):6 text = extract_text(pdf_file)7 full_text = "\n".join(text.values())89 data = extract_invoice_data(full_text)10 data['source_file'] = pdf_file.name11 all_data.append(data)1213 df = pd.DataFrame(all_data)14 df.to_csv(output_csv, index=False)15 print(f"Processed {len(all_data)} invoices")
Watch Folder for New PDFs
1import time2from watchdog.observers import Observer3from watchdog.events import FileSystemEventHandler45class PDFHandler(FileSystemEventHandler):6 def on_created(self, event):7 if event.src_path.endswith('.pdf'):8 print(f"New PDF detected: {event.src_path}")9 process_pdf(event.src_path)1011# Usage:12# observer = Observer()13# observer.schedule(PDFHandler(), "/path/to/watch")14# observer.start()
Conclusion
PDF data extraction used to be a manual nightmare. Now you have tools that handle text, tables, and even scanned documents automatically.
Start with clean, text-based PDFs to get comfortable with the extraction process. Then tackle more complex scenarios—multi-page tables, scanned documents, structured field extraction.
The real power comes from combining this with your other automations. Extract data from PDFs, process it with pandas, generate reports with openpyxl, and email the results—all without opening a single file manually.
Your PDFs, finally accessible.
Sponsored Content
Interested in advertising? Reach automation professionals through our platform.
