AutomateMyJob
Back to BlogPython Automation

Extract Data from PDFs with Python: Tables, Text, and More

Alex Rodriguez12 min read

Extract Data from PDFs with Python: Tables, Text, and More

PDFs are everywhere in business—invoices, reports, contracts, forms. They look great but are notoriously difficult to work with programmatically. Copy-pasting from PDFs into spreadsheets? A nightmare of broken formatting.

Python can crack open those PDFs and extract exactly what you need. Let's learn how.

What You'll Learn

  • Extracting text from PDF files
  • Pulling tables into structured data
  • Handling multi-page documents
  • Working with scanned PDFs (OCR)
  • Saving extracted data to Excel or CSV

Prerequisites

  • Python 3.8 or higher
  • PyPDF2 library (pip install pypdf2)
  • pdfplumber library (pip install pdfplumber)
  • tabula-py for table extraction (pip install tabula-py)
  • Java Runtime Environment (required for tabula-py)

The Problem

PDFs are designed for display, not data extraction:

  • Text can be scattered across invisible boxes
  • Tables are just visual arrangements, not data structures
  • Scanned documents are images, not text
  • Copy-paste mangles formatting
  • Manual extraction takes hours

The Solution

We'll use Python libraries that understand PDF structure:

  1. PyPDF2: Basic text extraction
  2. pdfplumber: Better text extraction with position data
  3. tabula-py: Specialized table extraction
  4. pytesseract: OCR for scanned documents

Step 1: Installing the Libraries

bash
1pip install pypdf2 pdfplumber tabula-py pandas openpyxl

For tabula-py, you also need Java installed:

  • Windows: Download from java.com and install
  • macOS: brew install java
  • Linux: sudo apt install default-jre

Step 2: Basic Text Extraction with PyPDF2

The simplest approach for text-based PDFs:

python
1from PyPDF2 import PdfReader
2
3def extract_text_pypdf2(pdf_path):
4    """
5    Extract all text from a PDF using PyPDF2.
6    
7    Args:
8        pdf_path: Path to the PDF file
9    
10    Returns:
11        String containing all extracted text
12    """
13    reader = PdfReader(pdf_path)
14    
15    text = ""
16    for page_num, page in enumerate(reader.pages, start=1):
17        page_text = page.extract_text()
18        if page_text:
19            text += f"\n--- Page {page_num} ---\n"
20            text += page_text
21    
22    return text
23
24# Example usage
25text = extract_text_pypdf2("report.pdf")
26print(text)

Step 3: Better Text Extraction with pdfplumber

pdfplumber gives more control and better results:

python
1import pdfplumber
2
3def extract_text_pdfplumber(pdf_path):
4    """
5    Extract text from PDF with better formatting preservation.
6    
7    Args:
8        pdf_path: Path to the PDF file
9    
10    Returns:
11        String containing extracted text
12    """
13    text = ""
14    
15    with pdfplumber.open(pdf_path) as pdf:
16        for page_num, page in enumerate(pdf.pages, start=1):
17            page_text = page.extract_text()
18            if page_text:
19                text += f"\n--- Page {page_num} ---\n"
20                text += page_text
21    
22    return text
23
24
25def extract_text_by_area(pdf_path, page_num, bbox):
26    """
27    Extract text from a specific area of a page.
28    
29    Args:
30        pdf_path: Path to the PDF file
31        page_num: Page number (1-indexed)
32        bbox: Bounding box (x0, y0, x1, y1) in points
33    
34    Returns:
35        Extracted text from the specified area
36    """
37    with pdfplumber.open(pdf_path) as pdf:
38        page = pdf.pages[page_num - 1]  # Convert to 0-indexed
39        cropped = page.crop(bbox)
40        return cropped.extract_text()
41
42# Example: Extract header area
43header_text = extract_text_by_area("report.pdf", 1, (0, 0, 612, 100))

Step 4: Extracting Tables

This is where it gets powerful. tabula-py excels at table extraction:

python
1import tabula
2import pandas as pd
3
4def extract_tables_tabula(pdf_path, pages="all"):
5    """
6    Extract all tables from a PDF.
7    
8    Args:
9        pdf_path: Path to the PDF file
10        pages: Pages to extract from ('all', '1', '1-3', [1,2,3])
11    
12    Returns:
13        List of pandas DataFrames, one per table
14    """
15    tables = tabula.read_pdf(
16        pdf_path,
17        pages=pages,
18        multiple_tables=True,
19        pandas_options={"header": None}  # Don't assume first row is header
20    )
21    
22    return tables
23
24
25def extract_tables_pdfplumber(pdf_path):
26    """
27    Extract tables using pdfplumber (alternative method).
28    
29    Args:
30        pdf_path: Path to the PDF file
31    
32    Returns:
33        List of tables (each table is a list of rows)
34    """
35    all_tables = []
36    
37    with pdfplumber.open(pdf_path) as pdf:
38        for page_num, page in enumerate(pdf.pages, start=1):
39            tables = page.extract_tables()
40            
41            for table_num, table in enumerate(tables, start=1):
42                all_tables.append({
43                    "page": page_num,
44                    "table_num": table_num,
45                    "data": table
46                })
47    
48    return all_tables
49
50
51# Example usage
52tables = extract_tables_tabula("financial_report.pdf")
53
54for i, df in enumerate(tables):
55    print(f"\nTable {i + 1}:")
56    print(df.head())

Step 5: Cleaning Extracted Data

Raw extracted data often needs cleaning:

python
1def clean_dataframe(df):
2    """
3    Clean a DataFrame extracted from PDF.
4    
5    Args:
6        df: pandas DataFrame
7    
8    Returns:
9        Cleaned DataFrame
10    """
11    # Remove empty rows and columns
12    df = df.dropna(how='all')
13    df = df.dropna(axis=1, how='all')
14    
15    # Strip whitespace from string columns
16    for col in df.columns:
17        if df[col].dtype == 'object':
18            df[col] = df[col].str.strip()
19    
20    # Reset index
21    df = df.reset_index(drop=True)
22    
23    return df
24
25
26def promote_header(df):
27    """
28    Promote the first row to column headers.
29    
30    Args:
31        df: pandas DataFrame
32    
33    Returns:
34        DataFrame with first row as headers
35    """
36    df.columns = df.iloc[0]
37    df = df.iloc[1:].reset_index(drop=True)
38    return df

The Complete Script

python
1#!/usr/bin/env python3
2"""
3PDF Data Extractor - Extract text and tables from PDF files.
4Author: Alex Rodriguez
5
6This script extracts text and tables from PDF documents and saves
7them in usable formats (CSV, Excel, or text files).
8"""
9
10import os
11from datetime import datetime
12from pathlib import Path
13
14import pandas as pd
15import pdfplumber
16import tabula
17
18
19def extract_text(pdf_path):
20    """
21    Extract all text from a PDF file.
22    
23    Args:
24        pdf_path: Path to the PDF file
25    
26    Returns:
27        Dictionary with page numbers as keys and text as values
28    """
29    results = {}
30    
31    with pdfplumber.open(pdf_path) as pdf:
32        for i, page in enumerate(pdf.pages, start=1):
33            text = page.extract_text()
34            results[i] = text if text else ""
35    
36    return results
37
38
39def extract_tables(pdf_path, method="tabula"):
40    """
41    Extract all tables from a PDF file.
42    
43    Args:
44        pdf_path: Path to the PDF file
45        method: 'tabula' or 'pdfplumber'
46    
47    Returns:
48        List of dictionaries with table metadata and DataFrames
49    """
50    tables = []
51    
52    if method == "tabula":
53        try:
54            raw_tables = tabula.read_pdf(
55                pdf_path,
56                pages="all",
57                multiple_tables=True
58            )
59            
60            for i, df in enumerate(raw_tables, start=1):
61                if not df.empty:
62                    tables.append({
63                        "table_num": i,
64                        "rows": len(df),
65                        "columns": len(df.columns),
66                        "data": df
67                    })
68        except Exception as e:
69            print(f"Tabula extraction failed: {e}")
70            print("Falling back to pdfplumber...")
71            method = "pdfplumber"
72    
73    if method == "pdfplumber":
74        with pdfplumber.open(pdf_path) as pdf:
75            table_num = 0
76            
77            for page_num, page in enumerate(pdf.pages, start=1):
78                page_tables = page.extract_tables()
79                
80                for table in page_tables:
81                    if table:
82                        table_num += 1
83                        df = pd.DataFrame(table[1:], columns=table[0])
84                        tables.append({
85                            "table_num": table_num,
86                            "page": page_num,
87                            "rows": len(df),
88                            "columns": len(df.columns),
89                            "data": df
90                        })
91    
92    return tables
93
94
95def clean_table(df):
96    """Clean and standardize a DataFrame."""
97    # Remove completely empty rows and columns
98    df = df.dropna(how='all').dropna(axis=1, how='all')
99    
100    # Clean string values
101    for col in df.columns:
102        if df[col].dtype == 'object':
103            df[col] = df[col].astype(str).str.strip()
104            df[col] = df[col].replace(['nan', 'None', ''], pd.NA)
105    
106    # Remove rows that are all NA after cleaning
107    df = df.dropna(how='all')
108    
109    return df.reset_index(drop=True)
110
111
112def get_pdf_metadata(pdf_path):
113    """
114    Get metadata from a PDF file.
115    
116    Args:
117        pdf_path: Path to the PDF file
118    
119    Returns:
120        Dictionary with PDF metadata
121    """
122    with pdfplumber.open(pdf_path) as pdf:
123        return {
124            "path": str(pdf_path),
125            "pages": len(pdf.pages),
126            "metadata": pdf.metadata
127        }
128
129
130def save_text(text_dict, output_path):
131    """Save extracted text to a file."""
132    with open(output_path, 'w', encoding='utf-8') as f:
133        for page_num, text in text_dict.items():
134            f.write(f"\n{'='*50}\n")
135            f.write(f"PAGE {page_num}\n")
136            f.write(f"{'='*50}\n\n")
137            f.write(text or "[No text extracted]")
138            f.write("\n")
139    
140    print(f"✅ Text saved to: {output_path}")
141
142
143def save_tables_to_excel(tables, output_path):
144    """Save all extracted tables to an Excel file."""
145    if not tables:
146        print("⚠️ No tables to save")
147        return
148    
149    with pd.ExcelWriter(output_path, engine='openpyxl') as writer:
150        for table_info in tables:
151            sheet_name = f"Table_{table_info['table_num']}"
152            df = clean_table(table_info['data'])
153            df.to_excel(writer, sheet_name=sheet_name, index=False)
154    
155    print(f"✅ Tables saved to: {output_path}")
156
157
158def save_tables_to_csv(tables, output_dir):
159    """Save each table to a separate CSV file."""
160    if not tables:
161        print("⚠️ No tables to save")
162        return
163    
164    output_dir = Path(output_dir)
165    output_dir.mkdir(exist_ok=True)
166    
167    for table_info in tables:
168        filename = f"table_{table_info['table_num']}.csv"
169        filepath = output_dir / filename
170        df = clean_table(table_info['data'])
171        df.to_csv(filepath, index=False)
172    
173    print(f"✅ Tables saved to: {output_dir}/")
174
175
176def process_pdf(pdf_path, output_dir=None, extract_text_flag=True, 
177                extract_tables_flag=True, table_format="excel"):
178    """
179    Process a PDF file and extract text and/or tables.
180    
181    Args:
182        pdf_path: Path to the PDF file
183        output_dir: Directory for output files (default: same as PDF)
184        extract_text_flag: Whether to extract text
185        extract_tables_flag: Whether to extract tables
186        table_format: 'excel' or 'csv'
187    
188    Returns:
189        Dictionary with extraction results
190    """
191    pdf_path = Path(pdf_path)
192    
193    if not pdf_path.exists():
194        raise FileNotFoundError(f"PDF not found: {pdf_path}")
195    
196    # Set output directory
197    if output_dir is None:
198        output_dir = pdf_path.parent
199    output_dir = Path(output_dir)
200    output_dir.mkdir(exist_ok=True)
201    
202    # Base name for output files
203    base_name = pdf_path.stem
204    
205    print(f"\n{'='*60}")
206    print(f"Processing: {pdf_path.name}")
207    print(f"{'='*60}")
208    
209    # Get metadata
210    metadata = get_pdf_metadata(pdf_path)
211    print(f"Pages: {metadata['pages']}")
212    
213    results = {
214        "pdf_path": str(pdf_path),
215        "pages": metadata['pages'],
216        "text_file": None,
217        "tables_file": None,
218        "tables_count": 0
219    }
220    
221    # Extract text
222    if extract_text_flag:
223        print("\n📄 Extracting text...")
224        text = extract_text(pdf_path)
225        text_file = output_dir / f"{base_name}_text.txt"
226        save_text(text, text_file)
227        results["text_file"] = str(text_file)
228    
229    # Extract tables
230    if extract_tables_flag:
231        print("\n📊 Extracting tables...")
232        tables = extract_tables(pdf_path)
233        
234        if tables:
235            print(f"   Found {len(tables)} table(s)")
236            results["tables_count"] = len(tables)
237            
238            if table_format == "excel":
239                tables_file = output_dir / f"{base_name}_tables.xlsx"
240                save_tables_to_excel(tables, tables_file)
241                results["tables_file"] = str(tables_file)
242            else:
243                tables_dir = output_dir / f"{base_name}_tables"
244                save_tables_to_csv(tables, tables_dir)
245                results["tables_file"] = str(tables_dir)
246        else:
247            print("   No tables found")
248    
249    return results
250
251
252def batch_process(pdf_folder, output_dir=None, **kwargs):
253    """
254    Process all PDFs in a folder.
255    
256    Args:
257        pdf_folder: Folder containing PDF files
258        output_dir: Output directory (default: subfolder in pdf_folder)
259        **kwargs: Arguments to pass to process_pdf
260    
261    Returns:
262        List of results for each PDF
263    """
264    pdf_folder = Path(pdf_folder)
265    pdf_files = list(pdf_folder.glob("*.pdf"))
266    
267    if not pdf_files:
268        print(f"No PDF files found in {pdf_folder}")
269        return []
270    
271    if output_dir is None:
272        output_dir = pdf_folder / "extracted"
273    
274    print(f"Found {len(pdf_files)} PDF file(s)")
275    
276    all_results = []
277    for pdf_path in pdf_files:
278        try:
279            result = process_pdf(pdf_path, output_dir, **kwargs)
280            all_results.append(result)
281        except Exception as e:
282            print(f"❌ Error processing {pdf_path.name}: {e}")
283            all_results.append({"pdf_path": str(pdf_path), "error": str(e)})
284    
285    return all_results
286
287
288def main():
289    """Main entry point with example usage."""
290    
291    print("=" * 60)
292    print("PDF DATA EXTRACTOR")
293    print("=" * 60)
294    
295    # ========================================
296    # CONFIGURE YOUR EXTRACTION
297    # ========================================
298    
299    # Single PDF
300    pdf_path = "sample.pdf"
301    
302    # Or process a folder
303    # pdf_folder = "/path/to/pdfs"
304    
305    # Output options
306    output_dir = None  # Same folder as PDF
307    extract_text = True
308    extract_tables = True
309    table_format = "excel"  # or "csv"
310    
311    # ========================================
312    # RUN EXTRACTION
313    # ========================================
314    
315    if os.path.isfile(pdf_path):
316        result = process_pdf(
317            pdf_path,
318            output_dir=output_dir,
319            extract_text_flag=extract_text,
320            extract_tables_flag=extract_tables,
321            table_format=table_format
322        )
323        
324        print("\n" + "=" * 60)
325        print("EXTRACTION COMPLETE")
326        print("=" * 60)
327        print(f"Text file: {result.get('text_file', 'Not extracted')}")
328        print(f"Tables file: {result.get('tables_file', 'Not extracted')}")
329        print(f"Tables found: {result.get('tables_count', 0)}")
330        
331    elif os.path.isdir(pdf_path):
332        results = batch_process(
333            pdf_path,
334            output_dir=output_dir,
335            extract_text_flag=extract_text,
336            extract_tables_flag=extract_tables,
337            table_format=table_format
338        )
339        
340        print("\n" + "=" * 60)
341        print("BATCH EXTRACTION COMPLETE")
342        print("=" * 60)
343        print(f"Processed: {len(results)} files")
344        
345    else:
346        print(f"❌ File or folder not found: {pdf_path}")
347        print("\nTo use this script:")
348        print("1. Update 'pdf_path' to point to your PDF file")
349        print("2. Or set it to a folder path for batch processing")
350
351
352if __name__ == "__main__":
353    main()

How to Run This Script

  1. Install dependencies:

    bash
    1pip install pypdf2 pdfplumber tabula-py pandas openpyxl
  2. Install Java (required for tabula-py)

  3. Save the script as pdf_extractor.py

  4. Update the pdf_path in the script to your PDF file

  5. Run the script:

    bash
    1python pdf_extractor.py
  6. Expected output:

    Prompt
    ============================================================
    PDF DATA EXTRACTOR
    ============================================================
    
    ============================================================
    Processing: financial_report.pdf
    ============================================================
    Pages: 5
    
    📄 Extracting text...
    ✅ Text saved to: financial_report_text.txt
    
    📊 Extracting tables...
       Found 3 table(s)
    ✅ Tables saved to: financial_report_tables.xlsx
    
    ============================================================
    EXTRACTION COMPLETE
    ============================================================
    Text file: financial_report_text.txt
    Tables file: financial_report_tables.xlsx
    Tables found: 3

Customization Options

Handle Scanned PDFs (OCR)

For scanned documents that are actually images:

bash
1pip install pytesseract pdf2image
2# Also install Tesseract OCR: https://github.com/tesseract-ocr/tesseract
python
1from pdf2image import convert_from_path
2import pytesseract
3
4def extract_text_ocr(pdf_path):
5    """Extract text from scanned PDF using OCR."""
6    images = convert_from_path(pdf_path)
7    text = ""
8    
9    for i, image in enumerate(images, start=1):
10        page_text = pytesseract.image_to_string(image)
11        text += f"\n--- Page {i} ---\n{page_text}"
12    
13    return text

Extract Specific Fields

For structured documents like invoices:

python
1import re
2
3def extract_invoice_data(text):
4    """Extract common invoice fields from text."""
5    data = {}
6    
7    # Invoice number
8    match = re.search(r'Invoice\s*#?\s*:?\s*(\w+)', text, re.IGNORECASE)
9    if match:
10        data['invoice_number'] = match.group(1)
11    
12    # Date
13    match = re.search(r'Date\s*:?\s*(\d{1,2}[/-]\d{1,2}[/-]\d{2,4})', text)
14    if match:
15        data['date'] = match.group(1)
16    
17    # Total amount
18    match = re.search(r'Total\s*:?\s*\$?([\d,]+\.?\d*)', text, re.IGNORECASE)
19    if match:
20        data['total'] = match.group(1)
21    
22    return data

Merge Multiple Tables

python
1def merge_tables(tables, on_column=None):
2    """Merge multiple extracted tables into one."""
3    dfs = [clean_table(t['data']) for t in tables]
4    
5    if on_column:
6        result = dfs[0]
7        for df in dfs[1:]:
8            result = result.merge(df, on=on_column, how='outer')
9        return result
10    else:
11        return pd.concat(dfs, ignore_index=True)

Common Issues & Solutions

IssueSolution
"Java not found"Install Java and add to PATH
Empty table extractionTry pdfplumber instead of tabula
Garbled textPDF may be scanned; use OCR
Table headers wrongUse promote_header() function
Merged cells brokenManually clean the DataFrame

Taking It Further

Automate Invoice Processing

python
1def process_invoices(folder_path, output_csv):
2    """Process all invoice PDFs and compile data."""
3    all_data = []
4    
5    for pdf_file in Path(folder_path).glob("*.pdf"):
6        text = extract_text(pdf_file)
7        full_text = "\n".join(text.values())
8        
9        data = extract_invoice_data(full_text)
10        data['source_file'] = pdf_file.name
11        all_data.append(data)
12    
13    df = pd.DataFrame(all_data)
14    df.to_csv(output_csv, index=False)
15    print(f"Processed {len(all_data)} invoices")

Watch Folder for New PDFs

python
1import time
2from watchdog.observers import Observer
3from watchdog.events import FileSystemEventHandler
4
5class PDFHandler(FileSystemEventHandler):
6    def on_created(self, event):
7        if event.src_path.endswith('.pdf'):
8            print(f"New PDF detected: {event.src_path}")
9            process_pdf(event.src_path)
10
11# Usage:
12# observer = Observer()
13# observer.schedule(PDFHandler(), "/path/to/watch")
14# observer.start()

Conclusion

PDF data extraction used to be a manual nightmare. Now you have tools that handle text, tables, and even scanned documents automatically.

Start with clean, text-based PDFs to get comfortable with the extraction process. Then tackle more complex scenarios—multi-page tables, scanned documents, structured field extraction.

The real power comes from combining this with your other automations. Extract data from PDFs, process it with pandas, generate reports with openpyxl, and email the results—all without opening a single file manually.

Your PDFs, finally accessible.

Sponsored Content

Interested in advertising? Reach automation professionals through our platform.

Share this article