Automate Invoice Processing with Python: OCR & Data Extraction

Every month, your team receives dozens—maybe hundreds—of invoices via email. Someone has to open each PDF, manually type the vendor name, invoice number, date, and amount into a spreadsheet, then route for approval.

It's tedious. It's error-prone. And it's completely automatable.

Today, we're building a Python script that automatically reads invoices, extracts key data using OCR (Optical Character Recognition), validates the information, and creates a structured Excel report—all while you focus on work that actually requires a human.

What You'll Learn

Using Tesseract OCR to read text from invoice PDFs
Extracting specific data (vendor, invoice number, date, amount) with regex
Validating and cleaning extracted data
Exporting results to Excel with formatting
Error handling and logging for production use

The Problem with Manual Invoice Processing

Manual invoice entry creates multiple pain points:

Time-consuming: 3-5 minutes per invoice × 100 invoices = 8+ hours monthly
Error-prone: Typos in amounts can cause payment issues
Inconsistent: Different people format data differently
Not scalable: More invoices = need more people
Boring work: Demoralizes your team with repetitive tasks

The Solution: Automated OCR Pipeline

Our Python solution will:

Monitor a folder for new invoice PDFs
Convert PDF pages to images
Use OCR to extract text
Parse text to find invoice details using patterns
Validate extracted data
Export to Excel with formatting
Move processed invoices to archive
Log all activities for audit trail

Prerequisites

Before we start, you'll need:

Python 3.8 or higher
Tesseract OCR installed on your system
Basic understanding of Python and regular expressions

Step 1: Install Required Libraries

Install the necessary Python packages:

bash

1pip install pytesseract Pillow pdf2image openpyxl python-dateutil

Tesseract OCR Installation:

Windows:

bash

1# Download installer from:
2# https://github.com/UB-Mannheim/tesseract/wiki
3
4# Add Tesseract to PATH or specify in script

Mac:

bash

1brew install tesseract

Linux:

bash

1sudo apt-get install tesseract-ocr

Step 2: PDF to Image Conversion

First, convert PDF invoices to images for OCR processing:

python

1from pdf2image import convert_from_path
2import pytesseract
3from PIL import Image
4import os
5
6def pdf_to_images(pdf_path):
7    """
8    Convert PDF pages to PIL Image objects.
9    
10    Args:
11        pdf_path: Path to PDF file
12    
13    Returns:
14        List of PIL Image objects (one per page)
15    """
16    try:
17        images = convert_from_path(
18            pdf_path,
19            dpi=300,  # Higher DPI = better OCR accuracy
20            fmt='png'
21        )
22        return images
23    except Exception as e:
24        print(f"Error converting PDF {pdf_path}: {e}")
25        return []
26
27
28def ocr_image(image):
29    """
30    Extract text from image using Tesseract OCR.
31    
32    Args:
33        image: PIL Image object
34    
35    Returns:
36        Extracted text as string
37    """
38    # Optional: Specify tesseract path if not in PATH
39    # pytesseract.pytesseract.tesseract_cmd = r'C:\Program Files\Tesseract-OCR\tesseract.exe'
40    
41    try:
42        text = pytesseract.image_to_string(
43            image,
44            config='--psm 6'  # Assume uniform block of text
45        )
46        return text
47    except Exception as e:
48        print(f"OCR error: {e}")
49        return ""
50
51
52def extract_text_from_pdf(pdf_path):
53    """
54    Complete pipeline: PDF -> Images -> OCR -> Text.
55    
56    Args:
57        pdf_path: Path to PDF invoice
58    
59    Returns:
60        Extracted text from all pages combined
61    """
62    images = pdf_to_images(pdf_path)
63    
64    all_text = []
65    for i, image in enumerate(images):
66        print(f"  Processing page {i+1}/{len(images)}...")
67        text = ocr_image(image)
68        all_text.append(text)
69    
70    return "\n".join(all_text)

Step 3: Data Extraction with Regex

Now extract structured data from the OCR text:

python

1import re
2from datetime import datetime
3from dateutil import parser
4
5class InvoiceParser:
6    """
7    Parse invoice text and extract key fields.
8    """
9    
10    def __init__(self, text):
11        self.text = text
12        self.data = {}
13    
14    def extract_invoice_number(self):
15        """Extract invoice number using common patterns."""
16        
17        patterns = [
18            r'Invoice\s*#?:?\s*([A-Z0-9-]+)',
19            r'Invoice\s+Number:?\s*([A-Z0-9-]+)',
20            r'INV-?\s*([A-Z0-9-]+)',
21            r'Bill\s+No\.?:?\s*([A-Z0-9-]+)',
22        ]
23        
24        for pattern in patterns:
25            match = re.search(pattern, self.text, re.IGNORECASE)
26            if match:
27                return match.group(1).strip()
28        
29        return "NOT_FOUND"
30    
31    def extract_date(self):
32        """Extract invoice date."""
33        
34        patterns = [
35            r'Invoice\s+Date:?\s*(\d{1,2}[/-]\d{1,2}[/-]\d{2,4})',
36            r'Date:?\s*(\d{1,2}[/-]\d{1,2}[/-]\d{2,4})',
37            r'Dated?:?\s*(\d{1,2}[/-]\d{1,2}[/-]\d{2,4})',
38            r'(\d{1,2}\s+(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)[a-z]*\s+\d{4})',
39        ]
40        
41        for pattern in patterns:
42            match = re.search(pattern, self.text, re.IGNORECASE)
43            if match:
44                date_str = match.group(1)
45                try:
46                    # Parse various date formats
47                    date_obj = parser.parse(date_str, fuzzy=True)
48                    return date_obj.strftime('%Y-%m-%d')
49                except:
50                    continue
51        
52        return "NOT_FOUND"
53    
54    def extract_vendor(self):
55        """Extract vendor/company name from top of invoice."""
56        
57        # Usually vendor name is in first few lines
58        lines = self.text.split('\n')[:10]
59        
60        # Look for lines with company indicators
61        company_indicators = ['LLC', 'Ltd', 'Inc', 'Corp', 'Limited', 'Corporation']
62        
63        for line in lines:
64            line = line.strip()
65            if len(line) > 3:  # Skip very short lines
66                for indicator in company_indicators:
67                    if indicator in line:
68                        # Clean and return
69                        return re.sub(r'[^A-Za-z0-9\s&\-.,]', '', line).strip()
70        
71        # Fallback: return first substantial line
72        for line in lines:
73            line = line.strip()
74            if len(line) > 5 and not re.match(r'^[\d\s\-/]+$', line):
75                return line[:50]  # Limit length
76        
77        return "NOT_FOUND"
78    
79    def extract_total_amount(self):
80        """Extract total amount due."""
81        
82        patterns = [
83            r'Total:?\s*\$?\s*([\d,]+\.?\d{0,2})',
84            r'Amount\s+Due:?\s*\$?\s*([\d,]+\.?\d{0,2})',
85            r'Total\s+Amount:?\s*\$?\s*([\d,]+\.?\d{0,2})',
86            r'Grand\s+Total:?\s*\$?\s*([\d,]+\.?\d{0,2})',
87            r'Balance\s+Due:?\s*\$?\s*([\d,]+\.?\d{0,2})',
88        ]
89        
90        # Try each pattern
91        for pattern in patterns:
92            matches = re.findall(pattern, self.text, re.IGNORECASE)
93            if matches:
94                # Return last occurrence (usually the final total)
95                amount_str = matches[-1].replace(',', '')
96                try:
97                    return float(amount_str)
98                except:
99                    continue
100        
101        return 0.0
102    
103    def extract_all(self):
104        """Extract all fields and return as dictionary."""
105        
106        self.data = {
107            'invoice_number': self.extract_invoice_number(),
108            'invoice_date': self.extract_date(),
109            'vendor_name': self.extract_vendor(),
110            'total_amount': self.extract_total_amount(),
111        }
112        
113        return self.data
114
115
116def parse_invoice(pdf_path):
117    """
118    Complete workflow: Read PDF, extract text, parse data.
119    
120    Args:
121        pdf_path: Path to invoice PDF
122    
123    Returns:
124        Dictionary with extracted invoice data
125    """
126    print(f"Processing: {os.path.basename(pdf_path)}")
127    
128    # Extract text via OCR
129    text = extract_text_from_pdf(pdf_path)
130    
131    if not text.strip():
132        print("  Warning: No text extracted from PDF")
133        return None
134    
135    # Parse extracted text
136    parser = InvoiceParser(text)
137    data = parser.extract_all()
138    
139    # Add metadata
140    data['file_name'] = os.path.basename(pdf_path)
141    data['processing_date'] = datetime.now().strftime('%Y-%m-%d %H:%M:%S')
142    
143    print(f"  ✓ Extracted: {data['vendor_name']} | {data['invoice_number']} | ${data['total_amount']}")
144    
145    return data

Step 4: Data Validation

Add validation to catch errors early:

python

1class InvoiceValidator:
2    """
3    Validate extracted invoice data for quality and completeness.
4    """
5    
6    @staticmethod
7    def validate(data):
8        """
9        Validate invoice data and return validation results.
10        
11        Args:
12            data: Dictionary with invoice data
13        
14        Returns:
15            Dictionary with validation status and issues
16        """
17        issues = []
18        warnings = []
19        
20        # Check for missing fields
21        if data['invoice_number'] == "NOT_FOUND":
22            issues.append("Invoice number not found")
23        
24        if data['invoice_date'] == "NOT_FOUND":
25            issues.append("Invoice date not found")
26        
27        if data['vendor_name'] == "NOT_FOUND":
28            warnings.append("Vendor name not found")
29        
30        if data['total_amount'] == 0.0:
31            issues.append("Total amount is zero or not found")
32        
33        # Check for suspicious amounts
34        if data['total_amount'] > 100000:
35            warnings.append(f"Unusually large amount: ${data['total_amount']:,.2f}")
36        
37        # Check date is reasonable
38        if data['invoice_date'] != "NOT_FOUND":
39            try:
40                inv_date = datetime.strptime(data['invoice_date'], '%Y-%m-%d')
41                today = datetime.now()
42                
43                # Future date?
44                if inv_date > today:
45                    warnings.append("Invoice date is in the future")
46                
47                # Very old invoice?
48                days_old = (today - inv_date).days
49                if days_old > 365:
50                    warnings.append(f"Invoice is {days_old} days old")
51            except:
52                pass
53        
54        # Determine overall status
55        if issues:
56            status = "FAILED"
57        elif warnings:
58            status = "WARNING"
59        else:
60            status = "PASSED"
61        
62        return {
63            'status': status,
64            'issues': issues,
65            'warnings': warnings
66        }

Step 5: Excel Export with Formatting

Export validated data to Excel:

python

1from openpyxl import Workbook
2from openpyxl.styles import Font, PatternFill, Alignment, Border, Side
3from openpyxl.utils import get_column_letter
4
5def export_to_excel(invoice_data_list, output_path):
6    """
7    Export processed invoices to Excel with formatting.
8    
9    Args:
10        invoice_data_list: List of invoice data dictionaries
11        output_path: Path for output Excel file
12    """
13    wb = Workbook()
14    ws = wb.active
15    ws.title = "Invoice Data"
16    
17    # Define headers
18    headers = [
19        'File Name',
20        'Vendor Name',
21        'Invoice Number',
22        'Invoice Date',
23        'Total Amount',
24        'Status',
25        'Issues/Warnings',
26        'Processing Date'
27    ]
28    
29    # Write headers with formatting
30    header_font = Font(bold=True, color='FFFFFF')
31    header_fill = PatternFill(start_color='4472C4', end_color='4472C4', fill_type='solid')
32    
33    for col, header in enumerate(headers, start=1):
34        cell = ws.cell(row=1, column=col, value=header)
35        cell.font = header_font
36        cell.fill = header_fill
37        cell.alignment = Alignment(horizontal='center', vertical='center')
38    
39    # Write data rows
40    for row_idx, invoice in enumerate(invoice_data_list, start=2):
41        validation = InvoiceValidator.validate(invoice)
42        
43        # Combine issues and warnings
44        problems = validation['issues'] + validation['warnings']
45        problems_str = '; '.join(problems) if problems else 'None'
46        
47        row_data = [
48            invoice['file_name'],
49            invoice['vendor_name'],
50            invoice['invoice_number'],
51            invoice['invoice_date'],
52            invoice['total_amount'],
53            validation['status'],
54            problems_str,
55            invoice['processing_date']
56        ]
57        
58        for col_idx, value in enumerate(row_data, start=1):
59            cell = ws.cell(row=row_idx, column=col_idx, value=value)
60            
61            # Format amount column as currency
62            if col_idx == 5:  # Total Amount
63                cell.number_format = '$#,##0.00'
64            
65            # Color-code status
66            if col_idx == 6:  # Status
67                if validation['status'] == 'FAILED':
68                    cell.fill = PatternFill(start_color='FFC7CE', end_color='FFC7CE', fill_type='solid')
69                elif validation['status'] == 'WARNING':
70                    cell.fill = PatternFill(start_color='FFEB9C', end_color='FFEB9C', fill_type='solid')
71                else:
72                    cell.fill = PatternFill(start_color='C6EFCE', end_color='C6EFCE', fill_type='solid')
73    
74    # Auto-adjust column widths
75    for column in ws.columns:
76        max_length = 0
77        column_letter = get_column_letter(column[0].column)
78        
79        for cell in column:
80            try:
81                if len(str(cell.value)) > max_length:
82                    max_length = len(str(cell.value))
83            except:
84                pass
85        
86        adjusted_width = min(max_length + 2, 50)
87        ws.column_dimensions[column_letter].width = adjusted_width
88    
89    # Add summary row
90    summary_row = len(invoice_data_list) + 3
91    ws.cell(row=summary_row, column=1, value="SUMMARY").font = Font(bold=True, size=12)
92    ws.cell(row=summary_row + 1, column=1, value="Total Invoices:")
93    ws.cell(row=summary_row + 1, column=2, value=len(invoice_data_list))
94    ws.cell(row=summary_row + 2, column=1, value="Total Amount:")
95    ws.cell(row=summary_row + 2, column=2, value=f"=SUM(E2:E{len(invoice_data_list)+1})")
96    ws.cell(row=summary_row + 2, column=2).number_format = '$#,##0.00'
97    
98    # Save workbook
99    wb.save(output_path)
100    print(f"\n✅ Excel report saved: {output_path}")

The Complete Automation Script

Here's the full script that ties everything together:

python

1#!/usr/bin/env python3
2"""
3Invoice Processing Automation
4Author: Alex Rodriguez
5
6Automatically extract data from invoice PDFs using OCR and export to Excel.
7"""
8
9import os
10import shutil
11from datetime import datetime
12import logging
13
14# Configure logging
15logging.basicConfig(
16    level=logging.INFO,
17    format='%(asctime)s - %(levelname)s - %(message)s',
18    handlers=[
19        logging.FileHandler('invoice_processing.log'),
20        logging.StreamHandler()
21    ]
22)
23
24logger = logging.getLogger(__name__)
25
26
27class InvoiceProcessor:
28    """
29    Main class for automated invoice processing workflow.
30    """
31    
32    def __init__(self, input_folder, output_folder, archive_folder):
33        self.input_folder = input_folder
34        self.output_folder = output_folder
35        self.archive_folder = archive_folder
36        
37        # Create folders if they don't exist
38        for folder in [input_folder, output_folder, archive_folder]:
39            os.makedirs(folder, exist_ok=True)
40    
41    def process_all_invoices(self):
42        """
43        Process all PDF invoices in input folder.
44        
45        Returns:
46            List of processed invoice data
47        """
48        logger.info(f"Scanning for invoices in: {self.input_folder}")
49        
50        # Find all PDF files
51        pdf_files = [
52            f for f in os.listdir(self.input_folder)
53            if f.lower().endswith('.pdf')
54        ]
55        
56        if not pdf_files:
57            logger.warning("No PDF files found to process")
58            return []
59        
60        logger.info(f"Found {len(pdf_files)} invoice(s) to process")
61        
62        results = []
63        
64        for pdf_file in pdf_files:
65            pdf_path = os.path.join(self.input_folder, pdf_file)
66            
67            try:
68                # Process invoice
69                data = parse_invoice(pdf_path)
70                
71                if data:
72                    results.append(data)
73                    
74                    # Move to archive
75                    archive_path = os.path.join(self.archive_folder, pdf_file)
76                    shutil.move(pdf_path, archive_path)
77                    logger.info(f"  Archived: {pdf_file}")
78                else:
79                    logger.error(f"  Failed to extract data from: {pdf_file}")
80            
81            except Exception as e:
82                logger.error(f"  Error processing {pdf_file}: {e}")
83        
84        return results
85    
86    def generate_report(self, invoice_data):
87        """
88        Generate Excel report from processed invoices.
89        
90        Args:
91            invoice_data: List of invoice dictionaries
92        """
93        if not invoice_data:
94            logger.warning("No invoice data to export")
95            return
96        
97        timestamp = datetime.now().strftime('%Y%m%d_%H%M%S')
98        output_path = os.path.join(
99            self.output_folder,
100            f'invoice_report_{timestamp}.xlsx'
101        )
102        
103        export_to_excel(invoice_data, output_path)
104        
105        # Log summary
106        total_amount = sum(inv['total_amount'] for inv in invoice_data)
107        logger.info(f"\n{'='*50}")
108        logger.info(f"PROCESSING COMPLETE")
109        logger.info(f"{'='*50}")
110        logger.info(f"Invoices processed: {len(invoice_data)}")
111        logger.info(f"Total amount: ${total_amount:,.2f}")
112        logger.info(f"Report saved: {output_path}")
113        logger.info(f"{'='*50}\n")
114    
115    def run(self):
116        """Execute complete processing workflow."""
117        logger.info("Starting invoice processing automation...")
118        
119        # Process invoices
120        invoice_data = self.process_all_invoices()
121        
122        # Generate report
123        if invoice_data:
124            self.generate_report(invoice_data)
125        
126        logger.info("Invoice processing automation complete!")
127
128
129def main():
130    """
131    Main entry point for invoice processing script.
132    """
133    
134    # Configure folders
135    INPUT_FOLDER = "./invoices/inbox"
136    OUTPUT_FOLDER = "./invoices/reports"
137    ARCHIVE_FOLDER = "./invoices/processed"
138    
139    # Create processor and run
140    processor = InvoiceProcessor(
141        input_folder=INPUT_FOLDER,
142        output_folder=OUTPUT_FOLDER,
143        archive_folder=ARCHIVE_FOLDER
144    )
145    
146    processor.run()
147
148
149if __name__ == "__main__":
150    main()

How to Use This Script

Create folder structure:

bash

1mkdir -p invoices/inbox
2mkdir -p invoices/reports
3mkdir -p invoices/processed

Place invoice PDFs in invoices/inbox/
Run the script:
bash
```
1python invoice_processor.py
```
Check results:
- Excel report in invoices/reports/
- Processed invoices moved to invoices/processed/
- Log file: invoice_processing.log

Improving OCR Accuracy

Image Preprocessing

Better image quality = better OCR results:

python

1from PIL import ImageEnhance, ImageFilter
2
3def preprocess_image(image):
4    """
5    Enhance image for better OCR accuracy.
6    """
7    # Convert to grayscale
8    image = image.convert('L')
9    
10    # Increase contrast
11    enhancer = ImageEnhance.Contrast(image)
12    image = enhancer.enhance(2.0)
13    
14    # Sharpen
15    image = image.filter(ImageFilter.SHARPEN)
16    
17    return image

Multiple OCR Attempts

Try different OCR configurations:

python

1def ocr_with_fallback(image):
2    """
3    Try multiple OCR configurations for best results.
4    """
5    configs = [
6        '--psm 6',  # Uniform block of text
7        '--psm 4',  # Single column of text
8        '--psm 3',  # Fully automatic page segmentation
9    ]
10    
11    for config in configs:
12        text = pytesseract.image_to_string(image, config=config)
13        if len(text) > 100:  # Reasonable amount of text extracted
14            return text
15    
16    return ""

Taking It Further

Email Integration

Automatically fetch invoices from email:

python

1import imaplib
2import email
3
4def fetch_invoice_emails(email_address, password):
5    """
6    Download invoice attachments from email.
7    """
8    mail = imaplib.IMAP4_SSL('imap.gmail.com')
9    mail.login(email_address, password)
10    mail.select('inbox')
11    
12    # Search for emails with invoice subject
13    _, messages = mail.search(None, '(SUBJECT "invoice")')
14    
15    for num in messages[0].split():
16        _, msg = mail.fetch(num, '(RFC822)')
17        email_body = msg[0][1]
18        email_message = email.message_from_bytes(email_body)
19        
20        # Download PDF attachments
21        for part in email_message.walk():
22            if part.get_content_type() == 'application/pdf':
23                filename = part.get_filename()
24                # Save PDF to inbox folder
25                # ...

Database Integration

Store invoice data in database:

python

1import sqlite3
2
3def save_to_database(invoice_data):
4    """
5    Save invoice data to SQLite database.
6    """
7    conn = sqlite3.connect('invoices.db')
8    cursor = conn.cursor()
9    
10    cursor.execute('''
11        CREATE TABLE IF NOT EXISTS invoices (
12            id INTEGER PRIMARY KEY,
13            file_name TEXT,
14            vendor_name TEXT,
15            invoice_number TEXT UNIQUE,
16            invoice_date TEXT,
17            total_amount REAL,
18            processing_date TEXT
19        )
20    ''')
21    
22    cursor.execute('''
23        INSERT OR REPLACE INTO invoices 
24        VALUES (NULL, ?, ?, ?, ?, ?, ?)
25    ''', (
26        invoice_data['file_name'],
27        invoice_data['vendor_name'],
28        invoice_data['invoice_number'],
29        invoice_data['invoice_date'],
30        invoice_data['total_amount'],
31        invoice_data['processing_date']
32    ))
33    
34    conn.commit()
35    conn.close()

Schedule Automation

Run automatically on schedule:

Windows Task Scheduler or cron (Linux/Mac):

bash

1# Run every day at 9 AM
20 9 * * * /usr/bin/python3 /path/to/invoice_processor.py

Troubleshooting

Issue	Solution
OCR returns empty text	Check Tesseract installation; increase image DPI
Wrong data extracted	Adjust regex patterns for your invoice format
Poor OCR accuracy	Preprocess images; ensure high-quality PDFs
Script crashes on PDF	Add try/except error handling; check PDF isn't corrupted
Amounts incorrect	Review decimal and currency parsing logic

Conclusion

You've built a production-ready invoice processing automation system. What used to take hours of manual data entry now runs automatically, extracting data accurately and consistently.

The real power here is scalability: whether you process 10 invoices or 1,000, the script handles them the same way—fast, accurate, and without human error.

Start with your own invoices, tune the regex patterns to match your invoice formats, and watch hours of work disappear.

Frequently Asked Questions

What invoice formats does this work with? This script works with standard invoice PDFs. You may need to adjust regex patterns for specific vendors or invoice layouts. The OCR approach handles scanned invoices well.

How accurate is the OCR extraction? With clear, high-quality PDFs, accuracy is typically 95%+. Scanned or image-based invoices may have lower accuracy. Image preprocessing improves results significantly.

Can this handle invoices in multiple languages? Yes, but you'll need to install language packs for Tesseract and adjust extraction patterns. Run tesseract --list-langs to see available languages.

How do I handle invoices with multiple pages? The script already processes multi-page PDFs. Each page is converted to an image and processed separately, with text combined for parsing.

What if my invoice format is different? Adjust the regex patterns in the InvoiceParser class to match your specific invoice layout. Test with sample invoices and iterate on patterns.

Related articles: Automate Data Entry Tasks, Web Scraping with Python

Automate Invoice Processing with Python: OCR & Data Extraction

What You'll Learn

The Problem with Manual Invoice Processing

The Solution: Automated OCR Pipeline

Prerequisites

Step 1: Install Required Libraries

Step 2: PDF to Image Conversion

Step 3: Data Extraction with Regex

Step 4: Data Validation

Step 5: Excel Export with Formatting

The Complete Automation Script

How to Use This Script

Improving OCR Accuracy

Image Preprocessing

Multiple OCR Attempts

Taking It Further

Email Integration

Database Integration

Schedule Automation

Troubleshooting

Conclusion

Frequently Asked Questions

Share this article