Automate Invoice Processing with Python: OCR & Data Extraction
Every month, your team receives dozens—maybe hundreds—of invoices via email. Someone has to open each PDF, manually type the vendor name, invoice number, date, and amount into a spreadsheet, then route for approval.
It's tedious. It's error-prone. And it's completely automatable.
Today, we're building a Python script that automatically reads invoices, extracts key data using OCR (Optical Character Recognition), validates the information, and creates a structured Excel report—all while you focus on work that actually requires a human.
What You'll Learn
- Using Tesseract OCR to read text from invoice PDFs
- Extracting specific data (vendor, invoice number, date, amount) with regex
- Validating and cleaning extracted data
- Exporting results to Excel with formatting
- Error handling and logging for production use
The Problem with Manual Invoice Processing
Manual invoice entry creates multiple pain points:
- Time-consuming: 3-5 minutes per invoice × 100 invoices = 8+ hours monthly
- Error-prone: Typos in amounts can cause payment issues
- Inconsistent: Different people format data differently
- Not scalable: More invoices = need more people
- Boring work: Demoralizes your team with repetitive tasks
The Solution: Automated OCR Pipeline
Our Python solution will:
- Monitor a folder for new invoice PDFs
- Convert PDF pages to images
- Use OCR to extract text
- Parse text to find invoice details using patterns
- Validate extracted data
- Export to Excel with formatting
- Move processed invoices to archive
- Log all activities for audit trail
Prerequisites
Before we start, you'll need:
- Python 3.8 or higher
- Tesseract OCR installed on your system
- Basic understanding of Python and regular expressions
Step 1: Install Required Libraries
Install the necessary Python packages:
1pip install pytesseract Pillow pdf2image openpyxl python-dateutilTesseract OCR Installation:
Windows:
1# Download installer from:
2# https://github.com/UB-Mannheim/tesseract/wiki
3
4# Add Tesseract to PATH or specify in scriptMac:
1brew install tesseractLinux:
1sudo apt-get install tesseract-ocrStep 2: PDF to Image Conversion
First, convert PDF invoices to images for OCR processing:
1from pdf2image import convert_from_path
2import pytesseract
3from PIL import Image
4import os
5
6def pdf_to_images(pdf_path):
7 """
8 Convert PDF pages to PIL Image objects.
9
10 Args:
11 pdf_path: Path to PDF file
12
13 Returns:
14 List of PIL Image objects (one per page)
15 """
16 try:
17 images = convert_from_path(
18 pdf_path,
19 dpi=300, # Higher DPI = better OCR accuracy
20 fmt='png'
21 )
22 return images
23 except Exception as e:
24 print(f"Error converting PDF {pdf_path}: {e}")
25 return []
26
27
28def ocr_image(image):
29 """
30 Extract text from image using Tesseract OCR.
31
32 Args:
33 image: PIL Image object
34
35 Returns:
36 Extracted text as string
37 """
38 # Optional: Specify tesseract path if not in PATH
39 # pytesseract.pytesseract.tesseract_cmd = r'C:\Program Files\Tesseract-OCR\tesseract.exe'
40
41 try:
42 text = pytesseract.image_to_string(
43 image,
44 config='--psm 6' # Assume uniform block of text
45 )
46 return text
47 except Exception as e:
48 print(f"OCR error: {e}")
49 return ""
50
51
52def extract_text_from_pdf(pdf_path):
53 """
54 Complete pipeline: PDF -> Images -> OCR -> Text.
55
56 Args:
57 pdf_path: Path to PDF invoice
58
59 Returns:
60 Extracted text from all pages combined
61 """
62 images = pdf_to_images(pdf_path)
63
64 all_text = []
65 for i, image in enumerate(images):
66 print(f" Processing page {i+1}/{len(images)}...")
67 text = ocr_image(image)
68 all_text.append(text)
69
70 return "\n".join(all_text)Step 3: Data Extraction with Regex
Now extract structured data from the OCR text:
1import re
2from datetime import datetime
3from dateutil import parser
4
5class InvoiceParser:
6 """
7 Parse invoice text and extract key fields.
8 """
9
10 def __init__(self, text):
11 self.text = text
12 self.data = {}
13
14 def extract_invoice_number(self):
15 """Extract invoice number using common patterns."""
16
17 patterns = [
18 r'Invoice\s*#?:?\s*([A-Z0-9-]+)',
19 r'Invoice\s+Number:?\s*([A-Z0-9-]+)',
20 r'INV-?\s*([A-Z0-9-]+)',
21 r'Bill\s+No\.?:?\s*([A-Z0-9-]+)',
22 ]
23
24 for pattern in patterns:
25 match = re.search(pattern, self.text, re.IGNORECASE)
26 if match:
27 return match.group(1).strip()
28
29 return "NOT_FOUND"
30
31 def extract_date(self):
32 """Extract invoice date."""
33
34 patterns = [
35 r'Invoice\s+Date:?\s*(\d{1,2}[/-]\d{1,2}[/-]\d{2,4})',
36 r'Date:?\s*(\d{1,2}[/-]\d{1,2}[/-]\d{2,4})',
37 r'Dated?:?\s*(\d{1,2}[/-]\d{1,2}[/-]\d{2,4})',
38 r'(\d{1,2}\s+(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)[a-z]*\s+\d{4})',
39 ]
40
41 for pattern in patterns:
42 match = re.search(pattern, self.text, re.IGNORECASE)
43 if match:
44 date_str = match.group(1)
45 try:
46 # Parse various date formats
47 date_obj = parser.parse(date_str, fuzzy=True)
48 return date_obj.strftime('%Y-%m-%d')
49 except:
50 continue
51
52 return "NOT_FOUND"
53
54 def extract_vendor(self):
55 """Extract vendor/company name from top of invoice."""
56
57 # Usually vendor name is in first few lines
58 lines = self.text.split('\n')[:10]
59
60 # Look for lines with company indicators
61 company_indicators = ['LLC', 'Ltd', 'Inc', 'Corp', 'Limited', 'Corporation']
62
63 for line in lines:
64 line = line.strip()
65 if len(line) > 3: # Skip very short lines
66 for indicator in company_indicators:
67 if indicator in line:
68 # Clean and return
69 return re.sub(r'[^A-Za-z0-9\s&\-.,]', '', line).strip()
70
71 # Fallback: return first substantial line
72 for line in lines:
73 line = line.strip()
74 if len(line) > 5 and not re.match(r'^[\d\s\-/]+$', line):
75 return line[:50] # Limit length
76
77 return "NOT_FOUND"
78
79 def extract_total_amount(self):
80 """Extract total amount due."""
81
82 patterns = [
83 r'Total:?\s*\$?\s*([\d,]+\.?\d{0,2})',
84 r'Amount\s+Due:?\s*\$?\s*([\d,]+\.?\d{0,2})',
85 r'Total\s+Amount:?\s*\$?\s*([\d,]+\.?\d{0,2})',
86 r'Grand\s+Total:?\s*\$?\s*([\d,]+\.?\d{0,2})',
87 r'Balance\s+Due:?\s*\$?\s*([\d,]+\.?\d{0,2})',
88 ]
89
90 # Try each pattern
91 for pattern in patterns:
92 matches = re.findall(pattern, self.text, re.IGNORECASE)
93 if matches:
94 # Return last occurrence (usually the final total)
95 amount_str = matches[-1].replace(',', '')
96 try:
97 return float(amount_str)
98 except:
99 continue
100
101 return 0.0
102
103 def extract_all(self):
104 """Extract all fields and return as dictionary."""
105
106 self.data = {
107 'invoice_number': self.extract_invoice_number(),
108 'invoice_date': self.extract_date(),
109 'vendor_name': self.extract_vendor(),
110 'total_amount': self.extract_total_amount(),
111 }
112
113 return self.data
114
115
116def parse_invoice(pdf_path):
117 """
118 Complete workflow: Read PDF, extract text, parse data.
119
120 Args:
121 pdf_path: Path to invoice PDF
122
123 Returns:
124 Dictionary with extracted invoice data
125 """
126 print(f"Processing: {os.path.basename(pdf_path)}")
127
128 # Extract text via OCR
129 text = extract_text_from_pdf(pdf_path)
130
131 if not text.strip():
132 print(" Warning: No text extracted from PDF")
133 return None
134
135 # Parse extracted text
136 parser = InvoiceParser(text)
137 data = parser.extract_all()
138
139 # Add metadata
140 data['file_name'] = os.path.basename(pdf_path)
141 data['processing_date'] = datetime.now().strftime('%Y-%m-%d %H:%M:%S')
142
143 print(f" ✓ Extracted: {data['vendor_name']} | {data['invoice_number']} | ${data['total_amount']}")
144
145 return dataStep 4: Data Validation
Add validation to catch errors early:
1class InvoiceValidator:
2 """
3 Validate extracted invoice data for quality and completeness.
4 """
5
6 @staticmethod
7 def validate(data):
8 """
9 Validate invoice data and return validation results.
10
11 Args:
12 data: Dictionary with invoice data
13
14 Returns:
15 Dictionary with validation status and issues
16 """
17 issues = []
18 warnings = []
19
20 # Check for missing fields
21 if data['invoice_number'] == "NOT_FOUND":
22 issues.append("Invoice number not found")
23
24 if data['invoice_date'] == "NOT_FOUND":
25 issues.append("Invoice date not found")
26
27 if data['vendor_name'] == "NOT_FOUND":
28 warnings.append("Vendor name not found")
29
30 if data['total_amount'] == 0.0:
31 issues.append("Total amount is zero or not found")
32
33 # Check for suspicious amounts
34 if data['total_amount'] > 100000:
35 warnings.append(f"Unusually large amount: ${data['total_amount']:,.2f}")
36
37 # Check date is reasonable
38 if data['invoice_date'] != "NOT_FOUND":
39 try:
40 inv_date = datetime.strptime(data['invoice_date'], '%Y-%m-%d')
41 today = datetime.now()
42
43 # Future date?
44 if inv_date > today:
45 warnings.append("Invoice date is in the future")
46
47 # Very old invoice?
48 days_old = (today - inv_date).days
49 if days_old > 365:
50 warnings.append(f"Invoice is {days_old} days old")
51 except:
52 pass
53
54 # Determine overall status
55 if issues:
56 status = "FAILED"
57 elif warnings:
58 status = "WARNING"
59 else:
60 status = "PASSED"
61
62 return {
63 'status': status,
64 'issues': issues,
65 'warnings': warnings
66 }Step 5: Excel Export with Formatting
Export validated data to Excel:
1from openpyxl import Workbook
2from openpyxl.styles import Font, PatternFill, Alignment, Border, Side
3from openpyxl.utils import get_column_letter
4
5def export_to_excel(invoice_data_list, output_path):
6 """
7 Export processed invoices to Excel with formatting.
8
9 Args:
10 invoice_data_list: List of invoice data dictionaries
11 output_path: Path for output Excel file
12 """
13 wb = Workbook()
14 ws = wb.active
15 ws.title = "Invoice Data"
16
17 # Define headers
18 headers = [
19 'File Name',
20 'Vendor Name',
21 'Invoice Number',
22 'Invoice Date',
23 'Total Amount',
24 'Status',
25 'Issues/Warnings',
26 'Processing Date'
27 ]
28
29 # Write headers with formatting
30 header_font = Font(bold=True, color='FFFFFF')
31 header_fill = PatternFill(start_color='4472C4', end_color='4472C4', fill_type='solid')
32
33 for col, header in enumerate(headers, start=1):
34 cell = ws.cell(row=1, column=col, value=header)
35 cell.font = header_font
36 cell.fill = header_fill
37 cell.alignment = Alignment(horizontal='center', vertical='center')
38
39 # Write data rows
40 for row_idx, invoice in enumerate(invoice_data_list, start=2):
41 validation = InvoiceValidator.validate(invoice)
42
43 # Combine issues and warnings
44 problems = validation['issues'] + validation['warnings']
45 problems_str = '; '.join(problems) if problems else 'None'
46
47 row_data = [
48 invoice['file_name'],
49 invoice['vendor_name'],
50 invoice['invoice_number'],
51 invoice['invoice_date'],
52 invoice['total_amount'],
53 validation['status'],
54 problems_str,
55 invoice['processing_date']
56 ]
57
58 for col_idx, value in enumerate(row_data, start=1):
59 cell = ws.cell(row=row_idx, column=col_idx, value=value)
60
61 # Format amount column as currency
62 if col_idx == 5: # Total Amount
63 cell.number_format = '$#,##0.00'
64
65 # Color-code status
66 if col_idx == 6: # Status
67 if validation['status'] == 'FAILED':
68 cell.fill = PatternFill(start_color='FFC7CE', end_color='FFC7CE', fill_type='solid')
69 elif validation['status'] == 'WARNING':
70 cell.fill = PatternFill(start_color='FFEB9C', end_color='FFEB9C', fill_type='solid')
71 else:
72 cell.fill = PatternFill(start_color='C6EFCE', end_color='C6EFCE', fill_type='solid')
73
74 # Auto-adjust column widths
75 for column in ws.columns:
76 max_length = 0
77 column_letter = get_column_letter(column[0].column)
78
79 for cell in column:
80 try:
81 if len(str(cell.value)) > max_length:
82 max_length = len(str(cell.value))
83 except:
84 pass
85
86 adjusted_width = min(max_length + 2, 50)
87 ws.column_dimensions[column_letter].width = adjusted_width
88
89 # Add summary row
90 summary_row = len(invoice_data_list) + 3
91 ws.cell(row=summary_row, column=1, value="SUMMARY").font = Font(bold=True, size=12)
92 ws.cell(row=summary_row + 1, column=1, value="Total Invoices:")
93 ws.cell(row=summary_row + 1, column=2, value=len(invoice_data_list))
94 ws.cell(row=summary_row + 2, column=1, value="Total Amount:")
95 ws.cell(row=summary_row + 2, column=2, value=f"=SUM(E2:E{len(invoice_data_list)+1})")
96 ws.cell(row=summary_row + 2, column=2).number_format = '$#,##0.00'
97
98 # Save workbook
99 wb.save(output_path)
100 print(f"\n✅ Excel report saved: {output_path}")The Complete Automation Script
Here's the full script that ties everything together:
1#!/usr/bin/env python3
2"""
3Invoice Processing Automation
4Author: Alex Rodriguez
5
6Automatically extract data from invoice PDFs using OCR and export to Excel.
7"""
8
9import os
10import shutil
11from datetime import datetime
12import logging
13
14# Configure logging
15logging.basicConfig(
16 level=logging.INFO,
17 format='%(asctime)s - %(levelname)s - %(message)s',
18 handlers=[
19 logging.FileHandler('invoice_processing.log'),
20 logging.StreamHandler()
21 ]
22)
23
24logger = logging.getLogger(__name__)
25
26
27class InvoiceProcessor:
28 """
29 Main class for automated invoice processing workflow.
30 """
31
32 def __init__(self, input_folder, output_folder, archive_folder):
33 self.input_folder = input_folder
34 self.output_folder = output_folder
35 self.archive_folder = archive_folder
36
37 # Create folders if they don't exist
38 for folder in [input_folder, output_folder, archive_folder]:
39 os.makedirs(folder, exist_ok=True)
40
41 def process_all_invoices(self):
42 """
43 Process all PDF invoices in input folder.
44
45 Returns:
46 List of processed invoice data
47 """
48 logger.info(f"Scanning for invoices in: {self.input_folder}")
49
50 # Find all PDF files
51 pdf_files = [
52 f for f in os.listdir(self.input_folder)
53 if f.lower().endswith('.pdf')
54 ]
55
56 if not pdf_files:
57 logger.warning("No PDF files found to process")
58 return []
59
60 logger.info(f"Found {len(pdf_files)} invoice(s) to process")
61
62 results = []
63
64 for pdf_file in pdf_files:
65 pdf_path = os.path.join(self.input_folder, pdf_file)
66
67 try:
68 # Process invoice
69 data = parse_invoice(pdf_path)
70
71 if data:
72 results.append(data)
73
74 # Move to archive
75 archive_path = os.path.join(self.archive_folder, pdf_file)
76 shutil.move(pdf_path, archive_path)
77 logger.info(f" Archived: {pdf_file}")
78 else:
79 logger.error(f" Failed to extract data from: {pdf_file}")
80
81 except Exception as e:
82 logger.error(f" Error processing {pdf_file}: {e}")
83
84 return results
85
86 def generate_report(self, invoice_data):
87 """
88 Generate Excel report from processed invoices.
89
90 Args:
91 invoice_data: List of invoice dictionaries
92 """
93 if not invoice_data:
94 logger.warning("No invoice data to export")
95 return
96
97 timestamp = datetime.now().strftime('%Y%m%d_%H%M%S')
98 output_path = os.path.join(
99 self.output_folder,
100 f'invoice_report_{timestamp}.xlsx'
101 )
102
103 export_to_excel(invoice_data, output_path)
104
105 # Log summary
106 total_amount = sum(inv['total_amount'] for inv in invoice_data)
107 logger.info(f"\n{'='*50}")
108 logger.info(f"PROCESSING COMPLETE")
109 logger.info(f"{'='*50}")
110 logger.info(f"Invoices processed: {len(invoice_data)}")
111 logger.info(f"Total amount: ${total_amount:,.2f}")
112 logger.info(f"Report saved: {output_path}")
113 logger.info(f"{'='*50}\n")
114
115 def run(self):
116 """Execute complete processing workflow."""
117 logger.info("Starting invoice processing automation...")
118
119 # Process invoices
120 invoice_data = self.process_all_invoices()
121
122 # Generate report
123 if invoice_data:
124 self.generate_report(invoice_data)
125
126 logger.info("Invoice processing automation complete!")
127
128
129def main():
130 """
131 Main entry point for invoice processing script.
132 """
133
134 # Configure folders
135 INPUT_FOLDER = "./invoices/inbox"
136 OUTPUT_FOLDER = "./invoices/reports"
137 ARCHIVE_FOLDER = "./invoices/processed"
138
139 # Create processor and run
140 processor = InvoiceProcessor(
141 input_folder=INPUT_FOLDER,
142 output_folder=OUTPUT_FOLDER,
143 archive_folder=ARCHIVE_FOLDER
144 )
145
146 processor.run()
147
148
149if __name__ == "__main__":
150 main()How to Use This Script
-
Create folder structure:
bash1mkdir -p invoices/inbox 2mkdir -p invoices/reports 3mkdir -p invoices/processed -
Place invoice PDFs in
invoices/inbox/ -
Run the script:
bash1python invoice_processor.py -
Check results:
- Excel report in
invoices/reports/ - Processed invoices moved to
invoices/processed/ - Log file:
invoice_processing.log
- Excel report in
Improving OCR Accuracy
Image Preprocessing
Better image quality = better OCR results:
1from PIL import ImageEnhance, ImageFilter
2
3def preprocess_image(image):
4 """
5 Enhance image for better OCR accuracy.
6 """
7 # Convert to grayscale
8 image = image.convert('L')
9
10 # Increase contrast
11 enhancer = ImageEnhance.Contrast(image)
12 image = enhancer.enhance(2.0)
13
14 # Sharpen
15 image = image.filter(ImageFilter.SHARPEN)
16
17 return imageMultiple OCR Attempts
Try different OCR configurations:
1def ocr_with_fallback(image):
2 """
3 Try multiple OCR configurations for best results.
4 """
5 configs = [
6 '--psm 6', # Uniform block of text
7 '--psm 4', # Single column of text
8 '--psm 3', # Fully automatic page segmentation
9 ]
10
11 for config in configs:
12 text = pytesseract.image_to_string(image, config=config)
13 if len(text) > 100: # Reasonable amount of text extracted
14 return text
15
16 return ""Taking It Further
Email Integration
Automatically fetch invoices from email:
1import imaplib
2import email
3
4def fetch_invoice_emails(email_address, password):
5 """
6 Download invoice attachments from email.
7 """
8 mail = imaplib.IMAP4_SSL('imap.gmail.com')
9 mail.login(email_address, password)
10 mail.select('inbox')
11
12 # Search for emails with invoice subject
13 _, messages = mail.search(None, '(SUBJECT "invoice")')
14
15 for num in messages[0].split():
16 _, msg = mail.fetch(num, '(RFC822)')
17 email_body = msg[0][1]
18 email_message = email.message_from_bytes(email_body)
19
20 # Download PDF attachments
21 for part in email_message.walk():
22 if part.get_content_type() == 'application/pdf':
23 filename = part.get_filename()
24 # Save PDF to inbox folder
25 # ...Database Integration
Store invoice data in database:
1import sqlite3
2
3def save_to_database(invoice_data):
4 """
5 Save invoice data to SQLite database.
6 """
7 conn = sqlite3.connect('invoices.db')
8 cursor = conn.cursor()
9
10 cursor.execute('''
11 CREATE TABLE IF NOT EXISTS invoices (
12 id INTEGER PRIMARY KEY,
13 file_name TEXT,
14 vendor_name TEXT,
15 invoice_number TEXT UNIQUE,
16 invoice_date TEXT,
17 total_amount REAL,
18 processing_date TEXT
19 )
20 ''')
21
22 cursor.execute('''
23 INSERT OR REPLACE INTO invoices
24 VALUES (NULL, ?, ?, ?, ?, ?, ?)
25 ''', (
26 invoice_data['file_name'],
27 invoice_data['vendor_name'],
28 invoice_data['invoice_number'],
29 invoice_data['invoice_date'],
30 invoice_data['total_amount'],
31 invoice_data['processing_date']
32 ))
33
34 conn.commit()
35 conn.close()Schedule Automation
Run automatically on schedule:
Windows Task Scheduler or cron (Linux/Mac):
1# Run every day at 9 AM
20 9 * * * /usr/bin/python3 /path/to/invoice_processor.pyTroubleshooting
| Issue | Solution |
|---|---|
| OCR returns empty text | Check Tesseract installation; increase image DPI |
| Wrong data extracted | Adjust regex patterns for your invoice format |
| Poor OCR accuracy | Preprocess images; ensure high-quality PDFs |
| Script crashes on PDF | Add try/except error handling; check PDF isn't corrupted |
| Amounts incorrect | Review decimal and currency parsing logic |
Conclusion
You've built a production-ready invoice processing automation system. What used to take hours of manual data entry now runs automatically, extracting data accurately and consistently.
The real power here is scalability: whether you process 10 invoices or 1,000, the script handles them the same way—fast, accurate, and without human error.
Start with your own invoices, tune the regex patterns to match your invoice formats, and watch hours of work disappear.
Frequently Asked Questions
What invoice formats does this work with? This script works with standard invoice PDFs. You may need to adjust regex patterns for specific vendors or invoice layouts. The OCR approach handles scanned invoices well.
How accurate is the OCR extraction? With clear, high-quality PDFs, accuracy is typically 95%+. Scanned or image-based invoices may have lower accuracy. Image preprocessing improves results significantly.
Can this handle invoices in multiple languages?
Yes, but you'll need to install language packs for Tesseract and adjust extraction patterns. Run tesseract --list-langs to see available languages.
How do I handle invoices with multiple pages? The script already processes multi-page PDFs. Each page is converted to an image and processed separately, with text combined for parsing.
What if my invoice format is different?
Adjust the regex patterns in the InvoiceParser class to match your specific invoice layout. Test with sample invoices and iterate on patterns.
Related articles: Automate Data Entry Tasks, Web Scraping with Python
Sponsored Content
Interested in advertising? Reach automation professionals through our platform.