Automate Invoice Processing with Python: OCR & Data Extraction
Every month, your team receives dozens—maybe hundreds—of invoices via email. Someone has to open each PDF, manually type the vendor name, invoice number, date, and amount into a spreadsheet, then route for approval.
It's tedious. It's error-prone. And it's completely automatable.
Today, we're building a Python script that automatically reads invoices, extracts key data using OCR (Optical Character Recognition), validates the information, and creates a structured Excel report—all while you focus on work that actually requires a human.
What You'll Learn
- Using Tesseract OCR to read text from invoice PDFs
- Extracting specific data (vendor, invoice number, date, amount) with regex
- Validating and cleaning extracted data
- Exporting results to Excel with formatting
- Error handling and logging for production use
The Problem with Manual Invoice Processing
Manual invoice entry creates multiple pain points:
- Time-consuming: 3-5 minutes per invoice × 100 invoices = 8+ hours monthly
- Error-prone: Typos in amounts can cause payment issues
- Inconsistent: Different people format data differently
- Not scalable: More invoices = need more people
- Boring work: Demoralizes your team with repetitive tasks
The Solution: Automated OCR Pipeline
Our Python solution will:
- Monitor a folder for new invoice PDFs
- Convert PDF pages to images
- Use OCR to extract text
- Parse text to find invoice details using patterns
- Validate extracted data
- Export to Excel with formatting
- Move processed invoices to archive
- Log all activities for audit trail
Prerequisites
Before we start, you'll need:
- Python 3.8 or higher
- Tesseract OCR installed on your system
- Basic understanding of Python and regular expressions
Step 1: Install Required Libraries
Install the necessary Python packages:
1pip install pytesseract Pillow pdf2image openpyxl python-dateutil
Tesseract OCR Installation:
Windows:
1# Download installer from:2# https://github.com/UB-Mannheim/tesseract/wiki34# Add Tesseract to PATH or specify in script
Mac:
1brew install tesseract
Linux:
1sudo apt-get install tesseract-ocr
Step 2: PDF to Image Conversion
First, convert PDF invoices to images for OCR processing:
1from pdf2image import convert_from_path2import pytesseract3from PIL import Image4import os56def pdf_to_images(pdf_path):7 """8 Convert PDF pages to PIL Image objects.910 Args:11 pdf_path: Path to PDF file1213 Returns:14 List of PIL Image objects (one per page)15 """16 try:17 images = convert_from_path(18 pdf_path,19 dpi=300, # Higher DPI = better OCR accuracy20 fmt='png'21 )22 return images23 except Exception as e:24 print(f"Error converting PDF {pdf_path}: {e}")25 return []262728def ocr_image(image):29 """30 Extract text from image using Tesseract OCR.3132 Args:33 image: PIL Image object3435 Returns:36 Extracted text as string37 """38 # Optional: Specify tesseract path if not in PATH39 # pytesseract.pytesseract.tesseract_cmd = r'C:\Program Files\Tesseract-OCR\tesseract.exe'4041 try:42 text = pytesseract.image_to_string(43 image,44 config='--psm 6' # Assume uniform block of text45 )46 return text47 except Exception as e:48 print(f"OCR error: {e}")49 return ""505152def extract_text_from_pdf(pdf_path):53 """54 Complete pipeline: PDF -> Images -> OCR -> Text.5556 Args:57 pdf_path: Path to PDF invoice5859 Returns:60 Extracted text from all pages combined61 """62 images = pdf_to_images(pdf_path)6364 all_text = []65 for i, image in enumerate(images):66 print(f" Processing page {i+1}/{len(images)}...")67 text = ocr_image(image)68 all_text.append(text)6970 return "\n".join(all_text)
Step 3: Data Extraction with Regex
Now extract structured data from the OCR text:
1import re2from datetime import datetime3from dateutil import parser45class InvoiceParser:6 """7 Parse invoice text and extract key fields.8 """910 def __init__(self, text):11 self.text = text12 self.data = {}1314 def extract_invoice_number(self):15 """Extract invoice number using common patterns."""1617 patterns = [18 r'Invoice\s*#?:?\s*([A-Z0-9-]+)',19 r'Invoice\s+Number:?\s*([A-Z0-9-]+)',20 r'INV-?\s*([A-Z0-9-]+)',21 r'Bill\s+No\.?:?\s*([A-Z0-9-]+)',22 ]2324 for pattern in patterns:25 match = re.search(pattern, self.text, re.IGNORECASE)26 if match:27 return match.group(1).strip()2829 return "NOT_FOUND"3031 def extract_date(self):32 """Extract invoice date."""3334 patterns = [35 r'Invoice\s+Date:?\s*(\d{1,2}[/-]\d{1,2}[/-]\d{2,4})',36 r'Date:?\s*(\d{1,2}[/-]\d{1,2}[/-]\d{2,4})',37 r'Dated?:?\s*(\d{1,2}[/-]\d{1,2}[/-]\d{2,4})',38 r'(\d{1,2}\s+(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)[a-z]*\s+\d{4})',39 ]4041 for pattern in patterns:42 match = re.search(pattern, self.text, re.IGNORECASE)43 if match:44 date_str = match.group(1)45 try:46 # Parse various date formats47 date_obj = parser.parse(date_str, fuzzy=True)48 return date_obj.strftime('%Y-%m-%d')49 except:50 continue5152 return "NOT_FOUND"5354 def extract_vendor(self):55 """Extract vendor/company name from top of invoice."""5657 # Usually vendor name is in first few lines58 lines = self.text.split('\n')[:10]5960 # Look for lines with company indicators61 company_indicators = ['LLC', 'Ltd', 'Inc', 'Corp', 'Limited', 'Corporation']6263 for line in lines:64 line = line.strip()65 if len(line) > 3: # Skip very short lines66 for indicator in company_indicators:67 if indicator in line:68 # Clean and return69 return re.sub(r'[^A-Za-z0-9\s&\-.,]', '', line).strip()7071 # Fallback: return first substantial line72 for line in lines:73 line = line.strip()74 if len(line) > 5 and not re.match(r'^[\d\s\-/]+$', line):75 return line[:50] # Limit length7677 return "NOT_FOUND"7879 def extract_total_amount(self):80 """Extract total amount due."""8182 patterns = [83 r'Total:?\s*\$?\s*([\d,]+\.?\d{0,2})',84 r'Amount\s+Due:?\s*\$?\s*([\d,]+\.?\d{0,2})',85 r'Total\s+Amount:?\s*\$?\s*([\d,]+\.?\d{0,2})',86 r'Grand\s+Total:?\s*\$?\s*([\d,]+\.?\d{0,2})',87 r'Balance\s+Due:?\s*\$?\s*([\d,]+\.?\d{0,2})',88 ]8990 # Try each pattern91 for pattern in patterns:92 matches = re.findall(pattern, self.text, re.IGNORECASE)93 if matches:94 # Return last occurrence (usually the final total)95 amount_str = matches[-1].replace(',', '')96 try:97 return float(amount_str)98 except:99 continue100101 return 0.0102103 def extract_all(self):104 """Extract all fields and return as dictionary."""105106 self.data = {107 'invoice_number': self.extract_invoice_number(),108 'invoice_date': self.extract_date(),109 'vendor_name': self.extract_vendor(),110 'total_amount': self.extract_total_amount(),111 }112113 return self.data114115116def parse_invoice(pdf_path):117 """118 Complete workflow: Read PDF, extract text, parse data.119120 Args:121 pdf_path: Path to invoice PDF122123 Returns:124 Dictionary with extracted invoice data125 """126 print(f"Processing: {os.path.basename(pdf_path)}")127128 # Extract text via OCR129 text = extract_text_from_pdf(pdf_path)130131 if not text.strip():132 print(" Warning: No text extracted from PDF")133 return None134135 # Parse extracted text136 parser = InvoiceParser(text)137 data = parser.extract_all()138139 # Add metadata140 data['file_name'] = os.path.basename(pdf_path)141 data['processing_date'] = datetime.now().strftime('%Y-%m-%d %H:%M:%S')142143 print(f" ✓ Extracted: {data['vendor_name']} | {data['invoice_number']} | ${data['total_amount']}")144145 return data
Step 4: Data Validation
Add validation to catch errors early:
1class InvoiceValidator:2 """3 Validate extracted invoice data for quality and completeness.4 """56 @staticmethod7 def validate(data):8 """9 Validate invoice data and return validation results.1011 Args:12 data: Dictionary with invoice data1314 Returns:15 Dictionary with validation status and issues16 """17 issues = []18 warnings = []1920 # Check for missing fields21 if data['invoice_number'] == "NOT_FOUND":22 issues.append("Invoice number not found")2324 if data['invoice_date'] == "NOT_FOUND":25 issues.append("Invoice date not found")2627 if data['vendor_name'] == "NOT_FOUND":28 warnings.append("Vendor name not found")2930 if data['total_amount'] == 0.0:31 issues.append("Total amount is zero or not found")3233 # Check for suspicious amounts34 if data['total_amount'] > 100000:35 warnings.append(f"Unusually large amount: ${data['total_amount']:,.2f}")3637 # Check date is reasonable38 if data['invoice_date'] != "NOT_FOUND":39 try:40 inv_date = datetime.strptime(data['invoice_date'], '%Y-%m-%d')41 today = datetime.now()4243 # Future date?44 if inv_date > today:45 warnings.append("Invoice date is in the future")4647 # Very old invoice?48 days_old = (today - inv_date).days49 if days_old > 365:50 warnings.append(f"Invoice is {days_old} days old")51 except:52 pass5354 # Determine overall status55 if issues:56 status = "FAILED"57 elif warnings:58 status = "WARNING"59 else:60 status = "PASSED"6162 return {63 'status': status,64 'issues': issues,65 'warnings': warnings66 }
Step 5: Excel Export with Formatting
Export validated data to Excel:
1from openpyxl import Workbook2from openpyxl.styles import Font, PatternFill, Alignment, Border, Side3from openpyxl.utils import get_column_letter45def export_to_excel(invoice_data_list, output_path):6 """7 Export processed invoices to Excel with formatting.89 Args:10 invoice_data_list: List of invoice data dictionaries11 output_path: Path for output Excel file12 """13 wb = Workbook()14 ws = wb.active15 ws.title = "Invoice Data"1617 # Define headers18 headers = [19 'File Name',20 'Vendor Name',21 'Invoice Number',22 'Invoice Date',23 'Total Amount',24 'Status',25 'Issues/Warnings',26 'Processing Date'27 ]2829 # Write headers with formatting30 header_font = Font(bold=True, color='FFFFFF')31 header_fill = PatternFill(start_color='4472C4', end_color='4472C4', fill_type='solid')3233 for col, header in enumerate(headers, start=1):34 cell = ws.cell(row=1, column=col, value=header)35 cell.font = header_font36 cell.fill = header_fill37 cell.alignment = Alignment(horizontal='center', vertical='center')3839 # Write data rows40 for row_idx, invoice in enumerate(invoice_data_list, start=2):41 validation = InvoiceValidator.validate(invoice)4243 # Combine issues and warnings44 problems = validation['issues'] + validation['warnings']45 problems_str = '; '.join(problems) if problems else 'None'4647 row_data = [48 invoice['file_name'],49 invoice['vendor_name'],50 invoice['invoice_number'],51 invoice['invoice_date'],52 invoice['total_amount'],53 validation['status'],54 problems_str,55 invoice['processing_date']56 ]5758 for col_idx, value in enumerate(row_data, start=1):59 cell = ws.cell(row=row_idx, column=col_idx, value=value)6061 # Format amount column as currency62 if col_idx == 5: # Total Amount63 cell.number_format = '$#,##0.00'6465 # Color-code status66 if col_idx == 6: # Status67 if validation['status'] == 'FAILED':68 cell.fill = PatternFill(start_color='FFC7CE', end_color='FFC7CE', fill_type='solid')69 elif validation['status'] == 'WARNING':70 cell.fill = PatternFill(start_color='FFEB9C', end_color='FFEB9C', fill_type='solid')71 else:72 cell.fill = PatternFill(start_color='C6EFCE', end_color='C6EFCE', fill_type='solid')7374 # Auto-adjust column widths75 for column in ws.columns:76 max_length = 077 column_letter = get_column_letter(column[0].column)7879 for cell in column:80 try:81 if len(str(cell.value)) > max_length:82 max_length = len(str(cell.value))83 except:84 pass8586 adjusted_width = min(max_length + 2, 50)87 ws.column_dimensions[column_letter].width = adjusted_width8889 # Add summary row90 summary_row = len(invoice_data_list) + 391 ws.cell(row=summary_row, column=1, value="SUMMARY").font = Font(bold=True, size=12)92 ws.cell(row=summary_row + 1, column=1, value="Total Invoices:")93 ws.cell(row=summary_row + 1, column=2, value=len(invoice_data_list))94 ws.cell(row=summary_row + 2, column=1, value="Total Amount:")95 ws.cell(row=summary_row + 2, column=2, value=f"=SUM(E2:E{len(invoice_data_list)+1})")96 ws.cell(row=summary_row + 2, column=2).number_format = '$#,##0.00'9798 # Save workbook99 wb.save(output_path)100 print(f"\n✅ Excel report saved: {output_path}")
The Complete Automation Script
Here's the full script that ties everything together:
1#!/usr/bin/env python32"""3Invoice Processing Automation4Author: Alex Rodriguez56Automatically extract data from invoice PDFs using OCR and export to Excel.7"""89import os10import shutil11from datetime import datetime12import logging1314# Configure logging15logging.basicConfig(16 level=logging.INFO,17 format='%(asctime)s - %(levelname)s - %(message)s',18 handlers=[19 logging.FileHandler('invoice_processing.log'),20 logging.StreamHandler()21 ]22)2324logger = logging.getLogger(__name__)252627class InvoiceProcessor:28 """29 Main class for automated invoice processing workflow.30 """3132 def __init__(self, input_folder, output_folder, archive_folder):33 self.input_folder = input_folder34 self.output_folder = output_folder35 self.archive_folder = archive_folder3637 # Create folders if they don't exist38 for folder in [input_folder, output_folder, archive_folder]:39 os.makedirs(folder, exist_ok=True)4041 def process_all_invoices(self):42 """43 Process all PDF invoices in input folder.4445 Returns:46 List of processed invoice data47 """48 logger.info(f"Scanning for invoices in: {self.input_folder}")4950 # Find all PDF files51 pdf_files = [52 f for f in os.listdir(self.input_folder)53 if f.lower().endswith('.pdf')54 ]5556 if not pdf_files:57 logger.warning("No PDF files found to process")58 return []5960 logger.info(f"Found {len(pdf_files)} invoice(s) to process")6162 results = []6364 for pdf_file in pdf_files:65 pdf_path = os.path.join(self.input_folder, pdf_file)6667 try:68 # Process invoice69 data = parse_invoice(pdf_path)7071 if data:72 results.append(data)7374 # Move to archive75 archive_path = os.path.join(self.archive_folder, pdf_file)76 shutil.move(pdf_path, archive_path)77 logger.info(f" Archived: {pdf_file}")78 else:79 logger.error(f" Failed to extract data from: {pdf_file}")8081 except Exception as e:82 logger.error(f" Error processing {pdf_file}: {e}")8384 return results8586 def generate_report(self, invoice_data):87 """88 Generate Excel report from processed invoices.8990 Args:91 invoice_data: List of invoice dictionaries92 """93 if not invoice_data:94 logger.warning("No invoice data to export")95 return9697 timestamp = datetime.now().strftime('%Y%m%d_%H%M%S')98 output_path = os.path.join(99 self.output_folder,100 f'invoice_report_{timestamp}.xlsx'101 )102103 export_to_excel(invoice_data, output_path)104105 # Log summary106 total_amount = sum(inv['total_amount'] for inv in invoice_data)107 logger.info(f"\n{'='*50}")108 logger.info(f"PROCESSING COMPLETE")109 logger.info(f"{'='*50}")110 logger.info(f"Invoices processed: {len(invoice_data)}")111 logger.info(f"Total amount: ${total_amount:,.2f}")112 logger.info(f"Report saved: {output_path}")113 logger.info(f"{'='*50}\n")114115 def run(self):116 """Execute complete processing workflow."""117 logger.info("Starting invoice processing automation...")118119 # Process invoices120 invoice_data = self.process_all_invoices()121122 # Generate report123 if invoice_data:124 self.generate_report(invoice_data)125126 logger.info("Invoice processing automation complete!")127128129def main():130 """131 Main entry point for invoice processing script.132 """133134 # Configure folders135 INPUT_FOLDER = "./invoices/inbox"136 OUTPUT_FOLDER = "./invoices/reports"137 ARCHIVE_FOLDER = "./invoices/processed"138139 # Create processor and run140 processor = InvoiceProcessor(141 input_folder=INPUT_FOLDER,142 output_folder=OUTPUT_FOLDER,143 archive_folder=ARCHIVE_FOLDER144 )145146 processor.run()147148149if __name__ == "__main__":150 main()
How to Use This Script
-
Create folder structure:
bash1mkdir -p invoices/inbox2mkdir -p invoices/reports3mkdir -p invoices/processed -
Place invoice PDFs in
invoices/inbox/ -
Run the script:
bash1python invoice_processor.py -
Check results:
- Excel report in
invoices/reports/ - Processed invoices moved to
invoices/processed/ - Log file:
invoice_processing.log
- Excel report in
Improving OCR Accuracy
Image Preprocessing
Better image quality = better OCR results:
1from PIL import ImageEnhance, ImageFilter23def preprocess_image(image):4 """5 Enhance image for better OCR accuracy.6 """7 # Convert to grayscale8 image = image.convert('L')910 # Increase contrast11 enhancer = ImageEnhance.Contrast(image)12 image = enhancer.enhance(2.0)1314 # Sharpen15 image = image.filter(ImageFilter.SHARPEN)1617 return image
Multiple OCR Attempts
Try different OCR configurations:
1def ocr_with_fallback(image):2 """3 Try multiple OCR configurations for best results.4 """5 configs = [6 '--psm 6', # Uniform block of text7 '--psm 4', # Single column of text8 '--psm 3', # Fully automatic page segmentation9 ]1011 for config in configs:12 text = pytesseract.image_to_string(image, config=config)13 if len(text) > 100: # Reasonable amount of text extracted14 return text1516 return ""
Taking It Further
Email Integration
Automatically fetch invoices from email:
1import imaplib2import email34def fetch_invoice_emails(email_address, password):5 """6 Download invoice attachments from email.7 """8 mail = imaplib.IMAP4_SSL('imap.gmail.com')9 mail.login(email_address, password)10 mail.select('inbox')1112 # Search for emails with invoice subject13 _, messages = mail.search(None, '(SUBJECT "invoice")')1415 for num in messages[0].split():16 _, msg = mail.fetch(num, '(RFC822)')17 email_body = msg[0][1]18 email_message = email.message_from_bytes(email_body)1920 # Download PDF attachments21 for part in email_message.walk():22 if part.get_content_type() == 'application/pdf':23 filename = part.get_filename()24 # Save PDF to inbox folder25 # ...
Database Integration
Store invoice data in database:
1import sqlite323def save_to_database(invoice_data):4 """5 Save invoice data to SQLite database.6 """7 conn = sqlite3.connect('invoices.db')8 cursor = conn.cursor()910 cursor.execute('''11 CREATE TABLE IF NOT EXISTS invoices (12 id INTEGER PRIMARY KEY,13 file_name TEXT,14 vendor_name TEXT,15 invoice_number TEXT UNIQUE,16 invoice_date TEXT,17 total_amount REAL,18 processing_date TEXT19 )20 ''')2122 cursor.execute('''23 INSERT OR REPLACE INTO invoices24 VALUES (NULL, ?, ?, ?, ?, ?, ?)25 ''', (26 invoice_data['file_name'],27 invoice_data['vendor_name'],28 invoice_data['invoice_number'],29 invoice_data['invoice_date'],30 invoice_data['total_amount'],31 invoice_data['processing_date']32 ))3334 conn.commit()35 conn.close()
Schedule Automation
Run automatically on schedule:
Windows Task Scheduler or cron (Linux/Mac):
1# Run every day at 9 AM20 9 * * * /usr/bin/python3 /path/to/invoice_processor.py
Troubleshooting
| Issue | Solution |
|---|---|
| OCR returns empty text | Check Tesseract installation; increase image DPI |
| Wrong data extracted | Adjust regex patterns for your invoice format |
| Poor OCR accuracy | Preprocess images; ensure high-quality PDFs |
| Script crashes on PDF | Add try/except error handling; check PDF isn't corrupted |
| Amounts incorrect | Review decimal and currency parsing logic |
Conclusion
You've built a production-ready invoice processing automation system. What used to take hours of manual data entry now runs automatically, extracting data accurately and consistently.
The real power here is scalability: whether you process 10 invoices or 1,000, the script handles them the same way—fast, accurate, and without human error.
Start with your own invoices, tune the regex patterns to match your invoice formats, and watch hours of work disappear.
Frequently Asked Questions
What invoice formats does this work with? This script works with standard invoice PDFs. You may need to adjust regex patterns for specific vendors or invoice layouts. The OCR approach handles scanned invoices well.
How accurate is the OCR extraction? With clear, high-quality PDFs, accuracy is typically 95%+. Scanned or image-based invoices may have lower accuracy. Image preprocessing improves results significantly.
Can this handle invoices in multiple languages?
Yes, but you'll need to install language packs for Tesseract and adjust extraction patterns. Run tesseract --list-langs to see available languages.
How do I handle invoices with multiple pages? The script already processes multi-page PDFs. Each page is converted to an image and processed separately, with text combined for parsing.
What if my invoice format is different?
Adjust the regex patterns in the InvoiceParser class to match your specific invoice layout. Test with sample invoices and iterate on patterns.
Related articles: Automate Data Entry Tasks, Web Scraping with Python
Sponsored Content
Interested in advertising? Reach automation professionals through our platform.
