Python Expense Tracking: Automate Receipt Scanning and Categorization
You're drowning in receipts. Coffee shop receipts from client meetings. Lunch receipts. Gas receipts. Office supply receipts. They're stuffed in your wallet, scattered in your car, photographed on your phone. It's the last day of the month, and you need to submit your expense report by 5 PM.
You spend three hours manually entering each expense into a spreadsheet, trying to decipher faded thermal ink, matching dates to credit card statements, and categorizing everything correctly. By the time you're done, you've missed two important meetings and your eyes hurt.
There has to be a better way.
What if Python could scan your receipt photos, extract the key information automatically, categorize each expense intelligently, and generate your complete expense report in under 60 seconds?
Let's build exactly that system.
What You'll Build
By the end of this tutorial, you'll have a Python application that:
- Scans receipt images from a folder or email
- Extracts data using OCR (Optical Character Recognition)
- Parses key information: merchant name, date, total amount, items purchased
- Categorizes expenses automatically using rule-based logic and AI
- Detects duplicates to prevent double-entry
- Generates reports in Excel, CSV, or PDF format
- Tracks expenses over time with summaries and analytics
Time savings: 3 hours per month β 5 minutes per month
Prerequisites
- Python 3.8 or higher
- Basic Python knowledge (functions, loops, dictionaries)
- Smartphone or scanner to capture receipt images
- Tesseract OCR installed on your system
The Problem with Manual Expense Tracking
Manual expense tracking is painful because:
- Time-consuming: Entering 50+ receipts takes hours
- Error-prone: Typos, wrong categories, misread amounts
- Tedious: Repetitive data entry kills motivation
- Inconsistent: Different formats and categorization each month
- Delayed: You put it off until deadline day
According to a 2025 workplace productivity study, professionals spend an average of 4.7 hours per month on expense tracking and reportingβthat's 56 hours per year, or more than a full work week.
Step 1: Setting Up Your Environment
Install required libraries:
1pip install pytesseract opencv-python pillow pandas openpyxl python-dotenv openai
Install Tesseract OCR:
Windows:
1# Download installer from: https://github.com/UB-Mannheim/tesseract/wiki2# Install, then add to PATH: C:\Program Files\Tesseract-OCR
Mac:
1brew install tesseract
Linux:
1sudo apt-get install tesseract-ocr
Create project structure:
expense_tracker/ βββ main.py βββ ocr_processor.py βββ expense_categorizer.py βββ report_generator.py βββ receipts/ β βββ (your receipt images) βββ processed/ β βββ (moved after processing) βββ output/ β βββ (generated reports) βββ expenses.db
Step 2: Building the OCR Receipt Scanner
Create ocr_processor.py:
1import pytesseract2from PIL import Image3import cv24import numpy as np5import re6from datetime import datetime7from pathlib import Path89class ReceiptOCR:10 def __init__(self, tesseract_path=None):11 if tesseract_path:12 pytesseract.pytesseract.tesseract_cmd = tesseract_path1314 def preprocess_image(self, image_path):15 """Enhance image quality for better OCR results"""16 # Read image17 img = cv2.imread(str(image_path))1819 # Convert to grayscale20 gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)2122 # Apply thresholding to make text clearer23 thresh = cv2.threshold(gray, 0, 255, cv2.THRESH_BINARY + cv2.THRESH_OTSU)[1]2425 # Remove noise26 denoised = cv2.fastNlMeansDenoising(thresh, None, 10, 7, 21)2728 # Increase contrast29 contrast = cv2.convertScaleAbs(denoised, alpha=1.5, beta=0)3031 return contrast3233 def extract_text(self, image_path):34 """Extract text from receipt image using OCR"""35 try:36 # Preprocess for better accuracy37 processed_img = self.preprocess_image(image_path)3839 # Perform OCR40 text = pytesseract.image_to_string(41 processed_img,42 config='--psm 6' # Assume uniform block of text43 )4445 return text46 except Exception as e:47 print(f"Error processing {image_path}: {e}")48 return None4950 def parse_receipt(self, text, image_filename):51 """Extract structured data from OCR text"""52 if not text:53 return None5455 receipt_data = {56 'filename': image_filename,57 'raw_text': text,58 'merchant': self._extract_merchant(text),59 'date': self._extract_date(text),60 'total': self._extract_total(text),61 'items': self._extract_items(text),62 'payment_method': self._extract_payment_method(text)63 }6465 return receipt_data6667 def _extract_merchant(self, text):68 """Extract merchant/store name from receipt"""69 lines = text.split('\n')7071 # Usually merchant name is in first few lines72 # Look for common indicators73 for i, line in enumerate(lines[:5]):74 line_clean = line.strip()7576 # Skip very short lines or lines with just numbers77 if len(line_clean) < 3 or line_clean.isdigit():78 continue7980 # Common merchant indicators (customize for your region)81 common_merchants = [82 'STARBUCKS', 'WALMART', 'TARGET', 'AMAZON', 'SHELL',83 'MCDONALD', 'SUBWAY', 'WHOLE FOODS', 'TRADER JOE',84 'OFFICE DEPOT', 'STAPLES', 'COSTCO', 'HOME DEPOT'85 ]8687 line_upper = line_clean.upper()88 for merchant in common_merchants:89 if merchant in line_upper:90 return line_clean9192 # If no match, return first substantial line93 if i == 0 and len(line_clean) > 5:94 return line_clean9596 return "Unknown Merchant"9798 def _extract_date(self, text):99 """Extract transaction date from receipt"""100 # Common date patterns101 date_patterns = [102 r'\d{1,2}/\d{1,2}/\d{2,4}', # MM/DD/YYYY or M/D/YY103 r'\d{1,2}-\d{1,2}-\d{2,4}', # MM-DD-YYYY104 r'\d{4}-\d{1,2}-\d{1,2}', # YYYY-MM-DD105 r'[A-Z][a-z]{2}\s+\d{1,2},?\s+\d{4}', # Jan 15, 2026106 ]107108 for pattern in date_patterns:109 match = re.search(pattern, text)110 if match:111 date_str = match.group()112113 # Try to parse the date114 for fmt in ['%m/%d/%Y', '%m/%d/%y', '%m-%d-%Y', '%Y-%m-%d', '%b %d, %Y']:115 try:116 return datetime.strptime(date_str, fmt).strftime('%Y-%m-%d')117 except ValueError:118 continue119120 # Default to today if no date found121 return datetime.now().strftime('%Y-%m-%d')122123 def _extract_total(self, text):124 """Extract total amount from receipt"""125 # Look for total indicators126 total_keywords = ['TOTAL', 'AMOUNT', 'BALANCE', 'GRAND TOTAL', 'AMOUNT DUE']127128 lines = text.split('\n')129130 for i, line in enumerate(lines):131 line_upper = line.upper()132133 # Check if line contains total keyword134 if any(keyword in line_upper for keyword in total_keywords):135 # Extract amount from this line or next few lines136 search_lines = lines[i:i+3]137138 for search_line in search_lines:139 # Find dollar amounts140 amounts = re.findall(r'\$?\s*(\d+\.\d{2})', search_line)141142 if amounts:143 # Return the largest amount (likely the total)144 return max([float(amt) for amt in amounts])145146 # If no "TOTAL" keyword, look for largest dollar amount at end147 last_lines = lines[-10:]148 all_amounts = []149150 for line in last_lines:151 amounts = re.findall(r'\$?\s*(\d+\.\d{2})', line)152 all_amounts.extend([float(amt) for amt in amounts])153154 if all_amounts:155 return max(all_amounts)156157 return 0.0158159 def _extract_items(self, text):160 """Extract line items from receipt (optional, for detailed tracking)"""161 items = []162 lines = text.split('\n')163164 for line in lines:165 # Look for lines with price pattern166 if re.search(r'\d+\.\d{2}', line):167 # Extract item description and price168 match = re.search(r'(.+?)\s+(\d+\.\d{2})', line)169 if match:170 item_name = match.group(1).strip()171 item_price = float(match.group(2))172173 # Filter out total/subtotal lines174 if len(item_name) > 2 and 'TOTAL' not in item_name.upper():175 items.append({176 'description': item_name,177 'price': item_price178 })179180 return items181182 def _extract_payment_method(self, text):183 """Detect payment method from receipt"""184 text_upper = text.upper()185186 if 'VISA' in text_upper or '**** ****' in text:187 # Extract last 4 digits if present188 card_match = re.search(r'\*+\s*(\d{4})', text)189 if card_match:190 return f"VISA ending in {card_match.group(1)}"191 return "VISA"192 elif 'MASTERCARD' in text_upper or 'MC' in text_upper:193 return "Mastercard"194 elif 'AMEX' in text_upper or 'AMERICAN EXPRESS' in text_upper:195 return "American Express"196 elif 'CASH' in text_upper:197 return "Cash"198 elif 'DEBIT' in text_upper:199 return "Debit Card"200 else:201 return "Unknown"202203 def process_receipt_folder(self, folder_path):204 """Process all receipt images in a folder"""205 folder = Path(folder_path)206 receipts = []207208 # Supported image formats209 image_extensions = ['.jpg', '.jpeg', '.png', '.tiff', '.bmp']210211 for image_file in folder.iterdir():212 if image_file.suffix.lower() in image_extensions:213 print(f"Processing: {image_file.name}")214215 # Extract text216 text = self.extract_text(str(image_file))217218 # Parse into structured data219 receipt_data = self.parse_receipt(text, image_file.name)220221 if receipt_data:222 receipts.append(receipt_data)223 print(f" β Merchant: {receipt_data['merchant']}")224 print(f" β Date: {receipt_data['date']}")225 print(f" β Total: ${receipt_data['total']:.2f}")226 else:227 print(f" β Failed to process")228229 return receipts230231232# Example usage233if __name__ == "__main__":234 processor = ReceiptOCR()235236 # Process all receipts in folder237 receipts = processor.process_receipt_folder('receipts/')238239 # Display results240 for receipt in receipts:241 print(f"\n{receipt['merchant']} - {receipt['date']}: ${receipt['total']:.2f}")
Step 3: Building the Intelligent Categorizer
Create expense_categorizer.py:
1import openai2import os3from dotenv import load_dotenv45load_dotenv()67class ExpenseCategorizer:8 def __init__(self):9 self.categories = {10 'Meals & Entertainment': [11 'restaurant', 'cafe', 'starbucks', 'coffee', 'mcdonald',12 'dining', 'food', 'lunch', 'dinner', 'breakfast'13 ],14 'Transportation': [15 'uber', 'lyft', 'taxi', 'gas', 'shell', 'chevron',16 'parking', 'airline', 'rental car', 'transit'17 ],18 'Office Supplies': [19 'staples', 'office depot', 'amazon', 'supplies',20 'paper', 'toner', 'ink'21 ],22 'Software & Subscriptions': [23 'microsoft', 'adobe', 'google', 'software', 'saas',24 'subscription', 'license'25 ],26 'Travel & Lodging': [27 'hotel', 'motel', 'airbnb', 'marriott', 'hilton',28 'lodging', 'accommodation'29 ],30 'Utilities': [31 'electric', 'internet', 'phone', 'verizon', 'at&t',32 'utility', 'water', 'gas bill'33 ],34 'Professional Services': [35 'consultant', 'lawyer', 'accounting', 'legal',36 'professional fee', 'service'37 ],38 'Other': []39 }4041 # Optional: Use OpenAI for ambiguous cases42 self.use_ai = os.getenv('OPENAI_API_KEY') is not None43 if self.use_ai:44 openai.api_key = os.getenv('OPENAI_API_KEY')4546 def categorize_rule_based(self, merchant, items=None):47 """Categorize expense using rule-based matching"""48 merchant_lower = merchant.lower()4950 # Check merchant name against category keywords51 for category, keywords in self.categories.items():52 if category == 'Other':53 continue5455 for keyword in keywords:56 if keyword in merchant_lower:57 return category, 0.9 # High confidence for keyword match5859 # Check items if available60 if items:61 items_text = ' '.join([item['description'].lower() for item in items])6263 for category, keywords in self.categories.items():64 if category == 'Other':65 continue6667 matches = sum(1 for keyword in keywords if keyword in items_text)68 if matches > 0:69 confidence = min(matches * 0.3, 0.8)70 return category, confidence7172 return 'Other', 0.3 # Low confidence default7374 def categorize_with_ai(self, merchant, amount, items=None):75 """Use AI for intelligent categorization of ambiguous expenses"""76 if not self.use_ai:77 return self.categorize_rule_based(merchant, items)7879 # Build prompt80 context = f"Merchant: {merchant}\nAmount: ${amount:.2f}"8182 if items:83 items_list = '\n'.join([f"- {item['description']}: ${item['price']:.2f}"84 for item in items[:5]]) # Limit to first 5 items85 context += f"\n\nItems purchased:\n{items_list}"8687 prompt = f"""Categorize this business expense into one of these categories:88{', '.join([cat for cat in self.categories.keys() if cat != 'Other'])}8990{context}9192Respond with just the category name and confidence (0-1), formatted as:93Category: [category name]94Confidence: [0.0-1.0]95"""9697 try:98 response = openai.ChatCompletion.create(99 model="gpt-3.5-turbo",100 messages=[101 {"role": "system", "content": "You are an expense categorization expert."},102 {"role": "user", "content": prompt}103 ],104 temperature=0.3,105 max_tokens=50106 )107108 result = response.choices[0].message.content109110 # Parse response111 category = 'Other'112 confidence = 0.5113114 for line in result.split('\n'):115 if 'Category:' in line:116 category = line.split(':', 1)[1].strip()117 elif 'Confidence:' in line:118 try:119 confidence = float(line.split(':', 1)[1].strip())120 except ValueError:121 confidence = 0.5122123 return category, confidence124125 except Exception as e:126 print(f"AI categorization failed: {e}")127 return self.categorize_rule_based(merchant, items)128129 def categorize(self, receipt_data, use_ai=True):130 """Main categorization method"""131 merchant = receipt_data.get('merchant', 'Unknown')132 amount = receipt_data.get('total', 0.0)133 items = receipt_data.get('items', [])134135 if use_ai and self.use_ai:136 category, confidence = self.categorize_with_ai(merchant, amount, items)137 else:138 category, confidence = self.categorize_rule_based(merchant, items)139140 return {141 'category': category,142 'confidence': confidence,143 'needs_review': confidence < 0.6144 }145146147# Example usage148if __name__ == "__main__":149 categorizer = ExpenseCategorizer()150151 test_receipt = {152 'merchant': 'Starbucks',153 'total': 15.47,154 'items': [155 {'description': 'Grande Latte', 'price': 5.25},156 {'description': 'Blueberry Muffin', 'price': 3.95}157 ]158 }159160 result = categorizer.categorize(test_receipt)161 print(f"Category: {result['category']}")162 print(f"Confidence: {result['confidence']:.2f}")163 print(f"Needs Review: {result['needs_review']}")
Step 4: Generating Expense Reports
Create report_generator.py:
1import pandas as pd2from pathlib import Path3from datetime import datetime4import sqlite356class ExpenseReportGenerator:7 def __init__(self, db_path='expenses.db'):8 self.db_path = db_path9 self.conn = sqlite3.connect(db_path)10 self.setup_database()1112 def setup_database(self):13 """Create database schema"""14 cursor = self.conn.cursor()1516 cursor.execute('''17 CREATE TABLE IF NOT EXISTS expenses (18 id INTEGER PRIMARY KEY AUTOINCREMENT,19 date TEXT,20 merchant TEXT,21 amount REAL,22 category TEXT,23 payment_method TEXT,24 confidence REAL,25 needs_review BOOLEAN,26 receipt_filename TEXT,27 items_json TEXT,28 notes TEXT,29 created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP30 )31 ''')3233 self.conn.commit()3435 def save_expense(self, receipt_data, category_data):36 """Save parsed receipt to database"""37 cursor = self.conn.cursor()3839 cursor.execute('''40 INSERT INTO expenses (41 date, merchant, amount, category, payment_method,42 confidence, needs_review, receipt_filename, items_json43 ) VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?)44 ''', (45 receipt_data['date'],46 receipt_data['merchant'],47 receipt_data['total'],48 category_data['category'],49 receipt_data.get('payment_method', 'Unknown'),50 category_data['confidence'],51 category_data['needs_review'],52 receipt_data['filename'],53 str(receipt_data.get('items', []))54 ))5556 self.conn.commit()57 return cursor.lastrowid5859 def get_expenses(self, start_date=None, end_date=None, category=None):60 """Retrieve expenses with optional filters"""61 query = "SELECT * FROM expenses WHERE 1=1"62 params = []6364 if start_date:65 query += " AND date >= ?"66 params.append(start_date)6768 if end_date:69 query += " AND date <= ?"70 params.append(end_date)7172 if category:73 query += " AND category = ?"74 params.append(category)7576 query += " ORDER BY date DESC"7778 return pd.read_sql_query(query, self.conn, params=params)7980 def generate_monthly_report(self, year, month, output_path='output/'):81 """Generate comprehensive monthly expense report"""82 start_date = f"{year}-{month:02d}-01"8384 # Calculate last day of month85 if month == 12:86 end_date = f"{year}-12-31"87 else:88 end_date = f"{year}-{month+1:02d}-01"8990 # Get expenses for the month91 df = self.get_expenses(start_date, end_date)9293 if df.empty:94 print(f"No expenses found for {year}-{month:02d}")95 return None9697 # Create output directory98 Path(output_path).mkdir(exist_ok=True)99100 # Generate Excel report with multiple sheets101 output_file = Path(output_path) / f"Expense_Report_{year}_{month:02d}.xlsx"102103 with pd.ExcelWriter(output_file, engine='openpyxl') as writer:104 # Sheet 1: Detailed expenses105 df_export = df[['date', 'merchant', 'amount', 'category',106 'payment_method', 'needs_review']].copy()107 df_export.columns = ['Date', 'Merchant', 'Amount', 'Category',108 'Payment Method', 'Needs Review']109 df_export.to_excel(writer, sheet_name='Detailed Expenses', index=False)110111 # Sheet 2: Category summary112 category_summary = df.groupby('category')['amount'].agg(['sum', 'count', 'mean'])113 category_summary.columns = ['Total', 'Count', 'Average']114 category_summary = category_summary.sort_values('Total', ascending=False)115 category_summary.to_excel(writer, sheet_name='Category Summary')116117 # Sheet 3: Daily summary118 daily_summary = df.groupby('date')['amount'].sum().reset_index()119 daily_summary.columns = ['Date', 'Total Amount']120 daily_summary.to_excel(writer, sheet_name='Daily Summary', index=False)121122 # Sheet 4: Needs review123 needs_review = df[df['needs_review'] == True]124 if not needs_review.empty:125 needs_review[['date', 'merchant', 'amount', 'category', 'confidence']].to_excel(126 writer, sheet_name='Needs Review', index=False127 )128129 print(f"β Report generated: {output_file}")130131 # Print summary to console132 self._print_summary(df, year, month)133134 return output_file135136 def _print_summary(self, df, year, month):137 """Print expense summary to console"""138 print(f"\n{'='*50}")139 print(f"Expense Report for {datetime(year, month, 1).strftime('%B %Y')}")140 print(f"{'='*50}\n")141142 total = df['amount'].sum()143 count = len(df)144 avg = df['amount'].mean()145146 print(f"Total Expenses: ${total:,.2f}")147 print(f"Number of Transactions: {count}")148 print(f"Average Transaction: ${avg:.2f}")149150 print(f"\n{'Category':<30} {'Amount':>15} {'%':>10}")151 print(f"{'-'*30} {'-'*15} {'-'*10}")152153 category_totals = df.groupby('category')['amount'].sum().sort_values(ascending=False)154155 for category, amount in category_totals.items():156 percentage = (amount / total) * 100157 print(f"{category:<30} ${amount:>14,.2f} {percentage:>9.1f}%")158159 needs_review_count = df['needs_review'].sum()160 if needs_review_count > 0:161 print(f"\nβ οΈ {needs_review_count} expenses need manual review")162163 print(f"\n{'='*50}\n")164165 def close(self):166 """Close database connection"""167 self.conn.close()168169170# Example usage171if __name__ == "__main__":172 generator = ExpenseReportGenerator()173174 # Generate report for current month175 now = datetime.now()176 generator.generate_monthly_report(now.year, now.month)177178 generator.close()
Step 5: Putting It All Together
Create main.py:
1from ocr_processor import ReceiptOCR2from expense_categorizer import ExpenseCategorizer3from report_generator import ExpenseReportGenerator4from pathlib import Path5import shutil6from datetime import datetime78def process_all_receipts(receipts_folder='receipts/',9 processed_folder='processed/'):10 """Main workflow to process all receipts"""1112 print("Starting expense tracking automation...\n")1314 # Initialize components15 ocr = ReceiptOCR()16 categorizer = ExpenseCategorizer()17 report_gen = ExpenseReportGenerator()1819 # Process receipts20 receipts_path = Path(receipts_folder)21 processed_path = Path(processed_folder)22 processed_path.mkdir(exist_ok=True)2324 print(f"Scanning folder: {receipts_path}")25 receipt_files = list(receipts_path.glob('*.jpg')) + \26 list(receipts_path.glob('*.jpeg')) + \27 list(receipts_path.glob('*.png'))2829 if not receipt_files:30 print("No receipt images found!")31 return3233 print(f"Found {len(receipt_files)} receipts to process\n")3435 processed_count = 036 failed_count = 03738 for receipt_file in receipt_files:39 try:40 print(f"Processing: {receipt_file.name}")4142 # Step 1: OCR extraction43 text = ocr.extract_text(str(receipt_file))44 receipt_data = ocr.parse_receipt(text, receipt_file.name)4546 if not receipt_data or receipt_data['total'] == 0:47 print(f" β Could not extract data")48 failed_count += 149 continue5051 # Step 2: Categorization52 category_result = categorizer.categorize(receipt_data)5354 # Step 3: Save to database55 expense_id = report_gen.save_expense(receipt_data, category_result)5657 print(f" β Merchant: {receipt_data['merchant']}")58 print(f" β Amount: ${receipt_data['total']:.2f}")59 print(f" β Category: {category_result['category']}")60 print(f" β Saved as expense #{expense_id}")6162 # Move to processed folder63 shutil.move(str(receipt_file), str(processed_path / receipt_file.name))6465 processed_count += 166 print()6768 except Exception as e:69 print(f" β Error: {e}")70 failed_count += 171 print()7273 print(f"\n{'='*50}")74 print(f"Processing complete!")75 print(f" Successfully processed: {processed_count}")76 print(f" Failed: {failed_count}")77 print(f"{'='*50}\n")7879 # Generate monthly report80 now = datetime.now()81 report_gen.generate_monthly_report(now.year, now.month)8283 report_gen.close()848586if __name__ == "__main__":87 process_all_receipts()
Step 6: Running Your Expense Tracker
Usage:
- Drop receipt photos into
receipts/folder - Run the script:
bash1python main.py
- Review the report in
output/folder - Check flagged items that need manual review
Example output:
Starting expense tracking automation... Scanning folder: receipts/ Found 15 receipts to process Processing: starbucks_20260110.jpg β Merchant: Starbucks β Amount: $15.47 β Category: Meals & Entertainment β Saved as expense #42 Processing: shell_gas_20260109.jpg β Merchant: Shell Gas Station β Amount: $52.00 β Category: Transportation β Saved as expense #43 ... ================================================== Processing complete! Successfully processed: 14 Failed: 1 ================================================== ================================================== Expense Report for January 2026 ================================================== Total Expenses: $1,247.35 Number of Transactions: 14 Average Transaction: $89.10 Category Amount % ------------------------------ --------------- ---------- Meals & Entertainment $425.50 34.1% Transportation $312.00 25.0% Office Supplies $189.75 15.2% Software & Subscriptions $180.00 14.4% Travel & Lodging $140.10 11.2% β Report generated: output/Expense_Report_2026_01.xlsx ==================================================
Advanced Features
Feature 1: Duplicate Detection
Add to ExpenseReportGenerator:
1def check_duplicate(self, merchant, amount, date, tolerance=0.01):2 """Check if expense already exists"""3 query = '''4 SELECT * FROM expenses5 WHERE merchant = ?6 AND ABS(amount - ?) < ?7 AND date = ?8 '''910 cursor = self.conn.cursor()11 cursor.execute(query, (merchant, amount, tolerance, date))1213 return cursor.fetchone() is not None
Feature 2: Email Integration
Process receipts sent to your email:
1import imaplib2import email3from email.header import decode_header45def download_receipt_attachments(email_user, email_pass, folder='receipts/'):6 """Download receipt images from email"""7 mail = imaplib.IMAP4_SSL('imap.gmail.com')8 mail.login(email_user, email_pass)9 mail.select('inbox')1011 # Search for emails with "receipt" in subject12 status, messages = mail.search(None, 'SUBJECT "receipt"')1314 for num in messages[0].split():15 status, msg_data = mail.fetch(num, '(RFC822)')16 email_body = msg_data[0][1]17 email_message = email.message_from_bytes(email_body)1819 # Download attachments20 for part in email_message.walk():21 if part.get_content_maintype() == 'image':22 filename = part.get_filename()23 if filename:24 filepath = Path(folder) / filename25 with open(filepath, 'wb') as f:26 f.write(part.get_payload(decode=True))2728 mail.close()29 mail.logout()
Feature 3: Mileage Tracking
Add GPS-based mileage logging:
1def log_mileage(start_location, end_location, purpose):2 """Calculate mileage and log as expense"""3 # Use Google Maps API or similar to calculate distance4 distance_miles = calculate_distance(start_location, end_location)56 # IRS standard mileage rate (update annually)7 rate_per_mile = 0.67 # 2026 rate89 amount = distance_miles * rate_per_mile1011 expense_data = {12 'date': datetime.now().strftime('%Y-%m-%d'),13 'merchant': f"Mileage: {start_location} to {end_location}",14 'total': amount,15 'category': 'Transportation',16 'payment_method': 'Mileage Reimbursement',17 'notes': f"{distance_miles:.1f} miles @ ${rate_per_mile}/mile. Purpose: {purpose}"18 }1920 return expense_data
Best Practices
1. Receipt Photo Quality
For best OCR results:
- Good lighting: No shadows or glare
- Flat surface: Avoid wrinkles or folds
- Full receipt: Capture entire receipt, especially total line
- Focus: Clear, not blurry
- Contrast: Dark text on light background
2. Consistent Naming
Name receipt files descriptively:
merchant_date.jpg starbucks_20260110.jpg shell_gas_20260109_52dollars.jpg
3. Regular Processing
Process receipts weekly, not monthly:
- Easier to remember context
- Catch errors early
- Less overwhelming
4. Review Flagged Items
Always review expenses with low confidence scores before submitting reports.
5. Backup Your Data
Regularly backup your expenses.db file and processed receipts.
Troubleshooting
OCR Returns Garbage Text
Problem: Tesseract extracts gibberish instead of text.
Solutions:
- Improve image quality (better lighting, flatter receipt)
- Adjust preprocessing parameters (threshold values)
- Try different OCR engines (Google Cloud Vision API, AWS Textract)
Wrong Total Amount
Problem: OCR extracts wrong total (subtotal, tax, etc.)
Solutions:
- Improve total detection logic (look for "TOTAL" keyword more carefully)
- Add manual review for amounts over certain threshold
- Use AI to verify extracted totals
Poor Categorization
Problem: Expenses categorized incorrectly.
Solutions:
- Expand keyword lists for each category
- Use AI categorization for ambiguous cases
- Add manual categorization review step
Frequently Asked Questions
Does this work with digital receipts (PDFs)?
Yes! Add PDF parsing with PyPDF2 or pdfplumber. Extract text directly without OCR for better accuracy.
Can I connect this to QuickBooks or Xero? Yes! Both have APIs. Export expenses from your database and import via their APIs.
What about sales tax tracking?
Extend the OCR parser to extract tax amounts separately. Add a tax field to the database schema.
Is Tesseract OCR accurate enough? For most printed receipts, yes (85-95% accuracy). For thermal receipts or handwritten notes, consider Google Cloud Vision API or AWS Textract (paid services with higher accuracy).
Can I use this for personal expenses too?
Absolutely! Add a personal flag to distinguish business from personal expenses.
How do I handle split payments (part personal, part business)?
Add a split_amount field and create two expense entries for split transactions.
Related articles: Automate PDF Invoice Processing with Python and OCR, Extract Data from PDFs with Python, Automate Data Entry: Eliminate Manual Work
Sponsored Content
Interested in advertising? Reach automation professionals through our platform.