Automate Data Validation with Python: Complete Pandas Tutorial
Last week, a colleague spent three hours debugging a report that showed impossible values: negative ages, future dates, and products with 250% tax rates. The root cause? No data validation on the incoming CSV files that fed the reporting pipeline.
After implementing the automated validation pipeline I'm about to show you, similar issues are now caught in seconds—before bad data contaminates downstream reports, dashboards, and databases.
This tutorial walks through building a production-ready data validation system using Python and pandas that can:
- Validate data types, ranges, formats, and business rules
- Generate detailed validation reports
- Handle multiple file formats (CSV, Excel, JSON)
- Run on a schedule or as part of data pipelines
- Alert stakeholders when issues are found
Why Automate Data Validation?
Manual data checking doesn't scale. Here's what happens without automated validation:
The Problem Cascade:
- Bad data enters your system (wrong types, missing values, invalid formats)
- Downstream processes fail or produce incorrect results
- Business decisions are made on flawed data
- Hours or days later, someone discovers the issue
- You spend time debugging, fixing data, and re-running everything
The Solution: Automated validation catches issues at the point of entry, preventing the cascade entirely.
Prerequisites and Setup
You'll need Python 3.8+ with these packages:
1# Install required packages2pip install pandas numpy openpyxl jsonschema python-dateutil
Create a new directory for this project:
1mkdir data-validation-pipeline2cd data-validation-pipeline
The Data Validation Framework
We'll build a modular validation system with these components:
- Validator Classes: Reusable validation rules
- Validation Engine: Applies rules and collects results
- Report Generator: Creates human-readable validation reports
- Pipeline Orchestrator: Runs validation on multiple files
Step 1: Creating the Base Validator Class
Create validators.py:
1from abc import ABC, abstractmethod2from typing import Any, List, Dict3import pandas as pd4import numpy as np567class BaseValidator(ABC):8 """Base class for all validators"""910 def __init__(self, column_name: str):11 self.column_name = column_name12 self.errors = []1314 @abstractmethod15 def validate(self, df: pd.DataFrame) -> pd.DataFrame:16 """17 Validate data and return DataFrame with validation results18 Returns df with added column '{column_name}_valid' (boolean)19 """20 pass2122 def get_error_summary(self) -> Dict[str, Any]:23 """Return summary of validation errors"""24 return {25 'column': self.column_name,26 'validator': self.__class__.__name__,27 'error_count': len(self.errors),28 'errors': self.errors29 }303132class DataTypeValidator(BaseValidator):33 """Validates data types and attempts type conversion"""3435 def __init__(self, column_name: str, expected_type: str):36 super().__init__(column_name)37 self.expected_type = expected_type3839 def validate(self, df: pd.DataFrame) -> pd.DataFrame:40 """Validate and coerce data types"""41 col = self.column_name42 result_col = f'{col}_valid'4344 # Initialize all rows as valid45 df[result_col] = True4647 if col not in df.columns:48 self.errors.append(f"Column '{col}' not found in DataFrame")49 return df5051 try:52 if self.expected_type == 'numeric':53 # Try to convert to numeric, invalid values become NaN54 converted = pd.to_numeric(df[col], errors='coerce')55 invalid_mask = converted.isna() & df[col].notna()5657 elif self.expected_type == 'integer':58 converted = pd.to_numeric(df[col], errors='coerce')59 # Check if can be safely converted to int60 invalid_mask = (61 (converted.isna() & df[col].notna()) |62 (converted % 1 != 0)63 )6465 elif self.expected_type == 'datetime':66 converted = pd.to_datetime(df[col], errors='coerce')67 invalid_mask = converted.isna() & df[col].notna()6869 elif self.expected_type == 'string':70 # Strings are generally permissive71 invalid_mask = pd.Series([False] * len(df))7273 else:74 raise ValueError(f"Unknown type: {self.expected_type}")7576 # Mark invalid rows77 df.loc[invalid_mask, result_col] = False7879 # Log errors80 if invalid_mask.any():81 invalid_values = df.loc[invalid_mask, col].head(10).tolist()82 self.errors.append(83 f"{invalid_mask.sum()} invalid values (expected {self.expected_type}). "84 f"Examples: {invalid_values}"85 )8687 except Exception as e:88 df[result_col] = False89 self.errors.append(f"Validation error: {str(e)}")9091 return df929394class RangeValidator(BaseValidator):95 """Validates numeric values are within acceptable ranges"""9697 def __init__(self, column_name: str, min_value: float = None,98 max_value: float = None, allow_null: bool = False):99 super().__init__(column_name)100 self.min_value = min_value101 self.max_value = max_value102 self.allow_null = allow_null103104 def validate(self, df: pd.DataFrame) -> pd.DataFrame:105 col = self.column_name106 result_col = f'{col}_valid'107 df[result_col] = True108109 if col not in df.columns:110 self.errors.append(f"Column '{col}' not found")111 return df112113 # Convert to numeric114 numeric_col = pd.to_numeric(df[col], errors='coerce')115116 # Check for nulls117 if not self.allow_null:118 null_mask = numeric_col.isna()119 if null_mask.any():120 df.loc[null_mask, result_col] = False121 self.errors.append(f"{null_mask.sum()} null values found (not allowed)")122123 # Check minimum124 if self.min_value is not None:125 below_min = (numeric_col < self.min_value) & numeric_col.notna()126 if below_min.any():127 df.loc[below_min, result_col] = False128 invalid_values = df.loc[below_min, col].head(5).tolist()129 self.errors.append(130 f"{below_min.sum()} values below minimum {self.min_value}. "131 f"Examples: {invalid_values}"132 )133134 # Check maximum135 if self.max_value is not None:136 above_max = (numeric_col > self.max_value) & numeric_col.notna()137 if above_max.any():138 df.loc[above_max, result_col] = False139 invalid_values = df.loc[above_max, col].head(5).tolist()140 self.errors.append(141 f"{above_max.sum()} values above maximum {self.max_value}. "142 f"Examples: {invalid_values}"143 )144145 return df146147148class PatternValidator(BaseValidator):149 """Validates strings match expected patterns (e.g., email, phone, SKU)"""150151 def __init__(self, column_name: str, pattern: str, pattern_name: str = None):152 super().__init__(column_name)153 self.pattern = pattern154 self.pattern_name = pattern_name or pattern155156 def validate(self, df: pd.DataFrame) -> pd.DataFrame:157 col = self.column_name158 result_col = f'{col}_valid'159 df[result_col] = True160161 if col not in df.columns:162 self.errors.append(f"Column '{col}' not found")163 return df164165 # Convert to string and check pattern166 string_col = df[col].astype(str)167 matches_pattern = string_col.str.match(self.pattern, na=False)168169 # Account for actual nulls (not string 'nan')170 is_null = df[col].isna()171 invalid_mask = ~matches_pattern & ~is_null172173 if invalid_mask.any():174 df.loc[invalid_mask, result_col] = False175 invalid_values = df.loc[invalid_mask, col].head(5).tolist()176 self.errors.append(177 f"{invalid_mask.sum()} values don't match pattern '{self.pattern_name}'. "178 f"Examples: {invalid_values}"179 )180181 return df182183184class UniquenessValidator(BaseValidator):185 """Validates column values are unique (for IDs, SKUs, etc.)"""186187 def __init__(self, column_name: str):188 super().__init__(column_name)189190 def validate(self, df: pd.DataFrame) -> pd.DataFrame:191 col = self.column_name192 result_col = f'{col}_valid'193 df[result_col] = True194195 if col not in df.columns:196 self.errors.append(f"Column '{col}' not found")197 return df198199 # Find duplicates200 is_duplicate = df[col].duplicated(keep=False)201202 if is_duplicate.any():203 df.loc[is_duplicate, result_col] = False204 duplicate_values = df.loc[is_duplicate, col].unique()[:5].tolist()205 self.errors.append(206 f"{is_duplicate.sum()} duplicate values found. "207 f"Examples: {duplicate_values}"208 )209210 return df211212213class DateRangeValidator(BaseValidator):214 """Validates dates are within acceptable ranges"""215216 def __init__(self, column_name: str, min_date: str = None,217 max_date: str = None, allow_future: bool = True):218 super().__init__(column_name)219 self.min_date = pd.to_datetime(min_date) if min_date else None220 self.max_date = pd.to_datetime(max_date) if max_date else None221 self.allow_future = allow_future222223 def validate(self, df: pd.DataFrame) -> pd.DataFrame:224 col = self.column_name225 result_col = f'{col}_valid'226 df[result_col] = True227228 if col not in df.columns:229 self.errors.append(f"Column '{col}' not found")230 return df231232 # Convert to datetime233 date_col = pd.to_datetime(df[col], errors='coerce')234235 # Check for invalid dates236 invalid_dates = date_col.isna() & df[col].notna()237 if invalid_dates.any():238 df.loc[invalid_dates, result_col] = False239 self.errors.append(f"{invalid_dates.sum()} invalid date formats")240241 # Check minimum date242 if self.min_date is not None:243 before_min = (date_col < self.min_date) & date_col.notna()244 if before_min.any():245 df.loc[before_min, result_col] = False246 self.errors.append(247 f"{before_min.sum()} dates before {self.min_date.date()}"248 )249250 # Check maximum date251 if self.max_date is not None:252 after_max = (date_col > self.max_date) & date_col.notna()253 if after_max.any():254 df.loc[after_max, result_col] = False255 self.errors.append(256 f"{after_max.sum()} dates after {self.max_date.date()}"257 )258259 # Check future dates260 if not self.allow_future:261 now = pd.Timestamp.now()262 future_dates = (date_col > now) & date_col.notna()263 if future_dates.any():264 df.loc[future_dates, result_col] = False265 self.errors.append(f"{future_dates.sum()} future dates found")266267 return df268269270class RequiredFieldValidator(BaseValidator):271 """Validates required fields are not null/empty"""272273 def __init__(self, column_name: str):274 super().__init__(column_name)275276 def validate(self, df: pd.DataFrame) -> pd.DataFrame:277 col = self.column_name278 result_col = f'{col}_valid'279 df[result_col] = True280281 if col not in df.columns:282 self.errors.append(f"Column '{col}' not found")283 df[result_col] = False284 return df285286 # Check for nulls and empty strings287 is_missing = df[col].isna() | (df[col].astype(str).str.strip() == '')288289 if is_missing.any():290 df.loc[is_missing, result_col] = False291 self.errors.append(f"{is_missing.sum()} missing/empty values")292293 return df
Step 2: Building the Validation Engine
Create validation_engine.py:
1from typing import List, Dict, Any2import pandas as pd3from datetime import datetime4from validators import BaseValidator567class ValidationEngine:8 """Orchestrates validation rules and generates reports"""910 def __init__(self, validators: List[BaseValidator]):11 self.validators = validators12 self.validation_results = []1314 def validate_dataframe(self, df: pd.DataFrame) -> Dict[str, Any]:15 """16 Run all validators on DataFrame and return results17 """18 validation_start = datetime.now()1920 # Keep original DataFrame untouched21 df_validated = df.copy()2223 # Track validation columns24 validation_columns = []2526 # Run each validator27 for validator in self.validators:28 df_validated = validator.validate(df_validated)29 validation_col = f'{validator.column_name}_valid'30 if validation_col in df_validated.columns:31 validation_columns.append(validation_col)3233 # Create master validity column (all validations must pass)34 if validation_columns:35 df_validated['is_valid'] = df_validated[validation_columns].all(axis=1)36 else:37 df_validated['is_valid'] = True3839 # Count results40 total_rows = len(df_validated)41 valid_rows = df_validated['is_valid'].sum()42 invalid_rows = total_rows - valid_rows4344 # Collect error summaries45 error_summaries = [v.get_error_summary() for v in self.validators]4647 validation_duration = (datetime.now() - validation_start).total_seconds()4849 return {50 'timestamp': datetime.now().isoformat(),51 'duration_seconds': validation_duration,52 'total_rows': total_rows,53 'valid_rows': int(valid_rows),54 'invalid_rows': int(invalid_rows),55 'pass_rate': round((valid_rows / total_rows * 100), 2) if total_rows > 0 else 0,56 'validation_details': error_summaries,57 'validated_dataframe': df_validated58 }5960 def validate_file(self, file_path: str, **read_kwargs) -> Dict[str, Any]:61 """62 Load and validate a file (CSV, Excel, JSON)63 """64 # Detect file type and load65 if file_path.endswith('.csv'):66 df = pd.read_csv(file_path, **read_kwargs)67 elif file_path.endswith(('.xlsx', '.xls')):68 df = pd.read_excel(file_path, **read_kwargs)69 elif file_path.endswith('.json'):70 df = pd.read_json(file_path, **read_kwargs)71 else:72 raise ValueError(f"Unsupported file type: {file_path}")7374 # Run validation75 results = self.validate_dataframe(df)76 results['file_path'] = file_path77 results['file_name'] = file_path.split('/')[-1]7879 return results8081 def generate_report(self, validation_results: Dict[str, Any],82 output_format: str = 'text') -> str:83 """84 Generate human-readable validation report85 """86 if output_format == 'text':87 return self._generate_text_report(validation_results)88 elif output_format == 'html':89 return self._generate_html_report(validation_results)90 else:91 raise ValueError(f"Unsupported format: {output_format}")9293 def _generate_text_report(self, results: Dict[str, Any]) -> str:94 """Generate plain text report"""95 report = []96 report.append("=" * 70)97 report.append("DATA VALIDATION REPORT")98 report.append("=" * 70)99 report.append(f"Timestamp: {results['timestamp']}")100101 if 'file_name' in results:102 report.append(f"File: {results['file_name']}")103104 report.append(f"Duration: {results['duration_seconds']:.2f} seconds")105 report.append("")106107 report.append("SUMMARY")108 report.append("-" * 70)109 report.append(f"Total Rows: {results['total_rows']:,}")110 report.append(f"Valid Rows: {results['valid_rows']:,}")111 report.append(f"Invalid Rows: {results['invalid_rows']:,}")112 report.append(f"Pass Rate: {results['pass_rate']}%")113 report.append("")114115 if results['invalid_rows'] > 0:116 report.append("VALIDATION DETAILS")117 report.append("-" * 70)118119 for detail in results['validation_details']:120 if detail['error_count'] > 0:121 report.append(f"\n{detail['validator']} - Column: {detail['column']}")122 for error in detail['errors']:123 report.append(f" • {error}")124 else:125 report.append("✓ All validation rules passed successfully!")126127 report.append("")128 report.append("=" * 70)129130 return "\n".join(report)131132 def save_invalid_rows(self, validation_results: Dict[str, Any],133 output_path: str) -> None:134 """135 Save rows that failed validation to a file for review136 """137 df = validation_results['validated_dataframe']138 invalid_df = df[~df['is_valid']].copy()139140 # Remove validation columns for cleaner output141 validation_cols = [col for col in invalid_df.columns if col.endswith('_valid')]142 invalid_df = invalid_df.drop(columns=validation_cols + ['is_valid'])143144 if output_path.endswith('.csv'):145 invalid_df.to_csv(output_path, index=False)146 elif output_path.endswith(('.xlsx', '.xls')):147 invalid_df.to_excel(output_path, index=False)148 else:149 raise ValueError(f"Unsupported output format: {output_path}")150151 print(f"Saved {len(invalid_df)} invalid rows to {output_path}")
Step 3: Creating a Real-World Example
Let's create a validation pipeline for a common business scenario: validating product inventory data.
Create validate_product_data.py:
1import pandas as pd2from validators import (3 DataTypeValidator, RangeValidator, PatternValidator,4 UniquenessValidator, RequiredFieldValidator, DateRangeValidator5)6from validation_engine import ValidationEngine789def create_product_validators():10 """11 Define validation rules for product inventory data12 """13 validators = [14 # Product ID must be unique and required15 RequiredFieldValidator('product_id'),16 UniquenessValidator('product_id'),17 PatternValidator('product_id', r'^PRD-\d{6}$', 'Product ID format (PRD-######)'),1819 # Product name is required20 RequiredFieldValidator('product_name'),2122 # Price validations23 DataTypeValidator('price', 'numeric'),24 RangeValidator('price', min_value=0, max_value=100000),2526 # Quantity validations27 DataTypeValidator('quantity', 'integer'),28 RangeValidator('quantity', min_value=0, max_value=1000000),2930 # Category is required31 RequiredFieldValidator('category'),3233 # SKU format validation34 PatternValidator('sku', r'^[A-Z]{3}-\d{4}$', 'SKU format (ABC-1234)'),3536 # Date validations37 DateRangeValidator('date_added', min_date='2020-01-01', allow_future=False),38 DateRangeValidator('expiry_date', allow_future=True),39 ]4041 return validators424344def main():45 """46 Main validation workflow47 """48 # Create sample data with intentional errors49 sample_data = pd.DataFrame({50 'product_id': ['PRD-000001', 'PRD-000002', 'INVALID', 'PRD-000001', 'PRD-000004'],51 'product_name': ['Widget A', 'Widget B', '', 'Widget D', 'Widget E'],52 'price': [29.99, -5.00, 150.50, 99.99, 'invalid'],53 'quantity': [100, 50, 25.5, 0, 200],54 'category': ['Electronics', 'Electronics', 'Home', None, 'Sports'],55 'sku': ['ELC-1001', 'ELC-1002', 'INVALID', 'HOM-2003', 'SPT-3004'],56 'date_added': ['2024-01-15', '2024-02-20', '2027-01-01', '2023-12-10', '2024-03-01'],57 'expiry_date': ['2026-01-15', '2026-06-20', '2026-12-31', '2025-12-10', '2027-03-01'],58 })5960 print("Product Inventory Data Validation")61 print("=" * 70)62 print(f"\nValidating {len(sample_data)} products...\n")6364 # Create validators65 validators = create_product_validators()6667 # Create validation engine68 engine = ValidationEngine(validators)6970 # Run validation71 results = engine.validate_dataframe(sample_data)7273 # Generate and print report74 report = engine.generate_report(results)75 print(report)7677 # Save invalid rows for review78 if results['invalid_rows'] > 0:79 engine.save_invalid_rows(results, 'invalid_products.csv')80 print("\nInvalid rows saved to 'invalid_products.csv' for review")8182 # Optionally save valid rows for further processing83 df_validated = results['validated_dataframe']84 valid_data = df_validated[df_validated['is_valid']].copy()8586 # Remove validation columns87 validation_cols = [col for col in valid_data.columns if col.endswith('_valid')]88 valid_data = valid_data.drop(columns=validation_cols + ['is_valid'])8990 valid_data.to_csv('valid_products.csv', index=False)91 print(f"Valid rows saved to 'valid_products.csv' ({len(valid_data)} products)")929394if __name__ == '__main__':95 main()
Step 4: Running the Validation
Run the script:
1python validate_product_data.py
Expected Output:
Product Inventory Data Validation ====================================================================== Validating 5 products... ====================================================================== DATA VALIDATION REPORT ====================================================================== Timestamp: 2026-02-11T10:30:45.123456 Duration: 0.12 seconds SUMMARY ---------------------------------------------------------------------- Total Rows: 5 Valid Rows: 1 Invalid Rows: 4 Pass Rate: 20.0% VALIDATION DETAILS ---------------------------------------------------------------------- PatternValidator - Column: product_id • 1 values don't match pattern 'Product ID format (PRD-######)'. Examples: ['INVALID'] UniquenessValidator - Column: product_id • 2 duplicate values found. Examples: ['PRD-000001'] RequiredFieldValidator - Column: product_name • 1 missing/empty values RangeValidator - Column: price • 1 values below minimum 0. Examples: [-5.0] DataTypeValidator - Column: price • 1 invalid values (expected numeric). Examples: ['invalid'] DataTypeValidator - Column: quantity • 1 invalid values (expected integer). Examples: [25.5] RequiredFieldValidator - Column: category • 1 missing/empty values PatternValidator - Column: sku • 1 values don't match pattern 'SKU format (ABC-1234)'. Examples: ['INVALID'] DateRangeValidator - Column: date_added • 1 dates after 2026-02-11 ====================================================================== Saved 4 invalid rows to 'invalid_products.csv' for review Valid rows saved to 'valid_products.csv' (1 products)
Advanced Validation Patterns
Custom Business Rule Validators
For complex business logic, create custom validators:
1class PriceQuantityValidator(BaseValidator):2 """3 Validate business rule: high-value items (price > $1000)4 must have quantity < 1005 """67 def __init__(self, price_col: str = 'price', quantity_col: str = 'quantity'):8 super().__init__(f'{price_col}_{quantity_col}')9 self.price_col = price_col10 self.quantity_col = quantity_col1112 def validate(self, df: pd.DataFrame) -> pd.DataFrame:13 result_col = f'{self.column_name}_valid'14 df[result_col] = True1516 # Convert to numeric17 prices = pd.to_numeric(df[self.price_col], errors='coerce')18 quantities = pd.to_numeric(df[self.quantity_col], errors='coerce')1920 # Apply business rule21 high_value_items = prices > 100022 high_quantity = quantities >= 10023 violation = high_value_items & high_quantity2425 if violation.any():26 df.loc[violation, result_col] = False27 violating_items = df.loc[violation, [self.price_col, self.quantity_col]].head(5)28 self.errors.append(29 f"{violation.sum()} items violate rule: high-value items (>${1000}) "30 f"must have quantity < 100. Examples:\n{violating_items.to_string()}"31 )3233 return df
Batch Validation for Multiple Files
Create batch_validator.py:
1import os2from pathlib import Path3from typing import List4from validation_engine import ValidationEngine5from validate_product_data import create_product_validators678def validate_directory(input_dir: str, output_dir: str, file_pattern: str = '*.csv'):9 """10 Validate all files in a directory11 """12 input_path = Path(input_dir)13 output_path = Path(output_dir)14 output_path.mkdir(exist_ok=True)1516 # Find matching files17 files = list(input_path.glob(file_pattern))1819 print(f"Found {len(files)} files to validate in {input_dir}\n")2021 # Create validation engine22 validators = create_product_validators()23 engine = ValidationEngine(validators)2425 # Track summary stats26 all_results = []2728 for file in files:29 print(f"Validating {file.name}...")3031 try:32 # Validate file33 results = engine.validate_file(str(file))34 all_results.append(results)3536 # Generate report37 report = engine.generate_report(results)3839 # Save report40 report_file = output_path / f"{file.stem}_validation_report.txt"41 with open(report_file, 'w') as f:42 f.write(report)4344 # Save invalid rows if any45 if results['invalid_rows'] > 0:46 invalid_file = output_path / f"{file.stem}_invalid.csv"47 engine.save_invalid_rows(results, str(invalid_file))4849 print(f" ✓ Pass rate: {results['pass_rate']}%")50 print(f" ✓ Report saved to {report_file}\n")5152 except Exception as e:53 print(f" ✗ Error validating {file.name}: {e}\n")5455 # Generate summary report56 generate_summary_report(all_results, output_path / "summary_report.txt")575859def generate_summary_report(all_results: List, output_file: Path):60 """Generate summary across all validated files"""61 with open(output_file, 'w') as f:62 f.write("=" * 70 + "\n")63 f.write("BATCH VALIDATION SUMMARY\n")64 f.write("=" * 70 + "\n\n")6566 total_files = len(all_results)67 total_rows = sum(r['total_rows'] for r in all_results)68 total_valid = sum(r['valid_rows'] for r in all_results)69 total_invalid = sum(r['invalid_rows'] for r in all_results)70 avg_pass_rate = sum(r['pass_rate'] for r in all_results) / total_files if total_files > 0 else 07172 f.write(f"Files Validated: {total_files}\n")73 f.write(f"Total Rows: {total_rows:,}\n")74 f.write(f"Valid Rows: {total_valid:,}\n")75 f.write(f"Invalid Rows: {total_invalid:,}\n")76 f.write(f"Average Pass Rate: {avg_pass_rate:.2f}%\n\n")7778 f.write("INDIVIDUAL FILE RESULTS\n")79 f.write("-" * 70 + "\n")8081 for result in all_results:82 f.write(f"\n{result['file_name']}:\n")83 f.write(f" Rows: {result['total_rows']:,} | ")84 f.write(f"Valid: {result['valid_rows']:,} | ")85 f.write(f"Invalid: {result['invalid_rows']:,} | ")86 f.write(f"Pass Rate: {result['pass_rate']}%\n")8788 print(f"Summary report saved to {output_file}")899091if __name__ == '__main__':92 validate_directory(93 input_dir='./data/incoming',94 output_dir='./data/validation_reports'95 )
Scheduling Automated Validation
Option 1: Windows Task Scheduler (PowerShell)
Create schedule_validation.ps1:
1# Schedule daily validation at 8 AM2$action = New-ScheduledTaskAction -Execute "python" -Argument "C:\path\to\batch_validator.py"3$trigger = New-ScheduledTaskTrigger -Daily -At 8am4$principal = New-ScheduledTaskPrincipal -UserId "SYSTEM" -LogonType ServiceAccount5$settings = New-ScheduledTaskSettingsSet -StartWhenAvailable67Register-ScheduledTask -TaskName "DailyDataValidation" -Action $action -Trigger $trigger -Principal $principal -Settings $settings
Option 2: Linux Cron Job
1# Add to crontab (crontab -e)2# Run daily at 8 AM30 8 * * * /usr/bin/python3 /path/to/batch_validator.py >> /var/log/validation.log 2>&1
Option 3: Python Schedule Library
Create scheduled_validator.py:
1import schedule2import time3from batch_validator import validate_directory456def job():7 """Validation job to run on schedule"""8 print(f"\n--- Starting scheduled validation at {time.strftime('%Y-%m-%d %H:%M:%S')} ---")9 validate_directory(10 input_dir='./data/incoming',11 output_dir='./data/validation_reports'12 )13 print("--- Validation complete ---\n")141516# Schedule job every day at 8:00 AM17schedule.every().day.at("08:00").do(job)1819# Or every hour20# schedule.every().hour.do(job)2122# Or every 15 minutes23# schedule.every(15).minutes.do(job)2425print("Validation scheduler started. Press Ctrl+C to exit.")2627while True:28 schedule.run_pending()29 time.sleep(60) # Check every minute
Email Alerts for Failed Validation
Add email notifications when validation fails:
1import smtplib2from email.mime.text import MIMEText3from email.mime.multipart import MIMEMultipart456def send_validation_alert(results: Dict, recipients: List[str]):7 """Send email alert if validation fails"""89 if results['pass_rate'] < 100: # Alert on any failures10 sender = "data-validation@company.com"11 subject = f"⚠️ Data Validation Failed: {results['file_name']}"1213 # Create email body14 body = f"""15 Data validation has detected issues requiring attention.1617 File: {results['file_name']}18 Total Rows: {results['total_rows']:,}19 Invalid Rows: {results['invalid_rows']:,}20 Pass Rate: {results['pass_rate']}%2122 Please review the detailed validation report attached.2324 Validation Details:25 """2627 for detail in results['validation_details']:28 if detail['error_count'] > 0:29 body += f"\n\n{detail['validator']} - {detail['column']}:"30 for error in detail['errors']:31 body += f"\n • {error}"3233 # Create message34 msg = MIMEMultipart()35 msg['From'] = sender36 msg['To'] = ', '.join(recipients)37 msg['Subject'] = subject38 msg.attach(MIMEText(body, 'plain'))3940 # Send email41 try:42 with smtplib.SMTP('smtp.company.com', 587) as server:43 server.starttls()44 server.login(sender, 'your-password')45 server.send_message(msg)46 print(f"Alert email sent to {recipients}")47 except Exception as e:48 print(f"Failed to send email: {e}")
Best Practices for Production Use
1. Performance Optimization
For large files:
1# Use chunking for files too large for memory2def validate_large_file(file_path: str, chunk_size: int = 10000):3 """Validate large files in chunks"""4 validators = create_product_validators()5 engine = ValidationEngine(validators)67 all_results = []8 chunk_num = 0910 for chunk in pd.read_csv(file_path, chunksize=chunk_size):11 chunk_num += 112 print(f"Processing chunk {chunk_num}...")13 results = engine.validate_dataframe(chunk)14 all_results.append(results)1516 # Combine results17 # (implementation depends on your needs)18 return all_results
2. Logging
Add proper logging:
1import logging23logging.basicConfig(4 level=logging.INFO,5 format='%(asctime)s - %(name)s - %(levelname)s - %(message)s',6 handlers=[7 logging.FileHandler('validation.log'),8 logging.StreamHandler()9 ]10)1112logger = logging.getLogger(__name__)
3. Configuration Files
Use YAML for validation rules:
1# validation_config.yaml2validators:3 - type: RequiredField4 column: product_id56 - type: Pattern7 column: product_id8 pattern: '^PRD-\d{6}$'9 pattern_name: 'Product ID format'1011 - type: Range12 column: price13 min_value: 014 max_value: 100000
Frequently Asked Questions
How do I validate data from APIs instead of files?
The ValidationEngine works with any pandas DataFrame. Fetch your API data, convert to DataFrame, then validate:
1import requests2import pandas as pd34response = requests.get('https://api.example.com/products')5df = pd.DataFrame(response.json())67validators = create_product_validators()8engine = ValidationEngine(validators)9results = engine.validate_dataframe(df)
Can I validate data in a database?
Yes, load data from your database into a DataFrame:
1import sqlalchemy23engine = sqlalchemy.create_engine('postgresql://user:pass@localhost/db')4df = pd.read_sql('SELECT * FROM products', engine)56# Validate as normal7results = validation_engine.validate_dataframe(df)
How do I handle different validation rules for different file types?
Create separate validator factories:
1def create_customer_validators():2 return [...]34def create_order_validators():5 return [...]67# Use appropriate validators based on file type8if 'customer' in filename:9 validators = create_customer_validators()10elif 'order' in filename:11 validators = create_order_validators()
What's the performance impact on large datasets?
For files under 100K rows, validation typically takes 1-5 seconds. For larger datasets, use chunking or consider tools like Great Expectations or Pandera for production-scale validation.
Related articles: Automate Data Entry with Python, Python API Automation Tutorial
Sponsored Content
Interested in advertising? Reach automation professionals through our platform.