Automate Data Validation with Python: Complete Pandas Tutorial

Last week, a colleague spent three hours debugging a report that showed impossible values: negative ages, future dates, and products with 250% tax rates. The root cause? No data validation on the incoming CSV files that fed the reporting pipeline.

After implementing the automated validation pipeline I'm about to show you, similar issues are now caught in seconds—before bad data contaminates downstream reports, dashboards, and databases.

This tutorial walks through building a production-ready data validation system using Python and pandas that can:

Validate data types, ranges, formats, and business rules
Generate detailed validation reports
Handle multiple file formats (CSV, Excel, JSON)
Run on a schedule or as part of data pipelines
Alert stakeholders when issues are found

Why Automate Data Validation?

Manual data checking doesn't scale. Here's what happens without automated validation:

The Problem Cascade:

Bad data enters your system (wrong types, missing values, invalid formats)
Downstream processes fail or produce incorrect results
Business decisions are made on flawed data
Hours or days later, someone discovers the issue
You spend time debugging, fixing data, and re-running everything

The Solution: Automated validation catches issues at the point of entry, preventing the cascade entirely.

Prerequisites and Setup

You'll need Python 3.8+ with these packages:

python

1# Install required packages
2pip install pandas numpy openpyxl jsonschema python-dateutil

Create a new directory for this project:

bash

1mkdir data-validation-pipeline
2cd data-validation-pipeline

The Data Validation Framework

We'll build a modular validation system with these components:

Validator Classes: Reusable validation rules
Validation Engine: Applies rules and collects results
Report Generator: Creates human-readable validation reports
Pipeline Orchestrator: Runs validation on multiple files

Step 1: Creating the Base Validator Class

Create validators.py:

python

1from abc import ABC, abstractmethod
2from typing import Any, List, Dict
3import pandas as pd
4import numpy as np
5
6
7class BaseValidator(ABC):
8    """Base class for all validators"""
9    
10    def __init__(self, column_name: str):
11        self.column_name = column_name
12        self.errors = []
13        
14    @abstractmethod
15    def validate(self, df: pd.DataFrame) -> pd.DataFrame:
16        """
17        Validate data and return DataFrame with validation results
18        Returns df with added column '{column_name}_valid' (boolean)
19        """
20        pass
21    
22    def get_error_summary(self) -> Dict[str, Any]:
23        """Return summary of validation errors"""
24        return {
25            'column': self.column_name,
26            'validator': self.__class__.__name__,
27            'error_count': len(self.errors),
28            'errors': self.errors
29        }
30
31
32class DataTypeValidator(BaseValidator):
33    """Validates data types and attempts type conversion"""
34    
35    def __init__(self, column_name: str, expected_type: str):
36        super().__init__(column_name)
37        self.expected_type = expected_type
38        
39    def validate(self, df: pd.DataFrame) -> pd.DataFrame:
40        """Validate and coerce data types"""
41        col = self.column_name
42        result_col = f'{col}_valid'
43        
44        # Initialize all rows as valid
45        df[result_col] = True
46        
47        if col not in df.columns:
48            self.errors.append(f"Column '{col}' not found in DataFrame")
49            return df
50        
51        try:
52            if self.expected_type == 'numeric':
53                # Try to convert to numeric, invalid values become NaN
54                converted = pd.to_numeric(df[col], errors='coerce')
55                invalid_mask = converted.isna() & df[col].notna()
56                
57            elif self.expected_type == 'integer':
58                converted = pd.to_numeric(df[col], errors='coerce')
59                # Check if can be safely converted to int
60                invalid_mask = (
61                    (converted.isna() & df[col].notna()) |
62                    (converted % 1 != 0)
63                )
64                
65            elif self.expected_type == 'datetime':
66                converted = pd.to_datetime(df[col], errors='coerce')
67                invalid_mask = converted.isna() & df[col].notna()
68                
69            elif self.expected_type == 'string':
70                # Strings are generally permissive
71                invalid_mask = pd.Series([False] * len(df))
72                
73            else:
74                raise ValueError(f"Unknown type: {self.expected_type}")
75            
76            # Mark invalid rows
77            df.loc[invalid_mask, result_col] = False
78            
79            # Log errors
80            if invalid_mask.any():
81                invalid_values = df.loc[invalid_mask, col].head(10).tolist()
82                self.errors.append(
83                    f"{invalid_mask.sum()} invalid values (expected {self.expected_type}). "
84                    f"Examples: {invalid_values}"
85                )
86                
87        except Exception as e:
88            df[result_col] = False
89            self.errors.append(f"Validation error: {str(e)}")
90            
91        return df
92
93
94class RangeValidator(BaseValidator):
95    """Validates numeric values are within acceptable ranges"""
96    
97    def __init__(self, column_name: str, min_value: float = None, 
98                 max_value: float = None, allow_null: bool = False):
99        super().__init__(column_name)
100        self.min_value = min_value
101        self.max_value = max_value
102        self.allow_null = allow_null
103        
104    def validate(self, df: pd.DataFrame) -> pd.DataFrame:
105        col = self.column_name
106        result_col = f'{col}_valid'
107        df[result_col] = True
108        
109        if col not in df.columns:
110            self.errors.append(f"Column '{col}' not found")
111            return df
112        
113        # Convert to numeric
114        numeric_col = pd.to_numeric(df[col], errors='coerce')
115        
116        # Check for nulls
117        if not self.allow_null:
118            null_mask = numeric_col.isna()
119            if null_mask.any():
120                df.loc[null_mask, result_col] = False
121                self.errors.append(f"{null_mask.sum()} null values found (not allowed)")
122        
123        # Check minimum
124        if self.min_value is not None:
125            below_min = (numeric_col < self.min_value) & numeric_col.notna()
126            if below_min.any():
127                df.loc[below_min, result_col] = False
128                invalid_values = df.loc[below_min, col].head(5).tolist()
129                self.errors.append(
130                    f"{below_min.sum()} values below minimum {self.min_value}. "
131                    f"Examples: {invalid_values}"
132                )
133        
134        # Check maximum
135        if self.max_value is not None:
136            above_max = (numeric_col > self.max_value) & numeric_col.notna()
137            if above_max.any():
138                df.loc[above_max, result_col] = False
139                invalid_values = df.loc[above_max, col].head(5).tolist()
140                self.errors.append(
141                    f"{above_max.sum()} values above maximum {self.max_value}. "
142                    f"Examples: {invalid_values}"
143                )
144        
145        return df
146
147
148class PatternValidator(BaseValidator):
149    """Validates strings match expected patterns (e.g., email, phone, SKU)"""
150    
151    def __init__(self, column_name: str, pattern: str, pattern_name: str = None):
152        super().__init__(column_name)
153        self.pattern = pattern
154        self.pattern_name = pattern_name or pattern
155        
156    def validate(self, df: pd.DataFrame) -> pd.DataFrame:
157        col = self.column_name
158        result_col = f'{col}_valid'
159        df[result_col] = True
160        
161        if col not in df.columns:
162            self.errors.append(f"Column '{col}' not found")
163            return df
164        
165        # Convert to string and check pattern
166        string_col = df[col].astype(str)
167        matches_pattern = string_col.str.match(self.pattern, na=False)
168        
169        # Account for actual nulls (not string 'nan')
170        is_null = df[col].isna()
171        invalid_mask = ~matches_pattern & ~is_null
172        
173        if invalid_mask.any():
174            df.loc[invalid_mask, result_col] = False
175            invalid_values = df.loc[invalid_mask, col].head(5).tolist()
176            self.errors.append(
177                f"{invalid_mask.sum()} values don't match pattern '{self.pattern_name}'. "
178                f"Examples: {invalid_values}"
179            )
180        
181        return df
182
183
184class UniquenessValidator(BaseValidator):
185    """Validates column values are unique (for IDs, SKUs, etc.)"""
186    
187    def __init__(self, column_name: str):
188        super().__init__(column_name)
189        
190    def validate(self, df: pd.DataFrame) -> pd.DataFrame:
191        col = self.column_name
192        result_col = f'{col}_valid'
193        df[result_col] = True
194        
195        if col not in df.columns:
196            self.errors.append(f"Column '{col}' not found")
197            return df
198        
199        # Find duplicates
200        is_duplicate = df[col].duplicated(keep=False)
201        
202        if is_duplicate.any():
203            df.loc[is_duplicate, result_col] = False
204            duplicate_values = df.loc[is_duplicate, col].unique()[:5].tolist()
205            self.errors.append(
206                f"{is_duplicate.sum()} duplicate values found. "
207                f"Examples: {duplicate_values}"
208            )
209        
210        return df
211
212
213class DateRangeValidator(BaseValidator):
214    """Validates dates are within acceptable ranges"""
215    
216    def __init__(self, column_name: str, min_date: str = None, 
217                 max_date: str = None, allow_future: bool = True):
218        super().__init__(column_name)
219        self.min_date = pd.to_datetime(min_date) if min_date else None
220        self.max_date = pd.to_datetime(max_date) if max_date else None
221        self.allow_future = allow_future
222        
223    def validate(self, df: pd.DataFrame) -> pd.DataFrame:
224        col = self.column_name
225        result_col = f'{col}_valid'
226        df[result_col] = True
227        
228        if col not in df.columns:
229            self.errors.append(f"Column '{col}' not found")
230            return df
231        
232        # Convert to datetime
233        date_col = pd.to_datetime(df[col], errors='coerce')
234        
235        # Check for invalid dates
236        invalid_dates = date_col.isna() & df[col].notna()
237        if invalid_dates.any():
238            df.loc[invalid_dates, result_col] = False
239            self.errors.append(f"{invalid_dates.sum()} invalid date formats")
240        
241        # Check minimum date
242        if self.min_date is not None:
243            before_min = (date_col < self.min_date) & date_col.notna()
244            if before_min.any():
245                df.loc[before_min, result_col] = False
246                self.errors.append(
247                    f"{before_min.sum()} dates before {self.min_date.date()}"
248                )
249        
250        # Check maximum date
251        if self.max_date is not None:
252            after_max = (date_col > self.max_date) & date_col.notna()
253            if after_max.any():
254                df.loc[after_max, result_col] = False
255                self.errors.append(
256                    f"{after_max.sum()} dates after {self.max_date.date()}"
257                )
258        
259        # Check future dates
260        if not self.allow_future:
261            now = pd.Timestamp.now()
262            future_dates = (date_col > now) & date_col.notna()
263            if future_dates.any():
264                df.loc[future_dates, result_col] = False
265                self.errors.append(f"{future_dates.sum()} future dates found")
266        
267        return df
268
269
270class RequiredFieldValidator(BaseValidator):
271    """Validates required fields are not null/empty"""
272    
273    def __init__(self, column_name: str):
274        super().__init__(column_name)
275        
276    def validate(self, df: pd.DataFrame) -> pd.DataFrame:
277        col = self.column_name
278        result_col = f'{col}_valid'
279        df[result_col] = True
280        
281        if col not in df.columns:
282            self.errors.append(f"Column '{col}' not found")
283            df[result_col] = False
284            return df
285        
286        # Check for nulls and empty strings
287        is_missing = df[col].isna() | (df[col].astype(str).str.strip() == '')
288        
289        if is_missing.any():
290            df.loc[is_missing, result_col] = False
291            self.errors.append(f"{is_missing.sum()} missing/empty values")
292        
293        return df

Step 2: Building the Validation Engine

Create validation_engine.py:

python

1from typing import List, Dict, Any
2import pandas as pd
3from datetime import datetime
4from validators import BaseValidator
5
6
7class ValidationEngine:
8    """Orchestrates validation rules and generates reports"""
9    
10    def __init__(self, validators: List[BaseValidator]):
11        self.validators = validators
12        self.validation_results = []
13        
14    def validate_dataframe(self, df: pd.DataFrame) -> Dict[str, Any]:
15        """
16        Run all validators on DataFrame and return results
17        """
18        validation_start = datetime.now()
19        
20        # Keep original DataFrame untouched
21        df_validated = df.copy()
22        
23        # Track validation columns
24        validation_columns = []
25        
26        # Run each validator
27        for validator in self.validators:
28            df_validated = validator.validate(df_validated)
29            validation_col = f'{validator.column_name}_valid'
30            if validation_col in df_validated.columns:
31                validation_columns.append(validation_col)
32        
33        # Create master validity column (all validations must pass)
34        if validation_columns:
35            df_validated['is_valid'] = df_validated[validation_columns].all(axis=1)
36        else:
37            df_validated['is_valid'] = True
38        
39        # Count results
40        total_rows = len(df_validated)
41        valid_rows = df_validated['is_valid'].sum()
42        invalid_rows = total_rows - valid_rows
43        
44        # Collect error summaries
45        error_summaries = [v.get_error_summary() for v in self.validators]
46        
47        validation_duration = (datetime.now() - validation_start).total_seconds()
48        
49        return {
50            'timestamp': datetime.now().isoformat(),
51            'duration_seconds': validation_duration,
52            'total_rows': total_rows,
53            'valid_rows': int(valid_rows),
54            'invalid_rows': int(invalid_rows),
55            'pass_rate': round((valid_rows / total_rows * 100), 2) if total_rows > 0 else 0,
56            'validation_details': error_summaries,
57            'validated_dataframe': df_validated
58        }
59    
60    def validate_file(self, file_path: str, **read_kwargs) -> Dict[str, Any]:
61        """
62        Load and validate a file (CSV, Excel, JSON)
63        """
64        # Detect file type and load
65        if file_path.endswith('.csv'):
66            df = pd.read_csv(file_path, **read_kwargs)
67        elif file_path.endswith(('.xlsx', '.xls')):
68            df = pd.read_excel(file_path, **read_kwargs)
69        elif file_path.endswith('.json'):
70            df = pd.read_json(file_path, **read_kwargs)
71        else:
72            raise ValueError(f"Unsupported file type: {file_path}")
73        
74        # Run validation
75        results = self.validate_dataframe(df)
76        results['file_path'] = file_path
77        results['file_name'] = file_path.split('/')[-1]
78        
79        return results
80    
81    def generate_report(self, validation_results: Dict[str, Any], 
82                       output_format: str = 'text') -> str:
83        """
84        Generate human-readable validation report
85        """
86        if output_format == 'text':
87            return self._generate_text_report(validation_results)
88        elif output_format == 'html':
89            return self._generate_html_report(validation_results)
90        else:
91            raise ValueError(f"Unsupported format: {output_format}")
92    
93    def _generate_text_report(self, results: Dict[str, Any]) -> str:
94        """Generate plain text report"""
95        report = []
96        report.append("=" * 70)
97        report.append("DATA VALIDATION REPORT")
98        report.append("=" * 70)
99        report.append(f"Timestamp: {results['timestamp']}")
100        
101        if 'file_name' in results:
102            report.append(f"File: {results['file_name']}")
103        
104        report.append(f"Duration: {results['duration_seconds']:.2f} seconds")
105        report.append("")
106        
107        report.append("SUMMARY")
108        report.append("-" * 70)
109        report.append(f"Total Rows: {results['total_rows']:,}")
110        report.append(f"Valid Rows: {results['valid_rows']:,}")
111        report.append(f"Invalid Rows: {results['invalid_rows']:,}")
112        report.append(f"Pass Rate: {results['pass_rate']}%")
113        report.append("")
114        
115        if results['invalid_rows'] > 0:
116            report.append("VALIDATION DETAILS")
117            report.append("-" * 70)
118            
119            for detail in results['validation_details']:
120                if detail['error_count'] > 0:
121                    report.append(f"\n{detail['validator']} - Column: {detail['column']}")
122                    for error in detail['errors']:
123                        report.append(f"  • {error}")
124        else:
125            report.append("✓ All validation rules passed successfully!")
126        
127        report.append("")
128        report.append("=" * 70)
129        
130        return "\n".join(report)
131    
132    def save_invalid_rows(self, validation_results: Dict[str, Any], 
133                         output_path: str) -> None:
134        """
135        Save rows that failed validation to a file for review
136        """
137        df = validation_results['validated_dataframe']
138        invalid_df = df[~df['is_valid']].copy()
139        
140        # Remove validation columns for cleaner output
141        validation_cols = [col for col in invalid_df.columns if col.endswith('_valid')]
142        invalid_df = invalid_df.drop(columns=validation_cols + ['is_valid'])
143        
144        if output_path.endswith('.csv'):
145            invalid_df.to_csv(output_path, index=False)
146        elif output_path.endswith(('.xlsx', '.xls')):
147            invalid_df.to_excel(output_path, index=False)
148        else:
149            raise ValueError(f"Unsupported output format: {output_path}")
150        
151        print(f"Saved {len(invalid_df)} invalid rows to {output_path}")

Step 3: Creating a Real-World Example

Let's create a validation pipeline for a common business scenario: validating product inventory data.

Create validate_product_data.py:

python

1import pandas as pd
2from validators import (
3    DataTypeValidator, RangeValidator, PatternValidator,
4    UniquenessValidator, RequiredFieldValidator, DateRangeValidator
5)
6from validation_engine import ValidationEngine
7
8
9def create_product_validators():
10    """
11    Define validation rules for product inventory data
12    """
13    validators = [
14        # Product ID must be unique and required
15        RequiredFieldValidator('product_id'),
16        UniquenessValidator('product_id'),
17        PatternValidator('product_id', r'^PRD-\d{6}$', 'Product ID format (PRD-######)'),
18        
19        # Product name is required
20        RequiredFieldValidator('product_name'),
21        
22        # Price validations
23        DataTypeValidator('price', 'numeric'),
24        RangeValidator('price', min_value=0, max_value=100000),
25        
26        # Quantity validations
27        DataTypeValidator('quantity', 'integer'),
28        RangeValidator('quantity', min_value=0, max_value=1000000),
29        
30        # Category is required
31        RequiredFieldValidator('category'),
32        
33        # SKU format validation
34        PatternValidator('sku', r'^[A-Z]{3}-\d{4}$', 'SKU format (ABC-1234)'),
35        
36        # Date validations
37        DateRangeValidator('date_added', min_date='2020-01-01', allow_future=False),
38        DateRangeValidator('expiry_date', allow_future=True),
39    ]
40    
41    return validators
42
43
44def main():
45    """
46    Main validation workflow
47    """
48    # Create sample data with intentional errors
49    sample_data = pd.DataFrame({
50        'product_id': ['PRD-000001', 'PRD-000002', 'INVALID', 'PRD-000001', 'PRD-000004'],
51        'product_name': ['Widget A', 'Widget B', '', 'Widget D', 'Widget E'],
52        'price': [29.99, -5.00, 150.50, 99.99, 'invalid'],
53        'quantity': [100, 50, 25.5, 0, 200],
54        'category': ['Electronics', 'Electronics', 'Home', None, 'Sports'],
55        'sku': ['ELC-1001', 'ELC-1002', 'INVALID', 'HOM-2003', 'SPT-3004'],
56        'date_added': ['2024-01-15', '2024-02-20', '2027-01-01', '2023-12-10', '2024-03-01'],
57        'expiry_date': ['2026-01-15', '2026-06-20', '2026-12-31', '2025-12-10', '2027-03-01'],
58    })
59    
60    print("Product Inventory Data Validation")
61    print("=" * 70)
62    print(f"\nValidating {len(sample_data)} products...\n")
63    
64    # Create validators
65    validators = create_product_validators()
66    
67    # Create validation engine
68    engine = ValidationEngine(validators)
69    
70    # Run validation
71    results = engine.validate_dataframe(sample_data)
72    
73    # Generate and print report
74    report = engine.generate_report(results)
75    print(report)
76    
77    # Save invalid rows for review
78    if results['invalid_rows'] > 0:
79        engine.save_invalid_rows(results, 'invalid_products.csv')
80        print("\nInvalid rows saved to 'invalid_products.csv' for review")
81    
82    # Optionally save valid rows for further processing
83    df_validated = results['validated_dataframe']
84    valid_data = df_validated[df_validated['is_valid']].copy()
85    
86    # Remove validation columns
87    validation_cols = [col for col in valid_data.columns if col.endswith('_valid')]
88    valid_data = valid_data.drop(columns=validation_cols + ['is_valid'])
89    
90    valid_data.to_csv('valid_products.csv', index=False)
91    print(f"Valid rows saved to 'valid_products.csv' ({len(valid_data)} products)")
92
93
94if __name__ == '__main__':
95    main()

Step 4: Running the Validation

Run the script:

bash

1python validate_product_data.py

Expected Output:

Prompt

Product Inventory Data Validation
======================================================================

Validating 5 products...

======================================================================
DATA VALIDATION REPORT
======================================================================
Timestamp: 2026-02-11T10:30:45.123456
Duration: 0.12 seconds

SUMMARY
----------------------------------------------------------------------
Total Rows: 5
Valid Rows: 1
Invalid Rows: 4
Pass Rate: 20.0%

VALIDATION DETAILS
----------------------------------------------------------------------

PatternValidator - Column: product_id
  • 1 values don't match pattern 'Product ID format (PRD-######)'. Examples: ['INVALID']

UniquenessValidator - Column: product_id
  • 2 duplicate values found. Examples: ['PRD-000001']

RequiredFieldValidator - Column: product_name
  • 1 missing/empty values

RangeValidator - Column: price
  • 1 values below minimum 0. Examples: [-5.0]

DataTypeValidator - Column: price
  • 1 invalid values (expected numeric). Examples: ['invalid']

DataTypeValidator - Column: quantity
  • 1 invalid values (expected integer). Examples: [25.5]

RequiredFieldValidator - Column: category
  • 1 missing/empty values

PatternValidator - Column: sku
  • 1 values don't match pattern 'SKU format (ABC-1234)'. Examples: ['INVALID']

DateRangeValidator - Column: date_added
  • 1 dates after 2026-02-11

======================================================================

Saved 4 invalid rows to 'invalid_products.csv' for review
Valid rows saved to 'valid_products.csv' (1 products)

Advanced Validation Patterns

Custom Business Rule Validators

For complex business logic, create custom validators:

python

1class PriceQuantityValidator(BaseValidator):
2    """
3    Validate business rule: high-value items (price > $1000) 
4    must have quantity < 100
5    """
6    
7    def __init__(self, price_col: str = 'price', quantity_col: str = 'quantity'):
8        super().__init__(f'{price_col}_{quantity_col}')
9        self.price_col = price_col
10        self.quantity_col = quantity_col
11        
12    def validate(self, df: pd.DataFrame) -> pd.DataFrame:
13        result_col = f'{self.column_name}_valid'
14        df[result_col] = True
15        
16        # Convert to numeric
17        prices = pd.to_numeric(df[self.price_col], errors='coerce')
18        quantities = pd.to_numeric(df[self.quantity_col], errors='coerce')
19        
20        # Apply business rule
21        high_value_items = prices > 1000
22        high_quantity = quantities >= 100
23        violation = high_value_items & high_quantity
24        
25        if violation.any():
26            df.loc[violation, result_col] = False
27            violating_items = df.loc[violation, [self.price_col, self.quantity_col]].head(5)
28            self.errors.append(
29                f"{violation.sum()} items violate rule: high-value items (>${1000}) "
30                f"must have quantity < 100. Examples:\n{violating_items.to_string()}"
31            )
32        
33        return df

Batch Validation for Multiple Files

Create batch_validator.py:

python

1import os
2from pathlib import Path
3from typing import List
4from validation_engine import ValidationEngine
5from validate_product_data import create_product_validators
6
7
8def validate_directory(input_dir: str, output_dir: str, file_pattern: str = '*.csv'):
9    """
10    Validate all files in a directory
11    """
12    input_path = Path(input_dir)
13    output_path = Path(output_dir)
14    output_path.mkdir(exist_ok=True)
15    
16    # Find matching files
17    files = list(input_path.glob(file_pattern))
18    
19    print(f"Found {len(files)} files to validate in {input_dir}\n")
20    
21    # Create validation engine
22    validators = create_product_validators()
23    engine = ValidationEngine(validators)
24    
25    # Track summary stats
26    all_results = []
27    
28    for file in files:
29        print(f"Validating {file.name}...")
30        
31        try:
32            # Validate file
33            results = engine.validate_file(str(file))
34            all_results.append(results)
35            
36            # Generate report
37            report = engine.generate_report(results)
38            
39            # Save report
40            report_file = output_path / f"{file.stem}_validation_report.txt"
41            with open(report_file, 'w') as f:
42                f.write(report)
43            
44            # Save invalid rows if any
45            if results['invalid_rows'] > 0:
46                invalid_file = output_path / f"{file.stem}_invalid.csv"
47                engine.save_invalid_rows(results, str(invalid_file))
48            
49            print(f"  ✓ Pass rate: {results['pass_rate']}%")
50            print(f"  ✓ Report saved to {report_file}\n")
51            
52        except Exception as e:
53            print(f"  ✗ Error validating {file.name}: {e}\n")
54    
55    # Generate summary report
56    generate_summary_report(all_results, output_path / "summary_report.txt")
57
58
59def generate_summary_report(all_results: List, output_file: Path):
60    """Generate summary across all validated files"""
61    with open(output_file, 'w') as f:
62        f.write("=" * 70 + "\n")
63        f.write("BATCH VALIDATION SUMMARY\n")
64        f.write("=" * 70 + "\n\n")
65        
66        total_files = len(all_results)
67        total_rows = sum(r['total_rows'] for r in all_results)
68        total_valid = sum(r['valid_rows'] for r in all_results)
69        total_invalid = sum(r['invalid_rows'] for r in all_results)
70        avg_pass_rate = sum(r['pass_rate'] for r in all_results) / total_files if total_files > 0 else 0
71        
72        f.write(f"Files Validated: {total_files}\n")
73        f.write(f"Total Rows: {total_rows:,}\n")
74        f.write(f"Valid Rows: {total_valid:,}\n")
75        f.write(f"Invalid Rows: {total_invalid:,}\n")
76        f.write(f"Average Pass Rate: {avg_pass_rate:.2f}%\n\n")
77        
78        f.write("INDIVIDUAL FILE RESULTS\n")
79        f.write("-" * 70 + "\n")
80        
81        for result in all_results:
82            f.write(f"\n{result['file_name']}:\n")
83            f.write(f"  Rows: {result['total_rows']:,} | ")
84            f.write(f"Valid: {result['valid_rows']:,} | ")
85            f.write(f"Invalid: {result['invalid_rows']:,} | ")
86            f.write(f"Pass Rate: {result['pass_rate']}%\n")
87    
88    print(f"Summary report saved to {output_file}")
89
90
91if __name__ == '__main__':
92    validate_directory(
93        input_dir='./data/incoming',
94        output_dir='./data/validation_reports'
95    )

Scheduling Automated Validation

Option 1: Windows Task Scheduler (PowerShell)

Create schedule_validation.ps1:

powershell

1# Schedule daily validation at 8 AM
2$action = New-ScheduledTaskAction -Execute "python" -Argument "C:\path\to\batch_validator.py"
3$trigger = New-ScheduledTaskTrigger -Daily -At 8am
4$principal = New-ScheduledTaskPrincipal -UserId "SYSTEM" -LogonType ServiceAccount
5$settings = New-ScheduledTaskSettingsSet -StartWhenAvailable
6
7Register-ScheduledTask -TaskName "DailyDataValidation" -Action $action -Trigger $trigger -Principal $principal -Settings $settings

Option 2: Linux Cron Job

bash

1# Add to crontab (crontab -e)
2# Run daily at 8 AM
30 8 * * * /usr/bin/python3 /path/to/batch_validator.py >> /var/log/validation.log 2>&1

Option 3: Python Schedule Library

Create scheduled_validator.py:

python

1import schedule
2import time
3from batch_validator import validate_directory
4
5
6def job():
7    """Validation job to run on schedule"""
8    print(f"\n--- Starting scheduled validation at {time.strftime('%Y-%m-%d %H:%M:%S')} ---")
9    validate_directory(
10        input_dir='./data/incoming',
11        output_dir='./data/validation_reports'
12    )
13    print("--- Validation complete ---\n")
14
15
16# Schedule job every day at 8:00 AM
17schedule.every().day.at("08:00").do(job)
18
19# Or every hour
20# schedule.every().hour.do(job)
21
22# Or every 15 minutes
23# schedule.every(15).minutes.do(job)
24
25print("Validation scheduler started. Press Ctrl+C to exit.")
26
27while True:
28    schedule.run_pending()
29    time.sleep(60)  # Check every minute

Email Alerts for Failed Validation

Add email notifications when validation fails:

python

1import smtplib
2from email.mime.text import MIMEText
3from email.mime.multipart import MIMEMultipart
4
5
6def send_validation_alert(results: Dict, recipients: List[str]):
7    """Send email alert if validation fails"""
8    
9    if results['pass_rate'] < 100:  # Alert on any failures
10        sender = "data-validation@company.com"
11        subject = f"⚠️ Data Validation Failed: {results['file_name']}"
12        
13        # Create email body
14        body = f"""
15        Data validation has detected issues requiring attention.
16        
17        File: {results['file_name']}
18        Total Rows: {results['total_rows']:,}
19        Invalid Rows: {results['invalid_rows']:,}
20        Pass Rate: {results['pass_rate']}%
21        
22        Please review the detailed validation report attached.
23        
24        Validation Details:
25        """
26        
27        for detail in results['validation_details']:
28            if detail['error_count'] > 0:
29                body += f"\n\n{detail['validator']} - {detail['column']}:"
30                for error in detail['errors']:
31                    body += f"\n  • {error}"
32        
33        # Create message
34        msg = MIMEMultipart()
35        msg['From'] = sender
36        msg['To'] = ', '.join(recipients)
37        msg['Subject'] = subject
38        msg.attach(MIMEText(body, 'plain'))
39        
40        # Send email
41        try:
42            with smtplib.SMTP('smtp.company.com', 587) as server:
43                server.starttls()
44                server.login(sender, 'your-password')
45                server.send_message(msg)
46            print(f"Alert email sent to {recipients}")
47        except Exception as e:
48            print(f"Failed to send email: {e}")

Best Practices for Production Use

1. Performance Optimization

For large files:

python

1# Use chunking for files too large for memory
2def validate_large_file(file_path: str, chunk_size: int = 10000):
3    """Validate large files in chunks"""
4    validators = create_product_validators()
5    engine = ValidationEngine(validators)
6    
7    all_results = []
8    chunk_num = 0
9    
10    for chunk in pd.read_csv(file_path, chunksize=chunk_size):
11        chunk_num += 1
12        print(f"Processing chunk {chunk_num}...")
13        results = engine.validate_dataframe(chunk)
14        all_results.append(results)
15    
16    # Combine results
17    # (implementation depends on your needs)
18    return all_results

2. Logging

Add proper logging:

python

1import logging
2
3logging.basicConfig(
4    level=logging.INFO,
5    format='%(asctime)s - %(name)s - %(levelname)s - %(message)s',
6    handlers=[
7        logging.FileHandler('validation.log'),
8        logging.StreamHandler()
9    ]
10)
11
12logger = logging.getLogger(__name__)

3. Configuration Files

Use YAML for validation rules:

yaml

1# validation_config.yaml
2validators:
3  - type: RequiredField
4    column: product_id
5  
6  - type: Pattern
7    column: product_id
8    pattern: '^PRD-\d{6}$'
9    pattern_name: 'Product ID format'
10  
11  - type: Range
12    column: price
13    min_value: 0
14    max_value: 100000

Frequently Asked Questions

How do I validate data from APIs instead of files?

The ValidationEngine works with any pandas DataFrame. Fetch your API data, convert to DataFrame, then validate:

python

1import requests
2import pandas as pd
3
4response = requests.get('https://api.example.com/products')
5df = pd.DataFrame(response.json())
6
7validators = create_product_validators()
8engine = ValidationEngine(validators)
9results = engine.validate_dataframe(df)

Can I validate data in a database?

Yes, load data from your database into a DataFrame:

python

1import sqlalchemy
2
3engine = sqlalchemy.create_engine('postgresql://user:pass@localhost/db')
4df = pd.read_sql('SELECT * FROM products', engine)
5
6# Validate as normal
7results = validation_engine.validate_dataframe(df)

How do I handle different validation rules for different file types?

Create separate validator factories:

python

1def create_customer_validators():
2    return [...]
3
4def create_order_validators():
5    return [...]
6
7# Use appropriate validators based on file type
8if 'customer' in filename:
9    validators = create_customer_validators()
10elif 'order' in filename:
11    validators = create_order_validators()

What's the performance impact on large datasets?

For files under 100K rows, validation typically takes 1-5 seconds. For larger datasets, use chunking or consider tools like Great Expectations or Pandera for production-scale validation.

Automate Data Validation with Python: Complete Pandas Tutorial

Why Automate Data Validation?

Prerequisites and Setup

The Data Validation Framework

Step 1: Creating the Base Validator Class

Step 2: Building the Validation Engine

Step 3: Creating a Real-World Example

Step 4: Running the Validation

Advanced Validation Patterns

Custom Business Rule Validators

Batch Validation for Multiple Files

Scheduling Automated Validation

Option 1: Windows Task Scheduler (PowerShell)

Option 2: Linux Cron Job

Option 3: Python Schedule Library

Email Alerts for Failed Validation

Best Practices for Production Use

1. Performance Optimization

2. Logging

3. Configuration Files

Frequently Asked Questions

Share this article