- What: Cleaned & validated 10,000+ PAN records using PostgreSQL
- How: SQL-only data quality pipeline using CTEs, regex, and window functions
- Rules: Implemented strict, project-defined business validations in SQL
- Output: Audit-ready summary reports for valid vs invalid PANs
- Tech Stack: PostgreSQL | SQL | CTEs | Regex | Window Functions
In financial and compliance systems, invalid PAN numbers can lead to:
- Transaction failures
- Compliance violations
- Incorrect reporting
- Poor downstream analytics
Raw PAN data often contains missing values, duplicates, formatting issues, and structurally invalid entries.
👉 Challenge: Detect, clean, and validate PAN data reliably inside the database, without relying on external tools.
This project implements an end-to-end SQL-based data quality pipeline that:
- Cleans raw PAN data
- Applies strict format and rule-based validations
- Flags invalid records with clear logic
- Produces summary reports for decision-making
✨ The solution is implemented using set-based SQL logic, making it designed using production-style SQL practices.
This project performs the following tasks using SQL:
- NULLs and blanks — Remove empty and whitespace-only values
- Duplicates — Eliminate duplicate PAN entries
- Casing issues — Standardize all values to uppercase
- Normalization — Apply consistent formatting transformations
- Strict business validation rules — Enforce PAN format and structure requirements
- Adjacent repeated characters — Detect invalid consecutive character patterns
- Alphabetical sequences — Identify invalid ascending/descending letter patterns
- Numeric sequences — Detect invalid digit patterns (e.g., 1234, 9876)
- Total PANs before cleaning — Track original dataset size
- Missing PAN count — Report data quality metrics
- Valid vs invalid PAN counts — Categorize and summarize validation results
Processes datasets with 10,000+ records and produces consistent, rule-based validation results.
For readers unfamiliar with Indian tax identifiers:
PAN (Permanent Account Number) is a 10-character alphanumeric identifier issued by the Indian Income Tax Department. It is used for tax filing, financial transactions, banking, and regulatory compliance.
Official reference: https://www.incometax.gov.in/
Format: AAAAA1234A
- First 5: Alphabetic characters (A-Z)
- Next 4: Numeric digits (0-9)
- Last 1: Alphabetic character (A-Z)
- Total length: Exactly 10 characters
Advanced Features Used:
- Common Table Expressions (CTEs) for stepwise processing
- Regular expressions for format validation
- Window functions for analysis
- PostgreSQL functions and views for reuse
- PostgreSQL (tested with PostgreSQL)
- A SQL client such as DBeaver or
psql - The CSV dataset:
PAN Number Validation Dataset.csv
-
Place the CSV file in a location accessible to your database server or use your SQL client import tool.
-
(Optional) Create the database and table manually, or run the create statement in
cleaning_script.sql:
CREATE DATABASE pan_number_valdation_project;
-- then connect to the database and run the rest of the script- Import CSV into the
pan_numbers_datasettable. Example usingCOPYin PostgreSQL (run as a DB superuser or ensure file access):
-- adjust path to where the CSV lives on the server
COPY pan_numbers_dataset(pan_number)
FROM '/path/to/PAN Number Validation Dataset.csv'
WITH (FORMAT csv, HEADER true);Alternatively, use DBeaver's data import wizard to load the CSV into pan_numbers_dataset.
- Run the cleaning and validation queries in
cleaning_script.sql(it contains step-by-step queries and a final summary). Example withpsql:
psql -h <host> -U <user> -d pan_number_valdation_project -f cleaning_script.sqlPAN_Number_Validation/
│
├── dataset/
│ └── PAN_Number_Validation_Dataset.csv
│ # Raw input dataset containing PAN values
│
├── docs/
│ ├── PAN_Number_Validation_and_Data_Quality_Analysis.pdf
│ │ # Final project presentation (PDF)
│ │
│ └── PAN_Number_Validation_and_Data_Quality_Analysis.pptx
│ # Editable PowerPoint version of the presentation
│
├── snapshots/
│ ├── banner.png
│ │ # Project banner image used in README
│ │
│ └── summary_report.jpg
│ # Screenshot of final data quality summary output
│
├── cleaning_script.sql
│ # Production-style SQL script for cleaning, validating,
│ # and reporting PAN data using PostgreSQL
│
└── README.md
# Project documentation and overview
Raw CSV Data
↓
Import to PostgreSQL Table
↓
Data Cleaning (TRIM, UPPER, DISTINCT)
↓
PAN Validation (Business Rules)
↓
Data Quality Summary Report
The script performs the following operations in sequence:
- Cleaning – removes NULLs, blanks, duplicates, and normalizes values
- Format Validation – enforces 10-character PAN structure
- Business Rules – applies sequence and repetition checks
- Reporting – generates valid/invalid summary counts
- Processed: 10,000 rows (example)
- Valid PANs: 3186 valid count
- Invalid PANs: 5839 invalid count (flagged by business rules)
- Missing / Incomplete: 967 missing count
- Artifacts:
summary_report.jpg(visual),cleaning_script.sql(SQL pipeline)
ℹ️ Some rules are project-defined to demonstrate advanced data-quality logic and may not reflect official PAN issuance rules.
The following business and quality rules are implemented:
Must strictly match the regex pattern:
^[A-Z]{5}[0-9]{4}[A-Z]$
- First 5 characters: Uppercase letters only (A-Z)
- Next 4 characters: Numeric digits only (0-9)
- Last character: Uppercase letter only (A-Z)
No two consecutive identical letters allowed in the alphabetic section.
❌ Invalid Examples:
AABCD1234F(AA - repeated A)XXYZA1234F(XX - repeated X)
✅ Valid Examples:
AXBCD1234FABCXD1234F
First 5 letters must not form a strictly ascending ASCII sequence.
❌ Invalid Examples:
ABCDE1234F(A→B→C→D→E ascending)BCDEF1234F(B→C→D→E→F ascending)
✅ Valid Examples:
ABCDX1234FAXBCD1234F
Implemented using ASCII arithmetic and generate_series() for efficient sequence detection.
No two consecutive identical digits allowed in the 4-digit block.
❌ Invalid Examples:
ABCDE1123F(11 - repeated 1)ABCDE5556F(55 - repeated 5)
✅ Valid Examples:
ABCDE1923FABCDE5678F
The 4-digit numeric block must not be a strictly ascending sequence.
❌ Invalid Sequential Patterns:
1234(1→2→3→4)2345(2→3→4→5)3456(3→4→5→6)4567,5678,6789
✅ Valid Examples:
1923(non-sequential)5432(descending)1357(irregular)
Validated by explicitly excluding known sequences: ('0123','1234','2345','3456','4567','5678','6789')
All rules are applied in sequence to ensure High confidence data quality and compliance with PAN business rules. Invalid records are flagged and excluded from the final validated dataset.
The script produces:
- A cleaned set of PANs (trimmed, uppercased, distinct)
- Lists of rows that match or fail validation rules
- A summary table with counts grouped by
Valid PAN/Invalid PAN
Sample summary report generated by the pipeline
- Implemented using set-based SQL, avoiding RBAR (row-by-row) processing
- Designed for predictable scaling as data volume increases
- Validation logic is database-driven, enabling efficient bulk execution
Optional optimizations:
- Functional index on normalized PAN values
- Materialized view for reused cleaned datasets
VACUUM ANALYZEafter bulk loads- Measure execution cost using
EXPLAIN (ANALYZE, BUFFERS)
Future enhancements:
- Automated unit tests for validation logic
- Additional checksum or pattern-based rules
- Export validation results for downstream analytics
- Complex SQL: CTEs, window functions, and set-based data processing
- Data Validation: Implementing comprehensive business rules in SQL
- Regular Expressions: Pattern matching for data format validation
- Database Performance: Writing efficient queries for large datasets
- Data Quality Assurance: Comprehensive validation rule implementation
This project demonstrates ability to:
- ✅ PostgreSQL Expertise: Advanced SQL development for complex data validation
- ✅ Data Quality Assurance: Implement comprehensive validation rules
- ✅ Financial Compliance: Handle sensitive government ID (PAN) validation
- ✅ Scalable Solutions: Process thousands of records efficiently using database optimization
- ✅ Report Generation: Generate comprehensive summary reports with validation statistics
- ✅ Database Design: Create normalized schemas with proper constraints and views
Manohar K
Data Analyst & SQL Developer | Data Quality & Validation Specialist
Passionate about building scalable SQL-based solutions for complex data validation and quality assurance. Specialized in PostgreSQL development, business rule implementation, and producing audit-ready reports.
For questions or changes, update the repository or contact the project owner.

