Skip to content

ManoharManu1999/SQL-PAN-Number-Validation-Project

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

PAN Number Validation Project Banner

🪪 PAN Number Validation & Data Quality Analysis (SQL)

TL;DR

  • What: Cleaned & validated 10,000+ PAN records using PostgreSQL
  • How: SQL-only data quality pipeline using CTEs, regex, and window functions
  • Rules: Implemented strict, project-defined business validations in SQL
  • Output: Audit-ready summary reports for valid vs invalid PANs
  • Tech Stack: PostgreSQL | SQL | CTEs | Regex | Window Functions

📌 Business Problem

In financial and compliance systems, invalid PAN numbers can lead to:

  • Transaction failures
  • Compliance violations
  • Incorrect reporting
  • Poor downstream analytics

Raw PAN data often contains missing values, duplicates, formatting issues, and structurally invalid entries.

👉 Challenge: Detect, clean, and validate PAN data reliably inside the database, without relying on external tools.

✅ Solution Overview

This project implements an end-to-end SQL-based data quality pipeline that:

  • Cleans raw PAN data
  • Applies strict format and rule-based validations
  • Flags invalid records with clear logic
  • Produces summary reports for decision-making

The solution is implemented using set-based SQL logic, making it designed using production-style SQL practices.

🎯 What This Project Does

This project performs the following tasks using SQL:

Data Cleaning

  • NULLs and blanks — Remove empty and whitespace-only values
  • Duplicates — Eliminate duplicate PAN entries
  • Casing issues — Standardize all values to uppercase
  • Normalization — Apply consistent formatting transformations

Validation & Pattern Detection

  • Strict business validation rules — Enforce PAN format and structure requirements
  • Adjacent repeated characters — Detect invalid consecutive character patterns
  • Alphabetical sequences — Identify invalid ascending/descending letter patterns
  • Numeric sequences — Detect invalid digit patterns (e.g., 1234, 9876)

Data Quality Reporting

  • Total PANs before cleaning — Track original dataset size
  • Missing PAN count — Report data quality metrics
  • Valid vs invalid PAN counts — Categorize and summarize validation results

Processes datasets with 10,000+ records and produces consistent, rule-based validation results.


📋 PAN Background (Context)

For readers unfamiliar with Indian tax identifiers:

PAN (Permanent Account Number) is a 10-character alphanumeric identifier issued by the Indian Income Tax Department. It is used for tax filing, financial transactions, banking, and regulatory compliance.

Official reference: https://www.incometax.gov.in/

📋 PAN Format & Structure

Format: AAAAA1234A

  • First 5: Alphabetic characters (A-Z)
  • Next 4: Numeric digits (0-9)
  • Last 1: Alphabetic character (A-Z)
  • Total length: Exactly 10 characters

🛠️ Implementation Details

🐘 Key PostgreSQL Features Used

Advanced Features Used:

  • Common Table Expressions (CTEs) for stepwise processing
  • Regular expressions for format validation
  • Window functions for analysis
  • PostgreSQL functions and views for reuse

Prerequisites

  • PostgreSQL (tested with PostgreSQL)
  • A SQL client such as DBeaver or psql
  • The CSV dataset: PAN Number Validation Dataset.csv

How to Use

  1. Place the CSV file in a location accessible to your database server or use your SQL client import tool.

  2. (Optional) Create the database and table manually, or run the create statement in cleaning_script.sql:

CREATE DATABASE pan_number_valdation_project;
-- then connect to the database and run the rest of the script
  1. Import CSV into the pan_numbers_dataset table. Example using COPY in PostgreSQL (run as a DB superuser or ensure file access):
-- adjust path to where the CSV lives on the server
COPY pan_numbers_dataset(pan_number)
FROM '/path/to/PAN Number Validation Dataset.csv'
WITH (FORMAT csv, HEADER true);

Alternatively, use DBeaver's data import wizard to load the CSV into pan_numbers_dataset.

  1. Run the cleaning and validation queries in cleaning_script.sql (it contains step-by-step queries and a final summary). Example with psql:
psql -h <host> -U <user> -d pan_number_valdation_project -f cleaning_script.sql

📁 Project Structure

PAN_Number_Validation/
│
├── dataset/
│   └── PAN_Number_Validation_Dataset.csv
│      # Raw input dataset containing PAN values
│
├── docs/
│   ├── PAN_Number_Validation_and_Data_Quality_Analysis.pdf
│   │   # Final project presentation (PDF)
│   │
│   └── PAN_Number_Validation_and_Data_Quality_Analysis.pptx
│       # Editable PowerPoint version of the presentation
│
├── snapshots/
│   ├── banner.png
│   │   # Project banner image used in README
│   │
│   └── summary_report.jpg
│       # Screenshot of final data quality summary output
│
├── cleaning_script.sql
│   # Production-style SQL script for cleaning, validating,
│   # and reporting PAN data using PostgreSQL
│
└── README.md
    # Project documentation and overview

🔍 SQL Data Processing Flow

Raw CSV Data
   ↓
Import to PostgreSQL Table
   ↓
Data Cleaning (TRIM, UPPER, DISTINCT)
   ↓
PAN Validation (Business Rules)
   ↓
Data Quality Summary Report

📄 What cleaning_script.sql Does

The script performs the following operations in sequence:

  1. Cleaning – removes NULLs, blanks, duplicates, and normalizes values
  2. Format Validation – enforces 10-character PAN structure
  3. Business Rules – applies sequence and repetition checks
  4. Reporting – generates valid/invalid summary counts

Outcome

  • Processed: 10,000 rows (example)
  • Valid PANs: 3186 valid count
  • Invalid PANs: 5839 invalid count (flagged by business rules)
  • Missing / Incomplete: 967 missing count
  • Artifacts: summary_report.jpg (visual), cleaning_script.sql (SQL pipeline)

🎯 Validation Rules Enforced

ℹ️ Some rules are project-defined to demonstrate advanced data-quality logic and may not reflect official PAN issuance rules.

The following business and quality rules are implemented:

1️⃣ Format Validation

Must strictly match the regex pattern:

^[A-Z]{5}[0-9]{4}[A-Z]$
  • First 5 characters: Uppercase letters only (A-Z)
  • Next 4 characters: Numeric digits only (0-9)
  • Last character: Uppercase letter only (A-Z)

2️⃣ No Adjacent Alphabet Repetition (First 5 Characters)

No two consecutive identical letters allowed in the alphabetic section.

Invalid Examples:

  • AABCD1234F (AA - repeated A)
  • XXYZA1234F (XX - repeated X)

Valid Examples:

  • AXBCD1234F
  • ABCXD1234F

3️⃣ No Alphabetical Sequences (First 5 Characters)

First 5 letters must not form a strictly ascending ASCII sequence.

Invalid Examples:

  • ABCDE1234F (A→B→C→D→E ascending)
  • BCDEF1234F (B→C→D→E→F ascending)

Valid Examples:

  • ABCDX1234F
  • AXBCD1234F

Implemented using ASCII arithmetic and generate_series() for efficient sequence detection.

4️⃣ No Adjacent Digit Repetition (Numeric Block)

No two consecutive identical digits allowed in the 4-digit block.

Invalid Examples:

  • ABCDE1123F (11 - repeated 1)
  • ABCDE5556F (55 - repeated 5)

Valid Examples:

  • ABCDE1923F
  • ABCDE5678F

5️⃣ No Sequential Digits

The 4-digit numeric block must not be a strictly ascending sequence.

Invalid Sequential Patterns:

  • 1234 (1→2→3→4)
  • 2345 (2→3→4→5)
  • 3456 (3→4→5→6)
  • 4567, 5678, 6789

Valid Examples:

  • 1923 (non-sequential)
  • 5432 (descending)
  • 1357 (irregular)

Validated by explicitly excluding known sequences: ('0123','1234','2345','3456','4567','5678','6789')

Summary of Validations

All rules are applied in sequence to ensure High confidence data quality and compliance with PAN business rules. Invalid records are flagged and excluded from the final validated dataset.

📊 Example Output

The script produces:

  • A cleaned set of PANs (trimmed, uppercased, distinct)
  • Lists of rows that match or fail validation rules
  • A summary table with counts grouped by Valid PAN / Invalid PAN

Example Visual Summary

Summary Report

Sample summary report generated by the pipeline

📈 Performance, Scalability & Enhancements

Performance & Scalability

  • Implemented using set-based SQL, avoiding RBAR (row-by-row) processing
  • Designed for predictable scaling as data volume increases
  • Validation logic is database-driven, enabling efficient bulk execution

Optimizations & Next Steps

Optional optimizations:

  • Functional index on normalized PAN values
  • Materialized view for reused cleaned datasets
  • VACUUM ANALYZE after bulk loads
  • Measure execution cost using EXPLAIN (ANALYZE, BUFFERS)

Future enhancements:

  • Automated unit tests for validation logic
  • Additional checksum or pattern-based rules
  • Export validation results for downstream analytics

💡 What I Learned

  • Complex SQL: CTEs, window functions, and set-based data processing
  • Data Validation: Implementing comprehensive business rules in SQL
  • Regular Expressions: Pattern matching for data format validation
  • Database Performance: Writing efficient queries for large datasets
  • Data Quality Assurance: Comprehensive validation rule implementation

🎓 Business Value

This project demonstrates ability to:

  • PostgreSQL Expertise: Advanced SQL development for complex data validation
  • Data Quality Assurance: Implement comprehensive validation rules
  • Financial Compliance: Handle sensitive government ID (PAN) validation
  • Scalable Solutions: Process thousands of records efficiently using database optimization
  • Report Generation: Generate comprehensive summary reports with validation statistics
  • Database Design: Create normalized schemas with proper constraints and views

👤 Author

Manohar K
Data Analyst & SQL Developer | Data Quality & Validation Specialist

Passionate about building scalable SQL-based solutions for complex data validation and quality assurance. Specialized in PostgreSQL development, business rule implementation, and producing audit-ready reports.

Contact

For questions or changes, update the repository or contact the project owner.

LinkedIn Email GitHub

About

SQL-based PAN data validation & data quality analysis using PostgreSQL. Implements cleaning, business-rule validation, and audit-ready reporting.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors