A comprehensive, enterprise-grade toolkit for analyzing, profiling, and migrating Hospital Information System (HIS) databases. This centralized dashboard provides end-to-end capabilities from database analysis and profiling to schema mapping and configuration generation with live database connectivity.
- π Multi-Database Support: Analyze MySQL, PostgreSQL, and MSSQL databases
- π Deep Data Profiling: Column-level statistics, data quality metrics, and composition analysis
- ποΈ Schema Analysis: Automatic DDL extraction with schema namespace support
- π― Smart Sampling: Intelligent data sampling with NULL and empty string filtering
- β‘ Auto-Dependency Management: Automatic installation of required database clients
- π HTML Reports: Beautiful, interactive reports with DataTables integration
- π§ Configuration Generator: Export migration configs in TypeScript/JSON format
- ποΈ Datasource Management: Centralized database connection profiles with SQLite storage
- π Connection Pooling: Singleton pattern for efficient connection reuse across requests
- πΊοΈ Enhanced Schema Mapper: Dual-mode source selection (Run ID or Live Datasource)
- π‘ Live Schema Discovery: Dynamic table and column loading from connected databases
- π‘ Smart Column Suggestions: Auto-suggest target columns from actual database schema
- ποΈ Configuration Repository: Save and load mapping configurations from project database
- π Configuration History: Version tracking with comparison and rollback capabilities
- π Migration Engine: Production-ready ETL execution with batch processing and logging
- π€ AI-Powered Mapping: Semantic column matching using ML transformers and healthcare dictionaries
- Architecture
- Requirements
- Installation
- Quick Start
- Configuration
- Usage
- Workflow
- Advanced Features
- Troubleshooting
- Contributing
- License
his-analyzer/
βββ app.py # Main Streamlit Dashboard Application
βββ requirements.txt # Python Dependencies
βββ README.md # Documentation
βββ migration_tool.db # SQLite database (datasources, configs, config_histories)
β
βββ views/ # Streamlit Pages
β βββ schema_mapper.py # Schema mapping interface (v8.0 enhanced)
β β # β’ AI auto-mapping with ML models
β β # β’ Dual-mode: Run ID or Live Datasource
β β # β’ Configuration save/load with versioning
β βββ migration_engine.py # Migration execution engine (v8.0)
β β # β’ Batch processing with streaming
β β # β’ Data transformation pipeline
β β # β’ Real-time logging and progress tracking
β βββ settings.py # Datasource & config management (v8.0)
β # β’ Datasource CRUD operations
β # β’ Configuration history viewer
β # β’ Version comparison and rollback
β
βββ services/ # Business Logic
β βββ db_connector.py # Database connection pool (v8.0 refactored)
β β # β’ Singleton pattern for connection reuse
β β # β’ SQLAlchemy engine creation
β βββ transformers.py # Data transformation functions (v8.0)
β β # β’ BUDDHIST_TO_ISO, TRIM, JSON parsing
β β # β’ Batch transformer application
β βββ ml_mapper.py # AI-powered mapping service (v8.0 NEW)
β # β’ Sentence Transformers ML model
β # β’ Healthcare domain dictionary
β # β’ Sample data pattern analysis
β
βββ database.py # SQLite operations (v8.0 enhanced)
β β # β’ Datasources CRUD
β β # β’ Configs CRUD with versioning
β β # β’ config_histories table management
β β # β’ Version comparison utilities
β
βββ config.py # Application configuration
βββ utils/ # Utility functions
β βββ helpers.py # Common helper functions
β
βββ analysis_report/ # Database Analysis Engine (Shell Script)
β βββ config.json # Database connection configuration
β βββ unified_db_analyzer.sh # Core analysis script (Bash v7.1)
β β # β’ Multi-database support (MySQL, PG, MSSQL)
β β # β’ Smart sampling (NOT NULL, NOT EMPTY)
β β # β’ Deep analysis mode
β β # β’ Auto-dependency installation
β βββ csv_to_html.py # HTML report generator
β βββ migration_report/ # Analysis output directory
β βββ YYYYMMDD_HHMM/ # Timestamped report folders
β βββ ddl_schema/ # DDL export (schema.sql)
β βββ data_profile/ # CSV and HTML reports
β βββ process.log # Execution logs
β
βββ migration_logs/ # Migration execution logs (v8.0 NEW)
β βββ migration_NAME_TIMESTAMP.log # Timestamped ETL logs
β
βββ mini_his/ # Mock Data Generator
βββ gen_mini_his.py # Python data generator
βββ full_his_mockup.sql # Base SQL schema
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β ANALYSIS PHASE (Bash) β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β Source DB β unified_db_analyzer.sh β CSV/HTML Reports β
β β’ Profile data quality β
β β’ Extract schema (DDL) β
β β’ Calculate statistics β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β MAPPING PHASE (Python + AI) β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β Schema Mapper (Streamlit) β
β βββ Load: CSV Report OR Live Datasource β
β βββ AI Auto-Map: ML Model suggests column mappings β
β βββ Manual Review: User confirms/modifies β
β βββ Transformer Selection: Date conv, trim, etc. β
β βββ Save: Config β SQLite (with versioning) β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β MIGRATION PHASE (Python ETL) β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β Migration Engine β
β βββ Load Config: From SQLite or JSON file β
β βββ Connect: Source & Target via datasource profiles β
β βββ Extract: Batch streaming (pandas + SQLAlchemy) β
β βββ Transform: Apply transformers to each batch β
β βββ Load: Bulk insert to target (to_sql) β
β βββ Log: Real-time progress + persistent logs β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β VERSIONING & AUDIT (SQLite) β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β migration_tool.db β
β βββ datasources: Connection profiles β
β βββ configs: Current mapping configurations β
β βββ config_histories: All versions with timestamps β
β βββ Version comparison & rollback support β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
- Operating System: Linux, macOS, Windows (via WSL2)
- Shell: Bash 4.0+ (auto-switch on macOS)
- Python: 3.8 or higher
- RAM: 4GB minimum, 8GB+ recommended for large databases
The toolkit requires database-specific clients:
- MySQL:
mysql-client - PostgreSQL:
libpq(PostgreSQL client) - MSSQL:
mssql-tools18(with ODBC driver)
Note: On macOS with Homebrew, these dependencies are auto-installed when missing.
streamlit >= 1.30.0- Web dashboard frameworkpandas >= 2.0.0- Data manipulationjq- JSON processor (system package)
Using a virtual environment prevents version conflicts with system Python packages.
# 1. Clone the repository
git clone https://github.com/yourusername/his-analyzer.git
cd his-analyzer
# 2. Create virtual environment
python3 -m venv venv
# 3. Activate virtual environment
source venv/bin/activate # macOS/Linux
# venv\Scripts\activate # Windows
# 4. Install dependencies
pip install --upgrade pip
pip install -r requirements.txt# Install Python dependencies
pip3 install -r requirements.txt
# Install system dependencies (macOS with Homebrew)
brew install jq
# Install database clients as needed
brew install mysql-client
brew install libpq
brew tap microsoft/mssql-release && brew install mssql-tools18On first run, the application automatically creates migration_tool.db SQLite database:
# Start the Streamlit application
streamlit run app.py
# or hot reload
streamlit run app.py --server.runOnSave trueThe database is created automatically with the following tables:
datasources- Stores database connection profilesconfigs- Stores schema mapping configurations
No manual setup required! The database initialization happens on application startup.
Navigate to Settings page in the Streamlit interface to manage datasources:
- Click "βοΈ Settings" in the sidebar
- Select "Datasources" tab
- Click "Add New Datasource"
- Fill in connection details:
- Name (unique identifier)
- Database Type (MySQL, PostgreSQL, MSSQL)
- Host, Port, Database Name
- Username, Password
- Test connection
- Save datasource
Datasources are stored in SQLite and reused across:
- Schema Mapper (source & target selection)
- Migration Engine (connection profiles)
- All database operations (via connection pool)
Edit analysis_report/config.json:
{
"database": {
"type": "mysql",
"host": "localhost",
"port": "3306",
"name": "hospital_db",
"user": "root",
"password": "your_password",
"schema": "",
"tables": []
},
"sampling": {
"default_limit": 10,
"max_text_length": 300,
"deep_analysis": true,
"exceptions": []
}
}cd analysis_report
./unified_db_analyzer.shOutput: Creates timestamped report in migration_report/YYYYMMDD_HHMM/
Note: In v8.0, you can also connect directly to datasources in Schema Mapper, bypassing the need for analysis reports.
| Field | Description | Example |
|---|---|---|
type |
Database type | mysql, postgresql, mssql |
host |
Database host | localhost, 192.168.1.100 |
port |
Database port | 3306, 5432, 1433 |
name |
Database name | hospital_db |
user |
Username | admin |
password |
Password | secure_password |
schema |
Schema name (optional) | public, dbo |
tables |
Specific tables (optional) | ["patients", "visits"] |
Specify database schema for PostgreSQL and MSSQL:
{
"database": {
"type": "postgresql",
"schema": "public",
...
}
}Defaults:
- PostgreSQL:
public - MSSQL:
dbo - MySQL: Not applicable
| Parameter | Description | Default |
|---|---|---|
default_limit |
Number of sample rows | 10 |
max_text_length |
Max characters for text fields | 300 |
deep_analysis |
Enable detailed statistics | true |
exceptions |
Per-column overrides | [] |
Override sampling limits for specific columns:
{
"sampling": {
"exceptions": [
{ "table": "patients", "column": "notes", "limit": 3 },
{ "table": "visits", "column": "diagnosis", "limit": 5 }
]
}
}This toolkit uses a pure Bash shell script (unified_db_analyzer.sh) for database profiling instead of Python ETL frameworks or commercial tools. Here's why:
| Aspect | Shell Script Approach | Python/Tools Alternative |
|---|---|---|
| Dependencies | Minimal: bash, jq, native DB clients |
Heavy: pandas, SQLAlchemy, various libraries |
| Portability | Runs anywhere with Bash 4.0+ | Requires Python environment setup |
| Performance | Direct database access, minimal overhead | Abstraction layers slow down queries |
| Security | No code execution risks, simple audit | Complex dependency chains, supply chain risks |
| Maintenance | Single 600-line script, easy to debug | Multiple packages, version conflicts |
| Installation | Auto-installs missing DB clients via Homebrew | Manual pip installs, virtual environments |
1. Zero-Setup Profiling
# No Python, no pip install, no virtual env - just run
cd analysis_report
./unified_db_analyzer.sh2. Multi-Database Native Support
- Directly uses
mysql,psql,sqlcmdfor optimal performance - Schema-aware profiling (PostgreSQL
public, MSSQLdbo) - Handles database-specific quirks (MSSQL SSL certs, NULL warnings)
3. Production-Ready Features
- Smart Sampling: Filters NULL/empty values automatically
- Deep Analysis Mode: Min/Max, Top-5 frequencies, data composition
- Exception Rules: Per-column sampling limits
- Table Size Calculation: Actual disk usage in MB
- DDL Export: Complete schema with indexes and constraints
4. Migration-Friendly Output
- CSV Format: Universal, works with any ETL tool
- HTML Reports: Interactive DataTables for business users
- Timestamped Runs: Tracks profiling history (
YYYYMMDD_HHMM/) - Process Logs: Complete audit trail for compliance
5. Real-World Migration Use Cases
# Before migration: Profile source database
./unified_db_analyzer.sh # Analyzes source system
# Review data quality, identify issues
open migration_report/20251130_1523/data_profile/data_profile.html
# Load into Streamlit for schema mapping
# Use profiling data to design transformations
# Execute migration with confidence
# Knowing exact data types, null counts, value ranges6. Shell Script Advantages for Migration
- Repeatable: Run daily to track data changes over time
- Scriptable: Integrate into CI/CD pipelines
- Offline: Profile production DB, analyze on laptop (CSV export)
- Auditable: Single script = complete transparency
- Fast: No Python overhead, direct SQL execution
cd analysis_report
./unified_db_analyzer.shFeatures:
- Auto-detects database type from
config.json - Checks and installs missing dependencies (macOS with Homebrew)
- Exports DDL schema to
schema.sql - Generates CSV data profile with smart NULL/empty filtering
- Creates interactive HTML report with DataTables
- Logs all operations to
process.log
Output Structure:
migration_report/20251124_0023/
βββ ddl_schema/
β βββ schema.sql # Complete DDL export
βββ data_profile/
β βββ data_profile.csv # Raw profiling data
β βββ data_profile.html # Interactive report
βββ process.log # Execution log
The dashboard provides several interfaces:
Dual Source Mode:
- Run ID Mode: Load from CSV analysis reports (legacy)
- Datasource Mode: Connect directly to live database (new!)
Features:
- View table and column statistics
- Map source to target fields with live schema discovery
- Smart target column suggestions from actual database
- Select data transformers and validators
- Save/load configurations from project database
- Generate TypeScript/JSON configurations
- Export configurations as downloadable files
Workflow:
- Source Configuration: Choose Run ID or Datasource
- Run ID: Select from analysis report folders
- Datasource: Select datasource β Choose table (auto-loads schema)
- Target Configuration: Select target datasource and table
- Field Mapping: Map source columns to target with suggestions
- Save Configuration: Store in SQLite for reuse
- Export: Download as JSON for migration tools
Datasources Tab:
- Add/Edit/Delete datasource profiles
- Test database connections
- View all configured datasources
- Secure credential storage in SQLite
Saved Configs Tab:
- View all saved schema mapping configurations
- Load configurations for editing
- Delete unused configurations
- Export configurations
- Select source and target from datasource profiles
- Load saved configurations from project database
- Upload configuration files
- Execute data migration (simulation mode)
- Browse database schema
- Click tables to view CREATE statements
- Navigate foreign key relationships
- Generate test data for migration testing
- Configurable data volumes
- Realistic HIS data patterns
flowchart TD
A[π Configure config.json] --> B{Select Database Type}
B -->|MySQL| C1[MySQL Client]
B -->|PostgreSQL| C2[PostgreSQL Client]
B -->|MSSQL| C3[MSSQL Client + SSL]
C1 --> D[π unified_db_analyzer.sh]
C2 --> D
C3 --> D
D --> E{Check Dependencies}
E -->|Missing| F[π§ Auto-Install via Homebrew]
E -->|Available| G
F --> G[βοΈ Start Analysis]
G --> H1[π Table Size & Row Count]
G --> H2[π Column Profiling]
G --> H3[π DDL Export]
H2 --> I{Deep Analysis?}
I -->|true| J[π Min/Max/Top5/Composition]
I -->|false| K[Basic Stats Only]
J --> L[π― Smart Sample<br/>NOT NULL & NOT EMPTY]
K --> L
L --> M[πΎ Export to CSV]
M --> N[π Generate HTML Report]
N --> O[π migration_report/YYYYMMDD_HHMM/]
O --> P[π₯οΈ Open in Streamlit Dashboard]
P --> Q[πΊοΈ Schema Mapping & Config Generation]
style D fill:#4CAF50,color:#fff
style L fill:#FF9800,color:#fff
style O fill:#2196F3,color:#fff
style Q fill:#9C27B0,color:#fff
-
Database Analysis
- Configure
config.jsonwith source database credentials - Run
./unified_db_analyzer.sh - Review generated reports
- Configure
-
Schema Mapping
- Launch Streamlit dashboard
- Navigate to Schema Mapper
- Load analysis report
- Map source fields to target schema
- Select transformers (e.g., date format converters, string normalizers)
-
Configuration Export
- Generate TypeScript/JSON configuration
- Integrate with migration pipeline
- Test with mock data if needed
-
Migration Execution
- Use generated config with your ETL tool
- Monitor data quality metrics
- Validate migrated data
The toolkit uses a singleton connection pool pattern for efficient database operations:
Benefits:
- Reuses connections across multiple requests
- Automatic health checks and reconnection
- Significant performance improvement for repeated operations
- Thread-safe connection management
How it works:
# First call - creates connection
get_tables_from_datasource(...) # Creates new connection
# Second call - reuses connection (no overhead!)
get_columns_from_table(...) # Reuses existing connection
# Connection stays alive for future requestsConnection Management:
- Connections are identified by unique hash (host, port, db, user)
- Dead connections are automatically detected and recreated
- All functions use autocommit mode for stability
- Connections persist across Streamlit reruns
Manual Control:
from services.db_connector import close_connection, close_all_connections
# Close specific connection
close_connection(db_type, host, port, db_name, user)
# Close all connections (useful for cleanup)
close_all_connections()SQLite Storage: migration_tool.db
Tables:
-
datasources - Database connection profiles
CREATE TABLE datasources ( id INTEGER PRIMARY KEY AUTOINCREMENT, name TEXT UNIQUE, db_type TEXT, host TEXT, port TEXT, dbname TEXT, username TEXT, password TEXT )
-
configs - Schema mapping configurations
CREATE TABLE configs ( id INTEGER PRIMARY KEY AUTOINCREMENT, config_name TEXT UNIQUE, table_name TEXT, json_data TEXT, updated_at TIMESTAMP )
Automatic Initialization:
- Database created on first application run
- No manual SQL scripts required
- Handles migrations automatically
Mode 1: Run ID (Traditional)
- Uses CSV analysis reports
- Offline operation
- Historical data analysis
- Best for: Initial exploration, documented analysis
Mode 2: Datasource (New)
- Connects directly to live database
- Real-time schema discovery
- Auto-loads tables and columns
- Best for: Active development, latest schema
Switching between modes:
- Open Schema Mapper
- Select source mode (Run ID / Datasource)
- Choose source accordingly
- Schema Mapper adapts automatically
Enable comprehensive data profiling:
{
"sampling": {
"deep_analysis": true
}
}Metrics Collected:
| Metric | Basic Mode | Deep Mode |
|---|---|---|
| Row Count | β | β |
| Null Count | β | β |
| Distinct Values | β | β |
| Min/Max Values | β | β |
| Top 5 Frequency | β | β |
| Data Composition | β | β (Valid/Null/Empty/Zero) |
| Sample Data | β | β (Smart filtered) |
Performance Considerations:
- Basic Mode: Fast, suitable for large tables (millions of rows)
- Deep Mode: Slower, recommended for detailed migration planning
Automatically filters sample data to show only meaningful values:
Filtering Rules:
- Excludes
NULLvalues - Excludes empty strings (
'') - Shows actual representative data
Implementation (MySQL example):
SELECT DISTINCT column_name
FROM table_name
WHERE column_name IS NOT NULL
AND CAST(column_name AS CHAR) <> ''
LIMIT 10;On macOS with Homebrew, missing database clients are automatically installed:
# Example: Installing MSSQL tools
β Error: Command 'sqlcmd' not found
πΊ Homebrew detected...
β Install 'mssql-tools18' now? (y/N): y
π¦ Installing mssql-tools18...
-> Tapping microsoft/mssql-release...
-> Installing packages...
β
Installation successful!Generated HTML reports include:
- Overview Tab: Table-level metrics with sortable DataTable
- Column Detail Tab: Comprehensive column-level statistics
- Formulas & Docs Tab: Data quality score explanations
- Process Log Tab: Complete execution logs
Features:
- Responsive design with Bootstrap 5
- Interactive tables with search/filter/sort
- Data quality visualizations
- Exportable to Excel/CSV/PDF
Track every change to your schema mapping configurations with built-in version control.
How It Works:
Every time you save a configuration, the system automatically:
- Creates a new version entry in
config_historiestable - Preserves complete JSON snapshot with timestamp
- Increments version number (v1, v2, v3...)
- Links to parent configuration via foreign key
Database Schema:
-- Main configuration table
CREATE TABLE configs (
id TEXT PRIMARY KEY, -- UUID for relationships
config_name TEXT UNIQUE, -- User-facing name
table_name TEXT, -- Source table
json_data TEXT, -- Current config JSON
updated_at TIMESTAMP -- Last modification
);
-- Version history table
CREATE TABLE config_histories (
id TEXT PRIMARY KEY, -- Unique history entry ID
config_id TEXT, -- Links to parent config
version INTEGER, -- Sequential version number
json_data TEXT, -- Config snapshot at this version
created_at TIMESTAMP, -- When this version was created
FOREIGN KEY(config_id) REFERENCES configs(id) ON DELETE CASCADE
);Key Features:
-
Automatic Versioning
- No manual intervention needed
- Every save creates a new version
- Original version preserved forever
-
Version Comparison
# Compare two versions to see what changed diff = db.compare_config_versions("PatientMigration", version1=1, version2=3) # Returns: { 'mappings_added': [...], # New column mappings 'mappings_removed': [...], # Deleted mappings 'mappings_modified': [...] # Changed transformers/targets }
-
Rollback Support
- View all historical versions in Settings page
- Load any previous version
- Restore deleted configurations from history
-
Audit Trail
- Complete history of configuration changes
- Timestamp for every modification
- Useful for compliance and troubleshooting
Use Cases:
- Migration Testing: Try different mapping strategies, rollback if needed
- Team Collaboration: Track who changed what and when
- Production Safety: Restore last-known-good configuration quickly
- Documentation: Historical record of migration decisions
Production-ready ETL execution engine with enterprise features.
Architecture:
βββββββββββββββββββ ββββββββββββββββββββ βββββββββββββββββββ
β Source DB ββββββββΆβ Migration ββββββββΆβ Target DB β
β (via Profile) βββββ β Engine β ββββΆβ (via Profile) β
βββββββββββββββββββ β ββββββββββββββββββββ β βββββββββββββββββββ
β β β
β βΌ β
β ββββββββββββββββββββ β
β β Transformers β β
β β - Date Conv. β β
β β - Trim/Clean β β
β β - JSON Parse β β
β ββββββββββββββββββββ β
β β
β ββββββββββββββββββββ β
βββββ Config JSON βββββ
β (Mappings) β
ββββββββββββββββββββ
Key Features:
1. Batch Processing
- Configurable batch size (default: 1000 rows)
- Streaming execution - handles millions of rows
- Memory-efficient: processes one chunk at a time
- Progress tracking with visual progress bar
2. Smart Query Generation
# Only selects mapped columns - reduces network overhead
SELECT "hn", "fname", "lname", "dob" FROM patients
# Instead of SELECT * (which transfers unused data)3. Data Transformation Pipeline
for batch in data_iterator:
# 1. Fetch batch (1000 rows)
df_batch = pd.read_sql(query, source_engine, chunksize=1000)
# 2. Apply transformers (in-memory)
df_batch = DataTransformer.apply_transformers_to_batch(df_batch, config)
# 3. Rename columns to match target schema
df_batch.rename(columns=rename_map, inplace=True)
# 4. Bulk insert to target
df_batch.to_sql(target_table, target_engine, if_exists='append')4. Comprehensive Logging
- Real-time log viewer in UI
- Persistent log files:
migration_logs/migration_NAME_TIMESTAMP.log - Audit trail: timestamps, row counts, errors
- Downloadable after completion
5. Test Mode
- Process only 1 batch (configurable limit)
- Validate mappings without full migration
- Dry-run capability for safety
6. Error Handling
- Transaction-safe batch commits
- Stops on first error (prevents data corruption)
- Detailed error messages with context
- Rollback support (database-dependent)
Execution Workflow:
Step 1: Select Configuration
βββ Load from Project Database (saved configs)
βββ Upload JSON File (external configs)
Step 2: Test Connections
βββ Select Source Datasource
βββ Select Target Datasource
βββ Verify connectivity
βββ Health checks
Step 3: Review & Settings
βββ View configuration JSON
βββ Set batch size
βββ Enable/disable test mode
βββ Confirm execution
Step 4: Execute Migration
βββ Connect to databases (SQLAlchemy engines)
βββ Generate optimized SELECT query
βββ Stream data in batches
βββ Apply transformations
βββ Bulk insert to target
βββ Log everythingPerformance Characteristics:
| Dataset Size | Batch Size | Approx. Time | Memory Usage |
|---|---|---|---|
| 10K rows | 1000 | ~10 seconds | < 50 MB |
| 100K rows | 1000 | ~1-2 minutes | < 200 MB |
| 1M rows | 1000 | ~10-15 min | < 500 MB |
| 10M+ rows | 5000 | ~1-2 hours | < 1 GB |
Use Cases:
- One-time Migrations: Legacy system to new platform
- Continuous Sync: Nightly data transfers
- Data Warehouse ETL: OLTP β OLAP transformations
- Multi-tenant Migrations: Hospital A β Hospital B
- Testing: Validate transformations with test mode
Intelligent column matching using machine learning and healthcare domain knowledge.
Technology Stack:
- Model:
sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2 - Framework: Sentence Transformers (Hugging Face)
- Similarity: Cosine similarity on semantic embeddings
- Domain: Healthcare Information Systems (HIS)
How It Works:
1. Dual-Strategy Matching
# Strategy 1: Rule-Based Dictionary (Priority)
his_dictionary = {
"hn": ["hn", "hospital_number", "mrn", "patient_code"],
"cid": ["cid", "national_id", "card_id", "citizen_id"],
"vn": ["vn", "visit_number", "visit_no"],
# ... 30+ healthcare terms
}
# Strategy 2: Semantic AI Matching (Fallback)
# Encodes column names into 384-dimensional vectors
# Compares similarity using cosine distance
source_embedding = model.encode("patient_firstname")
target_embeddings = model.encode(["fname", "first_name", "given_name"])
best_match = argmax(cosine_similarity(source_embedding, target_embeddings))2. Confidence Scoring
- Exact Match: 1.0 (100% confidence)
- Dictionary Match: 0.9 (90% confidence)
- Semantic Match: 0.4-0.9 (threshold-based)
- No Match: 0.0 (suggests manual review)
3. Sample Data Analysis
Analyzes actual column values to suggest transformers:
# Example: Detects Thai Buddhist year dates
sample_values = ["2566-01-15", "2567-03-20", "2565-12-01"]
analysis = ml_mapper.analyze_column_with_sample(
source_col="admit_date",
target_col="admission_date",
sample_values=sample_values
)
# Returns:
{
"confidence_score": 0.9,
"transformers": ["BUDDHIST_TO_ISO"], # Auto-suggested
"reason": "Detected Thai Buddhist year (25xx) in 3/3 samples"
}4. Pattern Detection
| Pattern | Detection Logic | Suggested Transformer |
|---|---|---|
| Thai Buddhist Year | 25[5-9]\d in >50% samples |
BUDDHIST_TO_ISO |
| Whitespace Issues | Leading/trailing spaces | TRIM |
| JSON Structures | {...} or [...] |
PARSE_JSON |
| Float IDs | 123.0 pattern |
FLOAT_TO_INT |
| Leading Zeros | ID with 0 prefix |
Keep as string |
| All NULL/Empty | No valid data | Mark as IGNORE |
5. Healthcare-Specific Validation
# Hospital Number (HN) validation
if "hn" in source_column:
hn_pattern = r'^\d{6,10}$' # 6-10 digits
valid_count = count_matches(samples, hn_pattern)
confidence = valid_count / total_samples
# National ID (CID) validation
if "cid" in source_column:
cid_pattern = r'^\d{13}$' # Exactly 13 digits
validate_thai_national_id_checksum(samples)User Interface:
In Schema Mapper page:
- Click "π€ AI Auto-Map" button
- AI analyzes source columns vs target schema
- Displays suggestions with confidence scores
- User reviews and confirms/modifies mappings
- AI also suggests transformers based on sample data
Benefits:
- Time Savings: Auto-map 100 columns in seconds vs hours
- Accuracy: Semantic understanding, not just string matching
- Learning: Improves with healthcare-specific dictionary
- Transparency: Shows confidence scores for manual review
- Flexibility: Suggestions, not forced decisions
Limitations:
- Requires internet for first model download (~100 MB)
- Best for English/Thai column names (multilingual model)
- Suggestions need human validation
- Not trained on your specific schema (generic model)
Example Session:
Source Columns Target Columns AI Suggestion Confidence
--------------- --------------- ------------- ----------
hn β hospital_number β
Matched 100%
patient_name β full_name β οΈ Maybe 65%
admit_dt β admission_date β
Matched 85%
β’ Transformer: BUDDHIST_TO_ISO
blood_press β bp_systolic β οΈ Uncertain 45%
old_id β [No Match] β Manual 0%
Cause: Outdated Streamlit version (< 1.30.0)
Solution:
# Option 1: Use virtual environment (recommended)
python3 -m venv venv
source venv/bin/activate
pip install --upgrade streamlit
# Option 2: Force reinstall
pip uninstall streamlit -y
pip install --upgrade --force-reinstall streamlitCause: Self-signed or untrusted SSL certificate
Solution: The toolkit automatically adds -C flag to trust server certificates:
sqlcmd -S host,port -C -U user -P password ...Cause: T-SQL variable scope in dynamic SQL
Solution: Already handled in v7.1+ with proper variable injection
Cause: Incorrect schema name (e.g., using default public for MSSQL)
Solution: Specify correct schema in config.json:
{
"database": {
"type": "mssql",
"schema": "dbo"
}
}- Check the
process.login the report folder - Review error messages in the terminal output
- Verify database connectivity with native clients:
mysql -h host -u user -p psql -h host -U user -d database sqlcmd -S host,port -U user -P password
- Open an issue on GitHub with:
- Error message
- Database type and version
- Operating system
- Relevant log excerpts
cd mini_his
python gen_mini_his.pycd analysis_report
# Edit config.json to point to test database
./unified_db_analyzer.sh
# Verify output
ls -lh migration_report/*/data_profile/data_profile.csvApproximate analysis times (single table):
| Rows | Columns | Basic Mode | Deep Mode |
|---|---|---|---|
| 10K | 20 | ~2s | ~5s |
| 100K | 50 | ~10s | ~30s |
| 1M | 100 | ~30s | ~2min |
| 10M+ | 200+ | ~2min | ~10min+ |
Optimization Tips:
- Use
tablesfilter to analyze specific tables only - Disable
deep_analysisfor initial exploration - Adjust
default_limitfor faster sampling
Contributions are welcome! Please follow these guidelines:
- Use GitHub Issues
- Include error messages and logs
- Provide reproduction steps
- Specify environment details
- Open a GitHub Discussion
- Describe use case and benefits
- Provide examples if possible
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature) - Commit changes (
git commit -m 'Add amazing feature') - Push to branch (
git push origin feature/amazing-feature) - Open a Pull Request
Code Standards:
- Bash scripts: Follow ShellCheck recommendations
- Python: PEP 8 style guide
- Add comments for complex logic
- Update documentation for new features
This project is licensed under the MIT License - see the LICENSE file for details.
- Built for healthcare professionals managing HIS migrations
- Inspired by enterprise database migration challenges
- Community feedback and contributions welcome
- Documentation: This README and inline code comments
- Issues: GitHub Issues
- Discussions: GitHub Discussions
- Datasource management with SQLite storage (v8.0)
- Connection pooling and reuse (v8.0)
- Dual-mode schema mapper (Run ID / Datasource) (v8.0)
- Live schema discovery (v8.0)
- Configuration repository (v8.0)
- Smart column suggestions (v8.0)
- Configuration history and version control (v8.0)
- Version comparison and rollback (v8.0)
- Production-ready migration engine (v8.0)
- Batch processing with streaming (v8.0)
- AI-powered column mapping (v8.0)
- Healthcare-specific ML dictionary (v8.0)
- Automatic transformer suggestions (v8.0)
- Support for Oracle Database
- REST API for programmatic access
- Docker containerization
- CI/CD pipeline integration
- Data anonymization features
- Real-time migration progress metrics
- Database-level transaction rollback
- Data validation dashboard with anomaly detection
- Scheduled migration jobs (cron-like)
- Multi-datasource data lineage tracking
- Custom AI model training for organization-specific schemas
- Incremental/delta migration support
- Data quality scoring engine
Made with β€οΈ for the HIS migration community