A robust AI-powered system for handling both structured and unstructured customer data from multiple sources (2000+). The enhanced system intelligently processes diverse data types, evolves schemas dynamically, and builds complete customer profiles through AI-powered field extraction and profile matching.
Purpose: Intelligently process structured & unstructured data with dynamic schema evolution
Input Types:
- Structured: CSV/Excel/JSON files with column headers
- Unstructured: Free-text data (customer notes, descriptions, etc.)
Key Features:
- ✅ Intelligent Data Type Detection: Auto-detects structured vs unstructured data
- ✅ Dynamic Schema Evolution: Extends schema when new relevant fields discovered
- ✅ LLM-Powered Text Extraction: Extracts structured fields from free-text
- ✅ Smart Name Handling: Manages
first_name,last_name,full_nameintelligently - ✅ Comprehensive Logging: Tracks unmapped fields with confidence scores
Output Files:
processed_data/*.csv- Unified data formatschema_mappings/*_map.json- Field mapping detailsunified_schema.json- Dynamic schema definitionunmapped_fields.json- Unmapped fields log
Purpose: Query and stitch related customer data using anchor attributes Input: Unified data from Enhanced Agent 1 + customer query Output: Complete customer profiles with all related data
- Python 3.7+
- pip package manager
-
Clone/Download the project:
# Extract or navigate to the project directory cd Agent1
-
Install dependencies:
pip install -r requirements.txt
-
Set up Gemini API Key:
cd utils echo "GOOGLE_API_KEY=your_gemini_api_key_here" > .env
-
Add your data sources to the
data_sources/directory:Structured Data (CSV/Excel/JSON with columns):
data_sources/ ├── crm_system.csv # fname, lname, email, etc. ├── ecommerce_platform.csv # given_name, family_name, contact, etc. └── legacy_database.csv # first_nm, last_nm, email_addr, etc.Unstructured Data (Single text column):
data_sources/ └── customer_notes.csv # Free-text customer information -
Run Enhanced Agent 1 (Schema Identification):
python main.py
-
Run Agent 2 (Profile Matching):
python profile_matcher.py
Agent1/
├── agents/
│ ├── __init__.py
│ └── schema_identification_agent.py # Main Agent 1 implementation
├── utils/
│ ├── __init__.py
│ ├── config.py # Configuration & unified schema
│ ├── llm_service.py # LLM service for AI mapping
│ └── data_loader.py # Data loading utilities
├── data_sources/ # Sample data with inconsistent schemas
├── schema_mappings/ # Generated mapping files
├── output/ # Transformed data output
├── main.py # Main implementation
├── requirements.txt # Python dependencies
└── README.md # This file
Main agent class with key methods:
discover_data_sources()- Find and catalog source filesextract_schema_from_source(source_name)- Extract column namesgenerate_schema_mapping(source_name)- Create AI-powered mappingssave_schema_mapping(source_name)- Persist mappings to JSONmap_to_unified_schema(df, schema_map)- Transform data
AI service for intelligent field mapping using Google Gemini API.
Utility for loading and working with data:
- Supports CSV, Excel, and JSON formats
- Schema comparison between sources
- Data export in multiple formats
- Source metadata extraction
The system supports AI-powered schema mapping using Google Gemini API only.
-
Get API Key: Visit: https://aistudio.google.com/app/apikey
-
Set Environment Variable:
# Windows set GOOGLE_API_KEY=your_api_key_here # Linux/Mac export GOOGLE_API_KEY=your_api_key_here
-
Install Gemini Library:
pip install google-generativeai
- The system uses the Gemini LLM to map source fields to the unified schema. All mapping is performed by the LLM, leveraging context and schema definitions.
This project is for demonstration purposes. Extend and modify as needed for your use case.
This is Agent 1 of a larger system. Future agents will handle:
- Agent 2: Customer profile integration and deduplication
- Agent 3: Data quality assessment and cleansing
- Agent 4: Real-time data synchronization