██████╗ █████╗ ██████╗ ████████╗██╗ ██╗ ████████╗██╗██╗ ██╗ █████╗ ██████╗ ██╗
██╔══██╗██╔══██╗██╔══██╗╚══██╔══╝██║ ██║ ╚══██╔══╝██║██║ ██║██╔══██╗██╔══██╗██║
██████╔╝███████║██████╔╝ ██║ ███████║ ██║ ██║██║ █╗ ██║███████║██████╔╝██║
██╔═══╝ ██╔══██║██╔══██╗ ██║ ██╔══██║ ██║ ██║██║███╗██║██╔══██║██╔══██╗██║
██║ ██║ ██║██║ ██║ ██║ ██║ ██║ ██║ ██║╚███╔███╔╝██║ ██║██║ ██║██║
╚═╝ ╚═╝ ╚═╝╚═╝ ╚═╝ ╚═╝ ╚═╝ ╚═╝ ╚═╝ ╚═╝ ╚══╝╚══╝ ╚═╝ ╚═╝╚═╝ ╚═╝╚═╝
$ initializing parth_tiwari.profile ...
[✓] identity → AI Systems Engineer
[✓] location → Bengaluru, India
[✓] status → open to the right problem
[✓] philosophy → first principles, not tutorials
[✓] vibe-coding → NOT DETECTED
[✓] evaluation → ACTIVE
[✓] systems deployed → 3 (running right now, not on my laptop)
[✓] slides shipped → 0
[READY] parth_tiwari.profile loaded successfully.Most profiles show you the wins. Here's what actually happened.
Building a fraud engine. Backtesting revealed this:
train ROC-AUC → 0.895 ← model looked great
production ROC → 0.60 ← system was lying to itself the whole time
cause: temporal features bled future signal into past training windows
fix: 20+ leakage validation tests, point-in-time enforcement, rebuilt from scratch
result: 0.895 ROC-AUC that's actually trustworthy
Shipped a Text-to-SQL agent. Hallucination detector reported 100% hallucination:
hallucination_rate → 100% ← every query hallucinating?
actual rate → 0% ← the metric was wrong, not the system
cause: schema_tables_used returned ["schema_dict", "tables"] — dict keys, not table names
fix: one-line patch
lesson: I found this because I wrote a hallucination detector in the first place
Deployed to Render. LLM mixed up two different databases:
question → "what is the total revenue?" (ecommerce schema)
sql → SELECT SUM(amount) FROM fines (library schema — wrong database entirely)
cause: both schemas lived in the same Chroma collection, embeddings leaked cross-schema
fix: prompt isolation + schema-scoped retrieval + re-evaluated full 82-query benchmark
The pattern: I find these things because I build evaluation harnesses before I trust results.
- "it works on my machine" → ship it
+ measure → break it intentionally → fix it → measure again → then ship itmodel_id : parth-tiwari-v1
type : AI Systems Engineer (fresher)
architecture : first-principles → build → evaluate → break → fix → deploy
training_data : production constraints, real failure modes, measurable outcomes
benchmarks:
text_to_sql_execution_success : 95.7% # 82-query ecommerce benchmark
cross_schema_generalization : 100% # zero-shot on unseen library schema
syntactic_hallucination_rate : 0.0% # schema-grounded generation
fraud_roc_auc : 0.895 # 590K transactions, temporal integrity enforced
fraud_precision_in_budget : 92% # ≤0.5% daily alert constraint
rag_faithfulness_score : 0.80 # RAGAS evaluated, 20-query medical benchmark
rag_overall_score : 0.71 # holistic, not cherry-picked
serving:
fraud_p95_latency : < 500ms
sql_agent_p50 : ~2.3s
deployment : Docker · Render · Neon PostgreSQL · Streamlit · HuggingFace
known_limitations : still a fresher · will fix this with time · but fresh perspective · optimized for rapid iteration and high ownershipEvery link below is live. Not a demo. Not a notebook. A running service.
⚡ QUERYPILOT · Self-Correcting Text-to-SQL Agent
Natural Language
│
▼
Schema-Aware RAG ──► SQL Generator
│
Static Validator
│
┌───────────────┼───────────────┐
Regex Repair LLM Fix Executor
└───────────────┴───────────────┘
Self-Correction Loop
(max 3 attempts)
| Metric | Result | Context |
|---|---|---|
| First-attempt success | 90.0% |
No correction, cold generation |
| After self-correction | 95.7% |
3-stage loop on 82-query benchmark |
| Hallucination rate | 0.0% |
Zero invented tables or columns |
| Cross-schema generalization | 100% |
Library schema, zero domain tuning |
| Cold-start reduction | ~400ms |
Per-schema agent caching |
Python LangGraph FastAPI ChromaDB PostgreSQL Docker GitHub Actions
🛡 UPI FRAUD ENGINE · Real-Time Fraud Decision System
HARD CONSTRAINTS (non-negotiable):
├── score transaction at T using only pre-T features (no future leakage)
├── ≤ 0.5% daily alert budget (precision is everything)
└── simulate delayed fraud labels (real-world label lag)
590K transactions → 480+ point-in-time features → DuckDB feature store
day-by-day backtest surfaced 0.895→0.60 train/serve drift → rebuilt
A/B: XGBoost vs two-stage ensemble → winner selected under real budget
| Metric | Result | Context |
|---|---|---|
| ROC-AUC | 0.895 |
Offline, leakage-validated |
| Precision in alert budget | 92% |
Only flags what matters |
| P95 latency | < 500ms |
Production SLA |
| Leakage tests | 20+ |
Temporal integrity proven |
| Modeled fraud savings | ₹21.6 Cr/yr |
Stakes were real |
Python XGBoost FastAPI DuckDB Great Expectations Docker
🧬 EVIDENCE-BOUND DRUG RAG · Medical Knowledge Retrieval
HARD CONSTRAINT: medical domain — hallucination is patient harm
├── every claim needs a citation source
├── adversarial queries must trigger refusal, not a guess
└── faithfulness is measured, not assumed
FDA + NICE PDFs → 853 semantic chunks → hybrid retrieval (vector + BM25)
RAGAS benchmark: 20 structured queries → faithfulness 0.80, overall 0.71
zero-score failure cases logged → retrieval + refusal logic refined
| Metric | Result | Context |
|---|---|---|
| RAGAS Faithfulness | 0.80 |
Claims grounded in source |
| Overall RAGAS Score | 0.71 |
Holistic, 20-query eval |
| Eval cost | $0.168 |
Cost-aware, not burning tokens |
| Refusal behavior | controlled |
Hallucination suppressed by design |
Python FastAPI ChromaDB SentenceTransformers LangChain RAGAS Streamlit
step 1 → define what "working" means before writing a single line
step 2 → build the evaluation harness
step 3 → write the system
step 4 → break it intentionally (adversarial inputs, edge cases, drift simulation)
step 5 → fix what breaks
step 6 → measure again
step 7 → deploy with monitoring hooks
step 8 → repeat when production proves you wrong
This is how 0.895 ROC-AUC became trustworthy instead of suspicious. This is how "100% hallucination" turned out to be a metric bug, not a model bug. This is how a 3-stage correction loop beat a bigger model with a better prompt.
$ ./parth --shutdown
[saving state] ✓ 3 systems deployed
[saving state] ✓ all evaluation harnesses active
[saving state] ✓ no hallucinations in production
[saving state] ✓ open to the right problem
[goodbye] see you on the other side of the next PR.