Parth Tiwari parthtiwari-dev

██████╗  █████╗ ██████╗ ████████╗██╗  ██╗    ████████╗██╗██╗    ██╗ █████╗ ██████╗ ██╗
██╔══██╗██╔══██╗██╔══██╗╚══██╔══╝██║  ██║    ╚══██╔══╝██║██║    ██║██╔══██╗██╔══██╗██║
██████╔╝███████║██████╔╝   ██║   ███████║       ██║   ██║██║ █╗ ██║███████║██████╔╝██║
██╔═══╝ ██╔══██║██╔══██╗   ██║   ██╔══██║       ██║   ██║██║███╗██║██╔══██║██╔══██╗██║
██║     ██║  ██║██║  ██║   ██║   ██║  ██║       ██║   ██║╚███╔███╔╝██║  ██║██║  ██║██║
╚═╝     ╚═╝  ╚═╝╚═╝  ╚═╝   ╚═╝   ╚═╝  ╚═╝       ╚═╝   ╚═╝ ╚══╝╚══╝ ╚═╝  ╚═╝╚═╝  ╚═╝╚═╝

◈ SYSTEM BOOT

$ initializing parth_tiwari.profile ...

[✓] identity          →  AI Systems Engineer
[✓] location          →  Bengaluru, India
[✓] status            →  open to the right problem
[✓] philosophy        →  first principles, not tutorials
[✓] vibe-coding       →  NOT DETECTED
[✓] evaluation        →  ACTIVE
[✓] systems deployed  →  3  (running right now, not on my laptop)
[✓] slides shipped    →  0

[READY] parth_tiwari.profile loaded successfully.

◈ WHO I AM (told through what broke)

Most profiles show you the wins. Here's what actually happened.

Building a fraud engine. Backtesting revealed this:

train ROC-AUC      →  0.895   ← model looked great
production ROC     →  0.60    ← system was lying to itself the whole time

cause:  temporal features bled future signal into past training windows
fix:    20+ leakage validation tests, point-in-time enforcement, rebuilt from scratch
result: 0.895 ROC-AUC that's actually trustworthy

Shipped a Text-to-SQL agent. Hallucination detector reported 100% hallucination:

hallucination_rate  →  100%   ← every query hallucinating?
actual rate         →  0%     ← the metric was wrong, not the system

cause:  schema_tables_used returned ["schema_dict", "tables"] — dict keys, not table names
fix:    one-line patch
lesson: I found this because I wrote a hallucination detector in the first place

Deployed to Render. LLM mixed up two different databases:

question  →  "what is the total revenue?"      (ecommerce schema)
sql       →  SELECT SUM(amount) FROM fines      (library schema — wrong database entirely)

cause:  both schemas lived in the same Chroma collection, embeddings leaked cross-schema
fix:    prompt isolation + schema-scoped retrieval + re-evaluated full 82-query benchmark

The pattern: I find these things because I build evaluation harnesses before I trust results.

- "it works on my machine" → ship it
+ measure → break it intentionally → fix it → measure again → then ship it

◈ MODEL CARD

model_id         : parth-tiwari-v1
type             : AI Systems Engineer  (fresher)
architecture     : first-principles → build → evaluate → break → fix → deploy
training_data    : production constraints, real failure modes, measurable outcomes

benchmarks:
  text_to_sql_execution_success  : 95.7%   # 82-query ecommerce benchmark
  cross_schema_generalization    : 100%    # zero-shot on unseen library schema
  syntactic_hallucination_rate   : 0.0%    # schema-grounded generation
  fraud_roc_auc                  : 0.895   # 590K transactions, temporal integrity enforced
  fraud_precision_in_budget      : 92%     # ≤0.5% daily alert constraint
  rag_faithfulness_score         : 0.80    # RAGAS evaluated, 20-query medical benchmark
  rag_overall_score              : 0.71    # holistic, not cherry-picked

serving:
  fraud_p95_latency   : < 500ms
  sql_agent_p50       : ~2.3s
  deployment          : Docker · Render · Neon PostgreSQL · Streamlit · HuggingFace

known_limitations    : still a fresher  ·  will fix this with time  ·  but fresh perspective · optimized for rapid iteration and high ownership

◈ DEPLOYED SYSTEMS

Every link below is live. Not a demo. Not a notebook. A running service.

⚡ QUERYPILOT · Self-Correcting Text-to-SQL Agent

  Natural Language
        │
        ▼
  Schema-Aware RAG  ──►  SQL Generator
                               │
                         Static Validator
                               │
               ┌───────────────┼───────────────┐
          Regex Repair       LLM Fix        Executor
               └───────────────┴───────────────┘
                        Self-Correction Loop
                           (max 3 attempts)

Metric	Result	Context
First-attempt success	`90.0%`	No correction, cold generation
After self-correction	`95.7%`	3-stage loop on 82-query benchmark
Hallucination rate	`0.0%`	Zero invented tables or columns
Cross-schema generalization	`100%`	Library schema, zero domain tuning
Cold-start reduction	`~400ms`	Per-schema agent caching

Python LangGraph FastAPI ChromaDB PostgreSQL Docker GitHub Actions

🛡 UPI FRAUD ENGINE · Real-Time Fraud Decision System

  HARD CONSTRAINTS (non-negotiable):
  ├── score transaction at T using only pre-T features   (no future leakage)
  ├── ≤ 0.5% daily alert budget                         (precision is everything)
  └── simulate delayed fraud labels                     (real-world label lag)

  590K transactions → 480+ point-in-time features → DuckDB feature store
  day-by-day backtest surfaced 0.895→0.60 train/serve drift → rebuilt
  A/B: XGBoost vs two-stage ensemble → winner selected under real budget

Metric	Result	Context
ROC-AUC	`0.895`	Offline, leakage-validated
Precision in alert budget	`92%`	Only flags what matters
P95 latency	`< 500ms`	Production SLA
Leakage tests	`20+`	Temporal integrity proven
Modeled fraud savings	`₹21.6 Cr/yr`	Stakes were real

Python XGBoost FastAPI DuckDB Great Expectations Docker

🧬 EVIDENCE-BOUND DRUG RAG · Medical Knowledge Retrieval

  HARD CONSTRAINT: medical domain — hallucination is patient harm
  ├── every claim needs a citation source
  ├── adversarial queries must trigger refusal, not a guess
  └── faithfulness is measured, not assumed

  FDA + NICE PDFs → 853 semantic chunks → hybrid retrieval (vector + BM25)
  RAGAS benchmark: 20 structured queries → faithfulness 0.80, overall 0.71
  zero-score failure cases logged → retrieval + refusal logic refined

Metric	Result	Context
RAGAS Faithfulness	`0.80`	Claims grounded in source
Overall RAGAS Score	`0.71`	Holistic, 20-query eval
Eval cost	`$0.168`	Cost-aware, not burning tokens
Refusal behavior	`controlled`	Hallucination suppressed by design

Python FastAPI ChromaDB SentenceTransformers LangChain RAGAS Streamlit

◈ HOW I ACTUALLY BUILD

step 1  →  define what "working" means before writing a single line
step 2  →  build the evaluation harness
step 3  →  write the system
step 4  →  break it intentionally  (adversarial inputs, edge cases, drift simulation)
step 5  →  fix what breaks
step 6  →  measure again
step 7  →  deploy with monitoring hooks
step 8  →  repeat when production proves you wrong

This is how 0.895 ROC-AUC became trustworthy instead of suspicious. This is how "100% hallucination" turned out to be a metric bug, not a model bug. This is how a 3-stage correction loop beat a bigger model with a better prompt.

◈ STACK

◈ STATS

$ ./parth --shutdown

[saving state]   ✓  3 systems deployed
[saving state]   ✓  all evaluation harnesses active
[saving state]   ✓  no hallucinations in production
[saving state]   ✓  open to the right problem

[goodbye]  see you on the other side of the next PR.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly