Evaluation infrastructure testing LLMs on financial reasoning, portfolio construction, and risk decomposition.
| fin-reasoning-eval | 306 finance problems — valuation, accounting, credit, portfolio math. Difficulty grading, multi-model leaderboard, AI vendor assessment framework. |
| investment-workflow-evals | Scoring rubrics for the full investment workflow: thesis → catalysts → sizing → risk → monitoring → post-mortem. RLHF Studio, IC memo templates, institutional-to-retail research translator. |
| excel-model-eval | Structural auditing AND construction of financial models — DCF builder, comps table, operating model, plus dependency graphs, circular ref detection, balance sheet consistency checks. |
| institutional-investor-casebook | PM-level case studies scored against quantized local models. Likert ratings, CLI pipeline. |
| judgment-under-uncertainty-eval | Healthcare investing judgment quality — adversarial testing and calibration. |
| redflag-ex1-analyst | Scans analyst research for MNPI, tipping, regulatory arbitrage, construction traps. PASS / PM_REVIEW / AUTO_REJECT in <60s. |
| multi-agent-investment-committee | Four-agent IC with adversarial debate and RL-ready T-signal. 200+ tests, 6 LLM providers, Bloomberg Terminal/IBKR adapters. |
| investment-research-rag | Full RAG pipeline for investment research — SEC filings, earnings transcripts, equity research, Excel models. 4 doc-type-specific chunkers, hybrid retrieval, reranking, 3 LLM/embedding providers. 255 tests. |
| ls-portfolio-lab | L/S risk workbench — 40+ metrics, trade simulator, PM scorecard. Streamlit + Polars + Plotly. |
| backtest-lab | Event-driven backtesting with execution realism and bias prevention. 322 tests. |
| fund-tracker-13f | 13F filing analyzer — 52 hedge funds, position changes, consensus trades, crowding. |
- 20+ years buy-side equity PM (SAC Capital/Point72, BAM, WRC), global healthcare across all six GICS industries. More generalist in recent years. CFA, MBA.
- Built and systematized investment processes. Hired, trained, and developed analyst teams. Taught valuation and idea pitching internally across firms.
- Building LLM evaluation frameworks and agentic tools for investment research and probability assignment — scoring rubrics, difficulty grading, multi-model leaderboards, adversarial red teams.
- The failure modes in investment decisions — anchoring, false precision, narrative over data, footnote blindness — appear in LLM outputs too. These repos measure that.
- Eval-first, adversarial by default, open source.
Curiosity compounds. Rigor endures.
