pm-to-ai bdschi1

Evaluation infrastructure testing LLMs on financial reasoning, portfolio construction, and risk decomposition.


fin-reasoning-eval	306 finance problems — valuation, accounting, credit, portfolio math. Difficulty grading, multi-model leaderboard, AI vendor assessment framework.
investment-workflow-evals	Scoring rubrics for the full investment workflow: thesis → catalysts → sizing → risk → monitoring → post-mortem. RLHF Studio, IC memo templates, institutional-to-retail research translator.
excel-model-eval	Structural auditing AND construction of financial models — DCF builder, comps table, operating model, plus dependency graphs, circular ref detection, balance sheet consistency checks.
institutional-investor-casebook	PM-level case studies scored against quantized local models. Likert ratings, CLI pipeline.
judgment-under-uncertainty-eval	Healthcare investing judgment quality — adversarial testing and calibration.


redflag-ex1-analyst	Scans analyst research for MNPI, tipping, regulatory arbitrage, construction traps. PASS / PM_REVIEW / AUTO_REJECT in <60s.


multi-agent-investment-committee	Four-agent IC with adversarial debate and RL-ready T-signal. 200+ tests, 6 LLM providers, Bloomberg Terminal/IBKR adapters.


investment-research-rag	Full RAG pipeline for investment research — SEC filings, earnings transcripts, equity research, Excel models. 4 doc-type-specific chunkers, hybrid retrieval, reranking, 3 LLM/embedding providers. 255 tests.


ls-portfolio-lab	L/S risk workbench — 40+ metrics, trade simulator, PM scorecard. Streamlit + Polars + Plotly.
backtest-lab	Event-driven backtesting with execution realism and bias prevention. 322 tests.
fund-tracker-13f	13F filing analyzer — 52 hedge funds, position changes, consensus trades, crowding.

20+ years buy-side equity PM (SAC Capital/Point72, BAM, WRC), global healthcare across all six GICS industries. More generalist in recent years. CFA, MBA.
Built and systematized investment processes. Hired, trained, and developed analyst teams. Taught valuation and idea pitching internally across firms.
Building LLM evaluation frameworks and agentic tools for investment research and probability assignment — scoring rubrics, difficulty grading, multi-model leaderboards, adversarial red teams.
The failure modes in investment decisions — anchoring, false precision, narrative over data, footnote blindness — appear in LLM outputs too. These repos measure that.
Eval-first, adversarial by default, open source.

Curiosity compounds. Rigor endures.