A containerized research system that ingests earnings-call transcripts and SEC filings, runs a multi-stage LangGraph pipeline powered by DigitalOcean Serverless Inference, generates structured research theses with falsifiable predictions, and backtests those predictions against subsequent price action.
Disclaimer: Research only. Not financial advice. Not a recommendation to buy or sell any security.
graph TD
UI[React Dashboard] -->|REST| API[FastAPI :8000]
API -->|enqueue| Worker[Celery Worker]
Worker -->|run| Pipeline[LangGraph Pipeline]
Pipeline -->|LLM calls| DO[DigitalOcean Serverless Inference]
Pipeline -->|embeddings| DO
Pipeline -->|read/write| DB[(Postgres + pgvector)]
Pipeline -->|SEC filings| EDGAR[SEC EDGAR API]
Pipeline -->|price data| Market[yfinance / sample fallback]
Pipeline -->|transcripts| Transcripts[sample_data/transcripts/]
API -->|read| DB
UI -->|view| DB
subgraph LangGraph Stages
S1[fetch_sources] --> S2[segment_transcript]
S2 --> S3[extract_guidance]
S3 --> S4[extract_analyst_pushback]
S4 --> S5[retrieve_prior_context RAG]
S5 --> S6[detect_sentiment_shift]
S6 -->|lowered/vague guidance| DD1[deep_dive_guidance_risk]
S6 -->|pushback >= 3| DD2[deep_dive_pushback]
S6 -->|shift <= -2| DD3[deep_dive_negative_shift]
S6 -->|skip| S7[generate_thesis]
DD1 --> DD2 --> DD3 --> S7
S7 --> S8[score_thesis]
S8 --> S9[backtest_outcome]
end
| Layer | Technology |
|---|---|
| LLM & Embeddings | DigitalOcean Serverless Inference (OpenAI-compatible) |
| Orchestration | LangGraph 0.2 |
| Backend | FastAPI + Uvicorn |
| Background Jobs | Celery + Redis |
| Database | Postgres 16 + pgvector |
| Vector Search | NumPy cosine similarity (pgvector-ready) |
| Market Data | yfinance (deterministic fallback if unavailable) |
| SEC Filings | EDGAR REST API |
| Frontend | React 18 + Vite |
| Container | Docker + Docker Compose |
- Log in to DigitalOcean Cloud
- Navigate to AI & ML → Serverless Inference
- Click Create Access Key
- Copy the key — it is shown only once
Run the check script (after setting your key) to see available models:
python scripts/check_do_inference.pyCommon models available on DigitalOcean Serverless Inference:
meta-llama-3-70b-instruct— good balance of speed and qualitymeta-llama-3.1-405b-instruct— highest qualitymistral-7b-instruct— fastest- Check
/v1/modelsfor the current list
Copy .env.example to .env and fill in:
cp .env.example .envDIGITALOCEAN_INFERENCE_API_KEY=dop_v1_xxxx...
DIGITALOCEAN_LLM_MODEL=meta-llama-3-70b-instruct
DIGITALOCEAN_EMBEDDING_MODEL=text-embedding-ada-002 # if available, else leave blankIf DIGITALOCEAN_EMBEDDING_MODEL is blank, the pipeline uses zero-vectors (RAG retrieval will be random but the pipeline still runs).
# 1. Clone and configure
cp .env.example .env
# Edit .env — set DIGITALOCEAN_INFERENCE_API_KEY and DIGITALOCEAN_LLM_MODEL
# 2. Build and start
docker compose up --build
# 3. Frontend: http://localhost:3000
# 4. API docs: http://localhost:8000/docs# Backend
python -m venv .venv
source .venv/bin/activate # Windows: .venv\Scripts\activate
pip install -r requirements.txt
cp .env.example .env # fill in your keys
# Start Postgres locally (or set DATABASE_URL to SQLite for dev)
# DATABASE_URL=sqlite:///./dev.db
uvicorn app.main:app --reload
# Frontend (separate terminal)
cd frontend
npm install
npm run dev # http://localhost:5173python scripts/check_do_inference.pyExpected output:
✅ All checks passed. DigitalOcean Inference is ready.
python scripts/seed_demo.pyIngests 3 bundled transcripts: AAPL 2024Q4, NVDA 2024Q3, MSFT 2024Q2.
python scripts/run_demo_analysis.py --ticker AAPL --period 2024Q4Output includes: thesis, bull/bear case, falsifiable predictions, backtest results.
- Open http://localhost:3000
- Select ticker + period, click Start Analysis
- Watch the stage progress bar fill as LangGraph executes
- Click View Full Thesis for evidence, predictions, and backtest table
| Method | Endpoint | Description |
|---|---|---|
| GET | /api/v1/health |
App health |
| GET | /api/v1/health/llm |
DigitalOcean Inference connectivity + model check |
| GET | /api/v1/models |
List available DO inference models |
| POST | /api/v1/ingest/transcript |
Upload transcript text for RAG |
| POST | /api/v1/runs |
Start analysis pipeline |
| GET | /api/v1/runs/{run_id} |
Poll run status + stage outputs |
| GET | /api/v1/theses |
List generated theses |
| GET | /api/v1/theses/{id} |
Full thesis detail |
| POST | /api/v1/backtest/{run_id} |
(Re)run backtest for a run |
| GET | /api/v1/backtests |
List all backtest results |
Interactive docs: http://localhost:8000/docs
| Variable | Required | Description |
|---|---|---|
DIGITALOCEAN_INFERENCE_API_KEY |
Yes | DO model access key |
DIGITALOCEAN_LLM_MODEL |
Yes | Chat model ID |
DIGITALOCEAN_EMBEDDING_MODEL |
No | Embedding model ID (blank = zero vectors) |
DIGITALOCEAN_INFERENCE_BASE_URL |
No | Defaults to https://inference.do-ai.run/v1 |
DATABASE_URL |
Yes | Postgres or SQLite URL |
REDIS_URL |
No | Celery broker (defaults to redis://localhost:6379/0) |
SEC_USER_AGENT |
Yes | Name email@example.com for EDGAR headers |
MARKET_DATA_PROVIDER |
No | yfinance (default); falls back to sample data |
AAPL, MSFT, NVDA, AMZN, META
Sample transcripts bundled:
sample_data/transcripts/AAPL_2024Q4.txt— Apple Q4 2024 (Oct 31, 2024)sample_data/transcripts/NVDA_2024Q3.txt— NVIDIA Q3 FY2025 (Nov 20, 2024)sample_data/transcripts/MSFT_2024Q2.txt— Microsoft Q2 FY2025 (Jan 29, 2025)
pip install -r requirements.txt
pytest -vTest coverage:
test_do_provider.py— DigitalOcean provider (mocked HTTP)test_chunker.py— RAG text chunkertest_backtest.py— Return calculations + deterministic sample datatest_schema_validation.py— Pydantic schemastest_transcript_segmentation.py— Transcript parser utilitiestest_rag_retrieval.py— Vector store ingest + cosine retrieval (SQLite)test_pipeline_smoke.py— Full LangGraph graph with mocked LLM
- RAG is cold on first run. Prior context retrieval only works after multiple transcripts have been ingested. Run
seed_demo.pyfirst. - Embeddings require a model. If
DIGITALOCEAN_EMBEDDING_MODELis blank, all embeddings are zero-vectors and RAG ranking is non-functional (pipeline still runs). - Backtest uses close-to-close returns. Intraday dynamics, bid-ask spread, and transaction costs are not modeled.
- Market data falls back to deterministic synthetic data if yfinance is unavailable or the date range has no data. The synthetic data is clearly labeled
is_sample_data=true. - SEC filing text fetch is best-effort. EDGAR HTML parsing may fail for older filings; the pipeline gracefully continues without filing text.
- Concurrency is limited. The FastAPI background task runner and Celery worker are single-threaded by default. For production, scale Celery workers.
- No authentication. This is an MVP. Add OAuth2 / API key auth before exposing publicly.
- China/macro context and true multi-quarter trend analysis require more historical data than the MVP seeds.
- Add pgvector
vectorcolumn type for GPU-accelerated similarity search - Wire real news API (Benzinga, Alpha Vantage) into
app/tools/news.py - Add earnings date auto-discovery via EDGAR filing dates
- Multi-quarter trend charting in the frontend
- Streaming pipeline stage updates via WebSocket
- Add authentication (JWT / API key)
- Expand to 20+ tickers
- Add PDF filing ingestion (10-K, 10-Q full text)
- Add evaluation framework comparing thesis quality across models
- Deploy to DigitalOcean App Platform with managed Postgres
earnings-thesis-eval/
├── app/
│ ├── config.py # Settings (pydantic-settings)
│ ├── database.py # SQLAlchemy engine + session
│ ├── main.py # FastAPI app entry point
│ ├── models/
│ │ └── core.py # DB models: Company, Thesis, Backtest, etc.
│ ├── llm/
│ │ ├── digitalocean.py # DO Inference provider (chat, embed, models)
│ │ └── prompts.py # All LLM prompt templates
│ ├── tools/
│ │ ├── sec_edgar.py # SEC EDGAR MCP-style tool
│ │ ├── market_data.py # Price data + deterministic fallback
│ │ ├── transcripts.py # Transcript loader
│ │ └── news.py # News stub
│ ├── rag/
│ │ ├── chunker.py # Text chunking
│ │ └── store.py # Vector store (cosine sim / pgvector-ready)
│ ├── pipeline/
│ │ ├── state.py # LangGraph PipelineState TypedDict
│ │ ├── nodes.py # All pipeline node implementations
│ │ ├── graph.py # LangGraph StateGraph construction
│ │ └── runner.py # High-level run_pipeline() entry point
│ ├── api/
│ │ ├── schemas.py # Pydantic request/response schemas
│ │ └── routes.py # FastAPI route handlers
│ └── workers/
│ ├── celery_app.py # Celery configuration
│ └── tasks.py # Celery tasks
├── frontend/
│ ├── src/
│ │ ├── App.jsx # Router + nav
│ │ ├── api.js # Axios API client
│ │ ├── components/ # Card, Badge
│ │ └── pages/ # Dashboard, RunPage, ThesisPage, BacktestsPage, HealthPage
│ ├── index.html
│ ├── package.json
│ └── vite.config.js
├── sample_data/transcripts/ # Bundled earnings call transcripts
├── scripts/
│ ├── check_do_inference.py # DO connectivity smoke test
│ ├── seed_demo.py # Database seed
│ └── run_demo_analysis.py # CLI pipeline runner
├── tests/ # pytest test suite
├── Dockerfile.backend
├── Dockerfile.worker
├── Dockerfile.frontend
├── docker-compose.yml
├── nginx.conf
├── requirements.txt
├── pytest.ini
└── .env.example