An automated financial data pipeline that collects stock financial statements (income statements, cash flow, balance sheets, and financials) from Yahoo Finance for 106,000+ tickers worldwide and stores them in a cloud-hosted PostgreSQL (Neon) database in a normalized long-format schema.
Built using the stockdex package for Yahoo Finance data extraction. Database schema and migrations are managed in the database-version-control repository.
- 106K+ Tickers: Validates and tracks active tickers across global exchanges
- 4 Financial Statement Types: Income Statement, Cash Flow, Balance Sheet, Financials
- Parallel Processing: Multi-threaded HTTP calls (configurable threads) for high throughput
- Incremental Updates: Prioritizes new tickers, then refreshes oldest data
- Idempotent Jobs: Safe to re-run; uses upsert/delete+insert patterns
┌─────────────────┐ ┌──────────────────┐ ┌─────────────────┐
│ Ticker Excel │────▶│ Active Tickers │────▶│ Financial │
│ (106K tickers) │ │ Check Job │ │ Data ETL Jobs │
└─────────────────┘ └──────────────────┘ └─────────────────┘
│ │
▼ ▼
┌──────────────────────────────────────┐
│ Neon PostgreSQL (finance schema) │
│ │
│ • active_tickers (106K rows) │
│ • income_stmt (long format) │
│ • cash_flow (long format) │
│ • balance_sheet (long format) │
│ • financials (long format) │
└──────────────────────────────────────┘
# Install dependencies
pip install -r finance/requirements.txt
# Run active tickers check (validates tickers against Yahoo API)
python -m finance.src.run_active_tickers_check --mode single --threads 30
# Run financial data ETL (for any of the 4 tables)
python -m finance.src.run_financial_etl --table income_stmt --max-batches 5
python -m finance.src.run_financial_etl --table cash_flow --max-batches 5
python -m finance.src.run_financial_etl --table balance_sheet --max-batches 5
python -m finance.src.run_financial_etl --table financials --max-batches 5All job parameters are defined in config.py:
| Parameter | Default | Description |
|---|---|---|
ACTIVE_TICKERS_BATCH_SIZE |
100 | Tickers per batch for validity check |
ACTIVE_TICKERS_THREADS |
30 | Concurrent threads for ticker validation |
ETL_BATCH_SIZE |
50 | Tickers per batch for financial data fetch |
ETL_THREADS |
10 | Concurrent threads for data fetching |
All financial tables use a long (melted) format:
| Column | Type | Description |
|---|---|---|
ticker |
VARCHAR | Stock ticker symbol |
frequency |
VARCHAR | "annual" or "quarterly" |
report_date |
DATE | Financial report date |
metric |
VARCHAR | Metric name (e.g., "annualTotalRevenue") |
value |
FLOAT | Numeric value |
insert_datetime |
TIMESTAMP | When the row was inserted |
Yahoo Finance only provides the last 4 quarters/years of financial data. This pipeline accumulates data over time, building a growing historical database.
Instead of scraping Yahoo Finance manually for each company, query thousands of companies at once with SQL.
Filter and rank companies across any financial metric (revenue, net income, EPS, etc.) using standard SQL queries.
MIT License - see LICENSE for details.