Skip to content

Bulk data access: single-file format for full-universe analysis #2

@deepentropy

Description

@deepentropy

Problem

Iterating yfd.get(sym, 'price') across all ~7,700 symbols takes 5+ minutes even when fully cached. Each call reads a separate parquet file from disk, deserializes it, and returns a DataFrame. For screener/scanner workflows that need OHLCV for the entire universe, this is the bottleneck.

Current workflow

for sym in yfd.symbols():       # 7,679 iterations
    df = yfd.get(sym, 'price')  # one parquet read per symbol
    # compute indicators...

~5 minutes just for the load step. The actual computation is seconds.

Suggestion

Provide a bulk access path for price data, similar to how screener and info are stored as single parquet-bulk files.

Options (any would help):

  1. yfd.get_all('price') → single DataFrame with a symbol column, stored as one partitioned parquet file. One read, one deserialize.

  2. yfd.get_all('price') → dict[str, DataFrame] loaded from a single concatenated parquet, split in memory.

  3. Pre-built universe parquet at ~/.cache/yfd/price_all.parquet generated by yfd.sync(), containing all symbols stacked. Refreshed on sync.

The screener bulk format already proves this pattern works. Price data is the most common access pattern for quantitative workflows and would benefit the most from bulk loading.

Context

Building weekly relative strength scanners that rank all US stocks. The scanner needs Close + Volume for the full universe on every run. Current per-symbol access makes iteration slow despite data being local.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions