Skip to content

Add --stats flag for quick data profiling #169

@vmvarela

Description

@vmvarela

Description

Add a --stats (alias --profile) flag that computes per-column statistics after loading input and prints them as a formatted table.

sql-pipe sales.csv --stats
# | column | type    | non-null | min   | max     | mean   |
# |--------|---------|----------|-------|---------|--------|
# | id     | INTEGER | 1000     | 1     | 1000    | 500.5  |
# | amount | REAL    | 1000     | 0.50  | 9999.99 | 512.34 |
# | region | TEXT    | 1000     | East  | West    |        |

Motivation

This is the first thing every data analyst does with a new dataset. Currently users must manually write SELECT MIN(x), MAX(x), AVG(x), COUNT(*) FROM t WHERE x IS NOT NULL for each column. A --stats mode automates the most common profiling query and produces instant insight into data shape, completeness, and distribution. Competitive with csvstat (csvkit) and DuckDB's .mode stats.

Acceptance Criteria

  • --stats flag is parsed in args.zig
  • After loading tables, compute per-column: type, non-null count, min, max, mean (for numeric), distinct count
  • For TEXT columns: show min/max as string values, skip mean
  • For INTEGER/REAL columns: show all stats including mean
  • Output formatted as a table (reuse existing table formatter)
  • --stats is mutually exclusive with --columns, --validate, --sample, --schema, --explain, and a query argument
  • Works with multiple files (show stats per table)
  • Integration tests cover the new mode
  • Help text updated

Implementation Notes

  • Use PRAGMA table_info(t) to get column names and types (function getTableColumns at src/sqlite.zig:277 already does this)
  • Build an aggregate query with UNION ALL per column
  • Add StatsArgs to src/args.zig following the pattern of ColumnsArgs/ValidateArgs
  • Handle empty result sets gracefully

Metadata

Metadata

Assignees

No one assigned

    Labels

    priority:highMust be in the next sprintsize:mMedium — 4 to 8 hoursstatus:readyRefined and ready for sprint selectiontype:featureNew functionality

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions