Olly is a data observability framework that monitors data warehouses for schema changes, volume anomalies, freshness issues, cross-source integrity, cost spikes, unused tables, and contract violations. Currently supports DuckDB, Postgres, BigQuery, and Snowflake via Ibis.
By default, the schema checks are metadata-only, making it lightweight and low-cost.
🚨 This should be considered experimental and subject to change.
uv add olly-coreInstall with an adapter:
uv add "olly-core[duckdb]" # or postgres, bigquery, snowflakeolly init # interactive setup → creates olly.toml
olly snapshot # capture current warehouse state
olly check # detect changes since last snapshotThat's it. Olly compares consecutive snapshots and reports schema changes, volume anomalies, and stale tables, out of the box. You can add more specific configuration for high-value objects.
| Check | What it detects | Severity |
|---|---|---|
| Schema | Added/removed tables, added/removed/changed columns, nullability changes | error or warning |
| Volume | Row count anomalies using EWMA (default) or z-score over snapshot history | warning |
| Freshness | Tables not updated within the configured threshold | warning |
| Integrity | Row count or hash mismatches between source and target databases | error |
| Contracts | Schema violations against declared Python contracts | error or warning |
| dbt integration | Ingest run_results.json, and get reporting on test failures, changes in execution time, and track DDL changes |
error or warning |
| Usage | Tables not queried within a lookback window (BigQuery only) | error or warning |
| Cost | Query cost spike (BigQuery only) | warning |
Findings
┏━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ Check ┃ Severity ┃ Table ┃ Description ┃
┡━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ schema │ error │ main.customers │ Column removed: main.customers.email │
│ schema │ warning │ main.orders │ New column: main.orders.status │
│ freshness│ warning │ main.payments │ Stale data: main.payments — last update 30.0h ago (threshold: 24.0h)│
│ volume │ warning │ main.orders │ Row count anomaly (increase): main.orders (12,345 rows, z-score: +3.20) │
└──────────┴──────────┴──────────────────┴────────────────────────────────────────────────────────────────────┘
1 error(s), 3 warning(s)
Use --json for machine-readable output.
olly init Run the interactive setup wizard
olly snapshot Capture current warehouse state
--verbose Print detailed progress
--connection NAME Only snapshot this connection
olly check Run data quality checks
--json Machine-readable JSON output
--verbose Print detailed progress
--write-results Persist findings to ~/.olly/findings.json (default: true)
--no-write-results Disable writing findings to disk
--connection NAME Only check this connection
olly plan Show resolved config for each table
--connection NAME Only explain this connection
olly unused Show unused and stale tables
--json Machine-readable JSON output
--verbose Print detailed progress
--connection NAME Only check this connection
olly debug Test connectivity to the configured warehouse
--connection NAME Only test this connection
olly clean Delete the local state database
--yes Skip confirmation prompt
olly create-state Create warehouse state schema and tables
--connection NAME Only create state for this connection
olly serve Start the web dashboard
--host HOST Bind address (default: 127.0.0.1)
--port PORT Port (default: 8000)
Olly is configured via olly.toml in your project root. Run olly init to generate one interactively, or create it by hand.
Each named connection under [connections.<name>] specifies a warehouse backend via type plus backend-specific fields. You can monitor multiple warehouses from a single config:
# DuckDB
[connections.primary]
type = "duckdb"
path = "warehouse.duckdb" # omit for in-memory
# Postgres (second connection)
[connections.analytics]
type = "postgres"
url = "${DATABASE_URL}"
# BigQuery
[connections.warehouse]
type = "bigquery"
project = "my-project"
dataset = "analytics" # optional
use_information_schema_row_counts = true # optional, default true
# Snowflake
[connections.snowflake]
type = "snowflake"
account = "my-account"
database = "my_db" # optional
use_account_usage = false # optional, default false
user = "my-user" # optional, forwarded to Ibis
role = "ANALYST" # optional, forwarded to Ibis
warehouse = "COMPUTE_WH" # optional, forwarded to IbisAny extra keys beyond the standard fields are forwarded as keyword arguments to the underlying Ibis connect() call. This lets you pass adapter-specific options (e.g. Snowflake user, role, warehouse, or DuckDB read_only) without Olly needing to know about them.
For BigQuery, set credentials via:
export GOOGLE_APPLICATION_CREDENTIALS="/path/to/service-account.json"Each connection has its own selection. By default, all schemas are monitored (except information_schema) and all tables are included. Use glob patterns (*) to filter:
[connections.primary.selection]
include_schemas = ["*"]
exclude_schemas = ["information_schema", "scratch", "dev"]
include_tables = ["*.*"]
exclude_tables = ["main.tmp_*", "main.staging_*"]Global defaults for check thresholds:
[settings]
freshness_threshold_hours = 24.0 # max age before flagging stale
volume_zscore_threshold = 3.0 # z-score cutoff for volume anomalies
volume_method = "ewma" # "ewma" (default) or "zscore"
history_depth = 30 # snapshots to keep for trend analysis
min_history_for_anomaly = 5 # minimum snapshots before volume checks run
write_results = true # persist findings to ~/.olly/findings.json
state_schema = "olly_state" # optional: store state in warehouse instead of local SQLiteEWMA (Exponentially Weighted Moving Average) is the default volume method.
It weights recent observations more heavily, making it better at handling trending tables without false positives.
Use "zscore" for stationary tables where equal weighting of all history is preferred. Both methods can be overridden per table.
Override settings for specific tables or patterns, scoped per connection. More specific matches win (schema < pattern < exact table):
[[connections.primary.overrides]]
match = "main.orders"
freshness_threshold_hours = 168
volume_zscore_threshold = 5.0
[[connections.primary.overrides]]
match = "main.*"
freshness_column = "updated_at"Use olly plan to see how overrides resolve for each object.
Compare data between two connections. Syncs reference named connections from your [connections.*] config:
[connections.warehouse]
type = "duckdb"
path = "warehouse.duckdb"
[connections.replica]
type = "postgres"
url = "postgresql://${PGUSER}:${PGPASSWORD}@replica:5432/db"
[integrity]
module = "integrity_syncs.py" # file path or dotted module name# integrity_syncs.py
from olly.models import IntegrityMethod, Sync, WindowOp, WindowSpec
syncs = [
Sync(
name="orders_count",
source="warehouse", # references [connections.warehouse]
target="replica", # references [connections.replica]
source_table="main.orders",
target_table="public.orders",
method=IntegrityMethod.COUNT,
watermark="updated_at",
window=WindowSpec(op=WindowOp.GT_NOW, duration="2h"),
),
]Methods:
| Method | What it compares |
|---|---|
count |
Row counts between source and target |
count_distinct |
Distinct values of a key column |
pk |
Primary key sets (finds missing/extra rows) |
hash |
Row-level hash of specified columns |
Pipelines support time windows, WHERE filters, tolerance thresholds, and watermark columns for incremental checks.
Define expected table schemas as Python classes:
# contracts.py
from datetime import datetime
from olly.contracts import TableContract
class Orders(TableContract):
__table__ = "orders"
__schema__ = "main"
__connection__ = "primary" # only check against this connection (optional)
__strict__ = True # flag unexpected columns
id: int
amount: float
created_at: datetime
customer_name: str | None # nullableWhen __connection__ is set, the contract is only validated against that named connection. When omitted, it runs against all connections.
Point your config at the contracts module:
[contracts]
module = "contracts.py" # file path or dotted module nameOlly validates the warehouse schema against your contracts on every olly check.
Parse dbt's run_results.json to surface failures:
[dbt]
run_results_path = "target/run_results.json"
performance_threshold = 3.0 # z-score threshold for execution time anomalies
min_history_for_anomaly = 5 # minimum runs before performance checks activateDetect tables that haven't been queried recently. Currently supported on BigQuery only (uses INFORMATION_SCHEMA.JOBS_BY_PROJECT).
[usage]
enabled = true
lookback_days = 90 # how far back to scan query history
unused_threshold_days = 30 # days without queries before flagging
bigquery_region = "us"Tables with no queries in the lookback window are flagged as errors. Tables queried but not within the unused threshold are warnings.
Track BigQuery query costs and detect spending spikes. Uses z-score anomaly detection over cost history from previous snapshots. Cost fields live in the [usage] section:
[usage]
enabled = true
cost_enabled = true
cost_lookback_days = 30 # days of query history to aggregate
price_per_tb_usd = 6.25 # on-demand pricing rate
spike_threshold = 3.0 # z-score threshold for cost spike alertsA legacy [cost] section is still accepted for backward compatibility, but new configs should use [usage].
When olly check runs, it shows a cost summary with top tables and top users by spend. Cost spikes are flagged as findings when the current period's total exceeds the historical mean by more than spike_threshold standard deviations.
Send findings to a Slack channel via an incoming webhook:
[slack]
webhook_url = "https://hooks.slack.com/services/T00/B00/xxxx"
on_error = true # send on errors (default: true)
on_warning = false # send on warnings (default: false)When olly check produces qualifying findings, a summary is posted to the configured webhook.
All functionality is available as importable Python modules:
from olly.checker import run_checks
from olly.cli.snapshot import take_snapshot
from olly.config import (
ConnectionConfig, NamedConnection, OllyConfig, Selection, Settings,
)
nc = NamedConnection(
name="primary",
connection=ConnectionConfig(type="duckdb", path="warehouse.duckdb"),
selection=Selection(include_schemas=["main"]),
)
config = OllyConfig(
connections={"primary": nc},
settings=Settings(),
)
take_snapshot(config)
findings, dbt_findings, cost_records = run_checks(config)
for finding in findings:
print(f"[{finding.connection_name}] {finding.check_type}: {finding.description}")Install the dashboard extra and start the server:
uv add "olly-core[dashboard]"
olly serveThe dashboard reads from a state database (written by olly check) and displays findings in a web UI at http://127.0.0.1:8000.
Olly uses a snapshot-and-diff model:
-
olly snapshotconnects to your warehouse via Ibis, introspects schemas and tables, and stores schema info and row counts in a local SQLite database (~/.olly/state.db). -
olly checkcompares the two most recent snapshots. It runs schema diffs, volume anomaly detection, freshness checks, and any configured integrity/contract/dbt checks. -
Findings are printed to the terminal and written to a sqlite database, or specified warehouse schema.
By default, state is fully local — Olly only reads from your warehouse and writes to the ~/.olly/ directory. Optionally, set state_schema in [settings] and run olly create-state to store state in your warehouse instead.
uv sync --group dev
uv run pytest tests/
uv run ruff check
uv run ty check