One small pattern that made the analytics SQL easier to maintain.
The analytics projects often accumulate binning logic like this:
case
when amount < 2 then '0-2'
when amount < 6 then '2-6'
when amount < 7 then '6-7'
else '7+'
endThere is nothing wrong with this query by itself.
The problem is that the same business logic often ends up copied across multiple models, dashboards, notebooks, and reports.
Then a threshold changes.
0-2
2-6
6-7
7+
becomes
0-2
2-5
5-10
10+
Someone now has to find every CASE WHEN statement and update it correctly.
The work is repetitive, difficult to review, and surprisingly easy to get wrong.
Instead of storing the rules inside many CASE WHEN statements, store the thresholds as data.
For example:
threshold
---------
0
2
6
7
This can be a small table or a dbt seed.
Once the thresholds exist as data, the bin definitions can be generated automatically.
To keep the core concept easy to understand, we’ll first look at a simplified version that handles basic inner boundaries. Later, we'll see how the actual
dbt-binningpackage automatically handles open-ended edges like <0 or 7+ under the hood.
create table thresholds (
threshold int primary key
);
insert into thresholds values
(0),
(2),
(6),
(7);The table only stores boundaries.
Using a window function:
create view bins AS
select
threshold as bin_start,
lead(threshold) over (
order by threshold
) as bin_end,
... as label
from thresholdsWhich becomes:
| bin_start | bin_end | label |
|---|---|---|
| 0 | 2 | 0-2 |
| 2 | 6 | 2-6 |
| 6 | 7 | 6-7 |
| 7 | null | 7+ |
From there, labels can be generated automatically.
The SQL for this step rarely changes.
Once you choose an interval convention, it can be reused for many different binning problems.
Instead of repeating CASE WHEN logic:
select
u.*,
b.label
from users u
left join bins_with_labels b
on u.amount >= b.bin_start
and (
b.bin_end is null
or u.amount < b.bin_end
)Now threshold changes only require updating a small configuration table.
The join logic stays the same.
A few things become easier:
- Thresholds are visible in one place
- Changes are easier to review in Git
- Labels stay consistent
- The same logic can be reused across models
- There is less SQL to maintain
Most importantly, this shifts your workflow from hardcoding business rules in SQL to managing configuration as data.
If you use dbt for your analytics data pipeline, I packaged it into a small dbt package called dbt-binning, it:
- Store thresholds as data
- Generate bins automatically
- Reduce repetitive CASE WHEN / JOIN maintenance
-- models/amount_bins.sql
{{ generate_bins(ref('amount_thresholds')) }}amount_bins is intentionally materialized as a view.
Threshold tables are usually small, and keeping generated ranges as a view makes changes easy to inspect.
| bin_start | bin_end | label |
|---|---|---|
| null | 0 | <0 |
| 0 | 2 | 0-2 |
| 2 | 6 | 2-6 |
| 6 | 7 | 6-7 |
| 7 | null | 7+ |
Finite labels use continuous boundaries: 0-2 means [0, 2), including the start value and excluding the end value. Open-ended labels like 7+ (or <0) include all values greater than or equal to the start (or smaller than the end).
The package provides a bin_join macro to handles the join condition automatically:
-- models/orders_with_amount_bins.sql
select
orders.order_id,
orders.amount,
amount_bins.label as amount_bin
from {{ ref('orders') }} as orders
{{ bin_join(
value='orders.amount',
bins=ref('amount_bins'),
bins_alias='bins'
) }}Note: bins_alias is optional and defaults to bins.
The macro expands to a standard SQL left join, replacing manual boundary join:
left join {{ ref('amount_bins') }} as bins
on (
bins.bin_start is null
and orders.amount < bins.bin_end
)
or (
bins.bin_start is not null
and orders.amount >= bins.bin_start
and (
orders.amount < bins.bin_end
or bins.bin_end is null
)
)This replaces a repeated case when amount ... then ... end block with one threshold seed, one generated bins model, and ordinary SQL joins wherever the bin is needed.
Why not use
BETWEEN? > > Usingbetween b.bin_start and b.bin_endcreates overlapping boundaries for continuous data (e.g., is2.0in0-2or2-6?). This inequality join guarantees half-open intervals[start, end), ensuring each value falls into exactly one bin.
version: 2
seeds:
- name: amount_thresholds
tests:
- thresholds_not_nullThe package expects the seed/table to contain a column named threshold. The included thresholds_not_null test ensures no boundary values are missing.
Duplicate thresholds are ignored when bins are generated, so repeated values do not create invalid zero-width ranges.