Speed up refundable orders query #4061

MartinquaXD · 2026-01-16T20:49:41Z

Description

In the RDS UI you can show the DB active sessions grouped by specific queries.

It turns out the purple, dark blue, and one other color all belong to the query fetching the yet to be refunded ethflow orders.
To confirm the data I ran the query on the read replica and it took ~130s in total. Luckily the query could be fixed with a few simple changes.

Changes

Rather than selecting from the orders table and LEFT JOINing the ethflow_orders table onto it we can simple select from the ethflow_orders table and unconditionally JOIN the orders table onto it. That's because every row in the ethflow_orders table has a corresponding row in the orders table but not the other way around.
the JOIN commands now join on the ethflow_orders.uid instead of the orders.uid so there is no more dependency on first doing the orders join.
some where filters got moved into the joins to eliminate as many rows as early as possible
we now use the pattern of where not exists (select 1 where ...) instead of joining a bunch of tables and doing regular where statements

How to test

compared the queries on the read replica.

The old one always takes at least 2 minutes (this run even 2.5m)

original execution plan

"Nested Loop Anti Join  (cost=3187.87..866030.26 rows=1 width=65) (actual time=5992.536..149371.780 rows=4 loops=1)"
"  ->  Nested Loop Anti Join  (cost=3187.46..866028.94 rows=1 width=122) (actual time=17.215..149300.538 rows=14988 loops=1)"
"        ->  Nested Loop  (cost=3186.90..866027.74 rows=1 width=122) (actual time=17.206..138886.327 rows=400615 loops=1)"
"              ->  Nested Loop  (cost=3186.48..866026.40 rows=1 width=122) (actual time=17.189..125819.381 rows=4210642 loops=1)"
"                    ->  Gather  (cost=3185.92..866023.88 rows=1 width=86) (actual time=17.162..1130.021 rows=7268330 loops=1)"
"                          Workers Planned: 2"
"                          Workers Launched: 2"
"                          ->  Hash Left Join  (cost=2185.92..865023.78 rows=1 width=86) (actual time=17.352..4576.083 rows=2422777 loops=3)"
"                                Hash Cond: (o.uid = o_ref.order_uid)"
"                                Filter: (o_ref.tx_hash IS NULL)"
"                                Rows Removed by Filter: 18169"
"                                ->  Parallel Seq Scan on orders o  (cost=0.00..854817.17 rows=3055503 width=86) (actual time=0.017..3533.277 rows=2440945 loops=3)"
"                                      Filter: (NOT partially_fillable)"
"                                      Rows Removed by Filter: 2330197"
"                                ->  Hash  (cost=1515.41..1515.41 rows=53641 width=90) (actual time=17.183..17.184 rows=54518 loops=3)"
"                                      Buckets: 65536  Batches: 1  Memory Usage: 7008kB"
"                                      ->  Seq Scan on ethflow_refunds o_ref  (cost=0.00..1515.41 rows=53641 width=90) (actual time=0.024..6.696 rows=54518 loops=3)"
"                    ->  Index Scan using order_quotes_pkey on order_quotes oq  (cost=0.56..2.50 rows=1 width=77) (actual time=0.017..0.017 rows=1 loops=7268330)"
"                          Index Cond: (order_uid = o.uid)"
"                          Filter: ((sell_amount = o.sell_amount) AND ((1.0 - (o.buy_amount / GREATEST(buy_amount, '1'::numeric))) >= '0'::numeric))"
"                          Rows Removed by Filter: 0"
"              ->  Index Scan using ethflow_orders_pkey on ethflow_orders eo  (cost=0.42..1.33 rows=1 width=65) (actual time=0.003..0.003 rows=0 loops=4210642)"
"                    Index Cond: (uid = o.uid)"
"                    Filter: ((valid_to < 1768583318) AND ((valid_to - (EXTRACT(epoch FROM o.creation_timestamp))::integer) > 0))"
"                    Rows Removed by Filter: 0"
"        ->  Index Only Scan using trade_order_uid on trades t  (cost=0.56..1.15 rows=2 width=57) (actual time=0.026..0.026 rows=1 loops=400615)"
"              Index Cond: (order_uid = o.uid)"
"              Heap Fetches: 5393"
"  ->  Index Only Scan using onchain_order_invalidations_pkey on onchain_order_invalidations o_inv  (cost=0.41..0.87 rows=1 width=57) (actual time=0.004..0.004 rows=1 loops=14988)"
"        Index Cond: (uid = o.uid)"
"        Heap Fetches: 462"
"Planning Time: 1.210 ms"
"Execution Time: 149371.847 ms"

The new query takes ~1.5s. One query I previously thought was this fast was actually only that fast with a warm cache. So far I was not able to make this query run significantly slower but just to be sure I'll run another test tomorrow.
After some more testing it looks like there might be some pathological table locking behavior or sth like that. For runs that take a long time to finish the execution plan actually still says 1.5s execution time. So I think those runs might only be so slow because the tables are locked for an extended period of time which does not get accounted for in the execution time.

new execution plan

"Nested Loop Anti Join  (cost=16738.84..141777.45 rows=1 width=65) (actual time=736.206..1355.895 rows=4 loops=1)"
"  ->  Nested Loop  (cost=16738.42..141776.12 rows=1 width=65) (actual time=594.485..1232.277 rows=53538 loops=1)"
"        Join Filter: ((eo.uid = o.uid) AND ((eo.valid_to - (EXTRACT(epoch FROM o.creation_timestamp))::integer) > 0))"
"        Rows Removed by Join Filter: 1"
"        ->  Nested Loop  (cost=16737.86..141773.92 rows=1 width=142) (actual time=594.461..930.174 rows=53618 loops=1)"
"              ->  Gather  (cost=16737.30..141768.17 rows=1 width=65) (actual time=594.432..669.259 rows=53875 loops=1)"
"                    Workers Planned: 2"
"                    Workers Launched: 2"
"                    ->  Parallel Hash Right Anti Join  (cost=15737.30..140768.07 rows=1 width=65) (actual time=591.433..727.315 rows=17958 loops=3)"
"                          Hash Cond: (t.order_uid = eo.uid)"
"                          ->  Parallel Seq Scan on trades t  (cost=0.00..81835.30 rows=1558630 width=57) (actual time=0.004..155.961 rows=1325521 loops=3)"
"                          ->  Parallel Hash  (cost=11468.06..11468.06 rows=176259 width=65) (actual time=92.350..92.352 rows=146132 loops=3)"
"                                Buckets: 131072  Batches: 8  Memory Usage: 6656kB"
"                                ->  Hash Anti Join  (cost=606.95..11468.06 rows=176259 width=65) (actual time=4.350..54.449 rows=146132 loops=3)"
"                                      Hash Cond: (eo.uid = o_inv.uid)"
"                                      ->  Parallel Seq Scan on ethflow_orders eo  (cost=0.00..8609.66 rows=183243 width=65) (actual time=0.006..17.417 rows=151756 loops=3)"
"                                            Filter: (valid_to < 1768583318)"
"                                            Rows Removed by Filter: 17"
"                                      ->  Hash  (cost=395.31..395.31 rows=16931 width=57) (actual time=4.301..4.302 rows=16898 loops=3)"
"                                            Buckets: 32768  Batches: 1  Memory Usage: 1725kB"
"                                            ->  Seq Scan on onchain_order_invalidations o_inv  (cost=0.00..395.31 rows=16931 width=57) (actual time=0.007..1.699 rows=16898 loops=3)"
"              ->  Index Scan using order_quotes_pkey on order_quotes oq  (cost=0.56..5.75 rows=1 width=77) (actual time=0.004..0.004 rows=1 loops=53875)"
"                    Index Cond: (order_uid = eo.uid)"
"        ->  Index Scan using orders_pkey on orders o  (cost=0.56..2.18 rows=1 width=86) (actual time=0.005..0.005 rows=1 loops=53618)"
"              Index Cond: (uid = oq.order_uid)"
"              Filter: ((NOT partially_fillable) AND (oq.sell_amount = sell_amount) AND ((1.0 - (buy_amount / GREATEST(oq.buy_amount, '1'::numeric))) >= '0'::numeric))"
"              Rows Removed by Filter: 0"
"  ->  Index Only Scan using ethflow_refunds_pkey on ethflow_refunds o_ref  (cost=0.41..0.87 rows=1 width=57) (actual time=0.002..0.002 rows=1 loops=53538)"
"        Index Cond: (order_uid = eo.uid)"
"        Heap Fetches: 359"
"Planning Time: 1.152 ms"
"Execution Time: 1355.953 ms"

gemini-code-assist

Code Review

This pull request provides an excellent optimization for the refundable_orders query. The changes made are sound and follow best practices for improving SQL performance in PostgreSQL. Switching the query to start from ethflow_orders, replacing LEFT JOIN with NOT EXISTS for checking non-existence, and moving filter conditions into the ON clauses of the JOINs are all effective strategies that have clearly paid off, as shown by the dramatic reduction in execution time. The resulting query is not only more performant but also more readable and maintainable. This is a high-quality contribution.

MartinquaXD · 2026-01-17T23:24:14Z

I experimented a bit more and the biggest culprit appears to be that the query does a sequential scan over the entire trades table. I think the most pragmatic way forward is to limit the look back window to 1 week or so. This reliably results in ~300ms runtimes on the prod DB while still being very generous.
In other words this would mean try refunding an order for up to 1 week and if it still doesn't work give up. IMO this seems reasonable. (also any orders that failed to get refunded for 1 week can still be refunded manually by running the same query with the lookback window filter removed.
Do you agree that the lookback window is okay? Are you fine with just hardcoding the 1 week in the query or do you see value in making this configurable?

MartinquaXD · 2026-01-18T08:52:42Z

These are the changes in DB metrics after temporarily deploying this + 1 week lookback window + #4062. The deployment happened only on mainnet (yellow bars in second graph) so since all network DBs run on the same postgres instance the improvement will be even more pronounced when the fix gets rolled out to all networks. While I assumed the change would reduce the CPU usage of the DB it actually resolved a IO bottle neck. Due to constant full table scans the DB had to page memory in and out of disk a ton.
According to claude these graphs mean:

fewer requests are queued up waiting for IO
IO is no longer being throttled
more headroom for traffic spikes

On the services side this appeared to lower the variance in latency of some DB queries (but not all).

squadgazzz

Great improvement!

MartinquaXD added 2 commits January 16, 2026 19:48

Speed up refundable orders query

55afad6

second iteration

5e7d5f1

MartinquaXD requested a review from a team as a code owner January 16, 2026 20:49

gemini-code-assist bot reviewed Jan 16, 2026

View reviewed changes

jmg-duarte approved these changes Jan 18, 2026

View reviewed changes

squadgazzz approved these changes Jan 19, 2026

View reviewed changes

MartinquaXD added this pull request to the merge queue Jan 19, 2026

Merged via the queue into main with commit d0eecdd Jan 19, 2026
19 checks passed

MartinquaXD deleted the speed-up-refundable-orders branch January 19, 2026 14:44

github-actions bot locked and limited conversation to collaborators Jan 19, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Speed up refundable orders query #4061

Speed up refundable orders query #4061

MartinquaXD commented Jan 16, 2026 •

edited

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

MartinquaXD commented Jan 17, 2026

Uh oh!

MartinquaXD commented Jan 18, 2026 •

edited

Loading

Uh oh!

squadgazzz left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Speed up refundable orders query #4061

Speed up refundable orders query #4061

Conversation

MartinquaXD commented Jan 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Changes

How to test

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

MartinquaXD commented Jan 17, 2026

Uh oh!

MartinquaXD commented Jan 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

squadgazzz left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

MartinquaXD commented Jan 16, 2026 •

edited

Loading

MartinquaXD commented Jan 18, 2026 •

edited

Loading