Skip to content

handle non-deterministic queries in ClickBench test #458

@Rich-T-kid

Description

@Rich-T-kid

Context
#451 & discussion on #452.
Click bench queries [17,21,22,24,31,32] run non-deterministic queries, this makes it difficult to compare record batches produced by these queries between vanilla DataFusion and distributed DataFusion.
Why? the Limit operator can return any X rows it wants, this means that running the query twice can produce different but equally correct results.
initially in #452 I added in a flag to do a schema check & row count comparison but skip the row by row comparison as it will not give an accurate comparison of the queries. This solution is too loose, as it does no comparison.
A better approach may be to run DataFusion's vanilla query and take the X rows returned. We would then slightly alter the distributed query to run the exact same query but without the limit operator, and wrap it in an in clause. This way, we can verify that the aggregations and filters line up.

example flow for clickbench q17
q17 -> SELECT "UserID", "SearchPhrase", COUNT(*) FROM hits GROUP BY "UserID", "SearchPhrase" LIMIT 10;

  1. run q17 on vanilla DataFusion, you'll get back 10 tuples of 3 elements
  2. take q17 and remove the limit -> SELECT "UserID", "SearchPhrase", COUNT(*) FROM hits GROUP BY "UserID", "SearchPhrase" ; this will be called q17_altered
  3. restructure the query so the results are re-usable > WITH agg AS ( SELECT "UserID", "SearchPhrase", COUNT(*) AS cnt FROM hits GROUP BY "UserID", "SearchPhrase" )
  4. pass in the rows that vanilla DataFusion returned in step 1 as a in clause. if the results match up then the query ran correctly. distributed Datafusion runs WITH agg AS ( SELECT "UserID", "SearchPhrase", COUNT(*) AS cnt FROM hits GROUP BY "UserID", "SearchPhrase" ) SELECT * FROM agg WHERE ("UserID", "SearchPhrase") IN ( (123, 'phrase_a'), (456, 'phrase_b'), -- ... from vanilla )

With this approach, we validate that the exact same results exist in both engines after aggregations and filters are run. It will be slower than the current implementation, but I think that's appropriate since these tests are meant to determine that the results are correct, as opposed to benchmarking performance.
cc @gabotechs

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions