handle non-deterministic queries in ClickBench test

**Context**
#451 & discussion on #452.
Click bench queries [17,21,22,24,31,32] run non-deterministic queries, this makes it difficult to compare record batches produced by these queries between vanilla DataFusion and distributed DataFusion. 
Why? the `Limit` operator can return any **X** rows it wants, this means that running the query twice can produce different but equally correct results.
 initially in  #452 I added in a flag to do a schema check & row count comparison but skip the row by row comparison as it will not give an accurate comparison of the queries. This solution is [too loose](https://github.com/datafusion-contrib/datafusion-distributed/pull/452#discussion_r3252663747), as it does no comparison. 
A better approach may be to run DataFusion's vanilla query and take the **X** rows returned. We would then slightly alter the distributed query to run the exact same query but without the limit operator, and wrap it in an in clause. This way, we can verify that the aggregations and filters line up.

example flow for clickbench q17
q17 -> `SELECT "UserID", "SearchPhrase", COUNT(*) FROM hits GROUP BY "UserID", "SearchPhrase" LIMIT 10;`

1. run q17 on vanilla DataFusion, you'll get back 10 tuples of 3 elements
2. take q17 and remove the limit -> `SELECT "UserID", "SearchPhrase", COUNT(*) FROM hits GROUP BY "UserID", "SearchPhrase" ;` this will be called q17_altered
3. restructure the query so the results are re-usable > `WITH agg AS (
    SELECT "UserID", "SearchPhrase", COUNT(*) AS cnt
    FROM hits
    GROUP BY "UserID", "SearchPhrase"
)`
4.  pass in the rows that vanilla DataFusion returned in step 1 as a `in` clause. if the results match up then the query ran correctly. distributed Datafusion  runs  `
WITH agg AS (
    SELECT "UserID", "SearchPhrase", COUNT(*) AS cnt
    FROM hits
    GROUP BY "UserID", "SearchPhrase"
)
SELECT *
FROM agg
WHERE ("UserID", "SearchPhrase") IN (
    (123, 'phrase_a'),
    (456, 'phrase_b'),
    -- ...  from vanilla
)
`


With this approach, we validate that the exact same results exist in both engines after aggregations and filters are run. It will be slower than the current implementation, but I think that's appropriate since these tests are meant to determine that the results are correct, as opposed to benchmarking performance.
cc @gabotechs 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

handle non-deterministic queries in ClickBench test #458

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

handle non-deterministic queries in ClickBench test #458

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions