Context
#451 & discussion on #452.
Click bench queries [17,21,22,24,31,32] run non-deterministic queries, this makes it difficult to compare record batches produced by these queries between vanilla DataFusion and distributed DataFusion.
Why? the Limit operator can return any X rows it wants, this means that running the query twice can produce different but equally correct results.
initially in #452 I added in a flag to do a schema check & row count comparison but skip the row by row comparison as it will not give an accurate comparison of the queries. This solution is too loose, as it does no comparison.
A better approach may be to run DataFusion's vanilla query and take the X rows returned. We would then slightly alter the distributed query to run the exact same query but without the limit operator, and wrap it in an in clause. This way, we can verify that the aggregations and filters line up.
example flow for clickbench q17
q17 -> SELECT "UserID", "SearchPhrase", COUNT(*) FROM hits GROUP BY "UserID", "SearchPhrase" LIMIT 10;
- run q17 on vanilla DataFusion, you'll get back 10 tuples of 3 elements
- take q17 and remove the limit ->
SELECT "UserID", "SearchPhrase", COUNT(*) FROM hits GROUP BY "UserID", "SearchPhrase" ; this will be called q17_altered
- restructure the query so the results are re-usable >
WITH agg AS ( SELECT "UserID", "SearchPhrase", COUNT(*) AS cnt FROM hits GROUP BY "UserID", "SearchPhrase" )
- pass in the rows that vanilla DataFusion returned in step 1 as a
in clause. if the results match up then the query ran correctly. distributed Datafusion runs WITH agg AS ( SELECT "UserID", "SearchPhrase", COUNT(*) AS cnt FROM hits GROUP BY "UserID", "SearchPhrase" ) SELECT * FROM agg WHERE ("UserID", "SearchPhrase") IN ( (123, 'phrase_a'), (456, 'phrase_b'), -- ... from vanilla )
With this approach, we validate that the exact same results exist in both engines after aggregations and filters are run. It will be slower than the current implementation, but I think that's appropriate since these tests are meant to determine that the results are correct, as opposed to benchmarking performance.
cc @gabotechs
Context
#451 & discussion on #452.
Click bench queries [17,21,22,24,31,32] run non-deterministic queries, this makes it difficult to compare record batches produced by these queries between vanilla DataFusion and distributed DataFusion.
Why? the
Limitoperator can return any X rows it wants, this means that running the query twice can produce different but equally correct results.initially in #452 I added in a flag to do a schema check & row count comparison but skip the row by row comparison as it will not give an accurate comparison of the queries. This solution is too loose, as it does no comparison.
A better approach may be to run DataFusion's vanilla query and take the X rows returned. We would then slightly alter the distributed query to run the exact same query but without the limit operator, and wrap it in an in clause. This way, we can verify that the aggregations and filters line up.
example flow for clickbench q17
q17 ->
SELECT "UserID", "SearchPhrase", COUNT(*) FROM hits GROUP BY "UserID", "SearchPhrase" LIMIT 10;SELECT "UserID", "SearchPhrase", COUNT(*) FROM hits GROUP BY "UserID", "SearchPhrase" ;this will be called q17_alteredWITH agg AS ( SELECT "UserID", "SearchPhrase", COUNT(*) AS cnt FROM hits GROUP BY "UserID", "SearchPhrase" )inclause. if the results match up then the query ran correctly. distributed Datafusion runsWITH agg AS ( SELECT "UserID", "SearchPhrase", COUNT(*) AS cnt FROM hits GROUP BY "UserID", "SearchPhrase" ) SELECT * FROM agg WHERE ("UserID", "SearchPhrase") IN ( (123, 'phrase_a'), (456, 'phrase_b'), -- ... from vanilla )With this approach, we validate that the exact same results exist in both engines after aggregations and filters are run. It will be slower than the current implementation, but I think that's appropriate since these tests are meant to determine that the results are correct, as opposed to benchmarking performance.
cc @gabotechs