From 355f102c4027355ed427411080ee420a19ca48d1 Mon Sep 17 00:00:00 2001 From: "Mingyu Chen (Rayner)" Date: Wed, 27 May 2026 17:28:14 +0800 Subject: [PATCH] [fix](regression-test) stabilize 2 muted external_table_p0 tests (#63646) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit ## Summary Both tests have been muted on the External Regression pipeline due to long-standing flakiness (analysis based on TeamCity build #92687 / id 953050). Neither is a real product bug — both are test-side robustness issues. ### `test_file_cache_query_limit` (~50% pass rate) After `POST /api/file_cache?op=clear&sync=true` the test waited exactly one `file_cache_background_monitor_interval_ms` window and then asserted `normal_queue_curr_size == 0` once. The counters surfaced by `information_schema.file_cache_statistics` are republished by the background monitor on its own cadence, so a single fixed-time wait races the refresh and the assert fails roughly half the time even when the cache really is empty. - Replace the four wait-then-assert blocks (`size == 0` after clear, `size > 0` after a query) with `Awaitility`-based polling (already imported) on the relevant metric until the predicate holds, with a `max(30s, 6 × monitor_interval)` timeout. - The original `assertFalse(...)` calls with their metric-specific messages are kept as the final guard, so real failures still surface a precise reason. - The two waits for BE config propagation (`enable_file_cache_query_limit` flip) are left untouched — not in the failure path. ### `test_hive_query_cache` (~20–25% fail rate) The `test { sql ...; time 20000 }` block at L122 ran TPC-H Q9 against containerized hive parquet with `enable_sql_cache=false` set above, so the 20s upper bound was timing a cold 6-table join, not a cache hit. The query routinely exceeds 20s under cluster load. - Drop the time guard; the `qt_tpch_1sf_q09` above already validates correctness, and the cache behavior is exercised in the subsequent blocks that explicitly enable sql cache. ## Test plan - [ ] Run External Regression pipeline on this PR and confirm both cases pass. - [ ] After 5+ consecutive green runs, follow up to unmute these cases in TeamCity. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-authored-by: Claude Opus 4.7 --- .../cache/test_file_cache_query_limit.groovy | 64 ++++++++++--------- .../hive/test_hive_query_cache.groovy | 11 ++-- 2 files changed, 41 insertions(+), 34 deletions(-) diff --git a/regression-test/suites/external_table_p0/cache/test_file_cache_query_limit.groovy b/regression-test/suites/external_table_p0/cache/test_file_cache_query_limit.groovy index a6ccf96c38c72a..1893a4ad6c3792 100644 --- a/regression-test/suites/external_table_p0/cache/test_file_cache_query_limit.groovy +++ b/regression-test/suites/external_table_p0/cache/test_file_cache_query_limit.groovy @@ -71,6 +71,26 @@ suite("test_file_cache_query_limit", "external_docker,hive,external_docker_hive, String hms_port = context.config.otherConfigs.get(hivePrefix + "HmsPort") int queryCacheCapacity + // Poll a file_cache_statistics metric until predicate holds, or until timeout. + // file_cache_statistics is refreshed by the background monitor on its own cadence, + // so waiting a single fixed interval (the previous behavior) races the refresh and + // makes assertions flaky. On timeout we swallow the exception so the caller's + // assertFalse below can surface its own metric-specific message. + def pollFileCacheMetric = { String metricName, Closure predicate, long timeoutSeconds -> + try { + Awaitility.await() + .atMost(timeoutSeconds, TimeUnit.SECONDS) + .pollInterval(1, TimeUnit.SECONDS) + .until { + def r = sql """select METRIC_VALUE from information_schema.file_cache_statistics + where METRIC_NAME = '${metricName}' limit 1;""" + return r.size() > 0 && predicate(Double.valueOf(r[0][0])) + } + } catch (org.awaitility.core.ConditionTimeoutException ignored) { + // fall through; the caller's assert will surface the precise failure + } + } + sql """drop catalog if exists ${catalog_name} """ sql """CREATE CATALOG ${catalog_name} PROPERTIES ( @@ -147,14 +167,13 @@ suite("test_file_cache_query_limit", "external_docker,hive,external_docker_hive, def totalWaitTime = (fileCacheBackgroundMonitorIntervalMsResult[0][3].toLong() / 1000) as int def interval = 1 def iterations = totalWaitTime / interval + long pollTimeoutSeconds = Math.max(30L, (long) totalWaitTime * 6L) - // Waiting for file cache clearing - (1..iterations).each { count -> - Thread.sleep(interval * 1000) - def elapsedSeconds = count * interval - def remainingSeconds = totalWaitTime - elapsedSeconds - logger.info("Waited for file cache clearing ${elapsedSeconds} seconds, ${remainingSeconds} seconds remaining") - } + // Poll until the cache clear has drained the LRU queue. The HTTP clear endpoint with sync=true + // deletes blocks synchronously, but the queue counters are republished by the background monitor + // thread on its own cadence — so a single fixed-time wait can race the refresh. + pollFileCacheMetric('normal_queue_curr_size', { it == 0.0 }, pollTimeoutSeconds) + pollFileCacheMetric('normal_queue_curr_elements', { it == 0.0 }, pollTimeoutSeconds) def initialNormalQueueCurrSizeResult = sql """select METRIC_VALUE from information_schema.file_cache_statistics where METRIC_NAME = 'normal_queue_curr_size' limit 1;""" @@ -162,7 +181,6 @@ suite("test_file_cache_query_limit", "external_docker,hive,external_docker_hive, assertFalse(initialNormalQueueCurrSizeResult.size() == 0 || Double.valueOf(initialNormalQueueCurrSizeResult[0][0]) != 0.0, INITIAL_NORMAL_QUEUE_CURR_SIZE_NOT_ZERO_MSG) - // Check normal queue current elements def initialNormalQueueCurrElementsResult = sql """select METRIC_VALUE from information_schema.file_cache_statistics where METRIC_NAME = 'normal_queue_curr_elements' limit 1;""" logger.info("normal_queue_curr_elements result: " + initialNormalQueueCurrElementsResult) @@ -199,13 +217,9 @@ suite("test_file_cache_query_limit", "external_docker,hive,external_docker_hive, // load the table into file cache sql query_sql - // Waiting for file cache statistics update - (1..iterations).each { count -> - Thread.sleep(interval * 1000) - def elapsedSeconds = count * interval - def remainingSeconds = totalWaitTime - elapsedSeconds - logger.info("Waited for file cache statistics update ${elapsedSeconds} seconds, ${remainingSeconds} seconds remaining") - } + // Poll until the query has populated the cache. + pollFileCacheMetric('normal_queue_curr_elements', { it > 0.0 }, pollTimeoutSeconds) + pollFileCacheMetric('normal_queue_curr_size', { it > 0.0 }, pollTimeoutSeconds) def baseNormalQueueCurrElementsResult = sql """select METRIC_VALUE from information_schema.file_cache_statistics where METRIC_NAME = 'normal_queue_curr_elements' limit 1;""" @@ -247,13 +261,9 @@ suite("test_file_cache_query_limit", "external_docker,hive,external_docker_hive, logger.info("File cache clear command output: ${output.toString()}") assertTrue(exitCode == 0, "File cache clear failed with exit code ${exitCode}. Error: ${errorOutput.toString()}") - // Waiting for file cache clearing - (1..iterations).each { count -> - Thread.sleep(interval * 1000) - def elapsedSeconds = count * interval - def remainingSeconds = totalWaitTime - elapsedSeconds - logger.info("Waited for file cache clearing ${elapsedSeconds} seconds, ${remainingSeconds} seconds remaining") - } + // Poll until the file cache is fully cleared again. + pollFileCacheMetric('normal_queue_curr_size', { it == 0.0 }, pollTimeoutSeconds) + pollFileCacheMetric('normal_queue_curr_elements', { it == 0.0 }, pollTimeoutSeconds) // ===== Normal Queue Metrics Check ===== // Check normal queue current size @@ -337,13 +347,9 @@ suite("test_file_cache_query_limit", "external_docker,hive,external_docker_hive, // load the table into file cache sql query_sql - // Waiting for file cache statistics update - (1..iterations).each { count -> - Thread.sleep(interval * 1000) - def elapsedSeconds = count * interval - def remainingSeconds = totalWaitTime - elapsedSeconds - logger.info("Waited for file cache statistics update ${elapsedSeconds} seconds, ${remainingSeconds} seconds remaining") - } + // Poll until the query has populated the cache under the new file_cache_query_limit. + pollFileCacheMetric('normal_queue_curr_size', { it > 0.0 }, pollTimeoutSeconds) + pollFileCacheMetric('normal_queue_curr_elements', { it > 0.0 }, pollTimeoutSeconds) // Get updated value of normal queue current elements and max elements after cache operations def updatedNormalQueueCurrSizeResult = sql """select METRIC_VALUE from information_schema.file_cache_statistics diff --git a/regression-test/suites/external_table_p0/hive/test_hive_query_cache.groovy b/regression-test/suites/external_table_p0/hive/test_hive_query_cache.groovy index 947c8c2248f153..63ed19d5429243 100644 --- a/regression-test/suites/external_table_p0/hive/test_hive_query_cache.groovy +++ b/regression-test/suites/external_table_p0/hive/test_hive_query_cache.groovy @@ -118,11 +118,12 @@ suite("test_hive_query_cache", "p0,external,hive,external_docker,external_docker sql """use `tpch1_parquet`""" qt_tpch_1sf_q09 "${tpch_1sf_q09}" sql "${tpch_1sf_q09}" - - test { - sql "${tpch_1sf_q09}" - time 20000 - } + // NOTE: enable_sql_cache=false is set above, so a `test { ... time 20000 }` block here is + // NOT testing SQL cache — it is timing a cold TPC-H Q9 over containerized hive parquet, + // which routinely exceeds 20s under load. Run the query without the time guard; the qt_ + // above already validates correctness. Cache behavior is verified in the blocks below + // that explicitly set enable_sql_cache=true. + sql "${tpch_1sf_q09}" // test sql cache with empty result try {