25.8 Antalya backport of #90825: Add role-based access to Glue catalog by zvonand · Pull Request #1428 · Altinity/ClickHouse

zvonand · 2026-02-18T15:50:10Z

Changelog category (leave one):

Improvement

Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):

Add role-based access to Glue catalog. Use settings aws_role_arn and, optionally, aws_role_session_name. (ClickHouse#90825 by @antonio2368)

CI/CD Options

Exclude tests:

Regression jobs to run:

Add role-based access to Glue catalog

github-actions · 2026-02-18T15:59:02Z

Workflow [PR], commit [6a98e01]

Selfeer · 2026-02-20T16:42:23Z

Integration Test Failure

All 12 test_database_glue tests fail during fixture setup, before any test logic runs. The 1 other failure (test_storage_s3_queue::test_list_and_delete_race) is unrelated and flaky.

Root Cause

The new run_s3_mocks() function added in this PR passes a 4-tuple to start_mock_servers():

start_mock_servers(
    started_cluster, script_dir,
    [("mock_sts.py", "sts.us-east-1.amazonaws.com", "80", args)],  # 4 elements
)

But helpers/mock_servers.py on the antalya-25.8 branch only supports 3-tuples:

for server_name, container, port in mocks:  # ValueError: too many values to unpack

The upstream repo has an updated mock_servers.py that accepts the 4th args element, but that change was not included in the backport.

Traceback

test_database_glue/test.py:280 → run_s3_mocks(cluster)
test_database_glue/test.py:39  → start_mock_servers(...)
helpers/mock_servers.py:23     → ValueError: too many values to unpack (expected 3)

Impact

This is directly caused by this PR. Since the fixture crashes, all 12 test_database_glue tests (both existing and the new test_sts_smoke) fail without executing.

zvonand · 2026-02-20T16:45:47Z

@Selfeer i saw that, Ill fix that

The problem is regression fails

Selfeer · 2026-02-20T17:13:02Z

Regression aarch64 swarms - Failure Analysis

PR: #1428 | Workflow: 22222536503 | Version: 25.8.16.20001.altinityantalya | Arch: aarch64

Reason: Not related to this specific PR

Report: report.html

Summary

The Regression aarch64 swarms job failed with 6 failed scenarios out of 20 (14 OK). All 6 failures are in the node failure feature.

Failed Scenarios & Historical Flakiness

Scenario	Pass Rate	Fail	OK	Flaky?
`node failure/network failure`	79.9%	80	318	Yes (highest flakiness)
`node failure/check restart swarm node`	80.2%	77	319	Yes
`node failure/check restart clickhouse on swarm node`	85.9%	52	342	Yes
`node failure/swarm out of disk space`	91.7%	30	363	Yes
`node failure/cpu overload`	91.9%	29	364	Yes
`node failure/initiator out of disk space`	95.7%	14	379	Yes (least flaky)

Root Cause Analysis

All 6 failures originate from swarms/tests/node_failure.py and share a common error pattern:

Primary failure (4 scenarios): Test expects DB::Exception: Query was cancelled. but receives DB::Exception: Query is killed in pending state. (QUERY_WAS_CANCELLED) — an assertion message mismatch where the error code 394 is correct but the message text differs.
Secondary failure (2 scenarios — disk space): check_preprocessed_config_is_updated() assertion fails (exitcode == 1 instead of 0), and subsequent query gets ALL_CONNECTION_TRIES_FAILED (Code: 279) — a timing/infrastructure issue where ClickHouse config reload doesn't complete in time.

Conclusion

Not a regression. These failures have a long history across multiple PRs and versions (25.8.14, 25.8.16).

Appendix: Queries Used

All queries below were run against the gh-data database at github-checks.tenant-a.staging.altinity.cloud:8443.

1. Historical pass/fail ratios per scenario (all-time)

This query computes the historical pass/fail percentage for each of the failed scenarios across all recorded runs.

SELECT
    test_name,
    result,
    count() AS cnt,
    round(count() * 100.0 / sum(count()) OVER (PARTITION BY test_name), 1) AS pct
FROM `gh-data`.clickhouse_regression_results
WHERE test_name IN (
    '/swarms/feature/node failure/check restart clickhouse on swarm node',
    '/swarms/feature/node failure/check restart swarm node',
    '/swarms/feature/node failure/cpu overload',
    '/swarms/feature/node failure/initiator out of disk space',
    '/swarms/feature/node failure/network failure',
    '/swarms/feature/node failure/swarm out of disk space',
    '/swarms/feature/swarm joins',
    '/swarms/feature/swarm union'
)
GROUP BY test_name, result
ORDER BY test_name, result;

Result:

test_name	result	cnt	pct
`node failure/check restart clickhouse on swarm node`	Error	2	0.5%
`node failure/check restart clickhouse on swarm node`	Fail	52	13.1%
`node failure/check restart clickhouse on swarm node`	OK	342	85.9%
`node failure/check restart clickhouse on swarm node`	Skip	2	0.5%
`node failure/check restart swarm node`	Fail	77	19.3%
`node failure/check restart swarm node`	OK	319	80.2%
`node failure/check restart swarm node`	Skip	2	0.5%
`node failure/cpu overload`	Error	1	0.3%
`node failure/cpu overload`	Fail	29	7.3%
`node failure/cpu overload`	OK	364	91.9%
`node failure/cpu overload`	Skip	2	0.5%
`node failure/initiator out of disk space`	Error	1	0.3%
`node failure/initiator out of disk space`	Fail	14	3.5%
`node failure/initiator out of disk space`	OK	379	95.7%
`node failure/initiator out of disk space`	Skip	2	0.5%
`node failure/network failure`	Fail	80	20.1%
`node failure/network failure`	OK	318	79.9%
`node failure/swarm out of disk space`	Error	1	0.3%
`node failure/swarm out of disk space`	Fail	30	7.6%
`node failure/swarm out of disk space`	OK	363	91.7%
`node failure/swarm out of disk space`	Skip	2	0.5%
`swarm joins`	Error	6	1.4%
`swarm joins`	Fail	151	35.1%
`swarm joins`	OK	270	62.8%
`swarm joins`	Skip	3	0.7%
`swarm union`	Error	2	0.5%
`swarm union`	Fail	23	5.6%
`swarm union`	OK	385	93.2%
`swarm union`	Skip	3	0.7%

2. Recent timeline for disk space scenarios

This query shows the recent pass/fail timeline for the initiator out of disk space and swarm out of disk space scenarios, to confirm that failures are intermittent and not trending.

SELECT
    test_name,
    result,
    start_time,
    clickhouse_version,
    job_url
FROM `gh-data`.clickhouse_regression_results
WHERE test_name LIKE '%swarm%'
  AND (
    test_name LIKE '%kill swarm%'
    OR test_name LIKE '%out of disk%'
    OR test_name LIKE '%initiator%'
    OR test_name LIKE '%kill initiator%'
    OR test_name LIKE '%node failure%feature%'
  )
ORDER BY start_time DESC
LIMIT 200;

Key observation: The most recent ~30 entries for both initiator out of disk space and swarm out of disk space show predominantly OK results, with only sporadic Fail entries, confirming the flaky nature rather than a consistent regression.

Selfeer · 2026-02-20T17:15:10Z

'/tiered storage/with s3gcs/background move/max move factor': https://github.com/Altinity/ClickHouse/actions/runs/22222536503/job/64284577377

Also is not related, and seems like an infra fail - The test queries system.part_log for path_on_disk of moved parts, and the output contains an empty line (between paths of a previous table and
the current table). The test iterates over all lines and asserts each starts with /var/lib/clickhouse/disks/external/ - the empty string '' fails that assertion. This is a timing/cleanup artifact, not a logic error.

Selfeer · 2026-02-26T08:48:19Z

swarms fail: https://altinity-build-artifacts.s3.amazonaws.com/REFs/1428/merge/6a98e013d2bec58de43a4daff7f5f5838fccb7c9/regression/aarch64/with_analyzer/zookeeper/without_thread_fuzzer/swarms/report.html

There is a swarms fail that doesn't seem to be related to the changes in the pr, in fact those are the same fails that are described here: #1428 (comment)

@alsugiliazova can we mark this as verified regardless of the issue that was raised? ClickHouse#97858

zvonand · 2026-02-26T08:56:21Z

I think it is OK (as long as this behavior matches upstream and they are OK with that)

Also, does not look like a problem to me at all.

zvonand · 2026-02-26T10:43:00Z

This PR does not have S3 credentials cache (It was added in 25.11 and 25.12 and not in the scope of this backport), so, performance may differ from latest upstream.

Backporting the cache logic would be quite heavy.

Actually, the code in this PR is somewhat similar to what they had in upstream before 91706 (which is related, but its backport was not requested)

alsugiliazova · 2026-02-26T10:59:34Z

25.8 does not include CredentialsProviderCache, so the behavior described in this issue does not apply to this backport.

Permission changes: They take effect after recreating the database (DROP + CREATE). The new database gets a new catalog and a new credential provider, so the next Glue/S3 use triggers a fresh AssumeRole and sees the current role policies.
Trade-off: The benefits of CredentialsProviderCache (e.g. reusing the same provider/STS session across databases with the same role, fewer AssumeRole calls) are not present in this backport.

Merge pull request ClickHouse#90825 from ClickHouse/glue-iam-auth

5577784

Add role-based access to Glue catalog

zvonand added antalya backport Backport antalya-25.8 antalya-25.8.16 antalya-25.8.16.20001 labels Feb 18, 2026

zvonand and others added 3 commits February 18, 2026 19:47

fix build

98bc08b

fix (2)

3347841

Merge branch 'antalya-25.8' into backports/antalya-25.8/90825

288e896

zvonand force-pushed the backports/antalya-25.8/90825 branch from cef9232 to fba0893 Compare February 20, 2026 09:54

fix build (3)

c96e9ea

zvonand force-pushed the backports/antalya-25.8/90825 branch from fba0893 to c96e9ea Compare February 20, 2026 10:05

fix nullptr access

e76cf14

zvonand and others added 3 commits February 20, 2026 18:38

fix integration

d81516c

don't beat me, this is not as bad as it looks

d461b2c

Merge branch 'antalya-25.8' into backports/antalya-25.8/90825

6a98e01

alsugiliazova added the verified Verified by QA label Feb 26, 2026

zvonand merged commit 801ee9e into antalya-25.8 Feb 26, 2026
442 of 444 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

25.8 Antalya backport of #90825: Add role-based access to Glue catalog#1428

25.8 Antalya backport of #90825: Add role-based access to Glue catalog#1428
zvonand merged 9 commits intoantalya-25.8from
backports/antalya-25.8/90825

zvonand commented Feb 18, 2026

Uh oh!

github-actions bot commented Feb 18, 2026 •

edited

Loading

Uh oh!

Selfeer commented Feb 20, 2026

Uh oh!

zvonand commented Feb 20, 2026

Uh oh!

Selfeer commented Feb 20, 2026

Uh oh!

Selfeer commented Feb 20, 2026

Uh oh!

Selfeer commented Feb 26, 2026

Uh oh!

zvonand commented Feb 26, 2026 •

edited

Loading

Uh oh!

zvonand commented Feb 26, 2026 •

edited

Loading

Uh oh!

alsugiliazova commented Feb 26, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

zvonand commented Feb 18, 2026

Changelog category (leave one):

Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):

CI/CD Options

Exclude tests:

Regression jobs to run:

Uh oh!

github-actions bot commented Feb 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Selfeer commented Feb 20, 2026

Integration Test Failure

Root Cause

Traceback

Impact

Uh oh!

zvonand commented Feb 20, 2026

Uh oh!

Selfeer commented Feb 20, 2026

Regression aarch64 swarms - Failure Analysis

Summary

Failed Scenarios & Historical Flakiness

Root Cause Analysis

Conclusion

Appendix: Queries Used

1. Historical pass/fail ratios per scenario (all-time)

2. Recent timeline for disk space scenarios

Uh oh!

Selfeer commented Feb 20, 2026

Uh oh!

Selfeer commented Feb 26, 2026

Uh oh!

zvonand commented Feb 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zvonand commented Feb 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

alsugiliazova commented Feb 26, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

github-actions bot commented Feb 18, 2026 •

edited

Loading

zvonand commented Feb 26, 2026 •

edited

Loading

zvonand commented Feb 26, 2026 •

edited

Loading