Skip to content

Better bucket regexes#259

Merged
TheTechromancer merged 100 commits intodevfrom
better-bucket-regexes
May 4, 2026
Merged

Better bucket regexes#259
TheTechromancer merged 100 commits intodevfrom
better-bucket-regexes

Conversation

@TheTechromancer
Copy link
Copy Markdown
Collaborator

@TheTechromancer TheTechromancer commented May 4, 2026

Summary

Switch storage bucket regexes from positional capture groups to named groups, and surface region info where the hostname carries it.

Previously, every STORAGE_BUCKET_HOSTNAME regex used a fixed contract: group 1 = bucket name, group 2 = trailing domain. That breaks down once some providers expose region in the URL and others don't — consumers can't tell which group means what without per-provider logic. With named groups, consumers just call match.groupdict() and read name / region (the latter only present when meaningful).

This is a breaking change for downstream consumers indexing by group number, so the artifact is published as cloud_providers_v3.json and the package version is bumped to 10.0.0. cloud_providers_v2.json is left in place, frozen, so legacy consumers keep resolving until they migrate.

Changes

  • 6 providers updated to use (?P<name>…) (and (?P<region>…) where applicable):
    • Amazon — split into 3 patterns covering legacy global (bucket.s3.amazonaws.com), dash-style (bucket.s3-us-west-2.amazonaws.com), and dot-style (bucket.s3.eu-central-1.amazonaws.com).
    • DigitalOcean — region (nyc3, sfo2, …) broken out of the trailing domain.
    • Hetzner — rewritten to capture region (fsn1, nbg1, hel1); the previous regex matched the wrong URL shape.
    • Google / Microsoft / Cloudflarename-only (no region in their bucket URLs).
  • Artifact rename: cloud_providers_v2.jsoncloud_providers_v3.json across cloudcheck_update, src/lib.rs (signature URL + cache path), scripts/update_readme_table.py, the daily-update workflow, and the README.
  • Version bump: pyproject.toml and Cargo.toml10.0.0.
  • Tests: parametrized coverage in test_cloudcheck.py exercising real hostnames per provider and verifying groupdict(), plus a sweep that asserts every provider's regexes compile and expose a name group.

Test plan

  • uv run pytest test_cloudcheck.py -k "regex or import_provider" — the new + existing regex tests pass.
  • uv run python -m cloudcheck_update.cli — regenerates cloud_providers_v3.json with the new regex strings round-tripped intact.
  • After merge, the daily-update workflow publishes cloud_providers_v3.json to stable; the network-dependent test_lookup_* tests resume passing once the file is live at the new URL.

Relevant:

TheTechromancer and others added 30 commits January 25, 2026 21:43
Readme updates + daily sig update
@TheTechromancer TheTechromancer requested a review from liquidsec May 4, 2026 18:43
@TheTechromancer TheTechromancer self-assigned this May 4, 2026
@TheTechromancer TheTechromancer merged commit 5ff4043 into dev May 4, 2026
29 of 37 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants