Skip to content

fix(backup): per-schedule scope + control-plane toggle + R2 tagging tolerance#92

Merged
dviejokfs merged 22 commits into
mainfrom
fix/backup-schedule-scope-and-r2
May 22, 2026
Merged

fix(backup): per-schedule scope + control-plane toggle + R2 tagging tolerance#92
dviejokfs merged 22 commits into
mainfrom
fix/backup-schedule-scope-and-r2

Conversation

@dviejokfs
Copy link
Copy Markdown
Contributor

@dviejokfs dviejokfs commented May 19, 2026

Summary

Two fixes that came out of a hands-on session on a Cloudflare R2-backed
control plane:

  1. Operators control what a schedule backs up. Schedules used to fan
    out to every external service plus an unavoidable control-plane row.
    New target_all_services + include_control_plane flags + an
    explicit backup_schedule_services join table give granular scope.
    Defaults preserve current behaviour on upgrade.
  2. Backups no longer fail on R2 with "service error". The AWS SDK
    service error collapse hid that R2 returns 501 NotImplemented on
    both x-amz-tagging upload headers and standalone PutObjectTagging
    calls. Tagging is now best-effort; backups complete; failures show
    the real HTTP status / service code / request id / response body.
  3. Process docs. New AGENTS.md with the short list of conventions
    future agents need (always CHANGELOG, use the generated SDK,
    pre-commit-hook timing). CHANGELOG updated for both commits in this
    PR.

What landed (by commit)

  • a6b20a67 — Per-schedule service scope + control-plane toggle + R2
    fixes (one commit because pre-commit hooks take ~3-5min and I'd rather
    run them once than four times):
    • Migrations: backup_schedule_services join table,
      target_all_services flag, include_control_plane flag.
    • BackupService: attach/detach/list-for-schedule/list-for-service
      • validators that reject empty fan-out + fan-out branches on both
        flags + clears membership when flipping target_all_services
        true.
    • 4 new HTTP endpoints with audit logging + OpenAPI.
    • UI: ScheduleServicesSelector component + "All databases /
      Specific databases" radio + "Also back up the Temps control plane"
      switch on Create & Edit pages; ScheduleDetail surfaces both flags.
    • describe_sdk_error helper that extracts HTTP status, service
      code, request_id, x-amz-id-2, truncated body from every SdkError
      variant; used at every upload site.
    • apply_object_tags calls PutObjectTagging after upload, treats
      is_unsupported_error matches as best-effort warn.
    • Drops web/src/lib/backup-schedules.ts (hand-rolled fetch shim);
      EditBackupSchedule migrated to generated SDK.
    • CLI auth + CliLogin work bundled in to clear the dirty state.
  • daf99a13 — CHANGELOG entries for the above + new AGENTS.md.

Tests

  • 6 unit tests on BackupService (MockDatabase, Docker-skip) covering
    attach/detach/list error paths.
  • 3 integration tests against TestDatabase:
    integration_attach_list_detach_round_trip,
    integration_flip_to_all_clears_membership,
    integration_fan_out_skips_control_plane_when_flag_off.
  • 2 regression tests pinning the exact R2 error strings so a future
    SDK upgrade can't silently regress the matcher.
  • cargo check --lib clean across the workspace.

Test plan

  • Run cargo test --lib -p temps-backup — all 10 new tests pass.
  • Apply migrations on a dev DB; verify existing schedules backfill
    to target_all_services=true / include_control_plane=true.
  • In the UI: create a schedule with "Specific databases" + control
    plane off, attach one DB, "Run now" — assert one backup row (no
    control_plane sidecar).
  • In the UI: edit a schedule; flip "All databases" → "Specific",
    attach two services, save; flip back to "All", save; flip to
    "Specific" again — membership must be empty (cleared on the
    first flip-to-all).
  • Trigger a backup against the Cloudflare R2 bucket the user
    tested with (backups-test) — backup should succeed, with a
    warn log under target temps_backup::tagging rather than a
    failure.
  • Force a real S3 error (eg. wrong bucket name) and confirm the
    backups.error_message column now contains HTTP status + code +
    request_id + body instead of "service error".

dviejokfs added 2 commits May 19, 2026 22:37
…fixes

Three changes in one commit so pre-commit hooks only run once.

1) Per-schedule service membership + scope flags

Schedules previously fanned out to every external service on the host
with no operator control. New shape:

  - backup_schedule_services join table (migration m20260519_000001)
    so a schedule can target a specific list of databases.
  - target_all_services flag on backup_schedules (migration
    m20260519_000002, defaults TRUE) for the common case "back up
    every database -- including ones I add later". When false, fan-out
    reads the explicit join table.
  - include_control_plane flag (migration m20260519_000003, defaults
    TRUE). Previously every fan-out unconditionally produced a
    control_plane backup; operators using Temps purely to orchestrate
    external DB backups can now opt out.

Service layer (BackupService):
  - attach/detach/list-services-for-schedule + list-schedules-for-service
  - create/update validators reject a state that would have nothing to
    back up (control plane off + target_all_services off + no attached
    services).
  - When flipping target_all_services -> true, the explicit membership
    rows are cleared ("all means all").
  - Fan-out (enqueue_scheduled_run) branches on both flags.

Handlers: 4 new endpoints (GET/POST /backups/schedules/{id}/services,
DELETE /backups/schedules/{id}/services/{service_id},
GET /backups/external-services/{service_id}/schedules) with audit
logging + OpenAPI registration.

UI:
  - ScheduleServicesSelector reusable component (checkbox list +
    "Select all" with indeterminate state, exclude-already-attached).
  - CreateBackupSchedule + EditBackupSchedule: "All databases" /
    "Specific databases" radio + "Also back up the Temps control plane"
    Switch.
  - ScheduleDetail: surfaces both flags in the Schedule Configuration
    card; the per-service attach/detach card only renders in 'specific'
    mode.

Tests: 6 unit tests (MockDatabase, Docker-skip) + 3 integration tests
covering attach/detach round-trip, flip-to-all-clears-membership, and
fan-out skips control plane when the flag is off.

2) Real S3 errors + R2 tagging tolerance

Every SDK call site used to swallow rich SdkError data behind
format!("...: {}", e), rendering as "service error" for any 4xx/5xx.
Added describe_sdk_error in v2_common: pattern-matches on
SdkError::{ConstructionFailure,TimeoutError,DispatchFailure,
ResponseError,ServiceError} and extracts HTTP status, service code,
request_id, x-amz-id-2, and a truncated response body. All upload
sites (single-part, create/upload/complete multipart, metadata
companion, head_bucket) plus the three From<SdkError> impls in
services/backup.rs now use it.

Cloudflare R2 returns 501 NotImplemented on both the x-amz-tagging
upload header and the standalone PutObjectTagging call. Fix: tags
are still applied via PutObjectTagging after every successful upload,
but apply_object_tags now treats failures matching
is_unsupported_error as best-effort -- logs a warn under target
temps_backup::tagging and returns Ok. AWS S3 / MinIO / compliant
stores still tag normally. Re-exposed is_unsupported_error as
pub(crate) so upload + lifecycle reconciler share the matcher.
Replaced legacy to_tagging_string with to_tag_pairs (header form is
unused now). Two regression tests pin the exact R2 error shapes.

Operational note for R2: tag-driven bucket lifecycle is unavailable;
app-side BackupService::enforce_retention is the retention source of
truth on tag-less providers.

3) Drop web/src/lib/backup-schedules.ts

The hand-rolled fetch helper had a TODO(sdk-regen) comment from
before the PATCH endpoint was in the OpenAPI surface. Migrated
EditBackupSchedule to the generated updateBackupScheduleMutation +
UpdateBackupScheduleRequest type. Includes regenerated SDK
artefacts (types.gen.ts, sdk.gen.ts, react-query.gen.ts).

Other: CLI auth + CliLogin work that was already in the working tree
is bundled in to clear the dirty state -- unrelated to backups.
Adds CHANGELOG.md entries under [Unreleased] covering the work in
a6b20a6: per-schedule service scope + control-plane toggle (Added),
the SDK-vs-shim migration on EditBackupSchedule (Changed), and the
R2 tagging tolerance + describe_sdk_error error-message overhaul
(Fixed).

Creates AGENTS.md to document process conventions that bit me this
session:
- Always update CHANGELOG.md in the same commit as the code change.
- Use the generated OpenAPI SDK in web/; no hand-rolled fetch shims.
- Restart the server + regenerate the SDK when the OpenAPI surface
  changes.
- Pre-commit hooks run cargo fmt + clippy; plan for the wall-clock
  cost by preferring one commit over many.
- Don't sweep unrelated dirty files into focused commits without
  confirming with the user.

CLAUDE.md continues to own the detailed engineering rules; AGENTS.md
is the short list of process conventions that go around them.
@dviejokfs dviejokfs changed the title feat(backup): per-schedule service scope + control-plane toggle + R2 fixes fix(backup): per-schedule scope + control-plane toggle + R2 tagging tolerance May 19, 2026
dviejokfs added 20 commits May 19, 2026 23:52
The gateway endpoints (/ai/v1/chat/completions, /models, /embeddings)
were registered via configure_public_routes, which lands on the no-auth
router. But the handlers use RequireAuth, which depends on the
AuthContext injected by auth_middleware -- that middleware only runs on
the authenticated router. The mismatch produced an instant 401
('Authentication Required') before the API key was ever validated, so
no 'API key auth failed' diagnostic was ever logged.

Move configure_gateway_routes() into configure_routes() alongside the
admin/usage/pricing routes so the whole AI Gateway sits on the
authenticated surface. Valid tk_ API keys now authenticate and the
AiGatewayExecute permission check runs as intended.
mark_deployment_complete flipped current_deployment_id and fired the PG
NOTIFY route reload before writing static_dir_location and image_name,
which load_routes() reads to build an environment's backend. For static
deployments the NOTIFY fired while static_dir_location was still NULL,
so the proxy built a route with no static directory and the folder
wasn't served until a later, unrelated route reload.

Add a Phase 0 step that writes the routing-relevant deployment fields
first, so the route table sees a consistent record the moment the
NOTIFY fires.
Refine what counts as an engaged session: require either >=10s of
measured wall-clock time or a genuine interaction event. Auto-fired view
events (page_view, page_leave, *_viewed) no longer mark a session
engaged on their own -- they trigger from intersection observers for
bots too, inflating engagement.

Also exclude zero-duration session replays (previously NULL OR >0 was
allowed) since never-finalized single-burst sessions have nothing to
play back, and expand user-agent bot detection in the events pipeline.
Add --manual, --source-type, --image, and --port flags to the projects
create command so users can create Docker-image and static-files
projects without a git repository. Bumps @temps-sdk/cli to 0.1.23.
Replaces the legacy blue rocket illustration with the black "t" mark
used on temps-landing. Updates the in-app icon (sidebar + login), the
served favicon, and the committed favicon/icon PNG sets.

KFS-13
The weekly digest email rendered with collapsed, colliding metrics
("Visitors: 18Page Views: 26") because the project-activity rows used
`display:flex` + `gap`, and the metric grid used CSS `grid`. Gmail,
Outlook, and most mobile clients strip or ignore modern CSS layout, so
flex/grid children run into each other.

Rewrite templates.rs to be fully table-based with critical styles
inlined on each element (many clients drop the <head><style> block):
- Each metric and each project stat is its own <td>, so cells cannot
  collide the way inline <span>s did.
- New data tables for top pages, top countries, and most-common errors.
- Funnel cards for the conversion section.
- Trend values rendered as proper colored pill badges.
- All user-controlled content (project names, error types, page paths)
  is HTML-escaped.

Adds a regression test asserting the rendered output never contains the
run-together strings, plus tests for duration/number formatting and
escaping.
The weekly digest previously shipped fabricated data: ErrorData was
entirely hardcoded (total_errors: 0, uptime_percentage: 99.9 regardless
of reality), FunnelData was always empty, and PerformanceData returned
zeros for session duration, bounce rate, top pages, and geo. For a
recurring observability email this is worse than a stub — it misleads.

Implement real aggregation:
- PerformanceData: average session duration (first-to-last event span),
  bounce rate (sessions flagged is_bounce), top 5 pages, top 5 countries
  via ip_geolocations join, and a daily visitor trend.
- ErrorData: total errors and new error types from error_events /
  error_groups, distinct affected visitors, most-common errors, daily
  error trend. Uptime is computed from external_service_health_checks
  (operational vs degraded/down); with no checks recorded it reports
  100%, never the old fabricated 99.9%. error_rate is errors per 1k
  page views.
- FunnelData: per active funnel, count sessions that fired the first
  step's event (entries) vs. also the last step's event (completions),
  with conversion rate and week-over-week change.

Aggregation uses parameterized raw SQL for the GROUP BY queries; each
detail query degrades to a safe default on error so one failing query
never blanks the whole digest. Adds 6 integration tests against
TestDatabase covering empty and populated states for all three.
The OTel ingest rate limit is already configurable via the
`TEMPS_OTEL_RATE_LIMIT` env var, but `check_rate_limit` hardcoded
`limit: 1000` in the OtelError::RateLimitExceeded it returned. An
operator who lowered or raised the limit got an error that contradicted
their configuration.

Add a `RateLimiter::max_requests()` getter and use it so the error
reports the limiter's actual configured value. Strengthens the existing
over-limit test to assert the reported limit matches the configured
one.
RequestLoggerImpl, the RequestLogger trait, RequestLogData, and
LoggingConfig were a legacy request-logging path superseded by the
CreateProxyLogRequest batch-writer path. Every RequestLoggerImpl::new
call site was inside services.rs's own test module -- nothing in
production wired it up.

The dead code carried stale TODOs for unpopulated proxy_logs columns
(query_string, container_id, upstream_host, request/response sizes),
which is misleading: the live CreateProxyLogRequest path in proxy.rs
already populates all of those from the proxy context. Investigating
the TODOs surfaced that the whole path was unused.

Remove ~500 lines: the struct + trait impls, the trait and DTO, the
config struct, 5 tests that exercised only the dead logger, and a
now-orphaned test helper.
test_proxy_visitor_management failed ("Failed to get or create
visitor") because it called get_or_create_visitor with a None project
context. The visitor table has non-nullable project_id/environment_id,
and the service correctly rejects a contextless call -- the test was
asserting an impossible scenario. Its .map_err also swallowed the real
DB error, hiding the cause.

test_proxy_session_management was #[ignore]'d with a TODO about a
foreign-key constraint: it passed a fabricated Visitor { id: 123 } that
had no DB row, so the request_sessions.visitor_id FK failed.

Both tests now build the real project -> environment -> (visitor) chain
and pass proper context. test_proxy_visitor_management also adds an
assertion that a contextless call still fails (no orphan visitors), and
both tests surface the real error instead of swallowing it. The
#[ignore] is removed per the project rule against ignored tests.

254 proxy tests pass, 0 failed (was 252 passed + 1 failed).
Dependabot medium: idna < 3.15 allows bypassing the CVE-2024-3651 fix
via specially crafted inputs to idna.encode(). Bump the transitive
dependency from 3.11 to 3.15 in sdks/python/uv.lock.
Bollard 0.20 no longer re-exports RestartPolicyNameEnum through the
`secret` module (it is private there); the generated enum lives in
`bollard::models`. Update the import path so temps-import-docker
compiles against the resolved bollard 0.20.2.
hickory-proto 0.24/0.25 carries two open advisories: an NSEC3
closest-encloser unbounded loop (high) and O(n^2) name-compression CPU
exhaustion on message encoding (medium). 0.26.1 fixes both.

Bumps hickory-resolver / hickory-proto / hickory-server / hickory-client
to 0.26 across the four DNS-using crates and migrates them to the 0.26
API:

- temps-dns-resolver (the worker-node DNS server): hickory_server's
  `authority` module is renamed `zone_handler`; ServerFuture -> Server;
  RequestHandler::handle_request gains a second `T: Time` type param;
  RequestInfo.header -> .metadata; Header is now Metadata (plain public
  fields, no set_* methods). authority.rs / upstream.rs / handle.rs
  updated accordingly.
- temps-domains, temps-infra, temps-email: TokioAsyncResolver /
  TokioConnectionProvider -> Resolver + net::runtime::TokioRuntimeProvider;
  ResolverConfig::cloudflare()/::new() removed -> udp_and_tcp(&CLOUDFLARE)
  / ::default(); .build() now returns Result; lookups yield a generic
  Lookup whose .answers() records carry typed RData (extract MX/TXT/A/
  AAAA/CNAME by variant); NameServerConfig::new(SocketAddr, Protocol) ->
  ::udp/::udp_and_tcp(IpAddr).

The unused hickory-client dev-dependency in temps-dns-resolver is
dropped (no stable 0.26 release exists and nothing referenced it).

Note: mongodb 3.6.0 still pins hickory 0.25 transitively for its
`mongodb+srv` SRV stub resolver; that copy remains until mongodb
upstream moves.
Replace the check-if-email-exists dependency with a native validation
engine. check-if-email-exists is AGPL-3.0 (a licensing hazard for a
commercially-distributed product) and pinned hickory 0.24, blocking the
CVE-fixing 0.26 upgrade.

New temps-email/services/validation/ module, four stages:
- syntax: pragmatic RFC 5321/5322 local-part + domain parsing, with a
  typo-domain 'did you mean' suggestion.
- mx: hickory-resolver 0.26 MX lookup against Cloudflare DNS.
- misc: disposable-provider, role-account, and B2C-provider detection
  from curated lists; Gravatar URL via md5.
- smtp: TCP (or SOCKS5 via tokio-socks) connect to the MX, EHLO/MAIL
  FROM/RCPT TO probe -- never sends DATA, never delivers mail. Reply
  codes classify deliverable / disabled / full-inbox; a random-localpart
  RCPT probe detects catch-all domains.
Stages combine into a Safe/Risky/Invalid/Unknown reachability verdict.

ValidationService keeps its exact public API (ValidateEmailResponse and
the per-stage result structs) so handlers, the plugin, and the OpenAPI
surface are unchanged -- only the engine behind it is swapped.

temps-cli no longer depends on check-if-email-exists for rustls
crypto-provider setup; install_crypto_provider() installs the ring
provider directly.

New deps: tokio-socks (SOCKS5), md5 (Gravatar). 32 unit tests cover
syntax/misc/smtp-classification/reachability; live MX/SMTP tests are
gated behind TEMPS_NETWORK_TESTS.
The standalone temps-mcp MCP server is no longer used — agent tooling
has moved to skills + bunx. It was already commented out of temps-cli's
dependencies; only a stale workspace-member entry and a leftover
`temps_mcp` log-filter directive still referenced it.

Removing the crate also drops rmcp 0.6.x from the dependency tree
entirely, which resolves the rmcp Streamable-HTTP DNS-rebinding
advisory (Dependabot high) — nothing else in the workspace depends on
rmcp. oauth2 / rmcp-macros / sse-stream / schemars_derive are pruned
from Cargo.lock as a result.

temps-agents-mcp-proxy is a separate, still-used crate and is not
affected.
CI's workspace check runs --all-targets, which compiles test code that
`cargo check --lib` skips. Two test surfaces still used the hickory
0.24/0.25 API:

- tests/end_to_end.rs (integration test): proto::xfer::Protocol,
  name_server::TokioConnectionProvider, ResolverConfig::new(), and the
  2-arg NameServerConfig::new(SocketAddr, Protocol) no longer exist.
  Rebuilt the test DNS client on the 0.26 API — NameServerConfig::udp
  with the connection's port set to the test resolver's random port,
  TokioRuntimeProvider, and .build()'s Result handled.
- authority.rs unit tests: Record's .ttl()/.data() accessors —
  switched to the public `ttl` / `data` fields, which resolve without
  the RecordData bound the methods require.

Also applies `cargo fmt` across the hickory-touched files. Verified
with `cargo check --all-targets --workspace` (exit 0) and clippy.
Add offset pagination and a UsageLogPage response (entries + total) to
the /ai/usage/recent endpoint so the AI Gateway usage tab no longer
renders an unbounded list. Page size is user-configurable up to 50.

Add provider, status, cost, and total-token filters to UsageFilter and
the recent-requests query. Cost and token bounds support gte/gt/lte/lt
comparisons; cost is expressed in microcents.

The web filter row is collapsed behind a Filters toggle (with an active
count badge) and only shown on demand. The provider dropdown is sourced
from the static supported-provider registry (openai, anthropic, xai,
gemini) so it is not constrained by the analytics time window.
@dviejokfs dviejokfs merged commit 302313f into main May 22, 2026
27 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant