fix(backup): per-schedule scope + control-plane toggle + R2 tagging tolerance by dviejokfs · Pull Request #92 · gotempsh/temps

dviejokfs · 2026-05-19T20:38:57Z

Summary

Two fixes that came out of a hands-on session on a Cloudflare R2-backed
control plane:

Operators control what a schedule backs up. Schedules used to fan
out to every external service plus an unavoidable control-plane row.
New target_all_services + include_control_plane flags + an
explicit backup_schedule_services join table give granular scope.
Defaults preserve current behaviour on upgrade.
Backups no longer fail on R2 with "service error". The AWS SDK
service error collapse hid that R2 returns 501 NotImplemented on
both x-amz-tagging upload headers and standalone PutObjectTagging
calls. Tagging is now best-effort; backups complete; failures show
the real HTTP status / service code / request id / response body.
Process docs. New AGENTS.md with the short list of conventions
future agents need (always CHANGELOG, use the generated SDK,
pre-commit-hook timing). CHANGELOG updated for both commits in this
PR.

What landed (by commit)

a6b20a67 — Per-schedule service scope + control-plane toggle + R2
fixes (one commit because pre-commit hooks take ~3-5min and I'd rather
run them once than four times):
- Migrations: backup_schedule_services join table,
  target_all_services flag, include_control_plane flag.
- BackupService: attach/detach/list-for-schedule/list-for-service
  - validators that reject empty fan-out + fan-out branches on both
    flags + clears membership when flipping target_all_services →
    true.
- 4 new HTTP endpoints with audit logging + OpenAPI.
- UI: ScheduleServicesSelector component + "All databases /
  Specific databases" radio + "Also back up the Temps control plane"
  switch on Create & Edit pages; ScheduleDetail surfaces both flags.
- describe_sdk_error helper that extracts HTTP status, service
  code, request_id, x-amz-id-2, truncated body from every SdkError
  variant; used at every upload site.
- apply_object_tags calls PutObjectTagging after upload, treats
  is_unsupported_error matches as best-effort warn.
- Drops web/src/lib/backup-schedules.ts (hand-rolled fetch shim);
  EditBackupSchedule migrated to generated SDK.
- CLI auth + CliLogin work bundled in to clear the dirty state.
daf99a13 — CHANGELOG entries for the above + new AGENTS.md.

Tests

6 unit tests on BackupService (MockDatabase, Docker-skip) covering
attach/detach/list error paths.
3 integration tests against TestDatabase:
integration_attach_list_detach_round_trip,
integration_flip_to_all_clears_membership,
integration_fan_out_skips_control_plane_when_flag_off.
2 regression tests pinning the exact R2 error strings so a future
SDK upgrade can't silently regress the matcher.
cargo check --lib clean across the workspace.

Test plan

Run cargo test --lib -p temps-backup — all 10 new tests pass.
Apply migrations on a dev DB; verify existing schedules backfill
to target_all_services=true / include_control_plane=true.
In the UI: create a schedule with "Specific databases" + control
plane off, attach one DB, "Run now" — assert one backup row (no
control_plane sidecar).
In the UI: edit a schedule; flip "All databases" → "Specific",
attach two services, save; flip back to "All", save; flip to
"Specific" again — membership must be empty (cleared on the
first flip-to-all).
Trigger a backup against the Cloudflare R2 bucket the user
tested with (backups-test) — backup should succeed, with a
warn log under target temps_backup::tagging rather than a
failure.
Force a real S3 error (eg. wrong bucket name) and confirm the
backups.error_message column now contains HTTP status + code +
request_id + body instead of "service error".

…fixes Three changes in one commit so pre-commit hooks only run once. 1) Per-schedule service membership + scope flags Schedules previously fanned out to every external service on the host with no operator control. New shape: - backup_schedule_services join table (migration m20260519_000001) so a schedule can target a specific list of databases. - target_all_services flag on backup_schedules (migration m20260519_000002, defaults TRUE) for the common case "back up every database -- including ones I add later". When false, fan-out reads the explicit join table. - include_control_plane flag (migration m20260519_000003, defaults TRUE). Previously every fan-out unconditionally produced a control_plane backup; operators using Temps purely to orchestrate external DB backups can now opt out. Service layer (BackupService): - attach/detach/list-services-for-schedule + list-schedules-for-service - create/update validators reject a state that would have nothing to back up (control plane off + target_all_services off + no attached services). - When flipping target_all_services -> true, the explicit membership rows are cleared ("all means all"). - Fan-out (enqueue_scheduled_run) branches on both flags. Handlers: 4 new endpoints (GET/POST /backups/schedules/{id}/services, DELETE /backups/schedules/{id}/services/{service_id}, GET /backups/external-services/{service_id}/schedules) with audit logging + OpenAPI registration. UI: - ScheduleServicesSelector reusable component (checkbox list + "Select all" with indeterminate state, exclude-already-attached). - CreateBackupSchedule + EditBackupSchedule: "All databases" / "Specific databases" radio + "Also back up the Temps control plane" Switch. - ScheduleDetail: surfaces both flags in the Schedule Configuration card; the per-service attach/detach card only renders in 'specific' mode. Tests: 6 unit tests (MockDatabase, Docker-skip) + 3 integration tests covering attach/detach round-trip, flip-to-all-clears-membership, and fan-out skips control plane when the flag is off. 2) Real S3 errors + R2 tagging tolerance Every SDK call site used to swallow rich SdkError data behind format!("...: {}", e), rendering as "service error" for any 4xx/5xx. Added describe_sdk_error in v2_common: pattern-matches on SdkError::{ConstructionFailure,TimeoutError,DispatchFailure, ResponseError,ServiceError} and extracts HTTP status, service code, request_id, x-amz-id-2, and a truncated response body. All upload sites (single-part, create/upload/complete multipart, metadata companion, head_bucket) plus the three From<SdkError> impls in services/backup.rs now use it. Cloudflare R2 returns 501 NotImplemented on both the x-amz-tagging upload header and the standalone PutObjectTagging call. Fix: tags are still applied via PutObjectTagging after every successful upload, but apply_object_tags now treats failures matching is_unsupported_error as best-effort -- logs a warn under target temps_backup::tagging and returns Ok. AWS S3 / MinIO / compliant stores still tag normally. Re-exposed is_unsupported_error as pub(crate) so upload + lifecycle reconciler share the matcher. Replaced legacy to_tagging_string with to_tag_pairs (header form is unused now). Two regression tests pin the exact R2 error shapes. Operational note for R2: tag-driven bucket lifecycle is unavailable; app-side BackupService::enforce_retention is the retention source of truth on tag-less providers. 3) Drop web/src/lib/backup-schedules.ts The hand-rolled fetch helper had a TODO(sdk-regen) comment from before the PATCH endpoint was in the OpenAPI surface. Migrated EditBackupSchedule to the generated updateBackupScheduleMutation + UpdateBackupScheduleRequest type. Includes regenerated SDK artefacts (types.gen.ts, sdk.gen.ts, react-query.gen.ts). Other: CLI auth + CliLogin work that was already in the working tree is bundled in to clear the dirty state -- unrelated to backups.

Adds CHANGELOG.md entries under [Unreleased] covering the work in a6b20a6: per-schedule service scope + control-plane toggle (Added), the SDK-vs-shim migration on EditBackupSchedule (Changed), and the R2 tagging tolerance + describe_sdk_error error-message overhaul (Fixed). Creates AGENTS.md to document process conventions that bit me this session: - Always update CHANGELOG.md in the same commit as the code change. - Use the generated OpenAPI SDK in web/; no hand-rolled fetch shims. - Restart the server + regenerate the SDK when the OpenAPI surface changes. - Pre-commit hooks run cargo fmt + clippy; plan for the wall-clock cost by preferring one commit over many. - Don't sweep unrelated dirty files into focused commits without confirming with the user. CLAUDE.md continues to own the detailed engineering rules; AGENTS.md is the short list of process conventions that go around them.

The gateway endpoints (/ai/v1/chat/completions, /models, /embeddings) were registered via configure_public_routes, which lands on the no-auth router. But the handlers use RequireAuth, which depends on the AuthContext injected by auth_middleware -- that middleware only runs on the authenticated router. The mismatch produced an instant 401 ('Authentication Required') before the API key was ever validated, so no 'API key auth failed' diagnostic was ever logged. Move configure_gateway_routes() into configure_routes() alongside the admin/usage/pricing routes so the whole AI Gateway sits on the authenticated surface. Valid tk_ API keys now authenticate and the AiGatewayExecute permission check runs as intended.

mark_deployment_complete flipped current_deployment_id and fired the PG NOTIFY route reload before writing static_dir_location and image_name, which load_routes() reads to build an environment's backend. For static deployments the NOTIFY fired while static_dir_location was still NULL, so the proxy built a route with no static directory and the folder wasn't served until a later, unrelated route reload. Add a Phase 0 step that writes the routing-relevant deployment fields first, so the route table sees a consistent record the moment the NOTIFY fires.

Refine what counts as an engaged session: require either >=10s of measured wall-clock time or a genuine interaction event. Auto-fired view events (page_view, page_leave, *_viewed) no longer mark a session engaged on their own -- they trigger from intersection observers for bots too, inflating engagement. Also exclude zero-duration session replays (previously NULL OR >0 was allowed) since never-finalized single-burst sessions have nothing to play back, and expand user-agent bot detection in the events pipeline.

Add --manual, --source-type, --image, and --port flags to the projects create command so users can create Docker-image and static-files projects without a git repository. Bumps @temps-sdk/cli to 0.1.23.

Replaces the legacy blue rocket illustration with the black "t" mark used on temps-landing. Updates the in-app icon (sidebar + login), the served favicon, and the committed favicon/icon PNG sets. KFS-13

The weekly digest email rendered with collapsed, colliding metrics ("Visitors: 18Page Views: 26") because the project-activity rows used `display:flex` + `gap`, and the metric grid used CSS `grid`. Gmail, Outlook, and most mobile clients strip or ignore modern CSS layout, so flex/grid children run into each other. Rewrite templates.rs to be fully table-based with critical styles inlined on each element (many clients drop the <head><style> block): - Each metric and each project stat is its own <td>, so cells cannot collide the way inline <span>s did. - New data tables for top pages, top countries, and most-common errors. - Funnel cards for the conversion section. - Trend values rendered as proper colored pill badges. - All user-controlled content (project names, error types, page paths) is HTML-escaped. Adds a regression test asserting the rendered output never contains the run-together strings, plus tests for duration/number formatting and escaping.

The weekly digest previously shipped fabricated data: ErrorData was entirely hardcoded (total_errors: 0, uptime_percentage: 99.9 regardless of reality), FunnelData was always empty, and PerformanceData returned zeros for session duration, bounce rate, top pages, and geo. For a recurring observability email this is worse than a stub — it misleads. Implement real aggregation: - PerformanceData: average session duration (first-to-last event span), bounce rate (sessions flagged is_bounce), top 5 pages, top 5 countries via ip_geolocations join, and a daily visitor trend. - ErrorData: total errors and new error types from error_events / error_groups, distinct affected visitors, most-common errors, daily error trend. Uptime is computed from external_service_health_checks (operational vs degraded/down); with no checks recorded it reports 100%, never the old fabricated 99.9%. error_rate is errors per 1k page views. - FunnelData: per active funnel, count sessions that fired the first step's event (entries) vs. also the last step's event (completions), with conversion rate and week-over-week change. Aggregation uses parameterized raw SQL for the GROUP BY queries; each detail query degrades to a safe default on error so one failing query never blanks the whole digest. Adds 6 integration tests against TestDatabase covering empty and populated states for all three.

The OTel ingest rate limit is already configurable via the `TEMPS_OTEL_RATE_LIMIT` env var, but `check_rate_limit` hardcoded `limit: 1000` in the OtelError::RateLimitExceeded it returned. An operator who lowered or raised the limit got an error that contradicted their configuration. Add a `RateLimiter::max_requests()` getter and use it so the error reports the limiter's actual configured value. Strengthens the existing over-limit test to assert the reported limit matches the configured one.

RequestLoggerImpl, the RequestLogger trait, RequestLogData, and LoggingConfig were a legacy request-logging path superseded by the CreateProxyLogRequest batch-writer path. Every RequestLoggerImpl::new call site was inside services.rs's own test module -- nothing in production wired it up. The dead code carried stale TODOs for unpopulated proxy_logs columns (query_string, container_id, upstream_host, request/response sizes), which is misleading: the live CreateProxyLogRequest path in proxy.rs already populates all of those from the proxy context. Investigating the TODOs surfaced that the whole path was unused. Remove ~500 lines: the struct + trait impls, the trait and DTO, the config struct, 5 tests that exercised only the dead logger, and a now-orphaned test helper.

test_proxy_visitor_management failed ("Failed to get or create visitor") because it called get_or_create_visitor with a None project context. The visitor table has non-nullable project_id/environment_id, and the service correctly rejects a contextless call -- the test was asserting an impossible scenario. Its .map_err also swallowed the real DB error, hiding the cause. test_proxy_session_management was #[ignore]'d with a TODO about a foreign-key constraint: it passed a fabricated Visitor { id: 123 } that had no DB row, so the request_sessions.visitor_id FK failed. Both tests now build the real project -> environment -> (visitor) chain and pass proper context. test_proxy_visitor_management also adds an assertion that a contextless call still fails (no orphan visitors), and both tests surface the real error instead of swallowing it. The #[ignore] is removed per the project rule against ignored tests. 254 proxy tests pass, 0 failed (was 252 passed + 1 failed).

Dependabot medium: idna < 3.15 allows bypassing the CVE-2024-3651 fix via specially crafted inputs to idna.encode(). Bump the transitive dependency from 3.11 to 3.15 in sdks/python/uv.lock.

Bollard 0.20 no longer re-exports RestartPolicyNameEnum through the `secret` module (it is private there); the generated enum lives in `bollard::models`. Update the import path so temps-import-docker compiles against the resolved bollard 0.20.2.

hickory-proto 0.24/0.25 carries two open advisories: an NSEC3 closest-encloser unbounded loop (high) and O(n^2) name-compression CPU exhaustion on message encoding (medium). 0.26.1 fixes both. Bumps hickory-resolver / hickory-proto / hickory-server / hickory-client to 0.26 across the four DNS-using crates and migrates them to the 0.26 API: - temps-dns-resolver (the worker-node DNS server): hickory_server's `authority` module is renamed `zone_handler`; ServerFuture -> Server; RequestHandler::handle_request gains a second `T: Time` type param; RequestInfo.header -> .metadata; Header is now Metadata (plain public fields, no set_* methods). authority.rs / upstream.rs / handle.rs updated accordingly. - temps-domains, temps-infra, temps-email: TokioAsyncResolver / TokioConnectionProvider -> Resolver + net::runtime::TokioRuntimeProvider; ResolverConfig::cloudflare()/::new() removed -> udp_and_tcp(&CLOUDFLARE) / ::default(); .build() now returns Result; lookups yield a generic Lookup whose .answers() records carry typed RData (extract MX/TXT/A/ AAAA/CNAME by variant); NameServerConfig::new(SocketAddr, Protocol) -> ::udp/::udp_and_tcp(IpAddr). The unused hickory-client dev-dependency in temps-dns-resolver is dropped (no stable 0.26 release exists and nothing referenced it). Note: mongodb 3.6.0 still pins hickory 0.25 transitively for its `mongodb+srv` SRV stub resolver; that copy remains until mongodb upstream moves.

Replace the check-if-email-exists dependency with a native validation engine. check-if-email-exists is AGPL-3.0 (a licensing hazard for a commercially-distributed product) and pinned hickory 0.24, blocking the CVE-fixing 0.26 upgrade. New temps-email/services/validation/ module, four stages: - syntax: pragmatic RFC 5321/5322 local-part + domain parsing, with a typo-domain 'did you mean' suggestion. - mx: hickory-resolver 0.26 MX lookup against Cloudflare DNS. - misc: disposable-provider, role-account, and B2C-provider detection from curated lists; Gravatar URL via md5. - smtp: TCP (or SOCKS5 via tokio-socks) connect to the MX, EHLO/MAIL FROM/RCPT TO probe -- never sends DATA, never delivers mail. Reply codes classify deliverable / disabled / full-inbox; a random-localpart RCPT probe detects catch-all domains. Stages combine into a Safe/Risky/Invalid/Unknown reachability verdict. ValidationService keeps its exact public API (ValidateEmailResponse and the per-stage result structs) so handlers, the plugin, and the OpenAPI surface are unchanged -- only the engine behind it is swapped. temps-cli no longer depends on check-if-email-exists for rustls crypto-provider setup; install_crypto_provider() installs the ring provider directly. New deps: tokio-socks (SOCKS5), md5 (Gravatar). 32 unit tests cover syntax/misc/smtp-classification/reachability; live MX/SMTP tests are gated behind TEMPS_NETWORK_TESTS.

The standalone temps-mcp MCP server is no longer used — agent tooling has moved to skills + bunx. It was already commented out of temps-cli's dependencies; only a stale workspace-member entry and a leftover `temps_mcp` log-filter directive still referenced it. Removing the crate also drops rmcp 0.6.x from the dependency tree entirely, which resolves the rmcp Streamable-HTTP DNS-rebinding advisory (Dependabot high) — nothing else in the workspace depends on rmcp. oauth2 / rmcp-macros / sse-stream / schemars_derive are pruned from Cargo.lock as a result. temps-agents-mcp-proxy is a separate, still-used crate and is not affected.

CI's workspace check runs --all-targets, which compiles test code that `cargo check --lib` skips. Two test surfaces still used the hickory 0.24/0.25 API: - tests/end_to_end.rs (integration test): proto::xfer::Protocol, name_server::TokioConnectionProvider, ResolverConfig::new(), and the 2-arg NameServerConfig::new(SocketAddr, Protocol) no longer exist. Rebuilt the test DNS client on the 0.26 API — NameServerConfig::udp with the connection's port set to the test resolver's random port, TokioRuntimeProvider, and .build()'s Result handled. - authority.rs unit tests: Record's .ttl()/.data() accessors — switched to the public `ttl` / `data` fields, which resolve without the RecordData bound the methods require. Also applies `cargo fmt` across the hickory-touched files. Verified with `cargo check --all-targets --workspace` (exit 0) and clippy.

Add offset pagination and a UsageLogPage response (entries + total) to the /ai/usage/recent endpoint so the AI Gateway usage tab no longer renders an unbounded list. Page size is user-configurable up to 50. Add provider, status, cost, and total-token filters to UsageFilter and the recent-requests query. Cost and token bounds support gte/gt/lte/lt comparisons; cost is expressed in microcents. The web filter row is collapsed behind a Filters toggle (with an active count badge) and only shown on demand. The provider dropdown is sourced from the static supported-provider registry (openai, anthropic, xai, gemini) so it is not constrained by the analytics time window.

dviejokfs added 2 commits May 19, 2026 22:37

dviejokfs changed the title ~~feat(backup): per-schedule service scope + control-plane toggle + R2 fixes~~ fix(backup): per-schedule scope + control-plane toggle + R2 tagging tolerance May 19, 2026

dviejokfs added 20 commits May 19, 2026 23:52

chore: bump version to v0.1.0-beta.18

ab21206

feat(temps-cli): support manual (non-git) project creation

e2cef7b

Add --manual, --source-type, --image, and --port flags to the projects create command so users can create Docker-image and static-files projects without a git repository. Bumps @temps-sdk/cli to 0.1.23.

chore: bump version to v0.1.0-beta.19

c1ff771

style(deployments): cargo fmt routing-inputs block

92a4189

feat(web): change platform logo and favicon to the "t" lettermark

6638714

Replaces the legacy blue rocket illustration with the black "t" mark used on temps-landing. Updates the in-app icon (sidebar + login), the served favicon, and the committed favicon/icon PNG sets. KFS-13

fix(deps): bump idna to 3.15 in Python SDK (CVE-2024-3651 bypass)

be99a7a

Dependabot medium: idna < 3.15 allows bypassing the CVE-2024-3651 fix via specially crafted inputs to idna.encode(). Bump the transitive dependency from 3.11 to 3.15 in sdks/python/uv.lock.

dviejokfs merged commit 302313f into main May 22, 2026
27 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(backup): per-schedule scope + control-plane toggle + R2 tagging tolerance#92

fix(backup): per-schedule scope + control-plane toggle + R2 tagging tolerance#92
dviejokfs merged 22 commits into
mainfrom
fix/backup-schedule-scope-and-r2

dviejokfs commented May 19, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

dviejokfs commented May 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What landed (by commit)

Tests

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

dviejokfs commented May 19, 2026 •

edited

Loading