fix(backup): per-schedule scope + control-plane toggle + R2 tagging tolerance#92
Merged
Conversation
…fixes
Three changes in one commit so pre-commit hooks only run once.
1) Per-schedule service membership + scope flags
Schedules previously fanned out to every external service on the host
with no operator control. New shape:
- backup_schedule_services join table (migration m20260519_000001)
so a schedule can target a specific list of databases.
- target_all_services flag on backup_schedules (migration
m20260519_000002, defaults TRUE) for the common case "back up
every database -- including ones I add later". When false, fan-out
reads the explicit join table.
- include_control_plane flag (migration m20260519_000003, defaults
TRUE). Previously every fan-out unconditionally produced a
control_plane backup; operators using Temps purely to orchestrate
external DB backups can now opt out.
Service layer (BackupService):
- attach/detach/list-services-for-schedule + list-schedules-for-service
- create/update validators reject a state that would have nothing to
back up (control plane off + target_all_services off + no attached
services).
- When flipping target_all_services -> true, the explicit membership
rows are cleared ("all means all").
- Fan-out (enqueue_scheduled_run) branches on both flags.
Handlers: 4 new endpoints (GET/POST /backups/schedules/{id}/services,
DELETE /backups/schedules/{id}/services/{service_id},
GET /backups/external-services/{service_id}/schedules) with audit
logging + OpenAPI registration.
UI:
- ScheduleServicesSelector reusable component (checkbox list +
"Select all" with indeterminate state, exclude-already-attached).
- CreateBackupSchedule + EditBackupSchedule: "All databases" /
"Specific databases" radio + "Also back up the Temps control plane"
Switch.
- ScheduleDetail: surfaces both flags in the Schedule Configuration
card; the per-service attach/detach card only renders in 'specific'
mode.
Tests: 6 unit tests (MockDatabase, Docker-skip) + 3 integration tests
covering attach/detach round-trip, flip-to-all-clears-membership, and
fan-out skips control plane when the flag is off.
2) Real S3 errors + R2 tagging tolerance
Every SDK call site used to swallow rich SdkError data behind
format!("...: {}", e), rendering as "service error" for any 4xx/5xx.
Added describe_sdk_error in v2_common: pattern-matches on
SdkError::{ConstructionFailure,TimeoutError,DispatchFailure,
ResponseError,ServiceError} and extracts HTTP status, service code,
request_id, x-amz-id-2, and a truncated response body. All upload
sites (single-part, create/upload/complete multipart, metadata
companion, head_bucket) plus the three From<SdkError> impls in
services/backup.rs now use it.
Cloudflare R2 returns 501 NotImplemented on both the x-amz-tagging
upload header and the standalone PutObjectTagging call. Fix: tags
are still applied via PutObjectTagging after every successful upload,
but apply_object_tags now treats failures matching
is_unsupported_error as best-effort -- logs a warn under target
temps_backup::tagging and returns Ok. AWS S3 / MinIO / compliant
stores still tag normally. Re-exposed is_unsupported_error as
pub(crate) so upload + lifecycle reconciler share the matcher.
Replaced legacy to_tagging_string with to_tag_pairs (header form is
unused now). Two regression tests pin the exact R2 error shapes.
Operational note for R2: tag-driven bucket lifecycle is unavailable;
app-side BackupService::enforce_retention is the retention source of
truth on tag-less providers.
3) Drop web/src/lib/backup-schedules.ts
The hand-rolled fetch helper had a TODO(sdk-regen) comment from
before the PATCH endpoint was in the OpenAPI surface. Migrated
EditBackupSchedule to the generated updateBackupScheduleMutation +
UpdateBackupScheduleRequest type. Includes regenerated SDK
artefacts (types.gen.ts, sdk.gen.ts, react-query.gen.ts).
Other: CLI auth + CliLogin work that was already in the working tree
is bundled in to clear the dirty state -- unrelated to backups.
Adds CHANGELOG.md entries under [Unreleased] covering the work in a6b20a6: per-schedule service scope + control-plane toggle (Added), the SDK-vs-shim migration on EditBackupSchedule (Changed), and the R2 tagging tolerance + describe_sdk_error error-message overhaul (Fixed). Creates AGENTS.md to document process conventions that bit me this session: - Always update CHANGELOG.md in the same commit as the code change. - Use the generated OpenAPI SDK in web/; no hand-rolled fetch shims. - Restart the server + regenerate the SDK when the OpenAPI surface changes. - Pre-commit hooks run cargo fmt + clippy; plan for the wall-clock cost by preferring one commit over many. - Don't sweep unrelated dirty files into focused commits without confirming with the user. CLAUDE.md continues to own the detailed engineering rules; AGENTS.md is the short list of process conventions that go around them.
The gateway endpoints (/ai/v1/chat/completions, /models, /embeddings)
were registered via configure_public_routes, which lands on the no-auth
router. But the handlers use RequireAuth, which depends on the
AuthContext injected by auth_middleware -- that middleware only runs on
the authenticated router. The mismatch produced an instant 401
('Authentication Required') before the API key was ever validated, so
no 'API key auth failed' diagnostic was ever logged.
Move configure_gateway_routes() into configure_routes() alongside the
admin/usage/pricing routes so the whole AI Gateway sits on the
authenticated surface. Valid tk_ API keys now authenticate and the
AiGatewayExecute permission check runs as intended.
mark_deployment_complete flipped current_deployment_id and fired the PG NOTIFY route reload before writing static_dir_location and image_name, which load_routes() reads to build an environment's backend. For static deployments the NOTIFY fired while static_dir_location was still NULL, so the proxy built a route with no static directory and the folder wasn't served until a later, unrelated route reload. Add a Phase 0 step that writes the routing-relevant deployment fields first, so the route table sees a consistent record the moment the NOTIFY fires.
Refine what counts as an engaged session: require either >=10s of measured wall-clock time or a genuine interaction event. Auto-fired view events (page_view, page_leave, *_viewed) no longer mark a session engaged on their own -- they trigger from intersection observers for bots too, inflating engagement. Also exclude zero-duration session replays (previously NULL OR >0 was allowed) since never-finalized single-burst sessions have nothing to play back, and expand user-agent bot detection in the events pipeline.
Add --manual, --source-type, --image, and --port flags to the projects create command so users can create Docker-image and static-files projects without a git repository. Bumps @temps-sdk/cli to 0.1.23.
Replaces the legacy blue rocket illustration with the black "t" mark used on temps-landing. Updates the in-app icon (sidebar + login), the served favicon, and the committed favicon/icon PNG sets. KFS-13
The weekly digest email rendered with collapsed, colliding metrics
("Visitors: 18Page Views: 26") because the project-activity rows used
`display:flex` + `gap`, and the metric grid used CSS `grid`. Gmail,
Outlook, and most mobile clients strip or ignore modern CSS layout, so
flex/grid children run into each other.
Rewrite templates.rs to be fully table-based with critical styles
inlined on each element (many clients drop the <head><style> block):
- Each metric and each project stat is its own <td>, so cells cannot
collide the way inline <span>s did.
- New data tables for top pages, top countries, and most-common errors.
- Funnel cards for the conversion section.
- Trend values rendered as proper colored pill badges.
- All user-controlled content (project names, error types, page paths)
is HTML-escaped.
Adds a regression test asserting the rendered output never contains the
run-together strings, plus tests for duration/number formatting and
escaping.
The weekly digest previously shipped fabricated data: ErrorData was entirely hardcoded (total_errors: 0, uptime_percentage: 99.9 regardless of reality), FunnelData was always empty, and PerformanceData returned zeros for session duration, bounce rate, top pages, and geo. For a recurring observability email this is worse than a stub — it misleads. Implement real aggregation: - PerformanceData: average session duration (first-to-last event span), bounce rate (sessions flagged is_bounce), top 5 pages, top 5 countries via ip_geolocations join, and a daily visitor trend. - ErrorData: total errors and new error types from error_events / error_groups, distinct affected visitors, most-common errors, daily error trend. Uptime is computed from external_service_health_checks (operational vs degraded/down); with no checks recorded it reports 100%, never the old fabricated 99.9%. error_rate is errors per 1k page views. - FunnelData: per active funnel, count sessions that fired the first step's event (entries) vs. also the last step's event (completions), with conversion rate and week-over-week change. Aggregation uses parameterized raw SQL for the GROUP BY queries; each detail query degrades to a safe default on error so one failing query never blanks the whole digest. Adds 6 integration tests against TestDatabase covering empty and populated states for all three.
The OTel ingest rate limit is already configurable via the `TEMPS_OTEL_RATE_LIMIT` env var, but `check_rate_limit` hardcoded `limit: 1000` in the OtelError::RateLimitExceeded it returned. An operator who lowered or raised the limit got an error that contradicted their configuration. Add a `RateLimiter::max_requests()` getter and use it so the error reports the limiter's actual configured value. Strengthens the existing over-limit test to assert the reported limit matches the configured one.
RequestLoggerImpl, the RequestLogger trait, RequestLogData, and LoggingConfig were a legacy request-logging path superseded by the CreateProxyLogRequest batch-writer path. Every RequestLoggerImpl::new call site was inside services.rs's own test module -- nothing in production wired it up. The dead code carried stale TODOs for unpopulated proxy_logs columns (query_string, container_id, upstream_host, request/response sizes), which is misleading: the live CreateProxyLogRequest path in proxy.rs already populates all of those from the proxy context. Investigating the TODOs surfaced that the whole path was unused. Remove ~500 lines: the struct + trait impls, the trait and DTO, the config struct, 5 tests that exercised only the dead logger, and a now-orphaned test helper.
test_proxy_visitor_management failed ("Failed to get or create
visitor") because it called get_or_create_visitor with a None project
context. The visitor table has non-nullable project_id/environment_id,
and the service correctly rejects a contextless call -- the test was
asserting an impossible scenario. Its .map_err also swallowed the real
DB error, hiding the cause.
test_proxy_session_management was #[ignore]'d with a TODO about a
foreign-key constraint: it passed a fabricated Visitor { id: 123 } that
had no DB row, so the request_sessions.visitor_id FK failed.
Both tests now build the real project -> environment -> (visitor) chain
and pass proper context. test_proxy_visitor_management also adds an
assertion that a contextless call still fails (no orphan visitors), and
both tests surface the real error instead of swallowing it. The
#[ignore] is removed per the project rule against ignored tests.
254 proxy tests pass, 0 failed (was 252 passed + 1 failed).
Dependabot medium: idna < 3.15 allows bypassing the CVE-2024-3651 fix via specially crafted inputs to idna.encode(). Bump the transitive dependency from 3.11 to 3.15 in sdks/python/uv.lock.
Bollard 0.20 no longer re-exports RestartPolicyNameEnum through the `secret` module (it is private there); the generated enum lives in `bollard::models`. Update the import path so temps-import-docker compiles against the resolved bollard 0.20.2.
hickory-proto 0.24/0.25 carries two open advisories: an NSEC3 closest-encloser unbounded loop (high) and O(n^2) name-compression CPU exhaustion on message encoding (medium). 0.26.1 fixes both. Bumps hickory-resolver / hickory-proto / hickory-server / hickory-client to 0.26 across the four DNS-using crates and migrates them to the 0.26 API: - temps-dns-resolver (the worker-node DNS server): hickory_server's `authority` module is renamed `zone_handler`; ServerFuture -> Server; RequestHandler::handle_request gains a second `T: Time` type param; RequestInfo.header -> .metadata; Header is now Metadata (plain public fields, no set_* methods). authority.rs / upstream.rs / handle.rs updated accordingly. - temps-domains, temps-infra, temps-email: TokioAsyncResolver / TokioConnectionProvider -> Resolver + net::runtime::TokioRuntimeProvider; ResolverConfig::cloudflare()/::new() removed -> udp_and_tcp(&CLOUDFLARE) / ::default(); .build() now returns Result; lookups yield a generic Lookup whose .answers() records carry typed RData (extract MX/TXT/A/ AAAA/CNAME by variant); NameServerConfig::new(SocketAddr, Protocol) -> ::udp/::udp_and_tcp(IpAddr). The unused hickory-client dev-dependency in temps-dns-resolver is dropped (no stable 0.26 release exists and nothing referenced it). Note: mongodb 3.6.0 still pins hickory 0.25 transitively for its `mongodb+srv` SRV stub resolver; that copy remains until mongodb upstream moves.
Replace the check-if-email-exists dependency with a native validation engine. check-if-email-exists is AGPL-3.0 (a licensing hazard for a commercially-distributed product) and pinned hickory 0.24, blocking the CVE-fixing 0.26 upgrade. New temps-email/services/validation/ module, four stages: - syntax: pragmatic RFC 5321/5322 local-part + domain parsing, with a typo-domain 'did you mean' suggestion. - mx: hickory-resolver 0.26 MX lookup against Cloudflare DNS. - misc: disposable-provider, role-account, and B2C-provider detection from curated lists; Gravatar URL via md5. - smtp: TCP (or SOCKS5 via tokio-socks) connect to the MX, EHLO/MAIL FROM/RCPT TO probe -- never sends DATA, never delivers mail. Reply codes classify deliverable / disabled / full-inbox; a random-localpart RCPT probe detects catch-all domains. Stages combine into a Safe/Risky/Invalid/Unknown reachability verdict. ValidationService keeps its exact public API (ValidateEmailResponse and the per-stage result structs) so handlers, the plugin, and the OpenAPI surface are unchanged -- only the engine behind it is swapped. temps-cli no longer depends on check-if-email-exists for rustls crypto-provider setup; install_crypto_provider() installs the ring provider directly. New deps: tokio-socks (SOCKS5), md5 (Gravatar). 32 unit tests cover syntax/misc/smtp-classification/reachability; live MX/SMTP tests are gated behind TEMPS_NETWORK_TESTS.
The standalone temps-mcp MCP server is no longer used — agent tooling has moved to skills + bunx. It was already commented out of temps-cli's dependencies; only a stale workspace-member entry and a leftover `temps_mcp` log-filter directive still referenced it. Removing the crate also drops rmcp 0.6.x from the dependency tree entirely, which resolves the rmcp Streamable-HTTP DNS-rebinding advisory (Dependabot high) — nothing else in the workspace depends on rmcp. oauth2 / rmcp-macros / sse-stream / schemars_derive are pruned from Cargo.lock as a result. temps-agents-mcp-proxy is a separate, still-used crate and is not affected.
CI's workspace check runs --all-targets, which compiles test code that `cargo check --lib` skips. Two test surfaces still used the hickory 0.24/0.25 API: - tests/end_to_end.rs (integration test): proto::xfer::Protocol, name_server::TokioConnectionProvider, ResolverConfig::new(), and the 2-arg NameServerConfig::new(SocketAddr, Protocol) no longer exist. Rebuilt the test DNS client on the 0.26 API — NameServerConfig::udp with the connection's port set to the test resolver's random port, TokioRuntimeProvider, and .build()'s Result handled. - authority.rs unit tests: Record's .ttl()/.data() accessors — switched to the public `ttl` / `data` fields, which resolve without the RecordData bound the methods require. Also applies `cargo fmt` across the hickory-touched files. Verified with `cargo check --all-targets --workspace` (exit 0) and clippy.
Add offset pagination and a UsageLogPage response (entries + total) to the /ai/usage/recent endpoint so the AI Gateway usage tab no longer renders an unbounded list. Page size is user-configurable up to 50. Add provider, status, cost, and total-token filters to UsageFilter and the recent-requests query. Cost and token bounds support gte/gt/lte/lt comparisons; cost is expressed in microcents. The web filter row is collapsed behind a Filters toggle (with an active count badge) and only shown on demand. The provider dropdown is sourced from the static supported-provider registry (openai, anthropic, xai, gemini) so it is not constrained by the analytics time window.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Two fixes that came out of a hands-on session on a Cloudflare R2-backed
control plane:
out to every external service plus an unavoidable control-plane row.
New
target_all_services+include_control_planeflags + anexplicit
backup_schedule_servicesjoin table give granular scope.Defaults preserve current behaviour on upgrade.
service errorcollapse hid that R2 returns501 NotImplementedonboth
x-amz-taggingupload headers and standalonePutObjectTaggingcalls. Tagging is now best-effort; backups complete; failures show
the real HTTP status / service code / request id / response body.
AGENTS.mdwith the short list of conventionsfuture agents need (always CHANGELOG, use the generated SDK,
pre-commit-hook timing). CHANGELOG updated for both commits in this
PR.
What landed (by commit)
a6b20a67— Per-schedule service scope + control-plane toggle + R2fixes (one commit because pre-commit hooks take ~3-5min and I'd rather
run them once than four times):
backup_schedule_servicesjoin table,target_all_servicesflag,include_control_planeflag.BackupService: attach/detach/list-for-schedule/list-for-serviceflags + clears membership when flipping
target_all_services→true.ScheduleServicesSelectorcomponent + "All databases /Specific databases" radio + "Also back up the Temps control plane"
switch on Create & Edit pages;
ScheduleDetailsurfaces both flags.describe_sdk_errorhelper that extracts HTTP status, servicecode, request_id, x-amz-id-2, truncated body from every
SdkErrorvariant; used at every upload site.
apply_object_tagscallsPutObjectTaggingafter upload, treatsis_unsupported_errormatches as best-effort warn.web/src/lib/backup-schedules.ts(hand-rolled fetch shim);EditBackupSchedulemigrated to generated SDK.daf99a13— CHANGELOG entries for the above + newAGENTS.md.Tests
BackupService(MockDatabase, Docker-skip) coveringattach/detach/list error paths.
integration_attach_list_detach_round_trip,integration_flip_to_all_clears_membership,integration_fan_out_skips_control_plane_when_flag_off.SDK upgrade can't silently regress the matcher.
cargo check --libclean across the workspace.Test plan
cargo test --lib -p temps-backup— all 10 new tests pass.to
target_all_services=true/include_control_plane=true.plane off, attach one DB, "Run now" — assert one backup row (no
control_planesidecar).attach two services, save; flip back to "All", save; flip to
"Specific" again — membership must be empty (cleared on the
first flip-to-all).
tested with (
backups-test) — backup should succeed, with awarnlog under targettemps_backup::taggingrather than afailure.
backups.error_messagecolumn now contains HTTP status + code +request_id + body instead of "service error".