Skip to content

fix(BA-6035): split RouteHealthRecord into ReplicaProbeTarget + ReplicaHealthStatus#11632

Merged
HyeockJinKim merged 13 commits into
mainfrom
fix/BA-6035-split-route-health-records
May 15, 2026
Merged

fix(BA-6035): split RouteHealthRecord into ReplicaProbeTarget + ReplicaHealthStatus#11632
HyeockJinKim merged 13 commits into
mainfrom
fix/BA-6035-split-route-health-records

Conversation

@HyeockJinKim
Copy link
Copy Markdown
Collaborator

Summary

  • Split monolithic RouteHealthRecord Valkey type into two focused types:
    • ReplicaProbeTarget (probe config, 3600s TTL) — written once at WARMING_UP entry
    • ReplicaHealthStatus (health result, 120s TTL) — written by observer after each probe; key expiry signals DEGRADED
  • Replace revision-per-cycle DB queries with health_check: PydanticColumn(ModelHealthCheck) on RoutingRow, backfilled via Alembic migration
  • Refactor DeploymentInfo to use direct current_revision/deploying_revision fields instead of model_revisions list
  • Add ReplicaProbeTargetSyncHandler (RUNNING + HEALTHY/UNHEALTHY/DEGRADED, 60s cycle) for TTL refresh and Valkey recovery
  • Fix AppProxySyncRouteHandler to target RUNNING + ACTIVE only (health filter removed; LB handles routing)
  • Remove RouteHealthRecord class and 8 legacy Valkey methods

Test plan

  • Unit tests for ReplicaProbeTarget / ReplicaHealthStatus serialization
  • Unit tests for _register_route_probe_targets, sync_route_probe_targets
  • Unit tests for check_warming_up_health (DB-based timeout via get_db_now())
  • Unit tests for check_route_health (TTL 3-way classification: None=DEGRADED)
  • Unit tests for RouteHealthObserver (batch write, no within_initial_delay)
  • Alembic migration verified against local DB with representative test data

Resolves BA-6035

Copilot AI review requested due to automatic review settings May 15, 2026 06:31
@HyeockJinKim HyeockJinKim requested a review from a team as a code owner May 15, 2026 06:31
@github-actions github-actions Bot added the size:XL 500~ LoC label May 15, 2026
@github-actions github-actions Bot added comp:manager Related to Manager component comp:common Related to Common component require:db-migration Automatically set when alembic migrations are added or updated labels May 15, 2026
HyeockJinKim added a commit that referenced this pull request May 15, 2026
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR refactors route health tracking by splitting the legacy Valkey RouteHealthRecord into separate probe-target and probe-status objects, while also persisting health-check configuration on DB routings rows to avoid per-cycle revision queries. In parallel, it refactors DeploymentInfo to carry direct current_revision / deploying_revision objects rather than a model_revisions list, and updates Sokovan handlers/strategies accordingly.

Changes:

  • Introduce ReplicaProbeTarget + ReplicaHealthStatus Valkey types/client APIs; update route health observer/executor to use TTL-based classification.
  • Add health_check (Pydantic JSONB) column to routings with an Alembic backfill, and plumb it through route creation and DB source mapping.
  • Replace DeploymentInfo.{current_revision_id,deploying_revision_id,model_revisions} with DeploymentInfo.{current_revision,deploying_revision} and update rolling-update/scaling/service code + tests; add a periodic probe-target sync handler and adjust AppProxy sync to RUNNING+ACTIVE.

Reviewed changes

Copilot reviewed 44 out of 44 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
tests/unit/manager/sokovan/deployment/test_coordinator_history.py Update fixtures for DeploymentInfo revision field changes.
tests/unit/manager/sokovan/deployment/strategy/test_rolling_update.py Adapt rolling-update tests to current_revision/deploying_revision objects.
tests/unit/manager/sokovan/deployment/route/test_coordinator_history.py Include health_check field in RouteData test fixtures.
tests/unit/manager/sokovan/deployment/route/handlers/test_terminating_handler.py Update RouteData fixtures for new health_check attribute.
tests/unit/manager/sokovan/deployment/route/handlers/test_health_check_handler.py Update RouteData fixtures for new health_check attribute.
tests/unit/manager/sokovan/deployment/route/executor/test_terminate_routes_drain.py Update RouteData fixtures for new health_check attribute.
tests/unit/manager/sokovan/deployment/route/executor/test_route_executor.py Update health-check classification tests to use ReplicaHealthStatus.
tests/unit/manager/sokovan/deployment/route/executor/test_register_unregister_routes.py Update RouteData fixtures for new health_check attribute.
tests/unit/manager/sokovan/deployment/route/executor/test_initial_delay.py Replace initial-delay/record tests with probe-target registration + observer write tests.
tests/unit/manager/sokovan/deployment/route/executor/test_check_route_health_register.py Update “newly healthy triggers register” tests to use ReplicaHealthStatus.
tests/unit/manager/sokovan/deployment/route/executor/conftest.py Adjust ValkeyScheduleClient mock to new batch APIs and route/deployment fixtures.
tests/unit/manager/sokovan/deployment/handlers/test_deploying_handler.py Update deploying handler tests for deploying_revision object.
tests/unit/manager/sokovan/deployment/executor/conftest.py Update executor fixtures for current_revision object + health_check on routes.
tests/unit/manager/services/deployment/test_deployment_service.py Update service conversion tests to new deployment revision fields.
tests/unit/manager/services/deployment/test_deployment_crud_actions.py Update CRUD tests for DeploymentInfo schema changes.
tests/unit/manager/repositories/deployment/test_deployment_repository.py Update route spec fixtures to include health_check.
tests/unit/common/clients/valkey_client/test_valkey_schedule_types.py New unit tests for Valkey type serialization (ReplicaProbeTarget, ReplicaHealthStatus).
tests/unit/common/clients/valkey_client/test_valkey_schedule_client.py Add integration tests for new ValkeyScheduleClient probe/health-status methods.
src/ai/backend/manager/sokovan/deployment/strategy/rolling_update.py Switch to deploying_revision object and attach health_check into route creation specs.
src/ai/backend/manager/sokovan/deployment/strategy/evaluator.py Update deploying revision ID collection to use deploying_revision.id.
src/ai/backend/manager/sokovan/deployment/route/types.py Add PROBE_TARGET_SYNC lifecycle type.
src/ai/backend/manager/sokovan/deployment/route/handlers/probe_target_sync.py New handler to periodically refresh/recover ReplicaProbeTarget entries in Valkey.
src/ai/backend/manager/sokovan/deployment/route/handlers/observer/health_check.py Observer now reads probe targets and writes health statuses in batch.
src/ai/backend/manager/sokovan/deployment/route/handlers/appproxy_sync.py Change AppProxy sync targeting from health-filtered to RUNNING+ACTIVE.
src/ai/backend/manager/sokovan/deployment/route/handlers/init.py Export the new ReplicaProbeTargetSyncHandler.
src/ai/backend/manager/sokovan/deployment/route/executor.py Implement probe-target registration/sync and TTL-based health status classification; refactor revision-field usages.
src/ai/backend/manager/sokovan/deployment/route/coordinator.py Register the new probe-target sync lifecycle and schedule it.
src/ai/backend/manager/sokovan/deployment/handlers/replica.py Update scaling eligibility check to use current_revision object presence.
src/ai/backend/manager/sokovan/deployment/handlers/deploying.py Update endpoint registration gating to use deploying_revision object presence.
src/ai/backend/manager/sokovan/deployment/executor.py Refactor scaling/endpoint building to use current_revision/deploying_revision objects and attach health_check to new routes.
src/ai/backend/manager/sokovan/deployment/deployment_controller.py Use current_revision object directly for validation and equality checks.
src/ai/backend/manager/services/deployment/service.py Simplify deployment DTO conversion using current_revision/deploying_revision.
src/ai/backend/manager/repositories/deployment/types/endpoint.py Add `health_check: ModelHealthCheck
src/ai/backend/manager/repositories/deployment/db_source/db_source.py Stop eager-loading full revision lists; load current/deploying revision rows and map route health_check.
src/ai/backend/manager/repositories/deployment/creators/route.py Add health_check to RouteCreatorSpec and persist to RoutingRow.
src/ai/backend/manager/models/routing/row.py Add health_check JSONB PydanticColumn to the routings table model.
src/ai/backend/manager/models/endpoint/row.py Add relationships for current/deploying revision rows and build DeploymentInfo from them.
src/ai/backend/manager/models/alembic/versions/c3d4e5f6a7b8_add_health_check_to_routings.py New migration to add/backfill/drop routings.health_check.
src/ai/backend/manager/data/deployment/types.py Refactor DeploymentInfo schema to store revision objects; remove resolve_revision_data.
src/ai/backend/manager/api/rest/service/handler.py Resolve target revision using direct current_revision/deploying_revision.
src/ai/backend/common/clients/valkey_client/valkey_schedule/types.py New Valkey DTOs for probe targets and health statuses/results.
src/ai/backend/common/clients/valkey_client/valkey_schedule/client.py Remove RouteHealthRecord, add probe-target + health-status batch read/write APIs with TTL constants.
src/ai/backend/common/clients/valkey_client/valkey_schedule/init.py Export new Valkey types and stop exporting RouteHealthRecord.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +406 to +409
statuses = await self._valkey_schedule.get_route_health_statuses_batch([
route.route_id for route in routes
])
now = await self._deployment_repo.get_db_now()
Comment on lines +476 to +477
if status is None:
# Key absent or TTL expired → DEGRADED
Comment on lines +395 to +397
log.debug("Synced {} ReplicaProbeTargets to Valkey", len(targets))

return RouteExecutionResult(successes=[], errors=[])
Comment on lines 57 to 62
@classmethod
def target_statuses(cls) -> RouteTargetStatuses:
return RouteTargetStatuses(
lifecycle=[RouteStatus.RUNNING],
health=[RouteHealthStatus.HEALTHY],
traffic=[RouteTrafficStatus.ACTIVE],
)
deploying_revision = deployment.deploying_revision
if deploying_revision is None:
raise InvalidEndpointState(
f"Deployment {deployment.id} has DEPLOYING lifecycle but deploying_revision_id is None. "
HyeockJinKim and others added 12 commits May 15, 2026 15:38
…ypes

Split route health Valkey data into two separate types:
- RouteProbeTarget: probe config (health_path, inference_port, replica_host)
  stored at route_probe:{replica_id}, TTL 3600s, written by coordinator
- RouteHealthStatus: check result (healthy, last_check)
  stored at route_health:{replica_id}, TTL 120s, written by observer

Both types use ReplicaID typed identifier for replica_id field.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ethods with tests

Add four new ValkeyScheduleClient methods for the split route health design:
- register_route_probe_targets_batch: writes probe config at route_probe:{id}, TTL 3600s
- get_route_probe_targets_batch: batch-reads probe targets
- record_route_health_status: writes health result at route_health:{id}, TTL 120s (expiry = DEGRADED)
- get_route_health_statuses_batch: batch-reads health statuses

Also rename RouteProbeTarget.route_id → replica_id for consistency with RouteHealthStatus.

Tests: pure serialization tests in test_valkey_schedule_types.py, Redis integration
tests for all four methods in test_valkey_schedule_client.py.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…Info revisions

Key changes:
- Add health_check: PydanticColumn(ModelHealthCheck) to RoutingRow (nullable);
  set from revision at route creation, eliminating per-cycle revision fetches
  in _register_route_probe_targets and sync_route_probe_targets
- Add health_check field to RouteCreatorSpec (required) and RouteData (required,
  no default); both callers now resolve health config from revision at creation
- Refactor DeploymentInfo: replace model_revisions list + current_revision_id +
  deploying_revision_id with direct current_revision and deploying_revision fields
- Add current_revision_row / deploying_revision_row 1:1 relationships on
  EndpointRow; to_deployment_info() uses these instead of iterating revisions
- Replace all selectinload(EndpointRow.revisions) in db_source with the two 1:1
  relationships; remove unnecessary image_row eager loads and duplicates
- Remove redundant get_revision() call in deployment_controller.update_deployment
- Remove unnecessary ReplicaID() casts on route.route_id (already ReplicaID)
- Alembic migration: c3d4e5f6a7b8 adds health_check JSONB column to routings

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…efinition

Existing routes have health_check=NULL after the column was added.
Backfill by reading health_check config from deployment_revisions.model_definition
and applying ModelHealthCheck defaults for optional fields.
Only populates when path (required field) exists in the draft config.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…_route_health

- check_warming_up_health: use get_route_health_statuses_batch + DB-based
  timeout via _deployment_repo.get_db_now() instead of Redis time + revision fetch
- check_route_health: TTL 3-way classification — None=DEGRADED, healthy=success,
  unhealthy=error (removes within_initial_delay, is_stale, get_redis_time calls)
- Update TestCheckRouteHealth to use ValkeyRouteHealthStatus fixtures
- Rewrite test_check_route_health_register.py to use get_route_health_statuses_batch

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…batch health recording

- Add ReplicaHealthResult dataclass (write input; last_check stamped by client)
- Add record_route_health_statuses_batch: single Redis time fetch + pipeline
- Rename Valkey-layer types Route* → Replica*
  (RouteProbeTarget/RouteHealthStatus/RouteHealthResult → Replica*)
- Observer: get_route_probe_targets_batch + record_route_health_statuses_batch;
  within_initial_delay guard removed, TTL handles DEGRADED
- Extract _build_probe_target() static method; skip routes without health_check

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
… target statuses

- Add RouteLifecycleType.PROBE_TARGET_SYNC
- Add ReplicaProbeTargetSyncHandler: targets RUNNING + [HEALTHY, UNHEALTHY, DEGRADED]
  (NOT_CHECKED excluded — probe targets registered at WARMING_UP entry)
- Register handler + task spec in coordinator (long_interval=60s, initial_delay=30s)
- AppProxySyncRouteHandler: target RUNNING + ACTIVE only (remove health=HEALTHY filter)
  Initial registration waits for HEALTHY via check_route_health; sync handler
  converges all active routes regardless of health

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Remove RouteHealthRecord class (replaced by ReplicaProbeTarget + ReplicaHealthStatus)
- Remove legacy methods: mark_route_running_at, get_route_running_at_batch,
  refresh_route_health_ttl, update_route_manager_health,
  initialize_route_health_records_batch, get_route_health_record,
  get_route_health_records_batch, record_route_health_status (single)
- Remove RouteHealthRecord from __init__.py exports
- Update test_valkey_schedule_client.py to use record_route_health_statuses_batch

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Update tests broken by the DeploymentInfo refactor (model_revisions →
current_revision/deploying_revision) and RouteCreatorSpec health_check addition.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ry test

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@HyeockJinKim HyeockJinKim force-pushed the fix/BA-6035-split-route-health-records branch from 768694a to 6f50520 Compare May 15, 2026 06:42
Comment thread changes/11632.enhance.md
@@ -0,0 +1 @@
Split route health Valkey record into ReplicaProbeTarget (probe config) and ReplicaHealthStatus (TTL-based result), removing initial_delay from Valkey and switching to DB-based timeout in check_warming_up_health.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR title says "fix," but the news fragment says "enhance," so I think we need to standardize this.

),
network_access=info.network,
revision_history_ids=[info.current_revision_id] if info.current_revision_id else [],
revision_history_ids=[info.current_revision.id]
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you currently using this field? It seems you should check whether you’ll be affected by the change in its meaning. If it’s not being used, it might be a good idea to delete it.

…ing update

- Add health_check: ModelHealthCheck | None to RouteInfo (populated from RoutingRow)
- Rolling update _classify_routes: RUNNING routes without health check count
  as new_healthy_count so DEPLOYING→READY transition can complete
- DB health_status stays DEGRADED (no probe data — correct behaviour)
- Add TestNoHealthCheck unit tests
- Fix CLI revision add --auto-activate flag (nest under options dict)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@github-actions github-actions Bot added comp:client Related to Client component comp:cli Related to CLI component labels May 15, 2026
@HyeockJinKim HyeockJinKim merged commit 8dbee20 into main May 15, 2026
36 checks passed
@HyeockJinKim HyeockJinKim deleted the fix/BA-6035-split-route-health-records branch May 15, 2026 09:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp:cli Related to CLI component comp:client Related to Client component comp:common Related to Common component comp:manager Related to Manager component require:db-migration Automatically set when alembic migrations are added or updated size:XL 500~ LoC

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants