Skip to content

Add is_alive liveness-probe API across SDKs (decouple liveness from typed ping deserialization) #1354

@tclem

Description

@tclem

Summary

The Copilot CLI host (github/github-app) had to work around a deserialization failure in our SDK's typed ping() API on the session-resume / warm-CLI-pool / liveness-probe paths. See github/github-app#5461 for the workaround.

Their fix introduces a ping_cli_compat(&client) helper that drops to the raw client.call("ping", json!({})) so the result body is never deserialized — they only care whether the JSON-RPC round trip succeeded.

We should give them (and everyone doing liveness checks) a first-class primitive instead of forcing them to bypass our typed API.

Root cause

Across all five SDKs, ping() deserializes the response into a typed body:

The authoritative PingResult schema (@github/copilot/schemas/api.schema.json) requires message, timestamp (date-time string), and protocolVersion (integer > 0). The Rust hand-written PingResponse softens this with #[serde(default)] and Option<u32>, but that only helps for missing fields — wrong types (e.g. null timestamp, integer-shape drift) still fail. Same brittleness exists across the other SDKs.

ping is used by hosts as a liveness check (warm-pool reuse, resumed-session aliveness, retrier "is the cached client still alive" probes). These paths straddle CLI version boundaries — a resumed older CLI process can answer with a slightly different ping body shape, and the entire liveness check fails even though the RPC itself succeeded.

This is exactly the wrong failure mode for a health check: the consumer asked "is the CLI reachable?" and we answered "no" because of a body-shape mismatch.

Proposed fix

Add a dedicated liveness-probe API to every SDK Client:

  • Sends the ping JSON-RPC call.
  • Returns success based solely on JSON-RPC success — never deserializes the result body.
  • Composable with caller-supplied timeouts — no baked-in timeout, so each host (startup probe, warm-pool, resume probe, background keepalive) sets its own budget.
  • ping() and generated rpc.ping() stay strict and schema-faithful. Callers who actually want the typed data keep getting it; schema drift continues to surface there as a real error.

Per-language names:

SDK API
Rust Client::is_alive(&self) -> bool
Node client.isAlive(): Promise<boolean>
Python client.is_alive() -> bool
Go client.IsAlive(ctx context.Context) bool
.NET client.IsAliveAsync(ct) : Task<bool>

After this lands, the github/github-app workaround helper goes away and call sites become e.g. existing.is_alive().await.

Rejected alternatives

  • Loosen ping() itself. Considered and rejected. ping() is a typed schema-backed API; if it silently swallows malformed bodies, the contract becomes ambiguous (did the caller want liveness, or the ping data?). It also masks real CLI/schema drift in the API most likely to catch it.
  • Loosen the generated rpc.ping(PingRequest). Same reasoning, more so — generated APIs must remain schema-faithful. is_alive is the explicit escape hatch.
  • "Fix it only in the CLI." The CLI should still honor the schema, and we should investigate any drift. But liveness checks fundamentally straddle version boundaries, so the SDK needs a body-agnostic primitive regardless.

Acceptance criteria

  • New is_alive/IsAlive/isAlive method on the Rust, Node, Python, Go, and .NET Client types.
  • Method calls the ping JSON-RPC method and returns success purely on RPC success, ignoring the response body.
  • Existing ping() / rpc.ping() typed APIs unchanged.
  • Docs in each SDK clearly distinguish "use is_alive for liveness/warm-pool/resume checks" from "use ping() when you want the typed response."
  • Tests cover the case where the CLI returns a ping body that fails strict deserialization but is otherwise a valid JSON-RPC success — is_alive returns true, ping() returns an error.
  • Once shipped, the workaround in github/github-app#5461 is removed.

Design review

Reviewed and agreed with GPT-5.5; consensus on:

  • Separate is_alive API (not loosening ping).
  • Keep generated rpc.ping strict.
  • No baked-in timeout; caller composes timeouts.
  • Name is_alive clearly communicates intent ("RPC round trip succeeded") and beats ping_raw / ping_check.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions