Skip to content

docs(blog): incident report for Prisma reconnect freezing event loop (LIT-2614)#234

Merged
mubashir1osmani merged 1 commit into
BerriAI:mainfrom
oss-agent-shin:shin/lit-2614-prisma-reconnect-postmortem
May 27, 2026
Merged

docs(blog): incident report for Prisma reconnect freezing event loop (LIT-2614)#234
mubashir1osmani merged 1 commit into
BerriAI:mainfrom
oss-agent-shin:shin/lit-2614-prisma-reconnect-postmortem

Conversation

@oss-agent-shin
Copy link
Copy Markdown
Contributor

Summary

Adds an incident-report blog post for the FLock-reported Prisma reconnect bug. The fix already landed in BerriAI/litellm#26225 (merged April 29, 2026); this is the customer-facing post-mortem requested in LIT-2614.

What the post-mortem covers

  • The failure mode: await self.db.disconnect() invokes prisma-client-py's synchronous subprocess.Popen.wait() on the Rust query engine, freezing the asyncio event loop for 30–120+ seconds during DB outages and breaking /health/liveliness — which Kubernetes then misreads as the pod being dead.
  • Why asyncio.wait_for() did not help: the blocking call has no await points, so the watchdog timeout could not fire.
  • The fix: replace disconnect() with a direct SIGTERMawait asyncio.sleep(0.5)SIGKILL on the engine subprocess, so the event loop keeps running and reads from recreate_prisma_client make forward progress.
  • Verification table from the PR (/health/liveliness max latency drops from 10006 ms → 52.7 ms under the same injected slow-close).
  • Operator guidance for affected versions + remediation.

Files

  • New: blog/prisma_reconnect_blocking_incident/index.md — single docusaurus blog entry, slug prisma-reconnect-blocking-incident. Matches the format of the existing blog/httpx_cache_eviction_incident/ and blog/vllm_embeddings_incident/ entries.

Evidence

This is a docs-only PR — no executable surface, so no runtime evidence to capture.

I cross-checked every code path I describe against the merged fix on BerriAI/litellm main:

  • _get_engine_pid and _kill_engine_process live in litellm/proxy/db/prisma_client.py (lines 123 and 137 on current main).
  • recreate_prisma_client (same file, line 315) calls _kill_engine_process and then constructs a fresh Prisma() + connect() — matches the post-mortem's flow diagram.
  • The verification latency table is reproduced verbatim from the testing section of PR #26225.

I also pulled the wording of /health/liveliness semantics from the existing proxy docs to keep it consistent.

Notes for reviewers

  • This PR was pushed via the GitHub Contents API (PUT /repos/{owner}/{repo}/contents/{path}) rather than git push because the current GITHUB_TOKEN lacks the repo scope. There is one commit on the branch corresponding to the new file.

@vercel
Copy link
Copy Markdown

vercel Bot commented May 27, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
litellm Ready Ready Preview, Comment May 27, 2026 4:58am

Request Review

@mubashir1osmani mubashir1osmani merged commit 2749ebe into BerriAI:main May 27, 2026
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants