docs(blog): incident report for Prisma reconnect freezing event loop (LIT-2614) by oss-agent-shin · Pull Request #234 · BerriAI/litellm-docs

oss-agent-shin · 2026-05-27T04:52:55Z

Summary

Adds an incident-report blog post for the FLock-reported Prisma reconnect bug. The fix already landed in BerriAI/litellm#26225 (merged April 29, 2026); this is the customer-facing post-mortem requested in LIT-2614.

What the post-mortem covers

The failure mode: await self.db.disconnect() invokes prisma-client-py's synchronous subprocess.Popen.wait() on the Rust query engine, freezing the asyncio event loop for 30–120+ seconds during DB outages and breaking /health/liveliness — which Kubernetes then misreads as the pod being dead.
Why asyncio.wait_for() did not help: the blocking call has no await points, so the watchdog timeout could not fire.
The fix: replace disconnect() with a direct SIGTERM → await asyncio.sleep(0.5) → SIGKILL on the engine subprocess, so the event loop keeps running and reads from recreate_prisma_client make forward progress.
Verification table from the PR (/health/liveliness max latency drops from 10006 ms → 52.7 ms under the same injected slow-close).
Operator guidance for affected versions + remediation.

Files

New: blog/prisma_reconnect_blocking_incident/index.md — single docusaurus blog entry, slug prisma-reconnect-blocking-incident. Matches the format of the existing blog/httpx_cache_eviction_incident/ and blog/vllm_embeddings_incident/ entries.

Evidence

This is a docs-only PR — no executable surface, so no runtime evidence to capture.

I cross-checked every code path I describe against the merged fix on BerriAI/litellm main:

_get_engine_pid and _kill_engine_process live in litellm/proxy/db/prisma_client.py (lines 123 and 137 on current main).
recreate_prisma_client (same file, line 315) calls _kill_engine_process and then constructs a fresh Prisma() + connect() — matches the post-mortem's flow diagram.
The verification latency table is reproduced verbatim from the testing section of PR #26225.

I also pulled the wording of /health/liveliness semantics from the existing proxy docs to keep it consistent.

Notes for reviewers

This PR was pushed via the GitHub Contents API (PUT /repos/{owner}/{repo}/contents/{path}) rather than git push because the current GITHUB_TOKEN lacks the repo scope. There is one commit on the branch corresponding to the new file.

…oop (LIT-2614)

vercel · 2026-05-27T04:53:01Z

The latest updates on your projects. Learn more about Vercel for GitHub.

Project	Deployment	Actions	Updated (UTC)
litellm	Ready	Preview, Comment	May 27, 2026 4:58am

docs(blog): add incident report for Prisma reconnect freezing event l…

b8134e7

…oop (LIT-2614)

vercel Bot deployed to Preview May 27, 2026 04:58 View deployment

mubashir1osmani approved these changes May 27, 2026

View reviewed changes

mubashir1osmani merged commit 2749ebe into BerriAI:main May 27, 2026
2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs(blog): incident report for Prisma reconnect freezing event loop (LIT-2614)#234

docs(blog): incident report for Prisma reconnect freezing event loop (LIT-2614)#234
mubashir1osmani merged 1 commit into
BerriAI:mainfrom
oss-agent-shin:shin/lit-2614-prisma-reconnect-postmortem

oss-agent-shin commented May 27, 2026

Uh oh!

vercel Bot commented May 27, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

oss-agent-shin commented May 27, 2026

Summary

What the post-mortem covers

Files

Evidence

Notes for reviewers

Uh oh!

vercel Bot commented May 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

vercel Bot commented May 27, 2026 •

edited

Loading