Description
The PostgreSQL configuration in db-configmap.yaml ships defaults that cause cascading failures in Kubernetes environments where pods are frequently restarted, rescheduled, or deleted. We hit all of these in production with an external CNPG database, but they apply equally to the chart's built-in StatefulSet.
Related: The deadlock loop described here is the operational consequence of the zombie connections that finding 1e in issue #1 (readOnlyRootFilesystem) also exposes — both issues compound each other.
Environment
- KASM Workspaces: 1.18.1
- Database: CloudNativePG PostgreSQL 18.1 (external) and chart's StatefulSet config
- Kubernetes: RKE2 v1.33
Findings
1. statement_timeout = 0 (unlimited)
Line 94 of db-configmap.yaml sets statement_timeout = 0. This allows any query — including DDL blocked on locks — to wait indefinitely. In our deployment, an Alembic ALTER TABLE blocked for over an hour waiting on a lock held by a zombie connection, causing the manager to deadlock.
Suggested fix:
-statement_timeout = 0
+statement_timeout = 120000 # 2 minutes; override per-session for long-running migrations
2. TCP keepalives: 2+ hour zombie detection
Lines 68-70:
tcp_keepalives_idle = 7200
tcp_keepalives_interval = 75
tcp_keepalives_count = 9
When a KASM pod is deleted or OOMKilled, PostgreSQL does not detect the broken TCP connection for over 2 hours (7200 + 675 = 7875s). The zombie connection continues holding AccessShareLocks, which block the db-init-job's Alembic migrations that need AccessExclusiveLock for DDL operations. This cascades: the migration hangs, the manager can't read from locked tables, healthchecks fail, pods restart, creating more zombie connections.
Suggested fix:
-tcp_keepalives_idle = 7200
-tcp_keepalives_interval = 75
-tcp_keepalives_count = 9
+tcp_keepalives_idle = 60 # Start probing after 60s idle
+tcp_keepalives_interval = 10 # Probe every 10s
+tcp_keepalives_count = 3 # 3 failed probes = dead connection
Detection time: ~90 seconds instead of 2+ hours.
3. No idle_in_transaction_session_timeout
The chart does not set idle_in_transaction_session_timeout. KASM application connections frequently enter idle in transaction state (visible in pg_stat_activity). These sessions hold shared locks that block DDL migrations indefinitely.
Suggested fix — add to db-configmap.yaml:
idle_in_transaction_session_timeout = 30000 # 30 seconds
4. Connection/disconnection logging disabled
Lines 88-89:
log_connections = off
log_disconnections = off
Zombie connections are invisible in PostgreSQL logs, making debugging extremely difficult. We spent significant time diagnosing the deadlock loop because we couldn't see when connections were established or dropped.
Suggested fix:
-log_connections = off
-log_disconnections = off
+log_connections = on
+log_disconnections = on
Impact
The combination of these defaults creates a repeatable failure loop:
- Pod restart/delete leaves zombie TCP connection holding locks
- db-init-job Alembic migration blocks on lock (indefinitely — no statement_timeout)
- Manager/API queries block behind the migration's pending exclusive lock
- Healthchecks fail, Kubernetes restarts pods, creating more zombies
- Only manual
pg_terminate_backend() intervention resolves the deadlock
All four fixes together prevent this loop entirely. We verified this in production — after applying aggressive TCP keepalives (60/10/3), idle_in_transaction_session_timeout = 30s, and statement_timeout = 120s, the deadlock loop has not recurred.
Workaround
We use an external CNPG PostgreSQL cluster with hardened parameters. For users of the chart's built-in StatefulSet, patching db-configmap.yaml with the values above resolves the issue.
Description
The PostgreSQL configuration in
db-configmap.yamlships defaults that cause cascading failures in Kubernetes environments where pods are frequently restarted, rescheduled, or deleted. We hit all of these in production with an external CNPG database, but they apply equally to the chart's built-in StatefulSet.Related: The deadlock loop described here is the operational consequence of the zombie connections that finding 1e in issue #1 (readOnlyRootFilesystem) also exposes — both issues compound each other.
Environment
Findings
1. statement_timeout = 0 (unlimited)
Line 94 of
db-configmap.yamlsetsstatement_timeout = 0. This allows any query — including DDL blocked on locks — to wait indefinitely. In our deployment, an AlembicALTER TABLEblocked for over an hour waiting on a lock held by a zombie connection, causing the manager to deadlock.Suggested fix:
2. TCP keepalives: 2+ hour zombie detection
Lines 68-70:
When a KASM pod is deleted or OOMKilled, PostgreSQL does not detect the broken TCP connection for over 2 hours (7200 + 675 = 7875s). The zombie connection continues holding
AccessShareLocks, which block the db-init-job's Alembic migrations that needAccessExclusiveLockfor DDL operations. This cascades: the migration hangs, the manager can't read from locked tables, healthchecks fail, pods restart, creating more zombie connections.Suggested fix:
Detection time: ~90 seconds instead of 2+ hours.
3. No idle_in_transaction_session_timeout
The chart does not set
idle_in_transaction_session_timeout. KASM application connections frequently enteridle in transactionstate (visible inpg_stat_activity). These sessions hold shared locks that block DDL migrations indefinitely.Suggested fix — add to db-configmap.yaml:
4. Connection/disconnection logging disabled
Lines 88-89:
Zombie connections are invisible in PostgreSQL logs, making debugging extremely difficult. We spent significant time diagnosing the deadlock loop because we couldn't see when connections were established or dropped.
Suggested fix:
Impact
The combination of these defaults creates a repeatable failure loop:
pg_terminate_backend()intervention resolves the deadlockAll four fixes together prevent this loop entirely. We verified this in production — after applying aggressive TCP keepalives (60/10/3),
idle_in_transaction_session_timeout = 30s, andstatement_timeout = 120s, the deadlock loop has not recurred.Workaround
We use an external CNPG PostgreSQL cluster with hardened parameters. For users of the chart's built-in StatefulSet, patching
db-configmap.yamlwith the values above resolves the issue.