Database configuration defaults create operational risk in Kubernetes


## Description

The PostgreSQL configuration in `db-configmap.yaml` ships defaults that cause cascading failures in Kubernetes environments where pods are frequently restarted, rescheduled, or deleted. We hit all of these in production with an external CNPG database, but they apply equally to the chart's built-in StatefulSet.

Related: The deadlock loop described here is the operational consequence of the zombie connections that finding 1e in issue #1 (readOnlyRootFilesystem) also exposes — both issues compound each other.

## Environment

- KASM Workspaces: 1.18.1
- Database: CloudNativePG PostgreSQL 18.1 (external) and chart's StatefulSet config
- Kubernetes: RKE2 v1.33

## Findings

### 1. statement_timeout = 0 (unlimited)

Line 94 of `db-configmap.yaml` sets `statement_timeout = 0`. This allows any query — including DDL blocked on locks — to wait indefinitely. In our deployment, an Alembic `ALTER TABLE` blocked for over an hour waiting on a lock held by a zombie connection, causing the manager to deadlock.

**Suggested fix:**

```diff
-statement_timeout = 0
+statement_timeout = 120000    # 2 minutes; override per-session for long-running migrations
```

### 2. TCP keepalives: 2+ hour zombie detection

Lines 68-70:
```
tcp_keepalives_idle = 7200
tcp_keepalives_interval = 75
tcp_keepalives_count = 9
```

When a KASM pod is deleted or OOMKilled, PostgreSQL does not detect the broken TCP connection for over 2 hours (7200 + 675 = 7875s). The zombie connection continues holding `AccessShareLock`s, which block the db-init-job's Alembic migrations that need `AccessExclusiveLock` for DDL operations. This cascades: the migration hangs, the manager can't read from locked tables, healthchecks fail, pods restart, creating more zombie connections.

**Suggested fix:**

```diff
-tcp_keepalives_idle = 7200
-tcp_keepalives_interval = 75
-tcp_keepalives_count = 9
+tcp_keepalives_idle = 60       # Start probing after 60s idle
+tcp_keepalives_interval = 10   # Probe every 10s
+tcp_keepalives_count = 3       # 3 failed probes = dead connection
```

Detection time: ~90 seconds instead of 2+ hours.

### 3. No idle_in_transaction_session_timeout

The chart does not set `idle_in_transaction_session_timeout`. KASM application connections frequently enter `idle in transaction` state (visible in `pg_stat_activity`). These sessions hold shared locks that block DDL migrations indefinitely.

**Suggested fix — add to db-configmap.yaml:**

```
idle_in_transaction_session_timeout = 30000   # 30 seconds
```

### 4. Connection/disconnection logging disabled

Lines 88-89:
```
log_connections = off
log_disconnections = off
```

Zombie connections are invisible in PostgreSQL logs, making debugging extremely difficult. We spent significant time diagnosing the deadlock loop because we couldn't see when connections were established or dropped.

**Suggested fix:**

```diff
-log_connections = off
-log_disconnections = off
+log_connections = on
+log_disconnections = on
```

## Impact

The combination of these defaults creates a repeatable failure loop:

1. Pod restart/delete leaves zombie TCP connection holding locks
2. db-init-job Alembic migration blocks on lock (indefinitely — no statement_timeout)
3. Manager/API queries block behind the migration's pending exclusive lock
4. Healthchecks fail, Kubernetes restarts pods, creating more zombies
5. Only manual `pg_terminate_backend()` intervention resolves the deadlock

All four fixes together prevent this loop entirely. We verified this in production — after applying aggressive TCP keepalives (60/10/3), `idle_in_transaction_session_timeout = 30s`, and `statement_timeout = 120s`, the deadlock loop has not recurred.

## Workaround

We use an external CNPG PostgreSQL cluster with hardened parameters. For users of the chart's built-in StatefulSet, patching `db-configmap.yaml` with the values above resolves the issue.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Database configuration defaults create operational risk in Kubernetes #36

Description

Environment

Findings

1. statement_timeout = 0 (unlimited)

2. TCP keepalives: 2+ hour zombie detection

3. No idle_in_transaction_session_timeout

4. Connection/disconnection logging disabled

Impact

Workaround

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Database configuration defaults create operational risk in Kubernetes #36

Description

Description

Environment

Findings

1. statement_timeout = 0 (unlimited)

2. TCP keepalives: 2+ hour zombie detection

3. No idle_in_transaction_session_timeout

4. Connection/disconnection logging disabled

Impact

Workaround

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions