Skip to content

feat: Add KV store API for automation state persistence#69

Draft
jpshackelford wants to merge 51 commits into
mainfrom
feature/kv-store-api
Draft

feat: Add KV store API for automation state persistence#69
jpshackelford wants to merge 51 commits into
mainfrom
feature/kv-store-api

Conversation

@jpshackelford
Copy link
Copy Markdown
Contributor

@jpshackelford jpshackelford commented Apr 24, 2026

Summary

This PR implements the KV store API for automation state persistence, including both the design document and the service-side implementation.

Problem

One of the use cases for automations is implementing integrations. Some integrations are stateless—webhook responders that receive an event, do work, and complete. But other jobs require small amounts of data storage to work effectively.

For example, an automation that summarizes Slack messages might store the last processed timestamp, then on the next run fetch only items since that date. But where should it store this data?

A GitHub repo is not the right tool. External services (JSONBin, Redis Cloud, etc.) work but require users to provision and manage infrastructure. If external systems are required for such a common use case, that erodes the simplicity of a batteries-included platform.

Solution

A built-in key-value store API scoped per-automation:

  • Easy — simple GET/SET, familiar Redis-like semantics
  • Flexible — JSON values, counters, lists, nested paths
  • Secure — application-level encryption, strict isolation between automations
  • Concurrent — batch operations with optimistic concurrency control

We do not need high performance—just simplicity and security for occasional state persistence.

Changes

Design Document

  • docs/kv-store-design.md - Full design specification for the KV store feature

New Files

  • automation/kv_router.py - KV API endpoints including batch operations
  • automation/kv_schemas.py - Request/response schemas including batch operation types
  • automation/kv_helpers.py - Validation and encryption helpers
  • automation/utils/kv.py - JWT and encryption utilities
  • migrations/versions/005_add_kv_store.py - Database migration
  • tests/test_kv_router.py - Integration test coverage
  • tests/test_kv_batch.py - Unit tests for batch operations
  • tests/test_kv_helpers.py - Unit tests for helpers

Modified Files

  • automation/models.py - Added AutomationKV model and enable_kv_store flag on Automation
  • automation/schemas.py - Added enable_kv_store to request/response schemas
  • automation/router.py - Include enable_kv_store in automation creation
  • automation/dispatcher.py - Generate AUTOMATION_KV_TOKEN for runs with KV enabled
  • automation/config.py - Added kv_secret setting
  • automation/app.py - Mount KV router
  • pyproject.toml - Added pyjwt and jwcrypto dependencies

API Endpoints

Endpoint Method Description
/v1/kv GET List all keys
/v1/kv/batch POST Atomic batch operations with optimistic concurrency
/v1/kv/{key} GET Get value (supports ?path= and ?meta=true for version)
/v1/kv/{key} PUT Set value (supports ?nx, ?xx, ?if_version)
/v1/kv/{key} PATCH Update nested path (supports ?if_version)
/v1/kv/{key} DELETE Delete key (supports ?if_version)
/v1/kv/{key}/incr POST Atomic increment
/v1/kv/{key}/decr POST Atomic decrement
/v1/kv/{key}/lpush POST Push to list front
/v1/kv/{key}/rpush POST Push to list back
/v1/kv/{key}/lpop POST Pop from list front
/v1/kv/{key}/rpop POST Pop from list back
/v1/kv/{key}/len GET Get list length

Concurrency Controls

Single-Document Model

All state for an automation is stored in a single encrypted JSON document. This eliminates multi-key deadlock scenarios—all operations serialize through a single row lock.

Lock Timeout (Deadlock Prevention)

  • 5-second lock timeout prevents stuck transactions from cascading
  • Returns HTTP 409 Conflict so clients can retry with backoff

Optimistic Concurrency ($version)

State includes a $version counter that auto-increments on every write. Use if_version for safe read-modify-write patterns.

On individual endpoints (PUT, PATCH, DELETE):

PUT /v1/kv/state?if_version=4
→ 200 OK (version was 4, now 5)
→ 409 Conflict {"error": "version_mismatch", "expected_version": 4, "actual_version": 6}

On batch endpoint:

POST /v1/kv/batch
{
  "if_version": 4,
  "operations": [
    {"op": "set", "key": "state", "value": {...}}
  ]
}
// → 200: {"version": 5, "results": [...]}
// → 409: {"error": "version_mismatch", "expected_version": 4, "actual_version": 6}

Get current version with ?meta=true:

GET /v1/kv/state?meta=true
// → {"key": "state", "value": {...}, "version": 5, "created_at": "...", "updated_at": "..."}

Batch Operations

Execute multiple operations atomically in a single transaction:

POST /v1/kv/batch
{
  "operations": [
    {"op": "incr", "key": "counter"},
    {"op": "rpush", "key": "log", "value": {"event": "sync"}}
  ]
}
// → {"version": 5, "results": [...]}

Benefits:

  • Single decrypt → N modifications → single encrypt → commit
  • Reduces crypto overhead for multiple updates
  • All operations succeed or none do (atomic)

Reserved Keys

Keys starting with $ are reserved for system use (e.g., $version). Client attempts to set/delete $-prefixed keys are rejected with HTTP 400.

When to Use Version Checks

Operation Version Check Needed? Why
PUT (set) Yes Read-modify-write pattern
PATCH Yes Read-modify-write pattern
DELETE Optional Can be useful for safe cleanup
incr/decr No Atomic, conflict-free
lpush/rpush No Atomic, conflict-free
lpop/rpop No Atomic, conflict-free

Deployment

Configuration

The KV store requires the AUTOMATION_KV_SECRET environment variable for JWT signing and value encryption.

Chart PR: OpenHands/OpenHands-Cloud#576 wires this up by reusing the existing jwt-secret — no new secrets needed!

Out of Scope

As noted in the design doc comment, admin debugging tooling for KV data is out of scope for this PR. The encrypted data will be secure at rest; admin tools can be added in a follow-up issue once the permission model is clarified with the platform team.

Related


This PR was updated by an AI agent (OpenHands) on behalf of @jpshackelford.

@github-actions
Copy link
Copy Markdown

github-actions Bot commented Apr 24, 2026

🚀 Deploy Preview PR Created/Updated

A deploy preview has been created/updated for this PR.

Deploy PR: https://github.com/OpenHands/deploy/pull/3920
Automation SHA: 4fc3f05e245b7c3fc24f749c53cee39d0676e5d6
Last updated: May 19, 2026, 12:39:19 PM ET

Once the deploy PR's CI passes, the automation service will be deployed to the feature environment.

@github-actions
Copy link
Copy Markdown

github-actions Bot commented Apr 24, 2026

Coverage

Copy link
Copy Markdown
Contributor Author

🔗 Chart PR Linked for Preview Deploy

The deploy preview PR has been updated to use the preview chart from OpenHands-Cloud PR #576, which adds the AUTOMATION_KV_SECRET environment variable configuration.

Deploy PR: https://github.com/OpenHands/deploy/pull/3920
Chart Version: 0.4.13-alpha.576

Once the deploy PR's CI passes, the feature environment will include both:

  • The KV store API code changes from this PR
  • The chart changes that wire up the KV secret

This comment was created by an AI agent (OpenHands) on behalf of the user.

Copy link
Copy Markdown
Contributor Author

Out of Scope: Admin Debugging Tooling

We are calling out of scope for this implementation:

  • An automation-specific role/permission that would allow an engineer to obtain a key to list, retrieve, and manipulate KV data stored for an automation
  • Admin CLI or API endpoints for debugging encrypted KV store contents

Context: Automations are linked to organizations (org_id), but the automation service delegates role/permission management to the main OpenHands platform. Adding admin debugging capabilities would require coordination with the platform team to define appropriate permissions (e.g., automation:kv:admin).

For now: All KV data is encrypted at rest. Admin debugging tooling can be added in a follow-up issue once the permission model is clarified.


This comment was migrated from PR #68 (design document PR) which has been merged into this PR.

@jpshackelford
Copy link
Copy Markdown
Contributor Author

jpshackelford commented Apr 25, 2026

✅ KV Store E2E Tests Passed (Thorough Mode)

Tested against jps01.r9.all-hands.dev with PR-69 image:

============================================================
KV STORE E2E TEST (THOROUGH MODE)
Running 26 tests
============================================================

[TEST] SET and GET
  PUT /test_key: 201
  GET /test_key: 200
  PASS

[TEST] DELETE
  DELETE /to_delete: 200
  GET after delete: 404
  PASS

[TEST] INCR and DECR
  INCR by 5: 200, value=15
  DECR by 3: 200, value=12
  PASS

[TEST] List operations
  RPUSH 'a': 200, length=1
  LPUSH 'z': 200, length=4
  LPOP: 200, value=z
  RPOP: 200, value=c
  LEN: 200, length=2
  PASS

[TEST] Nested path operations
  PATCH database.port=5433: 200
  GET with path: 200, value=5433
  PASS

[TEST] Conditional SET (nx)
  PUT with nx=true (new): 201
  PUT with nx=true (exists): 409
  PASS

[TEST] List keys
  GET /kv: 200, count=7
  PASS

[TEST] GET with metadata
  GET with meta=true: 200
  created_at: 2026-04-25T07:03:03.895583+00:00
  PASS

[TEST] GET non-existent key
  GET /nonexistent: 404
  PASS

[TEST] GET non-existent nested path
  GET with invalid path: 404
  PASS

[TEST] DELETE non-existent key
  DELETE /nonexistent: 200, deleted=False
  PASS

[TEST] Conditional SET (xx)
  PUT with xx=true (missing): 409
  PUT with xx=true (exists): 200
  PASS

[TEST] PATCH non-existent key
  PATCH /nonexistent: 404
  PASS

[TEST] INCR new key
  INCR new key: 200, value=1
  PASS

[TEST] DECR new key
  DECR new key: 200, value=-1
  PASS

[TEST] INCR non-numeric
  INCR string value: 400
  PASS

[TEST] LPOP empty list
  LPOP empty: 200, value=None
  PASS

[TEST] LPOP non-existent key
  LPOP nonexistent: 200, value=None
  PASS

[TEST] RPUSH to non-list
  RPUSH to dict: 400
  PASS

[TEST] LEN non-existent key
  LEN nonexistent: 404
  PASS

[TEST] Special characters in key
  PUT /test-key_123: 201
  PASS

[TEST] Store null value
  PUT null: 422
  PASS

[TEST] Various JSON types
  string_type: OK
  number_int: OK
  number_float: OK
  boolean_true: OK
  boolean_false: OK
  array_type: OK
  nested_obj: OK
  PASS

[TEST] Auth - missing token
  GET without token: 422
  PASS

[TEST] Auth - invalid token
  GET with invalid token: 401
  PASS

[TEST] Invalid JSON body
  PUT invalid JSON: 422
  PASS

============================================================
RESULTS (THOROUGH): 26 passed, 0 failed
============================================================

KV_STORE_ALL_TESTS_PASSED

All 26 test cases verified:

  • ✅ Basic SET/GET with proper status codes (201 create, 200 read)
  • ✅ DELETE with 404 on subsequent GET
  • ✅ INCR/DECR numeric operations
  • ✅ List operations (RPUSH, LPUSH, LPOP, RPOP, LEN)
  • ✅ Nested path PATCH and GET
  • ✅ Conditional SET with nx=true and xx=true
  • ✅ List all keys
  • ✅ Metadata retrieval with meta=true
  • ✅ Error handling (404 for missing keys, 400 for type errors, 409 for conflicts)
  • ✅ Auth validation (401 invalid token, 422 missing token)
  • ✅ Input validation (422 for null body, invalid JSON)
  • ✅ Various JSON types (string, int, float, bool, array, object)

This comment was created by an AI agent (OpenHands) on behalf of @jpshackelford.

@jpshackelford
Copy link
Copy Markdown
Contributor Author

Still doing some clean-up in response to OpenHands review feedback.

openhands-agent and others added 21 commits April 25, 2026 23:56
Implements a built-in key-value store API scoped per-automation for
state persistence between runs, as specified in issue #67.

Features:
- New AutomationKV model with encrypted value storage
- enable_kv_store flag on Automation model (opt-in)
- Per-run JWT tokens for KV authentication
- JWE encryption for all stored values
- Full Redis-like API: GET, SET, DELETE, INCR/DECR, LPUSH/RPUSH/LPOP/RPOP
- Support for nested paths (dot notation)
- Conditional SET operations (nx=true, xx=true)
- Database migration for new table

Endpoints:
- GET /v1/kv - List all keys
- GET /v1/kv/{key} - Get value (with optional path and meta params)
- PUT /v1/kv/{key} - Set value (with optional nx/xx params)
- PATCH /v1/kv/{key} - Update nested path
- DELETE /v1/kv/{key} - Delete key
- POST /v1/kv/{key}/incr - Atomic increment
- POST /v1/kv/{key}/decr - Atomic decrement
- POST /v1/kv/{key}/lpush - Push to list front
- POST /v1/kv/{key}/rpush - Push to list back
- POST /v1/kv/{key}/lpop - Pop from list front
- POST /v1/kv/{key}/rpop - Pop from list back
- GET /v1/kv/{key}/len - Get list length

Related: #67

Co-authored-by: openhands <openhands@all-hands.dev>
- Add type: ignore comments for jwcrypto library calls (missing stubs)
- Fix pyright type error for rowcount access on delete result
- Apply ruff formatting to migration file
- Make automation_with_kv fixture autouse=True so KV tests have parent
  Automation record for foreign key constraint
- Remove unused create_kv_token from top-level test imports

Co-authored-by: openhands <openhands@all-hands.dev>
- Set AUTOMATION_KV_SECRET env var in kv_client test fixture to match
  the test encryption key
- Use Annotated[Any, Body()] for the set_value endpoint body parameter
  to properly parse arbitrary JSON body

Co-authored-by: openhands <openhands@all-hands.dev>
Design for a built-in key-value store API for automation state persistence.

Key features:
- Scoped per-automation with strict isolation
- Redis-like API semantics (GET, SET, INCR, list operations)
- Application-level encryption (JWE) for all values
- JWT-based authentication per automation run

Closes #67

Co-authored-by: openhands <openhands@all-hands.dev>
Add enable_kv_store parameter to CreatePromptAutomationRequest and
CreatePluginAutomationRequest schemas, and pass it through when
creating Automation records.

This allows users to enable the KV store via preset endpoints
(/v1/preset/prompt and /v1/preset/plugin), making state persistence
available for prompt-based and plugin-based automations.

Co-authored-by: openhands <openhands@all-hands.dev>
Adds scripts/test_kv_e2e.py which tests the KV store API end-to-end:

1. Creates a real automation via API (with enable_kv_store=true)
2. Generates a KV token for that automation
3. Uses run_automation() for blocking execution with stdout/stderr capture
4. Runs comprehensive test suite covering all KV operations:
   - SET/GET, INCR/DECR, list operations (RPUSH/LPUSH/LPOP/RPOP/LEN)
   - Nested path operations (PATCH, GET with path)
   - Conditional SET (nx, xx flags)
   - List keys, DELETE, GET with metadata
5. Cleans up the automation after test

Usage:
    export OPENHANDS_API_KEY="sk-oh-..."
    export AUTOMATION_KV_SECRET="<same-as-staging>"
    uv run python scripts/test_kv_e2e.py

Co-authored-by: openhands <openhands@all-hands.dev>
- Refactor test_kv_e2e.py with @Quick and @Thorough decorators
- Quick mode: 8 smoke tests covering core operations (~30s)
- Thorough mode: 26 tests including edge cases and error paths (~2min)
- Add tests for: 404 handling, auth failures, type mismatches,
  empty lists, null values, various JSON types
- Tests always run to completion (no early exit on failure)
- Add kv-store-test-plan.md documenting manual test cases

Usage:
  python scripts/test_kv_e2e.py          # quick (default)
  python scripts/test_kv_e2e.py --thorough  # full suite
- Reduce from 1543 to 167 lines (~90% reduction)
- Remove individual TC-* test cases (now in test_kv_e2e.py)
- Remove hardcoded API key (security)
- Keep: critical bug docs, debugging workflow, manual-only tests
- Keep: cross-automation isolation & persistence tests (not yet automated)
- Keep: quick reference commands
Adds logging at:
1. App startup - logs whether kv_secret is present and its length
2. Dispatcher run - logs enable_kv_store flag and kv_secret presence before token generation

This will help debug why KV tokens aren't being injected into sandbox runs.

Co-authored-by: openhands <openhands@all-hands.dev>
The file had corrupted/duplicate content where lines from the embedded
KV_TEST_SCRIPT and duplicate function definitions got accidentally
inserted mid-file. This removes:

1. Corrupted line 939 where text was merged incorrectly
2. Duplicate test functions (test_list_keys, test_delete, etc) that
   were already defined inside KV_TEST_SCRIPT
3. Duplicate create_automation, delete_automation, and main() functions
   (kept the version with quick/thorough mode support)

File reduced from 1231 to 943 lines.

Co-authored-by: openhands <openhands@all-hands.dev>
- 201 Created: when a new key is created (including nx=true success)
- 200 OK: when an existing key is updated
- 409 Conflict: when nx=true but key already exists

This aligns with REST semantics and fixes the E2E test expectation.

Co-authored-by: openhands <openhands@all-hands.dev>
- test_set_new_value: expect 201 Created (new key)
- test_set_update_existing: expect 200 OK (update) + assert created=False
- test_set_nx_creates_new: expect 201 Created (new key via nx)
- test_set_nx_fails_if_exists: expect 409 Conflict (nx but key exists)

Also fixes trailing whitespace formatting in test_kv_e2e.py.

Co-authored-by: openhands <openhands@all-hands.dev>
- Add blank line after inline import in execution.py
- Split long import line in kv_router.py
- Remove unused json import from test_kv_e2e.py
- Fix long lines in test_kv_e2e.py

Co-authored-by: openhands <openhands@all-hands.dev>
- Check if key exists before upsert to determine created status
- Use func.now() to properly update the timestamp on conflict
- This ensures 200 for updates and 201 for inserts

Co-authored-by: openhands <openhands@all-hands.dev>
- Remove trailing whitespace from test_kv_e2e.py
- Add noqa: E402 for necessary path manipulation imports
- Fix log_extra to accept **kwargs for additional structured logging fields

Co-authored-by: openhands <openhands@all-hands.dev>
Make it clear that 403 on cleanup means the API key lacks
manage_automations permission, not a test failure.

Co-authored-by: openhands <openhands@all-hands.dev>
- xx=true returns 409 Conflict (not 404/412) when key doesn't exist
- null body returns 422 (FastAPI validation rejects empty body)
- Missing token returns 422 (header validation before auth check)

Co-authored-by: openhands <openhands@all-hands.dev>
Move all request/response schemas from kv_router.py to a dedicated
kv_schemas.py module for better separation of concerns.

- Creates automation/kv_schemas.py with 12 Pydantic models
- Updates kv_router.py to import from kv_schemas
- Reduces kv_router.py from 969 to 883 lines

Co-authored-by: openhands <openhands@all-hands.dev>
Move path parsing and nested value manipulation functions to a
dedicated kv_helpers.py module:

- parse_path(): Parse dot/bracket notation paths
- get_nested_value(): Traverse nested dicts/lists
- set_nested_value(): Set values at nested paths

Reduces kv_router.py from 883 to 790 lines.

Co-authored-by: openhands <openhands@all-hands.dev>
- dispatcher.py: Remove verbose 'KV store check' info log that runs
  for every automation. Keep single debug log when KV is enabled.
- app.py: Consolidate startup KV log into service startup message.
  Remove kv_secret_len as it provides no useful info and could
  theoretically leak entropy information.

Co-authored-by: openhands <openhands@all-hands.dev>
Add test coverage for parse_path, get_nested_value, and set_nested_value:

- TestParsePath: 14 tests covering dot notation, bracket notation,
  edge cases (empty, trailing dots, unclosed brackets)
- TestGetNestedValue: 13 tests for dict/list traversal and error cases
- TestSetNestedValue: 7 tests for value setting and intermediate creation

All 32 tests pass locally.

Co-authored-by: openhands <openhands@all-hands.dev>
Adds AUTOMATION_KV_MAX_VALUE_SIZE setting (default 64KB) to prevent
abuse by limiting the size of values stored in the KV store.

Size validation is applied to all write operations:
- PUT /kv/{key} (set_value) - validates the incoming body
- PATCH /kv/{key} (patch_value) - validates after applying the patch
- POST /kv/{key}/lpush - validates the resulting list
- POST /kv/{key}/rpush - validates the resulting list

Returns HTTP 413 Payload Too Large when limit is exceeded.

Includes tests using a 1KB limit to verify enforcement on all endpoints.

Co-authored-by: openhands <openhands@all-hands.dev>
Replace JWE (JSON Web Encryption) with raw AES-256-GCM and store
encrypted values as BYTEA instead of TEXT. This reduces storage
overhead significantly:

Before (JWE + TEXT):
- 14-byte plaintext → 100 bytes stored (7x overhead)
- Base64 encoding adds ~33% to all values

After (AES-GCM + BYTEA):
- 14-byte plaintext → 42 bytes stored (3x overhead)
- Fixed 28-byte overhead (12-byte nonce + 16-byte auth tag)
- For larger values, overhead approaches 0%

Changes:
- automation/utils/kv.py: Replace jwcrypto with cryptography library,
  comprehensive documentation of design decisions, PostgreSQL TOAST
  considerations, wire format, and key derivation rationale
- automation/models.py: Change value_encrypted from Text to LargeBinary
- migrations/005: Update column type to LargeBinary (BYTEA)
- pyproject.toml: Replace jwcrypto dependency with cryptography

The module docstring now documents:
- Three encryption approaches evaluated and why we chose raw AES-GCM
- PostgreSQL TOAST performance implications (<2KB optimal, 2-8KB ok, >8KB chunked)
- Key derivation approach and potential HKDF improvement
- Wire format diagram (nonce || ciphertext || auth_tag)

Co-authored-by: openhands <openhands@all-hands.dev>
Add section to kv.py module docstring explaining that JSONB can't be
used because we encrypt values at the application layer, producing
opaque binary data rather than valid JSON.

Co-authored-by: openhands <openhands@all-hands.dev>
Include a table showing PostgreSQL TOAST behavior at different size
limits (64KB-512KB), helping operators make informed decisions about
the tradeoff between flexibility and read performance.

Co-authored-by: openhands <openhands@all-hands.dev>
- SET STORAGE EXTERNAL on value_encrypted column to skip futile
  compression attempts on encrypted (high-entropy) data
- Add COMMENT ON TABLE/COLUMN for schema-level documentation
  visible to DBAs and database tools

Expanded migration docstring explains:
- Why BYTEA over TEXT/JSONB (encrypted data is raw bytes)
- Why EXTERNAL over EXTENDED (encrypted data won't compress)
- Why schema comments (self-documenting for DB inspection)

Co-authored-by: openhands <openhands@all-hands.dev>
Enforce RFC 8259 compliant JSON to prevent interoperability issues:

Validation rules:
- Reject NaN, Infinity, -Infinity (not valid JSON, causes issues
  with JavaScript JSON.parse and other strict parsers)
- Maximum nesting depth of 32 levels (prevents stack overflow DoS)
- Only JSON-serializable types accepted

Implementation:
- automation/utils/kv.py: Add _validate_json_value() with allow_nan=False,
  _check_nesting_depth() for DoS prevention, new KVValueError exception
- automation/kv_helpers.py: Add safe_encrypt() helper that converts
  KVValueError to HTTP 400, KVEncryptionError to HTTP 500
- automation/kv_router.py: Replace all encrypt_value try/except blocks
  with safe_encrypt() for cleaner code and consistent error handling

Error responses:
- 400 Bad Request for invalid values (client's fault)
- 500 Internal Server Error for encryption failures (our fault)

Tests added for:
- NaN, Infinity, -Infinity rejection
- Deep nesting rejection (35 levels)
- Valid nesting acceptance (10 levels)
- All standard JSON types acceptance

Co-authored-by: openhands <openhands@all-hands.dev>
Extract repeated patterns into reusable helpers in kv_helpers.py:

New helpers:
- safe_decrypt(): Wraps decrypt_value with HTTP 500 error handling
  (matches safe_encrypt pattern, replaces 9 try/except blocks)
- require_dict(): Type check for dict values (HTTP 400 on failure)
- require_list(): Type check for list values (HTTP 400 on failure)
- require_numeric(): Type check for int/float values (HTTP 400 on failure)

Benefits:
- Reduced boilerplate in kv_router.py (~100 lines removed)
- Consistent error handling across all endpoints
- Single point of change for error messages/logging
- Cleaner, more readable endpoint implementations

Router cleanup:
- Removed KVEncryptionError and decrypt_value imports (now internal)
- Replaced 9 decrypt try/except blocks with safe_decrypt()
- Replaced 8 isinstance type checks with require_* helpers

Co-authored-by: openhands <openhands@all-hands.dev>
This commit adds comprehensive validation to protect against accidental,
malicious, and ignorant clients:

Key Validation (validate_key):
- Reject empty and whitespace-only keys
- Enforce max key length (255 chars, matching DB constraint)
- Reject control characters (null bytes, newlines, tabs, etc.)
- Provide helpful error messages with character position

Type Validation Improvements:
- require_numeric: Now rejects booleans (True/False) explicitly
  - Prevents confusing behavior where True becomes 2 after increment
- require_int: New function for incr/decr operations
  - Rejects floats to prevent silent precision loss
  - Rejects booleans for same reason as require_numeric

Path Validation:
- Added _MAX_PATH_DEPTH (32) limit to parse_path()
- Prevents DoS via extremely long dot-notation paths

Router Changes:
- Added ValidatedKey type alias using Depends for consistent key validation
- All endpoints now validate key parameter automatically
- incr/decr now use require_int instead of require_numeric
- Fixed incr/decr to use integer arithmetic (was doing int(value + by))

Tests:
- Added 54+ new tests for key validation edge cases
- Added tests for boolean rejection in require_numeric/require_int
- Added tests for float rejection in require_int
- Added tests for path depth limiting
- Added API-level tests for incr/decr integer-only behavior

Co-authored-by: openhands <openhands@all-hands.dev>
- Fix line too long errors in kv_helpers.py error messages
- Add explicit type annotations in test_kv_router.py for nested dict tests
- Remove unused require_numeric import from kv_router.py
- Apply ruff formatting to migration file

Co-authored-by: openhands <openhands@all-hands.dev>
…evention

## Problem
The original implementation stored each KV 'key' as a separate row in the
automation_kv table. This created deadlock risk when:
- Concurrent requests modified different keys for the same automation
- Client code acquired locks on multiple keys in different orders
- The FOR UPDATE locking strategy serialized on N separate rows

## Solution: Single-Document Design
Changed to store all KV state in ONE encrypted JSON document per automation.
API 'keys' (e.g., /kv/config, /kv/counter) are now top-level fields within
this single document.

Benefits:
- Only ONE row per automation to lock
- All operations serialize through that single lock
- No multi-key ordering issues possible
- Eliminates deadlock risk between keys

Trade-off: Every operation reads/writes the entire state blob. This is
acceptable because automation state is intended to be small (cursors,
counters, configs) and access is infrequent (scheduled runs).

Co-authored-by: openhands <openhands@all-hands.dev>
Co-authored-by: openhands <openhands@all-hands.dev>
- Apply ruff formatting fixes
- Add assert state is not None before subscripting in tests

Co-authored-by: openhands <openhands@all-hands.dev>
…rotection

Add bounded lock wait times and connection pool timeout to prevent a
single slow KV operation from cascading into system-wide instability:

1. Lock Timeout (kv_router.py):
   - Add SET LOCAL lock_timeout = '5000ms' before FOR UPDATE queries
   - Prevents indefinite waits when another transaction holds the row lock
   - If lock times out, return HTTP 409 Conflict so clients can retry
   - SET LOCAL scopes timeout to current transaction only

2. Pool Exhaustion Protection (db.py, config.py):
   - Add db_pool_timeout config setting (default 30s)
   - Apply pool_timeout to both standard and GCP Cloud SQL engines
   - Surfaces pool exhaustion as errors instead of indefinite hangs

Why these protections matter:
Without lock timeout, a slow/stuck transaction (e.g., network issue during
commit, slow crypto) causes all concurrent KV operations on that automation
to queue indefinitely. With enough concurrent requests, this exhausts the
connection pool and degrades the entire service—not just one automation.

The single-document design already eliminates multi-row deadlock scenarios.
These timeouts provide defense-in-depth against operational issues.

Co-authored-by: openhands <openhands@all-hands.dev>
- Add POST /v1/kv/batch endpoint for atomic batch operations
- Implement $version meta key that auto-increments on every write
- Support if_version parameter for optimistic concurrency (409 on mismatch)
- Reserve $-prefixed keys for system use (reject client writes)
- Filter $-prefixed keys from list_keys response
- Include version in ?meta=true responses
- Add _has_user_keys() to properly handle delete-last-key case

Operations supported in batch:
- set, delete, incr, decr, lpush, rpush, lpop, rpop, patch

Tests: 700 pass, ruff and pyright clean
- Add if_version query param to PUT /v1/kv/{key} for optimistic concurrency
- Add if_version query param to PATCH /v1/kv/{key} for optimistic concurrency
- Add if_version query param to DELETE /v1/kv/{key} for optimistic concurrency
- Add integration tests for if_version on individual endpoints
- Fix pyright type errors in tests

The atomic operations (incr, decr, lpush, rpush, lpop, rpop) don't need
version checks because they are conflict-free by design.

Tests: 707 pass, pre-commit clean
Update kv_router.py to use get_config().service instead of the deprecated
get_settings() function. This aligns with the configuration centralization
done in PR #73.

Also add **kwargs support to the centralized log_extra() function to support
additional context fields needed by the KV store logging.

Co-authored-by: openhands <openhands@all-hands.dev>
Add Configuration section documenting:
- The composed AppConfig structure with typed sections
- Proper usage of get_config().service vs deprecated get_settings()
- Where to find environment variable documentation
- Protocol constants location

Co-authored-by: openhands <openhands@all-hands.dev>
…ings().cache_clear()

The get_settings() function is now a deprecated wrapper that doesn't have
a cache_clear() method. Use clear_config_cache() from automation.config
which properly clears the lru_cache on get_config().

Co-authored-by: openhands <openhands@all-hands.dev>
Implements follow-up tasks from PR #69 for KV store concurrency:

1. Statement Timeout (Safety Net)
   - Added statement_timeout = 2x lock_timeout to catch runaway operations
   - Detects PostgreSQL error codes 55P03 (lock) and 57014 (statement)

2. Retry-After Header
   - All 409 responses include Retry-After: 1 header
   - New _raise_version_conflict() helper for consistent version mismatch errors

3. Configurable Lock Timeout Per-Automation
   - Added kv_lock_timeout_ms field to Automation model (100-30000ms)
   - Schema validation in CreateAutomationRequest/UpdateAutomationRequest
   - Migration 006_add_kv_lock_timeout.py
   - Embedded in JWT token claims for fast access (no extra DB query)
   - KVTokenClaims class with automation_id and lock_timeout_ms
   - Backward compatible with old tokens (uses default 5000ms)

4. Metrics and Observability
   - New kv_metrics.py module with Prometheus metrics:
     - kv_operation_duration_seconds (histogram)
     - kv_lock_wait_duration_seconds (histogram)
     - kv_conflict_total (counter by reason)
     - kv_state_size_bytes (histogram)
   - Integrated into router: lock waits, conflicts, state size

5. Client Documentation
   - Comprehensive docs/kv-store-client-guide.md
   - Covers basic/advanced operations, concurrency patterns
   - Best practices, error handling, debugging tips
   - Lock timeout configuration guidance

Testing:
   - 27 new tests in test_kv_concurrency.py
   - All 168 KV tests pass

Co-authored-by: openhands <openhands@all-hands.dev>
Following the centralized configuration pattern from PR #73, this moves
KV store settings from ServiceSettings to a dedicated KVSettings section:

- Add KVSettings class with kv_secret, kv_max_value_size, enabled property
- Add config.kv property to AppConfig
- Update kv_router.py to use get_config().kv for all KV settings
- Update dispatcher.py to use get_config().kv for KV token generation
- Update AGENTS.md with KV config section in documentation

This keeps feature-specific configuration organized and follows the
existing pattern of http, sandbox, storage, and log sections.

Co-authored-by: openhands <openhands@all-hands.dev>
Co-authored-by: openhands <openhands@all-hands.dev>
- Add get_config import to app.py for kv.enabled check
- Use CronTrigger instead of dict in validation tests
- Use cast() for HTTPException.detail dict access
- Fix import ordering in migration file

Co-authored-by: openhands <openhands@all-hands.dev>
Sort imports alphabetically (CreateAutomationRequest before CronTrigger)
to satisfy ruff linting.

Co-authored-by: openhands <openhands@all-hands.dev>
The endpoints now use get_token_claims which returns KVTokenClaims
instead of get_automation_id_from_token which returns UUID. This update:

- Import get_token_claims instead of get_automation_id_from_token
- Import KVTokenClaims and DEFAULT_LOCK_TIMEOUT_MS from utils.kv
- Update kv_client fixture to override get_token_claims with KVTokenClaims
- Fix test_create_and_verify_token to compare result.automation_id

Co-authored-by: openhands <openhands@all-hands.dev>
Copy link
Copy Markdown
Contributor

@tofarr tofarr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • There is a lot of code related to wheterh kv is enabled or whether it is enabled for a automation - I think it should just always be enabled - something that each automation gets by default
  • Do we need to update the skill for automations to let the agent know that this facility is available?
  • We have a lot of code related to encryption. Could we use the Cipher class from the sdk to simplify / remove some of this?

Resolves conflicts from the automation/ -> openhands/automation/ directory
rename and from the SQLite/local-mode and pyproject.toml changes on main.

- Move dispatcher KV-token injection into the new dispatcher.py
- Rewrite all KV module imports to use openhands.automation.*
- Renumber KV migrations to 006_add_kv_store and 007_add_kv_lock_timeout
  so they sit after 005_add_bash_command_id

Co-authored-by: openhands <openhands@all-hands.dev>
…Cipher

- Drop enable_kv_store / kv_lock_timeout_ms per-automation fields. KV is
  always available; the dispatcher injects AUTOMATION_KV_TOKEN whenever
  the service has AUTOMATION_KV_SECRET configured.
- Move kv_lock_timeout_ms to KVSettings as a service-wide config knob
  (AUTOMATION_KV_LOCK_TIMEOUT_MS, default 5000).
- Replace the custom AES-256-GCM JWE code with the SDK's Cipher helper
  (Fernet), dropping the cryptography>=42 dependency.
- Switch state_encrypted from BYTEA/LargeBinary to TEXT (Fernet tokens
  are URL-safe base64 strings).
- Skip the new COMMENT ON TABLE statements when running migrations on
  SQLite so dev/test setups still work.
- Update docs (design, client guide, test plan) and the kv_router
  module docstring to reflect the new model.

Co-authored-by: openhands <openhands@all-hands.dev>
@tofarr
Copy link
Copy Markdown
Contributor

tofarr commented May 19, 2026

Picking this up on John's behalf since it's unblocking some work on my end. Addressed @tofarr's review in 4fc3f05:

  1. "a lot of code related to whether kv is enabled / per-automation" — Removed entirely. There is no longer an enable_kv_store field on Automation, no enable_kv_store knob in the schemas / router / preset router, and no per-automation lock-timeout override. The dispatcher injects AUTOMATION_KV_TOKEN for every run whenever the service has AUTOMATION_KV_SECRET configured, and the row-lock timeout is now a single service-wide config value (AUTOMATION_KV_LOCK_TIMEOUT_MS, default 5000ms).

  2. "update the skill for automations" — Deferring to a follow-up PR; tracked in the design doc's implementation checklist. Wanted to keep this PR scoped to the service-side cleanup so it's easier to review.

  3. "could we use the Cipher class from the SDK" — Done. utils/kv.py now uses openhands.sdk.utils.cipher.Cipher for value encryption (Fernet under the hood). Dropped the cryptography>=42 direct dependency and the custom AES-256-GCM JWE wrapper code. Since Fernet emits URL-safe base64 text rather than raw bytes, automation_kv.state_encrypted is now TEXT (was BYTEA/LargeBinary); the migration was updated in place and the old 007_add_kv_lock_timeout migration is gone (its only purpose was the per-automation timeout column).

Also:

  • Skipped the new COMMENT ON TABLE statements when running migrations on SQLite so the existing test_db.py::TestSqliteMigrations tests still pass.
  • Updated docs/kv-store-design.md, docs/kv-store-client-guide.md, and docs/kv-store-test-plan.md to match (removed all references to enable_kv_store, per-automation kv_lock_timeout_ms, AUTOMATION_JWT_SECRET, jwcrypto, JWE/A256GCM, and LargeBinary).

KV tests (test_kv_helpers, test_kv_batch, test_kv_concurrency) and test_db.py all pass locally. Remaining test ERRORs are testcontainers/Docker-related and reproduce on main too, so they're not from this change.

This comment was written by an AI agent (OpenHands) on behalf of @jpshackelford.

@tofarr
Copy link
Copy Markdown
Contributor

tofarr commented May 19, 2026

Dropping by as a near-term consumer — I'm building slack-channel-listener (a skill + plugin that lets users wire any Slack channel up as an OpenHands trigger), and this PR is exactly the primitive I need. Before I saw it I was about to invent a SLACK_STATE_DIR=/automation/storage/state convention with a bundled SQLite file, which would have been the wrong layer to solve it at.

I've gone ahead and built against the API as specced in docs/kv-store-client-guide.mdstate.py in extensions#245 has a pluggable Store interface with two backends, KVApiStore (preferred, activates when AUTOMATION_KV_TOKEN is injected) and SQLiteStore (fallback, keeps the skill usable on runtimes that don't ship this PR yet). The KV implementation uses the single-document model with ?if_version=N and jittered exponential backoff on 409 per the client guide.

A few pieces of consumer feedback while it's fresh:

Strong points

  • The single-document model + row-lock + $version story is the right call for this workload. Concurrent contention is rare (cron is single-instance by default) and when it does happen, optimistic retry is dead simple to implement client-side.
  • Per-automation isolation + encryption-at-rest means I don't have to think about secrets management for state — this would have been the most annoying part of rolling my own.
  • AUTOMATION_KV_TOKEN + AUTOMATION_ENABLE_KV_STORE env-var injection in dispatcher.py makes the client-side detection trivial: I just check for the token's presence in get_store().
  • Batch operations are over-spec for my use case but I can see them being load-bearing for skills that need to update several counters atomically (e.g., "increment per-user mention count AND push to recent-mentions list").

Things I bumped into while implementing

  1. No TTL primitive. For my use case I need to drop entries older than 2 × poll_lookback on every run to stay under the 64 KB cap. I implemented it client-side as a prune_older_than(cutoff_iso) that does an RMW and filters the dict — fine, but a server-side ?ttl=seconds on PUT (or a periodic-prune query) would generalise nicely for any skill that accumulates per-key timestamps. Not a blocker, but file under "obvious follow-up".
  2. 64 KB ceiling. For my {<channel>:<ts>: {...}} shape that's ~600–1000 entries. Plenty with pruning, but 413 with no per-key delete fallback would be a bad surprise for callers who haven't read the guide carefully. Right now I handle 413 by aggressively pruning to last-24h and retrying once, but it's not a great UX — the script either has to know its own retention policy or guess. Two things would help:
    • The 413 response body could include current_size_bytes so clients can decide how much to prune.
    • Document somewhere prominent that 64 KB is JSON-encoded after encryption-overhead, not raw value bytes, so callers can size accurately.
  3. AUTOMATION_API_URL is implicit. The client guide uses it freely but it's an existing env var, not introduced by this PR. My factory ends up checking both AUTOMATION_API_URL and (fallback) OPENHANDS_CLOUD_API_URL for resilience. Worth being explicit in the client guide that this env var pre-dates the KV store and is contributed by the dispatcher in general, not by enable_kv_store: true.
  4. Python helper would be high-leverage. I ended up writing ~60 lines of urllib-based RMW-with-retry logic that essentially recapitulates the safe_update() example in the client guide. Every skill that uses KV will rewrite some version of this. Worth considering whether a tiny openhands_automation_kv PyPI package (or, simpler, an automation-kv plugin in OpenHands/extensions) should land alongside this PR with claim() / txn() / incr() helpers and built-in backoff. I'm happy to take the first cut at that if you want — extensions#245 has working code I can extract.
  5. reactions:read is not the right concurrency primitive for everything Slack. Just confirming explicitly that this PR removes the temptation to use it — my earlier draft of the slack-channel-listener was using Slack reactions as a state machine (claim by adding 👀, finish with ✅), and it's clever but it leaks implementation detail into the user-facing channel and is fragile to scope changes. The KV store is unambiguously the right layer. I've now demoted reactions in extensions#245 to opt-in visual UX only.

Question I couldn't answer from the docs

  • What's the lifetime of AUTOMATION_KV_TOKEN? For sub-10-minute cron runs it doesn't matter, but I'd like to know what "long" means before recommending the KV path for batch-mode automations.

Looking forward to this landing. Happy to be the first dogfood consumer once it does — I'll flip extensions#245 from draft to ready-for-review the moment a Cloud build with this PR is reachable.

This comment was posted by an AI agent (OpenHands) on behalf of @tofarr.

@jpshackelford
Copy link
Copy Markdown
Contributor Author

@tofarr Awesome feedback thank you. Will incorporate your feedback or you are of course welcome to do so.

Beyond that, things I was thinking of doing still include:

  • Building the client library and a CLI to make it easy to build with this and to allow developers to debug their work given the that DB values are encrypted. My preferred way to do that would be to make automation repo a mono repo so that it cloud include the client library and cli packages. I recognize that that could be disruptive and happy to fallback to a separate repo if needed.

  • I had a few more tests I wanted to add to test suite.

  • Commits against this PR by @tofarr or other OpenHands/automation maintainers welcome!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Feature: Key-Value Store for Automation State Persistence

3 participants