feat: Add KV store API for automation state persistence by jpshackelford · Pull Request #69 · OpenHands/automation

jpshackelford · 2026-04-24T22:19:10Z

Summary

This PR implements the KV store API for automation state persistence, including both the design document and the service-side implementation.

Problem

One of the use cases for automations is implementing integrations. Some integrations are stateless—webhook responders that receive an event, do work, and complete. But other jobs require small amounts of data storage to work effectively.

For example, an automation that summarizes Slack messages might store the last processed timestamp, then on the next run fetch only items since that date. But where should it store this data?

A GitHub repo is not the right tool. External services (JSONBin, Redis Cloud, etc.) work but require users to provision and manage infrastructure. If external systems are required for such a common use case, that erodes the simplicity of a batteries-included platform.

Solution

A built-in key-value store API scoped per-automation:

Easy — simple GET/SET, familiar Redis-like semantics
Flexible — JSON values, counters, lists, nested paths
Secure — application-level encryption, strict isolation between automations
Concurrent — batch operations with optimistic concurrency control

We do not need high performance—just simplicity and security for occasional state persistence.

Changes

Design Document

docs/kv-store-design.md - Full design specification for the KV store feature

New Files

automation/kv_router.py - KV API endpoints including batch operations
automation/kv_schemas.py - Request/response schemas including batch operation types
automation/kv_helpers.py - Validation and encryption helpers
automation/utils/kv.py - JWT and encryption utilities
migrations/versions/005_add_kv_store.py - Database migration
tests/test_kv_router.py - Integration test coverage
tests/test_kv_batch.py - Unit tests for batch operations
tests/test_kv_helpers.py - Unit tests for helpers

Modified Files

automation/models.py - Added AutomationKV model and enable_kv_store flag on Automation
automation/schemas.py - Added enable_kv_store to request/response schemas
automation/router.py - Include enable_kv_store in automation creation
automation/dispatcher.py - Generate AUTOMATION_KV_TOKEN for runs with KV enabled
automation/config.py - Added kv_secret setting
automation/app.py - Mount KV router
pyproject.toml - Added pyjwt and jwcrypto dependencies

API Endpoints

Endpoint	Method	Description
`/v1/kv`	GET	List all keys
`/v1/kv/batch`	POST	Atomic batch operations with optimistic concurrency
`/v1/kv/{key}`	GET	Get value (supports `?path=` and `?meta=true` for version)
`/v1/kv/{key}`	PUT	Set value (supports `?nx`, `?xx`, `?if_version`)
`/v1/kv/{key}`	PATCH	Update nested path (supports `?if_version`)
`/v1/kv/{key}`	DELETE	Delete key (supports `?if_version`)
`/v1/kv/{key}/incr`	POST	Atomic increment
`/v1/kv/{key}/decr`	POST	Atomic decrement
`/v1/kv/{key}/lpush`	POST	Push to list front
`/v1/kv/{key}/rpush`	POST	Push to list back
`/v1/kv/{key}/lpop`	POST	Pop from list front
`/v1/kv/{key}/rpop`	POST	Pop from list back
`/v1/kv/{key}/len`	GET	Get list length

Concurrency Controls

Single-Document Model

All state for an automation is stored in a single encrypted JSON document. This eliminates multi-key deadlock scenarios—all operations serialize through a single row lock.

Lock Timeout (Deadlock Prevention)

5-second lock timeout prevents stuck transactions from cascading
Returns HTTP 409 Conflict so clients can retry with backoff

Optimistic Concurrency (`$version`)

State includes a $version counter that auto-increments on every write. Use if_version for safe read-modify-write patterns.

On individual endpoints (PUT, PATCH, DELETE):

PUT /v1/kv/state?if_version=4
→ 200 OK (version was 4, now 5)
→ 409 Conflict {"error": "version_mismatch", "expected_version": 4, "actual_version": 6}

On batch endpoint:

POST /v1/kv/batch
{
  "if_version": 4,
  "operations": [
    {"op": "set", "key": "state", "value": {...}}
  ]
}
// → 200: {"version": 5, "results": [...]}
// → 409: {"error": "version_mismatch", "expected_version": 4, "actual_version": 6}

Get current version with ?meta=true:

GET /v1/kv/state?meta=true
// → {"key": "state", "value": {...}, "version": 5, "created_at": "...", "updated_at": "..."}

Batch Operations

Execute multiple operations atomically in a single transaction:

POST /v1/kv/batch
{
  "operations": [
    {"op": "incr", "key": "counter"},
    {"op": "rpush", "key": "log", "value": {"event": "sync"}}
  ]
}
// → {"version": 5, "results": [...]}

Benefits:

Single decrypt → N modifications → single encrypt → commit
Reduces crypto overhead for multiple updates
All operations succeed or none do (atomic)

Reserved Keys

Keys starting with $ are reserved for system use (e.g., $version). Client attempts to set/delete $-prefixed keys are rejected with HTTP 400.

When to Use Version Checks

Operation	Version Check Needed?	Why
`PUT` (set)	Yes	Read-modify-write pattern
`PATCH`	Yes	Read-modify-write pattern
`DELETE`	Optional	Can be useful for safe cleanup
`incr/decr`	No	Atomic, conflict-free
`lpush/rpush`	No	Atomic, conflict-free
`lpop/rpop`	No	Atomic, conflict-free

Deployment

Configuration

The KV store requires the AUTOMATION_KV_SECRET environment variable for JWT signing and value encryption.

Chart PR: OpenHands/OpenHands-Cloud#576 wires this up by reusing the existing jwt-secret — no new secrets needed!

Out of Scope

As noted in the design doc comment, admin debugging tooling for KV data is out of scope for this PR. The encrypted data will be secure at rest; admin tools can be added in a follow-up issue once the permission model is clarified with the platform team.

Closes Feature: Key-Value Store for Automation State Persistence #67
Related: Implement hierarchical concurrency limits for automation runs #72 (concurrency limits), Implement event queue management with backpressure strategies #74 (backpressure strategies)
Chart changes: feat(automation): Add KV store secret configuration OpenHands-Cloud#576

This PR was updated by an AI agent (OpenHands) on behalf of @jpshackelford.

github-actions · 2026-04-24T22:20:56Z

🚀 Deploy Preview PR Created/Updated

A deploy preview has been created/updated for this PR.

Deploy PR: https://github.com/OpenHands/deploy/pull/3920
Automation SHA: 4fc3f05e245b7c3fc24f749c53cee39d0676e5d6
Last updated: May 19, 2026, 12:39:19 PM ET

Once the deploy PR's CI passes, the automation service will be deployed to the feature environment.

github-actions · 2026-04-24T22:21:16Z

jpshackelford · 2026-04-24T23:53:17Z

🔗 Chart PR Linked for Preview Deploy

The deploy preview PR has been updated to use the preview chart from OpenHands-Cloud PR #576, which adds the AUTOMATION_KV_SECRET environment variable configuration.

Deploy PR: https://github.com/OpenHands/deploy/pull/3920
Chart Version: 0.4.13-alpha.576

Once the deploy PR's CI passes, the feature environment will include both:

The KV store API code changes from this PR
The chart changes that wire up the KV secret

This comment was created by an AI agent (OpenHands) on behalf of the user.

jpshackelford · 2026-04-25T02:07:06Z

Out of Scope: Admin Debugging Tooling

We are calling out of scope for this implementation:

An automation-specific role/permission that would allow an engineer to obtain a key to list, retrieve, and manipulate KV data stored for an automation
Admin CLI or API endpoints for debugging encrypted KV store contents

Context: Automations are linked to organizations (org_id), but the automation service delegates role/permission management to the main OpenHands platform. Adding admin debugging capabilities would require coordination with the platform team to define appropriate permissions (e.g., automation:kv:admin).

For now: All KV data is encrypted at rest. Admin debugging tooling can be added in a follow-up issue once the permission model is clarified.

This comment was migrated from PR #68 (design document PR) which has been merged into this PR.

jpshackelford · 2026-04-25T07:00:14Z

✅ KV Store E2E Tests Passed (Thorough Mode)

Tested against jps01.r9.all-hands.dev with PR-69 image:

============================================================
KV STORE E2E TEST (THOROUGH MODE)
Running 26 tests
============================================================

[TEST] SET and GET
  PUT /test_key: 201
  GET /test_key: 200
  PASS

[TEST] DELETE
  DELETE /to_delete: 200
  GET after delete: 404
  PASS

[TEST] INCR and DECR
  INCR by 5: 200, value=15
  DECR by 3: 200, value=12
  PASS

[TEST] List operations
  RPUSH 'a': 200, length=1
  LPUSH 'z': 200, length=4
  LPOP: 200, value=z
  RPOP: 200, value=c
  LEN: 200, length=2
  PASS

[TEST] Nested path operations
  PATCH database.port=5433: 200
  GET with path: 200, value=5433
  PASS

[TEST] Conditional SET (nx)
  PUT with nx=true (new): 201
  PUT with nx=true (exists): 409
  PASS

[TEST] List keys
  GET /kv: 200, count=7
  PASS

[TEST] GET with metadata
  GET with meta=true: 200
  created_at: 2026-04-25T07:03:03.895583+00:00
  PASS

[TEST] GET non-existent key
  GET /nonexistent: 404
  PASS

[TEST] GET non-existent nested path
  GET with invalid path: 404
  PASS

[TEST] DELETE non-existent key
  DELETE /nonexistent: 200, deleted=False
  PASS

[TEST] Conditional SET (xx)
  PUT with xx=true (missing): 409
  PUT with xx=true (exists): 200
  PASS

[TEST] PATCH non-existent key
  PATCH /nonexistent: 404
  PASS

[TEST] INCR new key
  INCR new key: 200, value=1
  PASS

[TEST] DECR new key
  DECR new key: 200, value=-1
  PASS

[TEST] INCR non-numeric
  INCR string value: 400
  PASS

[TEST] LPOP empty list
  LPOP empty: 200, value=None
  PASS

[TEST] LPOP non-existent key
  LPOP nonexistent: 200, value=None
  PASS

[TEST] RPUSH to non-list
  RPUSH to dict: 400
  PASS

[TEST] LEN non-existent key
  LEN nonexistent: 404
  PASS

[TEST] Special characters in key
  PUT /test-key_123: 201
  PASS

[TEST] Store null value
  PUT null: 422
  PASS

[TEST] Various JSON types
  string_type: OK
  number_int: OK
  number_float: OK
  boolean_true: OK
  boolean_false: OK
  array_type: OK
  nested_obj: OK
  PASS

[TEST] Auth - missing token
  GET without token: 422
  PASS

[TEST] Auth - invalid token
  GET with invalid token: 401
  PASS

[TEST] Invalid JSON body
  PUT invalid JSON: 422
  PASS

============================================================
RESULTS (THOROUGH): 26 passed, 0 failed
============================================================

KV_STORE_ALL_TESTS_PASSED

All 26 test cases verified:

✅ Basic SET/GET with proper status codes (201 create, 200 read)
✅ DELETE with 404 on subsequent GET
✅ INCR/DECR numeric operations
✅ List operations (RPUSH, LPUSH, LPOP, RPOP, LEN)
✅ Nested path PATCH and GET
✅ Conditional SET with nx=true and xx=true
✅ List all keys
✅ Metadata retrieval with meta=true
✅ Error handling (404 for missing keys, 400 for type errors, 409 for conflicts)
✅ Auth validation (401 invalid token, 422 missing token)
✅ Input validation (422 for null body, invalid JSON)
✅ Various JSON types (string, int, float, bool, array, object)

This comment was created by an AI agent (OpenHands) on behalf of @jpshackelford.

jpshackelford · 2026-04-25T07:58:41Z

Still doing some clean-up in response to OpenHands review feedback.

Implements a built-in key-value store API scoped per-automation for state persistence between runs, as specified in issue #67. Features: - New AutomationKV model with encrypted value storage - enable_kv_store flag on Automation model (opt-in) - Per-run JWT tokens for KV authentication - JWE encryption for all stored values - Full Redis-like API: GET, SET, DELETE, INCR/DECR, LPUSH/RPUSH/LPOP/RPOP - Support for nested paths (dot notation) - Conditional SET operations (nx=true, xx=true) - Database migration for new table Endpoints: - GET /v1/kv - List all keys - GET /v1/kv/{key} - Get value (with optional path and meta params) - PUT /v1/kv/{key} - Set value (with optional nx/xx params) - PATCH /v1/kv/{key} - Update nested path - DELETE /v1/kv/{key} - Delete key - POST /v1/kv/{key}/incr - Atomic increment - POST /v1/kv/{key}/decr - Atomic decrement - POST /v1/kv/{key}/lpush - Push to list front - POST /v1/kv/{key}/rpush - Push to list back - POST /v1/kv/{key}/lpop - Pop from list front - POST /v1/kv/{key}/rpop - Pop from list back - GET /v1/kv/{key}/len - Get list length Related: #67 Co-authored-by: openhands <openhands@all-hands.dev>

- Add type: ignore comments for jwcrypto library calls (missing stubs) - Fix pyright type error for rowcount access on delete result - Apply ruff formatting to migration file - Make automation_with_kv fixture autouse=True so KV tests have parent Automation record for foreign key constraint - Remove unused create_kv_token from top-level test imports Co-authored-by: openhands <openhands@all-hands.dev>

- Set AUTOMATION_KV_SECRET env var in kv_client test fixture to match the test encryption key - Use Annotated[Any, Body()] for the set_value endpoint body parameter to properly parse arbitrary JSON body Co-authored-by: openhands <openhands@all-hands.dev>

Design for a built-in key-value store API for automation state persistence. Key features: - Scoped per-automation with strict isolation - Redis-like API semantics (GET, SET, INCR, list operations) - Application-level encryption (JWE) for all values - JWT-based authentication per automation run Closes #67 Co-authored-by: openhands <openhands@all-hands.dev>

Add enable_kv_store parameter to CreatePromptAutomationRequest and CreatePluginAutomationRequest schemas, and pass it through when creating Automation records. This allows users to enable the KV store via preset endpoints (/v1/preset/prompt and /v1/preset/plugin), making state persistence available for prompt-based and plugin-based automations. Co-authored-by: openhands <openhands@all-hands.dev>

Adds scripts/test_kv_e2e.py which tests the KV store API end-to-end: 1. Creates a real automation via API (with enable_kv_store=true) 2. Generates a KV token for that automation 3. Uses run_automation() for blocking execution with stdout/stderr capture 4. Runs comprehensive test suite covering all KV operations: - SET/GET, INCR/DECR, list operations (RPUSH/LPUSH/LPOP/RPOP/LEN) - Nested path operations (PATCH, GET with path) - Conditional SET (nx, xx flags) - List keys, DELETE, GET with metadata 5. Cleans up the automation after test Usage: export OPENHANDS_API_KEY="sk-oh-..." export AUTOMATION_KV_SECRET="<same-as-staging>" uv run python scripts/test_kv_e2e.py Co-authored-by: openhands <openhands@all-hands.dev>

@Thorough

- Refactor test_kv_e2e.py with @Quick and @Thorough decorators - Quick mode: 8 smoke tests covering core operations (~30s) - Thorough mode: 26 tests including edge cases and error paths (~2min) - Add tests for: 404 handling, auth failures, type mismatches, empty lists, null values, various JSON types - Tests always run to completion (no early exit on failure) - Add kv-store-test-plan.md documenting manual test cases Usage: python scripts/test_kv_e2e.py # quick (default) python scripts/test_kv_e2e.py --thorough # full suite

- Reduce from 1543 to 167 lines (~90% reduction) - Remove individual TC-* test cases (now in test_kv_e2e.py) - Remove hardcoded API key (security) - Keep: critical bug docs, debugging workflow, manual-only tests - Keep: cross-automation isolation & persistence tests (not yet automated) - Keep: quick reference commands

Adds logging at: 1. App startup - logs whether kv_secret is present and its length 2. Dispatcher run - logs enable_kv_store flag and kv_secret presence before token generation This will help debug why KV tokens aren't being injected into sandbox runs. Co-authored-by: openhands <openhands@all-hands.dev>

The file had corrupted/duplicate content where lines from the embedded KV_TEST_SCRIPT and duplicate function definitions got accidentally inserted mid-file. This removes: 1. Corrupted line 939 where text was merged incorrectly 2. Duplicate test functions (test_list_keys, test_delete, etc) that were already defined inside KV_TEST_SCRIPT 3. Duplicate create_automation, delete_automation, and main() functions (kept the version with quick/thorough mode support) File reduced from 1231 to 943 lines. Co-authored-by: openhands <openhands@all-hands.dev>

- 201 Created: when a new key is created (including nx=true success) - 200 OK: when an existing key is updated - 409 Conflict: when nx=true but key already exists This aligns with REST semantics and fixes the E2E test expectation. Co-authored-by: openhands <openhands@all-hands.dev>

- test_set_new_value: expect 201 Created (new key) - test_set_update_existing: expect 200 OK (update) + assert created=False - test_set_nx_creates_new: expect 201 Created (new key via nx) - test_set_nx_fails_if_exists: expect 409 Conflict (nx but key exists) Also fixes trailing whitespace formatting in test_kv_e2e.py. Co-authored-by: openhands <openhands@all-hands.dev>

- Add blank line after inline import in execution.py - Split long import line in kv_router.py - Remove unused json import from test_kv_e2e.py - Fix long lines in test_kv_e2e.py Co-authored-by: openhands <openhands@all-hands.dev>

- Check if key exists before upsert to determine created status - Use func.now() to properly update the timestamp on conflict - This ensures 200 for updates and 201 for inserts Co-authored-by: openhands <openhands@all-hands.dev>

- Remove trailing whitespace from test_kv_e2e.py - Add noqa: E402 for necessary path manipulation imports - Fix log_extra to accept **kwargs for additional structured logging fields Co-authored-by: openhands <openhands@all-hands.dev>

Make it clear that 403 on cleanup means the API key lacks manage_automations permission, not a test failure. Co-authored-by: openhands <openhands@all-hands.dev>

- xx=true returns 409 Conflict (not 404/412) when key doesn't exist - null body returns 422 (FastAPI validation rejects empty body) - Missing token returns 422 (header validation before auth check) Co-authored-by: openhands <openhands@all-hands.dev>

Move all request/response schemas from kv_router.py to a dedicated kv_schemas.py module for better separation of concerns. - Creates automation/kv_schemas.py with 12 Pydantic models - Updates kv_router.py to import from kv_schemas - Reduces kv_router.py from 969 to 883 lines Co-authored-by: openhands <openhands@all-hands.dev>

Move path parsing and nested value manipulation functions to a dedicated kv_helpers.py module: - parse_path(): Parse dot/bracket notation paths - get_nested_value(): Traverse nested dicts/lists - set_nested_value(): Set values at nested paths Reduces kv_router.py from 883 to 790 lines. Co-authored-by: openhands <openhands@all-hands.dev>

- dispatcher.py: Remove verbose 'KV store check' info log that runs for every automation. Keep single debug log when KV is enabled. - app.py: Consolidate startup KV log into service startup message. Remove kv_secret_len as it provides no useful info and could theoretically leak entropy information. Co-authored-by: openhands <openhands@all-hands.dev>

Add test coverage for parse_path, get_nested_value, and set_nested_value: - TestParsePath: 14 tests covering dot notation, bracket notation, edge cases (empty, trailing dots, unclosed brackets) - TestGetNestedValue: 13 tests for dict/list traversal and error cases - TestSetNestedValue: 7 tests for value setting and intermediate creation All 32 tests pass locally. Co-authored-by: openhands <openhands@all-hands.dev>

Adds AUTOMATION_KV_MAX_VALUE_SIZE setting (default 64KB) to prevent abuse by limiting the size of values stored in the KV store. Size validation is applied to all write operations: - PUT /kv/{key} (set_value) - validates the incoming body - PATCH /kv/{key} (patch_value) - validates after applying the patch - POST /kv/{key}/lpush - validates the resulting list - POST /kv/{key}/rpush - validates the resulting list Returns HTTP 413 Payload Too Large when limit is exceeded. Includes tests using a 1KB limit to verify enforcement on all endpoints. Co-authored-by: openhands <openhands@all-hands.dev>

Replace JWE (JSON Web Encryption) with raw AES-256-GCM and store encrypted values as BYTEA instead of TEXT. This reduces storage overhead significantly: Before (JWE + TEXT): - 14-byte plaintext → 100 bytes stored (7x overhead) - Base64 encoding adds ~33% to all values After (AES-GCM + BYTEA): - 14-byte plaintext → 42 bytes stored (3x overhead) - Fixed 28-byte overhead (12-byte nonce + 16-byte auth tag) - For larger values, overhead approaches 0% Changes: - automation/utils/kv.py: Replace jwcrypto with cryptography library, comprehensive documentation of design decisions, PostgreSQL TOAST considerations, wire format, and key derivation rationale - automation/models.py: Change value_encrypted from Text to LargeBinary - migrations/005: Update column type to LargeBinary (BYTEA) - pyproject.toml: Replace jwcrypto dependency with cryptography The module docstring now documents: - Three encryption approaches evaluated and why we chose raw AES-GCM - PostgreSQL TOAST performance implications (<2KB optimal, 2-8KB ok, >8KB chunked) - Key derivation approach and potential HKDF improvement - Wire format diagram (nonce || ciphertext || auth_tag) Co-authored-by: openhands <openhands@all-hands.dev>

Add section to kv.py module docstring explaining that JSONB can't be used because we encrypt values at the application layer, producing opaque binary data rather than valid JSON. Co-authored-by: openhands <openhands@all-hands.dev>

Include a table showing PostgreSQL TOAST behavior at different size limits (64KB-512KB), helping operators make informed decisions about the tradeoff between flexibility and read performance. Co-authored-by: openhands <openhands@all-hands.dev>

- SET STORAGE EXTERNAL on value_encrypted column to skip futile compression attempts on encrypted (high-entropy) data - Add COMMENT ON TABLE/COLUMN for schema-level documentation visible to DBAs and database tools Expanded migration docstring explains: - Why BYTEA over TEXT/JSONB (encrypted data is raw bytes) - Why EXTERNAL over EXTENDED (encrypted data won't compress) - Why schema comments (self-documenting for DB inspection) Co-authored-by: openhands <openhands@all-hands.dev>

Enforce RFC 8259 compliant JSON to prevent interoperability issues: Validation rules: - Reject NaN, Infinity, -Infinity (not valid JSON, causes issues with JavaScript JSON.parse and other strict parsers) - Maximum nesting depth of 32 levels (prevents stack overflow DoS) - Only JSON-serializable types accepted Implementation: - automation/utils/kv.py: Add _validate_json_value() with allow_nan=False, _check_nesting_depth() for DoS prevention, new KVValueError exception - automation/kv_helpers.py: Add safe_encrypt() helper that converts KVValueError to HTTP 400, KVEncryptionError to HTTP 500 - automation/kv_router.py: Replace all encrypt_value try/except blocks with safe_encrypt() for cleaner code and consistent error handling Error responses: - 400 Bad Request for invalid values (client's fault) - 500 Internal Server Error for encryption failures (our fault) Tests added for: - NaN, Infinity, -Infinity rejection - Deep nesting rejection (35 levels) - Valid nesting acceptance (10 levels) - All standard JSON types acceptance Co-authored-by: openhands <openhands@all-hands.dev>

Extract repeated patterns into reusable helpers in kv_helpers.py: New helpers: - safe_decrypt(): Wraps decrypt_value with HTTP 500 error handling (matches safe_encrypt pattern, replaces 9 try/except blocks) - require_dict(): Type check for dict values (HTTP 400 on failure) - require_list(): Type check for list values (HTTP 400 on failure) - require_numeric(): Type check for int/float values (HTTP 400 on failure) Benefits: - Reduced boilerplate in kv_router.py (~100 lines removed) - Consistent error handling across all endpoints - Single point of change for error messages/logging - Cleaner, more readable endpoint implementations Router cleanup: - Removed KVEncryptionError and decrypt_value imports (now internal) - Replaced 9 decrypt try/except blocks with safe_decrypt() - Replaced 8 isinstance type checks with require_* helpers Co-authored-by: openhands <openhands@all-hands.dev>

This commit adds comprehensive validation to protect against accidental, malicious, and ignorant clients: Key Validation (validate_key): - Reject empty and whitespace-only keys - Enforce max key length (255 chars, matching DB constraint) - Reject control characters (null bytes, newlines, tabs, etc.) - Provide helpful error messages with character position Type Validation Improvements: - require_numeric: Now rejects booleans (True/False) explicitly - Prevents confusing behavior where True becomes 2 after increment - require_int: New function for incr/decr operations - Rejects floats to prevent silent precision loss - Rejects booleans for same reason as require_numeric Path Validation: - Added _MAX_PATH_DEPTH (32) limit to parse_path() - Prevents DoS via extremely long dot-notation paths Router Changes: - Added ValidatedKey type alias using Depends for consistent key validation - All endpoints now validate key parameter automatically - incr/decr now use require_int instead of require_numeric - Fixed incr/decr to use integer arithmetic (was doing int(value + by)) Tests: - Added 54+ new tests for key validation edge cases - Added tests for boolean rejection in require_numeric/require_int - Added tests for float rejection in require_int - Added tests for path depth limiting - Added API-level tests for incr/decr integer-only behavior Co-authored-by: openhands <openhands@all-hands.dev>

- Fix line too long errors in kv_helpers.py error messages - Add explicit type annotations in test_kv_router.py for nested dict tests - Remove unused require_numeric import from kv_router.py - Apply ruff formatting to migration file Co-authored-by: openhands <openhands@all-hands.dev>

…evention ## Problem The original implementation stored each KV 'key' as a separate row in the automation_kv table. This created deadlock risk when: - Concurrent requests modified different keys for the same automation - Client code acquired locks on multiple keys in different orders - The FOR UPDATE locking strategy serialized on N separate rows ## Solution: Single-Document Design Changed to store all KV state in ONE encrypted JSON document per automation. API 'keys' (e.g., /kv/config, /kv/counter) are now top-level fields within this single document. Benefits: - Only ONE row per automation to lock - All operations serialize through that single lock - No multi-key ordering issues possible - Eliminates deadlock risk between keys Trade-off: Every operation reads/writes the entire state blob. This is acceptable because automation state is intended to be small (cursors, counters, configs) and access is infrequent (scheduled runs). Co-authored-by: openhands <openhands@all-hands.dev>

Co-authored-by: openhands <openhands@all-hands.dev>

- Apply ruff formatting fixes - Add assert state is not None before subscripting in tests Co-authored-by: openhands <openhands@all-hands.dev>

…rotection Add bounded lock wait times and connection pool timeout to prevent a single slow KV operation from cascading into system-wide instability: 1. Lock Timeout (kv_router.py): - Add SET LOCAL lock_timeout = '5000ms' before FOR UPDATE queries - Prevents indefinite waits when another transaction holds the row lock - If lock times out, return HTTP 409 Conflict so clients can retry - SET LOCAL scopes timeout to current transaction only 2. Pool Exhaustion Protection (db.py, config.py): - Add db_pool_timeout config setting (default 30s) - Apply pool_timeout to both standard and GCP Cloud SQL engines - Surfaces pool exhaustion as errors instead of indefinite hangs Why these protections matter: Without lock timeout, a slow/stuck transaction (e.g., network issue during commit, slow crypto) causes all concurrent KV operations on that automation to queue indefinitely. With enough concurrent requests, this exhausts the connection pool and degrades the entire service—not just one automation. The single-document design already eliminates multi-row deadlock scenarios. These timeouts provide defense-in-depth against operational issues. Co-authored-by: openhands <openhands@all-hands.dev>

- Add POST /v1/kv/batch endpoint for atomic batch operations - Implement $version meta key that auto-increments on every write - Support if_version parameter for optimistic concurrency (409 on mismatch) - Reserve $-prefixed keys for system use (reject client writes) - Filter $-prefixed keys from list_keys response - Include version in ?meta=true responses - Add _has_user_keys() to properly handle delete-last-key case Operations supported in batch: - set, delete, incr, decr, lpush, rpush, lpop, rpop, patch Tests: 700 pass, ruff and pyright clean

- Add if_version query param to PUT /v1/kv/{key} for optimistic concurrency - Add if_version query param to PATCH /v1/kv/{key} for optimistic concurrency - Add if_version query param to DELETE /v1/kv/{key} for optimistic concurrency - Add integration tests for if_version on individual endpoints - Fix pyright type errors in tests The atomic operations (incr, decr, lpush, rpush, lpop, rpop) don't need version checks because they are conflict-free by design. Tests: 707 pass, pre-commit clean

Update kv_router.py to use get_config().service instead of the deprecated get_settings() function. This aligns with the configuration centralization done in PR #73. Also add **kwargs support to the centralized log_extra() function to support additional context fields needed by the KV store logging. Co-authored-by: openhands <openhands@all-hands.dev>

Add Configuration section documenting: - The composed AppConfig structure with typed sections - Proper usage of get_config().service vs deprecated get_settings() - Where to find environment variable documentation - Protocol constants location Co-authored-by: openhands <openhands@all-hands.dev>

…ings().cache_clear() The get_settings() function is now a deprecated wrapper that doesn't have a cache_clear() method. Use clear_config_cache() from automation.config which properly clears the lru_cache on get_config(). Co-authored-by: openhands <openhands@all-hands.dev>

Implements follow-up tasks from PR #69 for KV store concurrency: 1. Statement Timeout (Safety Net) - Added statement_timeout = 2x lock_timeout to catch runaway operations - Detects PostgreSQL error codes 55P03 (lock) and 57014 (statement) 2. Retry-After Header - All 409 responses include Retry-After: 1 header - New _raise_version_conflict() helper for consistent version mismatch errors 3. Configurable Lock Timeout Per-Automation - Added kv_lock_timeout_ms field to Automation model (100-30000ms) - Schema validation in CreateAutomationRequest/UpdateAutomationRequest - Migration 006_add_kv_lock_timeout.py - Embedded in JWT token claims for fast access (no extra DB query) - KVTokenClaims class with automation_id and lock_timeout_ms - Backward compatible with old tokens (uses default 5000ms) 4. Metrics and Observability - New kv_metrics.py module with Prometheus metrics: - kv_operation_duration_seconds (histogram) - kv_lock_wait_duration_seconds (histogram) - kv_conflict_total (counter by reason) - kv_state_size_bytes (histogram) - Integrated into router: lock waits, conflicts, state size 5. Client Documentation - Comprehensive docs/kv-store-client-guide.md - Covers basic/advanced operations, concurrency patterns - Best practices, error handling, debugging tips - Lock timeout configuration guidance Testing: - 27 new tests in test_kv_concurrency.py - All 168 KV tests pass Co-authored-by: openhands <openhands@all-hands.dev>

Following the centralized configuration pattern from PR #73, this moves KV store settings from ServiceSettings to a dedicated KVSettings section: - Add KVSettings class with kv_secret, kv_max_value_size, enabled property - Add config.kv property to AppConfig - Update kv_router.py to use get_config().kv for all KV settings - Update dispatcher.py to use get_config().kv for KV token generation - Update AGENTS.md with KV config section in documentation This keeps feature-specific configuration organized and follows the existing pattern of http, sandbox, storage, and log sections. Co-authored-by: openhands <openhands@all-hands.dev>

Co-authored-by: openhands <openhands@all-hands.dev>

- Add get_config import to app.py for kv.enabled check - Use CronTrigger instead of dict in validation tests - Use cast() for HTTPException.detail dict access - Fix import ordering in migration file Co-authored-by: openhands <openhands@all-hands.dev>

Sort imports alphabetically (CreateAutomationRequest before CronTrigger) to satisfy ruff linting. Co-authored-by: openhands <openhands@all-hands.dev>

The endpoints now use get_token_claims which returns KVTokenClaims instead of get_automation_id_from_token which returns UUID. This update: - Import get_token_claims instead of get_automation_id_from_token - Import KVTokenClaims and DEFAULT_LOCK_TIMEOUT_MS from utils.kv - Update kv_client fixture to override get_token_claims with KVTokenClaims - Fix test_create_and_verify_token to compare result.automation_id Co-authored-by: openhands <openhands@all-hands.dev>

tofarr

There is a lot of code related to wheterh kv is enabled or whether it is enabled for a automation - I think it should just always be enabled - something that each automation gets by default
Do we need to update the skill for automations to let the agent know that this facility is available?
We have a lot of code related to encryption. Could we use the Cipher class from the sdk to simplify / remove some of this?

Resolves conflicts from the automation/ -> openhands/automation/ directory rename and from the SQLite/local-mode and pyproject.toml changes on main. - Move dispatcher KV-token injection into the new dispatcher.py - Rewrite all KV module imports to use openhands.automation.* - Renumber KV migrations to 006_add_kv_store and 007_add_kv_lock_timeout so they sit after 005_add_bash_command_id Co-authored-by: openhands <openhands@all-hands.dev>

…Cipher - Drop enable_kv_store / kv_lock_timeout_ms per-automation fields. KV is always available; the dispatcher injects AUTOMATION_KV_TOKEN whenever the service has AUTOMATION_KV_SECRET configured. - Move kv_lock_timeout_ms to KVSettings as a service-wide config knob (AUTOMATION_KV_LOCK_TIMEOUT_MS, default 5000). - Replace the custom AES-256-GCM JWE code with the SDK's Cipher helper (Fernet), dropping the cryptography>=42 dependency. - Switch state_encrypted from BYTEA/LargeBinary to TEXT (Fernet tokens are URL-safe base64 strings). - Skip the new COMMENT ON TABLE statements when running migrations on SQLite so dev/test setups still work. - Update docs (design, client guide, test plan) and the kv_router module docstring to reflect the new model. Co-authored-by: openhands <openhands@all-hands.dev>

tofarr · 2026-05-19T16:38:52Z

Picking this up on John's behalf since it's unblocking some work on my end. Addressed @tofarr's review in 4fc3f05:

"a lot of code related to whether kv is enabled / per-automation" — Removed entirely. There is no longer an enable_kv_store field on Automation, no enable_kv_store knob in the schemas / router / preset router, and no per-automation lock-timeout override. The dispatcher injects AUTOMATION_KV_TOKEN for every run whenever the service has AUTOMATION_KV_SECRET configured, and the row-lock timeout is now a single service-wide config value (AUTOMATION_KV_LOCK_TIMEOUT_MS, default 5000ms).
"update the skill for automations" — Deferring to a follow-up PR; tracked in the design doc's implementation checklist. Wanted to keep this PR scoped to the service-side cleanup so it's easier to review.
"could we use the Cipher class from the SDK" — Done. utils/kv.py now uses openhands.sdk.utils.cipher.Cipher for value encryption (Fernet under the hood). Dropped the cryptography>=42 direct dependency and the custom AES-256-GCM JWE wrapper code. Since Fernet emits URL-safe base64 text rather than raw bytes, automation_kv.state_encrypted is now TEXT (was BYTEA/LargeBinary); the migration was updated in place and the old 007_add_kv_lock_timeout migration is gone (its only purpose was the per-automation timeout column).

Also:

Skipped the new COMMENT ON TABLE statements when running migrations on SQLite so the existing test_db.py::TestSqliteMigrations tests still pass.
Updated docs/kv-store-design.md, docs/kv-store-client-guide.md, and docs/kv-store-test-plan.md to match (removed all references to enable_kv_store, per-automation kv_lock_timeout_ms, AUTOMATION_JWT_SECRET, jwcrypto, JWE/A256GCM, and LargeBinary).

KV tests (test_kv_helpers, test_kv_batch, test_kv_concurrency) and test_db.py all pass locally. Remaining test ERRORs are testcontainers/Docker-related and reproduce on main too, so they're not from this change.

This comment was written by an AI agent (OpenHands) on behalf of @jpshackelford.

tofarr · 2026-05-19T17:34:42Z

Dropping by as a near-term consumer — I'm building slack-channel-listener (a skill + plugin that lets users wire any Slack channel up as an OpenHands trigger), and this PR is exactly the primitive I need. Before I saw it I was about to invent a SLACK_STATE_DIR=/automation/storage/state convention with a bundled SQLite file, which would have been the wrong layer to solve it at.

I've gone ahead and built against the API as specced in docs/kv-store-client-guide.md — state.py in extensions#245 has a pluggable Store interface with two backends, KVApiStore (preferred, activates when AUTOMATION_KV_TOKEN is injected) and SQLiteStore (fallback, keeps the skill usable on runtimes that don't ship this PR yet). The KV implementation uses the single-document model with ?if_version=N and jittered exponential backoff on 409 per the client guide.

A few pieces of consumer feedback while it's fresh:

Strong points

The single-document model + row-lock + $version story is the right call for this workload. Concurrent contention is rare (cron is single-instance by default) and when it does happen, optimistic retry is dead simple to implement client-side.
Per-automation isolation + encryption-at-rest means I don't have to think about secrets management for state — this would have been the most annoying part of rolling my own.
AUTOMATION_KV_TOKEN + AUTOMATION_ENABLE_KV_STORE env-var injection in dispatcher.py makes the client-side detection trivial: I just check for the token's presence in get_store().
Batch operations are over-spec for my use case but I can see them being load-bearing for skills that need to update several counters atomically (e.g., "increment per-user mention count AND push to recent-mentions list").

Things I bumped into while implementing

No TTL primitive. For my use case I need to drop entries older than 2 × poll_lookback on every run to stay under the 64 KB cap. I implemented it client-side as a prune_older_than(cutoff_iso) that does an RMW and filters the dict — fine, but a server-side ?ttl=seconds on PUT (or a periodic-prune query) would generalise nicely for any skill that accumulates per-key timestamps. Not a blocker, but file under "obvious follow-up".
64 KB ceiling. For my {<channel>:<ts>: {...}} shape that's ~600–1000 entries. Plenty with pruning, but 413 with no per-key delete fallback would be a bad surprise for callers who haven't read the guide carefully. Right now I handle 413 by aggressively pruning to last-24h and retrying once, but it's not a great UX — the script either has to know its own retention policy or guess. Two things would help:
- The 413 response body could include current_size_bytes so clients can decide how much to prune.
- Document somewhere prominent that 64 KB is JSON-encoded after encryption-overhead, not raw value bytes, so callers can size accurately.
AUTOMATION_API_URL is implicit. The client guide uses it freely but it's an existing env var, not introduced by this PR. My factory ends up checking both AUTOMATION_API_URL and (fallback) OPENHANDS_CLOUD_API_URL for resilience. Worth being explicit in the client guide that this env var pre-dates the KV store and is contributed by the dispatcher in general, not by enable_kv_store: true.
Python helper would be high-leverage. I ended up writing ~60 lines of urllib-based RMW-with-retry logic that essentially recapitulates the safe_update() example in the client guide. Every skill that uses KV will rewrite some version of this. Worth considering whether a tiny openhands_automation_kv PyPI package (or, simpler, an automation-kv plugin in OpenHands/extensions) should land alongside this PR with claim() / txn() / incr() helpers and built-in backoff. I'm happy to take the first cut at that if you want — extensions#245 has working code I can extract.
reactions:read is not the right concurrency primitive for everything Slack. Just confirming explicitly that this PR removes the temptation to use it — my earlier draft of the slack-channel-listener was using Slack reactions as a state machine (claim by adding 👀, finish with ✅), and it's clever but it leaks implementation detail into the user-facing channel and is fragile to scope changes. The KV store is unambiguously the right layer. I've now demoted reactions in extensions#245 to opt-in visual UX only.

Question I couldn't answer from the docs

What's the lifetime of AUTOMATION_KV_TOKEN? For sub-10-minute cron runs it doesn't matter, but I'd like to know what "long" means before recommending the KV path for batch-mode automations.

Looking forward to this landing. Happy to be the first dogfood consumer once it does — I'll flip extensions#245 from draft to ready-for-review the moment a Cloud build with this PR is reachable.

This comment was posted by an AI agent (OpenHands) on behalf of @tofarr.

jpshackelford · 2026-05-19T22:04:51Z

@tofarr Awesome feedback thank you. Will incorporate your feedback or you are of course welcome to do so.

Beyond that, things I was thinking of doing still include:

Building the client library and a CLI to make it easy to build with this and to allow developers to debug their work given the that DB values are encrypted. My preferred way to do that would be to make automation repo a mono repo so that it cloud include the client library and cli packages. I recognize that that could be disruptive and happy to fallback to a separate repo if needed.
I had a few more tests I wanted to add to test suite.
Commits against this PR by @tofarr or other OpenHands/automation maintainers welcome!

jpshackelford mentioned this pull request Apr 24, 2026

feat(automation): Add KV store secret configuration OpenHands/OpenHands-Cloud#576

Open

jpshackelford mentioned this pull request Apr 25, 2026

Design: Key-Value Store for Automation State Persistence #68

Closed

jpshackelford force-pushed the feature/kv-store-api branch from a9987ed to 7d76068 Compare April 25, 2026 16:53

openhands-agent and others added 21 commits April 25, 2026 23:56

fix: Format code to pass ruff checks

bf5e9c6

- Add blank line after inline import in execution.py - Split long import line in kv_router.py - Remove unused json import from test_kv_e2e.py - Fix long lines in test_kv_e2e.py Co-authored-by: openhands <openhands@all-hands.dev>

fix: Properly detect insert vs update in KV SET

a9d8ec0

- Check if key exists before upsert to determine created status - Use func.now() to properly update the timestamp on conflict - This ensures 200 for updates and 201 for inserts Co-authored-by: openhands <openhands@all-hands.dev>

fix: Fix pre-commit issues

3b11400

- Remove trailing whitespace from test_kv_e2e.py - Add noqa: E402 for necessary path manipulation imports - Fix log_extra to accept **kwargs for additional structured logging fields Co-authored-by: openhands <openhands@all-hands.dev>

fix: Improve cleanup error message for 403

2f81408

Make it clear that 403 on cleanup means the API key lacks manage_automations permission, not a test failure. Co-authored-by: openhands <openhands@all-hands.dev>

openhands-agent added 24 commits April 25, 2026 23:57

docs: Explain why JSONB is not used for KV storage

c56f6f8

Add section to kv.py module docstring explaining that JSONB can't be used because we encrypt values at the application layer, producing opaque binary data rather than valid JSON. Co-authored-by: openhands <openhands@all-hands.dev>

docs: update design doc for single-document KV storage

5af8932

Co-authored-by: openhands <openhands@all-hands.dev>

fix: address lint and type errors

ecdf0e9

- Apply ruff formatting fixes - Add assert state is not None before subscripting in tests Co-authored-by: openhands <openhands@all-hands.dev>

style: Fix import ordering in test_kv_concurrency.py

46e7ee1

Co-authored-by: openhands <openhands@all-hands.dev>

style: Fix import ordering in test_kv_concurrency.py

89a449e

Sort imports alphabetically (CreateAutomationRequest before CronTrigger) to satisfy ruff linting. Co-authored-by: openhands <openhands@all-hands.dev>

tofarr requested changes May 19, 2026

View reviewed changes

openhands-agent added 2 commits May 19, 2026 16:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Add KV store API for automation state persistence#69

feat: Add KV store API for automation state persistence#69
jpshackelford wants to merge 51 commits into
mainfrom
feature/kv-store-api

jpshackelford commented Apr 24, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Apr 24, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Apr 24, 2026 •

edited

Loading

Uh oh!

jpshackelford commented Apr 24, 2026

Uh oh!

jpshackelford commented Apr 25, 2026

Uh oh!

jpshackelford commented Apr 25, 2026 •

edited

Loading

Uh oh!

jpshackelford commented Apr 25, 2026

Uh oh!

tofarr left a comment

Uh oh!

tofarr commented May 19, 2026

Uh oh!

tofarr commented May 19, 2026

Uh oh!

jpshackelford commented May 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

jpshackelford commented Apr 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Problem

Solution

Changes

Design Document

New Files

Modified Files

API Endpoints

Concurrency Controls

Single-Document Model

Lock Timeout (Deadlock Prevention)

Optimistic Concurrency ($version)

Batch Operations

Reserved Keys

When to Use Version Checks

Deployment

Configuration

Out of Scope

Related

Uh oh!

github-actions Bot commented Apr 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions Bot commented Apr 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jpshackelford commented Apr 24, 2026

Uh oh!

jpshackelford commented Apr 25, 2026

Out of Scope: Admin Debugging Tooling

Uh oh!

jpshackelford commented Apr 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ KV Store E2E Tests Passed (Thorough Mode)

Uh oh!

jpshackelford commented Apr 25, 2026

Uh oh!

tofarr left a comment

Choose a reason for hiding this comment

Uh oh!

tofarr commented May 19, 2026

Uh oh!

tofarr commented May 19, 2026

Uh oh!

jpshackelford commented May 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

jpshackelford commented Apr 24, 2026 •

edited

Loading

Optimistic Concurrency (`$version`)

github-actions Bot commented Apr 24, 2026 •

edited

Loading

github-actions Bot commented Apr 24, 2026 •

edited

Loading

jpshackelford commented Apr 25, 2026 •

edited

Loading