feat: Add KV store API for automation state persistence#69
feat: Add KV store API for automation state persistence#69jpshackelford wants to merge 51 commits into
Conversation
|
🚀 Deploy Preview PR Created/Updated A deploy preview has been created/updated for this PR. Deploy PR: https://github.com/OpenHands/deploy/pull/3920 Once the deploy PR's CI passes, the automation service will be deployed to the feature environment. |
|
🔗 Chart PR Linked for Preview Deploy The deploy preview PR has been updated to use the preview chart from OpenHands-Cloud PR #576, which adds the Deploy PR: https://github.com/OpenHands/deploy/pull/3920 Once the deploy PR's CI passes, the feature environment will include both:
This comment was created by an AI agent (OpenHands) on behalf of the user. |
Out of Scope: Admin Debugging ToolingWe are calling out of scope for this implementation:
Context: Automations are linked to organizations ( For now: All KV data is encrypted at rest. Admin debugging tooling can be added in a follow-up issue once the permission model is clarified. This comment was migrated from PR #68 (design document PR) which has been merged into this PR. |
✅ KV Store E2E Tests Passed (Thorough Mode)Tested against All 26 test cases verified:
This comment was created by an AI agent (OpenHands) on behalf of @jpshackelford. |
|
Still doing some clean-up in response to OpenHands review feedback. |
a9987ed to
7d76068
Compare
Implements a built-in key-value store API scoped per-automation for state persistence between runs, as specified in issue #67. Features: - New AutomationKV model with encrypted value storage - enable_kv_store flag on Automation model (opt-in) - Per-run JWT tokens for KV authentication - JWE encryption for all stored values - Full Redis-like API: GET, SET, DELETE, INCR/DECR, LPUSH/RPUSH/LPOP/RPOP - Support for nested paths (dot notation) - Conditional SET operations (nx=true, xx=true) - Database migration for new table Endpoints: - GET /v1/kv - List all keys - GET /v1/kv/{key} - Get value (with optional path and meta params) - PUT /v1/kv/{key} - Set value (with optional nx/xx params) - PATCH /v1/kv/{key} - Update nested path - DELETE /v1/kv/{key} - Delete key - POST /v1/kv/{key}/incr - Atomic increment - POST /v1/kv/{key}/decr - Atomic decrement - POST /v1/kv/{key}/lpush - Push to list front - POST /v1/kv/{key}/rpush - Push to list back - POST /v1/kv/{key}/lpop - Pop from list front - POST /v1/kv/{key}/rpop - Pop from list back - GET /v1/kv/{key}/len - Get list length Related: #67 Co-authored-by: openhands <openhands@all-hands.dev>
- Add type: ignore comments for jwcrypto library calls (missing stubs) - Fix pyright type error for rowcount access on delete result - Apply ruff formatting to migration file - Make automation_with_kv fixture autouse=True so KV tests have parent Automation record for foreign key constraint - Remove unused create_kv_token from top-level test imports Co-authored-by: openhands <openhands@all-hands.dev>
- Set AUTOMATION_KV_SECRET env var in kv_client test fixture to match the test encryption key - Use Annotated[Any, Body()] for the set_value endpoint body parameter to properly parse arbitrary JSON body Co-authored-by: openhands <openhands@all-hands.dev>
Design for a built-in key-value store API for automation state persistence. Key features: - Scoped per-automation with strict isolation - Redis-like API semantics (GET, SET, INCR, list operations) - Application-level encryption (JWE) for all values - JWT-based authentication per automation run Closes #67 Co-authored-by: openhands <openhands@all-hands.dev>
Add enable_kv_store parameter to CreatePromptAutomationRequest and CreatePluginAutomationRequest schemas, and pass it through when creating Automation records. This allows users to enable the KV store via preset endpoints (/v1/preset/prompt and /v1/preset/plugin), making state persistence available for prompt-based and plugin-based automations. Co-authored-by: openhands <openhands@all-hands.dev>
Adds scripts/test_kv_e2e.py which tests the KV store API end-to-end:
1. Creates a real automation via API (with enable_kv_store=true)
2. Generates a KV token for that automation
3. Uses run_automation() for blocking execution with stdout/stderr capture
4. Runs comprehensive test suite covering all KV operations:
- SET/GET, INCR/DECR, list operations (RPUSH/LPUSH/LPOP/RPOP/LEN)
- Nested path operations (PATCH, GET with path)
- Conditional SET (nx, xx flags)
- List keys, DELETE, GET with metadata
5. Cleans up the automation after test
Usage:
export OPENHANDS_API_KEY="sk-oh-..."
export AUTOMATION_KV_SECRET="<same-as-staging>"
uv run python scripts/test_kv_e2e.py
Co-authored-by: openhands <openhands@all-hands.dev>
- Refactor test_kv_e2e.py with @Quick and @Thorough decorators - Quick mode: 8 smoke tests covering core operations (~30s) - Thorough mode: 26 tests including edge cases and error paths (~2min) - Add tests for: 404 handling, auth failures, type mismatches, empty lists, null values, various JSON types - Tests always run to completion (no early exit on failure) - Add kv-store-test-plan.md documenting manual test cases Usage: python scripts/test_kv_e2e.py # quick (default) python scripts/test_kv_e2e.py --thorough # full suite
- Reduce from 1543 to 167 lines (~90% reduction) - Remove individual TC-* test cases (now in test_kv_e2e.py) - Remove hardcoded API key (security) - Keep: critical bug docs, debugging workflow, manual-only tests - Keep: cross-automation isolation & persistence tests (not yet automated) - Keep: quick reference commands
Adds logging at: 1. App startup - logs whether kv_secret is present and its length 2. Dispatcher run - logs enable_kv_store flag and kv_secret presence before token generation This will help debug why KV tokens aren't being injected into sandbox runs. Co-authored-by: openhands <openhands@all-hands.dev>
The file had corrupted/duplicate content where lines from the embedded KV_TEST_SCRIPT and duplicate function definitions got accidentally inserted mid-file. This removes: 1. Corrupted line 939 where text was merged incorrectly 2. Duplicate test functions (test_list_keys, test_delete, etc) that were already defined inside KV_TEST_SCRIPT 3. Duplicate create_automation, delete_automation, and main() functions (kept the version with quick/thorough mode support) File reduced from 1231 to 943 lines. Co-authored-by: openhands <openhands@all-hands.dev>
- 201 Created: when a new key is created (including nx=true success) - 200 OK: when an existing key is updated - 409 Conflict: when nx=true but key already exists This aligns with REST semantics and fixes the E2E test expectation. Co-authored-by: openhands <openhands@all-hands.dev>
- test_set_new_value: expect 201 Created (new key) - test_set_update_existing: expect 200 OK (update) + assert created=False - test_set_nx_creates_new: expect 201 Created (new key via nx) - test_set_nx_fails_if_exists: expect 409 Conflict (nx but key exists) Also fixes trailing whitespace formatting in test_kv_e2e.py. Co-authored-by: openhands <openhands@all-hands.dev>
- Add blank line after inline import in execution.py - Split long import line in kv_router.py - Remove unused json import from test_kv_e2e.py - Fix long lines in test_kv_e2e.py Co-authored-by: openhands <openhands@all-hands.dev>
- Check if key exists before upsert to determine created status - Use func.now() to properly update the timestamp on conflict - This ensures 200 for updates and 201 for inserts Co-authored-by: openhands <openhands@all-hands.dev>
- Remove trailing whitespace from test_kv_e2e.py - Add noqa: E402 for necessary path manipulation imports - Fix log_extra to accept **kwargs for additional structured logging fields Co-authored-by: openhands <openhands@all-hands.dev>
Make it clear that 403 on cleanup means the API key lacks manage_automations permission, not a test failure. Co-authored-by: openhands <openhands@all-hands.dev>
- xx=true returns 409 Conflict (not 404/412) when key doesn't exist - null body returns 422 (FastAPI validation rejects empty body) - Missing token returns 422 (header validation before auth check) Co-authored-by: openhands <openhands@all-hands.dev>
Move all request/response schemas from kv_router.py to a dedicated kv_schemas.py module for better separation of concerns. - Creates automation/kv_schemas.py with 12 Pydantic models - Updates kv_router.py to import from kv_schemas - Reduces kv_router.py from 969 to 883 lines Co-authored-by: openhands <openhands@all-hands.dev>
Move path parsing and nested value manipulation functions to a dedicated kv_helpers.py module: - parse_path(): Parse dot/bracket notation paths - get_nested_value(): Traverse nested dicts/lists - set_nested_value(): Set values at nested paths Reduces kv_router.py from 883 to 790 lines. Co-authored-by: openhands <openhands@all-hands.dev>
- dispatcher.py: Remove verbose 'KV store check' info log that runs for every automation. Keep single debug log when KV is enabled. - app.py: Consolidate startup KV log into service startup message. Remove kv_secret_len as it provides no useful info and could theoretically leak entropy information. Co-authored-by: openhands <openhands@all-hands.dev>
Add test coverage for parse_path, get_nested_value, and set_nested_value: - TestParsePath: 14 tests covering dot notation, bracket notation, edge cases (empty, trailing dots, unclosed brackets) - TestGetNestedValue: 13 tests for dict/list traversal and error cases - TestSetNestedValue: 7 tests for value setting and intermediate creation All 32 tests pass locally. Co-authored-by: openhands <openhands@all-hands.dev>
Adds AUTOMATION_KV_MAX_VALUE_SIZE setting (default 64KB) to prevent
abuse by limiting the size of values stored in the KV store.
Size validation is applied to all write operations:
- PUT /kv/{key} (set_value) - validates the incoming body
- PATCH /kv/{key} (patch_value) - validates after applying the patch
- POST /kv/{key}/lpush - validates the resulting list
- POST /kv/{key}/rpush - validates the resulting list
Returns HTTP 413 Payload Too Large when limit is exceeded.
Includes tests using a 1KB limit to verify enforcement on all endpoints.
Co-authored-by: openhands <openhands@all-hands.dev>
Replace JWE (JSON Web Encryption) with raw AES-256-GCM and store encrypted values as BYTEA instead of TEXT. This reduces storage overhead significantly: Before (JWE + TEXT): - 14-byte plaintext → 100 bytes stored (7x overhead) - Base64 encoding adds ~33% to all values After (AES-GCM + BYTEA): - 14-byte plaintext → 42 bytes stored (3x overhead) - Fixed 28-byte overhead (12-byte nonce + 16-byte auth tag) - For larger values, overhead approaches 0% Changes: - automation/utils/kv.py: Replace jwcrypto with cryptography library, comprehensive documentation of design decisions, PostgreSQL TOAST considerations, wire format, and key derivation rationale - automation/models.py: Change value_encrypted from Text to LargeBinary - migrations/005: Update column type to LargeBinary (BYTEA) - pyproject.toml: Replace jwcrypto dependency with cryptography The module docstring now documents: - Three encryption approaches evaluated and why we chose raw AES-GCM - PostgreSQL TOAST performance implications (<2KB optimal, 2-8KB ok, >8KB chunked) - Key derivation approach and potential HKDF improvement - Wire format diagram (nonce || ciphertext || auth_tag) Co-authored-by: openhands <openhands@all-hands.dev>
Add section to kv.py module docstring explaining that JSONB can't be used because we encrypt values at the application layer, producing opaque binary data rather than valid JSON. Co-authored-by: openhands <openhands@all-hands.dev>
Include a table showing PostgreSQL TOAST behavior at different size limits (64KB-512KB), helping operators make informed decisions about the tradeoff between flexibility and read performance. Co-authored-by: openhands <openhands@all-hands.dev>
- SET STORAGE EXTERNAL on value_encrypted column to skip futile compression attempts on encrypted (high-entropy) data - Add COMMENT ON TABLE/COLUMN for schema-level documentation visible to DBAs and database tools Expanded migration docstring explains: - Why BYTEA over TEXT/JSONB (encrypted data is raw bytes) - Why EXTERNAL over EXTENDED (encrypted data won't compress) - Why schema comments (self-documenting for DB inspection) Co-authored-by: openhands <openhands@all-hands.dev>
Enforce RFC 8259 compliant JSON to prevent interoperability issues: Validation rules: - Reject NaN, Infinity, -Infinity (not valid JSON, causes issues with JavaScript JSON.parse and other strict parsers) - Maximum nesting depth of 32 levels (prevents stack overflow DoS) - Only JSON-serializable types accepted Implementation: - automation/utils/kv.py: Add _validate_json_value() with allow_nan=False, _check_nesting_depth() for DoS prevention, new KVValueError exception - automation/kv_helpers.py: Add safe_encrypt() helper that converts KVValueError to HTTP 400, KVEncryptionError to HTTP 500 - automation/kv_router.py: Replace all encrypt_value try/except blocks with safe_encrypt() for cleaner code and consistent error handling Error responses: - 400 Bad Request for invalid values (client's fault) - 500 Internal Server Error for encryption failures (our fault) Tests added for: - NaN, Infinity, -Infinity rejection - Deep nesting rejection (35 levels) - Valid nesting acceptance (10 levels) - All standard JSON types acceptance Co-authored-by: openhands <openhands@all-hands.dev>
Extract repeated patterns into reusable helpers in kv_helpers.py: New helpers: - safe_decrypt(): Wraps decrypt_value with HTTP 500 error handling (matches safe_encrypt pattern, replaces 9 try/except blocks) - require_dict(): Type check for dict values (HTTP 400 on failure) - require_list(): Type check for list values (HTTP 400 on failure) - require_numeric(): Type check for int/float values (HTTP 400 on failure) Benefits: - Reduced boilerplate in kv_router.py (~100 lines removed) - Consistent error handling across all endpoints - Single point of change for error messages/logging - Cleaner, more readable endpoint implementations Router cleanup: - Removed KVEncryptionError and decrypt_value imports (now internal) - Replaced 9 decrypt try/except blocks with safe_decrypt() - Replaced 8 isinstance type checks with require_* helpers Co-authored-by: openhands <openhands@all-hands.dev>
This commit adds comprehensive validation to protect against accidental, malicious, and ignorant clients: Key Validation (validate_key): - Reject empty and whitespace-only keys - Enforce max key length (255 chars, matching DB constraint) - Reject control characters (null bytes, newlines, tabs, etc.) - Provide helpful error messages with character position Type Validation Improvements: - require_numeric: Now rejects booleans (True/False) explicitly - Prevents confusing behavior where True becomes 2 after increment - require_int: New function for incr/decr operations - Rejects floats to prevent silent precision loss - Rejects booleans for same reason as require_numeric Path Validation: - Added _MAX_PATH_DEPTH (32) limit to parse_path() - Prevents DoS via extremely long dot-notation paths Router Changes: - Added ValidatedKey type alias using Depends for consistent key validation - All endpoints now validate key parameter automatically - incr/decr now use require_int instead of require_numeric - Fixed incr/decr to use integer arithmetic (was doing int(value + by)) Tests: - Added 54+ new tests for key validation edge cases - Added tests for boolean rejection in require_numeric/require_int - Added tests for float rejection in require_int - Added tests for path depth limiting - Added API-level tests for incr/decr integer-only behavior Co-authored-by: openhands <openhands@all-hands.dev>
- Fix line too long errors in kv_helpers.py error messages - Add explicit type annotations in test_kv_router.py for nested dict tests - Remove unused require_numeric import from kv_router.py - Apply ruff formatting to migration file Co-authored-by: openhands <openhands@all-hands.dev>
…evention ## Problem The original implementation stored each KV 'key' as a separate row in the automation_kv table. This created deadlock risk when: - Concurrent requests modified different keys for the same automation - Client code acquired locks on multiple keys in different orders - The FOR UPDATE locking strategy serialized on N separate rows ## Solution: Single-Document Design Changed to store all KV state in ONE encrypted JSON document per automation. API 'keys' (e.g., /kv/config, /kv/counter) are now top-level fields within this single document. Benefits: - Only ONE row per automation to lock - All operations serialize through that single lock - No multi-key ordering issues possible - Eliminates deadlock risk between keys Trade-off: Every operation reads/writes the entire state blob. This is acceptable because automation state is intended to be small (cursors, counters, configs) and access is infrequent (scheduled runs). Co-authored-by: openhands <openhands@all-hands.dev>
Co-authored-by: openhands <openhands@all-hands.dev>
- Apply ruff formatting fixes - Add assert state is not None before subscripting in tests Co-authored-by: openhands <openhands@all-hands.dev>
…rotection Add bounded lock wait times and connection pool timeout to prevent a single slow KV operation from cascading into system-wide instability: 1. Lock Timeout (kv_router.py): - Add SET LOCAL lock_timeout = '5000ms' before FOR UPDATE queries - Prevents indefinite waits when another transaction holds the row lock - If lock times out, return HTTP 409 Conflict so clients can retry - SET LOCAL scopes timeout to current transaction only 2. Pool Exhaustion Protection (db.py, config.py): - Add db_pool_timeout config setting (default 30s) - Apply pool_timeout to both standard and GCP Cloud SQL engines - Surfaces pool exhaustion as errors instead of indefinite hangs Why these protections matter: Without lock timeout, a slow/stuck transaction (e.g., network issue during commit, slow crypto) causes all concurrent KV operations on that automation to queue indefinitely. With enough concurrent requests, this exhausts the connection pool and degrades the entire service—not just one automation. The single-document design already eliminates multi-row deadlock scenarios. These timeouts provide defense-in-depth against operational issues. Co-authored-by: openhands <openhands@all-hands.dev>
- Add POST /v1/kv/batch endpoint for atomic batch operations - Implement $version meta key that auto-increments on every write - Support if_version parameter for optimistic concurrency (409 on mismatch) - Reserve $-prefixed keys for system use (reject client writes) - Filter $-prefixed keys from list_keys response - Include version in ?meta=true responses - Add _has_user_keys() to properly handle delete-last-key case Operations supported in batch: - set, delete, incr, decr, lpush, rpush, lpop, rpop, patch Tests: 700 pass, ruff and pyright clean
- Add if_version query param to PUT /v1/kv/{key} for optimistic concurrency
- Add if_version query param to PATCH /v1/kv/{key} for optimistic concurrency
- Add if_version query param to DELETE /v1/kv/{key} for optimistic concurrency
- Add integration tests for if_version on individual endpoints
- Fix pyright type errors in tests
The atomic operations (incr, decr, lpush, rpush, lpop, rpop) don't need
version checks because they are conflict-free by design.
Tests: 707 pass, pre-commit clean
Update kv_router.py to use get_config().service instead of the deprecated get_settings() function. This aligns with the configuration centralization done in PR #73. Also add **kwargs support to the centralized log_extra() function to support additional context fields needed by the KV store logging. Co-authored-by: openhands <openhands@all-hands.dev>
Add Configuration section documenting: - The composed AppConfig structure with typed sections - Proper usage of get_config().service vs deprecated get_settings() - Where to find environment variable documentation - Protocol constants location Co-authored-by: openhands <openhands@all-hands.dev>
…ings().cache_clear() The get_settings() function is now a deprecated wrapper that doesn't have a cache_clear() method. Use clear_config_cache() from automation.config which properly clears the lru_cache on get_config(). Co-authored-by: openhands <openhands@all-hands.dev>
Implements follow-up tasks from PR #69 for KV store concurrency: 1. Statement Timeout (Safety Net) - Added statement_timeout = 2x lock_timeout to catch runaway operations - Detects PostgreSQL error codes 55P03 (lock) and 57014 (statement) 2. Retry-After Header - All 409 responses include Retry-After: 1 header - New _raise_version_conflict() helper for consistent version mismatch errors 3. Configurable Lock Timeout Per-Automation - Added kv_lock_timeout_ms field to Automation model (100-30000ms) - Schema validation in CreateAutomationRequest/UpdateAutomationRequest - Migration 006_add_kv_lock_timeout.py - Embedded in JWT token claims for fast access (no extra DB query) - KVTokenClaims class with automation_id and lock_timeout_ms - Backward compatible with old tokens (uses default 5000ms) 4. Metrics and Observability - New kv_metrics.py module with Prometheus metrics: - kv_operation_duration_seconds (histogram) - kv_lock_wait_duration_seconds (histogram) - kv_conflict_total (counter by reason) - kv_state_size_bytes (histogram) - Integrated into router: lock waits, conflicts, state size 5. Client Documentation - Comprehensive docs/kv-store-client-guide.md - Covers basic/advanced operations, concurrency patterns - Best practices, error handling, debugging tips - Lock timeout configuration guidance Testing: - 27 new tests in test_kv_concurrency.py - All 168 KV tests pass Co-authored-by: openhands <openhands@all-hands.dev>
Following the centralized configuration pattern from PR #73, this moves KV store settings from ServiceSettings to a dedicated KVSettings section: - Add KVSettings class with kv_secret, kv_max_value_size, enabled property - Add config.kv property to AppConfig - Update kv_router.py to use get_config().kv for all KV settings - Update dispatcher.py to use get_config().kv for KV token generation - Update AGENTS.md with KV config section in documentation This keeps feature-specific configuration organized and follows the existing pattern of http, sandbox, storage, and log sections. Co-authored-by: openhands <openhands@all-hands.dev>
Co-authored-by: openhands <openhands@all-hands.dev>
- Add get_config import to app.py for kv.enabled check - Use CronTrigger instead of dict in validation tests - Use cast() for HTTPException.detail dict access - Fix import ordering in migration file Co-authored-by: openhands <openhands@all-hands.dev>
Sort imports alphabetically (CreateAutomationRequest before CronTrigger) to satisfy ruff linting. Co-authored-by: openhands <openhands@all-hands.dev>
The endpoints now use get_token_claims which returns KVTokenClaims instead of get_automation_id_from_token which returns UUID. This update: - Import get_token_claims instead of get_automation_id_from_token - Import KVTokenClaims and DEFAULT_LOCK_TIMEOUT_MS from utils.kv - Update kv_client fixture to override get_token_claims with KVTokenClaims - Fix test_create_and_verify_token to compare result.automation_id Co-authored-by: openhands <openhands@all-hands.dev>
tofarr
left a comment
There was a problem hiding this comment.
- There is a lot of code related to wheterh kv is enabled or whether it is enabled for a automation - I think it should just always be enabled - something that each automation gets by default
- Do we need to update the skill for automations to let the agent know that this facility is available?
- We have a lot of code related to encryption. Could we use the Cipher class from the sdk to simplify / remove some of this?
Resolves conflicts from the automation/ -> openhands/automation/ directory rename and from the SQLite/local-mode and pyproject.toml changes on main. - Move dispatcher KV-token injection into the new dispatcher.py - Rewrite all KV module imports to use openhands.automation.* - Renumber KV migrations to 006_add_kv_store and 007_add_kv_lock_timeout so they sit after 005_add_bash_command_id Co-authored-by: openhands <openhands@all-hands.dev>
…Cipher - Drop enable_kv_store / kv_lock_timeout_ms per-automation fields. KV is always available; the dispatcher injects AUTOMATION_KV_TOKEN whenever the service has AUTOMATION_KV_SECRET configured. - Move kv_lock_timeout_ms to KVSettings as a service-wide config knob (AUTOMATION_KV_LOCK_TIMEOUT_MS, default 5000). - Replace the custom AES-256-GCM JWE code with the SDK's Cipher helper (Fernet), dropping the cryptography>=42 dependency. - Switch state_encrypted from BYTEA/LargeBinary to TEXT (Fernet tokens are URL-safe base64 strings). - Skip the new COMMENT ON TABLE statements when running migrations on SQLite so dev/test setups still work. - Update docs (design, client guide, test plan) and the kv_router module docstring to reflect the new model. Co-authored-by: openhands <openhands@all-hands.dev>
|
Picking this up on John's behalf since it's unblocking some work on my end. Addressed @tofarr's review in 4fc3f05:
Also:
KV tests ( This comment was written by an AI agent (OpenHands) on behalf of @jpshackelford. |
|
Dropping by as a near-term consumer — I'm building I've gone ahead and built against the API as specced in A few pieces of consumer feedback while it's fresh: Strong points
Things I bumped into while implementing
Question I couldn't answer from the docs
Looking forward to this landing. Happy to be the first dogfood consumer once it does — I'll flip extensions#245 from draft to ready-for-review the moment a Cloud build with this PR is reachable. This comment was posted by an AI agent (OpenHands) on behalf of @tofarr. |
|
@tofarr Awesome feedback thank you. Will incorporate your feedback or you are of course welcome to do so. Beyond that, things I was thinking of doing still include:
|
Summary
This PR implements the KV store API for automation state persistence, including both the design document and the service-side implementation.
Problem
One of the use cases for automations is implementing integrations. Some integrations are stateless—webhook responders that receive an event, do work, and complete. But other jobs require small amounts of data storage to work effectively.
For example, an automation that summarizes Slack messages might store the last processed timestamp, then on the next run fetch only items since that date. But where should it store this data?
A GitHub repo is not the right tool. External services (JSONBin, Redis Cloud, etc.) work but require users to provision and manage infrastructure. If external systems are required for such a common use case, that erodes the simplicity of a batteries-included platform.
Solution
A built-in key-value store API scoped per-automation:
We do not need high performance—just simplicity and security for occasional state persistence.
Changes
Design Document
docs/kv-store-design.md- Full design specification for the KV store featureNew Files
automation/kv_router.py- KV API endpoints including batch operationsautomation/kv_schemas.py- Request/response schemas including batch operation typesautomation/kv_helpers.py- Validation and encryption helpersautomation/utils/kv.py- JWT and encryption utilitiesmigrations/versions/005_add_kv_store.py- Database migrationtests/test_kv_router.py- Integration test coveragetests/test_kv_batch.py- Unit tests for batch operationstests/test_kv_helpers.py- Unit tests for helpersModified Files
automation/models.py- AddedAutomationKVmodel andenable_kv_storeflag onAutomationautomation/schemas.py- Addedenable_kv_storeto request/response schemasautomation/router.py- Includeenable_kv_storein automation creationautomation/dispatcher.py- GenerateAUTOMATION_KV_TOKENfor runs with KV enabledautomation/config.py- Addedkv_secretsettingautomation/app.py- Mount KV routerpyproject.toml- AddedpyjwtandjwcryptodependenciesAPI Endpoints
/v1/kv/v1/kv/batch/v1/kv/{key}?path=and?meta=truefor version)/v1/kv/{key}?nx,?xx,?if_version)/v1/kv/{key}?if_version)/v1/kv/{key}?if_version)/v1/kv/{key}/incr/v1/kv/{key}/decr/v1/kv/{key}/lpush/v1/kv/{key}/rpush/v1/kv/{key}/lpop/v1/kv/{key}/rpop/v1/kv/{key}/lenConcurrency Controls
Single-Document Model
All state for an automation is stored in a single encrypted JSON document. This eliminates multi-key deadlock scenarios—all operations serialize through a single row lock.
Lock Timeout (Deadlock Prevention)
Optimistic Concurrency (
$version)State includes a
$versioncounter that auto-increments on every write. Useif_versionfor safe read-modify-write patterns.On individual endpoints (PUT, PATCH, DELETE):
On batch endpoint:
Get current version with
?meta=true:Batch Operations
Execute multiple operations atomically in a single transaction:
Benefits:
Reserved Keys
Keys starting with
$are reserved for system use (e.g.,$version). Client attempts to set/delete$-prefixed keys are rejected with HTTP 400.When to Use Version Checks
PUT(set)PATCHDELETEincr/decrlpush/rpushlpop/rpopDeployment
Configuration
The KV store requires the
AUTOMATION_KV_SECRETenvironment variable for JWT signing and value encryption.Chart PR: OpenHands/OpenHands-Cloud#576 wires this up by reusing the existing
jwt-secret— no new secrets needed!Out of Scope
As noted in the design doc comment, admin debugging tooling for KV data is out of scope for this PR. The encrypted data will be secure at rest; admin tools can be added in a follow-up issue once the permission model is clarified with the platform team.
Related
This PR was updated by an AI agent (OpenHands) on behalf of @jpshackelford.