v0.2.0 Quality Hardening — 25 fixes across storage, LLM, config, performance, and observability by Muizzkolapo · Pull Request #624 · Muizzkolapo/agent-actions

Muizzkolapo · 2026-05-23T01:14:30Z

Summary

25 PRs from a comprehensive quality audit — 33 automated audit agents across 7 passes identified 236+ issues, prioritized into P0-P4 tiers, and executed surgically across 4 rounds with 6 parallel clones.

P0 — Critical (data integrity)

fix: clear prior conflicting dispositions before writing terminal #594 Clear prior conflicting dispositions before writing terminal
fix: propagate status save and disposition clear errors #595 Propagate status save and disposition clear errors
fix: --fresh clears source_data, batch recovery state, and batch registry #596 --fresh clears source_data, batch recovery state, and batch registry
fix: narrow preflight key verifier to fail on infrastructure errors #597 Narrow preflight key verifier to fail on infrastructure errors
fix: schema migration uses ALTER TABLE, not DROP TABLE #598 Schema migration uses ALTER TABLE, not DROP TABLE
fix: propagate storage errors in completion status and version correlation #599 Propagate storage errors in completion status and version correlation

P1 — High (performance, crashes, LLM reliability)

fix: break zero-output rerun cycle and scope workflow status to execution_order #601 Break zero-output rerun cycle + scope workflow status to execution_order
fix: retry manifest checkpoint prevents Ctrl+C disposition loss #602 Retry manifest checkpoint prevents Ctrl+C disposition loss
fix: detect config changes and invalidate stale completed actions #603 Detect config changes and invalidate stale completed actions
fix: add non-dict type guard in EnrichmentPipeline.enrich() #604 Non-dict type guard in EnrichmentPipeline.enrich()
fix: harden LLM client layer (P1-5 + P2-6) #605 Harden LLM client layer (token overflow, prompt injection, parse error parity)
perf: batch SQLite disposition commits #606 Batch SQLite disposition commits (10-50x perf improvement)

P2 — Medium (concurrency, scale, UX, security)

fix: use uuid4 for source_guid to prevent silent collision #607 Use uuid4 for source_guid to prevent silent collision
fix: file I/O hardening (P2-5) #608 File I/O hardening (path containment, name sanitization, binary json.dumps)
fix: provider parity — remove Groq temperature default, raise Anthropic max_tokens #609 Provider parity (Groq temperature default, Anthropic max_tokens)
fix: add thread safety to ActionStateManager and atomic_json_write #610 Thread safety for ActionStateManager and atomic_json_write
feat: config UX improvements (P2-4) #611 Config UX (unknown key warnings, fuzzy match suggestions, validation detail)

P3 — Cleanup (dead code, tests, docs, editor)

refactor: dead code cleanup (P3-1) #612 Dead code cleanup (12 functions, 11 aliases, 6 vestigial params, ghost dir)
fix: VS Code/LSP fixes — 6 independent editor issues #613 VS Code/LSP fixes (6 editor issues)
feat: close observability gaps across enrichment, LLM, batch, and storage #614 Observability gaps (enrichment, LLM, batch, storage logging)
perf: replace O(N^2) lineage lookups with dict index #615 O(n²) lineage lookup → dict index
docs: fix 6 documentation issues (P3-3) #616 Documentation fixes (README, CHANGELOG, manifests)
test: add 33 tests covering 7 critical untested paths (P3-5) #617 33 tests covering 7 critical untested paths
fix: eliminate false-positive tests (P3-4) #618 Eliminate 15 false-positive tests

P4 — Polish

feat: add SQLite database maintenance operations (P4-2) #619 DB maintenance (WAL checkpoint, disposition cleanup, prompt trace TTL)

Verification

All 6 example workflows validate successfully against this branch
Full test suite passes (7419 tests)
Each PR was written from a reviewed spec with adversarial testing and 3rd-degree code relationship graphs

🤖 Generated with Claude Code

The disposition table's UNIQUE constraint is (action_name, record_id, disposition), not (action_name, record_id). Writing SUCCESS for a record that previously had FAILED resulted in BOTH rows coexisting — phantom failures that _resolve_completion_status would report. Added DELETE for (action_name, record_id) before INSERT in set_disposition, both in the same transaction. This ensures only one disposition per record per action at any time.

…tion_order (#601) _check_prior_output now checks for node-level terminal dispositions (FILTERED, PASSTHROUGH, SUCCESS, UNPROCESSED) when no target files exist. This prevents infinite rerun cycles for actions that legitimately produce no output (e.g., all records guard-filtered). is_workflow_complete, is_workflow_done, has_any_failed, and get_summary now iterate execution_order instead of action_status.values(), so orphan entries from removed actions no longer block workflow completion or inflate failure counts.

… items (#604) LLM can return malformed data (strings, lists, None, integers) that crashes every enricher's .get() calls with AttributeError. The guard filters non-dict items at the pipeline entry point before any enricher runs. Partial non-dict batches continue with valid items; all-non-dict batches are marked FAILED and return early without running enrichers.

* fix: detect prompt/model/schema/guard changes and invalidate stale completed actions _maybe_invalidate_completed_status only checked record_limit and file_limit. If a user changed the prompt, model, schema, or guard, actions stayed COMPLETED from the prior run and the new config was never applied. Add _compute_action_config_hash() which produces a stable SHA-256 fingerprint of semantically-meaningful fields (prompt, model, schema, guard clause+behavior). Store the hash in status metadata on completion. On re-run, compare stored hash to current — if they differ, reset to PENDING. - Guard normalization handles string, dict, and None forms - Missing stored hash (first run / upgrade) does NOT invalidate - Cosmetic fields (description, tags, limits) excluded from hash - All _limit_metadata call sites replaced with _completion_metadata - 20 new tests + 2 existing tests updated for new config_hash kwarg * style: clean up json import alias and redundant ensure_ascii param - import json as _json → import json (aligns with 135+ other files) - Remove ensure_ascii=True (already the default for json.dumps)

…#602) Before this change, pressing Ctrl+C (or any crash) between disposition clearing and workflow.run() completion would silently lose track of failed records. The next retry invocation would see no failures and report 'Nothing to retry.' Now a JSON manifest is written BEFORE clearing dispositions. If the process is interrupted, the next retry invocation detects the manifest, restores the cleared dispositions, and proceeds normally. - Manifest stored at agent_io/store/{workflow}/_retry_manifest.json - Contains full disposition rows (action, record, status, reason, detail) - Written before clear, deleted after successful completion - Corrupt/stale manifests handled gracefully - 16 new tests covering write-before-clear ordering, interrupt survival, restore-on-next-invocation, write-failure abort, and no-signal-handler

* fix: harden LLM client layer (P1-5 + P2-6) Five LLM-layer defects fixed plus template None finalize: 1. Token overflow pre-flight guard: MessageBuilder.build() estimates token count (chars/4 heuristic) and raises PromptTooLargeError before the API call when prompt exceeds model context window. 2. OpenAI JSON mode prompt injection: user-supplied context_data is now sent in a separate user message instead of being embedded in the system message. Only affects TAGGED style (OpenAI); Groq's TAGGED_GROQ keeps its single system message. 3. Skipped-dependency template tolerance: when an upstream action was skipped, its missing namespace triggers a warning and empty _PermissiveNamespace substitution instead of crashing. Genuine typos still raise TemplateVariableError. 4. Anthropic empty response parity: call_json() now returns the same _parse_error sentinel as OpenAI instead of raising VendorAPIError on empty/content-less responses. 5. JSON parse failure with reprompt: when reprompt is configured and JSON parsing fails, OpenAIClient raises LLMResponseParseError so the retry/reprompt machinery can intercept. Without reprompt, the existing silent _parse_error pass-through is preserved. 6. Template None finalize (P2-6): Jinja2 Environment in service.py now uses finalize=lambda x: "" if x is None else x, so None values render as empty string instead of literal "None". 30 new tests, 7 existing tests updated. * refactor: simplify review — fix import ordering, test tautology, deferred imports - Move logger definition after all imports in anthropic/client.py - Replace tautological assertion in test_multiple_skipped_deps_tolerated with concrete expected output check - Move PromptTooLargeError and LLMResponseParseError to top-level imports instead of deferred function-body imports (consistent with existing VendorAPIError imports in the same files)

* perf: batch SQLite disposition commits — one transaction per action instead of N Add set_dispositions_batch() to StorageBackend (default loops over set_disposition) and override in SQLiteBackend with executemany for single-commit batch writes. Update collect_results_from_processing_results to accumulate dispositions and flush once per action instead of N per-record commits. For a 1000-record action this reduces disposition writes from 1000 fsyncs to 1. Existing per-record set_disposition remains for retry, node-level signals, and other single-record callers. Key design decisions: - Validate all records before opening transaction (no partial writes) - DELETE-before-INSERT in batch (matching P0-5 phantom prevention) - Rollback on sqlite3.Error (atomic batch) - Disposition errors caught at flush (telemetry, not authoritative) - Base class default loops for non-SQLite backends * refactor: simplify batch disposition code — extract shared helpers, deduplicate - Add DispositionRow TypeAlias to backend.py (replaces 4 repeated 7-tuple types) - Extract _validate_disposition_fields() in SQLiteBackend (deduplicates validation between set_disposition and set_dispositions_batch) - Extract wire_batch_disposition_delegate() to tests/conftest.py (replaces 10 copy-pasted _batch_delegate closures across 7 test files) - Net -102 lines

…ary json.dumps) (#608) - Add assert_path_contained() utility that resolves symlinks before checking containment, preventing path traversal via ../ or symlinks - Add sanitize_path_component() for long action names (truncate + hash) - Wire containment check into FileWriter.write_target() - Fix json.dumps(item) crash on bytes values in saver.py (default=str) - 17 new tests, 1 updated test

* fix: use uuid4 for source_guid to prevent silent collision on duplicate/expanded records source_guid was generated as uuid5(content_hash), causing identical input rows to silently collapse via UNIQUE constraint and 1-to-N expansion children to overwrite each other in storage. - Add generate_source_guid() (uuid4) for unique record identity - Rename old method to generate_content_hash() for explicit dedup use - Update all 4 sites in initial_pipeline.py to use uuid4 - Update task_preparer.py and online_llm.py callers - Give expansion children unique source_guids in LineageEnricher - Keep deprecated wrapper for backward compatibility * refactor: use IDGenerator.generate_source_guid() in batch paths for consistency Replace raw str(uuid.uuid4()) with IDGenerator.generate_source_guid() in _prepare_text_chunks_batch and _add_batch_metadata so all source_guid generation goes through the centralized method. * refactor: consolidate 4 local IDGenerator imports into single top-level import No circular dependency justifies the local imports — move to top of file.

…alidation error detail (#611) - Warn when defaults config contains unknown keys by comparing raw input against DefaultsConfig.model_fields (keeps extra="ignore" unchanged) - Add difflib.get_close_matches to UnknownKeysDetector for "did you mean?" suggestions on action key typos - Surface Pydantic field-level errors in ConfigurationError.context as structured validation_errors list (capped at 10)

) ActionStateManager had no lock protecting the read-modify-write cycle in update_status, allowing parallel executor threads to lose metadata. Added threading.Lock to guard all mutating methods (update_status, reset, mark_running_as_failed, reset_retryable). atomic_json_write used a deterministic temp path (path.with_suffix(".json.tmp")) which caused concurrent writes to the same file to clobber each other's temp file. Replaced with tempfile.mkstemp for unique temp paths per call, with proper fd cleanup if os.fdopen fails. Extracted _bulk_transition helper to deduplicate mark_running_as_failed and reset_retryable. Updated existing tests that asserted on old .json.tmp suffix. Added concurrency tests for both fixes.

…rams, ghost dir (#612) Dead functions (zero callers confirmed by grep): - path_utils.py: check_path_exists, create_mirror_source_path, validate_path_permissions, find_files_by_extension, safe_path_join, create_agent_directory_structure (6) - udf_management/tooling.py: _split_udf_name (1) - ollama/failure_injection.py: is_online_injection_enabled, is_batch_injection_enabled, get_injection_status (3) Backward-compat aliases (zero external consumers): - lsp/server.py: 11 _private = public re-exports removed Vestigial parameters (accepted but never read): - handlers.py: get_prompt_symbols(file_path) - diagnostics.py: _action_guard_variables(index) - builder.py: create_dynamic_agent(udf) - runner.py: setup_directories(idx) - service.py: _build_llm_context(contents) - anthropic/client.py: _call_api(mode) Ghost directory: - llm/providers/mistral/ — only __pycache__ artifacts, no .py source All callers and tests updated. 7305 tests pass, ruff clean.

…o 4096 (#609) Groq's call_non_json injected temperature=0.7 via setdefault — no other provider does this. Removed the default so Groq uses the API's own default, matching parity with OpenAI/Anthropic/Cohere. max_tokens default (1000) is preserved since Groq's API requires it. Anthropic's max_tokens default of 1024 was too low for most use cases. Raised to 4096 to better match typical usage patterns. User-specified values still take precedence via setdefault. Issue 1 (Anthropic empty response crash) was already fixed by PR #605 which added VendorAPIError catch in call_json returning _parse_error sentinel.

- Remove vestigial /tmp cleanup fixture (conftest.py) — cleaned paths no test creates - Strengthen 3 bare assert_called_once() to assert_called_once_with() verifying actual arguments (test_runner, test_stale_completion) - Replace 4 weak is-not-None-only assertions with value checks (test_schema_array_conversion, test_factory, test_retry_reprompt) - Remove 3 unnecessary time.sleep() calls from thread safety tests; replace sleep-based race with Event.wait() (test_processor_utils)

Build a {source_guid: item} dict once before the per-item loop in LineageEnricher.enrich() and pass it to _get_parent_item, replacing O(N^2) linear scans with O(1) dict lookups. Index only built when use_per_item_parent_lookup is True (avoids wasted work on single-call paths). Add optional index parameter to get_content_by_source_guid. Linear scan fallback preserved for callers that don't pass the index. 13 new tests including performance validation.

…and storage (#614) - Add record identity (action_name, source_guid) to enrichment pipeline exception handler - Add LLM call duration logging (time.perf_counter) to ClientInvocationService for all provider paths (tool, hitl, standard) - Add progress context (completed, failed, total) to batch timeout warning - Upgrade batch abandoned-records log from INFO to WARNING with error context - Add action_name to batch context_map load failure warning - Upgrade disposition clearing from DEBUG to INFO when records are actually deleted, with record_id context - Add action_name to enrichment source_mapping out-of-bounds warnings

* docs: fix README quick-start syntax and replace pip with uv The init --example flag is actually a subcommand (agac init example). The Contributing section referenced pip install but the project uses uv. * docs: remove Mistral provider references The Mistral provider directory was deleted but docs still listed it as a supported provider. Removed all standalone Mistral references from README, docs site, skills references, and configuration guides. Ollama can still run Mistral models — those references are preserved. * docs: add support_resolution to README examples table The example exists in examples/ and in the docs site but was missing from the README examples table. * docs: fix RELEASING.md to use changie for version bumps The version is defined in both pyproject.toml and agent_actions/__version__.py. changie batch updates both via .changie.yaml replacements. The manual pyproject.toml-only instructions would leave __version__.py out of sync. * docs: reconcile cli/_MANIFEST.md with actual CLI surface Added missing modules: dispositions, example, retry, workflow_loader. Split inspect entry into actual source files: inspect_base, inspect_deps, inspect_graph, inspect_action. These commands have existed since the batch/retry/disposition features were added but the manifest was not updated. * chore: add changie entry for docs cleanup * docs: fix review findings — cohere in vendor comment, Ollama batch status, stale manifest entry - tutorials/index.md: add missing cohere to vendor list comment - README.md: Ollama batch support is Yes (OllamaBatchClient exists), not Online only - cli/_MANIFEST.md: remove project_paths_factory.py entry (file was deleted)

* test: add 33 tests covering 7 critical untested paths (P3-5) Coverage for error/edge paths in core pipeline code: - _fail_abandoned_records: context_map load failure (5 tests) - wait_for_batch_completion: timeout path (5 tests) - submit_retry_batch: submission failure (5 tests) - parallel action executor: async exception handling (5 tests) - ValidationStrategy: malformed config (7 tests) - prepare_correlated_input: iteration handoff (6 tests) 7388 tests pass, ruff clean. * refactor: simplify tests — parametrize, fix naming, remove noise - Merge redundant timeout tests into parametrized terminal status test - Parametrize exception type tests in abandoned records and malformed config - Fix misleading test name: test_key_error → test_missing_namespace_raises - Move DataValidationError import to top level - Remove FakeExecutionMetrics (unused), simplify FakeActionResult - Remove unnecessary comments restating assertions - Use _make_batch_result consistently in malformed config tests

* fix(vscode): create missing prompt-blocks TextMate grammar file package.json references ./syntaxes/prompt-blocks.tmLanguage.json for markdown prompt block injection but the file did not exist, causing a VS Code activation warning. Creates a minimal grammar that highlights template variables ({{ }}) and prompt variables (${...}) in markdown. * fix(vscode): reset disposed flag on re-activate The module-level disposed flag was set to true on deactivate but never cleared when the extension re-activated. This prevented the LSP server from restarting after a deactivate/reactivate cycle. * fix(lsp): search all workspace files for context_scope completions build_context_scope_completions only searched actions defined in the current file. Now iterates all file_actions in the ProjectIndex so cross-file action references appear in context_scope completions. Deduplicates by action name to avoid duplicate entries. * fix(lsp): add textDocument/didChange handler The LSP server registered didOpen and didSave but not didChange. Without didChange, the in-memory document store used by pygls was not updated as the user typed, causing stale completions and diagnostics until the file was saved. * fix(vscode): add LSP crash recovery with restart and backoff If the LSP server process crashes, the LanguageClient now automatically restarts it up to 3 times (MAX_RESTART_COUNT). Uses vscode-languageclient built-in ErrorAction/CloseAction handlers. Skips restart if the extension is disposed. * fix(vscode): escape Mermaid node labels to prevent syntax injection Node labels from YAML were interpolated into Mermaid source with only partial escaping (quotes, ], backslash). A label containing [, {, (, |, #, &, ;, or other Mermaid-significant characters could break the diagram syntax. Replaces the inline regex with a dedicated escapeMermaidLabel method that handles all special characters. * chore: add changie entry for P3-2 VS Code/LSP fixes * refactor: remove unused file_path param and trim redundant docstring build_context_scope_completions no longer uses file_path since it searches all workspace files. Removed the unused parameter and updated the call site in server.py. Trimmed verbose docstring on escapeMermaidLabel.

* feat: add SQLite database maintenance operations (P4-2) Add post-workflow maintenance to prevent unbounded DB growth: - WAL checkpoint (TRUNCATE) after every workflow completion - Stale disposition cleanup: removes old FAILED/EXHAUSTED records when a newer SUCCESS exists for the same (action, record) - Prompt trace retention: configurable N-run retention window (default: 10 runs), deletes older traces at workflow end - Source data TTL: optional age-based cleanup (default: disabled) Config keys (in workflow YAML under storage.*): prompt_trace_retention_runs (default: 10) source_data_ttl_days (default: null = keep forever) All operations are idempotent, non-fatal on error, and logged. Wired into both sequential and async coordinator completion paths. 20 new tests covering all maintenance operations. * refactor: address review feedback on maintenance operations - Use DISPOSITION_SUCCESS/FAILED/EXHAUSTED constants instead of raw strings in disposition cleanup SQL (prevents schema drift) - Use StorageDefaults constants in perform_maintenance() signatures instead of duplicated magic numbers - Replace DATE(created_at) with direct created_at comparison in prompt trace retention DELETE to enable index usage

… note

1. BLOCKING: Remove OpenAI raise LLMResponseParseError on reprompt path. Parse failures now always return _parse_error sentinel dict, matching Anthropic behavior. The reprompt pipeline reads the sentinel; the exception was uncaught and crashed the pipeline. 2. Fix retention docstring: _enforce_prompt_trace_retention uses DATE(created_at) as boundary, so retention_runs effectively means calendar days. Updated docstring to reflect actual semantics. 3. Remove unused prompt_content parameter from _compute_action_config_hash. No call site passes it, so it was dead code creating false confidence that prompt body changes would invalidate cached results. 4. Remove deprecated generate_deterministic_source_guid wrapper and its test. Only the test called it — backward-compat shim per RULES.md.

When _get_source_content fails to match by source_guid (staging records have no guid), the fallback uses the record's content dict. This dict contains ALL namespaces (source, summarize_page_content, etc.), not just the original source fields. References like source.page_content would resolve as None because page_content lives under content.source, not at the content top level. Now SourceNamespaceBuilder.build detects when source_content is a content dict with a 'source' sub-key and extracts it, so downstream observe references resolve correctly.

Handles wrapped ({content: {source: {...}}}) and unwrapped ({source: {...}}) content dicts. Both cases now extract the source sub-namespace instead of using the full content dict with all action namespaces as the source namespace.

…ation When version-merged or FILE-mode actions produce duplicate (action_name, record_id, disposition) tuples in the same batch, the INSERT fails. Deduplicate within the batch before writing.

Muizzkolapo and others added 29 commits May 22, 2026 21:13

fix: resolve mypy errors in executor.py and retry.py

8870210

chore: consolidate changie entries into single v0.2.0 quality release…

14b9242

… note

fix: add missing time.time() mock value in batch timeout test

2654ab1

fix: remove test for deleted prompt_content parameter

5c7a7e1

chore: remove verbose comment

767ff0f

fix: deduplicate batch dispositions to prevent UNIQUE constraint viol…

8132230

…ation When version-merged or FILE-mode actions produce duplicate (action_name, record_id, disposition) tuples in the same batch, the INSERT fails. Deduplicate within the batch before writing.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v0.2.0 Quality Hardening — 25 fixes across storage, LLM, config, performance, and observability#624

v0.2.0 Quality Hardening — 25 fixes across storage, LLM, config, performance, and observability#624
Muizzkolapo wants to merge 29 commits into
mainfrom
integration/v0.2.0-quality-clean

Muizzkolapo commented May 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Muizzkolapo commented May 23, 2026

Summary

P0 — Critical (data integrity)

P1 — High (performance, crashes, LLM reliability)

P2 — Medium (concurrency, scale, UX, security)

P3 — Cleanup (dead code, tests, docs, editor)

P4 — Polish

Verification

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant