v0.2.0 Quality Hardening — 25 fixes across storage, LLM, config, performance, and observability#624
Open
Muizzkolapo wants to merge 29 commits into
Open
v0.2.0 Quality Hardening — 25 fixes across storage, LLM, config, performance, and observability#624Muizzkolapo wants to merge 29 commits into
Muizzkolapo wants to merge 29 commits into
Conversation
The disposition table's UNIQUE constraint is (action_name, record_id, disposition), not (action_name, record_id). Writing SUCCESS for a record that previously had FAILED resulted in BOTH rows coexisting — phantom failures that _resolve_completion_status would report. Added DELETE for (action_name, record_id) before INSERT in set_disposition, both in the same transaction. This ensures only one disposition per record per action at any time.
…tion_order (#601) _check_prior_output now checks for node-level terminal dispositions (FILTERED, PASSTHROUGH, SUCCESS, UNPROCESSED) when no target files exist. This prevents infinite rerun cycles for actions that legitimately produce no output (e.g., all records guard-filtered). is_workflow_complete, is_workflow_done, has_any_failed, and get_summary now iterate execution_order instead of action_status.values(), so orphan entries from removed actions no longer block workflow completion or inflate failure counts.
… items (#604) LLM can return malformed data (strings, lists, None, integers) that crashes every enricher's .get() calls with AttributeError. The guard filters non-dict items at the pipeline entry point before any enricher runs. Partial non-dict batches continue with valid items; all-non-dict batches are marked FAILED and return early without running enrichers.
* fix: detect prompt/model/schema/guard changes and invalidate stale completed actions _maybe_invalidate_completed_status only checked record_limit and file_limit. If a user changed the prompt, model, schema, or guard, actions stayed COMPLETED from the prior run and the new config was never applied. Add _compute_action_config_hash() which produces a stable SHA-256 fingerprint of semantically-meaningful fields (prompt, model, schema, guard clause+behavior). Store the hash in status metadata on completion. On re-run, compare stored hash to current — if they differ, reset to PENDING. - Guard normalization handles string, dict, and None forms - Missing stored hash (first run / upgrade) does NOT invalidate - Cosmetic fields (description, tags, limits) excluded from hash - All _limit_metadata call sites replaced with _completion_metadata - 20 new tests + 2 existing tests updated for new config_hash kwarg * style: clean up json import alias and redundant ensure_ascii param - import json as _json → import json (aligns with 135+ other files) - Remove ensure_ascii=True (already the default for json.dumps)
…#602) Before this change, pressing Ctrl+C (or any crash) between disposition clearing and workflow.run() completion would silently lose track of failed records. The next retry invocation would see no failures and report 'Nothing to retry.' Now a JSON manifest is written BEFORE clearing dispositions. If the process is interrupted, the next retry invocation detects the manifest, restores the cleared dispositions, and proceeds normally. - Manifest stored at agent_io/store/{workflow}/_retry_manifest.json - Contains full disposition rows (action, record, status, reason, detail) - Written before clear, deleted after successful completion - Corrupt/stale manifests handled gracefully - 16 new tests covering write-before-clear ordering, interrupt survival, restore-on-next-invocation, write-failure abort, and no-signal-handler
* fix: harden LLM client layer (P1-5 + P2-6) Five LLM-layer defects fixed plus template None finalize: 1. Token overflow pre-flight guard: MessageBuilder.build() estimates token count (chars/4 heuristic) and raises PromptTooLargeError before the API call when prompt exceeds model context window. 2. OpenAI JSON mode prompt injection: user-supplied context_data is now sent in a separate user message instead of being embedded in the system message. Only affects TAGGED style (OpenAI); Groq's TAGGED_GROQ keeps its single system message. 3. Skipped-dependency template tolerance: when an upstream action was skipped, its missing namespace triggers a warning and empty _PermissiveNamespace substitution instead of crashing. Genuine typos still raise TemplateVariableError. 4. Anthropic empty response parity: call_json() now returns the same _parse_error sentinel as OpenAI instead of raising VendorAPIError on empty/content-less responses. 5. JSON parse failure with reprompt: when reprompt is configured and JSON parsing fails, OpenAIClient raises LLMResponseParseError so the retry/reprompt machinery can intercept. Without reprompt, the existing silent _parse_error pass-through is preserved. 6. Template None finalize (P2-6): Jinja2 Environment in service.py now uses finalize=lambda x: "" if x is None else x, so None values render as empty string instead of literal "None". 30 new tests, 7 existing tests updated. * refactor: simplify review — fix import ordering, test tautology, deferred imports - Move logger definition after all imports in anthropic/client.py - Replace tautological assertion in test_multiple_skipped_deps_tolerated with concrete expected output check - Move PromptTooLargeError and LLMResponseParseError to top-level imports instead of deferred function-body imports (consistent with existing VendorAPIError imports in the same files)
* perf: batch SQLite disposition commits — one transaction per action instead of N Add set_dispositions_batch() to StorageBackend (default loops over set_disposition) and override in SQLiteBackend with executemany for single-commit batch writes. Update collect_results_from_processing_results to accumulate dispositions and flush once per action instead of N per-record commits. For a 1000-record action this reduces disposition writes from 1000 fsyncs to 1. Existing per-record set_disposition remains for retry, node-level signals, and other single-record callers. Key design decisions: - Validate all records before opening transaction (no partial writes) - DELETE-before-INSERT in batch (matching P0-5 phantom prevention) - Rollback on sqlite3.Error (atomic batch) - Disposition errors caught at flush (telemetry, not authoritative) - Base class default loops for non-SQLite backends * refactor: simplify batch disposition code — extract shared helpers, deduplicate - Add DispositionRow TypeAlias to backend.py (replaces 4 repeated 7-tuple types) - Extract _validate_disposition_fields() in SQLiteBackend (deduplicates validation between set_disposition and set_dispositions_batch) - Extract wire_batch_disposition_delegate() to tests/conftest.py (replaces 10 copy-pasted _batch_delegate closures across 7 test files) - Net -102 lines
…ary json.dumps) (#608) - Add assert_path_contained() utility that resolves symlinks before checking containment, preventing path traversal via ../ or symlinks - Add sanitize_path_component() for long action names (truncate + hash) - Wire containment check into FileWriter.write_target() - Fix json.dumps(item) crash on bytes values in saver.py (default=str) - 17 new tests, 1 updated test
* fix: use uuid4 for source_guid to prevent silent collision on duplicate/expanded records source_guid was generated as uuid5(content_hash), causing identical input rows to silently collapse via UNIQUE constraint and 1-to-N expansion children to overwrite each other in storage. - Add generate_source_guid() (uuid4) for unique record identity - Rename old method to generate_content_hash() for explicit dedup use - Update all 4 sites in initial_pipeline.py to use uuid4 - Update task_preparer.py and online_llm.py callers - Give expansion children unique source_guids in LineageEnricher - Keep deprecated wrapper for backward compatibility * refactor: use IDGenerator.generate_source_guid() in batch paths for consistency Replace raw str(uuid.uuid4()) with IDGenerator.generate_source_guid() in _prepare_text_chunks_batch and _add_batch_metadata so all source_guid generation goes through the centralized method. * refactor: consolidate 4 local IDGenerator imports into single top-level import No circular dependency justifies the local imports — move to top of file.
…alidation error detail (#611) - Warn when defaults config contains unknown keys by comparing raw input against DefaultsConfig.model_fields (keeps extra="ignore" unchanged) - Add difflib.get_close_matches to UnknownKeysDetector for "did you mean?" suggestions on action key typos - Surface Pydantic field-level errors in ConfigurationError.context as structured validation_errors list (capped at 10)
) ActionStateManager had no lock protecting the read-modify-write cycle in update_status, allowing parallel executor threads to lose metadata. Added threading.Lock to guard all mutating methods (update_status, reset, mark_running_as_failed, reset_retryable). atomic_json_write used a deterministic temp path (path.with_suffix(".json.tmp")) which caused concurrent writes to the same file to clobber each other's temp file. Replaced with tempfile.mkstemp for unique temp paths per call, with proper fd cleanup if os.fdopen fails. Extracted _bulk_transition helper to deduplicate mark_running_as_failed and reset_retryable. Updated existing tests that asserted on old .json.tmp suffix. Added concurrency tests for both fixes.
…rams, ghost dir (#612) Dead functions (zero callers confirmed by grep): - path_utils.py: check_path_exists, create_mirror_source_path, validate_path_permissions, find_files_by_extension, safe_path_join, create_agent_directory_structure (6) - udf_management/tooling.py: _split_udf_name (1) - ollama/failure_injection.py: is_online_injection_enabled, is_batch_injection_enabled, get_injection_status (3) Backward-compat aliases (zero external consumers): - lsp/server.py: 11 _private = public re-exports removed Vestigial parameters (accepted but never read): - handlers.py: get_prompt_symbols(file_path) - diagnostics.py: _action_guard_variables(index) - builder.py: create_dynamic_agent(udf) - runner.py: setup_directories(idx) - service.py: _build_llm_context(contents) - anthropic/client.py: _call_api(mode) Ghost directory: - llm/providers/mistral/ — only __pycache__ artifacts, no .py source All callers and tests updated. 7305 tests pass, ruff clean.
…o 4096 (#609) Groq's call_non_json injected temperature=0.7 via setdefault — no other provider does this. Removed the default so Groq uses the API's own default, matching parity with OpenAI/Anthropic/Cohere. max_tokens default (1000) is preserved since Groq's API requires it. Anthropic's max_tokens default of 1024 was too low for most use cases. Raised to 4096 to better match typical usage patterns. User-specified values still take precedence via setdefault. Issue 1 (Anthropic empty response crash) was already fixed by PR #605 which added VendorAPIError catch in call_json returning _parse_error sentinel.
- Remove vestigial /tmp cleanup fixture (conftest.py) — cleaned paths no test creates - Strengthen 3 bare assert_called_once() to assert_called_once_with() verifying actual arguments (test_runner, test_stale_completion) - Replace 4 weak is-not-None-only assertions with value checks (test_schema_array_conversion, test_factory, test_retry_reprompt) - Remove 3 unnecessary time.sleep() calls from thread safety tests; replace sleep-based race with Event.wait() (test_processor_utils)
Build a {source_guid: item} dict once before the per-item loop in
LineageEnricher.enrich() and pass it to _get_parent_item, replacing
O(N^2) linear scans with O(1) dict lookups. Index only built when
use_per_item_parent_lookup is True (avoids wasted work on single-call
paths). Add optional index parameter to get_content_by_source_guid.
Linear scan fallback preserved for callers that don't pass the index.
13 new tests including performance validation.
…and storage (#614) - Add record identity (action_name, source_guid) to enrichment pipeline exception handler - Add LLM call duration logging (time.perf_counter) to ClientInvocationService for all provider paths (tool, hitl, standard) - Add progress context (completed, failed, total) to batch timeout warning - Upgrade batch abandoned-records log from INFO to WARNING with error context - Add action_name to batch context_map load failure warning - Upgrade disposition clearing from DEBUG to INFO when records are actually deleted, with record_id context - Add action_name to enrichment source_mapping out-of-bounds warnings
* docs: fix README quick-start syntax and replace pip with uv The init --example flag is actually a subcommand (agac init example). The Contributing section referenced pip install but the project uses uv. * docs: remove Mistral provider references The Mistral provider directory was deleted but docs still listed it as a supported provider. Removed all standalone Mistral references from README, docs site, skills references, and configuration guides. Ollama can still run Mistral models — those references are preserved. * docs: add support_resolution to README examples table The example exists in examples/ and in the docs site but was missing from the README examples table. * docs: fix RELEASING.md to use changie for version bumps The version is defined in both pyproject.toml and agent_actions/__version__.py. changie batch updates both via .changie.yaml replacements. The manual pyproject.toml-only instructions would leave __version__.py out of sync. * docs: reconcile cli/_MANIFEST.md with actual CLI surface Added missing modules: dispositions, example, retry, workflow_loader. Split inspect entry into actual source files: inspect_base, inspect_deps, inspect_graph, inspect_action. These commands have existed since the batch/retry/disposition features were added but the manifest was not updated. * chore: add changie entry for docs cleanup * docs: fix review findings — cohere in vendor comment, Ollama batch status, stale manifest entry - tutorials/index.md: add missing cohere to vendor list comment - README.md: Ollama batch support is Yes (OllamaBatchClient exists), not Online only - cli/_MANIFEST.md: remove project_paths_factory.py entry (file was deleted)
* test: add 33 tests covering 7 critical untested paths (P3-5) Coverage for error/edge paths in core pipeline code: - _fail_abandoned_records: context_map load failure (5 tests) - wait_for_batch_completion: timeout path (5 tests) - submit_retry_batch: submission failure (5 tests) - parallel action executor: async exception handling (5 tests) - ValidationStrategy: malformed config (7 tests) - prepare_correlated_input: iteration handoff (6 tests) 7388 tests pass, ruff clean. * refactor: simplify tests — parametrize, fix naming, remove noise - Merge redundant timeout tests into parametrized terminal status test - Parametrize exception type tests in abandoned records and malformed config - Fix misleading test name: test_key_error → test_missing_namespace_raises - Move DataValidationError import to top level - Remove FakeExecutionMetrics (unused), simplify FakeActionResult - Remove unnecessary comments restating assertions - Use _make_batch_result consistently in malformed config tests
* fix(vscode): create missing prompt-blocks TextMate grammar file
package.json references ./syntaxes/prompt-blocks.tmLanguage.json for
markdown prompt block injection but the file did not exist, causing a
VS Code activation warning. Creates a minimal grammar that highlights
template variables ({{ }}) and prompt variables (${...}) in markdown.
* fix(vscode): reset disposed flag on re-activate
The module-level disposed flag was set to true on deactivate but never
cleared when the extension re-activated. This prevented the LSP server
from restarting after a deactivate/reactivate cycle.
* fix(lsp): search all workspace files for context_scope completions
build_context_scope_completions only searched actions defined in the
current file. Now iterates all file_actions in the ProjectIndex so
cross-file action references appear in context_scope completions.
Deduplicates by action name to avoid duplicate entries.
* fix(lsp): add textDocument/didChange handler
The LSP server registered didOpen and didSave but not didChange.
Without didChange, the in-memory document store used by pygls was not
updated as the user typed, causing stale completions and diagnostics
until the file was saved.
* fix(vscode): add LSP crash recovery with restart and backoff
If the LSP server process crashes, the LanguageClient now automatically
restarts it up to 3 times (MAX_RESTART_COUNT). Uses vscode-languageclient
built-in ErrorAction/CloseAction handlers. Skips restart if the
extension is disposed.
* fix(vscode): escape Mermaid node labels to prevent syntax injection
Node labels from YAML were interpolated into Mermaid source with only
partial escaping (quotes, ], backslash). A label containing [, {, (,
|, #, &, ;, or other Mermaid-significant characters could break the
diagram syntax. Replaces the inline regex with a dedicated
escapeMermaidLabel method that handles all special characters.
* chore: add changie entry for P3-2 VS Code/LSP fixes
* refactor: remove unused file_path param and trim redundant docstring
build_context_scope_completions no longer uses file_path since it
searches all workspace files. Removed the unused parameter and updated
the call site in server.py. Trimmed verbose docstring on
escapeMermaidLabel.
* feat: add SQLite database maintenance operations (P4-2) Add post-workflow maintenance to prevent unbounded DB growth: - WAL checkpoint (TRUNCATE) after every workflow completion - Stale disposition cleanup: removes old FAILED/EXHAUSTED records when a newer SUCCESS exists for the same (action, record) - Prompt trace retention: configurable N-run retention window (default: 10 runs), deletes older traces at workflow end - Source data TTL: optional age-based cleanup (default: disabled) Config keys (in workflow YAML under storage.*): prompt_trace_retention_runs (default: 10) source_data_ttl_days (default: null = keep forever) All operations are idempotent, non-fatal on error, and logged. Wired into both sequential and async coordinator completion paths. 20 new tests covering all maintenance operations. * refactor: address review feedback on maintenance operations - Use DISPOSITION_SUCCESS/FAILED/EXHAUSTED constants instead of raw strings in disposition cleanup SQL (prevents schema drift) - Use StorageDefaults constants in perform_maintenance() signatures instead of duplicated magic numbers - Replace DATE(created_at) with direct created_at comparison in prompt trace retention DELETE to enable index usage
1. BLOCKING: Remove OpenAI raise LLMResponseParseError on reprompt path. Parse failures now always return _parse_error sentinel dict, matching Anthropic behavior. The reprompt pipeline reads the sentinel; the exception was uncaught and crashed the pipeline. 2. Fix retention docstring: _enforce_prompt_trace_retention uses DATE(created_at) as boundary, so retention_runs effectively means calendar days. Updated docstring to reflect actual semantics. 3. Remove unused prompt_content parameter from _compute_action_config_hash. No call site passes it, so it was dead code creating false confidence that prompt body changes would invalidate cached results. 4. Remove deprecated generate_deterministic_source_guid wrapper and its test. Only the test called it — backward-compat shim per RULES.md.
When _get_source_content fails to match by source_guid (staging records have no guid), the fallback uses the record's content dict. This dict contains ALL namespaces (source, summarize_page_content, etc.), not just the original source fields. References like source.page_content would resolve as None because page_content lives under content.source, not at the content top level. Now SourceNamespaceBuilder.build detects when source_content is a content dict with a 'source' sub-key and extracts it, so downstream observe references resolve correctly.
Handles wrapped ({content: {source: {...}}}) and unwrapped
({source: {...}}) content dicts. Both cases now extract the source
sub-namespace instead of using the full content dict with all action
namespaces as the source namespace.
…ation When version-merged or FILE-mode actions produce duplicate (action_name, record_id, disposition) tuples in the same batch, the INSERT fails. Deduplicate within the batch before writing.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
25 PRs from a comprehensive quality audit — 33 automated audit agents across 7 passes identified 236+ issues, prioritized into P0-P4 tiers, and executed surgically across 4 rounds with 6 parallel clones.
P0 — Critical (data integrity)
P1 — High (performance, crashes, LLM reliability)
P2 — Medium (concurrency, scale, UX, security)
P3 — Cleanup (dead code, tests, docs, editor)
P4 — Polish
Verification
🤖 Generated with Claude Code