Skip to content

v0.2.0 Quality Hardening — 25 fixes across storage, LLM, config, performance, and observability#624

Open
Muizzkolapo wants to merge 29 commits into
mainfrom
integration/v0.2.0-quality-clean
Open

v0.2.0 Quality Hardening — 25 fixes across storage, LLM, config, performance, and observability#624
Muizzkolapo wants to merge 29 commits into
mainfrom
integration/v0.2.0-quality-clean

Conversation

@Muizzkolapo
Copy link
Copy Markdown
Owner

Summary

25 PRs from a comprehensive quality audit — 33 automated audit agents across 7 passes identified 236+ issues, prioritized into P0-P4 tiers, and executed surgically across 4 rounds with 6 parallel clones.

P0 — Critical (data integrity)

P1 — High (performance, crashes, LLM reliability)

P2 — Medium (concurrency, scale, UX, security)

P3 — Cleanup (dead code, tests, docs, editor)

P4 — Polish

Verification

  • All 6 example workflows validate successfully against this branch
  • Full test suite passes (7419 tests)
  • Each PR was written from a reviewed spec with adversarial testing and 3rd-degree code relationship graphs

🤖 Generated with Claude Code

Muizzkolapo and others added 29 commits May 22, 2026 21:13
The disposition table's UNIQUE constraint is (action_name, record_id,
disposition), not (action_name, record_id). Writing SUCCESS for a record
that previously had FAILED resulted in BOTH rows coexisting — phantom
failures that _resolve_completion_status would report.

Added DELETE for (action_name, record_id) before INSERT in set_disposition,
both in the same transaction. This ensures only one disposition per record
per action at any time.
…tion_order (#601)

_check_prior_output now checks for node-level terminal dispositions
(FILTERED, PASSTHROUGH, SUCCESS, UNPROCESSED) when no target files
exist. This prevents infinite rerun cycles for actions that legitimately
produce no output (e.g., all records guard-filtered).

is_workflow_complete, is_workflow_done, has_any_failed, and get_summary
now iterate execution_order instead of action_status.values(), so orphan
entries from removed actions no longer block workflow completion or
inflate failure counts.
… items (#604)

LLM can return malformed data (strings, lists, None, integers) that
crashes every enricher's .get() calls with AttributeError. The guard
filters non-dict items at the pipeline entry point before any enricher
runs. Partial non-dict batches continue with valid items; all-non-dict
batches are marked FAILED and return early without running enrichers.
* fix: detect prompt/model/schema/guard changes and invalidate stale completed actions

_maybe_invalidate_completed_status only checked record_limit and file_limit.
If a user changed the prompt, model, schema, or guard, actions stayed COMPLETED
from the prior run and the new config was never applied.

Add _compute_action_config_hash() which produces a stable SHA-256 fingerprint
of semantically-meaningful fields (prompt, model, schema, guard clause+behavior).
Store the hash in status metadata on completion. On re-run, compare stored hash
to current — if they differ, reset to PENDING.

- Guard normalization handles string, dict, and None forms
- Missing stored hash (first run / upgrade) does NOT invalidate
- Cosmetic fields (description, tags, limits) excluded from hash
- All _limit_metadata call sites replaced with _completion_metadata
- 20 new tests + 2 existing tests updated for new config_hash kwarg

* style: clean up json import alias and redundant ensure_ascii param

- import json as _json → import json (aligns with 135+ other files)
- Remove ensure_ascii=True (already the default for json.dumps)
…#602)

Before this change, pressing Ctrl+C (or any crash) between disposition
clearing and workflow.run() completion would silently lose track of
failed records. The next retry invocation would see no failures and
report 'Nothing to retry.'

Now a JSON manifest is written BEFORE clearing dispositions. If the
process is interrupted, the next retry invocation detects the manifest,
restores the cleared dispositions, and proceeds normally.

- Manifest stored at agent_io/store/{workflow}/_retry_manifest.json
- Contains full disposition rows (action, record, status, reason, detail)
- Written before clear, deleted after successful completion
- Corrupt/stale manifests handled gracefully
- 16 new tests covering write-before-clear ordering, interrupt survival,
  restore-on-next-invocation, write-failure abort, and no-signal-handler
* fix: harden LLM client layer (P1-5 + P2-6)

Five LLM-layer defects fixed plus template None finalize:

1. Token overflow pre-flight guard: MessageBuilder.build() estimates
   token count (chars/4 heuristic) and raises PromptTooLargeError
   before the API call when prompt exceeds model context window.

2. OpenAI JSON mode prompt injection: user-supplied context_data is
   now sent in a separate user message instead of being embedded in
   the system message. Only affects TAGGED style (OpenAI); Groq's
   TAGGED_GROQ keeps its single system message.

3. Skipped-dependency template tolerance: when an upstream action was
   skipped, its missing namespace triggers a warning and empty
   _PermissiveNamespace substitution instead of crashing. Genuine
   typos still raise TemplateVariableError.

4. Anthropic empty response parity: call_json() now returns the same
   _parse_error sentinel as OpenAI instead of raising VendorAPIError
   on empty/content-less responses.

5. JSON parse failure with reprompt: when reprompt is configured and
   JSON parsing fails, OpenAIClient raises LLMResponseParseError so
   the retry/reprompt machinery can intercept. Without reprompt, the
   existing silent _parse_error pass-through is preserved.

6. Template None finalize (P2-6): Jinja2 Environment in service.py
   now uses finalize=lambda x: "" if x is None else x, so None
   values render as empty string instead of literal "None".

30 new tests, 7 existing tests updated.

* refactor: simplify review — fix import ordering, test tautology, deferred imports

- Move logger definition after all imports in anthropic/client.py
- Replace tautological assertion in test_multiple_skipped_deps_tolerated
  with concrete expected output check
- Move PromptTooLargeError and LLMResponseParseError to top-level imports
  instead of deferred function-body imports (consistent with existing
  VendorAPIError imports in the same files)
* perf: batch SQLite disposition commits — one transaction per action instead of N

Add set_dispositions_batch() to StorageBackend (default loops over
set_disposition) and override in SQLiteBackend with executemany for
single-commit batch writes. Update collect_results_from_processing_results
to accumulate dispositions and flush once per action instead of N per-record
commits.

For a 1000-record action this reduces disposition writes from 1000 fsyncs
to 1. Existing per-record set_disposition remains for retry, node-level
signals, and other single-record callers.

Key design decisions:
- Validate all records before opening transaction (no partial writes)
- DELETE-before-INSERT in batch (matching P0-5 phantom prevention)
- Rollback on sqlite3.Error (atomic batch)
- Disposition errors caught at flush (telemetry, not authoritative)
- Base class default loops for non-SQLite backends

* refactor: simplify batch disposition code — extract shared helpers, deduplicate

- Add DispositionRow TypeAlias to backend.py (replaces 4 repeated 7-tuple types)
- Extract _validate_disposition_fields() in SQLiteBackend (deduplicates
  validation between set_disposition and set_dispositions_batch)
- Extract wire_batch_disposition_delegate() to tests/conftest.py (replaces
  10 copy-pasted _batch_delegate closures across 7 test files)
- Net -102 lines
…ary json.dumps) (#608)

- Add assert_path_contained() utility that resolves symlinks before
  checking containment, preventing path traversal via ../ or symlinks
- Add sanitize_path_component() for long action names (truncate + hash)
- Wire containment check into FileWriter.write_target()
- Fix json.dumps(item) crash on bytes values in saver.py (default=str)
- 17 new tests, 1 updated test
* fix: use uuid4 for source_guid to prevent silent collision on duplicate/expanded records

source_guid was generated as uuid5(content_hash), causing identical input
rows to silently collapse via UNIQUE constraint and 1-to-N expansion
children to overwrite each other in storage.

- Add generate_source_guid() (uuid4) for unique record identity
- Rename old method to generate_content_hash() for explicit dedup use
- Update all 4 sites in initial_pipeline.py to use uuid4
- Update task_preparer.py and online_llm.py callers
- Give expansion children unique source_guids in LineageEnricher
- Keep deprecated wrapper for backward compatibility

* refactor: use IDGenerator.generate_source_guid() in batch paths for consistency

Replace raw str(uuid.uuid4()) with IDGenerator.generate_source_guid()
in _prepare_text_chunks_batch and _add_batch_metadata so all source_guid
generation goes through the centralized method.

* refactor: consolidate 4 local IDGenerator imports into single top-level import

No circular dependency justifies the local imports — move to top of file.
…alidation error detail (#611)

- Warn when defaults config contains unknown keys by comparing raw input
  against DefaultsConfig.model_fields (keeps extra="ignore" unchanged)
- Add difflib.get_close_matches to UnknownKeysDetector for "did you mean?"
  suggestions on action key typos
- Surface Pydantic field-level errors in ConfigurationError.context as
  structured validation_errors list (capped at 10)
)

ActionStateManager had no lock protecting the read-modify-write cycle
in update_status, allowing parallel executor threads to lose metadata.
Added threading.Lock to guard all mutating methods (update_status, reset,
mark_running_as_failed, reset_retryable).

atomic_json_write used a deterministic temp path (path.with_suffix(".json.tmp"))
which caused concurrent writes to the same file to clobber each other's temp
file. Replaced with tempfile.mkstemp for unique temp paths per call, with
proper fd cleanup if os.fdopen fails.

Extracted _bulk_transition helper to deduplicate mark_running_as_failed and
reset_retryable. Updated existing tests that asserted on old .json.tmp suffix.
Added concurrency tests for both fixes.
…rams, ghost dir (#612)

Dead functions (zero callers confirmed by grep):
- path_utils.py: check_path_exists, create_mirror_source_path,
  validate_path_permissions, find_files_by_extension, safe_path_join,
  create_agent_directory_structure (6)
- udf_management/tooling.py: _split_udf_name (1)
- ollama/failure_injection.py: is_online_injection_enabled,
  is_batch_injection_enabled, get_injection_status (3)

Backward-compat aliases (zero external consumers):
- lsp/server.py: 11 _private = public re-exports removed

Vestigial parameters (accepted but never read):
- handlers.py: get_prompt_symbols(file_path)
- diagnostics.py: _action_guard_variables(index)
- builder.py: create_dynamic_agent(udf)
- runner.py: setup_directories(idx)
- service.py: _build_llm_context(contents)
- anthropic/client.py: _call_api(mode)

Ghost directory:
- llm/providers/mistral/ — only __pycache__ artifacts, no .py source

All callers and tests updated. 7305 tests pass, ruff clean.
…o 4096 (#609)

Groq's call_non_json injected temperature=0.7 via setdefault — no other
provider does this. Removed the default so Groq uses the API's own default,
matching parity with OpenAI/Anthropic/Cohere. max_tokens default (1000)
is preserved since Groq's API requires it.

Anthropic's max_tokens default of 1024 was too low for most use cases.
Raised to 4096 to better match typical usage patterns. User-specified
values still take precedence via setdefault.

Issue 1 (Anthropic empty response crash) was already fixed by PR #605
which added VendorAPIError catch in call_json returning _parse_error sentinel.
- Remove vestigial /tmp cleanup fixture (conftest.py) — cleaned paths
  no test creates
- Strengthen 3 bare assert_called_once() to assert_called_once_with()
  verifying actual arguments (test_runner, test_stale_completion)
- Replace 4 weak is-not-None-only assertions with value checks
  (test_schema_array_conversion, test_factory, test_retry_reprompt)
- Remove 3 unnecessary time.sleep() calls from thread safety tests;
  replace sleep-based race with Event.wait() (test_processor_utils)
Build a {source_guid: item} dict once before the per-item loop in
LineageEnricher.enrich() and pass it to _get_parent_item, replacing
O(N^2) linear scans with O(1) dict lookups. Index only built when
use_per_item_parent_lookup is True (avoids wasted work on single-call
paths). Add optional index parameter to get_content_by_source_guid.

Linear scan fallback preserved for callers that don't pass the index.
13 new tests including performance validation.
…and storage (#614)

- Add record identity (action_name, source_guid) to enrichment pipeline
  exception handler
- Add LLM call duration logging (time.perf_counter) to ClientInvocationService
  for all provider paths (tool, hitl, standard)
- Add progress context (completed, failed, total) to batch timeout warning
- Upgrade batch abandoned-records log from INFO to WARNING with error context
- Add action_name to batch context_map load failure warning
- Upgrade disposition clearing from DEBUG to INFO when records are actually
  deleted, with record_id context
- Add action_name to enrichment source_mapping out-of-bounds warnings
* docs: fix README quick-start syntax and replace pip with uv

The init --example flag is actually a subcommand (agac init example).
The Contributing section referenced pip install but the project uses uv.

* docs: remove Mistral provider references

The Mistral provider directory was deleted but docs still listed it
as a supported provider. Removed all standalone Mistral references
from README, docs site, skills references, and configuration guides.
Ollama can still run Mistral models — those references are preserved.

* docs: add support_resolution to README examples table

The example exists in examples/ and in the docs site but was missing
from the README examples table.

* docs: fix RELEASING.md to use changie for version bumps

The version is defined in both pyproject.toml and agent_actions/__version__.py.
changie batch updates both via .changie.yaml replacements. The manual
pyproject.toml-only instructions would leave __version__.py out of sync.

* docs: reconcile cli/_MANIFEST.md with actual CLI surface

Added missing modules: dispositions, example, retry, workflow_loader.
Split inspect entry into actual source files: inspect_base, inspect_deps,
inspect_graph, inspect_action. These commands have existed since the
batch/retry/disposition features were added but the manifest was not updated.

* chore: add changie entry for docs cleanup

* docs: fix review findings — cohere in vendor comment, Ollama batch status, stale manifest entry

- tutorials/index.md: add missing cohere to vendor list comment
- README.md: Ollama batch support is Yes (OllamaBatchClient exists), not Online only
- cli/_MANIFEST.md: remove project_paths_factory.py entry (file was deleted)
* test: add 33 tests covering 7 critical untested paths (P3-5)

Coverage for error/edge paths in core pipeline code:
- _fail_abandoned_records: context_map load failure (5 tests)
- wait_for_batch_completion: timeout path (5 tests)
- submit_retry_batch: submission failure (5 tests)
- parallel action executor: async exception handling (5 tests)
- ValidationStrategy: malformed config (7 tests)
- prepare_correlated_input: iteration handoff (6 tests)

7388 tests pass, ruff clean.

* refactor: simplify tests — parametrize, fix naming, remove noise

- Merge redundant timeout tests into parametrized terminal status test
- Parametrize exception type tests in abandoned records and malformed config
- Fix misleading test name: test_key_error → test_missing_namespace_raises
- Move DataValidationError import to top level
- Remove FakeExecutionMetrics (unused), simplify FakeActionResult
- Remove unnecessary comments restating assertions
- Use _make_batch_result consistently in malformed config tests
* fix(vscode): create missing prompt-blocks TextMate grammar file

package.json references ./syntaxes/prompt-blocks.tmLanguage.json for
markdown prompt block injection but the file did not exist, causing a
VS Code activation warning. Creates a minimal grammar that highlights
template variables ({{ }}) and prompt variables (${...}) in markdown.

* fix(vscode): reset disposed flag on re-activate

The module-level disposed flag was set to true on deactivate but never
cleared when the extension re-activated. This prevented the LSP server
from restarting after a deactivate/reactivate cycle.

* fix(lsp): search all workspace files for context_scope completions

build_context_scope_completions only searched actions defined in the
current file. Now iterates all file_actions in the ProjectIndex so
cross-file action references appear in context_scope completions.
Deduplicates by action name to avoid duplicate entries.

* fix(lsp): add textDocument/didChange handler

The LSP server registered didOpen and didSave but not didChange.
Without didChange, the in-memory document store used by pygls was not
updated as the user typed, causing stale completions and diagnostics
until the file was saved.

* fix(vscode): add LSP crash recovery with restart and backoff

If the LSP server process crashes, the LanguageClient now automatically
restarts it up to 3 times (MAX_RESTART_COUNT). Uses vscode-languageclient
built-in ErrorAction/CloseAction handlers. Skips restart if the
extension is disposed.

* fix(vscode): escape Mermaid node labels to prevent syntax injection

Node labels from YAML were interpolated into Mermaid source with only
partial escaping (quotes, ], backslash). A label containing [, {, (,
|, #, &, ;, or other Mermaid-significant characters could break the
diagram syntax. Replaces the inline regex with a dedicated
escapeMermaidLabel method that handles all special characters.

* chore: add changie entry for P3-2 VS Code/LSP fixes

* refactor: remove unused file_path param and trim redundant docstring

build_context_scope_completions no longer uses file_path since it
searches all workspace files. Removed the unused parameter and updated
the call site in server.py. Trimmed verbose docstring on
escapeMermaidLabel.
* feat: add SQLite database maintenance operations (P4-2)

Add post-workflow maintenance to prevent unbounded DB growth:

- WAL checkpoint (TRUNCATE) after every workflow completion
- Stale disposition cleanup: removes old FAILED/EXHAUSTED records
  when a newer SUCCESS exists for the same (action, record)
- Prompt trace retention: configurable N-run retention window
  (default: 10 runs), deletes older traces at workflow end
- Source data TTL: optional age-based cleanup (default: disabled)

Config keys (in workflow YAML under storage.*):
  prompt_trace_retention_runs (default: 10)
  source_data_ttl_days (default: null = keep forever)

All operations are idempotent, non-fatal on error, and logged.
Wired into both sequential and async coordinator completion paths.

20 new tests covering all maintenance operations.

* refactor: address review feedback on maintenance operations

- Use DISPOSITION_SUCCESS/FAILED/EXHAUSTED constants instead of raw
  strings in disposition cleanup SQL (prevents schema drift)
- Use StorageDefaults constants in perform_maintenance() signatures
  instead of duplicated magic numbers
- Replace DATE(created_at) with direct created_at comparison in
  prompt trace retention DELETE to enable index usage
1. BLOCKING: Remove OpenAI raise LLMResponseParseError on reprompt path.
   Parse failures now always return _parse_error sentinel dict, matching
   Anthropic behavior. The reprompt pipeline reads the sentinel; the
   exception was uncaught and crashed the pipeline.

2. Fix retention docstring: _enforce_prompt_trace_retention uses
   DATE(created_at) as boundary, so retention_runs effectively means
   calendar days. Updated docstring to reflect actual semantics.

3. Remove unused prompt_content parameter from _compute_action_config_hash.
   No call site passes it, so it was dead code creating false confidence
   that prompt body changes would invalidate cached results.

4. Remove deprecated generate_deterministic_source_guid wrapper and its
   test. Only the test called it — backward-compat shim per RULES.md.
When _get_source_content fails to match by source_guid (staging records
have no guid), the fallback uses the record's content dict. This dict
contains ALL namespaces (source, summarize_page_content, etc.), not just
the original source fields. References like source.page_content would
resolve as None because page_content lives under content.source, not at
the content top level.

Now SourceNamespaceBuilder.build detects when source_content is a
content dict with a 'source' sub-key and extracts it, so downstream
observe references resolve correctly.
Handles wrapped ({content: {source: {...}}}) and unwrapped
({source: {...}}) content dicts. Both cases now extract the source
sub-namespace instead of using the full content dict with all action
namespaces as the source namespace.
…ation

When version-merged or FILE-mode actions produce duplicate
(action_name, record_id, disposition) tuples in the same batch,
the INSERT fails. Deduplicate within the batch before writing.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant