Improve EverCore Python quality gates#233
Open
hylarucoder wants to merge 24 commits into
Open
Conversation
- ruff with FastAPI-style rule set (E/W/F/B/C4/ASYNC); `make ruff` and `make ruff-fix` targets. Keeps black/isort during transition. - ty as the default typechecker (~10x faster than pyright); pyright retained as `make typecheck-pyright`. [tool.ty.rules] enables possibly-unresolved-reference, which catches real bugs surfaced in the audit (milvus_start, user_id_list). - pytest-asyncio: pin asyncio_mode = "strict" explicitly so the current default doesn't shift under us on upstream release. - load_env.py: use os.pathsep instead of hardcoded ":" in PYTHONPATH manipulation so the env loader works on Windows.
- memory_manager.get_vector_search_results: initialize milvus_start before the try so the except branch can always compute the stage duration. Previously a failure before line 729 would raise UnboundLocalError, masking the real exception. - mem_memorize._trigger_profile_extraction: initialize user_id_list before the try so the recovery branch can iterate participants safely on early failure. - mem_memorize: `request == None` -> `request is None` (E711). - mem_memorize / memory_manager / mem_db_operations: remove duplicate imports of traceback, dataclass, RawDataType, MemorizeConfig, DEFAULT_MEMORIZE_CONFIG, Any, MemoryType (F811). - retrieval_utils: bare `except:` -> `except Exception:` to stop swallowing KeyboardInterrupt / SystemExit (E722). - i18n_tool: declare OpenAIProvider under TYPE_CHECKING so the forward-reference annotations resolve without forcing memory_layer to be importable at module load.
Two classes named SearchMemoriesResponse were defined in api_specs/dtos/memory.py: a GET-endpoint variant at line 774 (BaseApiResponse[RetrieveMemResponse]) and a POST-endpoint variant at line 1697 (BaseApiResponse[SearchMemoriesResponseData]). The second definition silently shadowed the first, making the GET variant dead code. The POST variant is the one wired to memory_search_controller (it instantiates `SearchMemoriesResponse(data=response_data)` where response_data is SearchMemoriesResponseData), so keep that one and remove the shadowed GET variant + its example schema.
Same "assigned inside try, used in except / after-try" pattern as the earlier milvus_start / user_id_list fixes: - mem_memorize._trigger_profile_extraction: initialize profile_repo to None before the try, and skip the last_updated_ts advance when the DI lookup never ran (failure during lazy imports). - memory_layer.memory_manager: initialize display_name to None before the if/else so the else-branch loop falling through (no matching sender_id) doesn't leave it unbound at the downstream call site. - milvus_collection_base: pull `new_coll = Collection(...)` outside the index-creation try so an index-warning path can't shadow the collection-creation success with a NameError on `return new_coll`. - hmac_signature_middleware: bind timestamp_header / nonce_header / signature_header to None before the try, so the except branch's error logging works even if request.headers itself raises. - redis_length_cache_manager, redis_windows_cache_manager: bind `message = None` at the top of each loop iteration so the except branch can format its warning without a NameError when indexing raises before the assignment. Incidentally includes ruff F541 cleanup (f-strings without placeholders -> plain strings) in mem_memorize.py from the same auto-fix pass.
Mechanical cleanup from `ruff check --fix --select W291,W293,F541, C420,B010,B009`: - W293 blank-line-with-whitespace: removed trailing whitespace on otherwise-empty lines (~126 occurrences). - W291 trailing-whitespace: trimmed line-end whitespace. - F541 f-string-missing-placeholders: rewrote `f"..."` literals with no interpolation as plain strings (~22 occurrences). - C420 unnecessary-dict-comprehension-for-iterable, B010 set-attr- with-constant, B009 get-attr-with-constant: simplified a handful of unnecessary dict comprehensions and setattr/getattr-with- constant-attribute calls into direct attribute access. No behavior changes. The remaining 144 W293 / 9 W291 occurrences are inside docstrings or string literals and require --unsafe-fixes, which is intentionally not applied.
Mechanical cleanup via `ruff check --select F401 --fix --unsafe-fixes`. 138 of 139 unused imports removed across 67 files. Audited removals before committing: - DI decorators (`from core.di.decorators import service / component / repository`) were imported in a few adapter modules but never used as `@service` etc. `core.di.decorators` is a pure decorator factory module with no module-load side effects, so removing the dangling imports is safe. - stdlib (`os`, `time`, `json`, `random`, `uuid`, `numpy`, `jieba`, ...) and typing helpers (`Optional`, `List`, `Tuple`, ...) form the bulk of removals. - `__init__.py` files are intentionally exempt (per the `[tool.ruff.lint.per-file-ignores]` rule allowing re-exports). The one remaining F401 (`pyinstrument` in profile_middleware.py) is an availability-probe import — `try: import pyinstrument` — and is suppressed with `# noqa: F401` since the import is the side effect. Net impact: ruff baseline 387 -> 244 errors. ty count unchanged at 891 (F401 doesn't affect type inference). All src/*.py files compile successfully.
Fixes 5 ruff ASYNC violations. The first is a real bug; the others are dev-script cleanup for consistency now that aiofiles is already a project dependency. - core/oxm/es/migration/utils.py:177,182: `wait_for_task_completion` polls Elasticsearch task status inside an async function but used `time.sleep(5)` and `time.sleep(10)` on error, freezing the entire event loop on each iteration. ES rebuilds can run for minutes-to-hours, so this would stall every other coroutine in the process (FastAPI lifespan, background tasks, etc.). Switch to `await asyncio.sleep(...)`. - devops_scripts/i18n/i18n_tool.py:663,701 and devops_scripts/sensitive_info/sensitive_info_tool.py:477: Replace synchronous `with open(...)` / `f.read()` / `f.write()` with `async with aiofiles.open(...)` / `await f.read()` / `await f.write()`. These are dev-tool scripts where blocking IO impact is negligible, but the conversion is mechanical and removes the noise from future lint reports.
Mechanical cleanup of 190 ruff violations across 39 files. Manually
audited the auto-fixes; the noteworthy decisions:
- W293/W291 inside src/memory_layer/prompts/* was REVERTED. Those
triple-quoted strings are LLM prompt templates and the trailing
whitespace is part of the literal sent to the model. Three files
restored to HEAD: prompts/__init__.py, en/atomic_fact_prompts.py,
zh/conv_prompts.py.
- F841 "unused variable" auto-fix mostly deleted assignments where
the RHS is pure (literals, len(), sorted(), str()). For cases
where ruff conservatively kept the RHS as an orphan expression
(`get_tenant_aware_collection_name(origin_alias_name)` etc.),
manually verified those functions are pure and deleted the
expression lines too:
- test_di_scanner.py:284 (`len(container._named_beans)`)
- tenant_aware_collection_with_suffix.py:350
- user_profile_milvus_converter.py:52
The `job = await pool.enqueue_job(...)` case correctly kept the
awaited call -- only the unused name binding was removed.
- E731 lambda-to-def conversion in milvus_rebuild_collection.py
produced wrong inner indentation; fixed to PEP 8 4-space.
- C408 dict(...) -> {...}, B905 zip(..., strict=False), E401
multiple-imports-on-one-line, and the remaining W291/W293 outside
prompts were applied verbatim.
Result: ruff 239 -> 56 errors. All src/ files compile. ty 891 -> 839
(F841 deletions remove type tracking on those vars).
The previous --unsafe-fixes commit (dabc3c4) auto-removed assignments that ruff classified as unused locals. Several of those were not safe to drop on a purely-syntactic read: - agentic_layer/memory_manager.py: three `memory_type = (...)` bindings in `retrieve_mem_keyword / _vector / _hybrid`. These were paired with `start_time = time.perf_counter()` in the original metrics-labelling commit (b3958cb). The paired start_time was removed during a later metrics refactor that pushed recording into the inner `get_*_search_results` helpers; the memory_type binding got orphaned but the value is still a useful debugging breadcrumb at the wrapper layer. - core/asynctasks/task_manager.py: `job = await pool.enqueue_job(...)` — keeping the name binding aids debugging and any future follow-up that needs the job handle. - memory_manager.py `extend_data = fields['extend_data']`, agent_skill_milvus_converter `content = source_doc.content or ""`, episode_memory_extractor `default_title = "Conversation Episode"`, test_di_scanner `initial_bean_count = len(...)`, tenant_aware_collection_with_suffix `tenant_aware_alias_name = ...`, user_profile_milvus_converter `doc_id = str(source_doc.id) ...`: same story — likely vestiges of in-flight work or intentional reads kept for clarity. Restoring them is cheap; auto-removing was the wrong default. Add F841 to the ruff `ignore` list in pyproject.toml so future `--unsafe-fixes` runs won't reintroduce the same destructive cleanup. F841 can still be surfaced explicitly via `ruff check --select F841` when a human wants to triage them; we just don't auto-fix.
Claude Code loads CLAUDE.local.md alongside CLAUDE.md and lets it take precedence (closer-to-cwd loads last). Use it for personal workflow rules that shouldn't bind other contributors, while CLAUDE.md / AGENTS.md keep the shared team conventions.
Two new dev_docs articles drafted from a project-wide code review: - exception_handling_analysis.md: full audit of 781 try blocks and 17 hand-written retry loops. Catalogues 7 anti-patterns and 6 retry types with code references, performance/correctness analysis, and a sprint-organised action plan. - code_quality_roadmap.md: 8-week three-phase execution plan (observability -> tests -> try-catch cleanup), with per-phase acceptance criteria and rollback paths.
K8s-style liveness vs readiness separation, per the observability
roadmap (Phase 1, T1.1).
- /livez: process heartbeat, always 200. Suitable for liveness probes;
a failure means the process should be restarted.
- /readyz: runs four downstream probes (Redis, MongoDB, Elasticsearch,
Milvus) in parallel under a 2 s timeout. Returns 200 only when all
succeed; 503 with a per-dependency breakdown otherwise. Suitable for
readiness probes and load-balancer drain checks.
- /health: legacy alias for /livez, preserved for backward compatibility.
Each probe also publishes the evercore_dependency_healthy{name="..."}
Prometheus gauge so alerts can fire without polling /readyz externally.
Probe definitions live in a sibling probes.py module; adding a new
downstream is one async function plus an entry in default_probes().
prometheus_middleware skips /livez and /readyz so probe traffic does
not dominate request-rate dashboards.
Smoke-tested locally: happy path returns 200 with all four deps
healthy; pointing REDIS_PORT at a dead port correctly drives /readyz
to 503 with redis=False detail, /livez stays 200, and the gauge for
redis drops to 0 while others remain 1.
Per Phase 1, T1.3 of the observability roadmap. Before this change, four independent classifiers (memorize/retrieve/rerank/vectorize) each did string-matching on str(error).lower() and fell through to the literal "unknown" (or "error") bucket. Cross-service alerts could not group failures the same way, and any unfamiliar exception class was invisible behind the catch-all label. The new core/observation/error_classification.py is the single source of truth: it still emits the existing semantic labels (timeout, rate_limit, validation_error, connection_error, not_found) when those match, and otherwise returns the exception class name in snake_case. The four call sites now delegate to it. Effect on metrics: - error_type='unknown' / error_type='error' goes away entirely. - A previously-anonymous KeyError or pydantic.ValidationError now shows up as error_type='key_error' / 'validation_error' and can be alerted on. - Vectorize-specific failures previously bucketed as 'api_error' now surface as 'vectorize_error', which is more actionable. Unit-tested with 9 representative exception shapes; ruff + compile clean. No call-site behaviour change beyond the label string.
Per Phase 1, T1.2 of the observability roadmap.
The logger module now selects its output format from the LOG_FORMAT
env var:
- LOG_FORMAT=text (default) keeps the existing human-readable format
used by local development and tail-based debugging.
- LOG_FORMAT=json switches to one JSON object per line via the new
JsonFormatter. Each record carries ts (ISO 8601 ms + Z), level,
logger, msg, file, request_id, and any extra= mapping the caller
attached. Tracebacks live in a structured exc_info field instead of
being concatenated into msg.
ContextFilter (formerly RequestIdFilter) now also forwards tenant_id,
user_id, group_id and session_id from the app_info contextvar onto
the LogRecord, so JSON output captures them automatically when they
are populated by middleware.
The change is fully backward compatible: all 247 existing call sites
keep working unchanged in both formats. Structured fields can be
adopted incrementally via logger.info(msg, extra={...}) without
touching the rest.
Implemented with zero new dependencies (stdlib logging + json). A
future migration to structlog can layer on top without churning
callers again.
Smoke-tested locally: with LOG_FORMAT=json, /readyz emits per-probe
JSON lines that share the same request_id across the Redis / MongoDB /
Elasticsearch / Milvus checks, which is exactly the cross-call
correlation the roadmap calls out as Phase 1's headline win.
Closes Phase 1 T1.5 of the code-quality roadmap. - docs/dev_docs/slo_definitions.md: four load-bearing SLOs (memorize success rate, retrieve p95 latency, LLM error budget, dependency availability) wired to existing metric names. Includes PromQL, severity guidance, and an error-budget policy. - docs/grafana/evercore_slo_overview.json: importable dashboard with one stat per SLO plus 5-min burn-rate timeseries panels.
Closes Phase 1 T1.4 of the code-quality roadmap. - core/observation/tracing/otel.py: soft-imports the OTel SDK; init_tracing is a no-op unless OTEL_EXPORTER_OTLP_ENDPOINT is set AND the deps are installed. Installs FastAPI/httpx/redis/pymongo instrumentations when available. Default sampler is ParentBased(TraceIdRatioBased(0.1)). - trace_logger decorator now opens a real span when tracing is active; unchanged behavior when inactive (zero perf cost). - base_app wires init_tracing(app) at the end of create_base_app. - pyproject adds an optional `otel` dependency group. Activate with `uv sync --group otel`; default sync does NOT pull these in.
…heck Closes Phase 2 T2.1, T2.2, T2.3, T2.5 of the code-quality roadmap. - evercore-smoke.yml: runs unit-marker pytest with coverage XML upload and a non-blocking ty typecheck step. Both `continue-on-error: true` until tests are explicitly tagged per T2.4. - Makefile: adds test-unit / test-integration / test-e2e / test-cov targets. - tests/conftest.py: auto-tags tests by file path (tests/integration/** → integration; *_e2e.py → e2e; rest → unit). Explicit markers still win. - pytest.ini: fixes section header from [tool:pytest] (legacy setup.cfg form) to [pytest], so pytest actually reads markers and other settings. Registers e2e marker. - pyproject.toml: removes duplicate [tool.pytest.ini_options] section; pytest >= 8 silently ignored it whenever pytest.ini existed.
Closes Phase 3 W5 of the code-quality roadmap. Phase 3 W6 (retry refactor) and W7 (catch-all overhaul + custom exception hierarchy) are intentionally not in this commit — both require 24-hour staging soak per roadmap §4.2. Substantive changes - Replace 20 `traceback.print_exc()` call sites with `logger.exception(...)`; remove now-unused `import traceback`. Affected: bootstrap, biz_layer/mem_memorize, mem_db_operations, core/di/scanner, core/component/mongodb_client_factory, core/request/timeout_background, core/oxm/es/migration/utils, and 7 devops_scripts/data_fix scripts. - Consolidate duplicate `RetryConfig` classes. The canonical implementation lives in core/longjob/interfaces (richer API: jitter, backoff_multiplier, retry_on_fatal). The dataclass in core/asynctasks/task_manager is removed and the module re-imports from interfaces. No external consumers, safe merge. Ruff baseline - Enable G (flake8-logging-format) and BLE (flake8-blind-except). - 930 `# noqa: G004` / `# noqa: BLE001` markers applied to existing violations via `ruff check --add-noqa`. New code is held to the rule; baseline shrinks naturally as files are touched. Roadmap status - docs/dev_docs/code_quality_roadmap.md §0 updated to reflect Phase 1 done, Phase 2 partial, Phase 3 W5 done; explicit hand-off note for the W6/W7 boundary.
Adds opt-in performance benchmarks for the two hot paths called out in roadmap §3.2: hybrid retrieval and end-to-end memorize. Per roadmap guidance, benchmarks are NOT wired into CI yet — the first ~3 months are baseline collection only, with thresholds added later. - tests/benchmarks/test_retrieve_hybrid_benchmark.py: measures MemoryManager.retrieve_mem_hybrid; tied to the retrieve_p95_latency SLO (target p95 < 500 ms). - tests/benchmarks/test_memorize_benchmark.py: measures end-to-end memorize for a single message. - Both skip cleanly when RUN_BENCHMARKS is unset or when the DI container / DTOs can't be imported — collection-only is safe in CI. - Adds optional `bench` dependency group (pytest-benchmark). Default `uv sync` does not pull it in; activate with `uv sync --group bench`. - `make benchmark` target sets RUN_BENCHMARKS=1 and prints the min/mean/median/p95/max columns.
Records that T2.6 (benchmark scaffolding) landed and names T2.4 (unit-test backfill) as the next-up item gating Phase 3 W6/W7.
…nch group Picked up by uv during T2.6 verification. No behavior change for the default sync — these only activate under `uv sync --group bench`.
… layer Closes Phase 2 T2.4 of the code-quality roadmap. agent_skill_extractor was already covered by 193 existing tests, so the gap was in two modules: - tests/test_agentic_memory_manager.py (14 tests) Pins the swallow-and-return-empty contract on retrieve_mem_hybrid / retrieve_mem so Phase 3 W7's intentional change is a deliberate, reviewed diff. Also covers _classify_retrieve_error for the most common failure modes. Surfaces one xfail: the audit-flagged possibly-unbound bug where retrieve_mem(None) enters the fallback path but crashes inside QueryMetadata.from_request(None). Strict-xfail so Phase 3 W7's fix flips it naturally to xpass. - tests/test_biz_mem_memorize.py (15 tests) Covers if_memorize, _is_agent_case_quality_sufficient, _should_skip_atomic_fact_for_agent, _clone_episodes_for_users, and the _save_agent_case catch-all (returns 0 on save / conversion failure). Pure-logic tests; no DB or LLM fixtures required. Both files pass `ruff check`, `ty check`, and run cleanly under `pytest -m unit` without docker.
Two W7-subset items that are safely landable without staging soak:
1. core/errors/ — typed exception hierarchy.
EverCoreError is the common root so controllers can install a single
except EverCoreError fallback while pattern-matching subtypes
higher in the chain. Subclasses cover:
- MemorizeError / MemoryCellPersistFailed — write-path failures
- ExtractionError / LLMOutputFormatError — LLM-side failures, with
LLMOutputFormatError carved out for the Type-B "feedback retry"
pattern (parse failure feeds into next prompt)
- RetrieveError / SearchBackendUnavailable — read-path failures
- ProviderUnavailable — root-level (both paths can hit a provider
outage), carries an explicit `provider` field for metric/label
attribution
Every class carries `tenant_id`, `cause`, and arbitrary `context`
kwargs so structured logging and Sentry grouping work without
parsing message strings. Purely additive — no callers migrated.
15 tests pin the hierarchy shape and attribute contract.
2. Fix audit-flagged possibly-unbound bug in MemoryManager.retrieve_mem.
When the request was None, the validation ValueError was caught by
the broad fallback, but the fallback then crashed inside
QueryMetadata.from_request(None). Short-circuited None at the top
so the swallow-and-empty contract is honored. The xfail in
tests/test_agentic_memory_manager.py::test_no_request_returns_empty_response
flips to xpass and is no longer marked.
Roadmap §0 status updated. W6 retry refactor remains deferred —
openai_provider._execute_with_retry's key-rotation + status-code
dispatch + per-error metric labels aren't a tenacity drop-in and need
behavior-equivalence tests plus 24h staging soak per §4.2.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
pyproject.tomlconfiguration.uv, added the missingpsutiltest dependency, and refresheduv.lockfor the new dev tooling.Area
Verification
make ruffpassed.make typecheckpassed.make typecheck-pyrightcompleted with 0 errors and 8 existing warnings.make testnow runs under the uv-managed Python environment, but the full local suite is not green in this checkout: 1574 passed, 6 skipped, 164 failed, and 160 errored. The failures are concentrated in integration/repository/E2E suites that require local Mongo/Redis/Elasticsearch/Milvus/provider configuration; for example, the delete API integration suite usednone:27017andMONGO_DB=None.Checklist
.envfiles, dependency folders, or generated output.Notes for Reviewers
The main reviewer focus should be the initial rule baseline. Ruff is now passing, while ty is intentionally limited to
possibly-unresolved-referenceso this first rollout does not fail on the current broader type-checking baseline.By submitting this pull request, I agree that my contribution is licensed under
the Apache License 2.0.