linkedin · mkuchenbecker · May 23, 2026
diff --git a/apps/optimizer-analyzer/docs/architecture.md b/apps/optimizer-analyzer/docs/architecture.md
@@ -0,0 +1,70 @@
+# Optimizer Analyzer — Architecture
+
+The Optimizer Analyzer is a batch process that decides which tables need a maintenance operation scheduled. It is **not** a continuously running service — it is a Spring Boot `CommandLineRunner` invoked per process invocation (typically on a cron). One pass evaluates every table opted in to a given operation type and writes a `PENDING` row to the optimizer DB for each table that needs work. The Scheduler picks those rows up on its own cadence.
+
+This document describes the components, dependencies, and the per-pass call flow.
+
+## Component view
+
+![Optimizer Analyzer component view](diagrams/component.png)
+
+| Component | Role |
+|---|---|
+| `AnalyzerApplication` | Spring Boot entry point. `@SpringBootApplication` + `CommandLineRunner` that invokes `AnalyzerRunner.analyze(operationType)` once per registered `OperationAnalyzer` per process. |
+| `AnalyzerRunner` | The per-pass orchestrator. Resolves the matching analyzer for an operation type, fans out across databases, and per database pre-loads three intermediate maps before evaluating each table. |
+| `OperationAnalyzer` (interface) | Strategy interface. Each implementation declares (a) which `OperationType` it handles, (b) `isEnabled(table)` — the per-table opt-in check, and (c) `shouldSchedule(table, currentOp?, latestHistory?)` — the per-table decision. |
+| `CadenceBasedOrphanFilesDeletionAnalyzer` | First implementation. Opts in via the `maintenance.optimizer.ofd.enabled` table property. Delegates the cadence decision to `CadencePolicy`. |
+| `CadencePolicy` | Time-based scheduling policy. Stays out of any table that already has a non-`CANCELED` active operation; for others, decides re-scheduling eligibility from the most recent completed-history entry using configurable success/failure retry intervals. |
+| `TableOperationsRepository`, `TableOperationsHistoryRepository`, `TableStatsRepository` | Defined in `services/optimizer` and shared with the analyzer / scheduler apps via the `apps/optimizer` shared module. The analyzer reads from all three and writes only to `TableOperationsRepository`. |
+
+The analyzer never talks to the Scheduler directly. The contract between them is the `table_operations` table in MySQL — the analyzer inserts `PENDING` rows; the scheduler claims them.
+
+## Sequence view
+
+The sequence below covers **one operation type, one analyzer pass**, fanned out across all databases. The `CommandLineRunner` repeats this loop once per registered `OperationAnalyzer` per process startup.
+
+![Optimizer Analyzer sequence view](diagrams/sequence.png)
+
+### Phases
+
+**Per-process startup.** `AnalyzerRunner.analyze(operationType)` resolves the matching analyzer (throws `IllegalStateException` if none is registered for the given type) and calls `statsRepo.findDistinctDatabaseNames()` for the fan-out list. If `analyze(...)` is called with an explicit `databaseName` filter, the fan-out is a singleton list and `findDistinctDatabaseNames` is not called.
+
+**Per-database fan-out.** Each database is processed in its own `@Transactional` `analyzeDatabase(...)` call. The transaction boundary spans the three pre-load reads and the per-table save loop, giving a consistent snapshot of (current operations, latest history, tables) across the iteration. The working set is bounded by tables-per-db rather than tables-total.
+
+**Pre-load: three reads.** Inside the transaction, three queries load the intermediate maps once per database:
+
+1. `operationsRepo.find(opType, *, *, db, *, *, *, page)` → `Map<tableUuid, TableOperationDto>` — current active operations for this op type in this database. The collector uses `TableOperationDto::mostRecent` as the merge function so duplicate rows (which the scheduler dedups separately) resolve to the most recent.
+2. `historyRepo.findLatest(opType, page)` → `Map<tableUuid, TableOperationsHistoryDto>` — latest completed-history entry per table. Collector merge uses `TableOperationsHistoryDto::after` (which trusts the service-set `completedAt` invariant).
+3. `statsRepo.find(db, *, *, page)` → `List<TableDto>` — the candidate tables in this database.
+
+**Per-table evaluation.** For each table:
+
+1. `analyzer.isEnabled(table)` — skip if the table is not opted in to this operation type.
+2. `analyzer.shouldSchedule(table, currentOp?, latestHistory?)` — for the cadence-based analyzer this delegates to `CadencePolicy.shouldSchedule`, which has three branches:
+   - **Active non-`CANCELED` `currentOp`:** return `false` — the scheduler owns this row.
+   - **No active (or `CANCELED`) `currentOp` + no history:** return `true` — first-time / post-cancel; schedule.
+   - **No active (or `CANCELED`) `currentOp` + history present:** return `true` if `now − completedAt > intervalFor(status)` (where `intervalFor` is `successRetryInterval` on `SUCCESS`, `failureRetryInterval` on `FAILED`; a switch with throwing default surfaces any new `HistoryStatusDto` value as a runtime error rather than silently bucketing it).
+3. On `true`, save a new `PENDING` row via `operationsRepo.save(...)`. The save is wrapped in `try { ... } catch (RuntimeException) { log.error; continue; }` — a single bad row never aborts the rest of the database's iteration. Per-row outcomes increment `created` or `failed`, logged as a single aggregate `INFO` line per database at the end.
+
+**Per-table logs are at `DEBUG` level.** The per-database aggregate is at `INFO`. This keeps INFO-level output bounded by `databases × operation_types`, not `tables × operation_types`.
+
+## Concurrent-instance contract
+
+Two analyzer instances running concurrently against the same MySQL **may** both insert a `PENDING` row for the same `(tableUuid, operationType)` — there is no uniqueness constraint on `table_operations`, and multiple PENDING/SCHEDULING/SCHEDULED rows for the same table are intentionally allowed. The dedup mechanism is `SchedulerRunner.cancelDuplicates`, which runs per scheduling cycle. The analyzer's own logic does not coordinate with itself or with the scheduler beyond the read snapshot.
+
+## Configuration
+
+| Property | Default | Effect |
+|---|---|---|
+| `ofd.success-retry-hours` | `16` | Hours to wait after a `SUCCESS` history entry before re-evaluating. Configured below 24h so at least one re-evaluation is guaranteed in any rolling 24-hour window regardless of when the prior run landed. |
+| `ofd.failure-retry-hours` | `1` | Hours to wait after a `FAILED` history entry before retrying. Shorter than success so transient failures recover quickly. |
+| `optimizer.repo.default-limit` | `10000` | Per-query LIMIT on the three pre-load reads. `Pageable` cascades to `LIMIT n` in SQL. |
+| `maintenance.optimizer.ofd.enabled` (table property, not app config) | (absent) | Per-table opt-in for OFD. Must equal the string `"true"` for `isEnabled` to return `true`. |
+
+## Source of truth
+
+- Renderable PlantUML sources: [`diagrams/component.puml`](diagrams/component.puml), [`diagrams/sequence.puml`](diagrams/sequence.puml).
+- The diagrams describe the analyzer as implemented in `apps/optimizer-analyzer` (introduced in PR #533).
+- Open scale-test work: [BDP-102738](https://linkedin.atlassian.net/browse/BDP-102738).
+- Open OTel-metrics work for both analyzer + scheduler: [BDP-102737](https://linkedin.atlassian.net/browse/BDP-102737).
+- Open MySQL TX-validation work: [BDP-102739](https://linkedin.atlassian.net/browse/BDP-102739).
diff --git a/apps/optimizer-analyzer/docs/diagrams/component.png b/apps/optimizer-analyzer/docs/diagrams/component.png
diff --git a/apps/optimizer-analyzer/docs/diagrams/component.puml b/apps/optimizer-analyzer/docs/diagrams/component.puml
@@ -0,0 +1,45 @@
+@startuml component
+title Optimizer Analyzer — Component View
+
+skinparam componentStyle rectangle
+skinparam shadowing false
+skinparam defaultFontName Helvetica
+
+package "apps/optimizer-analyzer" {
+  component "AnalyzerApplication\n@SpringBootApplication\nCommandLineRunner" as App
+  component "AnalyzerRunner" as Runner
+
+  interface "OperationAnalyzer\n(strategy)" as IfaceAnalyzer
+  component "CadenceBasedOrphanFiles\nDeletionAnalyzer" as OFD
+  component "CadencePolicy" as Policy
+}
+
+package "services/optimizer\n(shared module: db + repository)" {
+  component "TableOperationsRepository" as OpsRepo
+  component "TableOperationsHistoryRepository" as HistRepo
+  component "TableStatsRepository" as StatsRepo
+}
+
+database "MySQL\n(optimizer schema)" as DB
+
+App ..> Runner : invokes once per\nregistered analyzer
+Runner --> IfaceAnalyzer : isEnabled / shouldSchedule
+OFD ..|> IfaceAnalyzer
+OFD --> Policy : delegate cadence decision
+
+Runner --> StatsRepo : findDistinctDatabaseNames,\nfind(filters)
+Runner --> OpsRepo : read currentOps,\nsave PENDING
+Runner --> HistRepo : read latestHistory
+
+OpsRepo --> DB
+HistRepo --> DB
+StatsRepo --> DB
+
+note right of DB
+  Scheduler reads PENDING rows
+  from here on its own cadence;
+  the analyzer never talks to the
+  scheduler directly.
+end note
+
+@enduml
diff --git a/apps/optimizer-analyzer/docs/diagrams/sequence.png b/apps/optimizer-analyzer/docs/diagrams/sequence.png
diff --git a/apps/optimizer-analyzer/docs/diagrams/sequence.puml b/apps/optimizer-analyzer/docs/diagrams/sequence.puml
@@ -0,0 +1,79 @@
+@startuml sequence
+title Optimizer Analyzer — Sequence (one OperationType, one analyzer pass)
+
+skinparam shadowing false
+skinparam defaultFontName Helvetica
+skinparam sequenceMessageAlign center
+
+actor "Spring Boot\nCommandLineRunner" as CLR
+participant AnalyzerRunner as Runner
+participant "OperationAnalyzer\n(CadenceBased OFD)" as Analyzer
+participant CadencePolicy as Policy
+participant TableOperationsRepository as OpsRepo
+participant TableOperationsHistoryRepository as HistRepo
+participant TableStatsRepository as StatsRepo
+database MySQL
+
+== Per-process startup ==
+CLR -> Runner : analyze(operationType)
+activate Runner
+Runner -> Runner : lookup analyzer by operationType\n(IllegalStateException if none)
+Runner -> StatsRepo : findDistinctDatabaseNames()
+StatsRepo -> MySQL : SELECT DISTINCT database_name\nFROM table_stats
+StatsRepo --> Runner : List<String> dbs
+
+== Per-database fan-out ==
+loop for each database
+  Runner -> Runner : analyzeDatabase(...)\n[@Transactional]
+  activate Runner #LightBlue
+
+  group Pre-load: 3 reads bounded by tables-per-db
+    Runner -> OpsRepo : find(opType, *, *, db, *, *, *, page)
+    OpsRepo -> MySQL : SELECT * FROM table_operations\nWHERE operation_type=? AND database_name=?
+    OpsRepo --> Runner : currentOps : Map<uuid, TableOperationDto>
+
+    Runner -> HistRepo : findLatest(opType, page)
+    HistRepo -> MySQL : SELECT latest-per-table\nFROM table_operations_history
+    HistRepo --> Runner : latestHistory : Map<uuid, TableOperationsHistoryDto>
+
+    Runner -> StatsRepo : find(db, *, *, page)
+    StatsRepo -> MySQL : SELECT * FROM table_stats\nWHERE database_name=?
+    StatsRepo --> Runner : tables : List<TableDto>
+  end
+
+  group Per-table evaluation (continue on per-row failure)
+    loop for each table in this db
+      Runner -> Analyzer : isEnabled(table)
+      Analyzer --> Runner : enabled?
+      alt not enabled
+        note over Runner : continue
+      else enabled
+        Runner -> Analyzer : shouldSchedule(table, currentOp?, latestHistory?)
+        Analyzer -> Policy : shouldSchedule(currentOp?, latestHistory?)
+        alt active non-CANCELED currentOp
+          Policy --> Analyzer : false\n(scheduler owns this row)
+        else no currentOp (or CANCELED), no history
+          Policy --> Analyzer : true\n(first-time / post-cancel)
+        else no currentOp (or CANCELED), has history
+          Policy -> Policy : readyAfterHistoryEntry:\nnow - completedAt > intervalFor(status)
+          Policy --> Analyzer : true if elapsed,\nfalse otherwise
+        end
+        Analyzer --> Runner : decision
+        alt shouldSchedule == true
+          Runner -> OpsRepo : save(TableOperationDto.pending(...).toRow())
+          OpsRepo -> MySQL : INSERT INTO table_operations\n(status=PENDING, ...)
+          OpsRepo --> Runner : saved row
+          note right of Runner : try/catch RuntimeException —\nlog.error + continue;\nincrement {created} or {failed}
+        end
+      end
+    end
+  end
+
+  Runner -> Runner : log.info aggregate count\n(db, created, failed)
+  deactivate Runner
+end
+
+Runner --> CLR : analyze() returns
+deactivate Runner
+
+@enduml