Problem
get_beat_schedule() in boost_collector_runner/schedule_config.py returns an empty dict when the YAML schedule file is missing or unparseable, logging only a warning. The system starts, serves health checks, and runs zero collections. More critically, when a collector within a group fails, the runner exits the entire group — downstream collectors in the same group are silently skipped for that cycle with no log entry. In a domain where silent data gaps are the primary operational fear, this fail-open behavior means an operator may not realize collections have stopped until someone queries a report and finds missing data.
Acceptance Criteria
- Change
get_beat_schedule() to raise a clear error (or emit log.error + set an unhealthy flag) when the YAML file is missing in production mode, rather than silently returning {}
- Add per-collector outcome logging in the group runner: when a collector is skipped due to a predecessor's failure, log a warning with the skipped collector's name and the reason
- Add a startup health check that verifies the schedule YAML is present and parseable, and includes the loaded schedule summary in startup logs
- Add tests: (a) missing YAML raises/logs appropriately; (b) group-exit logs the skipped collectors
- Consider adding a
--strict flag to run_scheduled_collectors that fails hard on any schedule misconfiguration
Implementation Notes
The get_beat_schedule() function is at schedule_config.py:381-404. The group execution logic is in the collector runner's task dispatch. The simplest fix for the group-exit issue is to wrap each collector invocation in its own try/except block and continue to the next collector on failure, logging the error. This changes the semantics from "fail-fast group" to "best-effort group" — document the behavioral change. If fail-fast is intentional for some groups, add a fail_strategy: fast|continue option per group in the YAML schema.
Problem
get_beat_schedule()inboost_collector_runner/schedule_config.pyreturns an empty dict when the YAML schedule file is missing or unparseable, logging only a warning. The system starts, serves health checks, and runs zero collections. More critically, when a collector within a group fails, the runner exits the entire group — downstream collectors in the same group are silently skipped for that cycle with no log entry. In a domain where silent data gaps are the primary operational fear, this fail-open behavior means an operator may not realize collections have stopped until someone queries a report and finds missing data.Acceptance Criteria
get_beat_schedule()to raise a clear error (or emitlog.error+ set an unhealthy flag) when the YAML file is missing in production mode, rather than silently returning{}--strictflag torun_scheduled_collectorsthat fails hard on any schedule misconfigurationImplementation Notes
The
get_beat_schedule()function is atschedule_config.py:381-404. The group execution logic is in the collector runner's task dispatch. The simplest fix for the group-exit issue is to wrap each collector invocation in its own try/except block and continue to the next collector on failure, logging the error. This changes the semantics from "fail-fast group" to "best-effort group" — document the behavioral change. If fail-fast is intentional for some groups, add afail_strategy: fast|continueoption per group in the YAML schema.