Problem
The Vector aggregator deployment gets stuck and never reconnects to the NATS server after the NATS server enters lame duck mode. The error message observed is:
Impact
- Activity processing pipeline stops working
- Audit logs and events are not processed until Vector pods are manually restarted
- Requires manual intervention to restore functionality
Expected Behavior
Vector should automatically reconnect to the NATS server after it exits lame duck mode or switches to a new leader.
Investigation Areas
- Vector NATS source configuration - reconnection settings
- NATS client library behavior in lame duck mode
- Kubernetes deployment/liveness probe configuration
- Potential need for connection retry logic or pod restart policy
Reproduction
Occurs when NATS server enters lame duck mode (e.g., during rolling updates or node drains).
Problem
The Vector aggregator deployment gets stuck and never reconnects to the NATS server after the NATS server enters lame duck mode. The error message observed is:
Impact
Expected Behavior
Vector should automatically reconnect to the NATS server after it exits lame duck mode or switches to a new leader.
Investigation Areas
Reproduction
Occurs when NATS server enters lame duck mode (e.g., during rolling updates or node drains).