fix: Enhance cassandra code to have self heal functionality#1201
fix: Enhance cassandra code to have self heal functionality#1201pallakartheekreddy merged 6 commits intodevelopfrom
Conversation
|
Important Review skippedAuto reviews are disabled on base/target branches other than the default branch. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
SonarCloud Analysis Results 🔍Quality Gate Results for Services:Please review the analysis results for each service. Ensure all quality gates are passing before merging. |
All catch blocks in CassandraStore now log via TelemetryManager.error() instead of printing stack traces, so errors surface in the platform telemetry pipeline rather than being swallowed by stdout. Also improves the upsert validation message from "Invalid Identifier to read" to "Invalid request to upsert." for accuracy.
…draConnector - Replace HashMap with ConcurrentHashMap for sessionMap; add clusterMap (ConcurrentHashMap<String, Cluster>) so the full Cluster lifecycle is tracked and can be properly closed, not just the Session. - Use AtomicBoolean for shutdownHookRegistered (thread-safe CAS). - getSession(): add double-checked locking so concurrent callers do not all race to reconnect; only one thread rebuilds the session inside the synchronized block while others wait. - close(): close Cluster objects instead of Sessions; closing a Cluster releases its Session, connection pools, and driver background threads. - prepareSession(): consolidate the two if/else builder paths into one (level == null guard). Close any previous Cluster for the same key to prevent resource leaks on reconnect. - Add ExponentialReconnectionPolicy(1s..60s) and DefaultRetryPolicy so the driver automatically handles transient network blips at the connection level without application-layer intervention. - Remove explicit ProtocolVersion.V4 pin so the driver auto-negotiates the highest protocol version supported by the server. - Clean up getConnectionInfo() (add default: case, use hasPath guard) and getSocketAddress() (trim whitespace around host/port tokens).
…lization
When the service starts before Cassandra is ready (common in Docker /
Kubernetes deployments), the connector now retries the initial connection
up to 30 times instead of failing immediately.
- prepareSessionWithRetry(): retry loop used only at JVM startup (static
initialiser). Uses exponential backoff with full jitter:
sleep = random(0, min(cap, 30s)), cap doubles from 2s up to 30s.
- prepareSessionOnce(): thin wrapper used at runtime (inside getSession's
synchronized block) — logs success/failure but does not throw, so a
single failed reconnect does not propagate an exception to the caller.
- MAX_STARTUP_RETRIES = 30, RETRY_BASE_MS = 2s, RETRY_MAX_MS = 30s
(all tunable via these constants).
2a99ee8 to
549a889
Compare
SonarCloud Analysis Results 🔍Quality Gate Results for Services:Please review the analysis results for each service. Ensure all quality gates are passing before merging. |
SonarCloud Analysis Results 🔍Quality Gate Results for Services:Please review the analysis results for each service. Ensure all quality gates are passing before merging. |
There was a problem hiding this comment.
Pull request overview
Enhances the Cassandra connector to be more resilient by adding startup retry/backoff and runtime reconnect behavior, while improving CassandraStore error logging.
Changes:
- Added startup connection retry with exponential backoff + jitter and runtime reconnect logic guarded to avoid reconnect storms.
- Introduced cluster lifecycle tracking/cleanup (clusterMap) and improved thread-safety (ConcurrentHashMap/AtomicBoolean).
- Replaced
printStackTrace()usages in CassandraStore with structuredTelemetryManager.error()logging and adjusted one upsert validation message.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.
| File | Description |
|---|---|
| platform-core/cassandra-connector/src/main/java/org/sunbird/cassandra/CassandraConnector.java | Adds retry/backoff, reconnect logic, reconnection/retry policies, and tracks/closes Cluster instances. |
| platform-core/cassandra-connector/src/main/java/org/sunbird/cassandra/CassandraStore.java | Replaces console stack traces with structured telemetry logging; adjusts upsert validation message. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
- Throw ServerException on interrupted startup retry instead of silent return - Close Cluster on connect() failure to prevent driver thread leaks - Trim and filter blank entries from connection config to avoid parse errors - Move upsertRecord validation outside try/catch to preserve error message Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
SonarCloud Analysis Results 🔍Quality Gate Results for Services:Please review the analysis results for each service. Ensure all quality gates are passing before merging. |



Summary
Enhances the Cassandra connector with self-healing capabilities so services can recover from transient Cassandra outages without requiring a restart.
Changes
CassandraConnector.java — Reconnect & Resilience
prepareSessionWithRetry()retries up to 30 times (configurable viacassandra.max.startup.retries) with full-jitter exponential backoff (2s base, 30s cap), allowing the service to tolerate Cassandra starting after the JVM.getSession()now detects closed sessions/clusters and re-establishes the connection using double-checked locking to avoid concurrent reconnect storms.ExponentialReconnectionPolicy(1s–60s) andDefaultRetryPolicyon the Cluster builder so the driver handles transient failures internally.HashMapwithConcurrentHashMapfor session/cluster maps andbooleanwithAtomicBooleanfor the shutdown hook guard.clusterMapto properly track and close Cluster objects (not just sessions) — prevents resource leaks on reconnect and shutdown.ServerExceptionif all startup retries are exhausted instead of silently starting without Cassandra.getSocketAddress()defaults to port 9042 when the port is omitted from the connection string.CassandraStore.java — Logging
e.printStackTrace()calls withTelemetryManager.error()for structured logging.upsertRecord(): "Invalid Identifier to read" → "Invalid request to upsert."Notes
ProtocolVersion.V4was previously hardcoded; now uses driver auto-negotiation. All target environments run Cassandra 3.x+ which supports protocol v4+ natively.close()method is nowsynchronizedto prevent races with concurrentgetSession()reconnects during shutdown.Type of change
Configuration
cassandra.max.startup.retries30service.db.cassandra.enabledtrueHow Has This Been Tested?
mvn clean install -DskipTests)Test Configuration:
Checklist