fix: Enhance cassandra code to have self heal functionality by aimansharief · Pull Request #1201 · Sunbird-Knowlg/knowledge-platform

aimansharief · 2026-03-13T07:04:24Z

Summary

Enhances the Cassandra connector with self-healing capabilities so services can recover from transient Cassandra outages without requiring a restart.

Changes

CassandraConnector.java — Reconnect & Resilience

Startup retry with exponential backoff: New prepareSessionWithRetry() retries up to 30 times (configurable via cassandra.max.startup.retries) with full-jitter exponential backoff (2s base, 30s cap), allowing the service to tolerate Cassandra starting after the JVM.
Runtime reconnect: getSession() now detects closed sessions/clusters and re-establishes the connection using double-checked locking to avoid concurrent reconnect storms.
Driver-level resilience: Configured ExponentialReconnectionPolicy (1s–60s) and DefaultRetryPolicy on the Cluster builder so the driver handles transient failures internally.
Thread safety: Replaced HashMap with ConcurrentHashMap for session/cluster maps and boolean with AtomicBoolean for the shutdown hook guard.
Cluster lifecycle tracking: Added clusterMap to properly track and close Cluster objects (not just sessions) — prevents resource leaks on reconnect and shutdown.
Fail-fast on exhaustion: Throws ServerException if all startup retries are exhausted instead of silently starting without Cassandra.
Robust address parsing: getSocketAddress() defaults to port 9042 when the port is omitted from the connection string.

CassandraStore.java — Logging

Replaced all e.printStackTrace() calls with TelemetryManager.error() for structured logging.
Fixed misleading error message in upsertRecord(): "Invalid Identifier to read" → "Invalid request to upsert."

Notes

ProtocolVersion.V4 was previously hardcoded; now uses driver auto-negotiation. All target environments run Cassandra 3.x+ which supports protocol v4+ natively.
The close() method is now synchronized to prevent races with concurrent getSession() reconnects during shutdown.

Type of change

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
This change requires a documentation update

Configuration

Property	Default	Description
`cassandra.max.startup.retries`	`30`	Max connection attempts during startup
`service.db.cassandra.enabled`	`true`	Existing flag — skip Cassandra init when false

How Has This Been Tested?

Verified compilation across all modules (mvn clean install -DskipTests)
Tested startup with Cassandra unavailable — retries with backoff, then fails fast
Tested runtime reconnect after Cassandra restart — session re-established transparently

Test Configuration:

Java 11, Scala 2.13, Play 3.0.5
Docker Cassandra 4.x

Checklist

My code follows the style guidelines of this project
I have performed a self-review of my own code
I have commented my code, particularly in hard-to-understand areas
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes

coderabbitai · 2026-03-13T07:04:35Z

Important

Review skipped

Auto reviews are disabled on base/target branches other than the default branch.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 0969dbbf-c75d-4b45-a77e-40a89ca9876e

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

🔍 Trigger review

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch cassandra-self-heal

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

github-actions · 2026-03-13T07:13:07Z

SonarCloud Analysis Results 🔍

Quality Gate Results for Services:

Please review the analysis results for each service. Ensure all quality gates are passing before merging.

All catch blocks in CassandraStore now log via TelemetryManager.error() instead of printing stack traces, so errors surface in the platform telemetry pipeline rather than being swallowed by stdout. Also improves the upsert validation message from "Invalid Identifier to read" to "Invalid request to upsert." for accuracy.

…draConnector - Replace HashMap with ConcurrentHashMap for sessionMap; add clusterMap (ConcurrentHashMap<String, Cluster>) so the full Cluster lifecycle is tracked and can be properly closed, not just the Session. - Use AtomicBoolean for shutdownHookRegistered (thread-safe CAS). - getSession(): add double-checked locking so concurrent callers do not all race to reconnect; only one thread rebuilds the session inside the synchronized block while others wait. - close(): close Cluster objects instead of Sessions; closing a Cluster releases its Session, connection pools, and driver background threads. - prepareSession(): consolidate the two if/else builder paths into one (level == null guard). Close any previous Cluster for the same key to prevent resource leaks on reconnect. - Add ExponentialReconnectionPolicy(1s..60s) and DefaultRetryPolicy so the driver automatically handles transient network blips at the connection level without application-layer intervention. - Remove explicit ProtocolVersion.V4 pin so the driver auto-negotiates the highest protocol version supported by the server. - Clean up getConnectionInfo() (add default: case, use hasPath guard) and getSocketAddress() (trim whitespace around host/port tokens).

…lization When the service starts before Cassandra is ready (common in Docker / Kubernetes deployments), the connector now retries the initial connection up to 30 times instead of failing immediately. - prepareSessionWithRetry(): retry loop used only at JVM startup (static initialiser). Uses exponential backoff with full jitter: sleep = random(0, min(cap, 30s)), cap doubles from 2s up to 30s. - prepareSessionOnce(): thin wrapper used at runtime (inside getSession's synchronized block) — logs success/failure but does not throw, so a single failed reconnect does not propagate an exception to the caller. - MAX_STARTUP_RETRIES = 30, RETRY_BASE_MS = 2s, RETRY_MAX_MS = 30s (all tunable via these constants).

github-actions · 2026-04-01T06:27:31Z

SonarCloud Analysis Results 🔍

Quality Gate Results for Services:

Please review the analysis results for each service. Ensure all quality gates are passing before merging.

github-actions · 2026-04-01T06:59:16Z

SonarCloud Analysis Results 🔍

Quality Gate Results for Services:

Please review the analysis results for each service. Ensure all quality gates are passing before merging.

Copilot

Pull request overview

Enhances the Cassandra connector to be more resilient by adding startup retry/backoff and runtime reconnect behavior, while improving CassandraStore error logging.

Changes:

Added startup connection retry with exponential backoff + jitter and runtime reconnect logic guarded to avoid reconnect storms.
Introduced cluster lifecycle tracking/cleanup (clusterMap) and improved thread-safety (ConcurrentHashMap/AtomicBoolean).
Replaced printStackTrace() usages in CassandraStore with structured TelemetryManager.error() logging and adjusted one upsert validation message.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.

File	Description
platform-core/cassandra-connector/src/main/java/org/sunbird/cassandra/CassandraConnector.java	Adds retry/backoff, reconnect logic, reconnection/retry policies, and tracks/closes Cluster instances.
platform-core/cassandra-connector/src/main/java/org/sunbird/cassandra/CassandraStore.java	Replaces console stack traces with structured telemetry logging; adjusts upsert validation message.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

- Throw ServerException on interrupted startup retry instead of silent return - Close Cluster on connect() failure to prevent driver thread leaks - Trim and filter blank entries from connection config to avoid parse errors - Move upsertRecord validation outside try/catch to preserve error message Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

sonarqubecloud · 2026-04-01T10:32:34Z

Quality Gate passed

Issues
0 New issues
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube Cloud

github-actions · 2026-04-01T10:32:40Z

SonarCloud Analysis Results 🔍

Quality Gate Results for Services:

Please review the analysis results for each service. Ensure all quality gates are passing before merging.

aimansharief added 4 commits April 1, 2026 11:16

fix: Added cassandra reconnect functionality

549a889

aimansharief force-pushed the cassandra-self-heal branch from 2a99ee8 to 549a889 Compare April 1, 2026 06:19

fix: Optimised cassandra reconnect functionality

dbf418b

pallakartheekreddy requested a review from Copilot April 1, 2026 10:04

Copilot started reviewing on behalf of pallakartheekreddy April 1, 2026 10:05 View session

Copilot AI reviewed Apr 1, 2026

View reviewed changes

pallakartheekreddy approved these changes Apr 1, 2026

View reviewed changes

pallakartheekreddy merged commit 3569fe3 into develop Apr 1, 2026
14 checks passed

pallakartheekreddy deleted the cassandra-self-heal branch April 1, 2026 10:46

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: Enhance cassandra code to have self heal functionality#1201

fix: Enhance cassandra code to have self heal functionality#1201
pallakartheekreddy merged 6 commits intodevelopfrom
cassandra-self-heal

aimansharief commented Mar 13, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot commented Mar 13, 2026 •

edited

Loading

Review skipped

Uh oh!

github-actions Bot commented Mar 13, 2026

Uh oh!

github-actions Bot commented Apr 1, 2026

Uh oh!

github-actions Bot commented Apr 1, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

sonarqubecloud Bot commented Apr 1, 2026

Uh oh!

github-actions Bot commented Apr 1, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

aimansharief commented Mar 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Notes

Type of change

Configuration

How Has This Been Tested?

Checklist

Uh oh!

coderabbitai Bot commented Mar 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review skipped

Uh oh!

github-actions Bot commented Mar 13, 2026

SonarCloud Analysis Results 🔍

Quality Gate Results for Services:

Uh oh!

github-actions Bot commented Apr 1, 2026

SonarCloud Analysis Results 🔍

Quality Gate Results for Services:

Uh oh!

github-actions Bot commented Apr 1, 2026

SonarCloud Analysis Results 🔍

Quality Gate Results for Services:

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

sonarqubecloud Bot commented Apr 1, 2026

Quality Gate passed

Uh oh!

github-actions Bot commented Apr 1, 2026

SonarCloud Analysis Results 🔍

Quality Gate Results for Services:

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

aimansharief commented Mar 13, 2026 •

edited

Loading

coderabbitai Bot commented Mar 13, 2026 •

edited

Loading