Skip to content

fix: Enhance cassandra code to have self heal functionality#1201

Merged
pallakartheekreddy merged 6 commits intodevelopfrom
cassandra-self-heal
Apr 1, 2026
Merged

fix: Enhance cassandra code to have self heal functionality#1201
pallakartheekreddy merged 6 commits intodevelopfrom
cassandra-self-heal

Conversation

@aimansharief
Copy link
Copy Markdown
Collaborator

@aimansharief aimansharief commented Mar 13, 2026

Summary

Enhances the Cassandra connector with self-healing capabilities so services can recover from transient Cassandra outages without requiring a restart.

Changes

CassandraConnector.java — Reconnect & Resilience

  • Startup retry with exponential backoff: New prepareSessionWithRetry() retries up to 30 times (configurable via cassandra.max.startup.retries) with full-jitter exponential backoff (2s base, 30s cap), allowing the service to tolerate Cassandra starting after the JVM.
  • Runtime reconnect: getSession() now detects closed sessions/clusters and re-establishes the connection using double-checked locking to avoid concurrent reconnect storms.
  • Driver-level resilience: Configured ExponentialReconnectionPolicy (1s–60s) and DefaultRetryPolicy on the Cluster builder so the driver handles transient failures internally.
  • Thread safety: Replaced HashMap with ConcurrentHashMap for session/cluster maps and boolean with AtomicBoolean for the shutdown hook guard.
  • Cluster lifecycle tracking: Added clusterMap to properly track and close Cluster objects (not just sessions) — prevents resource leaks on reconnect and shutdown.
  • Fail-fast on exhaustion: Throws ServerException if all startup retries are exhausted instead of silently starting without Cassandra.
  • Robust address parsing: getSocketAddress() defaults to port 9042 when the port is omitted from the connection string.

CassandraStore.java — Logging

  • Replaced all e.printStackTrace() calls with TelemetryManager.error() for structured logging.
  • Fixed misleading error message in upsertRecord(): "Invalid Identifier to read" → "Invalid request to upsert."

Notes

  • ProtocolVersion.V4 was previously hardcoded; now uses driver auto-negotiation. All target environments run Cassandra 3.x+ which supports protocol v4+ natively.
  • The close() method is now synchronized to prevent races with concurrent getSession() reconnects during shutdown.

Type of change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • This change requires a documentation update

Configuration

Property Default Description
cassandra.max.startup.retries 30 Max connection attempts during startup
service.db.cassandra.enabled true Existing flag — skip Cassandra init when false

How Has This Been Tested?

  • Verified compilation across all modules (mvn clean install -DskipTests)
  • Tested startup with Cassandra unavailable — retries with backoff, then fails fast
  • Tested runtime reconnect after Cassandra restart — session re-established transparently

Test Configuration:

  • Java 11, Scala 2.13, Play 3.0.5
  • Docker Cassandra 4.x

Checklist

  • My code follows the style guidelines of this project
  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Mar 13, 2026

Important

Review skipped

Auto reviews are disabled on base/target branches other than the default branch.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 0969dbbf-c75d-4b45-a77e-40a89ca9876e

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch cassandra-self-heal

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions
Copy link
Copy Markdown

SonarCloud Analysis Results 🔍

Quality Gate Results for Services:

Please review the analysis results for each service. Ensure all quality gates are passing before merging.

All catch blocks in CassandraStore now log via TelemetryManager.error()
instead of printing stack traces, so errors surface in the platform
telemetry pipeline rather than being swallowed by stdout.

Also improves the upsert validation message from "Invalid Identifier to
read" to "Invalid request to upsert." for accuracy.
…draConnector

- Replace HashMap with ConcurrentHashMap for sessionMap; add clusterMap
  (ConcurrentHashMap<String, Cluster>) so the full Cluster lifecycle is
  tracked and can be properly closed, not just the Session.

- Use AtomicBoolean for shutdownHookRegistered (thread-safe CAS).

- getSession(): add double-checked locking so concurrent callers do not
  all race to reconnect; only one thread rebuilds the session inside the
  synchronized block while others wait.

- close(): close Cluster objects instead of Sessions; closing a Cluster
  releases its Session, connection pools, and driver background threads.

- prepareSession(): consolidate the two if/else builder paths into one
  (level == null guard). Close any previous Cluster for the same key to
  prevent resource leaks on reconnect.

- Add ExponentialReconnectionPolicy(1s..60s) and DefaultRetryPolicy so
  the driver automatically handles transient network blips at the
  connection level without application-layer intervention.

- Remove explicit ProtocolVersion.V4 pin so the driver auto-negotiates
  the highest protocol version supported by the server.

- Clean up getConnectionInfo() (add default: case, use hasPath guard)
  and getSocketAddress() (trim whitespace around host/port tokens).
…lization

When the service starts before Cassandra is ready (common in Docker /
Kubernetes deployments), the connector now retries the initial connection
up to 30 times instead of failing immediately.

- prepareSessionWithRetry(): retry loop used only at JVM startup (static
  initialiser). Uses exponential backoff with full jitter:
    sleep = random(0, min(cap, 30s)),  cap doubles from 2s up to 30s.

- prepareSessionOnce(): thin wrapper used at runtime (inside getSession's
  synchronized block) — logs success/failure but does not throw, so a
  single failed reconnect does not propagate an exception to the caller.

- MAX_STARTUP_RETRIES = 30, RETRY_BASE_MS = 2s, RETRY_MAX_MS = 30s
  (all tunable via these constants).
@aimansharief aimansharief force-pushed the cassandra-self-heal branch from 2a99ee8 to 549a889 Compare April 1, 2026 06:19
@github-actions
Copy link
Copy Markdown

github-actions Bot commented Apr 1, 2026

SonarCloud Analysis Results 🔍

Quality Gate Results for Services:

Please review the analysis results for each service. Ensure all quality gates are passing before merging.

@github-actions
Copy link
Copy Markdown

github-actions Bot commented Apr 1, 2026

SonarCloud Analysis Results 🔍

Quality Gate Results for Services:

Please review the analysis results for each service. Ensure all quality gates are passing before merging.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Enhances the Cassandra connector to be more resilient by adding startup retry/backoff and runtime reconnect behavior, while improving CassandraStore error logging.

Changes:

  • Added startup connection retry with exponential backoff + jitter and runtime reconnect logic guarded to avoid reconnect storms.
  • Introduced cluster lifecycle tracking/cleanup (clusterMap) and improved thread-safety (ConcurrentHashMap/AtomicBoolean).
  • Replaced printStackTrace() usages in CassandraStore with structured TelemetryManager.error() logging and adjusted one upsert validation message.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.

File Description
platform-core/cassandra-connector/src/main/java/org/sunbird/cassandra/CassandraConnector.java Adds retry/backoff, reconnect logic, reconnection/retry policies, and tracks/closes Cluster instances.
platform-core/cassandra-connector/src/main/java/org/sunbird/cassandra/CassandraStore.java Replaces console stack traces with structured telemetry logging; adjusts upsert validation message.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

- Throw ServerException on interrupted startup retry instead of silent return
- Close Cluster on connect() failure to prevent driver thread leaks
- Trim and filter blank entries from connection config to avoid parse errors
- Move upsertRecord validation outside try/catch to preserve error message

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@sonarqubecloud
Copy link
Copy Markdown

sonarqubecloud Bot commented Apr 1, 2026

@github-actions
Copy link
Copy Markdown

github-actions Bot commented Apr 1, 2026

SonarCloud Analysis Results 🔍

Quality Gate Results for Services:

Please review the analysis results for each service. Ensure all quality gates are passing before merging.

@pallakartheekreddy pallakartheekreddy merged commit 3569fe3 into develop Apr 1, 2026
14 checks passed
@pallakartheekreddy pallakartheekreddy deleted the cassandra-self-heal branch April 1, 2026 10:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants