S26-Distributed-Capstone · MaxFdev · May 20, 2026 · May 20, 2026 · May 20, 2026 · May 20, 2026
diff --git a/docs/architecture.md b/docs/architecture.md
@@ -76,6 +76,9 @@ sequenceDiagram
 - The user may request all versions of a secret, which will return a map of version to secret value
 - Receiving node requests k - 1 shards from nodes, and gets one from itself (minimum threshold to reconstruct)
 - Node reconstructs the plaintext secret in memory using Shamir's algorithm (requires k of n shards)
+- For latest-version reads, the node checks how many shards were actually available. If reconstruction succeeds but the cluster only returned `k` or `k + repairTriggerBuffer` shards, the node performs best-effort read repair before returning the value.
+- Read repair re-splits the reconstructed plaintext in memory and republishes shards for the same version through the internal prepare + Kafka commit flow. It does **not** create a new version.
+- Explicit historical version reads and all-version reads do not trigger repair.
 - **Plaintext exists only in memory during reconstruction, never written to disk**
 - Node returns the secret value to the client and clears it from memory
 
@@ -92,6 +95,12 @@ sequenceDiagram
     Node->>Cluster: Request k-1 additional shards via ScaleCube
     Cluster-->>Node: Return shards (encrypted in transit)
     Node->>Node: Reconstruct plaintext in memory<br/>using Shamir's algorithm (k of n shards)
+    opt Latest read has only k or k+buffer shards
+        Node->>Node: Re-split plaintext into same version shards
+        Node->>Cluster: Prepare repair shards
+        Cluster-->>Node: Repair ACKs
+        Node->>Kafka: Publish repair commit
+    end
     Node-->>Ingress: Return secret value
     Ingress-->>User: Secret value
 ```
@@ -211,3 +220,5 @@ graph LR
 - If failure occurs in the **ordering phase**, no shard writes are committed and the request fails.
 - If failure occurs in the **writing phase**, partially written shards are rolled back.
 - Recovered nodes rejoin automatically via ScaleCube and synchronize state from Kafka and peers.
+- Latest-version reads can also repair degraded shard placement when at least `k` shards remain. This read repair is best-effort: a successful GET still returns the reconstructed value even if repair cannot reach quorum.
+- Repair uses `ActionType.REPAIR`, stages replacement shards in memory, and commits through Kafka like other internal mutations. The committed shard keeps the existing version number.
diff --git a/docs/challenges.md b/docs/challenges.md
@@ -6,34 +6,40 @@
 2. **Quorum-Based Reconstruction**  
    Reads collect at least `k` shards and reconstruct only in memory. If fewer than `k` shards are available, the read fails deterministically instead of returning partial or stale data.
 
-3. **Create vs Update Under Concurrency**  
+3. **Read Repair Under Degraded Replication**  
+   Latest-version reads also measure how many shards were actually available. If a GET can reconstruct the value but only has `k` or `k + repairTriggerBuffer` shards, the coordinating node performs best-effort read repair before returning. Repair re-splits the reconstructed plaintext in memory and redistributes shards for the same version through the existing prepare + Kafka commit path. It does not create a new version, and it does not apply to explicit historical reads.
+
+4. **Create vs Update Under Concurrency**  
    Create requires non-existent key; update requires existing key. Both use the same Kafka-based two-phase write flow. This keeps write ordering consistent while preserving operation-specific preconditions.
 
-4. **Versioning and Time Metadata**  
+5. **Versioning and Time Metadata**  
    The DSV Worker attaches request timestamp metadata. Versions are committed in per-key Kafka order. This avoids relying on a global clock source while maintaining monotonic per-key history.
 
-5. **History and Validity Intervals**  
+6. **History and Validity Intervals**  
    Each version is independently stored and retrievable. `valid_from`/`valid_to` define active intervals. Intervals are updated during commits so historical reads can be served without ambiguity.
 
-6. **Replication of Authoritative State**  
+7. **Replication of Authoritative State**  
    Shards replicate through write quorum. Metadata converges through commit propagation and gossip. Any node can therefore answer existence/version queries from local replicated metadata.
 
-7. **Retries and Idempotency**  
+8. **Retries and Idempotency**  
    Safe retries return existing committed outcomes. Duplicate create returns `409`; duplicate identical update is idempotent. This lets clients retry on timeout without risking duplicate state transitions.
 
-8. **Namespace Isolation**  
+9. **Namespace Isolation**  
    Secrets are separated into logical namespaces (`user:key:version`) allowing different groups to reuse key names. Pre-condition checks are enforced on every request path before shard access.
 
-9. **Deterministic Failure Semantics**  
+10. **Deterministic Failure Semantics**  
    Precondition failures are stable (`409` for duplicate create, `404` for missing update/retrieve/delete). Equivalent requests against equivalent cluster state produce the same status code.
 
-10. **`.env` Batch Semantics**  
+11. **`.env` Batch Semantics**  
     `enc(NAME)` and `secret(NAME)` processing is all-or-nothing; failures roll back staged writes. Callers receive either a fully transformed file or a single error response.
 
-11. **Failure Phases for Writes**  
+12. **Failure Phases for Writes**  
     - **Ordering phase failure**: Kafka commit log write failed; no intent published.  
     - **Writing phase failure**: intent published but write quorum fails; partial writes roll back.
     Phase separation makes recovery behavior explicit and prevents ambiguous outcomes for in-flight writes.
 
-12. **Recovery and Availability**  
-    Nodes recover from durable storage, and rejoin automatically when healthy. Quorum rules determine whether reads/writes continue or fail fast during degraded periods.
+13. **Recovery and Availability**  
+    Nodes recover from durable storage, and rejoin automatically when healthy. Quorum rules determine whether reads/writes continue or fail fast during degraded periods. Read repair improves availability after partial failures by restoring shard redundancy while reads are still reconstructable.
+
+14. **Repair vs Concurrent Mutation**  
+    Read repair follows snapshot-style GET semantics. If a GET reconstructs a value, it may return that value even if a PUT or DELETE commits immediately afterward. Repair is version-preserving, so a concurrent PUT creates a newer version rather than being overwritten by repair. A concurrent DELETE is not rechecked before returning the already reconstructed GET result.
diff --git a/docs/scope.md b/docs/scope.md
@@ -19,6 +19,7 @@ It will:
 - Reject update requests for secrets that do not exist
 - Accept secret retrieval requests for previously stored secrets
 - Accept requests to retrieve the version history of a secret
+- Repair latest-version shard placement during retrieval when the system can still reconstruct the value but available shards are close to the recovery threshold
 - Accept secret **delete** requests that remove enough stored shards to make reconstruction impossible
 - Reject delete requests for secrets that do not exist
 - Accept `.env` file content and:
@@ -48,6 +49,7 @@ It will:
 - Shard secret pieces to peer vault instances
 - Serve retrieval, history, and delete requests based on authoritative state
 - Remain available under partial failure (within range of accepted Shamir's recovery threshold)
+- Perform best-effort read repair for latest-version reads when only `k` or `k + repairTriggerBuffer` shards are available
 - Recover state on restart
 
 ---
@@ -71,6 +73,7 @@ It defines:
 - How secret deletion is defined and when a secret is considered non-reconstructable
 - What identifiers are used to reference secrets
 - How retries and concurrent requests are handled
+- How read repair behaves under concurrent updates and deletes
 - What duplicate and _not found_ errors mean
 
 The model must be documented and observable in practice.
@@ -87,6 +90,9 @@ It defines:
 - Shamir's Secret Keeping behavior:
   - A password is separated on a single node into n (configured by user) parts
   - The data is sent out to n - 1 other nodes, with 1 piece staying local to the machine that received the request directly
+- Read repair behavior:
+  - Latest-version GET requests that barely meet the reconstruction threshold may re-split the reconstructed value and restore shards for the same version
+  - Repair does not create a new version and does not apply to historical version reads
 - no master key required, any node can take requests to decode
 - The rule that plaintext secret bytes are never written to durable storage or passed to other nodes
 
@@ -103,6 +109,7 @@ It will:
 - Define delete request and response behavior, including threshold-based deletion success criteria
 - Specify duplicate and _not found_ error behavior
 - Describe durability and replication guarantees
+- Describe best-effort read repair when shard availability is degraded but still reconstructable
 - Describe secret-keeping and spreading behavior and failure behavior when referenced secrets cannot be resolved
 - Describe secret history retrieval semantics, including version ordering and validity timestamps
 - Describe `.env` encryption and expansion semantics, including secret creation and all-or-nothing failure

diff --git a/src/main/java/edu/yu/capstone/DistributedSecretsVault/config/ClusterConfig.java b/src/main/java/edu/yu/capstone/DistributedSecretsVault/config/ClusterConfig.java
@@ -14,4 +14,6 @@ public class ClusterConfig {
     private int quorumM;
     private long lockTimeoutMillis;
     private long writeTimeoutMillis;
+    private boolean repairEnabled = true;
+    private int repairTriggerBuffer = 1;
 }
diff --git a/src/main/java/edu/yu/capstone/DistributedSecretsVault/controller/InternalController.java b/src/main/java/edu/yu/capstone/DistributedSecretsVault/controller/InternalController.java
@@ -5,10 +5,12 @@
 import edu.yu.capstone.DistributedSecretsVault.dto.internal.DeletePrepareRequest;
 import edu.yu.capstone.DistributedSecretsVault.dto.internal.PostPrepareRequest;
 import edu.yu.capstone.DistributedSecretsVault.dto.internal.PutPrepareRequest;
+import edu.yu.capstone.DistributedSecretsVault.dto.internal.RepairPrepareRequest;
 import edu.yu.capstone.DistributedSecretsVault.service.internal.DeletePrepareHandler;
 import edu.yu.capstone.DistributedSecretsVault.service.internal.InternalGetService;
 import edu.yu.capstone.DistributedSecretsVault.service.internal.PostPrepareHandler;
 import edu.yu.capstone.DistributedSecretsVault.service.internal.PutPrepareHandler;
+import edu.yu.capstone.DistributedSecretsVault.service.internal.RepairPrepareHandler;
 
 import java.util.Map;
 import java.util.UUID;
@@ -32,15 +34,18 @@ public class InternalController {
     private final PostPrepareHandler postPrepareHandler;
     private final PutPrepareHandler putPrepareHandler;
     private final DeletePrepareHandler deletePrepareHandler;
+    private final RepairPrepareHandler repairPrepareHandler;
 
     public InternalController(InternalGetService internalGetService,
             PostPrepareHandler postPrepareHandler,
             PutPrepareHandler putPrepareHandler,
-            DeletePrepareHandler deletePrepareHandler) {
+            DeletePrepareHandler deletePrepareHandler,
+            RepairPrepareHandler repairPrepareHandler) {
         this.internalGetService = internalGetService;
         this.postPrepareHandler = postPrepareHandler;
         this.putPrepareHandler = putPrepareHandler;
         this.deletePrepareHandler = deletePrepareHandler;
+        this.repairPrepareHandler = repairPrepareHandler;
     }
 
     @GetMapping("/{id}")
@@ -68,6 +73,12 @@ public ResponseEntity<Void> preparePut(@RequestBody PutPrepareRequest request) {
         return ResponseEntity.noContent().build();
     }
 
+    @PostMapping("/repair/prepare")
+    public ResponseEntity<Void> prepareRepair(@RequestBody RepairPrepareRequest request) {
+        repairPrepareHandler.handle(request);
+        return ResponseEntity.noContent().build();
+    }
+
     @DeleteMapping("/prepare")
     public ResponseEntity<Void> prepareDelete(
             @RequestParam("originatorNodeId") String originatorNodeId,

diff --git a/src/main/java/edu/yu/capstone/DistributedSecretsVault/dto/internal/RepairCommitRequest.java b/src/main/java/edu/yu/capstone/DistributedSecretsVault/dto/internal/RepairCommitRequest.java
@@ -0,0 +1,16 @@
+package edu.yu.capstone.DistributedSecretsVault.dto.internal;
+
+import java.util.UUID;
+
+import edu.yu.capstone.DistributedSecretsVault.domain.model.SecretKey;
+import lombok.AllArgsConstructor;
+import lombok.Data;
+import lombok.NoArgsConstructor;
+
+@Data
+@NoArgsConstructor
+@AllArgsConstructor
+public class RepairCommitRequest {
+    private UUID operationId;
+    private SecretKey secretKey;
+}
diff --git a/src/main/java/edu/yu/capstone/DistributedSecretsVault/dto/internal/RepairPrepareRequest.java b/src/main/java/edu/yu/capstone/DistributedSecretsVault/dto/internal/RepairPrepareRequest.java
@@ -0,0 +1,16 @@
+package edu.yu.capstone.DistributedSecretsVault.dto.internal;
+
+import java.util.UUID;
+
+import lombok.AllArgsConstructor;
+import lombok.Data;
+import lombok.NoArgsConstructor;
+
+@Data
+@NoArgsConstructor
+@AllArgsConstructor
+public class RepairPrepareRequest {
+    private String originatorNodeId;
+    private UUID operationId;
+    private SecretPartMessage secretPartMessage;
+}
diff --git a/.../java/edu/yu/capstone/DistributedSecretsVault/service/communication/CommitDispatcher.java b/.../java/edu/yu/capstone/DistributedSecretsVault/service/communication/CommitDispatcher.java
@@ -8,11 +8,13 @@
 import edu.yu.capstone.DistributedSecretsVault.dto.internal.DeleteCommitRequest;
 import edu.yu.capstone.DistributedSecretsVault.dto.internal.PostCommitRequest;
 import edu.yu.capstone.DistributedSecretsVault.dto.internal.PutCommitRequest;
+import edu.yu.capstone.DistributedSecretsVault.dto.internal.RepairCommitRequest;
 import edu.yu.capstone.DistributedSecretsVault.exceptions.InternalOperationConflictException;
 import edu.yu.capstone.DistributedSecretsVault.service.internal.ActionType;
 import edu.yu.capstone.DistributedSecretsVault.service.internal.DeleteCommitHandler;
 import edu.yu.capstone.DistributedSecretsVault.service.internal.PostCommitHandler;
 import edu.yu.capstone.DistributedSecretsVault.service.internal.PutCommitHandler;
+import edu.yu.capstone.DistributedSecretsVault.service.internal.RepairCommitHandler;
 
 @Service
 public class CommitDispatcher {
@@ -21,13 +23,16 @@ public class CommitDispatcher {
     private final DeleteCommitHandler deleteCommitHandler;
     private final PostCommitHandler postCommitHandler;
     private final PutCommitHandler putCommitHandler;
+    private final RepairCommitHandler repairCommitHandler;
 
     public CommitDispatcher(DeleteCommitHandler deleteCommitHandler,
             PostCommitHandler postCommitHandler,
-            PutCommitHandler putCommitHandler) {
+            PutCommitHandler putCommitHandler,
+            RepairCommitHandler repairCommitHandler) {
         this.deleteCommitHandler = deleteCommitHandler;
         this.postCommitHandler = postCommitHandler;
         this.putCommitHandler = putCommitHandler;
+        this.repairCommitHandler = repairCommitHandler;
     }
 
     public void dispatch(CommitMessage message) {
@@ -39,6 +44,8 @@ public void dispatch(CommitMessage message) {
                 postCommitHandler.handle(new PostCommitRequest(message.getOperationId(), message.getSecretKey()));
             } else if (message.getActionType() == ActionType.PUT) {
                 putCommitHandler.handle(new PutCommitRequest(message.getOperationId(), message.getSecretKey()));
+            } else if (message.getActionType() == ActionType.REPAIR) {
+                repairCommitHandler.handle(new RepairCommitRequest(message.getOperationId(), message.getSecretKey()));
             } else {
                 log.warn("Ignoring unsupported commit action type: operationId={}, actionType={}",
                         message.getOperationId(), message.getActionType());

diff --git a/src/main/java/edu/yu/capstone/DistributedSecretsVault/service/internal/ActionType.java b/src/main/java/edu/yu/capstone/DistributedSecretsVault/service/internal/ActionType.java
@@ -7,5 +7,6 @@
 public enum ActionType {
     POST,
     DELETE,
-    PUT
+    PUT,
+    REPAIR
 }
-Original file line number
+Diff line change
@@ Expand Up / @@ -7,5 +7,6 @@ @@
     public enum ActionType {
         POST,
         DELETE,
-        PUT
+        PUT,
+        REPAIR
     }