Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
11 changes: 11 additions & 0 deletions docs/architecture.md
Original file line number Diff line number Diff line change
Expand Up @@ -76,6 +76,9 @@ sequenceDiagram
- The user may request all versions of a secret, which will return a map of version to secret value
- Receiving node requests k - 1 shards from nodes, and gets one from itself (minimum threshold to reconstruct)
- Node reconstructs the plaintext secret in memory using Shamir's algorithm (requires k of n shards)
- For latest-version reads, the node checks how many shards were actually available. If reconstruction succeeds but the cluster only returned `k` or `k + repairTriggerBuffer` shards, the node performs best-effort read repair before returning the value.
- Read repair re-splits the reconstructed plaintext in memory and republishes shards for the same version through the internal prepare + Kafka commit flow. It does **not** create a new version.
- Explicit historical version reads and all-version reads do not trigger repair.
- **Plaintext exists only in memory during reconstruction, never written to disk**
- Node returns the secret value to the client and clears it from memory

Expand All @@ -92,6 +95,12 @@ sequenceDiagram
Node->>Cluster: Request k-1 additional shards via ScaleCube
Cluster-->>Node: Return shards (encrypted in transit)
Node->>Node: Reconstruct plaintext in memory<br/>using Shamir's algorithm (k of n shards)
opt Latest read has only k or k+buffer shards
Node->>Node: Re-split plaintext into same version shards
Node->>Cluster: Prepare repair shards
Cluster-->>Node: Repair ACKs
Node->>Kafka: Publish repair commit
end
Node-->>Ingress: Return secret value
Ingress-->>User: Secret value
```
Expand Down Expand Up @@ -211,3 +220,5 @@ graph LR
- If failure occurs in the **ordering phase**, no shard writes are committed and the request fails.
- If failure occurs in the **writing phase**, partially written shards are rolled back.
- Recovered nodes rejoin automatically via ScaleCube and synchronize state from Kafka and peers.
- Latest-version reads can also repair degraded shard placement when at least `k` shards remain. This read repair is best-effort: a successful GET still returns the reconstructed value even if repair cannot reach quorum.
- Repair uses `ActionType.REPAIR`, stages replacement shards in memory, and commits through Kafka like other internal mutations. The committed shard keeps the existing version number.
28 changes: 17 additions & 11 deletions docs/challenges.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,34 +6,40 @@
2. **Quorum-Based Reconstruction**
Reads collect at least `k` shards and reconstruct only in memory. If fewer than `k` shards are available, the read fails deterministically instead of returning partial or stale data.

3. **Create vs Update Under Concurrency**
3. **Read Repair Under Degraded Replication**
Latest-version reads also measure how many shards were actually available. If a GET can reconstruct the value but only has `k` or `k + repairTriggerBuffer` shards, the coordinating node performs best-effort read repair before returning. Repair re-splits the reconstructed plaintext in memory and redistributes shards for the same version through the existing prepare + Kafka commit path. It does not create a new version, and it does not apply to explicit historical reads.

4. **Create vs Update Under Concurrency**
Create requires non-existent key; update requires existing key. Both use the same Kafka-based two-phase write flow. This keeps write ordering consistent while preserving operation-specific preconditions.

4. **Versioning and Time Metadata**
5. **Versioning and Time Metadata**
The DSV Worker attaches request timestamp metadata. Versions are committed in per-key Kafka order. This avoids relying on a global clock source while maintaining monotonic per-key history.

5. **History and Validity Intervals**
6. **History and Validity Intervals**
Each version is independently stored and retrievable. `valid_from`/`valid_to` define active intervals. Intervals are updated during commits so historical reads can be served without ambiguity.

6. **Replication of Authoritative State**
7. **Replication of Authoritative State**
Shards replicate through write quorum. Metadata converges through commit propagation and gossip. Any node can therefore answer existence/version queries from local replicated metadata.

7. **Retries and Idempotency**
8. **Retries and Idempotency**
Safe retries return existing committed outcomes. Duplicate create returns `409`; duplicate identical update is idempotent. This lets clients retry on timeout without risking duplicate state transitions.

8. **Namespace Isolation**
9. **Namespace Isolation**
Secrets are separated into logical namespaces (`user:key:version`) allowing different groups to reuse key names. Pre-condition checks are enforced on every request path before shard access.

9. **Deterministic Failure Semantics**
10. **Deterministic Failure Semantics**
Precondition failures are stable (`409` for duplicate create, `404` for missing update/retrieve/delete). Equivalent requests against equivalent cluster state produce the same status code.

10. **`.env` Batch Semantics**
11. **`.env` Batch Semantics**
`enc(NAME)` and `secret(NAME)` processing is all-or-nothing; failures roll back staged writes. Callers receive either a fully transformed file or a single error response.

11. **Failure Phases for Writes**
12. **Failure Phases for Writes**
- **Ordering phase failure**: Kafka commit log write failed; no intent published.
- **Writing phase failure**: intent published but write quorum fails; partial writes roll back.
Phase separation makes recovery behavior explicit and prevents ambiguous outcomes for in-flight writes.

12. **Recovery and Availability**
Nodes recover from durable storage, and rejoin automatically when healthy. Quorum rules determine whether reads/writes continue or fail fast during degraded periods.
13. **Recovery and Availability**
Nodes recover from durable storage, and rejoin automatically when healthy. Quorum rules determine whether reads/writes continue or fail fast during degraded periods. Read repair improves availability after partial failures by restoring shard redundancy while reads are still reconstructable.

14. **Repair vs Concurrent Mutation**
Read repair follows snapshot-style GET semantics. If a GET reconstructs a value, it may return that value even if a PUT or DELETE commits immediately afterward. Repair is version-preserving, so a concurrent PUT creates a newer version rather than being overwritten by repair. A concurrent DELETE is not rechecked before returning the already reconstructed GET result.
7 changes: 7 additions & 0 deletions docs/scope.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,7 @@ It will:
- Reject update requests for secrets that do not exist
- Accept secret retrieval requests for previously stored secrets
- Accept requests to retrieve the version history of a secret
- Repair latest-version shard placement during retrieval when the system can still reconstruct the value but available shards are close to the recovery threshold
- Accept secret **delete** requests that remove enough stored shards to make reconstruction impossible
- Reject delete requests for secrets that do not exist
- Accept `.env` file content and:
Expand Down Expand Up @@ -48,6 +49,7 @@ It will:
- Shard secret pieces to peer vault instances
- Serve retrieval, history, and delete requests based on authoritative state
- Remain available under partial failure (within range of accepted Shamir's recovery threshold)
- Perform best-effort read repair for latest-version reads when only `k` or `k + repairTriggerBuffer` shards are available
- Recover state on restart

---
Expand All @@ -71,6 +73,7 @@ It defines:
- How secret deletion is defined and when a secret is considered non-reconstructable
- What identifiers are used to reference secrets
- How retries and concurrent requests are handled
- How read repair behaves under concurrent updates and deletes
- What duplicate and _not found_ errors mean

The model must be documented and observable in practice.
Expand All @@ -87,6 +90,9 @@ It defines:
- Shamir's Secret Keeping behavior:
- A password is separated on a single node into n (configured by user) parts
- The data is sent out to n - 1 other nodes, with 1 piece staying local to the machine that received the request directly
- Read repair behavior:
- Latest-version GET requests that barely meet the reconstruction threshold may re-split the reconstructed value and restore shards for the same version
- Repair does not create a new version and does not apply to historical version reads
- no master key required, any node can take requests to decode
- The rule that plaintext secret bytes are never written to durable storage or passed to other nodes

Expand All @@ -103,6 +109,7 @@ It will:
- Define delete request and response behavior, including threshold-based deletion success criteria
- Specify duplicate and _not found_ error behavior
- Describe durability and replication guarantees
- Describe best-effort read repair when shard availability is degraded but still reconstructable
- Describe secret-keeping and spreading behavior and failure behavior when referenced secrets cannot be resolved
- Describe secret history retrieval semantics, including version ordering and validity timestamps
- Describe `.env` encryption and expansion semantics, including secret creation and all-or-nothing failure
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -14,4 +14,6 @@ public class ClusterConfig {
private int quorumM;
private long lockTimeoutMillis;
private long writeTimeoutMillis;
private boolean repairEnabled = true;
private int repairTriggerBuffer = 1;
}
Original file line number Diff line number Diff line change
Expand Up @@ -5,10 +5,12 @@
import edu.yu.capstone.DistributedSecretsVault.dto.internal.DeletePrepareRequest;
import edu.yu.capstone.DistributedSecretsVault.dto.internal.PostPrepareRequest;
import edu.yu.capstone.DistributedSecretsVault.dto.internal.PutPrepareRequest;
import edu.yu.capstone.DistributedSecretsVault.dto.internal.RepairPrepareRequest;
import edu.yu.capstone.DistributedSecretsVault.service.internal.DeletePrepareHandler;
import edu.yu.capstone.DistributedSecretsVault.service.internal.InternalGetService;
import edu.yu.capstone.DistributedSecretsVault.service.internal.PostPrepareHandler;
import edu.yu.capstone.DistributedSecretsVault.service.internal.PutPrepareHandler;
import edu.yu.capstone.DistributedSecretsVault.service.internal.RepairPrepareHandler;

import java.util.Map;
import java.util.UUID;
Expand All @@ -32,15 +34,18 @@ public class InternalController {
private final PostPrepareHandler postPrepareHandler;
private final PutPrepareHandler putPrepareHandler;
private final DeletePrepareHandler deletePrepareHandler;
private final RepairPrepareHandler repairPrepareHandler;

public InternalController(InternalGetService internalGetService,
PostPrepareHandler postPrepareHandler,
PutPrepareHandler putPrepareHandler,
DeletePrepareHandler deletePrepareHandler) {
DeletePrepareHandler deletePrepareHandler,
RepairPrepareHandler repairPrepareHandler) {
this.internalGetService = internalGetService;
this.postPrepareHandler = postPrepareHandler;
this.putPrepareHandler = putPrepareHandler;
this.deletePrepareHandler = deletePrepareHandler;
this.repairPrepareHandler = repairPrepareHandler;
}

@GetMapping("/{id}")
Expand Down Expand Up @@ -68,6 +73,12 @@ public ResponseEntity<Void> preparePut(@RequestBody PutPrepareRequest request) {
return ResponseEntity.noContent().build();
}

@PostMapping("/repair/prepare")
public ResponseEntity<Void> prepareRepair(@RequestBody RepairPrepareRequest request) {
repairPrepareHandler.handle(request);
return ResponseEntity.noContent().build();
}

@DeleteMapping("/prepare")
public ResponseEntity<Void> prepareDelete(
@RequestParam("originatorNodeId") String originatorNodeId,
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
package edu.yu.capstone.DistributedSecretsVault.dto.internal;

import java.util.UUID;

import edu.yu.capstone.DistributedSecretsVault.domain.model.SecretKey;
import lombok.AllArgsConstructor;
import lombok.Data;
import lombok.NoArgsConstructor;

@Data
@NoArgsConstructor
@AllArgsConstructor
public class RepairCommitRequest {
private UUID operationId;
private SecretKey secretKey;
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
package edu.yu.capstone.DistributedSecretsVault.dto.internal;

import java.util.UUID;

import lombok.AllArgsConstructor;
import lombok.Data;
import lombok.NoArgsConstructor;

@Data
@NoArgsConstructor
@AllArgsConstructor
public class RepairPrepareRequest {
private String originatorNodeId;
private UUID operationId;
private SecretPartMessage secretPartMessage;
}
Original file line number Diff line number Diff line change
Expand Up @@ -8,11 +8,13 @@
import edu.yu.capstone.DistributedSecretsVault.dto.internal.DeleteCommitRequest;
import edu.yu.capstone.DistributedSecretsVault.dto.internal.PostCommitRequest;
import edu.yu.capstone.DistributedSecretsVault.dto.internal.PutCommitRequest;
import edu.yu.capstone.DistributedSecretsVault.dto.internal.RepairCommitRequest;
import edu.yu.capstone.DistributedSecretsVault.exceptions.InternalOperationConflictException;
import edu.yu.capstone.DistributedSecretsVault.service.internal.ActionType;
import edu.yu.capstone.DistributedSecretsVault.service.internal.DeleteCommitHandler;
import edu.yu.capstone.DistributedSecretsVault.service.internal.PostCommitHandler;
import edu.yu.capstone.DistributedSecretsVault.service.internal.PutCommitHandler;
import edu.yu.capstone.DistributedSecretsVault.service.internal.RepairCommitHandler;

@Service
public class CommitDispatcher {
Expand All @@ -21,13 +23,16 @@ public class CommitDispatcher {
private final DeleteCommitHandler deleteCommitHandler;
private final PostCommitHandler postCommitHandler;
private final PutCommitHandler putCommitHandler;
private final RepairCommitHandler repairCommitHandler;

public CommitDispatcher(DeleteCommitHandler deleteCommitHandler,
PostCommitHandler postCommitHandler,
PutCommitHandler putCommitHandler) {
PutCommitHandler putCommitHandler,
RepairCommitHandler repairCommitHandler) {
this.deleteCommitHandler = deleteCommitHandler;
this.postCommitHandler = postCommitHandler;
this.putCommitHandler = putCommitHandler;
this.repairCommitHandler = repairCommitHandler;
}

public void dispatch(CommitMessage message) {
Expand All @@ -39,6 +44,8 @@ public void dispatch(CommitMessage message) {
postCommitHandler.handle(new PostCommitRequest(message.getOperationId(), message.getSecretKey()));
} else if (message.getActionType() == ActionType.PUT) {
putCommitHandler.handle(new PutCommitRequest(message.getOperationId(), message.getSecretKey()));
} else if (message.getActionType() == ActionType.REPAIR) {
repairCommitHandler.handle(new RepairCommitRequest(message.getOperationId(), message.getSecretKey()));
} else {
log.warn("Ignoring unsupported commit action type: operationId={}, actionType={}",
message.getOperationId(), message.getActionType());
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -7,5 +7,6 @@
public enum ActionType {
POST,
DELETE,
PUT
PUT,
REPAIR
}
Loading
Loading