Skip to content

fix(compactor): stop compaction at TXID gap to prevent repeated failures#1155

Open
corylanou wants to merge 1 commit intomainfrom
issue-1151-failed-upload-breaks-compaction
Open

fix(compactor): stop compaction at TXID gap to prevent repeated failures#1155
corylanou wants to merge 1 commit intomainfrom
issue-1151-failed-upload-breaks-compaction

Conversation

@corylanou
Copy link
Copy Markdown
Collaborator

Summary

  • Add contiguity check in Compact() that stops file collection at the first TXID gap, compacting only the contiguous prefix
  • When a gap is detected, logs a warning and breaks out of the collection loop so ltx.Compactor never receives non-contiguous input
  • The monitor's existing recovery logic (calcPos()) re-uploads the missing file on its next sync cycle, and remaining files compact in a future pass

Fixes #1151

Test Plan

  • TestCompactor_Compact/L0GapStopsAtGap — creates L0 files 1,2,3,5,6 (gap at 4), verifies compaction produces L1 with range 1-3, then fills gap and verifies 4-6 compacts
  • TestCompactor_Compact/L0GapAtStart — creates L0 files 3,4,5 with no prior L1, verifies compactor compacts the available contiguous set (3-5)
  • All existing compactor tests pass
  • Full test suite passes with -race

@corylanou corylanou force-pushed the issue-1151-failed-upload-breaks-compaction branch from 1ccf74b to 0e5a589 Compare February 23, 2026 16:27
When an S3 upload fails after retries, a gap appears in L0 LTX files.
Compaction then collects all L0 files including those after the gap and
passes them to ltx.Compactor, which rejects non-contiguous inputs. This
error repeats indefinitely because the gap persists.

Add a contiguity check in Compact() that tracks expectedMinTXID as files
are collected. When a gap is detected, compaction stops and only compacts
the contiguous prefix before the gap. The monitor recovers the missing
file on its next sync cycle, and remaining files compact in a future pass.

Fixes #1151
@corylanou corylanou force-pushed the issue-1151-failed-upload-breaks-compaction branch from 0e5a589 to 456a2d3 Compare February 23, 2026 18:21
Copy link
Copy Markdown
Owner

@benbjohnson benbjohnson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this fixes #1151. The problem in that issue is that we have gaps in L0 as you can see from this compaction to L1:

time=2026-02-20T03:12:00.987+01:00 level=ERROR msg="compaction failed" level=1 error="write ltx file: extract timestamp from LTX header: non-contiguous transaction ids in input files: (000000000004b1c6,000000000004b1c6) -> (000000000004b1c8,000000000004b1c8)"

Before that, it looks like 000000000004b1c7 failed during upload:

time=2026-02-20T03:10:20.673+01:00 level=ERROR msg="monitor error" db=state.db replica=s3 error="calc pos: max ltx file: operation error S3: ListObjectsV2, https response error StatusCode: 504, RequestID: N/A, HostID: N/A, api error GatewayTimeout: The server did not respond in time.\nfailed to get rate limit token, retry quota exceeded, 0 available, 5 requested" consecutive_errors=2 backoff=2s

And then it failed again when trying to recalculate the remote replica L0 position:

time=2026-02-20T03:11:30.851+01:00 level=ERROR msg="compaction failed" level=1 error="write ltx file: extract timestamp from LTX header: non-contiguous transaction ids in input files: (000000000004b1c6,000000000004b1c6) -> (000000000004b1c8,000000000004b1c8)"

This shouldn't happen as Litestream should re-upload based on the latest version in S3 once it clears its position and recalculates. That code is in Replica.Sync().

I just pushed a PR to get some more info on replica sync and uploads: #1182

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Failed upload breaks compaction

2 participants