Skip to content

Fix sql logfill#5807

Open
markhannum wants to merge 3 commits intobloomberg:mainfrom
markhannum:fix_sql_logfill
Open

Fix sql logfill#5807
markhannum wants to merge 3 commits intobloomberg:mainfrom
markhannum:fix_sql_logfill

Conversation

@markhannum
Copy link
Contributor

Fix bug in sql-logfill where the logfill thread won't apply a record which has a generation which is less than the node's current generation, as this assertion is only true if we explicitly set the generation during recovery by reading a commit or a checkpoint record which contains it. Additionally, this provides the sqllogfill_reset_gen test which reproduces the issue and exercises the fix.

Copy link

@roborivers roborivers left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cbuild submission: Success ✓.
Regression testing: Success ✓.

The first 10 failing tests are:
maxtable [core dumped]
sc_swapfields
consumer_non_atomic_default_consumer_generated
sc_truncate_lockorder_generated
reco-ddlk-sql

@markhannum markhannum force-pushed the fix_sql_logfill branch 3 times, most recently from 4cbd8b6 to ae2d5ef Compare March 16, 2026 17:46
Copy link

@roborivers roborivers left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cbuild submission: Success ✓.
Regression testing: Success ✓.

The first 10 failing tests are:
sc_resume_logicalsc_generated
consumer_non_atomic_default_consumer_generated
remsql_locks_rte_connect_generated
remsql_locks
reco-ddlk-sql

Copy link
Contributor

@SChakravorti21 SChakravorti21 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Writing down my understanding so far...

We are introducing a tunable, gbl_emit_gen_commits, so that we can turn off the logic for emitting cluster generation in commit records. By disabling this logic, a replicant can't possibly pick up its previous cluster generation from the transaction log. In other words, it helps us test how logfill behaves when a node restarts and its cluster generation resets.

If we haven't enabled elect_highest_committed_gen or match_on_ckp, then our gen could foreseeably be much lower (defaults to 0) than that of the current master. We want logfill to be able to continue making progress in this case, hence the changes to gen_okay.

The reset_gen test makes sure the logfill makes progress even when our gen resets on restart, instead of refusing to apply log records.

The noresetgen test ensures that with elect_highest_committed_gen, match_on_ckp, etc. enabled, our gen is always fetched from logs on restart instead of resetting.

Questions:

  • In gen_okay, do we need to check for retrieve_gen_from_ckp in addition to match_on_ckp? Or does match_on_ckp imply retrieve_gen_from_ckp? How does match_on_ckp help guarantee that we have retrieved our most recent gen from the transaction logs?

Copy link

@roborivers roborivers left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cbuild submission: Success ✓.
Regression testing: Success ✓.

The first 10 failing tests are:
scindex [core dumped]
consumer_non_atomic_default_consumer_generated
remsql_locks_rte_connect_generated
remsql_locks
reco-ddlk-sql

Copy link

@roborivers roborivers left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cbuild submission: Success ✓.
Regression testing: Success ✓.

The first 10 failing tests are:
2026-03-17T19:01:21EDT [141605] sc_resume [db unavailable at finish]
2026-03-17T19:01:21EDT [141605] consumer_non_atomic_default_consumer_generated **quarantined**
2026-03-17T19:01:21EDT [141605] remsql_locks_rte_connect_generated **quarantined**
2026-03-17T19:01:21EDT [141605] remsql_locks **quarantined**
2026-03-17T19:01:21EDT [141605] reco-ddlk-sql [timeout] **quarantined**

@markhannum markhannum force-pushed the fix_sql_logfill branch 3 times, most recently from 66e3712 to b7861f4 Compare March 18, 2026 12:44
Copy link

@roborivers roborivers left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cbuild submission: Success ✓.
Regression testing: Success ✓.

The first 10 failing tests are:
2026-03-18T10:46:47EDT [275862] consumer_non_atomic_default_consumer_generated **quarantined**
2026-03-18T10:46:47EDT [275862] remsql_locks_rte_connect_generated **quarantined**
2026-03-18T10:46:47EDT [275862] remsql_locks **quarantined**
2026-03-18T10:46:47EDT [275862] truncatesc_offline_generated [timeout] **quarantined**
2026-03-18T10:46:47EDT [275862] reco-ddlk-sql [timeout] **quarantined**

Copy link

@roborivers roborivers left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cbuild submission: Success ✓.
Regression testing: Success ✓.

The first 10 failing tests are:
2026-03-18T10:51:38EDT [24607] sc_resume_logicalsc_generated **quarantined**
2026-03-18T10:51:38EDT [24607] consumer_non_atomic_default_consumer_generated **quarantined**
2026-03-18T10:51:38EDT [24607] sc_transactional_rowlocks_generated **quarantined**
2026-03-18T10:51:38EDT [24607] remsql_locks **quarantined**
2026-03-18T10:51:38EDT [24607] remsql_locks_rte_connect_generated **quarantined**
2026-03-18T10:51:38EDT [24607] truncatesc_offline_generated [timeout] **quarantined**

Copy link

@roborivers roborivers left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cbuild submission: Success ✓.
Regression testing: Success ✓.

The first 10 failing tests are:
2026-03-18T13:01:37EDT [253639] reco-ddlk-sql **quarantined**
2026-03-18T13:01:37EDT [253639] consumer_non_atomic_default_consumer_generated **quarantined**
2026-03-18T13:01:37EDT [253639] remsql_locks_rte_connect_generated **quarantined**
2026-03-18T13:01:37EDT [253639] remsql_locks **quarantined**
2026-03-18T13:01:37EDT [253639] noresetgen [timeout]

@markhannum markhannum force-pushed the fix_sql_logfill branch 2 times, most recently from 57d193a to ad792d2 Compare March 18, 2026 19:28
Copy link

@roborivers roborivers left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cbuild submission: Success ✓.
Regression testing: Success ✓.

The first 10 failing tests are:
2026-03-18T15:46:20EDT [9195] incremental_backup_usenames
2026-03-18T15:46:20EDT [9195] scindex_logicalsc_generated
2026-03-18T15:46:20EDT [9195] consumer_non_atomic_default_consumer_generated **quarantined**
2026-03-18T15:46:20EDT [9195] remsql_locks_rte_connect_generated **quarantined**
2026-03-18T15:46:20EDT [9195] remsql_locks **quarantined**
2026-03-18T15:46:20EDT [9195] sqllogfill_reset_gen [timeout]
2026-03-18T15:46:20EDT [9195] reco-ddlk-sql [timeout] **quarantined**

Copy link

@roborivers roborivers left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cbuild submission: Success ✓.
Regression testing: Success ✓.

The first 10 failing tests are:
analyze_partial_index_off_generated **quarantined**
consumer_non_atomic_default_consumer_generated **quarantined**
remsql_locks_rte_connect_generated **quarantined**
remsql_locks **quarantined**
reco-ddlk-sql [timeout] **quarantined**

Signed-off-by: mhannum <mhannum@bloomberg.net>
Signed-off-by: mhannum <mhannum@bloomberg.net>
Signed-off-by: mhannum <mhannum@bloomberg.net>
Copy link

@roborivers roborivers left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cbuild submission: Success ✓.
Regression testing: Success ✓.

The first 10 failing tests are:
sc_resume_logicalsc_generated **quarantined**
consumer_non_atomic_default_consumer_generated **quarantined**
remsql_locks_rte_connect_generated **quarantined**
remsql_locks **quarantined**
truncatesc_offline_generated [timeout] **quarantined**
sqllogfill_reset_gen [timeout]
reco-ddlk-sql [timeout] **quarantined**
guid [timeout]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants