Skip to content

DAOS-0000 rebuild: skip RECLAIM after a successful exclude-only rebuild#17713

Draft
wangshilong wants to merge 1 commit intomasterfrom
shilongw/skip_reclaim
Draft

DAOS-0000 rebuild: skip RECLAIM after a successful exclude-only rebuild#17713
wangshilong wants to merge 1 commit intomasterfrom
shilongw/skip_reclaim

Conversation

@wangshilong
Copy link
Contributor

In large systems, a full object scan can take hours. Under the current placement model, when the pool map changes due to target failures, only the failed targets are remapped to spare targets. After a successful rebuild there are no stale copies left on any surviving target, so scheduling a follow-up RB_OP_RECLAIM is unnecessary. All other rebuild triggers (drain, reintegration, extend, upgrade) still require RECLAIM because they can leave stale data behind.

To distinguish the root cause of each rebuild, a new rebuild_cause bitmask is introduced in ds_rebuild_schedule() and stored as dst_rebuild_cause in struct rebuild_task. Four cause flags are defined: RB_CAUSE_EXCLUDE, RB_CAUSE_DRAIN, RB_CAUSE_REINT, and RB_CAUSE_EXTEND. When multiple rebuild tasks are merged, their cause bitmasks are OR-ed together so that no information is lost.

On rebuild completion, RB_OP_RECLAIM is skipped only when the combined cause is RB_CAUSE_EXCLUDE (i.e. the task was triggered solely by an exclude operation and no other cause was merged in). For any other cause, or when the cause is unknown (e.g. the task was regenerated after a pool-service leader switch), RECLAIM is still scheduled conservatively.

Steps for the author:

  • Commit message follows the guidelines.
  • Appropriate Features or Test-tag pragmas were used.
  • Appropriate Functional Test Stages were run.
  • At least two positive code reviews including at least one code owner from each category referenced in the PR.
  • Testing is complete. If necessary, forced-landing label added and a reason added in a comment.

After all prior steps are complete:

  • Gatekeeper requested (daos-gatekeeper added as a reviewer).

@github-actions
Copy link

Errors are Unable to load ticket data
https://daosio.atlassian.net/browse/DAOS-0000

@daosbuild3
Copy link
Collaborator

Test stage Build on Leap 15 with Intel-C and TARGET_PREFIX completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-17713/1/execution/node/282/log

@daosbuild3
Copy link
Collaborator

@daosbuild3
Copy link
Collaborator

@daosbuild3
Copy link
Collaborator

In large systems, a full object scan can take hours. Under the current
placement model, when the pool map changes due to target failures, only
the failed targets are remapped to spare targets. After a successful
rebuild there are no stale copies left on any surviving target, so
scheduling a follow-up RB_OP_RECLAIM is unnecessary. All other rebuild
triggers (drain, reintegration, extend, upgrade) still require RECLAIM
because they can leave stale data behind.

To distinguish the root cause of each rebuild, a new rebuild_cause
bitmask is introduced in ds_rebuild_schedule() and stored as
dst_rebuild_cause in struct rebuild_task. Four cause flags are
defined: RB_CAUSE_EXCLUDE, RB_CAUSE_DRAIN, RB_CAUSE_REINT, and
RB_CAUSE_EXTEND. When multiple rebuild tasks are merged, their cause
bitmasks are OR-ed together so that no information is lost.

On rebuild completion, RB_OP_RECLAIM is skipped only when the
combined cause is RB_CAUSE_EXCLUDE (i.e. the task was triggered solely
by an exclude operation and no other cause was merged in). For any other
cause, or when the cause is unknown (e.g. the task was regenerated after
a pool-service leader switch), RECLAIM is still scheduled conservatively.

Signed-off-by: Wang Shilong <shilong.wang@hpe.com>
@wangshilong wangshilong force-pushed the shilongw/skip_reclaim branch from 2a1284b to fcf747b Compare March 17, 2026 06:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

2 participants