DAOS-0000 rebuild: skip RECLAIM after a successful exclude-only rebuild by wangshilong · Pull Request #17713 · daos-stack/daos

wangshilong · 2026-03-16T15:10:34Z

In large systems, a full object scan can take hours. Under the current placement model, when the pool map changes due to target failures, only the failed targets are remapped to spare targets. After a successful rebuild there are no stale copies left on any surviving target, so scheduling a follow-up RB_OP_RECLAIM is unnecessary. All other rebuild triggers (drain, reintegration, extend, upgrade) still require RECLAIM because they can leave stale data behind.

To distinguish the root cause of each rebuild, a new rebuild_cause bitmask is introduced in ds_rebuild_schedule() and stored as dst_rebuild_cause in struct rebuild_task. Four cause flags are defined: RB_CAUSE_EXCLUDE, RB_CAUSE_DRAIN, RB_CAUSE_REINT, and RB_CAUSE_EXTEND. When multiple rebuild tasks are merged, their cause bitmasks are OR-ed together so that no information is lost.

On rebuild completion, RB_OP_RECLAIM is skipped only when the combined cause is RB_CAUSE_EXCLUDE (i.e. the task was triggered solely by an exclude operation and no other cause was merged in). For any other cause, or when the cause is unknown (e.g. the task was regenerated after a pool-service leader switch), RECLAIM is still scheduled conservatively.

Steps for the author:

Commit message follows the guidelines.
Appropriate Features or Test-tag pragmas were used.
Appropriate Functional Test Stages were run.
At least two positive code reviews including at least one code owner from each category referenced in the PR.
Testing is complete. If necessary, forced-landing label added and a reason added in a comment.

After all prior steps are complete:

Gatekeeper requested (daos-gatekeeper added as a reviewer).

github-actions · 2026-03-16T15:22:31Z

Errors are Unable to load ticket data
https://daosio.atlassian.net/browse/DAOS-0000

daosbuild3 · 2026-03-16T15:24:55Z

Test stage Build on Leap 15 with Intel-C and TARGET_PREFIX completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-17713/1/execution/node/282/log

daosbuild3 · 2026-03-16T15:25:56Z

Test stage Build on EL 8 completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-17713/1/execution/node/290/log

daosbuild3 · 2026-03-16T15:26:50Z

Test stage Build on Leap 15 completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-17713/1/execution/node/320/log

daosbuild3 · 2026-03-16T15:29:22Z

Test stage Build on EL 9 completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-17713/1/execution/node/408/log

In large systems, a full object scan can take hours. Under the current placement model, when the pool map changes due to target failures, only the failed targets are remapped to spare targets. After a successful rebuild there are no stale copies left on any surviving target, so scheduling a follow-up RB_OP_RECLAIM is unnecessary. All other rebuild triggers (drain, reintegration, extend, upgrade) still require RECLAIM because they can leave stale data behind. To distinguish the root cause of each rebuild, a new rebuild_cause bitmask is introduced in ds_rebuild_schedule() and stored as dst_rebuild_cause in struct rebuild_task. Four cause flags are defined: RB_CAUSE_EXCLUDE, RB_CAUSE_DRAIN, RB_CAUSE_REINT, and RB_CAUSE_EXTEND. When multiple rebuild tasks are merged, their cause bitmasks are OR-ed together so that no information is lost. On rebuild completion, RB_OP_RECLAIM is skipped only when the combined cause is RB_CAUSE_EXCLUDE (i.e. the task was triggered solely by an exclude operation and no other cause was merged in). For any other cause, or when the cause is unknown (e.g. the task was regenerated after a pool-service leader switch), RECLAIM is still scheduled conservatively. Signed-off-by: Wang Shilong <shilong.wang@hpe.com>

wangshilong force-pushed the shilongw/skip_reclaim branch from 2a1284b to fcf747b Compare March 17, 2026 06:42

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DAOS-0000 rebuild: skip RECLAIM after a successful exclude-only rebuild#17713

DAOS-0000 rebuild: skip RECLAIM after a successful exclude-only rebuild#17713
wangshilong wants to merge 1 commit intomasterfrom
shilongw/skip_reclaim

wangshilong commented Mar 16, 2026

Uh oh!

github-actions bot commented Mar 16, 2026

Uh oh!

daosbuild3 commented Mar 16, 2026

Uh oh!

daosbuild3 commented Mar 16, 2026

Uh oh!

daosbuild3 commented Mar 16, 2026

Uh oh!

daosbuild3 commented Mar 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

2 participants

Conversation

wangshilong commented Mar 16, 2026

Steps for the author:

After all prior steps are complete:

Uh oh!

github-actions bot commented Mar 16, 2026

Uh oh!

daosbuild3 commented Mar 16, 2026

Uh oh!

daosbuild3 commented Mar 16, 2026

Uh oh!

daosbuild3 commented Mar 16, 2026

Uh oh!

daosbuild3 commented Mar 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

2 participants