Skip to content

[fix](ann-index) Fix ivf pq recall zero.#63757

Draft
kaka11chen wants to merge 2 commits into
apache:masterfrom
kaka11chen:fix_ivf_pq_recall
Draft

[fix](ann-index) Fix ivf pq recall zero.#63757
kaka11chen wants to merge 2 commits into
apache:masterfrom
kaka11chen:fix_ivf_pq_recall

Conversation

@kaka11chen
Copy link
Copy Markdown
Contributor

What problem does this PR solve?

Issue Number: close #xxx

Related PR: #xxx

Problem Summary:

Release note

None

Check List (For Author)

  • Test

    • Regression test
    • Unit Test
    • Manual test (add detailed scripts or steps below)
    • No need to test or manual test. Explain why:
      • This is a refactor/code format and no logic has been changed.
      • Previous test can cover this change.
      • No code files have been changed.
      • Other reason
  • Behavior changed:

    • No.
    • Yes.
  • Does this need documentation?

    • No.
    • Yes.

Check List (For Reviewer who merge this PR)

  • Confirm the release note
  • Confirm test cases
  • Confirm document
  • Add branch pick label

@hello-stephen
Copy link
Copy Markdown
Contributor

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

### What problem does this PR solve?

Issue Number: close #xxx

Related PR: apache#63757

Problem Summary: ANN index building previously used a prefix-only training sample and spooled every vector through the common writer path after delaying FAISS add until finish. That made non-training indexes pay unnecessary spool overhead, and very large segments could still train from a large prefix sample that was less representative than a sample across the segment. This change lets indexes that do not require training add vectors directly, keeps train-required small segments in memory, spills full vectors only after the in-memory chunk threshold, and replaces prefix training data with a bounded reservoir sample. A new ann_index_build_max_train_rows config caps training sample size while still keeping at least the FAISS-required minimum training rows. IVF_ON_DISK is also treated as train-required when computing minimum training rows.

### Release note

None

### Check List (For Author)

- Test: Manual test
    - Passed: git diff --check
    - Attempted: build-support/clang-format.sh on changed C++ files, blocked because clang-format is not installed in this environment
    - Attempted: DORIS_TOOLCHAIN=gcc ./run-be-ut.sh --run --filter=AnnIndexWriterTest.* -j 8, blocked by unrelated existing GCC compile error in be/src/format/parquet/parquet_column_convert.h:596: std::powf is not a member of std
- Behavior changed: Yes (ANN build training sampling and spool behavior changed internally)
- Does this need documentation: No
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants