Skip to content

Clarification on Multi-hop Inference Pipeline, Re-routing Overhead, SFT Details, and Naming Convention #4

@xiaohongrsx

Description

@xiaohongrsx

[Questions] Clarification on Multi-hop Inference Pipeline, Re-routing Overhead, SFT Details, and Naming Convention

Hi, thanks for the great work! After carefully reading the paper, I have several questions regarding the inference pipeline, training details, and the naming of the method. I'd appreciate any clarification.


1. Clarification on the Multi-hop Inference Pipeline

Based on my reading of Section 3.5 and Figure 3, I reconstructed the following inference pipeline for Memory Interleave. Could you confirm whether this understanding is correct?

loop:
    1. Encode the current query context through the model to obtain Q^R
    2. Route Q^R against all cached routing keys K̄^R → select Top-16 documents
    3. Load selected documents' compressed K̄, V̄ from CPU to GPU
    4. Autoregressively generate tokens with attention context = [Top-16 compressed KV ; local KV]
    5. If the model generates [doc_id]<|object_ref_end|>:
         → Fetch the original text of the referenced document
         → Append the original text to the current query context
         → Go back to step 1 (re-encode, re-route)
    6. If the model generates <End-of-Retrieve>:
         → Transition to final answer generation
         → Exit loop

Specific sub-questions:

  • Is the pipeline above identical for both single-hop and multi-hop queries (i.e., a single unified pipeline where single-hop queries simply exit the loop after one iteration)?
  • When appending original document text at step 5, does the system re-encode only the newly appended text (reusing the KV cache from the previous iteration for earlier tokens), or does it re-encode the entire expanded context from scratch?

2. Re-routing Overhead in Multi-hop Scenarios

Each iteration of the Memory Interleave loop requires:

  1. Re-encoding the appended original document text through the full model forward pass
  2. Re-routing Q^R against all ~1.56M routing key entries (for 100M tokens) across 18 layers
  3. Re-loading potentially different Top-16 documents' content KV from CPU

For complex multi-hop queries that may require 3-5 iterations, this overhead compounds. Have you measured the per-hop latency breakdown? Specifically:

  • What is the latency of the routing step alone (cosine similarity against all routing key entries) at the 100M token scale?
  • How does the end-to-end multi-hop inference latency compare to an equivalent iterative RAG pipeline (e.g., multi-turn RAG with reranking)?

3. SFT Data Construction and Loss Computation

The paper mentions a two-stage SFT curriculum (Section 3.3.2) but provides limited details:

  • Stage 1: SFT on QA tasks with 8K context length
  • Stage 2: Extended to 64K context with data cleaning

I have the following questions:

  1. Data construction: Could you provide more details on how the SFT training data was constructed? Specifically:

    • How were the multi-hop retrieval chains decomposed into individual training samples (as mentioned: "each retrieval chain is divided into multiple training samples")?
    • Were the document IDs and <End-of-Retrieve> / <|object_ref_end|> tokens manually annotated in the training data, or generated through some automated pipeline?
  2. Loss computation: During SFT, what loss function was used?

    • Is it the standard next-token prediction loss (cross-entropy) only on the response tokens?
    • Was L_aux (the contrastive routing loss from pre-training) still active during SFT, or was it dropped?
    • Was the loss computed over the generated document IDs as well, or only over the final answer tokens?
  3. Potential data leakage: Since the SFT data presumably includes specific document IDs paired with specific queries, does this create a dependency on the document corpus used during training? In other words, how does the model generalize to entirely new document collections not seen during SFT?

4. Naming: "Memory Sparse Attention" vs. "Sparse Retrieval with Attention-based Fusion"

The name "Memory Sparse Attention" implies a modification to the attention mechanism itself that introduces sparsity (similar to Longformer, BigBird, or NSA). However, from my understanding, MSA does not modify the internal attention computation — the standard dense attention is preserved. The "sparsity" in MSA refers to selecting a sparse subset of external documents via a separate router projector, and then fusing their compressed KV caches into the standard attention context.

Aspect Traditional Sparse Attention MSA
Sparsity scope Within a single sequence Across an external document bank
Sparsity granularity Token-level Document-level
Selection mechanism Attention scores / fixed patterns Separate router projector + cosine similarity
Operates on Full-resolution token representations Compressed (mean-pooled) KV cache

Given these differences, would it be more accurate to characterize MSA as "Sparse Retrieval with Attention-based Fusion" rather than a sparse attention mechanism? I'd be interested to hear the authors' perspective on how MSA relates to the sparse attention lineage versus the retrieval-augmented generation lineage.


Thanks for your time! Looking forward to your response.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions