-
Notifications
You must be signed in to change notification settings - Fork 127
Description
[Questions] Clarification on Multi-hop Inference Pipeline, Re-routing Overhead, SFT Details, and Naming Convention
Hi, thanks for the great work! After carefully reading the paper, I have several questions regarding the inference pipeline, training details, and the naming of the method. I'd appreciate any clarification.
1. Clarification on the Multi-hop Inference Pipeline
Based on my reading of Section 3.5 and Figure 3, I reconstructed the following inference pipeline for Memory Interleave. Could you confirm whether this understanding is correct?
loop:
1. Encode the current query context through the model to obtain Q^R
2. Route Q^R against all cached routing keys K̄^R → select Top-16 documents
3. Load selected documents' compressed K̄, V̄ from CPU to GPU
4. Autoregressively generate tokens with attention context = [Top-16 compressed KV ; local KV]
5. If the model generates [doc_id]<|object_ref_end|>:
→ Fetch the original text of the referenced document
→ Append the original text to the current query context
→ Go back to step 1 (re-encode, re-route)
6. If the model generates <End-of-Retrieve>:
→ Transition to final answer generation
→ Exit loop
Specific sub-questions:
- Is the pipeline above identical for both single-hop and multi-hop queries (i.e., a single unified pipeline where single-hop queries simply exit the loop after one iteration)?
- When appending original document text at step 5, does the system re-encode only the newly appended text (reusing the KV cache from the previous iteration for earlier tokens), or does it re-encode the entire expanded context from scratch?
2. Re-routing Overhead in Multi-hop Scenarios
Each iteration of the Memory Interleave loop requires:
- Re-encoding the appended original document text through the full model forward pass
- Re-routing
Q^Ragainst all ~1.56M routing key entries (for 100M tokens) across 18 layers - Re-loading potentially different Top-16 documents' content KV from CPU
For complex multi-hop queries that may require 3-5 iterations, this overhead compounds. Have you measured the per-hop latency breakdown? Specifically:
- What is the latency of the routing step alone (cosine similarity against all routing key entries) at the 100M token scale?
- How does the end-to-end multi-hop inference latency compare to an equivalent iterative RAG pipeline (e.g., multi-turn RAG with reranking)?
3. SFT Data Construction and Loss Computation
The paper mentions a two-stage SFT curriculum (Section 3.3.2) but provides limited details:
- Stage 1: SFT on QA tasks with 8K context length
- Stage 2: Extended to 64K context with data cleaning
I have the following questions:
-
Data construction: Could you provide more details on how the SFT training data was constructed? Specifically:
- How were the multi-hop retrieval chains decomposed into individual training samples (as mentioned: "each retrieval chain is divided into multiple training samples")?
- Were the document IDs and
<End-of-Retrieve>/<|object_ref_end|>tokens manually annotated in the training data, or generated through some automated pipeline?
-
Loss computation: During SFT, what loss function was used?
- Is it the standard next-token prediction loss (cross-entropy) only on the response tokens?
- Was
L_aux(the contrastive routing loss from pre-training) still active during SFT, or was it dropped? - Was the loss computed over the generated document IDs as well, or only over the final answer tokens?
-
Potential data leakage: Since the SFT data presumably includes specific document IDs paired with specific queries, does this create a dependency on the document corpus used during training? In other words, how does the model generalize to entirely new document collections not seen during SFT?
4. Naming: "Memory Sparse Attention" vs. "Sparse Retrieval with Attention-based Fusion"
The name "Memory Sparse Attention" implies a modification to the attention mechanism itself that introduces sparsity (similar to Longformer, BigBird, or NSA). However, from my understanding, MSA does not modify the internal attention computation — the standard dense attention is preserved. The "sparsity" in MSA refers to selecting a sparse subset of external documents via a separate router projector, and then fusing their compressed KV caches into the standard attention context.
| Aspect | Traditional Sparse Attention | MSA |
|---|---|---|
| Sparsity scope | Within a single sequence | Across an external document bank |
| Sparsity granularity | Token-level | Document-level |
| Selection mechanism | Attention scores / fixed patterns | Separate router projector + cosine similarity |
| Operates on | Full-resolution token representations | Compressed (mean-pooled) KV cache |
Given these differences, would it be more accurate to characterize MSA as "Sparse Retrieval with Attention-based Fusion" rather than a sparse attention mechanism? I'd be interested to hear the authors' perspective on how MSA relates to the sparse attention lineage versus the retrieval-augmented generation lineage.
Thanks for your time! Looking forward to your response.