Clarification on Multi-hop Inference Pipeline, Re-routing Overhead, SFT Details, and Naming Convention

# [Questions] Clarification on Multi-hop Inference Pipeline, Re-routing Overhead, SFT Details, and Naming Convention

Hi, thanks for the great work! After carefully reading the paper, I have several questions regarding the inference pipeline, training details, and the naming of the method. I'd appreciate any clarification.

---

## 1. Clarification on the Multi-hop Inference Pipeline

Based on my reading of Section 3.5 and Figure 3, I reconstructed the following inference pipeline for Memory Interleave. **Could you confirm whether this understanding is correct?**

```
loop:
    1. Encode the current query context through the model to obtain Q^R
    2. Route Q^R against all cached routing keys K̄^R → select Top-16 documents
    3. Load selected documents' compressed K̄, V̄ from CPU to GPU
    4. Autoregressively generate tokens with attention context = [Top-16 compressed KV ; local KV]
    5. If the model generates [doc_id]<|object_ref_end|>:
         → Fetch the original text of the referenced document
         → Append the original text to the current query context
         → Go back to step 1 (re-encode, re-route)
    6. If the model generates <End-of-Retrieve>:
         → Transition to final answer generation
         → Exit loop
```

Specific sub-questions:

- Is the pipeline above identical for both single-hop and multi-hop queries (i.e., a single unified pipeline where single-hop queries simply exit the loop after one iteration)?
- When appending original document text at step 5, does the system re-encode only the newly appended text (reusing the KV cache from the previous iteration for earlier tokens), or does it re-encode the entire expanded context from scratch?

## 2. Re-routing Overhead in Multi-hop Scenarios

Each iteration of the Memory Interleave loop requires:

1. **Re-encoding** the appended original document text through the full model forward pass
2. **Re-routing** `Q^R` against all ~1.56M routing key entries (for 100M tokens) across 18 layers
3. **Re-loading** potentially different Top-16 documents' content KV from CPU

For complex multi-hop queries that may require 3-5 iterations, this overhead compounds. **Have you measured the per-hop latency breakdown?** Specifically:

- What is the latency of the routing step alone (cosine similarity against all routing key entries) at the 100M token scale?
- How does the end-to-end multi-hop inference latency compare to an equivalent iterative RAG pipeline (e.g., multi-turn RAG with reranking)?

## 3. SFT Data Construction and Loss Computation

The paper mentions a two-stage SFT curriculum (Section 3.3.2) but provides limited details:

- **Stage 1**: SFT on QA tasks with 8K context length
- **Stage 2**: Extended to 64K context with data cleaning

I have the following questions:

1. **Data construction**: Could you provide more details on how the SFT training data was constructed? Specifically:
   - How were the multi-hop retrieval chains decomposed into individual training samples (as mentioned: "each retrieval chain is divided into multiple training samples")?
   - Were the document IDs and `<End-of-Retrieve>` / `<|object_ref_end|>` tokens manually annotated in the training data, or generated through some automated pipeline?
   
2. **Loss computation**: During SFT, what loss function was used?
   - Is it the standard next-token prediction loss (cross-entropy) only on the response tokens?
   - Was `L_aux` (the contrastive routing loss from pre-training) still active during SFT, or was it dropped?
   - Was the loss computed over the generated document IDs as well, or only over the final answer tokens?

3. **Potential data leakage**: Since the SFT data presumably includes specific document IDs paired with specific queries, does this create a dependency on the document corpus used during training? In other words, how does the model generalize to entirely new document collections not seen during SFT?

## 4. Naming: "Memory Sparse Attention" vs. "Sparse Retrieval with Attention-based Fusion"

The name "Memory Sparse Attention" implies a modification to the attention mechanism itself that introduces sparsity (similar to Longformer, BigBird, or NSA). However, from my understanding, MSA does not modify the internal attention computation — the standard dense attention is preserved. The "sparsity" in MSA refers to selecting a sparse subset of **external documents** via a separate router projector, and then fusing their compressed KV caches into the standard attention context.

| Aspect | Traditional Sparse Attention | MSA |
|--------|------------------------------|-----|
| Sparsity scope | Within a single sequence | Across an external document bank |
| Sparsity granularity | Token-level | Document-level |
| Selection mechanism | Attention scores / fixed patterns | Separate router projector + cosine similarity |
| Operates on | Full-resolution token representations | Compressed (mean-pooled) KV cache |

Given these differences, would it be more accurate to characterize MSA as **"Sparse Retrieval with Attention-based Fusion"** rather than a sparse attention mechanism? I'd be interested to hear the authors' perspective on how MSA relates to the sparse attention lineage versus the retrieval-augmented generation lineage.

---

Thanks for your time! Looking forward to your response.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Clarification on Multi-hop Inference Pipeline, Re-routing Overhead, SFT Details, and Naming Convention #4

[Questions] Clarification on Multi-hop Inference Pipeline, Re-routing Overhead, SFT Details, and Naming Convention

1. Clarification on the Multi-hop Inference Pipeline

2. Re-routing Overhead in Multi-hop Scenarios

3. SFT Data Construction and Loss Computation

4. Naming: "Memory Sparse Attention" vs. "Sparse Retrieval with Attention-based Fusion"

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Aspect	Traditional Sparse Attention	MSA
Sparsity scope	Within a single sequence	Across an external document bank
Sparsity granularity	Token-level	Document-level
Selection mechanism	Attention scores / fixed patterns	Separate router projector + cosine similarity
Operates on	Full-resolution token representations	Compressed (mean-pooled) KV cache

Clarification on Multi-hop Inference Pipeline, Re-routing Overhead, SFT Details, and Naming Convention #4

Description

[Questions] Clarification on Multi-hop Inference Pipeline, Re-routing Overhead, SFT Details, and Naming Convention

1. Clarification on the Multi-hop Inference Pipeline

2. Re-routing Overhead in Multi-hop Scenarios

3. SFT Data Construction and Loss Computation

4. Naming: "Memory Sparse Attention" vs. "Sparse Retrieval with Attention-based Fusion"

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions