A minimal PyTorch experiment exploring an alternative way to handle positional information in attention mechanisms.
This notebook introduces Attention Positional Encoding (APE) — a simple idea:
Instead of injecting positional information into token embeddings, add it directly to the attention matrix.
In standard Transformer architectures:
- Positional encodings are added to embeddings
- This mixes content (token meaning) with position (sequence order)
APE proposes:
- Keep embeddings pure (only semantic information)
- Inject positional bias at the attention level
APE generates a positional bias matrix based on relative distances between tokens:
def attention_positional_encoding(seq_len=5, scale=0.1):
seq = torch.arange(seq_len)
position = seq[None,:] - seq[:,None]
ape = 1 - abs(position) / seq_len
return ape * scaletensor([
[0.10, 0.08, 0.06, 0.04, 0.02],
[0.08, 0.10, 0.08, 0.06, 0.04],
[0.06, 0.08, 0.10, 0.08, 0.06],
[0.04, 0.06, 0.08, 0.10, 0.08],
[0.02, 0.04, 0.06, 0.08, 0.10]
])
- Diagonal = strongest attention (self-focus)
- Nearby tokens = higher weight
- Distant tokens = lower weight
- Smooth linear decay based on distance
Add APE directly to attention logits:
attention_scores = (Q @ K.T) / sqrt(d)
attention_scores += apeInstead of using separate linear projections for queries, keys, and values:
qkv = W_qkv @ x_embedSkip projection entirely and learn QKV directly in embeddings:
Embedding = nn.Embedding(vocab_size, embed_dim * 3)
x = torch.tensor([1,2,3,4,0,0,0,0,0,0])
q, k, v = torch.chunk(Embedding(x), 3, dim=-1)- Simpler architecture
- Fewer parameters (no projection layers)
- Embedding layer directly learns Q, K, V representations
| Component | Responsibility |
|---|---|
| Embedding | Token semantics only |
| APE | Positional bias |
| Attention | Interaction between tokens |
| Feature | Standard Transformer | APE Approach |
|---|---|---|
| Positional Encoding | Added to embeddings | Added to attention |
| QKV | Linear projections | Directly from embeddings |
| Embedding Role | Mixed (token + position) | Pure token representation |
| Complexity | Higher | Simpler |
- Avoid entangling positional and semantic information
- Explore whether attention alone can handle structure with minimal bias
- Reduce architectural overhead
Feel free to experiment, fork, and improve the idea.
🚀 Attention Positional Encoding (APE)
Instead of adding position to embeddings, APE injects it directly into attention scores.
✅ Keeps embeddings purely semantic ✅ Adds smooth distance-based bias ✅ Simple + minimal PyTorch
Worth exploring 👇