Skip to content

ab-bhorania/Attention-Positional-Encoding

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 

Repository files navigation

Attention Positional Encoding (APE)

A minimal PyTorch experiment exploring an alternative way to handle positional information in attention mechanisms.

🚀 Overview

This notebook introduces Attention Positional Encoding (APE) — a simple idea:

Instead of injecting positional information into token embeddings, add it directly to the attention matrix.

Why?

In standard Transformer architectures:

  • Positional encodings are added to embeddings
  • This mixes content (token meaning) with position (sequence order)

APE proposes:

  • Keep embeddings pure (only semantic information)
  • Inject positional bias at the attention level

🧠 Key Concepts

1. Attention Positional Encoding (APE)

APE generates a positional bias matrix based on relative distances between tokens:

def attention_positional_encoding(seq_len=5, scale=0.1):
    seq = torch.arange(seq_len)
    position = seq[None,:] - seq[:,None]
    ape = 1 - abs(position) / seq_len
    return ape * scale

Example Output

tensor([
 [0.10, 0.08, 0.06, 0.04, 0.02],
 [0.08, 0.10, 0.08, 0.06, 0.04],
 [0.06, 0.08, 0.10, 0.08, 0.06],
 [0.04, 0.06, 0.08, 0.10, 0.08],
 [0.02, 0.04, 0.06, 0.08, 0.10]
])

Interpretation

  • Diagonal = strongest attention (self-focus)
  • Nearby tokens = higher weight
  • Distant tokens = lower weight
  • Smooth linear decay based on distance

Usage

Add APE directly to attention logits:

attention_scores = (Q @ K.T) / sqrt(d)
attention_scores += ape

2. Embedded QKV (No Projection Layer)

Instead of using separate linear projections for queries, keys, and values:

Standard Approach

qkv = W_qkv @ x_embed

This Notebook's Approach

Skip projection entirely and learn QKV directly in embeddings:

Embedding = nn.Embedding(vocab_size, embed_dim * 3)

x = torch.tensor([1,2,3,4,0,0,0,0,0,0])
q, k, v = torch.chunk(Embedding(x), 3, dim=-1)

Benefits

  • Simpler architecture
  • Fewer parameters (no projection layers)
  • Embedding layer directly learns Q, K, V representations

🔍 Design Philosophy

Separation of Concerns

Component Responsibility
Embedding Token semantics only
APE Positional bias
Attention Interaction between tokens

⚖️ Comparison with Standard Transformers

Feature Standard Transformer APE Approach
Positional Encoding Added to embeddings Added to attention
QKV Linear projections Directly from embeddings
Embedding Role Mixed (token + position) Pure token representation
Complexity Higher Simpler

🧪 Motivation

  • Avoid entangling positional and semantic information
  • Explore whether attention alone can handle structure with minimal bias
  • Reduce architectural overhead

🤝 Contributing

Feel free to experiment, fork, and improve the idea.


🚀 Attention Positional Encoding (APE)

Instead of adding position to embeddings, APE injects it directly into attention scores.

✅ Keeps embeddings purely semantic ✅ Adds smooth distance-based bias ✅ Simple + minimal PyTorch

Worth exploring 👇