Attention Positional Encoding (APE)

A minimal PyTorch experiment exploring an alternative way to handle positional information in attention mechanisms.

🚀 Overview

This notebook introduces Attention Positional Encoding (APE) — a simple idea:

Instead of injecting positional information into token embeddings, add it directly to the attention matrix.

Why?

In standard Transformer architectures:

Positional encodings are added to embeddings
This mixes content (token meaning) with position (sequence order)

APE proposes:

Keep embeddings pure (only semantic information)
Inject positional bias at the attention level

🧠 Key Concepts

1. Attention Positional Encoding (APE)

APE generates a positional bias matrix based on relative distances between tokens:

def attention_positional_encoding(seq_len=5, scale=0.1):
    seq = torch.arange(seq_len)
    position = seq[None,:] - seq[:,None]
    ape = 1 - abs(position) / seq_len
    return ape * scale

Example Output

tensor([
 [0.10, 0.08, 0.06, 0.04, 0.02],
 [0.08, 0.10, 0.08, 0.06, 0.04],
 [0.06, 0.08, 0.10, 0.08, 0.06],
 [0.04, 0.06, 0.08, 0.10, 0.08],
 [0.02, 0.04, 0.06, 0.08, 0.10]
])

Interpretation

Diagonal = strongest attention (self-focus)
Nearby tokens = higher weight
Distant tokens = lower weight
Smooth linear decay based on distance

Usage

Add APE directly to attention logits:

attention_scores = (Q @ K.T) / sqrt(d)
attention_scores += ape

2. Embedded QKV (No Projection Layer)

Instead of using separate linear projections for queries, keys, and values:

Standard Approach

qkv = W_qkv @ x_embed

This Notebook's Approach

Skip projection entirely and learn QKV directly in embeddings:

Embedding = nn.Embedding(vocab_size, embed_dim * 3)

x = torch.tensor([1,2,3,4,0,0,0,0,0,0])
q, k, v = torch.chunk(Embedding(x), 3, dim=-1)

Benefits

Simpler architecture
Fewer parameters (no projection layers)
Embedding layer directly learns Q, K, V representations

🔍 Design Philosophy

Separation of Concerns

Component	Responsibility
Embedding	Token semantics only
APE	Positional bias
Attention	Interaction between tokens

⚖️ Comparison with Standard Transformers

Feature	Standard Transformer	APE Approach
Positional Encoding	Added to embeddings	Added to attention
QKV	Linear projections	Directly from embeddings
Embedding Role	Mixed (token + position)	Pure token representation
Complexity	Higher	Simpler

🧪 Motivation

Avoid entangling positional and semantic information
Explore whether attention alone can handle structure with minimal bias
Reduce architectural overhead

🤝 Contributing

Feel free to experiment, fork, and improve the idea.

🚀 Attention Positional Encoding (APE)

Instead of adding position to embeddings, APE injects it directly into attention scores.

✅ Keeps embeddings purely semantic ✅ Adds smooth distance-based bias ✅ Simple + minimal PyTorch

Worth exploring 👇

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
notebook		notebook
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Attention Positional Encoding (APE)

🚀 Overview

Why?

🧠 Key Concepts

1. Attention Positional Encoding (APE)

Example Output

Interpretation

Usage

2. Embedded QKV (No Projection Layer)

Standard Approach

This Notebook's Approach

Benefits

🔍 Design Philosophy

Separation of Concerns

⚖️ Comparison with Standard Transformers

🧪 Motivation

🤝 Contributing

About

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Attention Positional Encoding (APE)

🚀 Overview

Why?

🧠 Key Concepts

1. Attention Positional Encoding (APE)

Example Output

Interpretation

Usage

2. Embedded QKV (No Projection Layer)

Standard Approach

This Notebook's Approach

Benefits

🔍 Design Philosophy

Separation of Concerns

⚖️ Comparison with Standard Transformers

🧪 Motivation

🤝 Contributing

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Contributors

Uh oh!

Languages