Claude Code vs Cursor vs Codex: Which One Should You Use?
Three AI coding tools, three fundamentally different philosophies. Here's how to pick the right one for your actual workflow.
Attention mechanisms are the core reason large language models work. Here's a clear, technical breakdown of how they actually function.
Muunsparks
2026-03-13
Every large language model you've used — GPT, Claude, Gemini — runs on the same foundational idea: the attention mechanism. Not fine-tuning, not RLHF, not scale. Attention is the thing that made modern AI possible, and most explanations of it are either too vague or too dense to be useful.
This is the one that gets it right.
Before attention models existed, sequence-to-sequence tasks (translation, summarization, question answering) were handled by RNNs — recurrent neural networks. The setup was simple in theory: read a sentence word by word, compress the entire meaning into a single fixed-size vector, then decode that vector into an output.
The problem is that last part — fixed-size vector. You're asking a single blob of numbers to represent an entire sentence, or paragraph, or document. The longer the input, the more gets lost. Translate "The cat sat on the mat" and you're fine. Try to translate a legal contract and early RNNs would forget what the subject was by the time they got to the verb.
This is the long-range dependency problem. Meaning in language doesn't travel in a straight line. The word "it" in "The trophy didn't fit in the suitcase because it was too big" refers to the trophy, not the suitcase — and figuring that out requires connecting two pieces of information that are five words apart. RNNs struggled with this badly.
Researchers tried patching it. LSTMs (Long Short-Term Memory networks) added explicit memory gates to help information survive across longer sequences. They helped. They weren't enough. The compression bottleneck remained, and training LSTMs on long sequences was painfully slow because each step depended on the previous one — no parallelization.
The field needed something structurally different. In 2017, a team at Google published "Attention Is All You Need" and that paper did exactly what its title promised.
Here's the core intuition: instead of compressing everything into one vector, what if the model could look back at every single input token when producing each output token — and decide, dynamically, which ones matter most?
That's attention. At each step, the model pays different amounts of "attention" to different parts of the input. Some words get high weight. Some get almost none. The weights are learned and context-dependent.
The attention mechanism is built on three components — query, key, and value — which sound abstract until you think of them like a search engine.
For each query, the model computes a similarity score against every key. High similarity = high attention weight. Low similarity = ignored. Then it retrieves a weighted sum of the values.
Mathematically:
import torch
import torch.nn.functional as F
def scaled_dot_product_attention(Q, K, V):
"""
Q, K, V: tensors of shape (batch, seq_len, d_k)
Returns: output and attention weights
"""
d_k = Q.shape[-1]
# Step 1: Compute similarity scores between each query and all keys
scores = torch.matmul(Q, K.transpose(-2, -1)) / (d_k ** 0.5)
# Step 2: Convert scores to probabilities (soft weights that sum to 1)
weights = F.softmax(scores, dim=-1)
# Step 3: Weighted sum of values
output = torch.matmul(weights, V)
return output, weights
The / (d_k ** 0.5) scaling prevents the dot products from getting so large that softmax saturates and gradients vanish — a subtle but critical detail.
One attention pass looks at the input through a single lens. But a sentence can have multiple relationships worth tracking simultaneously — subject-verb agreement, pronoun resolution, tonal context, syntactic structure.
Multi-head attention runs several attention operations in parallel, each with different learned Q/K/V projections. Think of it as asking multiple questions about the same sentence at once and concatenating the answers.
import torch.nn as nn
class MultiHeadAttention(nn.Module):
def __init__(self, d_model, num_heads):
super().__init__()
self.num_heads = num_heads
self.d_k = d_model // num_heads
# Separate projection matrices for each head
self.W_q = nn.Linear(d_model, d_model)
self.W_k = nn.Linear(d_model, d_model)
self.W_v = nn.Linear(d_model, d_model)
self.W_o = nn.Linear(d_model, d_model) # Output projection
def split_heads(self, x, batch_size):
# Reshape to (batch, heads, seq_len, d_k)
x = x.view(batch_size, -1, self.num_heads, self.d_k)
return x.transpose(1, 2)
def forward(self, Q, K, V):
batch_size = Q.size(0)
Q = self.split_heads(self.W_q(Q), batch_size)
K = self.split_heads(self.W_k(K), batch_size)
V = self.split_heads(self.W_v(V), batch_size)
attn_output, _ = scaled_dot_product_attention(Q, K, V)
# Recombine heads
attn_output = attn_output.transpose(1, 2).contiguous()
attn_output = attn_output.view(batch_size, -1, self.num_heads * self.d_k)
return self.W_o(attn_output)
In GPT-3, there are 96 attention heads. Each one specializes — not by design, but by training. Researchers have found heads that track syntactic dependencies, heads that handle coreference, heads that focus on recent tokens. The specialization emerges; it isn't programmed.
Two variants show up constantly:
Self-attention: The query, key, and value all come from the same sequence. The model attends to its own input — each token in a sentence looking at all other tokens to build context. This is how LLMs process a prompt.
Cross-attention: The query comes from one sequence, but the keys and values come from another. Used in encoder-decoder models (like early translation systems), where the decoder generating an output word attends back to the full encoded input.
Modern decoder-only models (GPT-style) use self-attention exclusively. Encoder-decoder models (T5, original Transformer) use both.
The structural implication of attention is easy to miss: it enables full parallel processing of input sequences.
RNNs were sequential by design — you couldn't process token 5 until you'd processed tokens 1 through 4. Attention has no such constraint. Every token attends to every other token simultaneously. That means training on a sequence of length 1,000 takes roughly the same number of sequential steps as training on a sequence of length 10. The work is parallelized across hardware.
This unlocked scale. When you can train on hundreds of billions of tokens with modern GPUs, you can build models with emergent capabilities that smaller, sequential architectures couldn't develop. The LLM attention mechanism isn't just a better algorithm — it's what made it economically feasible to build at scale.
There's also a qualitative shift in what models can learn. Because attention weights are computed dynamically from context, the same word gets represented differently depending on its surroundings. "Bank" in "river bank" and "bank" in "investment bank" produce different attention patterns and different internal representations. This context-sensitivity is the foundation of genuine language understanding — or at least, something that looks convincingly like it.
Attention isn't free. A few constraints matter in practice:
Quadratic scaling. Standard self-attention computes pairwise scores between all tokens — if your sequence is length N, that's N² comparisons. Double the context window and you quadruple the compute. This is why long-context models are expensive and why sparse attention variants and sliding window approaches have been active research areas.
No inherent positional awareness. Attention treats input as a set, not a sequence. The model has no idea that token 3 comes before token 7 unless you explicitly encode position. This is handled by positional encodings — either learned embeddings added to token embeddings, or more recent approaches like RoPE (Rotary Position Embedding) used in LLaMA and others.
Attention ≠ understanding. Attention weights tell you which tokens the model weighted heavily — they don't always map cleanly to "what the model was thinking." Interpretability research has shown that attention patterns can be misleading. Treat attention visualization as a useful approximation, not a ground truth.
Tags: attention mechanism, attention models, transformers, llm attention mechanism, deep learning
// RELATED ARTICLES
Three AI coding tools, three fundamentally different philosophies. Here's how to pick the right one for your actual workflow.
Claude 4.x takes you literally. Here's how to use that to your advantage instead of fighting it.
Most AI income advice is noise. Here's what actually works for developers who want to build systems that earn while they sleep.