MCP: The USB Port for AI That Nobody Bothered to Lock
10,000+ public MCP servers, widespread OAuth flaws, and fewer than 4% of RSA submissions see it as opportunity. Here's the problem.
Everyone knows attention scales quadratically. Almost nobody talks about why that's a memory problem, not a math problem — and why it matters.
Muunsparks
2026-03-30
The standard story is that attention is expensive because of its O(n²) complexity. That's true, but it misidentifies the bottleneck. The real cost isn't multiplying matrices — it's moving data between your GPU's compute cores and its memory.
When engineers talk about attention being slow, they usually mean the sequence-length scaling problem: double your context, quadruple your memory footprint. For a sequence of length n, the attention matrix is n × n. At n = 8,192, that's 67 million entries. At n = 128,000 (Gemini-class context), you're looking at 16 billion — per layer, per head.
This framing leads to a natural set of solutions: sparse attention patterns, linear attention approximations, sliding window attention. Reduce the number of entries in the matrix, reduce the cost. And those approaches do work. But they're attacking the wrong bottleneck.
Here's the part that rarely gets mentioned: modern GPUs are extraordinarily good at matrix multiplication. The A100, for example, can do roughly 312 TFLOPS of BF16 tensor ops. What it's much slower at is reading and writing data. High-bandwidth memory (HBM) on the A100 tops out around 2 TB/s of bandwidth — which sounds fast until you realize your compute throughput expects to be fed at a rate that HBM structurally cannot keep up with.
The ratio of compute to memory bandwidth — the arithmetic intensity threshold — determines which operations are compute-bound and which are memory-bound. Attention, as implemented naively, sits firmly in memory-bound territory. You're spending more time on data transfers than on actual math.
Consider the standard scaled dot-product attention computation:
Attention(Q, K, V) = softmax(QKᵀ / √d_k) · V
A naive GPU implementation does this in discrete steps:
Four round trips to HBM for the attention matrix alone. At n = 4,096 and FP16, that matrix weighs in at ~32MB. Four trips means ~128MB of memory traffic for one attention layer. Across 32 layers and 32 heads, that's over 100GB of memory movement per forward pass — most of it moving data that's just intermediate scratch space.
This is the actual performance story. The bottleneck isn't the matrix multiply. It's the reading and writing of intermediate results.
FlashAttention, introduced by Dao et al. in 2022 and refined in subsequent versions, solves this not by reducing the number of operations but by restructuring where they happen.
The key insight: GPU SRAM (on-chip memory, fast, tiny — typically 20–40MB on an A100) is orders of magnitude faster to access than HBM. If you can keep your working set in SRAM and avoid materializing the full attention matrix in HBM, you win.
FlashAttention does this with a technique called tiled computation with online softmax. Instead of computing the full n × n matrix and then applying softmax globally, it processes blocks of Q, K, and V at a time — entirely within SRAM — and incrementally updates the softmax normalization as it goes.
# Conceptual sketch — not production code
# Standard attention: O(n²) HBM reads/writes
def naive_attention(Q, K, V):
S = Q @ K.T / math.sqrt(d_k) # n×n written to HBM
P = softmax(S) # n×n read and written again
return P @ V # n×n read again
# FlashAttention: tiles fit in SRAM, no n×n materialization
def flash_attention(Q, K, V, block_size=64):
# Process in blocks — Q[i], K[j], V[j] tiles stay on-chip
# Maintain running softmax denominator without storing full matrix
# O(n²) compute, but only O(n) HBM traffic
...
The math works out because softmax can be computed stably in an online fashion. If you've seen the "log-sum-exp trick" for numerical stability, FlashAttention essentially applies a streaming version of that across blocks.
The result: same exact output as standard attention (no approximation), but with HBM reads/writes reduced from O(n²) to O(n). On an A100, FlashAttention runs 2–4× faster than the PyTorch baseline for typical sequence lengths, and the speedup grows with sequence length.
The original version left performance on the table due to how it partitioned work across GPU thread blocks. FlashAttention-2 restructured the inner loop to improve parallelism across the sequence dimension, squeezing out another ~2× improvement.
FlashAttention-3 (targeting H100-class hardware) adds asynchronous execution to overlap GEMM and softmax operations, targeting better utilization of the H100's Tensor Memory Accelerator. The core idea remains the same — it's the implementation that's being optimized.
Practically, if you're using PyTorch >= 2.0, torch.nn.functional.scaled_dot_product_attention uses FlashAttention under the hood when the conditions are right (FP16/BF16, causal or non-causal, no custom attention masking). You don't need to install anything extra — it's there.
import torch
import torch.nn.functional as F
# PyTorch 2.0+ dispatches to FlashAttention automatically
# when running on compatible hardware with FP16/BF16
output = F.scaled_dot_product_attention(
query, # (batch, heads, seq_len, head_dim)
key,
value,
attn_mask=None,
dropout_p=0.0,
is_causal=True # enables causal masking without materializing the mask
)
The is_causal=True flag is worth highlighting: FlashAttention computes causal masking without ever storing the mask matrix, which is another O(n²) memory saving.
If the bottleneck is memory bandwidth rather than compute, then the right optimization levers are different.
Precision matters more than you might think. Going from FP32 to BF16 halves the size of every tensor moving through HBM. This is a 2× reduction in memory traffic, which for a memory-bound operation is a 2× speedup — even ignoring the compute benefits. If you're still running attention in FP32 for "stability" reasons, you're probably paying a steep tax for marginal gain.
Sequence packing is a first-class optimization. If you're training on variable-length sequences and padding to the max length, your attention is doing real work on padding tokens — and all that work is memory traffic. Packing multiple short sequences into a single attention block (with masking to prevent cross-contamination) can cut memory traffic substantially for real-world datasets with skewed length distributions.
Custom attention patterns don't come cheap. If you're implementing a custom attention variant — say, cross-attention with a non-standard mask, or a retrieval-augmented attention pattern — and you're materializing intermediate tensors at n × n, you're back to the memory-bound baseline. This is why libraries like xformers exist: they provide memory-efficient implementations of common custom patterns. Reach for them before rolling your own.
The context length wars are partly a memory engineering problem. When you see a model announce "1M context," the engineering question to ask is: how are they handling the attention memory footprint? Some use sparse attention (Longformer-style sliding windows), some use linear approximations (Mamba's SSM approach sidesteps attention entirely), and some are betting that hardware evolution and FlashAttention-class optimizations will make quadratic tractable at longer sequences. The answer tells you a lot about what tradeoffs they've made.
FlashAttention is exact — it produces the same output as naive attention. But it doesn't change the asymptotic complexity. For very long sequences (say, 100k+ tokens with large d_model), you eventually hit a wall that no amount of I/O optimization can move.
Linear attention methods — which approximate the softmax kernel and restructure computation to be O(n) — are genuinely interesting for this regime. The challenge is that the quality gap between exact and approximate attention isn't zero, and it's unevenly distributed across tasks. Long-range dependencies — the exact case where you'd most want linear attention — also tend to be where the approximation quality degrades most.
The honest state of affairs: for most production workloads at most sequence lengths, exact attention with FlashAttention is the right call. For research into very long context or on-device inference, linear attention and related architectures (Mamba, RWKV, RetNet) are worth understanding.
scaled_dot_product_attention in PyTorch 2.0+: It dispatches to FlashAttention automatically for standard use cases. There's almost no reason to implement attention by hand.Tags: transformers, attention, flashattention, performance, cuda
// RELATED ARTICLES
10,000+ public MCP servers, widespread OAuth flaws, and fewer than 4% of RSA submissions see it as opportunity. Here's the problem.
Three AI coding tools, three fundamentally different philosophies. Here's how to pick the right one for your actual workflow.
Most founders use AI to save time. The ones pulling ahead use it to think better. Here are 10 prompts that do the latter.