Attention Is Memory-Bound, Not Compute-Bound

Everyone knows attention scales quadratically. Almost nobody talks about why that's a memory problem, not a math problem — and why it matters.

Muunsparks

2026-03-30

8 min read

The standard story is that attention is expensive because of its O(n²) complexity. That's true, but it misidentifies the bottleneck. The real cost isn't multiplying matrices — it's moving data between your GPU's compute cores and its memory.

The Quadratic Story Is Incomplete

When engineers talk about attention being slow, they usually mean the sequence-length scaling problem: double your context, quadruple your memory footprint. For a sequence of length n, the attention matrix is n × n. At n = 8,192, that's 67 million entries. At n = 128,000 (Gemini-class context), you're looking at 16 billion — per layer, per head.

This framing leads to a natural set of solutions: sparse attention patterns, linear attention approximations, sliding window attention. Reduce the number of entries in the matrix, reduce the cost. And those approaches do work. But they're attacking the wrong bottleneck.

Here's the part that rarely gets mentioned: modern GPUs are extraordinarily good at matrix multiplication. The A100, for example, can do roughly 312 TFLOPS of BF16 tensor ops. What it's much slower at is reading and writing data. High-bandwidth memory (HBM) on the A100 tops out around 2 TB/s of bandwidth — which sounds fast until you realize your compute throughput expects to be fed at a rate that HBM structurally cannot keep up with.

The ratio of compute to memory bandwidth — the arithmetic intensity threshold — determines which operations are compute-bound and which are memory-bound. Attention, as implemented naively, sits firmly in memory-bound territory. You're spending more time on data transfers than on actual math.

What "Memory-Bound" Actually Means in Practice

Consider the standard scaled dot-product attention computation:

Attention(Q, K, V) = softmax(QKᵀ / √d_k) · V

A naive GPU implementation does this in discrete steps:

Compute QKᵀ → write the full n × n matrix to HBM
Scale by √d_k → read it back, write it back
Apply softmax row-wise → read it back, write it back
Multiply by V → read it back, produce output

Four round trips to HBM for the attention matrix alone. At n = 4,096 and FP16, that matrix weighs in at ~32MB. Four trips means ~128MB of memory traffic for one attention layer. Across 32 layers and 32 heads, that's over 100GB of memory movement per forward pass — most of it moving data that's just intermediate scratch space.

This is the actual performance story. The bottleneck isn't the matrix multiply. It's the reading and writing of intermediate results.

FlashAttention: Fusion, Not Approximation

FlashAttention, introduced by Dao et al. in 2022 and refined in subsequent versions, solves this not by reducing the number of operations but by restructuring where they happen.

The key insight: GPU SRAM (on-chip memory, fast, tiny — typically 20–40MB on an A100) is orders of magnitude faster to access than HBM. If you can keep your working set in SRAM and avoid materializing the full attention matrix in HBM, you win.

FlashAttention does this with a technique called tiled computation with online softmax. Instead of computing the full n × n matrix and then applying softmax globally, it processes blocks of Q, K, and V at a time — entirely within SRAM — and incrementally updates the softmax normalization as it goes.

# Conceptual sketch — not production code
# Standard attention: O(n²) HBM reads/writes
def naive_attention(Q, K, V):
    S = Q @ K.T / math.sqrt(d_k)   # n×n written to HBM
    P = softmax(S)                  # n×n read and written again
    return P @ V                    # n×n read again

# FlashAttention: tiles fit in SRAM, no n×n materialization
def flash_attention(Q, K, V, block_size=64):
    # Process in blocks — Q[i], K[j], V[j] tiles stay on-chip
    # Maintain running softmax denominator without storing full matrix
    # O(n²) compute, but only O(n) HBM traffic
    ...

The math works out because softmax can be computed stably in an online fashion. If you've seen the "log-sum-exp trick" for numerical stability, FlashAttention essentially applies a streaming version of that across blocks.

The result: same exact output as standard attention (no approximation), but with HBM reads/writes reduced from O(n²) to O(n). On an A100, FlashAttention runs 2–4× faster than the PyTorch baseline for typical sequence lengths, and the speedup grows with sequence length.

FlashAttention-2 and Beyond

The original version left performance on the table due to how it partitioned work across GPU thread blocks. FlashAttention-2 restructured the inner loop to improve parallelism across the sequence dimension, squeezing out another ~2× improvement.

FlashAttention-3 (targeting H100-class hardware) adds asynchronous execution to overlap GEMM and softmax operations, targeting better utilization of the H100's Tensor Memory Accelerator. The core idea remains the same — it's the implementation that's being optimized.

Practically, if you're using PyTorch >= 2.0, torch.nn.functional.scaled_dot_product_attention uses FlashAttention under the hood when the conditions are right (FP16/BF16, causal or non-causal, no custom attention masking). You don't need to install anything extra — it's there.

import torch
import torch.nn.functional as F

# PyTorch 2.0+ dispatches to FlashAttention automatically
# when running on compatible hardware with FP16/BF16
output = F.scaled_dot_product_attention(
    query,   # (batch, heads, seq_len, head_dim)
    key,
    value,
    attn_mask=None,
    dropout_p=0.0,
    is_causal=True  # enables causal masking without materializing the mask
)

The is_causal=True flag is worth highlighting: FlashAttention computes causal masking without ever storing the mask matrix, which is another O(n²) memory saving.

Why This Changes How You Should Think About Attention

If the bottleneck is memory bandwidth rather than compute, then the right optimization levers are different.

Precision matters more than you might think. Going from FP32 to BF16 halves the size of every tensor moving through HBM. This is a 2× reduction in memory traffic, which for a memory-bound operation is a 2× speedup — even ignoring the compute benefits. If you're still running attention in FP32 for "stability" reasons, you're probably paying a steep tax for marginal gain.

Sequence packing is a first-class optimization. If you're training on variable-length sequences and padding to the max length, your attention is doing real work on padding tokens — and all that work is memory traffic. Packing multiple short sequences into a single attention block (with masking to prevent cross-contamination) can cut memory traffic substantially for real-world datasets with skewed length distributions.

Custom attention patterns don't come cheap. If you're implementing a custom attention variant — say, cross-attention with a non-standard mask, or a retrieval-augmented attention pattern — and you're materializing intermediate tensors at n × n, you're back to the memory-bound baseline. This is why libraries like xformers exist: they provide memory-efficient implementations of common custom patterns. Reach for them before rolling your own.

The context length wars are partly a memory engineering problem. When you see a model announce "1M context," the engineering question to ask is: how are they handling the attention memory footprint? Some use sparse attention (Longformer-style sliding windows), some use linear approximations (Mamba's SSM approach sidesteps attention entirely), and some are betting that hardware evolution and FlashAttention-class optimizations will make quadratic tractable at longer sequences. The answer tells you a lot about what tradeoffs they've made.

The Counterpoint: Approximations Still Have Their Place

FlashAttention is exact — it produces the same output as naive attention. But it doesn't change the asymptotic complexity. For very long sequences (say, 100k+ tokens with large d_model), you eventually hit a wall that no amount of I/O optimization can move.

Linear attention methods — which approximate the softmax kernel and restructure computation to be O(n) — are genuinely interesting for this regime. The challenge is that the quality gap between exact and approximate attention isn't zero, and it's unevenly distributed across tasks. Long-range dependencies — the exact case where you'd most want linear attention — also tend to be where the approximation quality degrades most.

The honest state of affairs: for most production workloads at most sequence lengths, exact attention with FlashAttention is the right call. For research into very long context or on-device inference, linear attention and related architectures (Mamba, RWKV, RetNet) are worth understanding.

The Takeaway

Attention is memory-bound: The bottleneck in naive attention is HBM read/write traffic, not floating-point operations. O(n²) entries get materialized and shuffled multiple times.
FlashAttention is exact, not approximate: It achieves 2–4× speedups by computing attention in SRAM tiles, eliminating HBM round-trips for intermediate results — with identical numerical output.
Use scaled_dot_product_attention in PyTorch 2.0+: It dispatches to FlashAttention automatically for standard use cases. There's almost no reason to implement attention by hand.
Precision and sequence packing are high-leverage: For memory-bound ops, halving tensor size halves memory traffic. Sequence packing eliminates real work on padding tokens.
Know when approximations are justified: FlashAttention doesn't help at extreme context lengths. For 100k+ token regimes, linear attention architectures deserve a serious look — but benchmark carefully, because quality tradeoffs are real.

Tags: transformers, attention, flashattention, performance, cuda

#transformers #attention #flashattention #performance #cuda

// RELATED ARTICLES

AI2025-02-28

How to Build a RAG Pipeline in 100 Lines of Python

Retrieval-Augmented Generation is the most practical AI pattern of 2025. Here's a minimal but production-ready implementation using LangChain, ChromaDB, and the OpenAI API.

1 min read

AI2026-04-06

Vibe Coding Ships Fast and Breaks Everything

41% of code is now AI-generated. Code churn is up 41%. Refactoring has collapsed. The bill is coming due.

9 min read

AI2026-03-09

RAG Is Not a Silver Bullet — When to Skip It

RAG solves real problems, but teams reach for it reflexively. Here are the specific scenarios where it makes your system slower, harder to maintain, and dumber.

8 min read

← BACK TO ALL ARTICLES