From a Spiral to a Sink: How Positional Encoding Shapes Attention Sinks in Diffusion Transformers

Under Review
1Zhejiang University    2University of Toronto

Interactive. Attention maps at the selected denoising step. Bright vertical columns = attention sinks. Click any cell to zoom; use the model filter to inspect one architecture at a time.

Abstract

We study the attention sink phenomenon — where a few tokens monopolize attention regardless of content — in Diffusion Transformers (DiTs). This has been widely studied in large language models but remains underexplored in DiTs, whose architectural differences prevent existing explanations from directly transferring.

01 Universal Sinks appear across six DiT architectures
Sinks form early in denoising, persist throughout generation, and concentrate in the deeper half of the network.
02 Mechanism RoPE and AbsPE choose different empty anchors
In RoPE-based models, sink tokens concentrate Key energy in low-frequency channels that do not decay with distance (Frequency-Aware Concentration). In AbsPE models, learned embeddings at corner positions directly create sinks.
03 Practical fix Comma padding redistributes wasted attention
In FLUX.1, padding tokens become sinks and absorb over half of total attention. Replacing padding with commas redistributes the budget and improves compositional generation without retraining.

Models Studied

Six DiT architectures spanning three PE types and two attention designs.

Model PE Type Sink Region Peak Ratio Unique Sinks
PixArt-α AbsPE Spatial corners 85× 20
FLUX.1 2D RoPE Text <PAD> 214× 7
Qwen-Image 2D RoPE Text tokens 240× 7
Z-Image 3D RoPE Text <PAD> 212× 5
Wan2.1 3D RoPE 1st-frame patches 477× 48
LTX Video 3D RoPE 1st-frame patches 65× 91

Peak = max sink-to-random μk ratio across all layers. Unique = number of distinct top-1 sink tokens across ≥100 prompts (low = deterministic, high = stochastic).

Sink Dynamics

Key Importance Ratio

Sink-to-random μk ratio across network depth for representative DiTs.

Key Findings

When and where do sinks emerge?

We track sink strength along two axes: denoising progress and network depth. All three representative models develop sinks within the first few steps. FLUX.1 shows high Key Importance from step 1; PixArt locks onto a fixed sink around step 6; Wan grows exponentially over the first 20 steps.

Across layers, a sharp transition occurs in the second half of the network: attention concentrates onto a few tokens, reaching 85× (PixArt), 214× (FLUX.1), and 477× (Wan) at peak. Shallow layers show uniform attention.

ReadLeft: sinks appear early in denoising. Right: the same sink strength is concentrated in deep layers.
When and where sinks emerge — Key Importance over denoising steps and layer depth
Figure 2. (a) Top-1 Key Importance at each denoising step: all models develop sinks early. Shaded bands = ±1 std across 100 prompts. (b) Sink-to-random ratio at each layer: shallow layers show no sink; deep layers reach up to 477× concentration.
Sinks are intrinsic to model weights, not driven by any particular input. Once formed, they persist regardless of prompt or noise seed.

Which tokens become sinks?

The sink identity is determined by the positional encoding type. In AbsPE models (PixArt-α), sinks are corner tokens whose learned embedding has extreme coordinate values. In 2D RoPE models (FLUX.1, Qwen-Image), sinks land on text tokens. Z-Image uses 3-axis RoPE and also selects text <PAD> tokens, while 3D RoPE video models (Wan2.1, LTX Video) cluster sinks on first-frame patches.

Across all PE types, sink tokens share two properties: they occupy stable positions (present in every forward pass) and carry minimal content (not directly supervised by the training loss).

ReadThe tallest bars identify which token type becomes the sink for each architecture.
Key Importance per token for four DiTs — showing which tokens become sinks
Figure 3. Key Importance μk at the final denoising step for four DiTs. Tall bars = tokens receiving 100–200× more attention. PixArt: corner token. FLUX.1: <PAD> tokens. Wan2.1: first-frame patches. LTX Video: similar first-frame pattern.
The emptier the content, the stronger the sink. DiTs repurpose whichever tokens the positional encoding makes available as implicit registers.

Can massive activations explain sinks?

In LLMs, attention sinks co-occur with massive activations — outlier hidden-state dimensions that appear in the same tokens. We observe the same co-occurrence in DiTs: sink tokens exhibit activation spikes ~50× above average.

However, DiTs typically apply QK-Normalization, which projects all Key vectors to the same ℓ2-norm. After normalization, sink and non-sink tokens have identical Key magnitudes. The cause must lie in the direction of the Key vectors rather than their magnitude.

ReadHidden states spike, but normalized Keys have comparable magnitude.
3D line plots showing massive activations in sink tokens vs normal tokens
Figure 4. Layer input (left): sink tokens show 50× activation spikes. Key vectors (right): after QK-Norm, all norms are equal (~15.9).
ReadThe activation spike happens first; attention concentration rises later.
Co-occurrence of massive activations and attention sinks across layers
Figure 5. Massive activations (red) peak near depth 0.3; attention sinks (blue) emerge afterward in deeper layers.
QK-Normalization equalizes Key norms — the cause of attention dominance lies in Key direction, not magnitude.

How does RoPE create sinks?

RoPE is the only operation between QK-Normalization and the dot product. We compute the mean ⟨q, k⟩ for a sink key versus a random image key, before and after RoPE:

Pre-RoPE: both keys score similarly (123 vs 134) — no sink would form. Post-RoPE: the sink key retains its score (112) while the random key drops to −62. RoPE does not boost the sink; it suppresses non-sink keys by rotating their channels in ways that reduce dot products with distant queries.

ReadBefore RoPE, sink and random keys score similarly; after RoPE, the random key collapses.
DiT attention block diagram showing pre/post RoPE dot products
Figure 6. Left: signal flow in a DiT attention block. Right: mean QK dot product for a sink key vs. a random key, before and after RoPE (FLUX.1, layer 28, step 27). Pre-RoPE scores are comparable; post-RoPE, only the sink survives.
RoPE does not boost the sink — it suppresses non-sink keys. Sink keys survive because their energy distribution is immune to the rotation.

What is Frequency-Aware Concentration?

RoPE assigns each channel pair a rotation frequency θc. High-frequency channels rotate rapidly and their contributions cancel out for distant tokens. Low-frequency channels barely rotate: cos(Δθc) ≈ 1 regardless of distance — a "Safe Harbor".

Sink keys concentrate the majority of their energy into these low-frequency channels, while normal keys spread energy uniformly and lose it to high-frequency cancellation. We call this Frequency-Aware Concentration (FAC). It deepens with layer depth, reaching >200× attention gap in the deepest layers. On the Value side, sink tokens carry near-zero magnitudes — they absorb attention without adding signal.

ReadLow-frequency channels stay aligned across distance; sink keys place their energy there.
FAC mechanism: cosine heatmap and per-channel Key energy
Figure 7. (a) Per-channel cos(θc·Δ): high-freq channels (top) decay rapidly; low-freq "Safe Harbor" channels (bottom) stay near 1. (b) Per-channel Key energy: sink keys (pink) concentrate in low-freq; normal keys (gray) spread uniformly.
ReadTop row: Key energy moves into low-frequency channels. Bottom row: sink Values carry almost no signal.
2x3 heatmap showing FAC deepening across layers and near-zero sink Values
Figure 8. FAC deepens with layer depth. Top row: Sink Key energy shifts to low-freq channels; attention gap grows to >200×. Bottom row: Sink Value channels go to near-zero — sinks absorb attention without adding signal ("no-op" targets).
Sink keys survive by placing energy in "Safe Harbor" channels where rotation barely changes. They win by surviving, not by dominating.

How does PE type determine sink selection?

FAC explains how a token wins attention, but which tokens acquire this advantage depends on the PE type:

RoPE models: semantically empty tokens (<PAD>, <EOS>) that carry no content become sinks. In FLUX.1, <EOS> receives 1165× uniform attention, <PAD> 117×, commas only 29×.

AbsPE models: the learned positional embedding itself controls sink identity. We verify this causally: zeroing the corner PE and copying it to 20 positions arranged in a smiley-face pattern redirects attention to those positions (8.6× higher μk).

ReadEmpty text tokens attract more attention than image or comma tokens.
FLUX.1 attention by token type: EOS > PAD > comma > image
Figure 9a. FLUX.1 (RoPE): semantically empty tokens attract far more attention. Same positions, same seed — only token content changes.
ReadCopying one positional embedding relocates the sink pattern.
PixArt PE manipulation — smiley face pattern in attention
Figure 9b. PixArt (AbsPE): copying the corner PE to a smiley-face pattern redirects attention. The positional embedding alone controls which tokens become sinks.
In RoPE models, content determines sink strength. In AbsPE models, position determines it. Both types share the same principle: stable, empty positions become sinks.

Application: Redistributing Attention via Comma Padding

FLUX.1 uses a T5 encoder that packs every prompt into a fixed 512-token sequence. A typical short prompt fills ~17 slots; the remaining ~494 are <PAD> tokens. These empty tokens satisfy the FAC condition and become strong sinks: <PAD> tokens collectively absorb 55% of the total attention budget, while the 4,096 image tokens (89% of the sequence) receive only 22%.

Since comma tokens attract only 29× uniform attention versus 117× for <PAD>, simply appending 200 commas to the prompt displaces most padding tokens and redistributes attention toward image tokens. No retraining or architectural change is needed.

ReadMost of the top attention receivers are empty padding tokens; comma padding replaces that wasted budget.
Attention budget of FLUX.1 — top-32 tokens ranked by Key Importance
Figure 10. Attention budget of FLUX.1: Top-32 tokens ranked by mean μk (N=100 prompts). <EOS> dominates; 22 of 32 are <PAD>. Empty tokens absorb 55% of total attention; image tokens (89% of sequence) get only 22%.

Quantitative Results

Evaluated on FLUX.1-dev (12 steps, 1024², guidance 3.5). Comma padding improves compositional categories while maintaining general quality (TIIF overall: 66.8% → 67.7%).

Benchmark Category Baseline +Comma Diff
Concept Mixing Conflicting concepts 3.65 4.55 +0.90
Concept Mixing Hybrid fusion 4.37 5.71 +1.34
TIIF-Bench Diff. + texture 43.0% 50.5% +7.5%
TIIF-Bench Comparison 55.6% 61.6% +6.0%
TIIF-Bench Action + 2D 80.4% 86.0% +5.6%

CM = GPT-scored alignment (0–10). TIIF = VQA accuracy. Trade-off: aesthetic scores decrease ~19% on Concept Mixing (comma padding shifts the alignment–aesthetics frontier).

Qualitative Comparison

Hover, focus, or tap to compare the +Comma result. Same model weights, same seeds — only the prompt suffix differs.

BibTeX

@misc{yang2026spiral2sink,
  title     = {From a Spiral to a Sink: How Positional Encoding
               Shapes Attention Sinks in Diffusion Transformers},
  author    = {Yang, Yuanbo and Shao, Jiahao and Gao, Jun and Liao, Yiyi},
  note      = {Manuscript under review},
  year      = {2026}
}