From a Spiral to a Sink — How Positional Encoding Shapes Attention Sinks in Diffusion Transformers

Abstract

We study the attention sink phenomenon — where a few tokens monopolize attention regardless of content — in Diffusion Transformers (DiTs). This has been widely studied in large language models but remains underexplored in DiTs, whose architectural differences prevent existing explanations from directly transferring.

01 Universal Sinks appear across six DiT architectures

Sinks form early in denoising, persist throughout generation, and concentrate in the deeper half of the network.

02 Mechanism RoPE and AbsPE choose different empty anchors

In RoPE-based models, sink tokens concentrate Key energy in low-frequency channels that do not decay with distance (Frequency-Aware Concentration). In AbsPE models, learned embeddings at corner positions directly create sinks.

03 Practical fix Comma padding redistributes wasted attention

In FLUX.1, padding tokens become sinks and absorb over half of total attention. Replacing padding with commas redistributes the budget and improves compositional generation without retraining.

Model	PE Type	Sink Region	Peak Ratio	Unique Sinks
PixArt-α	`AbsPE`	Spatial corners	85×	20
FLUX.1	`2D RoPE`	Text `<PAD>`	214×	7
Qwen-Image	`2D RoPE`	Text tokens	240×	7
Z-Image	`3D RoPE`	Text `<PAD>`	212×	5
Wan2.1	`3D RoPE`	1st-frame patches	477×	48
LTX Video	`3D RoPE`	1st-frame patches	65×	91

Key Findings

When and where do sinks emerge?

We track sink strength along two axes: denoising progress and network depth. All three representative models develop sinks within the first few steps. FLUX.1 shows high Key Importance from step 1; PixArt locks onto a fixed sink around step 6; Wan grows exponentially over the first 20 steps.

Across layers, a sharp transition occurs in the second half of the network: attention concentrates onto a few tokens, reaching 85× (PixArt), 214× (FLUX.1), and 477× (Wan) at peak. Shallow layers show uniform attention.

When and where sinks emerge — Key Importance over denoising steps and layer depth — **Figure 2.** (a) Top-1 Key Importance at each denoising step: all models develop sinks early. Shaded bands = ±1 std across 100 prompts. (b) Sink-to-random ratio at each layer: shallow layers show no sink; deep layers reach up to 477× concentration.

Sinks are intrinsic to model weights, not driven by any particular input. Once formed, they persist regardless of prompt or noise seed.

Which tokens become sinks?

The sink identity is determined by the positional encoding type. In AbsPE models (PixArt-α), sinks are corner tokens whose learned embedding has extreme coordinate values. In 2D RoPE models (FLUX.1, Qwen-Image), sinks land on text tokens. Z-Image uses 3-axis RoPE and also selects text <PAD> tokens, while 3D RoPE video models (Wan2.1, LTX Video) cluster sinks on first-frame patches.

Across all PE types, sink tokens share two properties: they occupy stable positions (present in every forward pass) and carry minimal content (not directly supervised by the training loss).

Key Importance per token for four DiTs — showing which tokens become sinks — **Figure 3.** Key Importance μ_k at the final denoising step for four DiTs. Tall bars = tokens receiving 100–200× more attention. **PixArt:** corner token. **FLUX.1:** <PAD> tokens. **Wan2.1:** first-frame patches. **LTX Video:** similar first-frame pattern.

The emptier the content, the stronger the sink. DiTs repurpose whichever tokens the positional encoding makes available as implicit registers.

Can massive activations explain sinks?

In LLMs, attention sinks co-occur with massive activations — outlier hidden-state dimensions that appear in the same tokens. We observe the same co-occurrence in DiTs: sink tokens exhibit activation spikes ~50× above average.

However, DiTs typically apply QK-Normalization, which projects all Key vectors to the same ℓ₂-norm. After normalization, sink and non-sink tokens have identical Key magnitudes. The cause must lie in the direction of the Key vectors rather than their magnitude.

3D line plots showing massive activations in sink tokens vs normal tokens — **Figure 4.** Layer input (left): sink tokens show 50× activation spikes. Key vectors (right): after QK-Norm, all norms are equal (~15.9).

Co-occurrence of massive activations and attention sinks across layers — **Figure 5.** Massive activations (red) peak near depth 0.3; attention sinks (blue) emerge afterward in deeper layers.

QK-Normalization equalizes Key norms — the cause of attention dominance lies in Key direction, not magnitude.

How does RoPE create sinks?

RoPE is the only operation between QK-Normalization and the dot product. We compute the mean ⟨q, k⟩ for a sink key versus a random image key, before and after RoPE:

Pre-RoPE: both keys score similarly (123 vs 134) — no sink would form. Post-RoPE: the sink key retains its score (112) while the random key drops to −62. RoPE does not boost the sink; it suppresses non-sink keys by rotating their channels in ways that reduce dot products with distant queries.

DiT attention block diagram showing pre/post RoPE dot products — **Figure 6.** Left: signal flow in a DiT attention block. Right: mean QK dot product for a sink key vs. a random key, before and after RoPE (FLUX.1, layer 28, step 27). Pre-RoPE scores are comparable; post-RoPE, only the sink survives.

RoPE does not boost the sink — it suppresses non-sink keys. Sink keys survive because their energy distribution is immune to the rotation.

What is Frequency-Aware Concentration?

RoPE assigns each channel pair a rotation frequency θ_c. High-frequency channels rotate rapidly and their contributions cancel out for distant tokens. Low-frequency channels barely rotate: cos(Δθ_c) ≈ 1 regardless of distance — a "Safe Harbor".

Sink keys concentrate the majority of their energy into these low-frequency channels, while normal keys spread energy uniformly and lose it to high-frequency cancellation. We call this Frequency-Aware Concentration (FAC). It deepens with layer depth, reaching >200× attention gap in the deepest layers. On the Value side, sink tokens carry near-zero magnitudes — they absorb attention without adding signal.

FAC mechanism: cosine heatmap and per-channel Key energy — **Figure 7.** (a) Per-channel cos(θ_c·Δ): high-freq channels (top) decay rapidly; low-freq "Safe Harbor" channels (bottom) stay near 1. (b) Per-channel Key energy: sink keys (pink) concentrate in low-freq; normal keys (gray) spread uniformly.

2x3 heatmap showing FAC deepening across layers and near-zero sink Values — **Figure 8.** FAC deepens with layer depth. Top row: Sink Key energy shifts to low-freq channels; attention gap grows to >200×. Bottom row: Sink Value channels go to near-zero — sinks absorb attention without adding signal ("no-op" targets).

Sink keys survive by placing energy in "Safe Harbor" channels where rotation barely changes. They win by surviving, not by dominating.

How does PE type determine sink selection?

FAC explains how a token wins attention, but which tokens acquire this advantage depends on the PE type:

RoPE models: semantically empty tokens (<PAD>, <EOS>) that carry no content become sinks. In FLUX.1, <EOS> receives 1165× uniform attention, <PAD> 117×, commas only 29×.

AbsPE models: the learned positional embedding itself controls sink identity. We verify this causally: zeroing the corner PE and copying it to 20 positions arranged in a smiley-face pattern redirects attention to those positions (8.6× higher μ_k).

FLUX.1 attention by token type: EOS > PAD > comma > image — **Figure 9a.** FLUX.1 (RoPE): semantically empty tokens attract far more attention. Same positions, same seed — only token content changes.

PixArt PE manipulation — smiley face pattern in attention — **Figure 9b.** PixArt (AbsPE): copying the corner PE to a smiley-face pattern redirects attention. The positional embedding alone controls which tokens become sinks.

In RoPE models, content determines sink strength. In AbsPE models, position determines it. Both types share the same principle: stable, empty positions become sinks.

Application: Redistributing Attention via Comma Padding

FLUX.1 uses a T5 encoder that packs every prompt into a fixed 512-token sequence. A typical short prompt fills ~17 slots; the remaining ~494 are <PAD> tokens. These empty tokens satisfy the FAC condition and become strong sinks: <PAD> tokens collectively absorb 55% of the total attention budget, while the 4,096 image tokens (89% of the sequence) receive only 22%.

Since comma tokens attract only 29× uniform attention versus 117× for <PAD>, simply appending 200 commas to the prompt displaces most padding tokens and redistributes attention toward image tokens. No retraining or architectural change is needed.

Attention budget of FLUX.1 — top-32 tokens ranked by Key Importance — **Figure 10.** Attention budget of FLUX.1: Top-32 tokens ranked by mean μ_k (N=100 prompts). <EOS> dominates; 22 of 32 are **<PAD>**. Empty tokens absorb 55% of total attention; image tokens (89% of sequence) get only 22%.

Quantitative Results

Evaluated on FLUX.1-dev (12 steps, 1024², guidance 3.5). Comma padding improves compositional categories while maintaining general quality (TIIF overall: 66.8% → 67.7%).

Benchmark	Category	Baseline	+Comma	Diff
Concept Mixing	Conflicting concepts	3.65	4.55	+0.90
Concept Mixing	Hybrid fusion	4.37	5.71	+1.34
TIIF-Bench	Diff. + texture	43.0%	50.5%	+7.5%
TIIF-Bench	Comparison	55.6%	61.6%	+6.0%
TIIF-Bench	Action + 2D	80.4%	86.0%	+5.6%

CM = GPT-scored alignment (0–10). TIIF = VQA accuracy. Trade-off: aesthetic scores decrease ~19% on Concept Mixing (comma padding shifts the alignment–aesthetics frontier).