FP8 ViT Encoder Attention¶

For visual understanding workloads with large images (e.g. QHD, 4K) and relatively short text prompts/generation, the ViT encoder attention can become a significant bottleneck, especially when the text model is quantized (e.g. NVFP4). vLLM supports optional FP8 quantization for the ViT encoder attention via the FlashInfer cuDNN backend. Q/K/V are quantized on-the-fly to FP8 before the cuDNN attention call.

Note

Currently supports Qwen3-VL family models only (qwen3_vl, qwen3_vl_moe, qwen3_5, qwen3_5_moe, and other models using Qwen3 ViT).
Dynamic scaling is not compatible with ViT full CUDA graphs.
Performance gains are mostly visible at QHD/4K resolutions or multi-image requests. Smaller images may see no speedup due to quantization overhead (3 quantization kernel launches + un-padding).
FP8 tensor-core speedup is more pronounced on GB300 than GB200.

Requirements¶

FlashInfer cuDNN backend with cuDNN >= 9.17.1.

Usage¶

Enable FP8 ViT attention by passing --mm-encoder-attn-dtype fp8 together with --mm-encoder-attn-backend FLASHINFER:

vllm serve $MODEL \
    --mm-encoder-attn-backend FLASHINFER \
    --mm-encoder-attn-dtype fp8

By default (no scale file), dynamic scaling is used: a 16-entry circular buffer of observed Q/K/V amax values drives per-forward scale updates. This matches BF16 accuracy without any calibration but adds a small per-forward overhead.

Calibrate-Once, Reuse Workflow (Recommended)¶

For production, calibrate static scales on a representative dataset once and reuse them to avoid the dynamic overhead:

# Step 1: calibrate and save scales (runs dynamic scaling for 16 passes,
# then dumps the learned scales to JSON).
vllm bench mm-processor \
    --model $MODEL --mm-encoder-attn-backend FLASHINFER \
    --mm-encoder-attn-dtype fp8 \
    --mm-encoder-fp8-scale-save-path /path/to/scales.json \
    --dataset-name hf --dataset-path lmarena-ai/VisionArena-Chat \
    --num-prompts 100

# Step 2: serve with static scales (no dynamic overhead).
vllm serve $MODEL \
    --mm-encoder-attn-backend FLASHINFER \
    --mm-encoder-attn-dtype fp8 \
    --mm-encoder-fp8-scale-path /path/to/scales.json

Saved scales are multiplied by --mm-encoder-fp8-scale-save-margin (default 1.5) to leave headroom against activation outliers not present in the calibration set. The default has been validated to generalize across datasets (e.g. VisionArena-Chat calibration maintains BF16 accuracy on ChartQA).

Scale File Format¶

{
    "visual.blocks.0.attn.attn": {"q": 224.0, "k": 198.0, "v": 210.0},
    "visual.blocks.1.attn.attn": {"q": 218.0, "k": 195.0, "v": 207.0}
}

Keys q_scale / k_scale / v_scale are accepted as aliases.

Performance¶

Core cuDNN attention kernel (PyTorch profiler, cudnn_generated_fort_native_sdpa_sm100_flash_fprop, head_dim=128, seq_len=8192):

Hardware	BF16	FP8	Speedup
GB200	350 us	312 us	1.12x
GB300	300 us	211 us	1.42x

End-to-end encoder forward time (Qwen3-VL-30B-A3B-Instruct on GB200, 3 images/request):

Resolution	BF16 median	FP8 median	Speedup
HD (720x1280)	31.77 ms	36.39 ms	0.87x
FullHD (1080x1920)	57.99 ms	58.73 ms	~same
QHD (1440x2560)	131.83 ms	122.30 ms	1.08x
4K (2160x3840)	543.44 ms	460.31 ms	1.18x

Crossover is around FullHD with 3 images/request. At QHD and above, FP8 wins.

Accuracy¶

ChartQA, Qwen3-VL-8B-Instruct, 500 samples. FP8 static uses scales calibrated on VisionArena-Chat (with default 1.5x margin):

Metric	BF16	FP8 dynamic	FP8 static
relaxed_accuracy	0.780	0.776	0.780
anywhere_accuracy	0.806	0.816	0.814
exact_match	0.584	0.582	0.578

All three configurations match within statistical noise, confirming that static scales calibrated on one dataset generalize to another.