Skip to content

FP8 ViT Encoder Attention

For visual understanding workloads with large images (e.g. QHD, 4K) and relatively short text prompts/generation, the ViT encoder attention can become a significant bottleneck, especially when the text model is quantized (e.g. NVFP4). vLLM supports optional FP8 quantization for the ViT encoder attention via the FlashInfer cuDNN backend. Q/K/V are quantized on-the-fly to FP8 before the cuDNN attention call.

Note

  • Currently supports Qwen3-VL family models only (qwen3_vl, qwen3_vl_moe, qwen3_5, qwen3_5_moe, and other models using Qwen3 ViT).
  • Dynamic scaling is not compatible with ViT full CUDA graphs.
  • Performance gains are mostly visible at QHD/4K resolutions or multi-image requests. Smaller images may see no speedup due to quantization overhead (3 quantization kernel launches + un-padding).
  • FP8 tensor-core speedup is more pronounced on GB300 than GB200.

Requirements

  • FlashInfer cuDNN backend with cuDNN >= 9.17.1.

Usage

Enable FP8 ViT attention by passing --mm-encoder-attn-dtype fp8 together with --mm-encoder-attn-backend FLASHINFER:

vllm serve $MODEL \
    --mm-encoder-attn-backend FLASHINFER \
    --mm-encoder-attn-dtype fp8

By default (no scale file), dynamic scaling is used: a 16-entry circular buffer of observed Q/K/V amax values drives per-forward scale updates. This matches BF16 accuracy without any calibration but adds a small per-forward overhead.

For production, calibrate static scales on a representative dataset once and reuse them to avoid the dynamic overhead:

# Step 1: calibrate and save scales (runs dynamic scaling for 16 passes,
# then dumps the learned scales to JSON).
vllm bench mm-processor \
    --model $MODEL --mm-encoder-attn-backend FLASHINFER \
    --mm-encoder-attn-dtype fp8 \
    --mm-encoder-fp8-scale-save-path /path/to/scales.json \
    --dataset-name hf --dataset-path lmarena-ai/VisionArena-Chat \
    --num-prompts 100

# Step 2: serve with static scales (no dynamic overhead).
vllm serve $MODEL \
    --mm-encoder-attn-backend FLASHINFER \
    --mm-encoder-attn-dtype fp8 \
    --mm-encoder-fp8-scale-path /path/to/scales.json

Saved scales are multiplied by --mm-encoder-fp8-scale-save-margin (default 1.5) to leave headroom against activation outliers not present in the calibration set. The default has been validated to generalize across datasets (e.g. VisionArena-Chat calibration maintains BF16 accuracy on ChartQA).

Scale File Format

{
    "visual.blocks.0.attn.attn": {"q": 224.0, "k": 198.0, "v": 210.0},
    "visual.blocks.1.attn.attn": {"q": 218.0, "k": 195.0, "v": 207.0}
}

Keys q_scale / k_scale / v_scale are accepted as aliases.

Performance

Core cuDNN attention kernel (PyTorch profiler, cudnn_generated_fort_native_sdpa_sm100_flash_fprop, head_dim=128, seq_len=8192):

Hardware BF16 FP8 Speedup
GB200 350 us 312 us 1.12x
GB300 300 us 211 us 1.42x

End-to-end encoder forward time (Qwen3-VL-30B-A3B-Instruct on GB200, 3 images/request):

Resolution BF16 median FP8 median Speedup
HD (720x1280) 31.77 ms 36.39 ms 0.87x
FullHD (1080x1920) 57.99 ms 58.73 ms ~same
QHD (1440x2560) 131.83 ms 122.30 ms 1.08x
4K (2160x3840) 543.44 ms 460.31 ms 1.18x

Crossover is around FullHD with 3 images/request. At QHD and above, FP8 wins.

Accuracy

ChartQA, Qwen3-VL-8B-Instruct, 500 samples. FP8 static uses scales calibrated on VisionArena-Chat (with default 1.5x margin):

Metric BF16 FP8 dynamic FP8 static
relaxed_accuracy 0.780 0.776 0.780
anywhere_accuracy 0.806 0.816 0.814
exact_match 0.584 0.582 0.578

All three configurations match within statistical noise, confirming that static scales calibrated on one dataset generalize to another.