vllm.kernels.triton.qkv_padded_fp8_quant ¶
Stride-aware FP8 quantization with head_dim padding for ViT attention.
Reads directly from non-contiguous QKV views using 3D strides and pads head_dim to a multiple of 16 for cuDNN compatibility.
quantize_fp8_maybe_pad_head_dim ¶
quantize_fp8_maybe_pad_head_dim(
tensor: Tensor,
scale: Tensor,
fp8_quant: QuantFP8,
skip_scale: bool = False,
) -> Tensor
Quantize a 3D/4D tensor to FP8, padding head_dim to a multiple of 16 only when needed.
Accepts (S, H, D) or (B, S, H, D) input. Uses fp8_quant (a :class:QuantFP8 CustomOp) when head_dim is already aligned to 16 (no padding); otherwise falls back to a stride-aware Triton kernel that pads head_dim to a multiple of 16.
Source code in vllm/kernels/triton/qkv_padded_fp8_quant.py
quantize_fp8_pad_head_dim_triton ¶
quantize_fp8_pad_head_dim_triton(
tensor: Tensor,
scale: Tensor,
skip_scale: bool = False,
block_m: int | None = None,
block_n: int | None = None,
num_warps: int | None = None,
) -> Tensor
Quantize a 3D/4D tensor to FP8, padding head_dim to a multiple of 16.
Reads directly from the input using its 3D strides, so non-contiguous views (e.g. Q/K/V slices from an interleaved QKV buffer) are handled without an extra copy. Output is always a fresh contiguous tensor with shape (S, H, padded_D).