vllm.model_executor.kernels.linear.mxfp4 ¶
Modules:
| Name | Description |
|---|---|
base | |
flashinfer | |
MxFp4LinearKernel ¶
Bases: ABC
Base class for MXFP4 quantized linear kernels.
Each subclass implements a specific GEMM backend (CUTLASS, Marlin, etc). The kernel selection mechanism iterates over registered subclasses in priority order,calling is_supported and can_implement to find the best match for the current hardware.
Source code in vllm/model_executor/kernels/linear/mxfp4/base.py
apply_weights abstractmethod ¶
Run the quantized GEMM.
can_implement abstractmethod classmethod ¶
can_implement(
config: MxFp4LinearLayerConfig,
) -> tuple[bool, str | None]
Return whether this kernel can handle config.
is_supported abstractmethod classmethod ¶
Return whether this kernel can run on the current platform.
process_weights_after_loading abstractmethod ¶
process_weights_after_loading(layer: Module) -> None
Transform weights into the format required by this kernel.
Called once after checkpoint weights have been loaded onto the device. Implementations should repack / swizzle / pad weights and scales in-place on layer.
Source code in vllm/model_executor/kernels/linear/mxfp4/base.py
MxFp4LinearLayerConfig dataclass ¶
Configuration for an MXFP4 linear layer.
All MXFP4 layers share the same structure: packed uint8 weights (2 FP4 values per byte) and per-block weight scales (group size 32).