vllm.model_executor.layers.quantization.kernels.scaled_mm
Modules:
Name | Description |
---|---|
ScaledMMLinearKernel |
|
aiter |
|
cutlass |
|
triton |
|
xla |
|
_POSSIBLE_KERNELS
module-attribute
¶
_POSSIBLE_KERNELS: dict[
PlatformEnum, list[type[ScaledMMLinearKernel]]
] = {
CPU: [CutlassScaledMMLinearKernel],
CUDA: [CutlassScaledMMLinearKernel],
ROCM: [
AiterScaledMMLinearKernel,
TritonScaledMMLinearKernel,
],
TPU: [XLAScaledMMLinearKernel],
}
choose_scaled_mm_linear_kernel
¶
choose_scaled_mm_linear_kernel(
config: ScaledMMLinearLayerConfig,
compute_capability: Optional[int] = None,
) -> type[ScaledMMLinearKernel]
Choose an ScaledMMLinearKernel that can implement the given config for the given compute capability. Attempts to choose the best kernel in terms of performance.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
config
|
ScaledMMLinearLayerConfig
|
Description of the linear layer to be implemented. |
required |
compute_capability
|
Optional[int]
|
The compute capability of
the target device, if None uses |
None
|
Raises:
Type | Description |
---|---|
ValueError
|
If no kernel can implement the given config. |
Returns:
Type | Description |
---|---|
type[ScaledMMLinearKernel]
|
type[ScaledMMLinearKernel]: Chosen kernel. |