vllm.model_executor.layers.fused_moe.oracle.w4a8_int8 ¶
_get_priority_backends ¶
_get_priority_backends(
moe_config: FusedMoEConfig,
) -> list[W4A8Int8MoeBackend]
Get available backends in priority order based on platform and config.
Currently only CPU INT4 backend is available for W4A8 INT8 MoE.
Source code in vllm/model_executor/layers/fused_moe/oracle/w4a8_int8.py
backend_to_kernel_cls ¶
backend_to_kernel_cls(
backend: W4A8Int8MoeBackend,
) -> list[type[FusedMoEExperts]]
Map W4A8Int8MoeBackend to kernel class.
Source code in vllm/model_executor/layers/fused_moe/oracle/w4a8_int8.py
convert_to_w4a8_int8_moe_format ¶
convert_to_w4a8_int8_moe_format(
w13_weight: Tensor,
w2_weight: Tensor,
w13_weight_scale: Tensor,
w2_weight_scale: Tensor,
group_size: int,
w13_bias: Tensor | None = None,
w2_bias: Tensor | None = None,
) -> tuple[
Tensor,
Tensor,
Tensor,
Tensor,
Tensor | None,
Tensor | None,
]
Pack INT4 MoE weights to KleidiAI format.
This function packs the INT4 weights (stored as int8 values) into the format expected by the KleidiAI dynamic_4bit_int_moe kernel.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
w13_weight | Tensor | [E, 2*IN, H] int8 tensor (int4 values in [-8,7]) | required |
w2_weight | Tensor | [E, H, IN] int8 tensor (int4 values in [-8,7]) | required |
w13_weight_scale | Tensor | [E, 2*IN, H/g or 1] scale tensor | required |
w2_weight_scale | Tensor | [E, H, IN/g or 1] scale tensor | required |
group_size | int | Quantization group size (-1 for channel-wise) | required |
w13_bias | Tensor | None | Optional [E, 2*IN] bias tensor | None |
w2_bias | Tensor | None | Optional [E, H] bias tensor | None |
Returns:
| Type | Description |
|---|---|
tuple[Tensor, Tensor, Tensor, Tensor, Tensor | None, Tensor | None] | Tuple of (w13_packed, w2_packed) tensors |
Source code in vllm/model_executor/layers/fused_moe/oracle/w4a8_int8.py
make_w4a8_int8_moe_kernel ¶
make_w4a8_int8_moe_kernel(
moe_quant_config: FusedMoEQuantConfig,
moe_config: FusedMoEConfig,
experts_cls: type[FusedMoEExperts],
routing_tables: tuple[Tensor, Tensor, Tensor]
| None = None,
) -> FusedMoEKernel
Create FusedMoEKernel for W4A8 Int8 MoE.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
moe_quant_config | FusedMoEQuantConfig | Quantization configuration | required |
moe_config | FusedMoEConfig | MoE configuration | required |
experts_cls | type[FusedMoEExperts] | Expert kernel class (should be CPUExpertsInt4) | required |
routing_tables | tuple[Tensor, Tensor, Tensor] | None | Optional routing tables for expert parallelism | None |
Returns:
| Type | Description |
|---|---|
FusedMoEKernel | Configured FusedMoEKernel instance |
Source code in vllm/model_executor/layers/fused_moe/oracle/w4a8_int8.py
make_w4a8_int8_moe_quant_config ¶
make_w4a8_int8_moe_quant_config(
block_shape: tuple[int, int] | None = None,
) -> FusedMoEQuantConfig
Create FusedMoEQuantConfig for W4A8 Int8 MoE.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
block_shape | tuple[int, int] | None | Quantization block shape (row, col). For channel-wise: (-1, 1) or None For group-wise: (1, group_size) | None |
Returns:
| Type | Description |
|---|---|
FusedMoEQuantConfig | FusedMoEQuantConfig with appropriate settings for W4A8 Int8 |
Source code in vllm/model_executor/layers/fused_moe/oracle/w4a8_int8.py
map_w4a8_int8_backend ¶
Map user's MoEBackend to W4A8Int8MoeBackend.
Source code in vllm/model_executor/layers/fused_moe/oracle/w4a8_int8.py
pack_int4_weights_for_kleidi ¶
pack_int4_weights_for_kleidi(
int4_as_int8: Tensor,
scales: Tensor,
bias: Tensor | None,
in_features: int,
out_features: int,
group_size: int,
) -> Tensor
Pack INT4 weights (stored as int8 in [-8,7]) to KleidiAI format.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
int4_as_int8 | Tensor | [out, in] int8 tensor with values in [-8, 7] | required |
scales | Tensor | [out, in//group_size] or [out, 1] for channel-wise | required |
bias | Tensor | None | [out] optional bias | required |
in_features | int | Input dimension | required |
out_features | int | Output dimension | required |
group_size | int | Quantization group size (-1 for channel-wise) | required |
Returns:
| Type | Description |
|---|---|
Tensor | Packed weight tensor in KleidiAI format |
Source code in vllm/model_executor/layers/fused_moe/oracle/w4a8_int8.py
select_w4a8_int8_moe_backend ¶
select_w4a8_int8_moe_backend(
config: FusedMoEConfig,
weight_key: QuantKey | None,
activation_key: QuantKey | None,
) -> tuple[W4A8Int8MoeBackend, type[FusedMoEExperts]]
Select the primary W4A8 Int8 MoE backend.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
config | FusedMoEConfig | MoE configuration | required |
weight_key | QuantKey | None | Weight quantization key (should be one of kInt4W4A8Static*) | required |
activation_key | QuantKey | None | Activation quantization key (currently unused for W4A8) | required |
Returns:
| Type | Description |
|---|---|
tuple[W4A8Int8MoeBackend, type[FusedMoEExperts]] | Tuple of (backend, kernel_class) |
Source code in vllm/model_executor/layers/fused_moe/oracle/w4a8_int8.py
73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 | |