vllm.model_executor.layers.fused_moe.utils
_fp8_perm
¶
A permutation routine that works on fp8 types.
Source code in vllm/model_executor/layers/fused_moe/utils.py
_fp8_quantize
¶
_fp8_quantize(
A: Tensor,
A_scale: Optional[Tensor],
per_act_token: bool,
block_shape: Optional[list[int]] = None,
) -> tuple[Tensor, Tensor]
Perform fp8 quantization on the inputs. If a block_shape is provided, the output will be blocked.
Source code in vllm/model_executor/layers/fused_moe/utils.py
_int8_quantize
¶
_int8_quantize(
A: Tensor,
A_scale: Optional[Tensor],
per_act_token: bool,
block_shape: Optional[list[int]] = None,
) -> tuple[Tensor, Tensor]
Perform int8 quantization on the inputs. If a block_shape is provided, the output will be blocked.
Source code in vllm/model_executor/layers/fused_moe/utils.py
_resize_cache
¶
Shrink the given tensor and apply the given view to it. This is used to resize the intermediate fused_moe caches.
Source code in vllm/model_executor/layers/fused_moe/utils.py
_validate_scale_shape
¶
_validate_scale_shape(
a: Tensor,
a_scale: Optional[Tensor],
per_act_token_quant: bool,
block_shape: Optional[list[int]],
) -> None
Source code in vllm/model_executor/layers/fused_moe/utils.py
moe_kernel_quantize_input
¶
moe_kernel_quantize_input(
A: Tensor,
A_scale: Optional[Tensor],
quant_dtype: Optional[dtype],
per_act_token_quant: bool,
block_shape: Optional[list[int]] = None,
) -> tuple[Tensor, Optional[Tensor]]