vllm.model_executor.layers.quantization.ptpc_fp8
PTPCFp8Config
¶
Bases: Fp8Config
Config class for Per-Token-Per-Channel Dynamic Quantization Fp8.
Source code in vllm/model_executor/layers/quantization/ptpc_fp8.py
__init__
¶
Source code in vllm/model_executor/layers/quantization/ptpc_fp8.py
from_config
classmethod
¶
from_config(config: dict[str, Any]) -> PTPCFp8Config
Source code in vllm/model_executor/layers/quantization/ptpc_fp8.py
get_name
classmethod
¶
get_name() -> QuantizationMethods
get_quant_method
¶
get_quant_method(
layer: Module, prefix: str
) -> Optional[QuantizeMethodBase]
Source code in vllm/model_executor/layers/quantization/ptpc_fp8.py
PTPCFp8LinearMethod
¶
Bases: Fp8LinearMethod
Linear method for Per-Token and Per-Channel FP8 Quantization. Only supports loading quantized BF16 model checkpoints with dynamic activation scaling. To load FP16 model checkpoints, user must specify to convert the FP16 model weight loading into BF16. The weight scaling factor will be initialized after the model weights are loaded.
Limitations: 1. Only support float8_e4m3fnuz data type due to the limitation of torch._scaled_mm (https://github.com/ROCm/pytorch/blob/8c0504d7f3fb0ee4c278c096a5c3caedb01129fa/aten/src/ATen/native/cuda/Blas.cpp#L1041)
Parameters:
Name | Type | Description | Default |
---|---|---|---|
quant_config
|
PTPCFp8Config
|
The quantization config. |
required |
Source code in vllm/model_executor/layers/quantization/ptpc_fp8.py
fp8_linear
instance-attribute
¶
fp8_linear = Fp8LinearOp(
cutlass_fp8_supported=False,
use_per_token_if_dynamic=True,
)
__init__
¶
__init__(quant_config: PTPCFp8Config)
Source code in vllm/model_executor/layers/quantization/ptpc_fp8.py
apply
¶
Source code in vllm/model_executor/layers/quantization/ptpc_fp8.py
process_weights_after_loading
¶
process_weights_after_loading(layer: Module) -> None