vllm.model_executor.layers.quantization.deepspeedfp
DeepSpeedFPConfig
¶
Bases: QuantizationConfig
Config for DeepSpeed FP quantizer. It supports fp6 and fp8.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
weight_bits
|
int
|
the target quantization bits, 6 or 8. |
8
|
group_size
|
int
|
group size for quantizaiton, default to 128. |
512
|
Source code in vllm/model_executor/layers/quantization/deepspeedfp.py
__init__
¶
Source code in vllm/model_executor/layers/quantization/deepspeedfp.py
from_config
classmethod
¶
from_config(config: dict[str, Any]) -> DeepSpeedFPConfig
Source code in vllm/model_executor/layers/quantization/deepspeedfp.py
get_config_filenames
staticmethod
¶
get_linear_method
¶
get_linear_method() -> DeepSpeedFPLinearMethod
get_name
classmethod
¶
get_name() -> QuantizationMethods
get_quant_method
¶
get_quant_method(
layer: Module, prefix: str
) -> Optional[DeepSpeedFPLinearMethod]
DeepSpeedFPLinearMethod
¶
Bases: LinearMethodBase
Linear method for DeepSpeedFP quantizer.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
quant_config
|
DeepSpeedFPConfig
|
the DeepSpeedFP quantization config. |
required |
Source code in vllm/model_executor/layers/quantization/deepspeedfp.py
__init__
¶
__init__(quant_config: DeepSpeedFPConfig)
apply
¶
create_weights
¶
create_weights(
layer: Module,
input_size_per_partition: int,
output_partition_sizes: list[int],
input_size: int,
output_size: int,
params_dtype: dtype,
weight_loader=None,
**extra_weight_attrs,
)
Source code in vllm/model_executor/layers/quantization/deepspeedfp.py
DeepSpeedFPParameter
¶
Bases: Parameter
DeepSpeedFP quantized parameter class that implements fp8/fp6 quantization deepspeed. Weights are stored in quantized form on GPUs, and can be dequantized on-the-fly when needed by the model.
Source code in vllm/model_executor/layers/quantization/deepspeedfp.py
__new__
¶
__new__(
orig_shape: Size,
params_dtype: dtype,
quant_config: DeepSpeedFPConfig,
)
Source code in vllm/model_executor/layers/quantization/deepspeedfp.py
ds_dequantize
¶
ds_dequantize(fp_out=None) -> Tensor
Return a tensor containing the dequantized weights of this parameter.
Source code in vllm/model_executor/layers/quantization/deepspeedfp.py
ds_quantize_
¶
ds_quantize_(tensor: Tensor)
Source code in vllm/model_executor/layers/quantization/deepspeedfp.py
ds_selective_dequantize
¶
ds_selective_dequantize(indices, fp_out=None) -> Tensor
Return a tensor where only the weights at indices
are dequantized
(to save HBM -> SRAM bandwidth).