vllm.model_executor.layers.quantization
Modules:
Name | Description |
---|---|
aqlm |
|
auto_round |
|
awq |
|
awq_marlin |
|
awq_triton |
|
base_config |
|
bitblas |
|
bitsandbytes |
|
compressed_tensors |
|
deepgemm |
|
deepspeedfp |
|
experts_int8 |
|
fbgemm_fp8 |
|
fp8 |
|
gguf |
|
gptq |
|
gptq_bitblas |
|
gptq_marlin |
|
gptq_marlin_24 |
|
hqq_marlin |
|
ipex_quant |
|
kernels |
|
kv_cache |
|
marlin |
|
modelopt |
|
moe_wna16 |
|
neuron_quant |
|
ptpc_fp8 |
|
qqq |
|
quark |
|
rtn |
|
schema |
This file contains the Pydantic schemas for various quantization-related |
torchao |
|
tpu_int8 |
|
utils |
|
QUANTIZATION_METHODS
module-attribute
¶
QUANTIZATION_METHODS: list[str] = list(
get_args(QuantizationMethods)
)
QuantizationMethods
module-attribute
¶
QuantizationMethods = Literal[
"aqlm",
"awq",
"deepspeedfp",
"tpu_int8",
"fp8",
"ptpc_fp8",
"fbgemm_fp8",
"modelopt",
"modelopt_fp4",
"marlin",
"bitblas",
"gguf",
"gptq_marlin_24",
"gptq_marlin",
"gptq_bitblas",
"awq_marlin",
"gptq",
"compressed-tensors",
"bitsandbytes",
"qqq",
"hqq",
"experts_int8",
"neuron_quant",
"ipex",
"quark",
"moe_wna16",
"torchao",
"auto-round",
"rtn",
]
__all__
module-attribute
¶
__all__ = [
"QuantizationConfig",
"QuantizationMethods",
"get_quantization_config",
"QUANTIZATION_METHODS",
]
QuantizationConfig
¶
Bases: ABC
Base class for quantization configs.
Source code in vllm/model_executor/layers/quantization/base_config.py
67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 |
|
__init__
¶
apply_vllm_mapper
¶
apply_vllm_mapper(hf_to_vllm_mapper: WeightsMapper)
Interface for models to update module names referenced in quantization configs in order to reflect the vllm model structure
:param hf_to_vllm_mapper: maps from hf model structure (the assumed structure of the qconfig) to vllm model structure
Source code in vllm/model_executor/layers/quantization/base_config.py
from_config
abstractmethod
classmethod
¶
from_config(config: dict[str, Any]) -> QuantizationConfig
Create a config class from the model's quantization config.
get_cache_scale
¶
get_config_filenames
abstractmethod
staticmethod
¶
get_from_keys
staticmethod
¶
Get a value from the model's quantization config.
Source code in vllm/model_executor/layers/quantization/base_config.py
get_from_keys_or
staticmethod
¶
Get a optional value from the model's quantization config.
Source code in vllm/model_executor/layers/quantization/base_config.py
get_min_capability
abstractmethod
classmethod
¶
get_min_capability() -> int
Minimum GPU capability to support the quantization method.
E.g., 70 for Volta, 75 for Turing, 80 for Ampere. This requirement is due to the custom CUDA kernels used by the quantization method.
Source code in vllm/model_executor/layers/quantization/base_config.py
get_name
abstractmethod
¶
get_name() -> QuantizationMethods
get_quant_method
abstractmethod
¶
get_quant_method(
layer: Module, prefix: str
) -> Optional[QuantizeMethodBase]
Get the quantize method to use for the quantized layer.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
layer
|
Module
|
The layer for the quant method. |
required |
prefix
|
str
|
The full name of the layer in the state dict |
required |
Returns: The quantize method. None if the given layer doesn't support quant method.
Source code in vllm/model_executor/layers/quantization/base_config.py
get_supported_act_dtypes
abstractmethod
¶
override_quantization_method
classmethod
¶
override_quantization_method(
hf_quant_cfg, user_quant
) -> Optional[QuantizationMethods]
Detects if this quantization method can support a given checkpoint format by overriding the user specified quantization method -- this method should only be overwritten by subclasses in exceptional circumstances
Source code in vllm/model_executor/layers/quantization/base_config.py
get_quantization_config
¶
get_quantization_config(
quantization: str,
) -> type[QuantizationConfig]
Source code in vllm/model_executor/layers/quantization/__init__.py
register_quantization_config
¶
register_quantization_config(quantization: str)
Register a customized vllm quantization config.
When a quantization method is not supported by vllm, you can register a customized quantization config to support it.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
quantization
|
str
|
The quantization method name. |
required |
Examples:
>>> from vllm.model_executor.layers.quantization import register_quantization_config
>>> from vllm.model_executor.layers.quantization import get_quantization_config
>>> from vllm.model_executor.layers.quantization.base_config import QuantizationConfig
>>>
>>> @register_quantization_config("my_quant")
... class MyQuantConfig(QuantizationConfig):
... pass
>>>
>>> get_quantization_config("my_quant")
<class 'MyQuantConfig'>