vllm.model_executor.layers.quantization.quark.schemes
Modules:
Name | Description |
---|---|
quark_scheme |
|
quark_w4a4_mxfp4 |
|
quark_w8a8_fp8 |
|
quark_w8a8_int8 |
|
__all__
module-attribute
¶
QuarkScheme
¶
Bases: ABC
Abstract class used to describe the weight creation and forward pass of different quantization schemes supported by Quark.
Source code in vllm/model_executor/layers/quantization/quark/schemes/quark_scheme.py
apply_weights
abstractmethod
¶
Run the forward pass for the particular scheme. This is where scheme-specific dequant/quant steps/kernels should be applied.
:param layer: torch.nn.Module with the registered weights and other parameters relevant to the particular scheme. :param x: input to the layer :param bias: bias parameter
Source code in vllm/model_executor/layers/quantization/quark/schemes/quark_scheme.py
create_weights
abstractmethod
¶
Weight creation for the particular scheme. Inputs to this function
QuarkW4A4MXFP4
¶
Bases: QuarkScheme
Source code in vllm/model_executor/layers/quantization/quark/schemes/quark_w4a4_mxfp4.py
20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 |
|
__init__
¶
Source code in vllm/model_executor/layers/quantization/quark/schemes/quark_w4a4_mxfp4.py
apply_weights
¶
Source code in vllm/model_executor/layers/quantization/quark/schemes/quark_w4a4_mxfp4.py
create_weights
¶
create_weights(
layer: Module,
output_partition_sizes: list[int],
input_size_per_partition: int,
params_dtype: dtype,
weight_loader: Callable,
**kwargs,
)
Source code in vllm/model_executor/layers/quantization/quark/schemes/quark_w4a4_mxfp4.py
process_weights_after_loading
¶
process_weights_after_loading(layer: Module) -> None
Source code in vllm/model_executor/layers/quantization/quark/schemes/quark_w4a4_mxfp4.py
QuarkW8A8Fp8
¶
Bases: QuarkScheme
Source code in vllm/model_executor/layers/quantization/quark/schemes/quark_w8a8_fp8.py
20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 |
|
fp8_linear
instance-attribute
¶
fp8_linear = Fp8LinearOp(
use_per_token_if_dynamic=use_per_token_if_dynamic
)
use_per_token_if_dynamic
instance-attribute
¶
__init__
¶
Source code in vllm/model_executor/layers/quantization/quark/schemes/quark_w8a8_fp8.py
apply_weights
¶
Source code in vllm/model_executor/layers/quantization/quark/schemes/quark_w8a8_fp8.py
create_weights
¶
create_weights(
layer: Module,
output_partition_sizes: list[int],
input_size_per_partition: int,
params_dtype: dtype,
weight_loader: Callable,
**kwargs,
)
Source code in vllm/model_executor/layers/quantization/quark/schemes/quark_w8a8_fp8.py
process_weights_after_loading
¶
Source code in vllm/model_executor/layers/quantization/quark/schemes/quark_w8a8_fp8.py
QuarkW8A8Int8
¶
Bases: QuarkScheme
Source code in vllm/model_executor/layers/quantization/quark/schemes/quark_w8a8_int8.py
20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 |
|
_kernel_backends_being_used
class-attribute
instance-attribute
¶
__init__
¶
Source code in vllm/model_executor/layers/quantization/quark/schemes/quark_w8a8_int8.py
apply_weights
¶
create_weights
¶
create_weights(
layer: Module,
output_partition_sizes: list[int],
input_size_per_partition: int,
params_dtype: dtype,
weight_loader: Callable,
**kwargs,
)
Source code in vllm/model_executor/layers/quantization/quark/schemes/quark_w8a8_int8.py
process_weights_after_loading
¶
process_weights_after_loading(layer: Module) -> None