vllm.model_executor.layers.quantization.turboquant ¶
TurboQuant: KV-cache quantization for vLLM.
Hadamard rotation + per-coordinate Lloyd-Max scalar quantization for keys, uniform quantization for values.
The technique implemented here consists of the scalar case of the HIGGS quantization method (Malinovskii et al., "Pushing the Limits of Large Language Model Quantization via the Linearity Theorem", NAACL 2025; preprint arXiv:2411.17525): rotation + optimized grid + optional re-normalization, applied to KV cache compression. A first application of this approach to KV-cache compression is in "Cache Me If You Must: Adaptive Key-Value Quantization for Large Language Models" (Shutova et al., ICML 2025; preprint arXiv:2501.19392). Both these references pre-date the TurboQuant paper (Zandieh et al., ICLR 2026).
Modules:
| Name | Description |
|---|---|
centroids | Lloyd-Max optimal scalar quantizer for TurboQuant. |
config | TurboQuant configuration. |
quantizer | TurboQuant quantizer utilities. |
TurboQuantConfig dataclass ¶
Configuration for TurboQuant KV-cache quantization.
Applies Hadamard rotation followed by per-coordinate Lloyd-Max scalar quantization for keys, and uniform quantization for values.
Historical note: this is the scalar case of the HIGGS quantization method (Malinovskii et al., "Pushing the Limits of Large Language Model Quantization via the Linearity Theorem", NAACL 2025; preprint arXiv:2411.17525): rotation + optimized grid + optional re-normalization, applied to KV cache compression. A first application of this approach to KV-cache compression is in "Cache Me If You Must: Adaptive Key-Value Quantization for Large Language Models" (Shutova et al., ICML 2025; preprint arXiv:2501.19392). Both these references pre-date the TurboQuant paper.
QJL is intentionally omitted — community consensus (5+ independent groups) found it hurts attention quality by amplifying variance through softmax.
Named presets (use via --kv-cache-dtype): turboquant_k8v4: FP8 keys + 4-bit values, 2.6x, +1.17% PPL turboquant_4bit_nc: 4-bit MSE keys + 4-bit values + NC, 3.8x, +2.71% turboquant_k3v4_nc: 3-bit MSE keys + 4-bit values + NC, ~3.5x, +10.63% turboquant_3bit_nc: 3-bit MSE keys + 3-bit values + NC, 4.9x, +20.59%
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
head_dim | int | Attention head dimension (e.g. 64, 96, 128). | 128 |
key_quant_bits | int | Bits for key quantization. 8 = FP8 keys (no rotation/MSE). 3-4 = Lloyd-Max MSE quantized keys. | 3 |
value_quant_bits | int | Bits per value dimension for uniform quantization. 3 = 8 levels, 4 = 16 levels (default). | 4 |
norm_correction | bool | Re-normalize centroid vectors to unit norm before inverse rotation during dequant. Fixes quantization-induced norm distortion, improving PPL by ~0.8% at 4-bit. | False |
Source code in vllm/model_executor/layers/quantization/turboquant/config.py
35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 | |
effective_value_quant_bits property ¶
effective_value_quant_bits: int
Actual bits used for value storage.
key_mse_bits property ¶
key_mse_bits: int
MSE bits actually used for key quantization (0 if FP8 keys).
key_packed_size property ¶
key_packed_size: int
Packed bytes for a single KEY vector.
FP8 mode (key_quant_bits=8): head_dim bytes (1 byte per element, no overhead).
TQ mode
- MSE indices: ceil(head_dim * key_mse_bits / 8) bytes
- vec_norm: 2 bytes (float16)
mse_bits property ¶
mse_bits: int
MSE quantizer bit-width (determines centroid count: 2^mse_bits).
For MSE key modes, equals key_quant_bits. For FP8 key mode, falls back to value_quant_bits (centroids are still needed for continuation-prefill dequant and decode kernel params).
slot_size property ¶
slot_size: int
Total packed bytes per head per position (key + value combined).
Layout: [key_packed | value_packed]
slot_size_aligned property ¶
slot_size_aligned: int
Slot size rounded up to next even number.
Even-number is required so effective_head_size = slot_size_aligned // 2 is integral.
value_packed_size property ¶
value_packed_size: int
Packed bytes for a single VALUE vector.
Uniform quantization: ceil(head_dim * bits / 8) + 4 bytes (scale + zero fp16).
from_cache_dtype staticmethod ¶
from_cache_dtype(
cache_dtype: str, head_dim: int
) -> TurboQuantConfig
Create config from a named preset.
Valid presets: turboquant_k8v4, turboquant_4bit_nc, etc.
Source code in vllm/model_executor/layers/quantization/turboquant/config.py
get_boundary_skip_layers staticmethod ¶
Get layer indices to skip TQ compression (boundary protection).
Returns first N and last N layer indices as strings, suitable for kv_cache_dtype_skip_layers.