IndexCache¶
IndexCache reduces redundant top-k computation in DeepSeek-V3.2 (DSA) models by caching and reusing top-k indices across layers.
Background¶
DeepSeek-V3.2 uses a DeepSeek Sparse Attention (DSA) mechanism where top-k token selection is computed per layer. For deep models with many layers, this computation can be expensive. IndexCache allows skipping redundant top-k computations by reusing indices from previous layers.
See: IndexCache Paper
Usage¶
CLI¶
vllm serve deepseek-ai/DeepSeek-V3.2 \
--hf-overrides '{"use_index_cache": true, "index_topk_freq": 4}' ...
Configuration Reference¶
| Parameter | Type | Default | Description |
|---|---|---|---|
use_index_cache | bool | false | Enable IndexCache. Must be set to true to use this feature |
index_topk_freq | int | 1 | Frequency (in layers) at which top-k is computed. 1 = compute on every layer (disabled), 4 = compute on 1/4 of layers |
index_topk_pattern | str | null | Per-layer F/S pattern. Overrides index_topk_freq if set. Each character maps to one DSA layer: F = Full, S = Shared |
Configuration Examples¶
Using index_topk_freq (compute every N layers):
vllm serve deepseek-ai/DeepSeek-V3.2 \
--hf-overrides '{"use_index_cache": true, "index_topk_freq": 4}' ...
Using index_topk_pattern (explicit per-layer control):
# custom pattern for 61 layers: F = compute, S = reuse
vllm serve deepseek-ai/DeepSeek-V3.2 \
--hf-overrides '{"use_index_cache": true, "index_topk_pattern": "FFSFSSSFSSFFFSSSFFFSFSSSSSSFFSFFSFFSSFFFFFFSFFFFFSFFSSSSSSFSF"}'
How It Works¶
- When IndexCache is enabled, layers marked with
"F"(Full) calculate and store top-k indices - Subsequent layers marked with
"S"(Shared) receive the cached indices from the previous layer instead of recomputing - The cached indices are passed through the layer stack, reducing total computation
Requirements¶
- DeepSeek-V3.2 or compatible DSA model
use_index_cache: truevia--hf-overrides