vllm.v1.kv_cache_interface
AttentionSpec
dataclass
¶
Bases: KVCacheSpec
Source code in vllm/v1/kv_cache_interface.py
FullAttentionSpec
dataclass
¶
Bases: AttentionSpec
Source code in vllm/v1/kv_cache_interface.py
sliding_window
class-attribute
instance-attribute
¶
When hybrid allocator is disabled and the model contains both full attention layers and sliding window attention layers, sliding window attention are regarded as full attention in KV cache manager (blocks are allocated for all tokens), while computed as sliding window attention in model runner. In this case, we use FullAttentionSpec and record the sliding window size. Default to None for not using sliding window attention.
__init__
¶
__init__(
block_size: int,
num_kv_heads: int,
head_size: int,
dtype: dtype,
use_mla: bool,
sliding_window: Optional[int] = None,
) -> None
max_memory_usage_bytes
¶
max_memory_usage_bytes(vllm_config: VllmConfig) -> int
merge
classmethod
¶
Merge a list of FullAttentionSpec objects into a single FullAttentionSpec object.
Source code in vllm/v1/kv_cache_interface.py
KVCacheConfig
dataclass
¶
The KV cache configuration of a model.
Source code in vllm/v1/kv_cache_interface.py
kv_cache_tensors
instance-attribute
¶
kv_cache_tensors: list[KVCacheTensor]
The kv cache groups of the model.
For models with only one type of attention, there is only one group that
contains all layers.
For models with multiple types of attention, there will be multiple groups,
see _get_kv_cache_config_uniform_page_size
for more details.
num_blocks
instance-attribute
¶
num_blocks: int
How should model runner initialize the KV cache tensors for each layer
__init__
¶
__init__(
num_blocks: int,
kv_cache_tensors: list[KVCacheTensor],
kv_cache_groups: list[KVCacheGroupSpec],
) -> None
KVCacheGroupSpec
dataclass
¶
Represents a group of model layers that share the same KV cache block table. These layers are regarded as one layer in the KV cache manager.
Source code in vllm/v1/kv_cache_interface.py
KVCacheSpec
dataclass
¶
A base class for specifying the KV cache format of one layer.
Source code in vllm/v1/kv_cache_interface.py
type_id
property
¶
type_id: str
The type identifier of this KV cache. Return different strings for layers with different KV cache type (e.g., different number of tokens like full attention vs sliding window attention, different KV cache size per token like layers with different number of heads)
Returns:
Type | Description |
---|---|
str
|
The type identifier of this KV cache. |
max_memory_usage_bytes
¶
max_memory_usage_bytes(vllm_config: VllmConfig) -> int
The maximum possible memory usage of this KV cache in bytes.
Returns:
Type | Description |
---|---|
int
|
The KV cache size in bytes |
merge
classmethod
¶
Merge a list of KVCacheSpec objects into a single KVCacheSpec object.
Source code in vllm/v1/kv_cache_interface.py
KVCacheTensor
dataclass
¶
A class for specifying how the workers should initialize the KV cache.
Source code in vllm/v1/kv_cache_interface.py
MambaSpec
dataclass
¶
Bases: KVCacheSpec
Source code in vllm/v1/kv_cache_interface.py
SlidingWindowSpec
dataclass
¶
Bases: AttentionSpec
Source code in vllm/v1/kv_cache_interface.py
__init__
¶
__init__(
block_size: int,
num_kv_heads: int,
head_size: int,
dtype: dtype,
use_mla: bool,
sliding_window: int,
) -> None
__post_init__
¶
max_memory_usage_bytes
¶
max_memory_usage_bytes(vllm_config: VllmConfig) -> int