vllm.distributed.kv_transfer.kv_connector.base
KVConnectorBase Class for Distributed KV Cache & Hidden State communication
The class provides two primary abstract methods: 1. send_kv_caches_and_hidden_states(): Send KV caches and hidden states 2. recv_kv_caches_and_hidden_states(): Recv KV caches and hidden states
KVConnectorBaseType
module-attribute
¶
KVConnectorBaseType = Union[
KVConnectorBase, KVConnectorBase_V1
]
KVConnectorBase
¶
Bases: ABC
Abstract base class for a KV connector.
The class provides two primary abstract methods: 1. send_kv_caches_and_hidden_states(): Send KV caches and hidden states 2. recv_kv_caches_and_hidden_states(): Recv KV caches and hidden states
Source code in vllm/distributed/kv_transfer/kv_connector/base.py
24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 |
|
__init__
abstractmethod
¶
__init__(rank: int, local_rank: int, config: VllmConfig)
close
abstractmethod
¶
Close the buffer and release resources.
This method is responsible for cleaning up resources related to the connector when it is no longer needed.
Raises:
Type | Description |
---|---|
NotImplementedError
|
This method must be implemented in subclasses. |
Source code in vllm/distributed/kv_transfer/kv_connector/base.py
recv_kv_caches_and_hidden_states
abstractmethod
¶
recv_kv_caches_and_hidden_states(
model_executable: Module,
model_input: ModelInputForGPUWithSamplingMetadata,
kv_caches: list[Tensor],
) -> tuple[
Union[Tensor, IntermediateTensors],
bool,
ModelInputForGPUWithSamplingMetadata,
]
Receive KV caches and hidden states from the connector.
This method attempts to retrieve KV caches and hidden states for input tokens. If all required KV caches and hidden states are received, it will bypass model input, else it will fall back to normal vLLM model forwarding.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
model_executable
|
Module
|
The model executable from vLLM modelrunner. |
required |
model_input
|
ModelInputForGPUWithSamplingMetadata
|
The model input from vLLM modelrunner. |
required |
kv_caches
|
list[Tensor]
|
List of KV caches for each layer. |
required |
Returns:
Type | Description |
---|---|
Union[Tensor, IntermediateTensors]
|
|
bool
|
IntermediateTensors):
Concatenated hidden states if all required data is retrieved,
otherwise |
ModelInputForGPUWithSamplingMetadata
|
|
tuple[Union[Tensor, IntermediateTensors], bool, ModelInputForGPUWithSamplingMetadata]
|
|
Source code in vllm/distributed/kv_transfer/kv_connector/base.py
send_kv_caches_and_hidden_states
abstractmethod
¶
send_kv_caches_and_hidden_states(
model_executable: Module,
model_input: ModelInputForGPUWithSamplingMetadata,
kv_caches: list[Tensor],
hidden_or_intermediate_states: Union[
Tensor, IntermediateTensors
],
) -> None
Send KV caches and hidden states to the connector.
This method processes the input tokens, KV caches, and hidden/intermediate states for a given model and sends the data to the decode instance.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
model_executable
|
Module
|
The model executable containing start and end layer information. |
required |
model_input
|
ModelInputForGPUWithSamplingMetadata
|
The input metadata from vLLM. |
required |
kv_caches
|
list[Tensor]
|
List of KV caches (keys and values) for each layer. |
required |
IntermediateTensors])
|
The hidden or intermediate states associated with the tokens. |
required |
Returns:
Type | Description |
---|---|
None
|
None |