vllm.distributed.kv_transfer.kv_connector.lmcache_connector
LMCache KV Cache Connector for Distributed Machine Learning Inference
The LMCacheConnector can (1) transfer KV caches between prefill vLLM worker (KV cache producer) and decode vLLM worker (KV cache consumer) using LMCache; (2) offload and share KV caches.
LMCacheConnector
¶
Bases: KVConnectorBase
Source code in vllm/distributed/kv_transfer/kv_connector/lmcache_connector.py
engine
instance-attribute
¶
__init__
¶
__init__(rank: int, local_rank: int, config: VllmConfig)
Source code in vllm/distributed/kv_transfer/kv_connector/lmcache_connector.py
close
¶
recv_kv_caches_and_hidden_states
¶
recv_kv_caches_and_hidden_states(
model_executable: Module,
model_input: ModelInputForGPUWithSamplingMetadata,
kv_caches: list[Tensor],
) -> tuple[
Union[Tensor, IntermediateTensors],
bool,
ModelInputForGPUWithSamplingMetadata,
]
Source code in vllm/distributed/kv_transfer/kv_connector/lmcache_connector.py
send_kv_caches_and_hidden_states
¶
send_kv_caches_and_hidden_states(
model_executable: Module,
model_input: ModelInputForGPUWithSamplingMetadata,
kv_caches: list[Tensor],
hidden_or_intermediate_states: Union[
Tensor, IntermediateTensors
],
) -> None