vllm.model_executor.models.mamba_cache
MambaCacheManager
¶
Bases: ConstantSizeCache
Source code in vllm/model_executor/models/mamba_cache.py
__init__
¶
__init__(
vllm_config: VllmConfig,
dtype: dtype,
num_mamba_layers: int,
conv_state_shape: tuple[int, int],
temporal_state_shape: tuple[int, int],
)
Source code in vllm/model_executor/models/mamba_cache.py
_copy_cache
¶
current_run_tensors
¶
current_run_tensors(**kwargs) -> MambaCacheParams
Return the tensors for the current run's conv and ssm state.
Source code in vllm/model_executor/models/mamba_cache.py
get_seqlen_agnostic_capture_inputs
¶
get_seqlen_agnostic_capture_inputs(batch_size: int)
Provide the CUDA graph capture runs with a buffer in adjusted size. The buffer is used to maintain the Mamba Cache during the CUDA graph replay runs.