vllm.model_executor.models.constant_size_cache
ConstantSizeCache
¶
Bases: ABC
Abstract base class for managing constant size caches like Mamba and Minimax.
Source code in vllm/model_executor/models/constant_size_cache.py
11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 |
|
__init__
¶
__init__(max_batch_size: int)
Source code in vllm/model_executor/models/constant_size_cache.py
_assign_seq_id_to_cache_index
¶
Assign (req_id,seq_id) pair to a destination_index
index, if
already occupied, move the occupying index to a free index.
Source code in vllm/model_executor/models/constant_size_cache.py
_copy_cache
abstractmethod
¶
_prepare_current_run_cache
¶
_prepare_current_run_cache(
request_ids_to_seq_ids: dict[str, list[int]],
finished_requests_ids: list[str],
) -> list[int]
Source code in vllm/model_executor/models/constant_size_cache.py
_release_finished_requests
¶
Source code in vllm/model_executor/models/constant_size_cache.py
copy_inputs_before_cuda_graphs
¶
Copy the relevant state_indices into the CUDA graph input buffer
Source code in vllm/model_executor/models/constant_size_cache.py
current_run_tensors
¶
current_run_tensors(**kwargs) -> tuple
Return the tensors for the current run's conv and ssm state.
Source code in vllm/model_executor/models/constant_size_cache.py
get_seqlen_agnostic_capture_inputs
¶
get_seqlen_agnostic_capture_inputs(batch_size: int)
Provide the CUDA graph capture runs with a buffer in adjusted size. The buffer is used to maintain the Cache during the CUDA graph replay runs.