vllm.v1.core.encoder_cache_manager
EncoderCacheManager
¶
Manages caching of encoder outputs for multimodal models in vLLM V1.
The EncoderCacheManager handles the lifecycle of multimodal encoder outputs (such as vision embeddings from images) during request processing. It provides memory-aware caching to avoid recomputing encoder outputs when the same multimodal inputs appear in different stages of request processing.
This manager is particularly important for: - Vision-language models (e.g., LLaVA) where image encoder outputs are cached - Any multimodal model where encoder computation is expensive and cacheable
The cache operates at the granularity of individual multimodal input items within requests, allowing for fine-grained memory management and enabling chunked processing of multimodal inputs.
Note that no caching is shared between requests at this time. If the same input is used across multiple requests, it will be reprocessed for each request.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
cache_size
|
int
|
Limit the size of the cache, measured by the number of tokens from the input sequence. |
required |
Attributes:
Name | Type | Description |
---|---|---|
cache_size |
Total cache capacity in encoder tokens |
|
num_free_slots |
Current available cache capacity in encoder tokens |
|
cached |
dict[str, set[int]]
|
Mapping from request_id to set of cached input_ids for that request |
freed |
list[tuple[str, int]]
|
List of (request_id, input_id) pairs that were recently freed. This is cleared after every call to get_freed_ids(). |
Source code in vllm/v1/core/encoder_cache_manager.py
16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 |
|
allocate
¶
Allocate cache space for a multimodal input's encoder output.
This method reserves cache space for storing the encoder output of the specified multimodal input. The actual encoder output storage happens in the model runner, but this method ensures the cache manager tracks the allocation.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
request
|
Request
|
The request containing the multimodal input |
required |
input_id
|
int
|
Index of the multimodal input within the request |
required |
Note
This method assumes can_allocate() returned True for the same request and input_id. It will reduce available cache space.
Source code in vllm/v1/core/encoder_cache_manager.py
can_allocate
¶
Check if there's sufficient cache space for a multimodal input.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
request
|
Request
|
The request containing the multimodal input |
required |
input_id
|
int
|
Index of the multimodal input within the request |
required |
Returns:
Type | Description |
---|---|
bool
|
True if there's enough free cache space to store the encoder output |
bool
|
for this multimodal input |
Source code in vllm/v1/core/encoder_cache_manager.py
free
¶
free(request: Request) -> None
Free all cached encoder outputs for a request.
This method is typically called when a request is finished, cancelled, or aborted, and all its encoder outputs should be freed from cache.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
request
|
Request
|
The request whose encoder outputs should be freed |
required |
Source code in vllm/v1/core/encoder_cache_manager.py
free_encoder_input
¶
Free cache space for a single multimodal input's encoder output.
This method is called when: - The encoder output has been fully consumed by the decoder and is no longer needed (e.g., in vision-language models after image tokens are processed) - A request is being cancelled or aborted
Parameters:
Name | Type | Description | Default |
---|---|---|---|
request
|
Request
|
The request containing the multimodal input |
required |
input_id
|
int
|
Index of the multimodal input to free from cache |
required |
Source code in vllm/v1/core/encoder_cache_manager.py
get_cached_input_ids
¶
Get all cached multimodal input IDs for a request.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
request
|
Request
|
The request to query |
required |
Returns:
Type | Description |
---|---|
set[int]
|
Set of input_ids that have cached encoder outputs for this request. |
set[int]
|
Returns empty set if no inputs are cached for this request. |
Source code in vllm/v1/core/encoder_cache_manager.py
get_freed_ids
¶
Get and clear the list of recently freed encoder cache entries.
This method returns all encoder cache entries that were freed since the last call to this method. It's used by the scheduler to notify workers about which encoder outputs can be removed from their caches.
Returns:
Type | Description |
---|---|
list[tuple[str, int]]
|
List of (request_id, input_id) tuples that were freed since the |
list[tuple[str, int]]
|
last call. The internal freed list is cleared after this call. |
Source code in vllm/v1/core/encoder_cache_manager.py
has_cache
¶
Check if encoder output for a specific multimodal input is cached.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
request
|
Request
|
The request containing the multimodal input |
required |
input_id
|
int
|
Index of the multimodal input within the request |
required |
Returns:
Type | Description |
---|---|
bool
|
True if the encoder output for this input is already cached |
Source code in vllm/v1/core/encoder_cache_manager.py
_compute_encoder_budget_multimodal
¶
_compute_encoder_budget_multimodal(
model_config: ModelConfig,
scheduler_config: SchedulerConfig,
mm_registry: MultiModalRegistry,
) -> tuple[int, int]
Compute the encoder cache budget based on the model and scheduler configurations for a multimodal model.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
model_config
|
ModelConfig
|
Model configuration. |
required |
scheduler_config
|
SchedulerConfig
|
Scheduler configuration. |
required |
mm_registry
|
MultiModalRegistry
|
Provides information about the token cost. |
required |
Returns:
Type | Description |
---|---|
int
|
|
int
|
|
Source code in vllm/v1/core/encoder_cache_manager.py
compute_encoder_budget
¶
compute_encoder_budget(
model_config: ModelConfig,
scheduler_config: SchedulerConfig,
mm_registry: MultiModalRegistry,
) -> tuple[int, int]
Compute the encoder cache budget based on the model and scheduler configurations.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
model_config
|
ModelConfig
|
Model configuration. |
required |
scheduler_config
|
SchedulerConfig
|
Scheduler configuration. |
required |
mm_registry
|
MultiModalRegistry
|
Provides information about the token cost. |
required |
Returns:
Type | Description |
---|---|
int
|
|
int
|
|