vllm.worker.model_runner_base
BroadcastableModelInput
¶
Bases: ABC
Source code in vllm/worker/model_runner_base.py
as_broadcastable_tensor_dict
abstractmethod
¶
Extract broadcastable fields. Override for fields that require some custom deserialization.
from_broadcasted_tensor_dict
abstractmethod
classmethod
¶
from_broadcasted_tensor_dict(
tensor_dict: Dict[str, Any],
attn_backend: Optional[AttentionBackend] = None,
) -> T
Pop fields from the given tensor_dict and populate a new instance of BroadcastableModelInput.
Source code in vllm/worker/model_runner_base.py
InputProcessingError
¶
Bases: Exception
This exception is raised when an error occurs preparing the inputs for a single sequence group. This allows the engine to gracefully handle errors with a single sequence group without having to fail the entire batch.
Source code in vllm/worker/model_runner_base.py
__init__
¶
request_id is the id of the offending sequence group
ModelRunnerBase
¶
Model runner interface that abstracts a particular hardware and/or type of model. Model execution may communicate data with model runners in other processes, but it should not include control plane metadata communication.
Each ModelRunnerBase subclass should define a corresponding ModelRunnerInputBase subclass.
Source code in vllm/worker/model_runner_base.py
168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 |
|
__init__
¶
__init__(vllm_config: VllmConfig) -> None
Source code in vllm/worker/model_runner_base.py
execute_model
¶
execute_model(
model_input: T,
kv_caches: Optional[List[Tensor]],
intermediate_tensors: Optional[
IntermediateTensors
] = None,
num_steps: int = 1,
**kwargs,
) -> Optional[List[SamplerOutput]]
Execute the model on the given input.
Source code in vllm/worker/model_runner_base.py
get_generators
¶
Return dict of per-request generators used for random sampling.
Source code in vllm/worker/model_runner_base.py
make_model_input_from_broadcasted_tensor_dict
abstractmethod
¶
Make an instance of a ModelRunnerInputBase from the broadcasted tensor dict.
Source code in vllm/worker/model_runner_base.py
prepare_model_input
abstractmethod
¶
prepare_model_input(
seq_group_metadata_list: List[SequenceGroupMetadata],
virtual_engine: int = 0,
finished_requests_ids: Optional[List[str]] = None,
) -> T
Prepare the inputs to ModelRunnerBase.execute_model from an execution request. This method may move data to the worker's local device. It is not allowed to communicate with other workers or devices.
Source code in vllm/worker/model_runner_base.py
ModelRunnerInputBase
dataclass
¶
Bases: BroadcastableModelInput
Local inputs to each worker's model runner. May contain device-specific data. Different worker backends may have different methods of converting from the global ExecuteModelRequest produced by the LLM engine to the worker-local ModelRunnerInputBase objects.
Model runners that support multi-GPU execution should define a ModelRunnerInputBase subclass, add their required fields, and specify how to serialize/deserialize a ModelInput for broadcast between workers.
Source code in vllm/worker/model_runner_base.py
ModelRunnerInputBuilderBase
¶
A builder to create ModelRunnerInputBase objects.
Source code in vllm/worker/model_runner_base.py
ModelRunnerWrapperBase
¶
The whole point of this class is to lazily initialize the model_runner.
Source code in vllm/worker/model_runner_base.py
__getattr__
¶
_add_attn_metadata_broadcastable_dict
¶
_add_attn_metadata_broadcastable_dict(
tensor_dict: Dict[str, Any],
attn_metadata: Optional[AttentionMetadata],
) -> None
Helper method to update tensor_dict with broadcastable AttentionMetadata fields.
Source code in vllm/worker/model_runner_base.py
_add_sampling_metadata_broadcastable_dict
¶
_add_sampling_metadata_broadcastable_dict(
tensor_dict: Dict[str, Any],
sampling_metadata: Optional[SamplingMetadata],
) -> None
Helper method to update tensor_dict with broadcastable SamplingMetadata fields.
Source code in vllm/worker/model_runner_base.py
_init_attn_metadata_from_tensor_dict
¶
_init_attn_metadata_from_tensor_dict(
attn_backend: AttentionBackend,
tensor_dict: Dict[str, Any],
) -> Dict[str, Any]
Helper method to initialize AttentionMetadata based on an AttentionBackend and broadcastable AttentionMetadata fields.
Source code in vllm/worker/model_runner_base.py
_init_frozen_model_input_from_tensor_dict
¶
_init_frozen_model_input_from_tensor_dict(
frozen_model_input_cls: Type[ModelRunnerInputBase],
tensor_dict: Dict[str, Any],
) -> Dict[str, Any]
Helper method to initialize a frozen ModelInput based on broadcastable
Source code in vllm/worker/model_runner_base.py
_init_sampling_metadata_from_tensor_dict
¶
Helper method to initialize SamplingMetadata based on broadcastable SamplingMetadata fields.