vllm.model_executor.models.interfaces
MultiModalEmbeddings
module-attribute
¶
The output embeddings must be one of the following formats:
- A list or tuple of 2D tensors, where each tensor corresponds to each input multimodal data item (e.g, image).
- A single 3D tensor, with the batch dimension grouping the 2D tensors.
HasInnerState
¶
Bases: Protocol
The interface required for all models that has inner state.
Source code in vllm/model_executor/models/interfaces.py
HasNoOps
¶
IsAttentionFree
¶
Bases: Protocol
The interface required for all models like Mamba that lack attention, but do have state whose size is constant wrt the number of tokens.
Source code in vllm/model_executor/models/interfaces.py
IsHybrid
¶
Bases: Protocol
The interface required for all models like Jamba that have both attention and mamba blocks, indicates that hf_config has 'layers_block_type'
Source code in vllm/model_executor/models/interfaces.py
MixtureOfExperts
¶
Bases: Protocol
Check if the model is a mixture of experts (MoE) model.
Source code in vllm/model_executor/models/interfaces.py
expert_weights
instance-attribute
¶
expert_weights: MutableSequence[Iterable[Tensor]]
Expert weights saved in this rank.
The first dimension is the layer, and the second dimension is different parameters in the layer, e.g. up/down projection weights.
num_expert_groups
instance-attribute
¶
num_expert_groups: int
Number of expert groups in this model.
num_local_physical_experts
instance-attribute
¶
num_local_physical_experts: int
Number of local physical experts in this model.
num_logical_experts
instance-attribute
¶
num_logical_experts: int
Number of logical experts in this model.
num_physical_experts
instance-attribute
¶
num_physical_experts: int
Number of physical experts in this model.
num_redundant_experts
instance-attribute
¶
num_redundant_experts: int
Number of redundant experts in this model.
num_routed_experts
instance-attribute
¶
num_routed_experts: int
Number of routed experts in this model.
num_shared_experts
instance-attribute
¶
num_shared_experts: int
Number of shared experts in this model.
set_eplb_state
¶
set_eplb_state(
expert_load_view: Tensor,
logical_to_physical_map: Tensor,
logical_replica_count: Tensor,
) -> None
Register the EPLB state in the MoE model.
Since these are views of the actual EPLB state, any changes made by the EPLB algorithm are automatically reflected in the model's behavior without requiring additional method calls to set new states.
You should also collect model's expert_weights
here instead of in
the weight loader, since after initial weight loading, further
processing like quantization may be applied to the weights.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
expert_load_view
|
Tensor
|
A view of the expert load metrics tensor. |
required |
logical_to_physical_map
|
Tensor
|
Mapping from logical to physical experts. |
required |
logical_replica_count
|
Tensor
|
Count of replicas for each logical expert. |
required |
Source code in vllm/model_executor/models/interfaces.py
SupportsCrossEncoding
¶
Bases: Protocol
The interface required for all models that support cross encoding.
Source code in vllm/model_executor/models/interfaces.py
SupportsLoRA
¶
Bases: Protocol
The interface required for all models that support LoRA.
Source code in vllm/model_executor/models/interfaces.py
SupportsMultiModal
¶
Bases: Protocol
The interface required for all multi-modal models.
Source code in vllm/model_executor/models/interfaces.py
supports_multimodal
class-attribute
¶
supports_multimodal: Literal[True] = True
A flag that indicates this model supports multi-modal inputs.
Note
There is no need to redefine this flag if this class is in the MRO of your model class.
get_language_model
¶
get_language_model() -> Module
Returns the underlying language model used for text generation.
This is typically the torch.nn.Module
instance responsible for
processing the merged multimodal embeddings and producing hidden states
Returns:
Type | Description |
---|---|
Module
|
torch.nn.Module: The core language model component. |
Source code in vllm/model_executor/models/interfaces.py
get_multimodal_embeddings
¶
get_multimodal_embeddings(
**kwargs: object,
) -> MultiModalEmbeddings
Returns multimodal embeddings generated from multimodal kwargs to be merged with text embeddings.
Note
The returned multimodal embeddings must be in the same order as the appearances of their corresponding multimodal data item in the input prompt.
Source code in vllm/model_executor/models/interfaces.py
get_placeholder_str
classmethod
¶
Get the placeholder text for the i
th modality
item in the prompt.
SupportsPP
¶
Bases: Protocol
The interface required for all models that support pipeline parallel.
Source code in vllm/model_executor/models/interfaces.py
supports_pp
class-attribute
¶
supports_pp: Literal[True] = True
A flag that indicates this model supports pipeline parallel.
Note
There is no need to redefine this flag if this class is in the MRO of your model class.
forward
¶
forward(
*, intermediate_tensors: Optional[IntermediateTensors]
) -> Union[Tensor, IntermediateTensors]
Accept IntermediateTensors
when
PP rank > 0.
Return IntermediateTensors
only
for the last PP rank.
Source code in vllm/model_executor/models/interfaces.py
make_empty_intermediate_tensors
¶
make_empty_intermediate_tensors(
batch_size: int, dtype: dtype, device: device
) -> IntermediateTensors
Called when PP rank > 0 for profiling purposes.
SupportsQuant
¶
The interface required for all models that support quantization.
Source code in vllm/model_executor/models/interfaces.py
packed_modules_mapping
class-attribute
¶
__new__
¶
__new__(*args, **kwargs) -> Self
Source code in vllm/model_executor/models/interfaces.py
_find_quant_config
staticmethod
¶
_find_quant_config(
*args, **kwargs
) -> Optional[QuantizationConfig]
Find quant config passed through model constructor args
Source code in vllm/model_executor/models/interfaces.py
SupportsTranscription
¶
Bases: Protocol
The interface required for all models that support transcription.
Source code in vllm/model_executor/models/interfaces.py
SupportsV0Only
¶
Bases: Protocol
Models with this interface are not compatible with V1 vLLM.
Source code in vllm/model_executor/models/interfaces.py
_HasInnerStateType
¶
_HasNoOpsType
¶
_IsAttentionFreeType
¶
_IsHybridType
¶
_SupportsMultiModalType
¶
_SupportsPPType
¶
Bases: Protocol
Source code in vllm/model_executor/models/interfaces.py
forward
¶
forward(
*, intermediate_tensors: Optional[IntermediateTensors]
) -> Union[Tensor, IntermediateTensors]
make_empty_intermediate_tensors
¶
make_empty_intermediate_tensors(
batch_size: int, dtype: dtype, device: device
) -> IntermediateTensors
_supports_cross_encoding
¶
_supports_cross_encoding(
model: Union[type[object], object],
) -> Union[
TypeIs[type[SupportsCrossEncoding]],
TypeIs[SupportsCrossEncoding],
]
Source code in vllm/model_executor/models/interfaces.py
_supports_lora
¶
_supports_pp_attributes
¶
_supports_pp_inspect
¶
has_inner_state
¶
has_inner_state(model: object) -> TypeIs[HasInnerState]
has_inner_state(
model: type[object],
) -> TypeIs[type[HasInnerState]]
has_inner_state(
model: Union[type[object], object],
) -> Union[
TypeIs[type[HasInnerState]], TypeIs[HasInnerState]
]
Source code in vllm/model_executor/models/interfaces.py
has_noops
¶
has_step_pooler
¶
Check if the model uses step pooler.
is_attention_free
¶
is_attention_free(model: object) -> TypeIs[IsAttentionFree]
is_attention_free(
model: type[object],
) -> TypeIs[type[IsAttentionFree]]
is_attention_free(
model: Union[type[object], object],
) -> Union[
TypeIs[type[IsAttentionFree]], TypeIs[IsAttentionFree]
]
Source code in vllm/model_executor/models/interfaces.py
is_hybrid
¶
is_mixture_of_experts
¶
is_mixture_of_experts(
model: object,
) -> TypeIs[MixtureOfExperts]
supports_cross_encoding
¶
supports_cross_encoding(
model: type[object],
) -> TypeIs[type[SupportsCrossEncoding]]
supports_cross_encoding(
model: object,
) -> TypeIs[SupportsCrossEncoding]
supports_cross_encoding(
model: Union[type[object], object],
) -> Union[
TypeIs[type[SupportsCrossEncoding]],
TypeIs[SupportsCrossEncoding],
]
supports_lora
¶
supports_lora(
model: type[object],
) -> TypeIs[type[SupportsLoRA]]
supports_lora(model: object) -> TypeIs[SupportsLoRA]
supports_lora(
model: Union[type[object], object],
) -> Union[
TypeIs[type[SupportsLoRA]], TypeIs[SupportsLoRA]
]
Source code in vllm/model_executor/models/interfaces.py
supports_multimodal
¶
supports_multimodal(
model: type[object],
) -> TypeIs[type[SupportsMultiModal]]
supports_multimodal(
model: object,
) -> TypeIs[SupportsMultiModal]
supports_multimodal(
model: Union[type[object], object],
) -> Union[
TypeIs[type[SupportsMultiModal]],
TypeIs[SupportsMultiModal],
]
Source code in vllm/model_executor/models/interfaces.py
supports_pp
¶
supports_pp(
model: type[object],
) -> TypeIs[type[SupportsPP]]
supports_pp(model: object) -> TypeIs[SupportsPP]
supports_pp(
model: Union[type[object], object],
) -> Union[
bool, TypeIs[type[SupportsPP]], TypeIs[SupportsPP]
]
Source code in vllm/model_executor/models/interfaces.py
supports_transcription
¶
supports_transcription(
model: type[object],
) -> TypeIs[type[SupportsTranscription]]
supports_transcription(
model: object,
) -> TypeIs[SupportsTranscription]
supports_transcription(
model: Union[type[object], object],
) -> Union[
TypeIs[type[SupportsTranscription]],
TypeIs[SupportsTranscription],
]
Source code in vllm/model_executor/models/interfaces.py
supports_v0_only
¶
supports_v0_only(
model: type[object],
) -> TypeIs[type[SupportsV0Only]]
supports_v0_only(model: object) -> TypeIs[SupportsV0Only]
supports_v0_only(
model: Union[type[object], object],
) -> Union[
TypeIs[type[SupportsV0Only]], TypeIs[SupportsV0Only]
]