vllm.model_executor.layers.resampler
Shared resampler perceiver network used in multimodal models and related helpers for sincos positional embeddings.
Example models: Qwen (Qwen-VL), MiniCPM-V 2.0
BaseResampler
¶
Bases: Module
A 2D perceiver-resampler network with one cross attention layers by (grid_size2) learnable queries and 2d sincos pos_emb. Outputs: A tensor with the shape of (grid_size2, embed_dim)
Source code in vllm/model_executor/layers/resampler.py
kv_proj
instance-attribute
¶
kv_proj = ReplicatedLinear(
kv_dim,
embed_dim,
bias=False,
quant_config=quant_config,
prefix=f"{prefix}.kv_proj",
)
proj
instance-attribute
¶
proj = (
Parameter(embed_dim**-0.5 * empty(embed_dim, embed_dim))
if do_post_projection
else None
)
__init__
¶
__init__(
num_queries: int,
embed_dim: int,
num_heads: int,
kv_dim: Optional[int] = None,
norm_layer: Callable[[int], LayerNorm] = DEFAULT_LN,
do_post_projection: bool = True,
quant_config: Optional[QuantizationConfig] = None,
prefix: str = "",
) -> None
Source code in vllm/model_executor/layers/resampler.py
Resampler2
¶
Bases: BaseResampler
Resampler-perceiver network to be used for a variety of model types, e.g., Qwen-vl / Minicpmv 2.0. The main difference is the addition of the do_post_projection arg, which indicates whether or not there should be a post layer normalization and projector after the attention. This is present in minicpmv2.0, but not qwen-vl.
Source code in vllm/model_executor/layers/resampler.py
__init__
¶
__init__(
grid_size: int,
embed_dim: int,
num_heads: int,
kv_dim: Optional[int] = None,
norm_layer: Callable[[int], LayerNorm] = DEFAULT_LN,
adaptive: bool = False,
do_post_projection: bool = True,
quant_config: Optional[QuantizationConfig] = None,
prefix: str = "",
) -> None
Source code in vllm/model_executor/layers/resampler.py
forward
¶
forward(
x: Tensor,
tgt_sizes: Optional[Tensor] = None,
attn_mask: Optional[Tensor] = None,
) -> Tensor
Source code in vllm/model_executor/layers/resampler.py
get_1d_sincos_pos_embed_from_grid
¶
get_1d_sincos_pos_embed_from_grid(
embed_dim: int,
pos: ndarray,
version: tuple[int, int] = (2, 0),
) -> Tensor
embed_dim: output dimension for each position pos: a list of positions to be encoded: size (M,) / (H, W) out: (M, D) / (H, W, D)
Source code in vllm/model_executor/layers/resampler.py
get_2d_sincos_pos_embed
¶
get_2d_sincos_pos_embed(
embed_dim: int,
grid_size: Union[int, tuple[int, int]],
cls_token: bool = False,
version: tuple[int, int] = (2, 0),
) -> Tensor
grid_size: int of the grid height and width return: pos_embed: [grid_sizegrid_size, embed_dim] or [1+grid_sizegrid_size, embed_dim] (w/ or w/o cls_token)
Source code in vllm/model_executor/layers/resampler.py
get_2d_sincos_pos_embed_from_grid
¶
get_2d_sincos_pos_embed_from_grid(
embed_dim: int,
grid: ndarray,
version: tuple[int, int] = (2, 0),
) -> Tensor