Skip to content

vllm.model_executor.layers.fused_moe.lora_context ¶

MoELoRAContext `dataclass` ¶

Carries all LoRA state for one MoE forward pass.

Built by FusedMoEWithLoRA.forward() and propagated explicitly through the modular kernel path (FusedMoEKernel -> FusedMoEExpertsModular.apply) so that TritonExperts.apply() can compute the LoRA contribution inline, replacing the decorator-based monkey-patch approach.

Source code in vllm/model_executor/layers/fused_moe/lora_context.py

@dataclass
class MoELoRAContext:
    """
    Carries all LoRA state for one MoE forward pass.

    Built by FusedMoEWithLoRA.forward() and propagated explicitly through the
    modular kernel path (FusedMoEKernel -> FusedMoEExpertsModular.apply) so
    that TritonExperts.apply() can compute the LoRA contribution inline,
    replacing the decorator-based monkey-patch approach.
    """

    # LoRA weight tensors (same shapes as FusedMoEWithLoRA attributes)
    w13_lora_a_stacked: tuple[torch.Tensor, ...]
    w13_lora_b_stacked: tuple[torch.Tensor, ...]
    w2_lora_a_stacked: tuple[torch.Tensor, ...]
    w2_lora_b_stacked: tuple[torch.Tensor, ...]

    # (max_loras + 1,) int32; slot 0 is the "no-adapter" sentinel
    adapter_enabled: torch.Tensor

    # Metadata
    max_loras: int
    top_k: int
    w13_num_slices: int  # 2 = gated (gate + up), 1 = non-gated or 3D-fused
    fully_sharded: bool
    tp_rank: int
    tp_size: int
    local_num_experts: int

    punica_wrapper: PunicaWrapperBase

    # Whether VLLM_TUNED_CONFIG_FOLDER is set; selects get_lora_op_configs vs
    # try_get_optimal_moe_lora_config for Triton kernel tile configs.
    use_tuned_config: bool