vllm.model_executor.layers.fused_moe.moe_permute_unpermute
_moe_permute
¶
_moe_permute(
curr_hidden_states: Tensor,
a1q_scale: Optional[Tensor],
curr_topk_ids: Tensor,
global_num_experts: int,
expert_map: Optional[Tensor],
block_m: int,
) -> tuple[
Tensor, Optional[Tensor], Tensor, Tensor, Tensor
]
Determine the sorted_token_ids, expert_ids for the given problem size.
Permute the hidden states and scales according to sorted_token_ids
.
Source code in vllm/model_executor/layers/fused_moe/moe_permute_unpermute.py
_moe_unpermute_and_reduce
¶
_moe_unpermute_and_reduce(
out: Tensor,
curr_hidden: Tensor,
inv_perm: Optional[Tensor],
topk_weight: Tensor,
apply_router_weight_on_input: bool,
) -> None
Unpermute the final result and apply topk_weights, then perform the final reduction on the hidden states.
Source code in vllm/model_executor/layers/fused_moe/moe_permute_unpermute.py
moe_permute
¶
moe_permute(
hidden_states: Tensor,
topk_weights: Tensor,
topk_ids: Tensor,
token_expert_indices: Tensor,
topk: int,
n_expert: int,
n_local_expert: int,
expert_map: Optional[Tensor] = None,
align_block_size: Optional[int] = None,
fill_invalid_expert: int = -1,
) -> tuple[Tensor, Tensor, Tensor, Tensor]
This function expands and permutes activation to gather uncontinuous tokens
for each expert.
Parameters:
- hidden_states (torch.Tensor): The input tensor to the MoE layer.
- topk_weights (torch.Tensor): topk expert route weight for each token.
- topk_ids (torch.Tensor): topk expert route id for each token.
- token_expert_indices (torch.Tensor): indice for expanded hidden.
- topk (int): The number of top-k experts to select.
- n_expert (int): The number of expert.
- n_local_expert (int): The number of expert in current EP rank.
- expert_map (Optional[torch.Tensor]): A tensor mapping expert indices
from the global expert space to the local expert space of the expert
parallel shard.
- align_block_size (Optional[int]): align group gemm block size for deepgemm
- fill_invalid_expert(int): fill expert id in m_indices for invalid expert
to workaround DeepGemm unsupported -1 in m_indices
Returns:
- permuted_hidden_states (torch.Tensor): permuted activation.
- expert_first_token_offset (torch.Tensor): offset of the first token
of each expert for standard grouped gemm. if enable 'align_block_size'
expert_first_token_offset will align up to 'align_block_size'.
- src_row_id2dst_row_id_map (torch.Tensor): idx map for moe_unpermute.
- m_indices: m_indices for grouped gemm in deepgemm,m_indices[i]
records
the group which the j-th row of the LHS belong to.`
Source code in vllm/model_executor/layers/fused_moe/moe_permute_unpermute.py
moe_permute_unpermute_supported
¶
moe_unpermute
¶
moe_unpermute(
permuted_hidden_states: Tensor,
topk_weights: Tensor,
topk_ids: Tensor,
src_row_id2dst_row_id_map: Tensor,
expert_first_token_offset: Tensor,
topk: int,
n_expert: int,
n_local_expert: int,
) -> Tensor
This function expands and permutes activation to gathering uncontinuous tokens for each expert. Parameters: - permuted_hidden_states (torch.Tensor): permuted activation. - topk_weights (torch.Tensor): topk expert route weight for each token. - topk_ids (torch.Tensor): topk expert route id for each token. - expert_first_token_offset (torch.Tensor): offset of the first token of each expert for grouped gemm. - topk (int): The number of top-k experts to select. - n_expert (int): The number of expert. - n_local_expert (int): The number of expert in current EP rank. Returns: - hidden_states (torch.Tensor): The reduced and unpermuted activation tensor.