vllm.model_executor.model_loader.ep_weight_filter ¶
Filter out non-local expert weights during loading to avoid redundant I/O.
In DP+EP deployments each rank only needs its own expert shard. Skipping non-local expert tensors before they are read from disk eliminates the majority of storage I/O for MoE models (experts typically account for ~85-90 % of total weight bytes).
compute_local_expert_ids ¶
compute_local_expert_ids(
num_experts: int,
ep_size: int,
ep_rank: int,
placement: str = "linear",
) -> set[int] | None
Compute the set of global expert ids owned by ep_rank.
Returns None when EP is not active (ep_size <= 1), meaning all experts are local and no filtering should be performed.
The distribution logic mirrors :func:vllm.model_executor.layers.fused_moe.layer.determine_expert_map.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
placement | str |
| 'linear' |
Source code in vllm/model_executor/model_loader/ep_weight_filter.py
parse_expert_id ¶
Return the expert id embedded in weight_name, or None if it is not an per-expert weight.
Returns None for dense weights (attention, layernorm, embedding), shared experts, and 3D fused-expert tensors where all experts are stored in a single tensor without a numeric expert id in the name.
Source code in vllm/model_executor/model_loader/ep_weight_filter.py
should_skip_weight ¶
Return True if weight_name is an expert weight that does not belong to the local rank and should be skipped during loading.