Skip to content

vllm.transformers_utils.configs.hy_v3

HYV3Config

Bases: PretrainedConfig

This is the configuration class to store the configuration of a [HYV3Model]. It is used to instantiate a HYV3 model (HY V3 MoE language model) according to the specified arguments.

Configuration objects inherit from [PretrainedConfig] and can be used to control the model outputs. Read the documentation from [PretrainedConfig] for more information.

Parameters:

Name Type Description Default
vocab_size `int`, *optional*, defaults to 120832

Vocabulary size of the model.

120832
hidden_size `int`, *optional*, defaults to 4096

Dimension of the hidden representations.

4096
intermediate_size `int`, *optional*, defaults to 13312

Dimension of the dense FFN intermediate representations.

13312
num_hidden_layers `int`, *optional*, defaults to 80

Number of hidden layers in the Transformer decoder.

80
num_attention_heads `int`, *optional*, defaults to 64

Number of attention heads for each attention layer.

64
num_key_value_heads `int`, *optional*, defaults to 8

Number of key-value heads for grouped-query attention.

8
head_dim `int`, *optional*, defaults to 128

Dimension per attention head.

128
hidden_act `str`, *optional*, defaults to `"silu"`

Activation function used in FFN layers.

'silu'
max_position_embeddings `int`, *optional*, defaults to 131072

Maximum sequence length supported by the model.

131072
initializer_range `float`, *optional*, defaults to 0.006

Standard deviation of the truncated normal initializer for weight initialization.

0.006
rms_norm_eps `float`, *optional*, defaults to 1e-5

Epsilon for RMS normalization layers.

1e-05
use_cache `bool`, *optional*, defaults to `True`

Whether to use KV cache for decoding.

True
pad_token_id `int`, *optional*

Padding token id.

None
bos_token_id `int`, *optional*

Beginning-of-sequence token id.

None
eos_token_id `int` or `List[int]`, *optional*

End-of-sequence token id(s).

None
rope_parameters `dict`, *optional*

The parameters of the RoPE embeddings.

None
qk_norm `bool`, *optional*, defaults to `True`

Whether to apply RMSNorm to query and key states before attention.

True
tie_word_embeddings `bool`, *optional*, defaults to `False`

Whether to tie input and output embedding weights.

False
enable_attention_fp32_softmax `bool`, *optional*, defaults to `False`

Whether to upcast attention softmax to float32. Note: the eager attention path always computes softmax in float32 regardless of this setting; this flag is reserved for future use with custom attention backends.

False
enable_lm_head_fp32 `bool`, *optional*, defaults to `True`

Whether to upcast the LM head computation to float32.

True
num_experts `int`, *optional*, defaults to 192

Total number of MoE experts.

192
num_experts_per_tok `int`, *optional*, defaults to 8

Number of experts selected per token (top-k routing).

8
num_shared_experts `int`, *optional*, defaults to 1

Number of always-active shared experts combined into a single MLP.

1
expert_hidden_dim `int`, *optional*, defaults to 1536

Intermediate dimension of each individual MoE expert.

1536
moe_router_enable_expert_bias `bool`, *optional*, defaults to `True`

Whether to use per-expert load-balancing bias in the router.

True
moe_router_use_sigmoid `bool`, *optional*, defaults to `True`

Whether to use sigmoid (instead of softmax) for router scoring.

True
route_norm `bool`, *optional*, defaults to `True`

Whether to normalize routing scores when using sigmoid routing.

True
router_scaling_factor `float`, *optional*

Optional multiplicative scaling factor applied to routing scores.

None
use_grouped_mm `bool`, *optional*, defaults to `False`

Whether to use grouped GEMM for expert computation (not yet implemented).

False
enable_moe_fp32_combine `bool`, *optional*, defaults to `False`

Whether to accumulate expert outputs in float32.

False
first_k_dense_replace `int`, *optional*, defaults to 1

Number of initial decoder layers that use a dense FFN instead of MoE.

1
output_router_logits `bool`, *optional*, defaults to `False`

Whether to output router logits from each MoE layer. Useful for computing auxiliary load-balancing loss during training. Disabled by default to avoid the memory overhead of storing per-layer router tensors during inference.

False
Example
>>> from transformers import HYV3Config, HYV3Model

>>> config = HYV3Config()
>>> model = HYV3Model(config)
Source code in vllm/transformers_utils/configs/hy_v3.py
class HYV3Config(PretrainedConfig):
    r"""
    This is the configuration class to store the configuration of a [`HYV3Model`].
    It is used to instantiate a HYV3 model (HY V3 MoE language model) according to
    the specified arguments.

    Configuration objects inherit from [`PretrainedConfig`] and can be used to
    control the model outputs. Read the documentation from [`PretrainedConfig`]
    for more information.

    Args:
        vocab_size (`int`, *optional*, defaults to 120832):
            Vocabulary size of the model.
        hidden_size (`int`, *optional*, defaults to 4096):
            Dimension of the hidden representations.
        intermediate_size (`int`, *optional*, defaults to 13312):
            Dimension of the dense FFN intermediate representations.
        num_hidden_layers (`int`, *optional*, defaults to 80):
            Number of hidden layers in the Transformer decoder.
        num_attention_heads (`int`, *optional*, defaults to 64):
            Number of attention heads for each attention layer.
        num_key_value_heads (`int`, *optional*, defaults to 8):
            Number of key-value heads for grouped-query attention.
        head_dim (`int`, *optional*, defaults to 128):
            Dimension per attention head.
        hidden_act (`str`, *optional*, defaults to `"silu"`):
            Activation function used in FFN layers.
        max_position_embeddings (`int`, *optional*, defaults to 131072):
            Maximum sequence length supported by the model.
        initializer_range (`float`, *optional*, defaults to 0.006):
            Standard deviation of the truncated normal initializer for weight
            initialization.
        rms_norm_eps (`float`, *optional*, defaults to 1e-5):
            Epsilon for RMS normalization layers.
        use_cache (`bool`, *optional*, defaults to `True`):
            Whether to use KV cache for decoding.
        pad_token_id (`int`, *optional*):
            Padding token id.
        bos_token_id (`int`, *optional*):
            Beginning-of-sequence token id.
        eos_token_id (`int` or `List[int]`, *optional*):
            End-of-sequence token id(s).
        rope_parameters (`dict`, *optional*):
            The parameters of the RoPE embeddings.
        qk_norm (`bool`, *optional*, defaults to `True`):
            Whether to apply RMSNorm to query and key states before attention.
        tie_word_embeddings (`bool`, *optional*, defaults to `False`):
            Whether to tie input and output embedding weights.
        enable_attention_fp32_softmax (`bool`, *optional*, defaults to `False`):
            Whether to upcast attention softmax to float32. Note: the eager attention
            path always computes softmax in float32 regardless of this setting; this
            flag is reserved for future use with custom attention backends.
        enable_lm_head_fp32 (`bool`, *optional*, defaults to `True`):
            Whether to upcast the LM head computation to float32.
        num_experts (`int`, *optional*, defaults to 192):
            Total number of MoE experts.
        num_experts_per_tok (`int`, *optional*, defaults to 8):
            Number of experts selected per token (top-k routing).
        num_shared_experts (`int`, *optional*, defaults to 1):
            Number of always-active shared experts combined into a single MLP.
        expert_hidden_dim (`int`, *optional*, defaults to 1536):
            Intermediate dimension of each individual MoE expert.
        moe_router_enable_expert_bias (`bool`, *optional*, defaults to `True`):
            Whether to use per-expert load-balancing bias in the router.
        moe_router_use_sigmoid (`bool`, *optional*, defaults to `True`):
            Whether to use sigmoid (instead of softmax) for router scoring.
        route_norm (`bool`, *optional*, defaults to `True`):
            Whether to normalize routing scores when using sigmoid routing.
        router_scaling_factor (`float`, *optional*):
            Optional multiplicative scaling factor applied to routing scores.
        use_grouped_mm (`bool`, *optional*, defaults to `False`):
            Whether to use grouped GEMM for expert computation (not yet implemented).
        enable_moe_fp32_combine (`bool`, *optional*, defaults to `False`):
            Whether to accumulate expert outputs in float32.
        first_k_dense_replace (`int`, *optional*, defaults to 1):
            Number of initial decoder layers that use a dense FFN instead of MoE.
        output_router_logits (`bool`, *optional*, defaults to `False`):
            Whether to output router logits from each MoE layer. Useful for computing
            auxiliary load-balancing loss during training. Disabled by default to avoid
            the memory overhead of storing per-layer router tensors during inference.

    Example:
        ```python
        >>> from transformers import HYV3Config, HYV3Model

        >>> config = HYV3Config()
        >>> model = HYV3Model(config)
        ```
    """

    model_type = "hy_v3"
    keys_to_ignore_at_inference = ["past_key_values"]

    def __init__(
        self,
        vocab_size=120832,
        hidden_size=4096,
        intermediate_size=13312,
        num_hidden_layers=80,
        num_attention_heads=64,
        num_key_value_heads=8,
        head_dim=128,
        hidden_act="silu",
        max_position_embeddings=131072,
        initializer_range=0.006,
        rms_norm_eps=1e-5,
        use_cache=True,
        pad_token_id=None,
        bos_token_id=None,
        eos_token_id=None,
        rope_parameters: dict[str, Any] | None = None,
        qk_norm=True,
        tie_word_embeddings=False,
        enable_attention_fp32_softmax=False,
        enable_lm_head_fp32=True,
        # MoE specific
        num_experts=192,
        num_experts_per_tok=8,
        num_shared_experts=1,
        expert_hidden_dim=1536,
        moe_router_enable_expert_bias=True,
        moe_router_use_sigmoid=True,
        route_norm=True,
        router_scaling_factor=None,
        use_grouped_mm=False,
        enable_moe_fp32_combine=False,
        # Dense/MoE layer control
        first_k_dense_replace=1,
        output_router_logits=False,
        **kwargs,
    ):
        self.vocab_size = vocab_size
        self.hidden_size = hidden_size
        self.intermediate_size = intermediate_size
        self.num_hidden_layers = num_hidden_layers
        self.num_attention_heads = num_attention_heads
        self.num_key_value_heads = num_key_value_heads
        self.head_dim = head_dim
        self.hidden_act = hidden_act
        self.max_position_embeddings = max_position_embeddings
        self.initializer_range = initializer_range
        self.rms_norm_eps = rms_norm_eps
        self.use_cache = use_cache
        rope_theta = kwargs.pop("rope_theta", 11158840.0)
        if rope_parameters is None:
            rope_parameters = {"rope_type": "default", "rope_theta": rope_theta}
        self.rope_parameters = rope_parameters
        self.qk_norm = qk_norm
        self.tie_word_embeddings = tie_word_embeddings
        self.enable_lm_head_fp32 = enable_lm_head_fp32
        self.enable_attention_fp32_softmax = enable_attention_fp32_softmax

        # MoE specific
        self.num_experts = num_experts
        self.num_experts_per_tok = num_experts_per_tok
        self.num_shared_experts = num_shared_experts
        self.expert_hidden_dim = expert_hidden_dim
        self.moe_router_enable_expert_bias = moe_router_enable_expert_bias
        self.moe_router_use_sigmoid = moe_router_use_sigmoid
        self.route_norm = route_norm
        self.use_grouped_mm = use_grouped_mm
        self.router_scaling_factor = router_scaling_factor
        self.enable_moe_fp32_combine = enable_moe_fp32_combine

        # Dense/MoE layer control
        self.first_k_dense_replace = first_k_dense_replace
        self.output_router_logits = output_router_logits

        if eos_token_id is not None and isinstance(eos_token_id, int):
            eos_token_id = [eos_token_id]

        super().__init__(
            pad_token_id=pad_token_id,
            bos_token_id=bos_token_id,
            eos_token_id=eos_token_id,
            tie_word_embeddings=tie_word_embeddings,
            **kwargs,
        )