vllm.transformers_utils.configs.hy_v3 ¶
HYV3Config ¶
Bases: PretrainedConfig
This is the configuration class to store the configuration of a [HYV3Model]. It is used to instantiate a HYV3 model (HY V3 MoE language model) according to the specified arguments.
Configuration objects inherit from [PretrainedConfig] and can be used to control the model outputs. Read the documentation from [PretrainedConfig] for more information.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
vocab_size | `int`, *optional*, defaults to 120832 | Vocabulary size of the model. | 120832 |
hidden_size | `int`, *optional*, defaults to 4096 | Dimension of the hidden representations. | 4096 |
intermediate_size | `int`, *optional*, defaults to 13312 | Dimension of the dense FFN intermediate representations. | 13312 |
num_hidden_layers | `int`, *optional*, defaults to 80 | Number of hidden layers in the Transformer decoder. | 80 |
num_attention_heads | `int`, *optional*, defaults to 64 | Number of attention heads for each attention layer. | 64 |
num_key_value_heads | `int`, *optional*, defaults to 8 | Number of key-value heads for grouped-query attention. | 8 |
head_dim | `int`, *optional*, defaults to 128 | Dimension per attention head. | 128 |
hidden_act | `str`, *optional*, defaults to `"silu"` | Activation function used in FFN layers. | 'silu' |
max_position_embeddings | `int`, *optional*, defaults to 131072 | Maximum sequence length supported by the model. | 131072 |
initializer_range | `float`, *optional*, defaults to 0.006 | Standard deviation of the truncated normal initializer for weight initialization. | 0.006 |
rms_norm_eps | `float`, *optional*, defaults to 1e-5 | Epsilon for RMS normalization layers. | 1e-05 |
use_cache | `bool`, *optional*, defaults to `True` | Whether to use KV cache for decoding. | True |
pad_token_id | `int`, *optional* | Padding token id. | None |
bos_token_id | `int`, *optional* | Beginning-of-sequence token id. | None |
eos_token_id | `int` or `List[int]`, *optional* | End-of-sequence token id(s). | None |
rope_parameters | `dict`, *optional* | The parameters of the RoPE embeddings. | None |
qk_norm | `bool`, *optional*, defaults to `True` | Whether to apply RMSNorm to query and key states before attention. | True |
tie_word_embeddings | `bool`, *optional*, defaults to `False` | Whether to tie input and output embedding weights. | False |
enable_attention_fp32_softmax | `bool`, *optional*, defaults to `False` | Whether to upcast attention softmax to float32. Note: the eager attention path always computes softmax in float32 regardless of this setting; this flag is reserved for future use with custom attention backends. | False |
enable_lm_head_fp32 | `bool`, *optional*, defaults to `True` | Whether to upcast the LM head computation to float32. | True |
num_experts | `int`, *optional*, defaults to 192 | Total number of MoE experts. | 192 |
num_experts_per_tok | `int`, *optional*, defaults to 8 | Number of experts selected per token (top-k routing). | 8 |
num_shared_experts | `int`, *optional*, defaults to 1 | Number of always-active shared experts combined into a single MLP. | 1 |
expert_hidden_dim | `int`, *optional*, defaults to 1536 | Intermediate dimension of each individual MoE expert. | 1536 |
moe_router_enable_expert_bias | `bool`, *optional*, defaults to `True` | Whether to use per-expert load-balancing bias in the router. | True |
moe_router_use_sigmoid | `bool`, *optional*, defaults to `True` | Whether to use sigmoid (instead of softmax) for router scoring. | True |
route_norm | `bool`, *optional*, defaults to `True` | Whether to normalize routing scores when using sigmoid routing. | True |
router_scaling_factor | `float`, *optional* | Optional multiplicative scaling factor applied to routing scores. | None |
use_grouped_mm | `bool`, *optional*, defaults to `False` | Whether to use grouped GEMM for expert computation (not yet implemented). | False |
enable_moe_fp32_combine | `bool`, *optional*, defaults to `False` | Whether to accumulate expert outputs in float32. | False |
first_k_dense_replace | `int`, *optional*, defaults to 1 | Number of initial decoder layers that use a dense FFN instead of MoE. | 1 |
output_router_logits | `bool`, *optional*, defaults to `False` | Whether to output router logits from each MoE layer. Useful for computing auxiliary load-balancing loss during training. Disabled by default to avoid the memory overhead of storing per-layer router tensors during inference. | False |
Example
Source code in vllm/transformers_utils/configs/hy_v3.py
8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 | |