vllm.transformers_utils.configs.dbrx
Dbrx configuration.
DbrxAttentionConfig
¶
Bases: PretrainedConfig
Configuration class for Dbrx Attention.
[DbrxAttention
] class. It is used to instantiate attention layers
according to the specified arguments, defining the layers architecture.
Configuration objects inherit from [PretrainedConfig
] and can be used to control the model outputs. Read the
documentation from [PretrainedConfig
] for more information.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
attn_pdrop
|
`float`, *optional*, defaults to 0.0
|
The dropout probability for the attention layers. |
0
|
clip_qkv
|
`float`, *optional*, defaults to None
|
If not |
None
|
kv_n_heads
|
Optional[int]
|
For grouped_query_attention only, allow user to specify number of kv heads. |
1
|
rope_theta
|
float
|
The base frequency for rope. |
10000.0
|
Source code in vllm/transformers_utils/configs/dbrx.py
__init__
¶
__init__(
attn_pdrop: float = 0,
clip_qkv: Optional[float] = None,
kv_n_heads: int = 1,
rope_theta: float = 10000.0,
**kwargs: Any,
)
Source code in vllm/transformers_utils/configs/dbrx.py
from_pretrained
classmethod
¶
Source code in vllm/transformers_utils/configs/dbrx.py
DbrxConfig
¶
Bases: PretrainedConfig
Configuration class for Dbrx.
[DbrxModel
]. It is used to instantiate a Dbrx model according to the
specified arguments, defining the model architecture.
Configuration objects inherit from [PretrainedConfig
] and can be used to control the model outputs. Read the
documentation from [PretrainedConfig
] for more information.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
d_model
|
`int`, *optional*, defaults to 6144
|
Dimensionality of the embeddings and hidden states. |
2048
|
n_heads
|
`int`, *optional*, defaults to 48
|
Number of attention heads for each attention layer in the Transformer encoder. |
16
|
n_layers
|
`int`, *optional*, defaults to 40
|
Number of hidden layers in the Transformer encoder. |
24
|
max_seq_len
|
`int`, *optional*, defaults to 32768
|
The maximum sequence length of the model. |
2048
|
vocab_size
|
`int`, *optional*, defaults to 100352
|
Vocabulary size of the Dbrx model. Defines the maximum number of different tokens that can be represented by
the |
32000
|
resid_pdrop
|
`float`, *optional*, defaults to 0.0
|
The dropout probability applied to the attention output before combining with residual. |
0.0
|
emb_pdrop
|
`float`, *optional*, defaults to 0.0
|
The dropout probability for the embedding layer. |
0.0
|
attn_config
|
`dict`, *optional*
|
A dictionary used to configure the model's attention module. |
None
|
ffn_config
|
`dict`, *optional*
|
A dictionary used to configure the model's FFN module. |
None
|
use_cache
|
`bool`, *optional*, defaults to `False`
|
Whether or not the model should return the last key/values attentions (not used by all models). |
True
|
initializer_range
|
`float`, *optional*, defaults to 0.02
|
The standard deviation of the truncated_normal_initializer for initializing all weight matrices. |
0.02
|
output_router_logits
|
`bool`, *optional*, defaults to `False`
|
Whether or not the router logits should be returned by the model. Enabling this will also allow the model to output the auxiliary loss. |
False
|
router_aux_loss_coef
|
`float`, *optional*, defaults to 0.001
|
The aux loss factor for the total loss. |
0.05
|
Example:
>>> from transformers import DbrxConfig, DbrxModel
>>> # Initializing a Dbrx configuration
>>> configuration = DbrxConfig()
>>> # Initializing a model (with random weights) from the configuration
>>> model = DbrxModel(configuration)
>>> # Accessing the model configuration
>>> configuration = model.config
Source code in vllm/transformers_utils/configs/dbrx.py
165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 |
|
attribute_map
class-attribute
instance-attribute
¶
attribute_map = {
"num_attention_heads": "n_heads",
"hidden_size": "d_model",
"num_hidden_layers": "n_layers",
"max_position_embeddings": "max_seq_len",
}
__init__
¶
__init__(
d_model: int = 2048,
n_heads: int = 16,
n_layers: int = 24,
max_seq_len: int = 2048,
vocab_size: int = 32000,
resid_pdrop: float = 0.0,
emb_pdrop: float = 0.0,
attn_config: Optional[DbrxAttentionConfig] = None,
ffn_config: Optional[DbrxFFNConfig] = None,
use_cache: bool = True,
initializer_range: float = 0.02,
output_router_logits: bool = False,
router_aux_loss_coef: float = 0.05,
**kwargs: Any,
)
Source code in vllm/transformers_utils/configs/dbrx.py
DbrxFFNConfig
¶
Bases: PretrainedConfig
Configuration class for Dbrx FFN.
[DbrxFFN
] class. It is used to instantiate feedforward layers according to
the specified arguments, defining the layers architecture.
Configuration objects inherit from [PretrainedConfig
] and can be used to control the model outputs. Read the
documentation from [PretrainedConfig
] for more information.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
ffn_act_fn
|
dict
|
A dict specifying activation function for the FFN. The dict should have a key 'name' with the value being the name of the activation function along with any additional keyword arguments. |
None
|
ffn_hidden_size
|
int
|
The hidden size of the feedforward network. |
3584
|
moe_num_experts
|
int
|
The number of experts in the mixture of experts layer. |
4
|
moe_top_k
|
int
|
The number of experts to use in the mixture of experts layer. |
1
|
moe_jitter_eps
|
float
|
The jitter epsilon for the mixture of experts layer. |
None
|
moe_loss_weight
|
float
|
The loss weight for the mixture of experts layer. |
0.01
|
moe_normalize_expert_weights
|
float
|
The normalization factor for the expert weights. |
1
|
uniform_expert_assignment
|
bool
|
Whether to use uniform expert assignment. This should only be used for benchmarking purposes. |
False
|
Source code in vllm/transformers_utils/configs/dbrx.py
moe_normalize_expert_weights
instance-attribute
¶
uniform_expert_assignment
instance-attribute
¶
__init__
¶
__init__(
ffn_act_fn: Optional[dict] = None,
ffn_hidden_size: int = 3584,
moe_num_experts: int = 4,
moe_top_k: int = 1,
moe_jitter_eps: Optional[float] = None,
moe_loss_weight: float = 0.01,
moe_normalize_expert_weights: Optional[float] = 1,
uniform_expert_assignment: bool = False,
**kwargs: Any,
)