Skip to content

vllm.transformers_utils.configs.exaone

Exaone model configuration

EXAONE_PRETRAINED_CONFIG_ARCHIVE_MAP module-attribute

EXAONE_PRETRAINED_CONFIG_ARCHIVE_MAP: dict[str, str] = {}

logger module-attribute

logger = get_logger(__name__)

ExaoneConfig

Bases: PretrainedConfig

This is the configuration class to store the configuration of a :class: ~transformers.ExaoneModel. It is used to instantiate a GPT Lingvo model according to the specified arguments, defining the model architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of the Exaone

Configuration objects inherit from {class}~transformers.PretrainedConfig and can be used to control the model outputs. Read the documentation from : class:~transformers.PretrainedConfig for more information.

Parameters:

Name Type Description Default
vocab_size {obj}`int`, `optional`, defaults to 50257

Vocabulary size of the GPT Lingvo model. Defines the number of different tokens that can be represented by the {obj}inputs_ids passed when calling {class}~transformers.ExaoneModel. Vocabulary size of the model. Defines the different tokens that can be represented by the inputs_ids passed to the forward method of :class: ~transformers.EXAONEModel.

102400
hidden_size {obj}`int`, `optional`, defaults to 2048

Dimensionality of the encoder layers and the pooler layer.

2048
num_layers {obj}`int`, `optional`, defaults to 24

Number of hidden layers in the Transformer encoder.

32
num_attention_heads `int`, *optional*, defaults to 32

Number of attention heads for each attention layer in the Transformer decoder.

32
num_key_value_heads `int`, *optional*

This is the number of key_value heads that should be used to implement Grouped Query Attention. If num_key_value_heads=num_attention_heads, the model will use Multi Head Attention (MHA), if num_key_value_heads=1 the model will use Multi Query Attention (MQA) otherwise GQA is used. When converting a multi-head checkpoint to a GQA checkpoint, each group key and value head should be constructed by meanpooling all the original heads within that group. For more details checkout [this paper](https://arxiv.org/pdf/2305.13245.pdf). If it is not specified, will default tonum_attention_heads`.

None
rotary_pct `float`, *optional*, defaults to 0.25

percentage of hidden dimensions to allocate to rotary embeddings

0.25
intermediate_size {obj}`int`, `optional`, defaults to 8192

Dimensionality of the "intermediate" (i.e., feed-forward) layer in the Transformer encoder.

None
defaults to {obj}`"gelu_new"`

The non-linear activation function (function or string) in the encoder and pooler. If string, {obj}"gelu", {obj}"relu", {obj}"selu" and {obj}"gelu_new" are supported.

required
embed_dropout {obj}`float`, `optional`, defaults to 0.0

The dropout probabilitiy for all fully connected layers in the embeddings, encoder, and pooler.

0.0
attention_dropout {obj}`float`, `optional`, defaults to 0.0

The dropout ratio for the attention probabilities.

0.0
max_position_embeddings {obj}`int`, `optional`, defaults to 2048

The maximum sequence length that this model might ever be used with. Typically set this to something large just in case (e.g., 512 or 1024 or 2048).

2048
type_vocab_size {obj}`int`, `optional`, defaults to 2

The vocabulary size of the {obj}token_type_ids passed when calling {class}~transformers.EXAONEModel.

required
initializer_range {obj}`float`, `optional`, defaults to 0.02

The standard deviation of the truncated_normal_initializer for initializing all weight matrices.

0.02
layer_norm_epsilon {obj}`float`, `optional`, defaults to 1e-5

The epsilon used by the layer normalization layers.

1e-06
use_cache {obj}`bool`, `optional`, defaults to {obj}`True`

Whether or not the model should return the last key/values attentions (not used by all models). Only relevant if config.is_decoder=True.

True
defaults to {obj}`False`

If True, use gradient checkpointing to save memory at the expense of slower backward pass.

required
Example

:

from transformers import ExoneModel, ExaoneConfig

Initializing a EXAONE configuration

configuration = ExaoneConfig()

Initializing a model from configuration

model = ExoneModel(configuration)

Accessing the model configuration

configuration = model.config

required
Source code in vllm/transformers_utils/configs/exaone.py
class ExaoneConfig(PretrainedConfig):
    r"""
    This is the configuration class to store the configuration of a :class:
    `~transformers.ExaoneModel`. It is used to instantiate a GPT Lingvo model
    according to the specified arguments, defining the model architecture.
    Instantiating a configuration with the defaults will yield a similar
    configuration to that of the Exaone

    Configuration objects inherit from {class}`~transformers.PretrainedConfig`
    and can be used to control the model outputs. Read the documentation from :
    class:`~transformers.PretrainedConfig` for more information.

    Args:
        vocab_size ({obj}`int`, `optional`, defaults to 50257):
            Vocabulary size of the GPT Lingvo model. Defines the number of
            different tokens that can be represented by the {obj}`inputs_ids`
            passed when calling {class}`~transformers.ExaoneModel`. Vocabulary
            size of the model.
            Defines the different tokens that can be represented by the
            `inputs_ids` passed to the forward method of :class:
            `~transformers.EXAONEModel`.
        hidden_size ({obj}`int`, `optional`, defaults to 2048):
            Dimensionality of the encoder layers and the pooler layer.
        num_layers ({obj}`int`, `optional`, defaults to 24):
            Number of hidden layers in the Transformer encoder.
        num_attention_heads (`int`, *optional*, defaults to 32):
            Number of attention heads for each attention layer in the
            Transformer decoder.
        num_key_value_heads (`int`, *optional*):
            This is the number of key_value heads that should be used to
            implement Grouped Query Attention. If
            `num_key_value_heads=num_attention_heads`, the model will use Multi
            Head Attention (MHA), if `num_key_value_heads=1 the model will use
            Multi Query Attention (MQA) otherwise GQA is used. When
            converting a multi-head checkpoint to a GQA checkpoint,
            each group key and value head should be constructed by meanpooling
            all the original heads within that group. For more details checkout
            [this paper](https://arxiv.org/pdf/2305.13245.pdf). If it is not
            specified, will default to `num_attention_heads`.
        rotary_pct (`float`, *optional*, defaults to 0.25):
            percentage of hidden dimensions to allocate to rotary embeddings
        intermediate_size ({obj}`int`, `optional`, defaults to 8192):
            Dimensionality of the "intermediate" (i.e., feed-forward) layer in
            the Transformer encoder.
        activation_function ({obj}`str` or {obj}`function`, `optional`,
        defaults to {obj}`"gelu_new"`):
            The non-linear activation function (function or string) in the
            encoder and pooler. If string, {obj}`"gelu"`, {obj}`"relu"`,
            {obj}`"selu"` and {obj}`"gelu_new"` are supported.
        embed_dropout ({obj}`float`, `optional`, defaults to 0.0):
            The dropout probabilitiy for all fully connected layers in the
            embeddings, encoder, and pooler.
        attention_dropout ({obj}`float`, `optional`, defaults to 0.0):
            The dropout ratio for the attention probabilities.
        max_position_embeddings ({obj}`int`, `optional`, defaults to 2048):
            The maximum sequence length that this model might ever be used with.
            Typically set this to something large just in case
            (e.g., 512 or 1024 or 2048).
        type_vocab_size ({obj}`int`, `optional`, defaults to 2):
            The vocabulary size of the {obj}`token_type_ids` passed when calling
            {class}`~transformers.EXAONEModel`.
        initializer_range ({obj}`float`, `optional`, defaults to 0.02):
            The standard deviation of the truncated_normal_initializer for
            initializing all weight matrices.
        layer_norm_epsilon ({obj}`float`, `optional`, defaults to 1e-5):
            The epsilon used by the layer normalization layers.
        use_cache ({obj}`bool`, `optional`, defaults to {obj}`True`):
            Whether or not the model should return the last key/values
            attentions (not used by all models).
            Only relevant if ``config.is_decoder=True``.
        gradient_checkpointing ({obj}`bool`, `optional`,
        defaults to {obj}`False`):
            If True, use gradient checkpointing to save memory at the expense
            of slower backward pass.
        Example::

            >>> from transformers import ExoneModel, ExaoneConfig

            >>> # Initializing a EXAONE configuration
            >>> configuration = ExaoneConfig()

            >>> # Initializing a model from configuration
            >>> model = ExoneModel(configuration)

            >>> # Accessing the model configuration
            >>> configuration = model.config
    """

    model_type = "exaone"
    keys_to_ignore_at_inference = ["past_key_values"]
    attribute_map = {"num_hidden_layers": "num_layers"}

    def __init__(
        self,
        vocab_size=102400,
        max_position_embeddings=2048,
        hidden_size=2048,
        num_layers=32,
        num_attention_heads=32,
        num_key_value_heads=None,
        intermediate_size=None,
        activation_function="silu",
        rotary_pct=0.25,
        resid_dropout=0.0,
        embed_dropout=0.0,
        attention_dropout=0.0,
        layer_norm_epsilon=1e-6,
        initializer_range=0.02,
        use_cache=True,
        bos_token_id=0,
        eos_token_id=2,
        tie_word_embeddings=True,
        **kwargs,
    ):
        super().__init__(
            bos_token_id=bos_token_id,
            eos_token_id=eos_token_id,
            tie_word_embeddings=tie_word_embeddings,
            **kwargs,
        )

        self.vocab_size = vocab_size
        self.max_position_embeddings = max_position_embeddings
        self.hidden_size = hidden_size
        self.num_layers = num_layers
        self.num_attention_heads = num_attention_heads
        self.num_hidden_layers = num_layers
        if num_key_value_heads is None:
            num_key_value_heads = num_attention_heads
        self.num_key_value_heads = num_key_value_heads
        if intermediate_size:
            self.intermediate_size = intermediate_size
        else:
            self.intermediate_size = hidden_size * 4
        self.activation_function = activation_function
        self.resid_dropout = resid_dropout
        self.embed_dropout = embed_dropout
        self.attention_dropout = attention_dropout
        self.layer_norm_epsilon = layer_norm_epsilon
        self.initializer_range = initializer_range
        self.use_cache = use_cache
        self.rotary_pct = rotary_pct

        self.bos_token_id = bos_token_id
        self.eos_token_id = eos_token_id

        self.use_logit_cap = kwargs.pop("use_logit_cap", False)
        self.ln_no_scale = kwargs.pop("ln_no_scale", False)
        self.use_gated = kwargs.pop("use_gated", False)
        self.use_emb_norm = kwargs.pop("use_emb_norm", False)
        self.use_rotary_pos = kwargs.pop("use_rotary_pos", False)
        self.rotary_type = kwargs.pop("rotary_type", None)
        self.scaling_factor = kwargs.pop("scaling_factor", 1)
        self.use_absolute_pos = kwargs.pop("use_absolute_pos", True)
        self.use_extra_logit = kwargs.pop("use_extra_logit", True)
        self.rotary_expand_length = kwargs.pop("rotary_expand_length", None)
        self.rotary_base = kwargs.pop("rotary_base", 10000.0)
        self.use_qkv_fuse = kwargs.pop("use_qkv_fuse", False)
        self.rescale_before_lm_head = kwargs.pop("rescale_before_lm_head",
                                                 (rotary_pct == 0.25))
        if self.use_rotary_pos:
            self.use_absolute_pos = False

activation_function instance-attribute

activation_function = activation_function

attention_dropout instance-attribute

attention_dropout = attention_dropout

attribute_map class-attribute instance-attribute

attribute_map = {'num_hidden_layers': 'num_layers'}

bos_token_id instance-attribute

bos_token_id = bos_token_id

embed_dropout instance-attribute

embed_dropout = embed_dropout

eos_token_id instance-attribute

eos_token_id = eos_token_id

hidden_size instance-attribute

hidden_size = hidden_size

initializer_range instance-attribute

initializer_range = initializer_range

intermediate_size instance-attribute

intermediate_size = intermediate_size

keys_to_ignore_at_inference class-attribute instance-attribute

keys_to_ignore_at_inference = ['past_key_values']

layer_norm_epsilon instance-attribute

layer_norm_epsilon = layer_norm_epsilon

ln_no_scale instance-attribute

ln_no_scale = pop('ln_no_scale', False)

max_position_embeddings instance-attribute

max_position_embeddings = max_position_embeddings

model_type class-attribute instance-attribute

model_type = 'exaone'

num_attention_heads instance-attribute

num_attention_heads = num_attention_heads

num_hidden_layers instance-attribute

num_hidden_layers = num_layers

num_key_value_heads instance-attribute

num_key_value_heads = num_key_value_heads

num_layers instance-attribute

num_layers = num_layers

rescale_before_lm_head instance-attribute

rescale_before_lm_head = pop(
    "rescale_before_lm_head", rotary_pct == 0.25
)

resid_dropout instance-attribute

resid_dropout = resid_dropout

rotary_base instance-attribute

rotary_base = pop('rotary_base', 10000.0)

rotary_expand_length instance-attribute

rotary_expand_length = pop('rotary_expand_length', None)

rotary_pct instance-attribute

rotary_pct = rotary_pct

rotary_type instance-attribute

rotary_type = pop('rotary_type', None)

scaling_factor instance-attribute

scaling_factor = pop('scaling_factor', 1)

use_absolute_pos instance-attribute

use_absolute_pos = pop('use_absolute_pos', True)

use_cache instance-attribute

use_cache = use_cache

use_emb_norm instance-attribute

use_emb_norm = pop('use_emb_norm', False)

use_extra_logit instance-attribute

use_extra_logit = pop('use_extra_logit', True)

use_gated instance-attribute

use_gated = pop('use_gated', False)

use_logit_cap instance-attribute

use_logit_cap = pop('use_logit_cap', False)

use_qkv_fuse instance-attribute

use_qkv_fuse = pop('use_qkv_fuse', False)

use_rotary_pos instance-attribute

use_rotary_pos = pop('use_rotary_pos', False)

vocab_size instance-attribute

vocab_size = vocab_size

__init__

__init__(
    vocab_size=102400,
    max_position_embeddings=2048,
    hidden_size=2048,
    num_layers=32,
    num_attention_heads=32,
    num_key_value_heads=None,
    intermediate_size=None,
    activation_function="silu",
    rotary_pct=0.25,
    resid_dropout=0.0,
    embed_dropout=0.0,
    attention_dropout=0.0,
    layer_norm_epsilon=1e-06,
    initializer_range=0.02,
    use_cache=True,
    bos_token_id=0,
    eos_token_id=2,
    tie_word_embeddings=True,
    **kwargs,
)
Source code in vllm/transformers_utils/configs/exaone.py
def __init__(
    self,
    vocab_size=102400,
    max_position_embeddings=2048,
    hidden_size=2048,
    num_layers=32,
    num_attention_heads=32,
    num_key_value_heads=None,
    intermediate_size=None,
    activation_function="silu",
    rotary_pct=0.25,
    resid_dropout=0.0,
    embed_dropout=0.0,
    attention_dropout=0.0,
    layer_norm_epsilon=1e-6,
    initializer_range=0.02,
    use_cache=True,
    bos_token_id=0,
    eos_token_id=2,
    tie_word_embeddings=True,
    **kwargs,
):
    super().__init__(
        bos_token_id=bos_token_id,
        eos_token_id=eos_token_id,
        tie_word_embeddings=tie_word_embeddings,
        **kwargs,
    )

    self.vocab_size = vocab_size
    self.max_position_embeddings = max_position_embeddings
    self.hidden_size = hidden_size
    self.num_layers = num_layers
    self.num_attention_heads = num_attention_heads
    self.num_hidden_layers = num_layers
    if num_key_value_heads is None:
        num_key_value_heads = num_attention_heads
    self.num_key_value_heads = num_key_value_heads
    if intermediate_size:
        self.intermediate_size = intermediate_size
    else:
        self.intermediate_size = hidden_size * 4
    self.activation_function = activation_function
    self.resid_dropout = resid_dropout
    self.embed_dropout = embed_dropout
    self.attention_dropout = attention_dropout
    self.layer_norm_epsilon = layer_norm_epsilon
    self.initializer_range = initializer_range
    self.use_cache = use_cache
    self.rotary_pct = rotary_pct

    self.bos_token_id = bos_token_id
    self.eos_token_id = eos_token_id

    self.use_logit_cap = kwargs.pop("use_logit_cap", False)
    self.ln_no_scale = kwargs.pop("ln_no_scale", False)
    self.use_gated = kwargs.pop("use_gated", False)
    self.use_emb_norm = kwargs.pop("use_emb_norm", False)
    self.use_rotary_pos = kwargs.pop("use_rotary_pos", False)
    self.rotary_type = kwargs.pop("rotary_type", None)
    self.scaling_factor = kwargs.pop("scaling_factor", 1)
    self.use_absolute_pos = kwargs.pop("use_absolute_pos", True)
    self.use_extra_logit = kwargs.pop("use_extra_logit", True)
    self.rotary_expand_length = kwargs.pop("rotary_expand_length", None)
    self.rotary_base = kwargs.pop("rotary_base", 10000.0)
    self.use_qkv_fuse = kwargs.pop("use_qkv_fuse", False)
    self.rescale_before_lm_head = kwargs.pop("rescale_before_lm_head",
                                             (rotary_pct == 0.25))
    if self.use_rotary_pos:
        self.use_absolute_pos = False