vllm.transformers_utils.configs.cohere2

all `module-attribute` ¶

__all__ = ['Cohere2Config']

Cohere2Config ¶

Bases: PretrainedConfig

This is the configuration class to store the configuration of a [CohereModel]. It is used to instantiate an Cohere model according to the specified arguments, defining the model architecture.

Configuration objects inherit from [PretrainedConfig] and can be used to control the model outputs. Read the documentation from [PretrainedConfig] for more information. Instantiating a configuration with the defaults will yield a similar configuration to that of the CohereForAI/c4ai-command-r-v01 model.

Parameters:

Name	Type	Description	Default
`vocab_size`	`int`, optional, defaults to 256000	Vocabulary size of the Cohere model. Defines the number of different tokens that can be represented by the `inputs_ids` passed when calling [`CohereModel`]	`256000`
`hidden_size`	`int`, optional, defaults to 8192	Dimension of the hidden representations.	`8192`
`intermediate_size`	`int`, optional, defaults to 22528	Dimension of the MLP representations.	`22528`
`logit_scale`	`float`, optional, defaults to 0.0625	The scaling factor for the output logits.	`0.0625`
`num_hidden_layers`	`int`, optional, defaults to 40	Number of hidden layers in the Transformer decoder.	`40`
`num_attention_heads`	`int`, optional, defaults to 64	Number of attention heads for each attention layer in the Transformer decoder.	`64`
`num_key_value_heads`	`int`, optional	This is the number of key_value heads that should be used to implement Grouped Query Attention. If `num_key_value_heads=num_attention_heads`, the model will use Multi Head Attention (MHA), if `num_key_value_heads=1` the model will use Multi Query Attention (MQA) otherwise GQA is used. When converting a multi-head checkpoint to a GQA checkpoint, each group key and value head should be constructed by meanpooling all the original heads within that group. For more details checkout this paper. If it is not specified, will default to `num_attention_heads`.	`None`
`hidden_act`	`str` or `function`, optional, defaults to `"silu"`	The non-linear activation function (function or string) in the decoder.	`'silu'`
`max_position_embeddings`	`int`, optional, defaults to 8192	The maximum sequence length that this model might ever be used with.	`8192`
`initializer_range`	`float`, optional, defaults to 0.02	The standard deviation of the truncated_normal_initializer for initializing all weight matrices.	`0.02`
`layer_norm_eps`	`float`, optional, defaults to 1e-05	The epsilon used by the layer normalization.	`1e-05`
`use_cache`	`bool`, optional, defaults to `True`	Whether or not the model should return the last key/values attentions (not used by all models). Only relevant if `config.is_decoder=True`.	`True`
`pad_token_id`	`int`, optional, defaults to 0	Padding token id.	`0`
`bos_token_id`	`int`, optional, defaults to 5	Beginning of stream token id.	`5`
`eos_token_id`	`int`, optional, defaults to 255001	End of stream token id.	`255001`
`tie_word_embeddings`	`bool`, optional, defaults to `True`	Whether to tie weight embeddings	`True`
`rope_theta`	`float`, optional, defaults to 10000.0	The base period of the RoPE embeddings.	`10000.0`
`rope_scaling`	`dict`, optional	Dictionary containing the scaling configuration for the RoPE embeddings. NOTE: if you apply new rope type and you expect the model to work on longer `max_position_embeddings`, we recommend you to update this value accordingly. Expected contents: `rope_type` (`str`): The sub-variant of RoPE to use. Can be one of ['default', 'linear', 'dynamic', 'yarn', 'longrope', 'llama3'], with 'default' being the original RoPE implementation. `factor` (`float`, optional): Used with all rope types except 'default'. The scaling factor to apply to the RoPE embeddings. In most scaling types, a `factor` of x will enable the model to handle sequences of length x * original maximum pre-trained length. `original_max_position_embeddings` (`int`, optional): Used with 'dynamic', 'longrope' and 'llama3'. The original max position embeddings used during pretraining. `attention_factor` (`float`, optional): Used with 'yarn' and 'longrope'. The scaling factor to be applied on the attention computation. If unspecified, it defaults to value recommended by the implementation, using the `factor` field to infer the suggested value. `beta_fast` (`float`, optional): Only used with 'yarn'. Parameter to set the boundary for extrapolation (only) in the linear ramp function. If unspecified, it defaults to 32. `beta_slow` (`float`, optional): Only used with 'yarn'. Parameter to set the boundary for interpolation (only) in the linear ramp function. If unspecified, it defaults to 1. `short_factor` (`list[float]`, optional): Only used with 'longrope'. The scaling factor to be applied to short contexts (< `original_max_position_embeddings`). Must be a list of numbers with the same length as the hidden size divided by the number of attention heads divided by 2 `long_factor` (`list[float]`, optional): Only used with 'longrope'. The scaling factor to be applied to long contexts (< `original_max_position_embeddings`). Must be a list of numbers with the same length as the hidden size divided by the number of attention heads divided by 2 `low_freq_factor` (`float`, optional): Only used with 'llama3'. Scaling factor applied to low frequency components of the RoPE `high_freq_factor` (`float`, optional): Only used with 'llama3'. Scaling factor applied to high frequency components of the RoPE	`None`
`attention_bias`	`bool`, defaults to `False`, optional, defaults to `False`	Whether to use a bias in the query, key, value and output projection layers during self-attention.	`False`
`attention_dropout`	`float`, optional, defaults to 0.0	The dropout ratio for the attention probabilities.	`0.0`
`sliding_window`	`int`, optional, defaults to 4096	Size of the sliding window attention context.	`4096`
`sliding_window_pattern`	`int`, optional, defaults to 4	Pattern for the sliding window attention.	`4`
`cache_implementation`	`str`, optional, defaults to `"hybrid"`	the cache type to be used with `generate`.	`'hybrid'`

>>> from transformers import Cohere2Model, Cohere2Config

>>> # Initializing a Cohere Nextmodel configuration
>>> configuration = Cohere2Config()

>>> # Initializing a model from the Cohere2 configuration
>>> model = Cohere2Model(configuration) # doctest: +SKIP

>>> # Accessing the model configuration
>>> configuration = model.config # doctest: +SKIP

Source code in vllm/transformers_utils/configs/cohere2.py

class Cohere2Config(PretrainedConfig):
    r"""
    This is the configuration class to store the configuration of a [`CohereModel`]. It is used to instantiate an Cohere
    model according to the specified arguments, defining the model architecture.

    Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
    documentation from [`PretrainedConfig`] for more information. Instantiating a configuration
    with the defaults will yield a similar configuration to that of the [CohereForAI/c4ai-command-r-v01](https://huggingface.co/CohereForAI/c4ai-command-r-v01) model.


    Args:
        vocab_size (`int`, *optional*, defaults to 256000):
            Vocabulary size of the Cohere model. Defines the number of different tokens that can be represented by the
            `inputs_ids` passed when calling [`CohereModel`]
        hidden_size (`int`, *optional*, defaults to 8192):
            Dimension of the hidden representations.
        intermediate_size (`int`, *optional*, defaults to 22528):
            Dimension of the MLP representations.
        logit_scale (`float`, *optional*, defaults to 0.0625):
            The scaling factor for the output logits.
        num_hidden_layers (`int`, *optional*, defaults to 40):
            Number of hidden layers in the Transformer decoder.
        num_attention_heads (`int`, *optional*, defaults to 64):
            Number of attention heads for each attention layer in the Transformer decoder.
        num_key_value_heads (`int`, *optional*):
            This is the number of key_value heads that should be used to implement Grouped Query Attention. If
            `num_key_value_heads=num_attention_heads`, the model will use Multi Head Attention (MHA), if
            `num_key_value_heads=1` the model will use Multi Query Attention (MQA) otherwise GQA is used. When
            converting a multi-head checkpoint to a GQA checkpoint, each group key and value head should be constructed
            by meanpooling all the original heads within that group. For more details checkout [this
            paper](https://arxiv.org/pdf/2305.13245.pdf). If it is not specified, will default to
            `num_attention_heads`.
        hidden_act (`str` or `function`, *optional*, defaults to `"silu"`):
            The non-linear activation function (function or string) in the decoder.
        max_position_embeddings (`int`, *optional*, defaults to 8192):
            The maximum sequence length that this model might ever be used with.
        initializer_range (`float`, *optional*, defaults to 0.02):
            The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
        layer_norm_eps (`float`, *optional*, defaults to 1e-05):
            The epsilon used by the layer normalization.
        use_cache (`bool`, *optional*, defaults to `True`):
            Whether or not the model should return the last key/values attentions (not used by all models). Only
            relevant if `config.is_decoder=True`.
        pad_token_id (`int`, *optional*, defaults to 0):
            Padding token id.
        bos_token_id (`int`, *optional*, defaults to 5):
            Beginning of stream token id.
        eos_token_id (`int`, *optional*, defaults to 255001):
            End of stream token id.
        tie_word_embeddings (`bool`, *optional*, defaults to `True`):
            Whether to tie weight embeddings
        rope_theta (`float`, *optional*, defaults to 10000.0):
            The base period of the RoPE embeddings.
        rope_scaling (`dict`, *optional*):
            Dictionary containing the scaling configuration for the RoPE embeddings. NOTE: if you apply new rope type
            and you expect the model to work on longer `max_position_embeddings`, we recommend you to update this value
            accordingly.
            Expected contents:
                `rope_type` (`str`):
                    The sub-variant of RoPE to use. Can be one of ['default', 'linear', 'dynamic', 'yarn', 'longrope',
                    'llama3'], with 'default' being the original RoPE implementation.
                `factor` (`float`, *optional*):
                    Used with all rope types except 'default'. The scaling factor to apply to the RoPE embeddings. In
                    most scaling types, a `factor` of x will enable the model to handle sequences of length x *
                    original maximum pre-trained length.
                `original_max_position_embeddings` (`int`, *optional*):
                    Used with 'dynamic', 'longrope' and 'llama3'. The original max position embeddings used during
                    pretraining.
                `attention_factor` (`float`, *optional*):
                    Used with 'yarn' and 'longrope'. The scaling factor to be applied on the attention
                    computation. If unspecified, it defaults to value recommended by the implementation, using the
                    `factor` field to infer the suggested value.
                `beta_fast` (`float`, *optional*):
                    Only used with 'yarn'. Parameter to set the boundary for extrapolation (only) in the linear
                    ramp function. If unspecified, it defaults to 32.
                `beta_slow` (`float`, *optional*):
                    Only used with 'yarn'. Parameter to set the boundary for interpolation (only) in the linear
                    ramp function. If unspecified, it defaults to 1.
                `short_factor` (`list[float]`, *optional*):
                    Only used with 'longrope'. The scaling factor to be applied to short contexts (<
                    `original_max_position_embeddings`). Must be a list of numbers with the same length as the hidden
                    size divided by the number of attention heads divided by 2
                `long_factor` (`list[float]`, *optional*):
                    Only used with 'longrope'. The scaling factor to be applied to long contexts (<
                    `original_max_position_embeddings`). Must be a list of numbers with the same length as the hidden
                    size divided by the number of attention heads divided by 2
                `low_freq_factor` (`float`, *optional*):
                    Only used with 'llama3'. Scaling factor applied to low frequency components of the RoPE
                `high_freq_factor` (`float`, *optional*):
                    Only used with 'llama3'. Scaling factor applied to high frequency components of the RoPE
        attention_bias (`bool`, defaults to `False`, *optional*, defaults to `False`):
            Whether to use a bias in the query, key, value and output projection layers during self-attention.
        attention_dropout (`float`, *optional*, defaults to 0.0):
            The dropout ratio for the attention probabilities.
        sliding_window (`int`, *optional*, defaults to 4096):
            Size of the sliding window attention context.
        sliding_window_pattern (`int`, *optional*, defaults to 4):
            Pattern for the sliding window attention.
        cache_implementation (`str`, *optional*, defaults to `"hybrid"`): the cache type to be used with `generate`.

    ```python
    >>> from transformers import Cohere2Model, Cohere2Config

    >>> # Initializing a Cohere Nextmodel configuration
    >>> configuration = Cohere2Config()

    >>> # Initializing a model from the Cohere2 configuration
    >>> model = Cohere2Model(configuration) # doctest: +SKIP

    >>> # Accessing the model configuration
    >>> configuration = model.config # doctest: +SKIP
    ```
    """

    model_type = "cohere2"
    keys_to_ignore_at_inference = ["past_key_values"]

    def __init__(
        self,
        vocab_size=256000,
        hidden_size=8192,
        intermediate_size=22528,
        logit_scale=0.0625,
        num_hidden_layers=40,
        num_attention_heads=64,
        num_key_value_heads=None,
        hidden_act="silu",
        max_position_embeddings=8192,
        initializer_range=0.02,
        layer_norm_eps=1e-5,
        use_cache=True,
        pad_token_id=0,
        bos_token_id=5,
        eos_token_id=255001,
        tie_word_embeddings=True,
        rope_theta=10000.0,
        rope_scaling=None,
        attention_bias=False,
        attention_dropout=0.0,
        sliding_window=4096,
        sliding_window_pattern=4,
        cache_implementation="hybrid",
        **kwargs,
    ):
        self.vocab_size = vocab_size
        self.max_position_embeddings = max_position_embeddings
        self.hidden_size = hidden_size
        self.logit_scale = logit_scale
        self.intermediate_size = intermediate_size
        self.num_hidden_layers = num_hidden_layers
        self.num_attention_heads = num_attention_heads

        # for backward compatibility
        if num_key_value_heads is None:
            num_key_value_heads = num_attention_heads

        self.num_key_value_heads = num_key_value_heads
        self.hidden_act = hidden_act
        self.initializer_range = initializer_range
        self.layer_norm_eps = layer_norm_eps
        self.use_cache = use_cache
        self.rope_theta = rope_theta
        self.rope_scaling = rope_scaling
        self.attention_bias = attention_bias
        self.attention_dropout = attention_dropout
        self.sliding_window = sliding_window
        self.sliding_window_pattern = sliding_window_pattern
        # Need to specify head_dim in the config so it can be used in the attention forward functions
        self.head_dim = hidden_size // num_attention_heads
        self.cache_implementation = cache_implementation

        # Validate the correctness of rotary position embeddings parameters
        rope_config_validation(self)

        super().__init__(
            pad_token_id=pad_token_id,
            bos_token_id=bos_token_id,
            eos_token_id=eos_token_id,
            tie_word_embeddings=tie_word_embeddings,
            **kwargs,
        )

attention_bias `instance-attribute` ¶

attention_bias = attention_bias

attention_dropout `instance-attribute` ¶

attention_dropout = attention_dropout

cache_implementation `instance-attribute` ¶

cache_implementation = cache_implementation

head_dim `instance-attribute` ¶

head_dim = hidden_size // num_attention_heads

hidden_act `instance-attribute` ¶

hidden_act = hidden_act

hidden_size `instance-attribute` ¶

hidden_size = hidden_size

initializer_range `instance-attribute` ¶

initializer_range = initializer_range

intermediate_size `instance-attribute` ¶

intermediate_size = intermediate_size

keys_to_ignore_at_inference `class-attribute` `instance-attribute` ¶

keys_to_ignore_at_inference = ['past_key_values']

layer_norm_eps `instance-attribute` ¶

layer_norm_eps = layer_norm_eps

logit_scale `instance-attribute` ¶

logit_scale = logit_scale

max_position_embeddings `instance-attribute` ¶

max_position_embeddings = max_position_embeddings

model_type `class-attribute` `instance-attribute` ¶

model_type = 'cohere2'

num_attention_heads `instance-attribute` ¶

num_attention_heads = num_attention_heads

num_hidden_layers `instance-attribute` ¶

num_hidden_layers = num_hidden_layers

num_key_value_heads `instance-attribute` ¶

num_key_value_heads = num_key_value_heads

rope_scaling `instance-attribute` ¶

rope_scaling = rope_scaling

rope_theta `instance-attribute` ¶

rope_theta = rope_theta

sliding_window `instance-attribute` ¶

sliding_window = sliding_window

sliding_window_pattern `instance-attribute` ¶

sliding_window_pattern = sliding_window_pattern

use_cache `instance-attribute` ¶

use_cache = use_cache

vocab_size `instance-attribute` ¶

vocab_size = vocab_size

init ¶

__init__(
    vocab_size=256000,
    hidden_size=8192,
    intermediate_size=22528,
    logit_scale=0.0625,
    num_hidden_layers=40,
    num_attention_heads=64,
    num_key_value_heads=None,
    hidden_act="silu",
    max_position_embeddings=8192,
    initializer_range=0.02,
    layer_norm_eps=1e-05,
    use_cache=True,
    pad_token_id=0,
    bos_token_id=5,
    eos_token_id=255001,
    tie_word_embeddings=True,
    rope_theta=10000.0,
    rope_scaling=None,
    attention_bias=False,
    attention_dropout=0.0,
    sliding_window=4096,
    sliding_window_pattern=4,
    cache_implementation="hybrid",
    **kwargs,
)

Source code in vllm/transformers_utils/configs/cohere2.py

def __init__(
    self,
    vocab_size=256000,
    hidden_size=8192,
    intermediate_size=22528,
    logit_scale=0.0625,
    num_hidden_layers=40,
    num_attention_heads=64,
    num_key_value_heads=None,
    hidden_act="silu",
    max_position_embeddings=8192,
    initializer_range=0.02,
    layer_norm_eps=1e-5,
    use_cache=True,
    pad_token_id=0,
    bos_token_id=5,
    eos_token_id=255001,
    tie_word_embeddings=True,
    rope_theta=10000.0,
    rope_scaling=None,
    attention_bias=False,
    attention_dropout=0.0,
    sliding_window=4096,
    sliding_window_pattern=4,
    cache_implementation="hybrid",
    **kwargs,
):
    self.vocab_size = vocab_size
    self.max_position_embeddings = max_position_embeddings
    self.hidden_size = hidden_size
    self.logit_scale = logit_scale
    self.intermediate_size = intermediate_size
    self.num_hidden_layers = num_hidden_layers
    self.num_attention_heads = num_attention_heads

    # for backward compatibility
    if num_key_value_heads is None:
        num_key_value_heads = num_attention_heads

    self.num_key_value_heads = num_key_value_heads
    self.hidden_act = hidden_act
    self.initializer_range = initializer_range
    self.layer_norm_eps = layer_norm_eps
    self.use_cache = use_cache
    self.rope_theta = rope_theta
    self.rope_scaling = rope_scaling
    self.attention_bias = attention_bias
    self.attention_dropout = attention_dropout
    self.sliding_window = sliding_window
    self.sliding_window_pattern = sliding_window_pattern
    # Need to specify head_dim in the config so it can be used in the attention forward functions
    self.head_dim = hidden_size // num_attention_heads
    self.cache_implementation = cache_implementation

    # Validate the correctness of rotary position embeddings parameters
    rope_config_validation(self)

    super().__init__(
        pad_token_id=pad_token_id,
        bos_token_id=bos_token_id,
        eos_token_id=eos_token_id,
        tie_word_embeddings=tie_word_embeddings,
        **kwargs,
    )

vllm.transformers_utils.configs.cohere2

__all__ module-attribute ¶

Cohere2Config ¶

attention_bias instance-attribute ¶

attention_dropout instance-attribute ¶

cache_implementation instance-attribute ¶

head_dim instance-attribute ¶

hidden_act instance-attribute ¶

hidden_size instance-attribute ¶

initializer_range instance-attribute ¶

intermediate_size instance-attribute ¶

keys_to_ignore_at_inference class-attribute instance-attribute ¶

layer_norm_eps instance-attribute ¶

logit_scale instance-attribute ¶

max_position_embeddings instance-attribute ¶

model_type class-attribute instance-attribute ¶

num_attention_heads instance-attribute ¶

num_hidden_layers instance-attribute ¶

num_key_value_heads instance-attribute ¶

rope_scaling instance-attribute ¶

rope_theta instance-attribute ¶

sliding_window instance-attribute ¶

sliding_window_pattern instance-attribute ¶

use_cache instance-attribute ¶

vocab_size instance-attribute ¶

__init__ ¶