vllm.model_executor.parameter
__all__
module-attribute
¶
__all__ = [
"BasevLLMParameter",
"PackedvLLMParameter",
"PerTensorScaleParameter",
"ModelWeightParameter",
"ChannelQuantScaleParameter",
"GroupQuantScaleParameter",
"PackedColumnParameter",
"RowvLLMParameter",
]
BasevLLMParameter
¶
Bases: Parameter
Base parameter for vLLM linear layers. Extends the torch.nn.parameter by taking in a linear weight loader. Will copy the loaded weight into the parameter when the provided weight loader is called.
Source code in vllm/model_executor/parameter.py
__init__
¶
Initialize the BasevLLMParameter
:param data: torch tensor with the parameter data :param weight_loader: weight loader callable
:returns: a torch.nn.parameter
Source code in vllm/model_executor/parameter.py
BlockQuantScaleParameter
¶
Bases: _ColumnvLLMParameter
, RowvLLMParameter
Parameter class for weight scales loaded for weights with block-wise quantization. Uses both column and row parallelism.
Source code in vllm/model_executor/parameter.py
ChannelQuantScaleParameter
¶
Bases: _ColumnvLLMParameter
Parameter class for weight scales loaded for weights with channel-wise quantization. Equivalent to _ColumnvLLMParameter.
Source code in vllm/model_executor/parameter.py
GroupQuantScaleParameter
¶
Bases: _ColumnvLLMParameter
, RowvLLMParameter
Parameter class for weight scales loaded for weights with grouped quantization. Uses both column and row parallelism.
Source code in vllm/model_executor/parameter.py
ModelWeightParameter
¶
Bases: _ColumnvLLMParameter
, RowvLLMParameter
Parameter class for linear layer weights. Uses both column and row parallelism.
Source code in vllm/model_executor/parameter.py
PackedColumnParameter
¶
Bases: _ColumnvLLMParameter
Parameter for model parameters which are packed on disk and support column parallelism only. See PackedvLLMParameter for more details on the packed properties.
Source code in vllm/model_executor/parameter.py
__init__
¶
__init__(
packed_factor: Union[int, Fraction],
packed_dim: int,
marlin_tile_size: Optional[int] = None,
bitblas_tile_size: Optional[int] = None,
**kwargs,
)
Source code in vllm/model_executor/parameter.py
adjust_shard_indexes_for_packing
¶
Source code in vllm/model_executor/parameter.py
PackedvLLMParameter
¶
Bases: ModelWeightParameter
Parameter for model weights which are packed on disk. Example: GPTQ Marlin weights are int4 or int8, packed into int32. Extends the ModelWeightParameter to take in the packed factor, the packed dimension, and optionally, marlin tile size for marlin kernels. Adjusts the shard_size and shard_offset for fused linear layers model weight loading by accounting for packing and optionally, marlin tile size.
Source code in vllm/model_executor/parameter.py
__init__
¶
__init__(
packed_factor: Union[int, Fraction],
packed_dim: int,
marlin_tile_size: Optional[int] = None,
bitblas_tile_size: Optional[int] = None,
**kwargs,
)
Source code in vllm/model_executor/parameter.py
adjust_shard_indexes_for_packing
¶
Source code in vllm/model_executor/parameter.py
PerTensorScaleParameter
¶
Bases: BasevLLMParameter
Parameter class for scales where the number of scales is equivalent to the number of logical matrices in fused linear layers (e.g. for QKV, there are 3 scales loaded from disk). This is relevant to weights with per-tensor quantization. Adds functionality to map the scalers to a shard during weight loading.
Note: additional parameter manipulation may be handled for each quantization config specifically, within process_weights_after_loading
Source code in vllm/model_executor/parameter.py
__init__
¶
_load_into_shard_id
¶
Slice the parameter data based on the shard id for loading.
Source code in vllm/model_executor/parameter.py
_shard_id_as_int
¶
Source code in vllm/model_executor/parameter.py
load_column_parallel_weight
¶
load_merged_column_weight
¶
load_qkv_weight
¶
RowvLLMParameter
¶
Bases: BasevLLMParameter
Parameter class defining weight_loading functionality (load_row_parallel_weight) for parameters being loaded into linear layers with row parallel functionality. Requires an input_dim to be defined.
Source code in vllm/model_executor/parameter.py
load_row_parallel_weight
¶
load_row_parallel_weight(loaded_weight: Tensor)
Source code in vllm/model_executor/parameter.py
_ColumnvLLMParameter
¶
Bases: BasevLLMParameter
Private class defining weight loading functionality (load_merged_column_weight, load_qkv_weight) for parameters being loaded into linear layers with column parallelism. This includes QKV and MLP layers which are not already fused on disk. Requires an output dimension to be defined. Called within the weight loader of each of the column parallel linear layers.
Source code in vllm/model_executor/parameter.py
load_column_parallel_weight
¶
load_column_parallel_weight(loaded_weight: Tensor)
Source code in vllm/model_executor/parameter.py
load_merged_column_weight
¶
load_merged_column_weight(loaded_weight: Tensor, **kwargs)
Source code in vllm/model_executor/parameter.py
load_qkv_weight
¶
load_qkv_weight(loaded_weight: Tensor, **kwargs)
Source code in vllm/model_executor/parameter.py
_adjust_shard_indexes_for_bitblas
¶
_adjust_shard_indexes_for_marlin
¶
_adjust_shard_indexes_for_packing
¶
_adjust_shard_indexes_for_packing(
shard_size,
shard_offset,
packed_factor,
marlin_tile_size,
bitblas_tile_size,
)
Source code in vllm/model_executor/parameter.py
permute_param_layout_
¶
permute_param_layout_(
param: BasevLLMParameter,
input_dim: int,
output_dim: int,
**kwargs,
) -> BasevLLMParameter
Permute a parameter's layout to the specified input and output dimensions, useful for forcing the parameter into a known layout, for example, if I need a packed (quantized) weight matrix to be in the layout {input_dim = 0, output_dim = 1, packed_dim = 0} then I can call: permute_param_layout_(x, input_dim=0, output_dim=1, packed_dim=0) to ensure x is in the correct layout (permuting it to the correct layout if required, asserting if it cannot get it to the correct layout)