vllm.attention.backends.abstract
AttentionBackend
¶
Bases: ABC
Abstract class for attention backends.
Source code in vllm/attention/backends/abstract.py
advance_step
¶
copy_blocks
abstractmethod
staticmethod
¶
get_builder_cls
abstractmethod
staticmethod
¶
get_builder_cls() -> Type[AttentionMetadataBuilder]
get_impl_cls
abstractmethod
staticmethod
¶
get_impl_cls() -> Type[AttentionImpl]
get_kv_cache_shape
abstractmethod
staticmethod
¶
get_kv_cache_stride_order
staticmethod
¶
get_metadata_cls
abstractmethod
staticmethod
¶
get_metadata_cls() -> Type[AttentionMetadata]
get_state_cls
abstractmethod
staticmethod
¶
get_state_cls() -> Type[AttentionState]
make_metadata
classmethod
¶
make_metadata(*args, **kwargs) -> AttentionMetadata
swap_blocks
abstractmethod
staticmethod
¶
AttentionImpl
¶
Source code in vllm/attention/backends/abstract.py
__init__
abstractmethod
¶
__init__(
num_heads: int,
head_size: int,
scale: float,
num_kv_heads: Optional[int] = None,
alibi_slopes: Optional[List[float]] = None,
sliding_window: Optional[int] = None,
kv_cache_dtype: str = "auto",
blocksparse_params: Optional[Dict[str, Any]] = None,
logits_soft_cap: Optional[float] = None,
attn_type: str = DECODER,
kv_sharing_target_layer_name: Optional[str] = None,
) -> None
Source code in vllm/attention/backends/abstract.py
forward
abstractmethod
¶
forward(
layer: AttentionLayer,
query: Tensor,
key: Tensor,
value: Tensor,
kv_cache: Tensor,
attn_metadata: T,
output: Optional[Tensor] = None,
output_scale: Optional[Tensor] = None,
) -> Tensor
Source code in vllm/attention/backends/abstract.py
fused_output_quant_supported
¶
Does this attention implementation support fused output quantization. This is used by the AttnFusionPass to only fuse output quantization onto implementations that support it.
TODO(luka) merge parameters into QuantDescriptor :param dtype: quantized dtype :param static: static or dynamic quantization :param group_shape: quant group shape. (-1, -1) for per-tensor. :return: is fusion supported for this type of quantization
Source code in vllm/attention/backends/abstract.py
AttentionLayer
¶
Bases: Protocol
Source code in vllm/attention/backends/abstract.py
AttentionMetadata
dataclass
¶
Attention metadata for prefill and decode batched together.
Source code in vllm/attention/backends/abstract.py
decode_metadata
abstractmethod
property
¶
decode_metadata: Optional[AttentionMetadata]
Return the attention metadata that's required to run decode attention.
multi_modal_placeholder_index_maps
instance-attribute
¶
prefill_metadata
abstractmethod
property
¶
prefill_metadata: Optional[AttentionMetadata]
Return the attention metadata that's required to run prefill attention.
__init__
¶
__init__(
num_prefills: int,
num_prefill_tokens: int,
num_decode_tokens: int,
slot_mapping: Tensor,
multi_modal_placeholder_index_maps: Optional[
Dict[str, IndexMap]
],
enable_kv_scales_calculation: bool,
) -> None
asdict_zerocopy
¶
Similar to dataclasses.asdict, but avoids deepcopying.
Source code in vllm/attention/backends/abstract.py
AttentionMetadataBuilder
¶
Abstract class for attention metadata builders.
Source code in vllm/attention/backends/abstract.py
__init__
abstractmethod
¶
__init__(
input_builder: ModelRunnerInputBuilderBase,
) -> None
Create the builder, remember some configuration and parameters.
AttentionState
¶
Holds attention backend-specific objects reused during the lifetime of the model runner.
Source code in vllm/attention/backends/abstract.py
__init__
abstractmethod
¶
__init__(runner: ModelRunnerBase)
begin_forward
abstractmethod
¶
begin_forward(model_input: ModelRunnerInputBase) -> None
get_graph_input_buffers
abstractmethod
¶
get_graph_input_buffers(
attn_metadata: T,
is_encoder_decoder_model: bool = False,
) -> Dict[str, Any]
Get attention-specific input buffers for CUDA graph capture.
graph_capture_get_metadata_for_batch
abstractmethod
¶
graph_capture_get_metadata_for_batch(
batch_size: int, is_encoder_decoder_model: bool = False
) -> T
Get attention metadata for CUDA graph capture of batch_size.
graph_clone
abstractmethod
¶
graph_clone(batch_size: int) -> AttentionState[T]
AttentionType
¶
Attention type.
Use string to be compatible with torch.compile
.
Source code in vllm/attention/backends/abstract.py
MLAAttentionImpl
¶
Bases: AttentionImpl[T]
, Generic[T]
Source code in vllm/attention/backends/abstract.py
forward
abstractmethod
¶
forward(
layer: AttentionLayer,
hidden_states_or_cq: Tensor,
kv_c_normed: Tensor,
k_pe: Tensor,
kv_cache: Tensor,
attn_metadata: T,
output: Optional[Tensor] = None,
output_scale: Optional[Tensor] = None,
) -> Tensor