vllm.attention.backends.triton_mla
TritonMLABackend
¶
Bases: MLACommonBackend
Source code in vllm/attention/backends/triton_mla.py
get_impl_cls
staticmethod
¶
get_impl_cls() -> Type[TritonMLAImpl]
TritonMLAImpl
¶
Bases: MLACommonImpl[MLACommonMetadata]
Source code in vllm/attention/backends/triton_mla.py
__init__
¶
__init__(
num_heads: int,
head_size: int,
scale: float,
num_kv_heads: int,
alibi_slopes: Optional[List[float]],
sliding_window: Optional[int],
kv_cache_dtype: str,
blocksparse_params: Optional[Dict[str, Any]],
logits_soft_cap: Optional[float],
attn_type: str,
kv_sharing_target_layer_name: Optional[str],
**mla_args,
) -> None
Source code in vllm/attention/backends/triton_mla.py
_forward_decode
¶
_forward_decode(
q_nope: Tensor,
q_pe: Tensor,
kv_c_and_k_pe_cache: Tensor,
attn_metadata: MLACommonMetadata,
) -> Tensor