vllm.v1.attention.backends.mla.rocm_aiter_mla ¶
AiterMLAHelper ¶
AITER MLA implementation requires num_heads >= 16. If num_heads < 16 and 16 % num_heads == 0, we can pad q to 16 heads; otherwise AITER has to fail.
Source code in vllm/v1/attention/backends/mla/rocm_aiter_mla.py
_expand_page_indices_kernel ¶
_expand_page_indices_kernel(
page_indices,
block_table,
block_table_stride,
cu_num_tokens,
seq_lens,
KERNEL_BLOCK_SIZE: constexpr,
BLOCK_SIZE: constexpr,
)
Expand block table entries into per-token flat page indices.
The aiter MLA kernel always operates with page_size=1 internally (kv_buffer is flattened via .view(-1, 1, 1, H)). This kernel converts block-level indices from the block table into individual token positions in the flattened KV buffer.
When KERNEL_BLOCK_SIZE=1: block_idx=t, offset=0, flat=block_id (equivalent to a direct copy -- no regression from the original kernel).
When KERNEL_BLOCK_SIZE=K: block table entry b (covering K tokens) is expanded to flat indices bK, bK+1, ..., b*K+(K-1).