vllm.compilation.fusion_attn
AttentionStaticQuantPattern
¶
Source code in vllm/compilation/fusion_attn.py
25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 |
|
quant_key
instance-attribute
¶
quant_key = QuantKey(
dtype=quant_dtype,
static=True,
group_shape=PER_TENSOR,
symmetric=symmetric,
)
__init__
¶
Source code in vllm/compilation/fusion_attn.py
_register
¶
Source code in vllm/compilation/fusion_attn.py
empty_quant
¶
AttnFusionPass
¶
Bases: VllmInductorPass
This pass fuses post-attention quantization onto attention if supported.
It uses the pattern matcher and matches each layer manually, as strings cannot be wildcarded. This also lets us check support on attention layers upon registration instead of during pattern matching.
Currently, only static fp8 quant is supported, but patterns could easily be added for other quant schemes and dtypes. The bigger hurdle for wider support are attention kernels, which need to support fusing output quant.
Source code in vllm/compilation/fusion_attn.py
__init__
¶
__init__(config: VllmConfig)