vllm.triton_utils.jit_monitor ¶
Monitor unexpected Triton kernel JIT compilation during inference.
After server warmup completes, any Triton JIT compilation or autotuning event indicates a cache miss or unexpected input shape that causes a latency spike. This module registers hooks in the Triton runtime to detect and log such events so they can be investigated.
Currently monitors: - Triton @triton.autotune cache misses (via knobs.autotuning.print) - Triton @triton.jit first-time compilations (via knobs.runtime.jit_post_compile_hook)
_setup_triton_autotuning_print ¶
Enable TRITON_PRINT_AUTOTUNING unless the user opted out.
Source code in vllm/triton_utils/jit_monitor.py
_setup_triton_jit_hook ¶
Register a jit_post_compile_hook that warns on compilation.
Source code in vllm/triton_utils/jit_monitor.py
activate ¶
Enable JIT compilation monitoring after warmup.
Call once per worker process at the end of :func:compile_or_warm_up_model. After activation every Triton kernel compilation or autotuning benchmark that happens during inference will be logged as a warning.
Safe to call multiple times — subsequent calls are no-ops.
If the user has explicitly set TRITON_PRINT_AUTOTUNING=0 in their environment, autotuning printing is left disabled; the JIT compilation hook is still registered regardless.