vllm.model_executor.offloader.prefetch ¶
Prefetch-based CPU offloading with async prefetching.
Uses static buffers and event-based stream forking for torch.compile + CUDA graph compatibility. Events allow the copy stream to join CUDA graph captures, ensuring H2D copies are properly captured.
ParamInfo dataclass ¶
Metadata about an offloaded parameter.
Source code in vllm/model_executor/offloader/prefetch.py
key property ¶
Unique key for buffer pool grouping.
Includes parameter name to prevent different parameters with the same shape from sharing buffers within the same layer. Parameters with the same name across different layers will share buffers (via slots).
Includes stride because parameters with same shape but different strides need separate buffers to preserve memory layout.
PrefetchOffloader ¶
Bases: BaseOffloader
Prefetching-based offloader with group-based layer selection.
Groups layers and uses async H2D prefetch to hide transfer latency. Uses static buffers and stream synchronization for torch.compile and CUDA graph compatibility.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
group_size | int | Group every N layers together. | required |
num_in_group | int | Offload this many layers per group (last N of each group). | required |
prefetch_step | int | Number of layers to prefetch ahead. | required |
mode | str | Offload mode ("cpu" is currently supported). | 'cpu' |
Source code in vllm/model_executor/offloader/prefetch.py
127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 | |
_hook_module_forward ¶
Hook module's forward with torch.compile-compatible sync.
Source code in vllm/model_executor/offloader/prefetch.py
_start_prefetch ¶
_start_prefetch(layer_idx: int)
Called by custom op - start async copy to static buffer.
_wait_for_layer ¶
_wait_for_layer(layer_idx: int)
Called by custom op - wait for copy to complete.
Synchronization strategy: - During CUDA graph capture: use event-based wait (graph-compatible) - Outside capture (warmup/eager): use wait_stream (more robust)
During capture, we skip wait for pre-capture prefetches because: 1. sync_before_graph_capture() ensures pre-capture work is complete 2. We can't wait on pre-capture events during capture (isolation error)
Source code in vllm/model_executor/offloader/prefetch.py
join_after_forward ¶
Join copy_stream after model forward completes.
Call this after the model forward pass but before CUDA graph capture ends. This ensures copy_stream is rejoined for any prefetches started during the forward pass.
We join ALL layers that have _prefetch_in_capture=True, meaning their prefetch was started during capture but not yet waited on (joined). This handles both full and piecewise cudagraph modes correctly: - Full mode: joins layers 0..prefetch_step-1 (prefetched by last layers) - Piecewise mode: joins only layers prefetched by THIS subgraph's layers
Source code in vllm/model_executor/offloader/prefetch.py
post_init ¶
Allocate static buffer pool and start initial prefetches.
Note: Parameters have already been offloaded to CPU during wrap_modules() (in _CpuParamOffloader.init), so GPU memory is available for the static buffer pool.
Source code in vllm/model_executor/offloader/prefetch.py
sync_prev_onload ¶
Sync previous onload operations.
Ensures any H2D copies in flight on copy_stream complete before the compute stream continues. Call this before CUDA graph capture/replay or when synchronization is needed.
Source code in vllm/model_executor/offloader/prefetch.py
wrap_modules ¶
Wrap modules with prefetch offloading logic.
Source code in vllm/model_executor/offloader/prefetch.py
StaticBufferPool ¶
Pre-allocated GPU buffer pool for offloaded parameters.
Allocates slot_capacity copies of each unique parameter (name, shape, stride, dtype), allowing for double/triple buffering during prefetch.
Buffer slots are reused circularly: layer N uses slot (N % slot_capacity).
The key includes parameter name to prevent different parameters within the same layer from sharing buffers. Parameters with the same name across different layers share buffers via the slot mechanism.
Source code in vllm/model_executor/offloader/prefetch.py
_BaseParamOffloader ¶
Bases: ABC
Base class for parameter offloading strategies.
Source code in vllm/model_executor/offloader/prefetch.py
_param property ¶
Get the parameter being offloaded.
Supports dotted names (e.g. 'self_attn.qkv_proj.weight') by traversing the module hierarchy.
create staticmethod ¶
create(mode: str, **kwargs) -> _BaseParamOffloader
Factory method to create appropriate offloader for mode.
Source code in vllm/model_executor/offloader/prefetch.py
post_init ¶
sync_cpu_storage abstractmethod ¶
Sync CPU storage with current param.data.
Called after process_weights_after_loading to update _cpu_storage with the final processed weights.
_CpuParamOffloader ¶
Bases: _BaseParamOffloader
Offload parameter to pinned CPU memory.
Uses GPU static buffers as the actual parameter, with CPU storage kept separately. This ensures torch.compile sees GPU tensors at trace time.
The offloading happens in two phases: 1. init() - copies GPU data to CPU, frees GPU memory immediately 2. assign_static_buffer() - points param.data to GPU static buffer
Source code in vllm/model_executor/offloader/prefetch.py
578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 | |
_offload_to_cpu_internal ¶
Copy parameter data to pinned CPU storage and free GPU memory.
This replaces param.data with CPU storage, allowing weight loading to continue writing to CPU memory. GPU memory is freed when the original GPU tensor is garbage collected.
Source code in vllm/model_executor/offloader/prefetch.py
_update_cpu_storage_from_param ¶
Update _cpu_storage from current param.data, ensuring pinned memory.
After process_weights_after_loading, device_loading_context creates non-pinned CPU tensors via p.data = p.data.to("cpu"). Using non-pinned memory with copy_(src, non_blocking=True) causes CUDA to perform a stream synchronization before the copy, breaking the event-based fork synchronization and potentially allowing the copy to overwrite the GPU buffer while the compute stream still reads it.
This method ensures _cpu_storage always uses pinned memory when available, re-pinning if necessary.
Source code in vllm/model_executor/offloader/prefetch.py
assign_static_buffer ¶
assign_static_buffer(gpu_buffer: Tensor) -> None
Point parameter data to GPU static buffer.
This is called after weight loading AND process_weights_after_loading complete. At this point: - param.data may have been replaced by device_loading_context (which creates new CPU tensors after quantization processing) - We need to update _cpu_storage to point to current param.data so that prefetch copies the processed weights, not stale data - Then point param.data to the GPU buffer for torch.compile
Source code in vllm/model_executor/offloader/prefetch.py
post_init ¶
sync_cpu_storage ¶
Sync CPU storage with current param.data.
Called after process_weights_after_loading to update _cpu_storage with the final processed weights. This is critical because: 1. process_weights_after_loading may transform weights (quantization) 2. device_loading_context creates NEW CPU tensors when moving back 3. Our old _cpu_storage would have pre-processed or stale data
Source code in vllm/model_executor/offloader/prefetch.py
_ModuleOffloader ¶
Manages offloading for a single module.
Uses static buffers from a shared pool instead of dynamic allocation.
Source code in vllm/model_executor/offloader/prefetch.py
368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 | |
assign_buffer_slot ¶
assign_buffer_slot(pool: StaticBufferPool, slot_idx: int)
Assign this module to a buffer slot in the pool.
Also assigns static GPU buffers to each parameter offloader, which moves the parameter data to point to the GPU buffer.
Source code in vllm/model_executor/offloader/prefetch.py
get_param_infos ¶
Get parameter metadata for buffer pool allocation.
Note: sync_cpu_storage() must be called before this method to ensure _cpu_storage reflects the final processed weights (after quantization).
Source code in vllm/model_executor/offloader/prefetch.py
post_init ¶
Collect total offloaded bytes (offloading already done in init).
Source code in vllm/model_executor/offloader/prefetch.py
start_onload_to_static ¶
Start async copy from CPU storage to GPU buffer.
Uses event-based forking to join copy_stream to CUDA graph capture. This ensures H2D copies are properly captured when recording a graph.
IMPORTANT: We must wait for the compute stream before copying, because the previous layer's forward may still be using the buffer (GPU ops are async). Without this sync, we could overwrite the buffer while it's being read.
Source code in vllm/model_executor/offloader/prefetch.py
sync_cpu_storage ¶
Sync CPU storage with current param.data.
Called after process_weights_after_loading to ensure _cpu_storage contains the final processed weights, not stale pre-loading data.