Offline Inference
You can run vLLM in your own code on a list of prompts.
The offline API is based on the LLM class.
To initialize the vLLM engine, create a new instance of LLM
and specify the model to run.
For example, the following code downloads the facebook/opt-125m
model from HuggingFace
and runs it in vLLM using the default configuration.
After initializing the LLM
instance, you can perform model inference using various APIs.
The available APIs depend on the type of model that is being run:
- Generative models output logprobs which are sampled from to obtain the final output text.
- Pooling models output their hidden states directly.
Please refer to the above pages for more details about each API.
Info