Offline Inference

You can run vLLM in your own code on a list of prompts.

The offline API is based on the LLM class. To initialize the vLLM engine, create a new instance of LLM and specify the model to run.

For example, the following code downloads the facebook/opt-125m model from HuggingFace and runs it in vLLM using the default configuration.

from vllm import LLM

llm = LLM(model="facebook/opt-125m")

After initializing the LLM instance, you can perform model inference using various APIs. The available APIs depend on the type of model that is being run:

Please refer to the above pages for more details about each API.