Context Extension¶
Note
The --rope-scaling parameter used in older versions of vLLM is no longer supported. Please use the --hf-overrides method with rope_parameters instead.
This directory contains examples for extending the context length of models using vLLM.
Offline Inference Example¶
The context_extension.py script demonstrates how to extend the context length of a Qwen model using the YARN method (rope_parameters) and run a simple chat example.
Usage¶
OpenAI Online Method¶
You can also use vLLM's OpenAI-compatible API to serve models with extended context length.
Usage¶
Run the vLLM server with the following command to extend the context length using YARN:
vllm serve Qwen/Qwen3-0.6B \
--hf-overrides '{"rope_parameters": {"factor": 4.0, "original_max_position_embeddings": 32768, "rope_theta": 1000000, "rope_type": "yarn"}}' \
--max-model-len 131072
Client Example¶
After starting the server, you can use the OpenAI Python client to interact with it:
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:8000/v1",
api_key="token-abc123" # Dummy API key, required by the client
)
response = client.chat.completions.create(
model="Qwen/Qwen3-0.6B",
messages=[
{"role": "system", "content": "You are a helpful assistant"},
{"role": "user", "content": "Hello"}
],
max_tokens=128,
temperature=0.8,
top_p=0.95
)
print(response.choices[0].message.content)
Key Parameters¶
The available parameters depend on the rope_type you choose. For detailed information about all supported RoPE types and their specific parameters, please refer to the Hugging Face Transformers RoPE documentation.
Common parameters include:
rope_type: The type of RoPE implementation (e.g., "yarn", "linear", "dynamic")factor: The factor by which to extend the context lengthoriginal_max_position_embeddings: The original maximum position embeddings of the model
The following parameters are specific to vLLM:
max_model_len: The new maximum sequence length after extension (original * factor). Used for KV cache pre‑allocation and request limit at serving time.