Hidden State Extraction¶
The Hidden State Extraction feature allows vLLM to save intermediate layer activations from a target model during inference. This is useful for training EAGLE-style draft models, knowledge distillation, or offline analysis of model internals.
Note
It is possible to save the last-layer's output hidden states by passing num_hidden_layers as a layer id. Note that these are not normalized using the output norm.
Offline Example¶
import tempfile
from vllm import LLM, SamplingParams
from vllm.config.kv_transfer import KVTransferConfig
from vllm.distributed.kv_transfer.kv_connector.v1 import (
example_hidden_states_connector,
)
with tempfile.TemporaryDirectory() as tmpdir:
llm = LLM(
model="Qwen/Qwen3-8B",
enable_chunked_prefill=False,
speculative_config={
"method": "extract_hidden_states",
"num_speculative_tokens": 1,
"draft_model_config": {
"hf_config": {
"eagle_aux_hidden_state_layer_ids": [1, 2, 3, 4],
},
},
},
kv_transfer_config=KVTransferConfig(
kv_connector="ExampleHiddenStatesConnector",
kv_role="kv_producer",
kv_connector_extra_config={
"shared_storage_path": tmpdir,
},
),
)
outputs = llm.generate(
["The future of AI is"],
SamplingParams(max_tokens=1),
)
for output in outputs:
path = output.kv_transfer_params["hidden_states_path"]
obj = example_hidden_states_connector.load_hidden_states(path)
print(f"token_ids: {obj['token_ids'].shape}")
print(f"hidden_states: {obj['hidden_states'].shape}")
A complete example is available at examples/features/speculative_decoding/extract_hidden_states_offline.py.
Online Example¶
For improved performance, it is recommended to use a RAM-mounted file system such as /dev/shm/ for online usage in which the client cleans up the files soon after they are generated.
vllm serve Qwen/Qwen3-8B \
--speculative_config '{"method": "extract_hidden_states", "num_speculative_tokens": 1, "draft_model_config": {"hf_config": {"eagle_aux_hidden_state_layer_ids": [1, 2, 3, 4]}}}' \
--kv_transfer_config '{"kv_connector": "ExampleHiddenStatesConnector", "kv_role": "kv_producer", "kv_connector_extra_config": {"shared_storage_path": "/dev/shm/hidden_states"}}' \
--no-enable-chunked-prefill
Configuration¶
The kv_connector_extra_config dict accepts these options:
| Parameter | Default | Description |
|---|---|---|
shared_storage_path | /tmp | Directory where hidden state files are saved |
num_writer_threads | 8 | Thread pool size for async disk writes |
use_synchronization_lock | True | Use file locks so concurrent readers block until writes complete. Can be disabled for batch generation where synchronization is not needed. |
Output Format¶
Each request produces a .safetensors file containing:
hidden_states— shape[num_tokens, num_extracted_layers, hidden_size]token_ids— shape[num_tokens]
The file path is returned in output.kv_transfer_params["hidden_states_path"]. Use load_hidden_states() from the connector module to read the file with proper synchronization.
Note
Chunked prefill is not compatible with this feature and must be disabled.