vllm serve¶

JSON CLI Arguments¶

When passing JSON CLI arguments, the following sets of arguments are equivalent:

--json-arg '{"key1": "value1", "key2": {"key3": "value2"}}'
--json-arg.key1 value1 --json-arg.key2.key3 value2

Additionally, list elements can be passed individually using +:

--json-arg '{"key4": ["value3", "value4", "value5"]}'
--json-arg.key4+ value3 --json-arg.key4+='value4,value5'

Arguments¶

`--headless`¶

Run in headless mode. See multi-node data parallel documentation for more details.

Default: False

`--api-server-count`, `-asc`¶

How many API server processes to run. Defaults to data_parallel_size if not specified.

`--config`¶

Read CLI options from a config file. Must be a YAML with the following options: https://docs.vllm.ai/en/latest/configuration/serve_args.html

`--grpc`¶

Launch a gRPC server instead of the HTTP OpenAI-compatible server. Requires: pip install vllm[grpc].

Default: False

`--disable-log-stats`¶

Disable logging statistics.

Default: False

`--aggregate-engine-logging`¶

Log aggregate rather than per-engine statistics when using data parallelism.

Default: False

`--fail-on-environ-validation`, `--no-fail-on-environ-validation`¶

If set, the engine will raise an error if environment validation fails.

Default: False

`--shutdown-timeout`¶

Shutdown timeout in seconds. 0 = abort, >0 = wait.

Default: 0

`--gdn-prefill-backend`¶

Possible choices: flashinfer, triton

Select GDN prefill backend.

`--enable-log-requests`, `--no-enable-log-requests`¶

Enable logging request information, dependent on log level: - INFO: Request ID, parameters and LoRA request. - DEBUG: Prompt inputs (e.g: text, token IDs). You can set the minimum log level via VLLM_LOGGING_LEVEL.

Default: False

Frontend¶

Arguments for the OpenAI-compatible frontend server.

`--lora-modules`¶

`--chat-template`¶

`--chat-template-content-format`¶

Possible choices: auto, openai, string

Default: auto

`--trust-request-chat-template`, `--no-trust-request-chat-template`¶

Default: False

`--default-chat-template-kwargs`¶

: Should either be a valid JSON string or JSON keys passed individually.

`--response-role`¶

Default: assistant

`--return-tokens-as-token-ids`, `--no-return-tokens-as-token-ids`¶

Default: False

`--enable-auto-tool-choice`, `--no-enable-auto-tool-choice`¶

Default: False

`--exclude-tools-when-tool-choice-none`, `--no-exclude-tools-when-tool-choice-none`¶

Default: False

`--tool-call-parser`¶

`--tool-parser-plugin`¶

Default: ""

`--tool-server`¶

`--log-config-file`¶

`--max-log-len`¶

`--enable-prompt-tokens-details`, `--no-enable-prompt-tokens-details`¶

Default: False

`--enable-server-load-tracking`, `--no-enable-server-load-tracking`¶

Default: False

`--enable-force-include-usage`, `--no-enable-force-include-usage`¶

Default: False

`--enable-tokenizer-info-endpoint`, `--no-enable-tokenizer-info-endpoint`¶

Default: False

`--enable-log-outputs`, `--no-enable-log-outputs`¶

Default: False

`--enable-log-deltas`, `--no-enable-log-deltas`¶

Default: True

`--log-error-stack`, `--no-log-error-stack`¶

Default: False

`--tokens-only`, `--no-tokens-only`¶

Default: False

`--fingerprint-mode`¶

Possible choices: custom, full, hash, none

Default: full

`--fingerprint-value`¶

`--host`¶

`--port`¶

Default: 8000

`--uds`¶

`--uvicorn-log-level`¶

Possible choices: critical, debug, error, info, trace, warning

Default: info

`--disable-uvicorn-access-log`, `--no-disable-uvicorn-access-log`¶

Default: False

`--disable-access-log-for-endpoints`¶

`--allow-credentials`, `--no-allow-credentials`¶

Default: False

`--allowed-origins`¶

Default: ['*']

`--allowed-methods`¶

Default: ['*']

`--allowed-headers`¶

Default: ['*']

`--api-key`¶

`--ssl-keyfile`¶

`--ssl-certfile`¶

`--ssl-ca-certs`¶

`--enable-ssl-refresh`, `--no-enable-ssl-refresh`¶

Default: False

`--ssl-cert-reqs`¶

Default: 0

`--ssl-ciphers`¶

`--root-path`¶

`--middleware`¶

Default: []

`--enable-request-id-headers`, `--no-enable-request-id-headers`¶

Default: False

`--disable-fastapi-docs`, `--no-disable-fastapi-docs`¶

Default: False

`--h11-max-incomplete-event-size`¶

Default: 4194304

`--h11-max-header-count`¶

Default: 256

`--enable-offline-docs`, `--no-enable-offline-docs`¶

Default: False

`--enable-flash-late-interaction`, `--no-enable-flash-late-interaction`¶

Default: True

ModelConfig¶

Configuration for the model.

`--model`¶

Default: Qwen/Qwen3-0.6B

`--runner`¶

Possible choices: auto, draft, generate, pooling

Default: auto

`--convert`¶

Possible choices: auto, classify, embed, none

Default: auto

`--tokenizer`¶

`--tokenizer-mode`¶

Possible choices: auto, deepseek_v32, deepseek_v4, fastokens, hf, mistral, slow

Default: auto

`--trust-remote-code`, `--no-trust-remote-code`¶

Default: False

`--dtype`¶

Possible choices: auto, bfloat16, float, float16, float32, half

Default: auto

`--seed`¶

Default: 0

`--hf-config-path`¶

`--allowed-local-media-path`¶

Default: ""

`--allowed-media-domains`¶

`--revision`¶

`--code-revision`¶

`--tokenizer-revision`¶

`--max-model-len`¶

: Parse human-readable integers like '1k', '2M', etc. Including decimal values with decimal multipliers. Also accepts -1 or 'auto' as a special value for auto-detection.

    Examples:
    - '1k' -> 1,000
    - '1K' -> 1,024
    - '25.6k' -> 25,600
    - '-1' or 'auto' -> -1 (special value for auto-detection)

`--quantization`, `-q`¶

`--allow-deprecated-quantization`, `--no-allow-deprecated-quantization`¶

Default: False

`--enforce-eager`, `--no-enforce-eager`¶

Default: False

`--enable-return-routed-experts`, `--no-enable-return-routed-experts`¶

Default: False

`--max-logprobs`¶

Default: 20

`--logprobs-mode`¶

Possible choices: processed_logits, processed_logprobs, raw_logits, raw_logprobs

Default: raw_logprobs

`--disable-sliding-window`, `--no-disable-sliding-window`¶

Default: False

`--disable-cascade-attn`, `--no-disable-cascade-attn`¶

Default: True

`--skip-tokenizer-init`, `--no-skip-tokenizer-init`¶

Default: False

`--enable-prompt-embeds`, `--no-enable-prompt-embeds`¶

Default: False

`--served-model-name`¶

`--config-format`¶

Possible choices: auto, hf, mistral

Default: auto

`--hf-token`¶

`--hf-overrides`¶

Default: {}

`--pooler-config`¶

: API docs: https://docs.vllm.ai/en/latest/api/vllm/config/#vllm.config.PoolerConfig

Should either be a valid JSON string or JSON keys passed individually.

`--generation-config`¶

Default: auto

`--override-generation-config`¶

:
Should either be a valid JSON string or JSON keys passed individually.: Default: {}

`--enable-sleep-mode`, `--no-enable-sleep-mode`¶

Default: False

`--model-impl`¶

Possible choices: auto, terratorch, transformers, vllm

Default: auto

`--override-attention-dtype`¶

`--logits-processors`¶

`--io-processor-plugin`¶

`--renderer-num-workers`¶

Default: 1

LoadConfig¶

Configuration for loading the model weights.

`--load-format`¶

Default: auto

`--download-dir`¶

`--safetensors-load-strategy`¶

`--safetensors-prefetch-num-threads`¶

Default: 8

`--safetensors-prefetch-block-size`¶

: Parse human-readable integers like '1k', '2M', etc. Including decimal values with decimal multipliers.

    Examples:
    - '1k' -> 1,000
    - '1K' -> 1,024
    - '25.6k' -> 25,600

Default: 16777216

`--model-loader-extra-config`¶

Default: {}

`--ignore-patterns`¶

Default: ['original/**/*']

`--use-tqdm-on-load`, `--no-use-tqdm-on-load`¶

Default: True

`--pt-load-map-location`¶

Default: cpu

AttentionConfig¶

Configuration for attention mechanisms in vLLM.

`--attention-backend`¶

MambaConfig¶

Configuration for Mamba SSM backends.

`--mamba-backend`¶

Default: MambaBackendEnum.TRITON

`--enable-mamba-cache-stochastic-rounding`, `--no-enable-mamba-cache-stochastic-rounding`¶

Default: False

`--mamba-cache-philox-rounds`¶

Default: 0

StructuredOutputsConfig¶

Dataclass which contains structured outputs config for the engine.

`--reasoning-parser`¶

Default: ""

`--reasoning-parser-plugin`¶

Default: ""

ParallelConfig¶

Configuration for the distributed execution.

`--distributed-executor-backend`¶

Possible choices: external_launcher, mp, ray, uni

`--pipeline-parallel-size`, `-pp`¶

Default: 1

`--master-addr`¶

Default: 127.0.0.1

`--master-port`¶

Default: 29501

`--nnodes`, `-n`¶

Default: 1

`--node-rank`, `-r`¶

Default: 0

`--distributed-timeout-seconds`¶

`--numa-bind`, `--no-numa-bind`¶

Default: False

`--numa-bind-nodes`¶

`--numa-bind-cpus`¶

`--tensor-parallel-size`, `-tp`¶

Default: 1

`--decode-context-parallel-size`, `-dcp`¶

Default: 1

`--dcp-comm-backend`¶

Possible choices: a2a, ag_rs

Default: ag_rs

`--dcp-kv-cache-interleave-size`¶

Default: 1

`--cp-kv-cache-interleave-size`¶

Default: 1

`--prefill-context-parallel-size`, `-pcp`¶

Default: 1

`--data-parallel-size`, `-dp`¶

Default: 1

`--data-parallel-rank`, `-dpn`¶

Data parallel rank of this instance. When set, enables external load balancer mode for MoE data-parallel deployments. Unsupported for non-MoE models; launch independent vLLM instances instead.

`--data-parallel-start-rank`, `-dpr`¶

Starting data parallel rank for secondary nodes.

`--data-parallel-size-local`, `-dpl`¶

Number of data parallel replicas to run on this node.

`--data-parallel-address`, `-dpa`¶

Address of data parallel cluster head-node.

`--data-parallel-rpc-port`, `-dpp`¶

Port for data parallel RPC communication.

`--data-parallel-backend`, `-dpb`¶

Backend for data parallel, either "mp" or "ray".

Default: mp

`--data-parallel-hybrid-lb`, `--no-data-parallel-hybrid-lb`, `-dph`¶

Default: False

`--data-parallel-external-lb`, `--no-data-parallel-external-lb`, `-dpe`¶

Default: False

`--enable-expert-parallel`, `--no-enable-expert-parallel`, `-ep`¶

Default: False

`--enable-ep-weight-filter`, `--no-enable-ep-weight-filter`¶

Default: False

`--all2all-backend`¶

Possible choices: allgather_reducescatter, deepep_high_throughput, deepep_low_latency, flashinfer_all2allv, flashinfer_nvlink_one_sided, flashinfer_nvlink_two_sided, mori, naive, nixl_ep, pplx

Default: allgather_reducescatter

`--enable-dbo`, `--no-enable-dbo`¶

Default: False

`--ubatch-size`¶

Default: 0

`--enable-elastic-ep`, `--no-enable-elastic-ep`¶

Default: False

`--dbo-decode-token-threshold`¶

Default: 32

`--dbo-prefill-token-threshold`¶

Default: 512

`--disable-nccl-for-dp-synchronization`, `--no-disable-nccl-for-dp-synchronization`¶

`--enable-eplb`, `--no-enable-eplb`¶

Default: False

`--eplb-config`¶

: API docs: https://docs.vllm.ai/en/latest/api/vllm/config/#vllm.config.EPLBConfig

Should either be a valid JSON string or JSON keys passed individually.

Default:

EPLBConfig(window_size=1000, step_interval=3000, num_redundant_experts=0, log_balancedness=False, log_balancedness_interval=1, use_async=False, policy='default', communicator=None)

`--expert-placement-strategy`¶

Possible choices: linear, round_robin

Default: linear

`--max-parallel-loading-workers`¶

`--ray-workers-use-nsight`, `--no-ray-workers-use-nsight`¶

Default: False

`--disable-custom-all-reduce`, `--no-disable-custom-all-reduce`¶

Default: False

`--worker-cls`¶

Default: auto

`--worker-extension-cls`¶

Default: ""

CacheConfig¶

Configuration for the KV cache.

`--block-size`¶

`--gpu-memory-utilization`¶

Default: 0.92

`--kv-cache-memory-bytes`¶

: Parse human-readable integers like '1k', '2M', etc. Including decimal values with decimal multipliers.

    Examples:
    - '1k' -> 1,000
    - '1K' -> 1,024
    - '25.6k' -> 25,600

`--kv-cache-dtype`¶

Possible choices: auto, bfloat16, float16, fp8, fp8_ds_mla, fp8_e4m3, fp8_e5m2, fp8_inc, fp8_per_token_head, int8_per_token_head, nvfp4, turboquant_3bit_nc, turboquant_4bit_nc, turboquant_k3v4_nc, turboquant_k8v4

Default: auto

`--num-gpu-blocks-override`¶

`--enable-prefix-caching`, `--no-enable-prefix-caching`¶

`--prefix-caching-hash-algo`¶

Possible choices: sha256, sha256_cbor, xxhash, xxhash_cbor

Default: sha256

`--calculate-kv-scales`, `--no-calculate-kv-scales`¶

Default: False

`--kv-cache-dtype-skip-layers`¶

Default: []

`--kv-sharing-fast-prefill`, `--no-kv-sharing-fast-prefill`¶

Default: False

`--mamba-cache-dtype`¶

Possible choices: auto, float16, float32

Default: auto

`--mamba-ssm-cache-dtype`¶

Possible choices: auto, float16, float32

Default: auto

`--mamba-block-size`¶

`--mamba-cache-mode`¶

Possible choices: align, all, none

Default: none

`--kv-offloading-size`¶

`--kv-offloading-backend`¶

Possible choices: lmcache, native

Default: native

OffloadConfig¶

Configuration for model weight offloading to reduce GPU memory usage.

`--offload-backend`¶

Possible choices: auto, prefetch, uva

Default: auto

`--cpu-offload-gb`¶

Default: 0

`--cpu-offload-params`¶

Default: set()

`--offload-group-size`¶

Default: 0

`--offload-num-in-group`¶

Default: 1

`--offload-prefetch-step`¶

Default: 1

`--offload-params`¶

Default: set()

MultiModalConfig¶

Controls the behavior of multimodal models.

`--language-model-only`, `--no-language-model-only`¶

Default: False

`--limit-mm-per-prompt`¶

:
Should either be a valid JSON string or JSON keys passed individually.: Default: {}

`--enable-mm-embeds`, `--no-enable-mm-embeds`¶

Default: False

`--media-io-kwargs`¶

:
Should either be a valid JSON string or JSON keys passed individually.: Default: {}

`--mm-processor-kwargs`¶

: Should either be a valid JSON string or JSON keys passed individually.

`--mm-processor-cache-gb`¶

Default: 4

`--mm-processor-cache-type`¶

Possible choices: lru, shm

Default: lru

`--mm-shm-cache-max-object-size-mb`¶

Default: 128

`--mm-encoder-only`, `--no-mm-encoder-only`¶

Default: False

`--mm-encoder-tp-mode`¶

Possible choices: data, weights

Default: weights

`--mm-encoder-attn-backend`¶

`--mm-encoder-attn-dtype`¶

Possible choices: fp8, None

`--mm-encoder-fp8-scale-path`¶

`--mm-encoder-fp8-scale-save-path`¶

`--mm-encoder-fp8-scale-save-margin`¶

Default: 1.5

`--interleave-mm-strings`, `--no-interleave-mm-strings`¶

Default: False

`--skip-mm-profiling`, `--no-skip-mm-profiling`¶

Default: False

`--video-pruning-rate`¶

`--mm-tensor-ipc`¶

Possible choices: direct_rpc, torch_shm

Default: direct_rpc

LoRAConfig¶

Configuration for LoRA.

`--enable-lora`, `--no-enable-lora`¶

If True, enable handling of LoRA adapters.

`--max-loras`¶

Default: 1

`--max-lora-rank`¶

Possible choices: 1, 8, 16, 32, 64, 128, 256, 320, 512

Default: 16

`--lora-dtype`¶

Default: auto

`--enable-tower-connector-lora`, `--no-enable-tower-connector-lora`¶

Default: False

`--max-cpu-loras`¶

`--fully-sharded-loras`, `--no-fully-sharded-loras`¶

Default: False

`--lora-target-modules`¶

`--default-mm-loras`¶

: Should either be a valid JSON string or JSON keys passed individually.

`--specialize-active-lora`, `--no-specialize-active-lora`¶

Default: False

ObservabilityConfig¶

Configuration for observability - metrics and tracing.

`--show-hidden-metrics-for-version`¶

`--otlp-traces-endpoint`¶

`--collect-detailed-traces`¶

Possible choices: all, model, worker, None, model,worker, model,all, worker,model, worker,all, all,model, all,worker

`--kv-cache-metrics`, `--no-kv-cache-metrics`¶

Default: False

`--kv-cache-metrics-sample`¶

Default: 0.01

`--cudagraph-metrics`, `--no-cudagraph-metrics`¶

Default: False

`--enable-layerwise-nvtx-tracing`, `--no-enable-layerwise-nvtx-tracing`¶

Default: False

`--enable-mfu-metrics`, `--no-enable-mfu-metrics`¶

Default: False

`--enable-logging-iteration-details`, `--no-enable-logging-iteration-details`¶

Default: False

SchedulerConfig¶

Scheduler configuration.

`--max-num-batched-tokens`¶

: Parse human-readable integers like '1k', '2M', etc. Including decimal values with decimal multipliers.

    Examples:
    - '1k' -> 1,000
    - '1K' -> 1,024
    - '25.6k' -> 25,600

`--max-num-seqs`¶

`--max-num-partial-prefills`¶

Default: 1

`--max-long-partial-prefills`¶

Default: 1

`--long-prefill-token-threshold`¶

Default: 0

`--scheduling-policy`¶

Possible choices: fcfs, priority

Default: fcfs

`--enable-chunked-prefill`, `--no-enable-chunked-prefill`¶

`--disable-chunked-mm-input`, `--no-disable-chunked-mm-input`¶

Default: False

`--scheduler-cls`¶

`--scheduler-reserve-full-isl`, `--no-scheduler-reserve-full-isl`¶

Default: True

`--disable-hybrid-kv-cache-manager`, `--no-disable-hybrid-kv-cache-manager`¶

`--async-scheduling`, `--no-async-scheduling`¶

`--stream-interval`¶

Default: 1

CompilationConfig¶

Configuration for compilation.

You must pass CompilationConfig to VLLMConfig constructor.
VLLMConfig's post_init does further initialization. If used outside of the
VLLMConfig, some fields will be left in an improper state.

It contains PassConfig, which controls the custom fusion/transformation passes.
The rest has three parts:

- Top-level Compilation control:
    - [`mode`][vllm.config.CompilationConfig.mode]
    - [`debug_dump_path`][vllm.config.CompilationConfig.debug_dump_path]
    - [`cache_dir`][vllm.config.CompilationConfig.cache_dir]
    - [`backend`][vllm.config.CompilationConfig.backend]
    - [`custom_ops`][vllm.config.CompilationConfig.custom_ops]
    - [`splitting_ops`][vllm.config.CompilationConfig.splitting_ops]
    - [`compile_mm_encoder`][vllm.config.CompilationConfig.compile_mm_encoder]
- CudaGraph capture:
    - [`cudagraph_mode`][vllm.config.CompilationConfig.cudagraph_mode]
    - [`cudagraph_capture_sizes`]
    [vllm.config.CompilationConfig.cudagraph_capture_sizes]
    - [`max_cudagraph_capture_size`]
    [vllm.config.CompilationConfig.max_cudagraph_capture_size]
    - [`cudagraph_num_of_warmups`]
    [vllm.config.CompilationConfig.cudagraph_num_of_warmups]
    - [`cudagraph_copy_inputs`]
    [vllm.config.CompilationConfig.cudagraph_copy_inputs]
- Inductor compilation:
    - [`compile_sizes`][vllm.config.CompilationConfig.compile_sizes]
    - [`compile_ranges_endpoints`]
        [vllm.config.CompilationConfig.compile_ranges_endpoints]
    - [`inductor_compile_config`]
    [vllm.config.CompilationConfig.inductor_compile_config]
    - [`inductor_passes`][vllm.config.CompilationConfig.inductor_passes]
    - custom inductor passes

Why we have different sizes for cudagraph and inductor:
- cudagraph: a cudagraph captured for a specific size can only be used
    for the same size. We need to capture all the sizes we want to use.
- inductor: a graph compiled by inductor for a general shape can be used
    for different sizes. Inductor can also compile for specific sizes,
    where it can have more information to optimize the graph with fully
    static shapes. However, we find the general shape compilation is
    sufficient for most cases. It might be beneficial to compile for
    certain small batchsizes, where inductor is good at optimizing.

`--cudagraph-capture-sizes`¶

`--max-cudagraph-capture-size`¶

KernelConfig¶

Configuration for kernel selection and warmup behavior.

`--ir-op-priority`¶

: API docs: https://docs.vllm.ai/en/latest/api/vllm/config/#vllm.config.IrOpPriorityConfig

Should either be a valid JSON string or JSON keys passed individually.

Default: IrOpPriorityConfig(rms_norm=[], fused_add_rms_norm=[])

`--enable-flashinfer-autotune`, `--no-enable-flashinfer-autotune`¶

`--moe-backend`¶

Possible choices: aiter, auto, cutlass, deep_gemm, deep_gemm_mega_moe, emulation, flashinfer_cutedsl, flashinfer_cutlass, flashinfer_trtllm, humming, marlin, triton, triton_unfused

Default: auto

VllmConfig¶

Dataclass which contains all vllm-related configuration. This simplifies passing around the distinct configurations in the codebase.

`--speculative-config`, `-sc`¶

: API docs: https://docs.vllm.ai/en/latest/api/vllm/config/#vllm.config.SpeculativeConfig

Should either be a valid JSON string or JSON keys passed individually.

`--kv-transfer-config`¶

: API docs: https://docs.vllm.ai/en/latest/api/vllm/config/#vllm.config.KVTransferConfig

Should either be a valid JSON string or JSON keys passed individually.

`--kv-events-config`¶

: API docs: https://docs.vllm.ai/en/latest/api/vllm/config/#vllm.config.KVEventsConfig

Should either be a valid JSON string or JSON keys passed individually.

`--ec-transfer-config`¶

: API docs: https://docs.vllm.ai/en/latest/api/vllm/config/#vllm.config.ECTransferConfig

Should either be a valid JSON string or JSON keys passed individually.

`--compilation-config`, `-cc`¶

: API docs: https://docs.vllm.ai/en/latest/api/vllm/config/#vllm.config.CompilationConfig

Should either be a valid JSON string or JSON keys passed individually.

Default:

{'mode': None, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': [], 'ir_enable_torch_wrap': None, 'splitting_ops': None, 'compile_mm_encoder': False, 'cudagraph_mm_encoder': False, 'encoder_cudagraph_token_budgets': [], 'encoder_cudagraph_max_vision_items_per_batch': 0, 'encoder_cudagraph_max_frames_per_batch': None, 'compile_sizes': None, 'compile_ranges_endpoints': None, 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': None, 'cudagraph_num_of_warmups': 0, 'cudagraph_capture_sizes': None, 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': None, 'pass_config': {}, 'max_cudagraph_capture_size': None, 'dynamic_shapes_config': {'type': <DynamicShapesType.BACKED: 'backed'>, 'evaluate_guards': False, 'assume_32_bit_indexing': False}, 'local_cache_dir': None, 'fast_moe_cold_start': None, 'static_all_moe_layers': []}

`--attention-config`, `-ac`¶

: API docs: https://docs.vllm.ai/en/latest/api/vllm/config/#vllm.config.AttentionConfig

Should either be a valid JSON string or JSON keys passed individually.

Default:

AttentionConfig(backend=None, flash_attn_version=None, use_prefill_decode_attention=False, flash_attn_max_num_splits_for_cuda_graph=32, tq_max_kv_splits_for_cuda_graph=32, use_cudnn_prefill=False, use_trtllm_ragged_deepseek_prefill=False, use_trtllm_attention=None, disable_flashinfer_prefill=None, disable_flashinfer_q_quantization=False, mla_prefill_backend=None, use_prefill_query_quantization=False, use_fp4_indexer_cache=False, use_non_causal=False)

`--reasoning-config`¶

: API docs: https://docs.vllm.ai/en/latest/api/vllm/config/#vllm.config.ReasoningConfig

Should either be a valid JSON string or JSON keys passed individually.

`--kernel-config`¶

: API docs: https://docs.vllm.ai/en/latest/api/vllm/config/#vllm.config.KernelConfig

Should either be a valid JSON string or JSON keys passed individually.

Default:

KernelConfig(ir_op_priority=IrOpPriorityConfig(rms_norm=[], fused_add_rms_norm=[]), enable_flashinfer_autotune=None, moe_backend='auto')

`--additional-config`¶

Default: {}

`--structured-outputs-config`¶

: API docs: https://docs.vllm.ai/en/latest/api/vllm/config/#vllm.config.StructuredOutputsConfig

Should either be a valid JSON string or JSON keys passed individually.

Default:

StructuredOutputsConfig(backend='auto', disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='', reasoning_parser_plugin='', enable_in_reasoning=False)

`--profiler-config`¶

: API docs: https://docs.vllm.ai/en/latest/api/vllm/config/#vllm.config.ProfilerConfig

Should either be a valid JSON string or JSON keys passed individually.

Default:

ProfilerConfig(profiler=None, torch_profiler_dir='', torch_profiler_with_stack=True, torch_profiler_with_flops=False, torch_profiler_use_gzip=True, torch_profiler_dump_cuda_time_total=True, torch_profiler_record_shapes=False, torch_profiler_with_memory=False, ignore_frontend=False, delay_iterations=0, max_iterations=0, warmup_iterations=0, active_iterations=5, wait_iterations=0)

`--optimization-level`¶

Default: 2

`--performance-mode`¶

Possible choices: balanced, interactivity, throughput

Default: balanced

`--weight-transfer-config`¶

: API docs: https://docs.vllm.ai/en/latest/api/vllm/config/#vllm.config.WeightTransferConfig

Should either be a valid JSON string or JSON keys passed individually.

vllm serve¶

JSON CLI Arguments¶

Arguments¶

--headless¶

--api-server-count, -asc¶

--config¶

--grpc¶

--disable-log-stats¶

--aggregate-engine-logging¶

--fail-on-environ-validation, --no-fail-on-environ-validation¶

--shutdown-timeout¶

--gdn-prefill-backend¶

--enable-log-requests, --no-enable-log-requests¶

Frontend¶

--lora-modules¶

--chat-template¶

--chat-template-content-format¶

--trust-request-chat-template, --no-trust-request-chat-template¶

--default-chat-template-kwargs¶

--response-role¶

--return-tokens-as-token-ids, --no-return-tokens-as-token-ids¶

--enable-auto-tool-choice, --no-enable-auto-tool-choice¶

--exclude-tools-when-tool-choice-none, --no-exclude-tools-when-tool-choice-none¶

--tool-call-parser¶

--tool-parser-plugin¶

--tool-server¶

--log-config-file¶

--max-log-len¶

--enable-prompt-tokens-details, --no-enable-prompt-tokens-details¶

--enable-server-load-tracking, --no-enable-server-load-tracking¶

--enable-force-include-usage, --no-enable-force-include-usage¶

--enable-tokenizer-info-endpoint, --no-enable-tokenizer-info-endpoint¶

--enable-log-outputs, --no-enable-log-outputs¶

--enable-log-deltas, --no-enable-log-deltas¶

--log-error-stack, --no-log-error-stack¶

--tokens-only, --no-tokens-only¶

--fingerprint-mode¶

--fingerprint-value¶

--host¶

--port¶

--uds¶

--uvicorn-log-level¶

--disable-uvicorn-access-log, --no-disable-uvicorn-access-log¶

--disable-access-log-for-endpoints¶

--allow-credentials, --no-allow-credentials¶

--allowed-origins¶

--allowed-methods¶

--allowed-headers¶

--api-key¶

--ssl-keyfile¶

--ssl-certfile¶

--ssl-ca-certs¶

--enable-ssl-refresh, --no-enable-ssl-refresh¶

--ssl-cert-reqs¶

--ssl-ciphers¶

--root-path¶

--middleware¶

--enable-request-id-headers, --no-enable-request-id-headers¶

--disable-fastapi-docs, --no-disable-fastapi-docs¶

--h11-max-incomplete-event-size¶

--h11-max-header-count¶

--enable-offline-docs, --no-enable-offline-docs¶

--enable-flash-late-interaction, --no-enable-flash-late-interaction¶

ModelConfig¶

--model¶

--runner¶

--convert¶

--tokenizer¶

--tokenizer-mode¶

--trust-remote-code, --no-trust-remote-code¶

--dtype¶

--seed¶

--hf-config-path¶

--allowed-local-media-path¶

--allowed-media-domains¶

--revision¶

--code-revision¶

--tokenizer-revision¶

--max-model-len¶

--quantization, -q¶

`--headless`¶

`--api-server-count`, `-asc`¶

`--config`¶

`--grpc`¶

`--disable-log-stats`¶

`--aggregate-engine-logging`¶

`--fail-on-environ-validation`, `--no-fail-on-environ-validation`¶

`--shutdown-timeout`¶

`--gdn-prefill-backend`¶

`--enable-log-requests`, `--no-enable-log-requests`¶

`--lora-modules`¶

`--chat-template`¶

`--chat-template-content-format`¶

`--trust-request-chat-template`, `--no-trust-request-chat-template`¶

`--default-chat-template-kwargs`¶

`--response-role`¶

`--return-tokens-as-token-ids`, `--no-return-tokens-as-token-ids`¶

`--enable-auto-tool-choice`, `--no-enable-auto-tool-choice`¶

`--exclude-tools-when-tool-choice-none`, `--no-exclude-tools-when-tool-choice-none`¶

`--tool-call-parser`¶

`--tool-parser-plugin`¶

`--tool-server`¶

`--log-config-file`¶

`--max-log-len`¶

`--enable-prompt-tokens-details`, `--no-enable-prompt-tokens-details`¶

`--enable-server-load-tracking`, `--no-enable-server-load-tracking`¶

`--enable-force-include-usage`, `--no-enable-force-include-usage`¶

`--enable-tokenizer-info-endpoint`, `--no-enable-tokenizer-info-endpoint`¶

`--enable-log-outputs`, `--no-enable-log-outputs`¶

`--enable-log-deltas`, `--no-enable-log-deltas`¶

`--log-error-stack`, `--no-log-error-stack`¶

`--tokens-only`, `--no-tokens-only`¶

`--fingerprint-mode`¶

`--fingerprint-value`¶

`--host`¶

`--port`¶

`--uds`¶

`--uvicorn-log-level`¶

`--disable-uvicorn-access-log`, `--no-disable-uvicorn-access-log`¶

`--disable-access-log-for-endpoints`¶

`--allow-credentials`, `--no-allow-credentials`¶

`--allowed-origins`¶

`--allowed-methods`¶

`--allowed-headers`¶

`--api-key`¶

`--ssl-keyfile`¶

`--ssl-certfile`¶

`--ssl-ca-certs`¶

`--enable-ssl-refresh`, `--no-enable-ssl-refresh`¶

`--ssl-cert-reqs`¶

`--ssl-ciphers`¶

`--root-path`¶

`--middleware`¶

`--enable-request-id-headers`, `--no-enable-request-id-headers`¶

`--disable-fastapi-docs`, `--no-disable-fastapi-docs`¶

`--h11-max-incomplete-event-size`¶

`--h11-max-header-count`¶

`--enable-offline-docs`, `--no-enable-offline-docs`¶

`--enable-flash-late-interaction`, `--no-enable-flash-late-interaction`¶

`--model`¶

`--runner`¶

`--convert`¶

`--tokenizer`¶

`--tokenizer-mode`¶

`--trust-remote-code`, `--no-trust-remote-code`¶

`--dtype`¶

`--seed`¶

`--hf-config-path`¶

`--allowed-local-media-path`¶

`--allowed-media-domains`¶

`--revision`¶

`--code-revision`¶

`--tokenizer-revision`¶

`--max-model-len`¶

`--quantization`, `-q`¶

`--allow-deprecated-quantization`, `--no-allow-deprecated-quantization`¶

`--enforce-eager`, `--no-enforce-eager`¶

`--enable-return-routed-experts`, `--no-enable-return-routed-experts`¶

`--max-logprobs`¶

`--logprobs-mode`¶