Skip to content

vllm bench throughput

JSON CLI Arguments

When passing JSON CLI arguments, the following sets of arguments are equivalent:

  • --json-arg '{"key1": "value1", "key2": {"key3": "value2"}}'
  • --json-arg.key1 value1 --json-arg.key2.key3 value2

Additionally, list elements can be passed individually using +:

  • --json-arg '{"key4": ["value3", "value4", "value5"]}'
  • --json-arg.key4+ value3 --json-arg.key4+='value4,value5'

Arguments

--backend

Possible choices: vllm, hf, mii, vllm-chat
Default: vllm

--dataset-name

Possible choices: sharegpt, random, sonnet, burstgpt, hf, prefix_repetition, random-mm, random-rerank
Name of the dataset to benchmark on.
Default: sharegpt

--dataset

Path to the ShareGPT dataset, will be deprecated in the next release. The dataset is expected to be a json in form of list[dict[..., conversations: list[dict[..., value: ]]]]

--dataset-path

Path to the dataset

--input-len

Input prompt length for each request

--output-len

Output length for each request. Overrides the output length from the dataset.

--n

Number of generated sequences per prompt.
Default: 1

--num-prompts

Number of prompts to process.
Default: 1000

--hf-max-batch-size

Maximum batch size for HF backend.

--hf-enable-torch-compile

Enable Torch compile for HF backend.
Default: False

--output-json

Path to save the throughput results in JSON format.

--async-engine

Use vLLM async engine rather than LLM class.
Default: False

--disable-detokenize

Do not detokenize the response (i.e. do not include detokenization time in the measurement)
Default: False

--lora-path

Path to the lora adapters to use. This can be an absolute path, a relative path, or a Hugging Face model identifier.

--lora-assignment

Possible choices: random, round-robin
Strategy for assigning LoRA adapters to requests. 'random' (default) selects a LoRA at random for each request. 'round-robin' cycles through LoRAs deterministically.
Default: random

--prefix-len

Number of fixed prefix tokens before the random context in a request (default: 0).
Default: 0

--hf-subset

Subset of the HF dataset.

--hf-split

Split of the HF dataset.

--hf-name

Name of the dataset on HuggingFace (e.g., 'lmms-lab/LLaVA-OneVision-Data'). Specify this when --dataset-path is a local filesystem path so the benchmark can identify the correct dataset class.

--profile

Use vLLM Profiling. --profiler-config must be provided on the server.
Default: False

--prefix-repetition-prefix-len

Number of prefix tokens per request, used only for prefix repetition dataset.

--prefix-repetition-suffix-len

Number of suffix tokens per request, used only for prefix repetition dataset. Total input length is prefix_len + suffix_len.

--prefix-repetition-num-prefixes

Number of prefixes to generate, used only for prefix repetition dataset. Prompts per prefix is num_requests // num_prefixes.

--prefix-repetition-output-len

Number of output tokens per request, used only for prefix repetition dataset.

--random-input-len

Number of input tokens per request, used only for random sampling.
Default: 1024

--random-output-len

Number of output tokens per request, used only for random sampling.
Default: 128

--random-range-ratio

Range ratio for sampling input/output length, used only for random sampling. A single float applies to both ISL and OSL. A JSON dict like '{"input": 0.3, "output": 0.5}' sets them independently. Values must be in [0, 1).
Default: 0.0

--random-prefix-len

Number of fixed prefix tokens before the random context in a request. The total input length is the sum of random-prefix-len and a random context length sampled from [input_len * (1 - range_ratio), input_len * (1 + range_ratio)].
Default: 0

--random-batch-size

Batch size for random sampling. Only used for embeddings benchmark.
Default: 1

--no-reranker

Whether the model supports reranking natively. Only used for reranker benchmark.
Default: False

--random-mm-base-items-per-request

Base number of multimodal items per request for random-mm. Actual per-request count is sampled around this base using --random-mm-num-mm-items-range-ratio.
Default: 1

--random-mm-num-mm-items-range-ratio

Range ratio r in [0, 1] for sampling items per request. We sample uniformly from the closed integer range [floor(n(1-r)), ceil(n(1+r))] where n is the base items per request. r=0 keeps it fixed; r=1 allows 0 items. The maximum is clamped to the sum of per-modality limits from --random-mm-limit-mm-per-prompt. An error is raised if the computed min exceeds the max.
Default: 0.0

--random-mm-limit-mm-per-prompt

Per-modality hard caps for items attached per request, e.g. '{"image": 3, "video": 0}'. The sampled per-request item count is clamped to the sum of these limits. When a modality reaches its cap, its buckets are excluded and probabilities are renormalized.OBS.: Only image sampling is supported for now.
Default: {'image': 255, 'video': 1}

--random-mm-bucket-config

The bucket config is a dictionary mapping a multimodal itemsampling configuration to a probability.Currently allows for 2 modalities: images and videos. An bucket key is a tuple of (height, width, num_frames)The value is the probability of sampling that specific item. Example: --random-mm-bucket-config {(256, 256, 1): 0.5, (720, 1280, 1): 0.4, (720, 1280, 16): 0.10} First item: images with resolution 256x256 w.p. 0.5Second item: images with resolution 720x1280 w.p. 0.4 Third item: videos with resolution 720x1280 and 16 frames w.p. 0.1OBS.: If the probabilities do not sum to 1, they are normalized.OBS bis.: Only image sampling is supported for now.
Default: {(256, 256, 1): 0.5, (720, 1280, 1): 0.5, (720, 1280, 16): 0.0}

--asr-min-audio-len-sec

Minimum audio duration in seconds for ASR dataset filtering.
Default: 0.0

--asr-max-audio-len-sec

Maximum audio duration in seconds for ASR dataset filtering.
Default: inf

--disable-log-stats

Disable logging statistics.
Default: False

--aggregate-engine-logging

Log aggregate rather than per-engine statistics when using data parallelism.
Default: False

--fail-on-environ-validation, --no-fail-on-environ-validation

If set, the engine will raise an error if environment validation fails.
Default: False

--shutdown-timeout

Shutdown timeout in seconds. 0 = abort, >0 = wait.
Default: 0

--gdn-prefill-backend

Possible choices: flashinfer, triton
Select GDN prefill backend.

--enable-log-requests, --no-enable-log-requests

Enable logging request information, dependent on log level: - INFO: Request ID, parameters and LoRA request. - DEBUG: Prompt inputs (e.g: text, token IDs). You can set the minimum log level via VLLM_LOGGING_LEVEL.
Default: False

ModelConfig

Configuration for the model.

--model

Default: Qwen/Qwen3-0.6B

--runner

Possible choices: auto, draft, generate, pooling
Default: auto

--convert

Possible choices: auto, classify, embed, none
Default: auto

--tokenizer

--tokenizer-mode

Possible choices: auto, deepseek_v32, deepseek_v4, fastokens, hf, mistral, slow
Default: auto

--trust-remote-code, --no-trust-remote-code

Default: False

--dtype

Possible choices: auto, bfloat16, float, float16, float32, half
Default: auto

--seed

Default: 0

--hf-config-path

--allowed-local-media-path

Default: ""

--allowed-media-domains

--revision

--code-revision

--tokenizer-revision

--max-model-len

: Parse human-readable integers like '1k', '2M', etc. Including decimal values with decimal multipliers. Also accepts -1 or 'auto' as a special value for auto-detection.

    Examples:
    - '1k' -> 1,000
    - '1K' -> 1,024
    - '25.6k' -> 25,600
    - '-1' or 'auto' -> -1 (special value for auto-detection)

--quantization, -q

--allow-deprecated-quantization, --no-allow-deprecated-quantization

Default: False

--enforce-eager, --no-enforce-eager

Default: False

--enable-return-routed-experts, --no-enable-return-routed-experts

Default: False

--max-logprobs

Default: 20

--logprobs-mode

Possible choices: processed_logits, processed_logprobs, raw_logits, raw_logprobs
Default: raw_logprobs

--disable-sliding-window, --no-disable-sliding-window

Default: False

--disable-cascade-attn, --no-disable-cascade-attn

Default: True

--skip-tokenizer-init, --no-skip-tokenizer-init

Default: False

--enable-prompt-embeds, --no-enable-prompt-embeds

Default: False

--served-model-name

--config-format

Possible choices: auto, hf, mistral
Default: auto

--hf-token

--hf-overrides

Default: {}

--pooler-config

: API docs: https://docs.vllm.ai/en/latest/api/vllm/config/#vllm.config.PoolerConfig

Should either be a valid JSON string or JSON keys passed individually.

--generation-config

Default: auto

--override-generation-config

:
Should either be a valid JSON string or JSON keys passed individually.

Default: {}

--enable-sleep-mode, --no-enable-sleep-mode

Default: False

--model-impl

Possible choices: auto, terratorch, transformers, vllm
Default: auto

--override-attention-dtype

--logits-processors

--io-processor-plugin

--renderer-num-workers

Default: 1

LoadConfig

Configuration for loading the model weights.

--load-format

Default: auto

--download-dir

--safetensors-load-strategy

--safetensors-prefetch-num-threads

Default: 8

--safetensors-prefetch-block-size

: Parse human-readable integers like '1k', '2M', etc. Including decimal values with decimal multipliers.

    Examples:
    - '1k' -> 1,000
    - '1K' -> 1,024
    - '25.6k' -> 25,600
Default: 16777216

--model-loader-extra-config

Default: {}

--ignore-patterns

Default: ['original/**/*']

--use-tqdm-on-load, --no-use-tqdm-on-load

Default: True

--pt-load-map-location

Default: cpu

AttentionConfig

Configuration for attention mechanisms in vLLM.

--attention-backend

MambaConfig

Configuration for Mamba SSM backends.

--mamba-backend

Default: MambaBackendEnum.TRITON

--enable-mamba-cache-stochastic-rounding, --no-enable-mamba-cache-stochastic-rounding

Default: False

--mamba-cache-philox-rounds

Default: 0

StructuredOutputsConfig

Dataclass which contains structured outputs config for the engine.

--reasoning-parser

Default: ""

--reasoning-parser-plugin

Default: ""

ParallelConfig

Configuration for the distributed execution.

--distributed-executor-backend

Possible choices: external_launcher, mp, ray, uni

--pipeline-parallel-size, -pp

Default: 1

--master-addr

Default: 127.0.0.1

--master-port

Default: 29501

--nnodes, -n

Default: 1

--node-rank, -r

Default: 0

--distributed-timeout-seconds

--numa-bind, --no-numa-bind

Default: False

--numa-bind-nodes

--numa-bind-cpus

--tensor-parallel-size, -tp

Default: 1

--decode-context-parallel-size, -dcp

Default: 1

--dcp-comm-backend

Possible choices: a2a, ag_rs
Default: ag_rs

--dcp-kv-cache-interleave-size

Default: 1

--cp-kv-cache-interleave-size

Default: 1

--prefill-context-parallel-size, -pcp

Default: 1

--data-parallel-size, -dp

Default: 1

--data-parallel-rank, -dpn

Data parallel rank of this instance. When set, enables external load balancer mode for MoE data-parallel deployments. Unsupported for non-MoE models; launch independent vLLM instances instead.

--data-parallel-start-rank, -dpr

Starting data parallel rank for secondary nodes.

--data-parallel-size-local, -dpl

Number of data parallel replicas to run on this node.

--data-parallel-address, -dpa

Address of data parallel cluster head-node.

--data-parallel-rpc-port, -dpp

Port for data parallel RPC communication.

--data-parallel-backend, -dpb

Backend for data parallel, either "mp" or "ray".
Default: mp

--data-parallel-hybrid-lb, --no-data-parallel-hybrid-lb, -dph

Default: False

--data-parallel-external-lb, --no-data-parallel-external-lb, -dpe

Default: False

--enable-expert-parallel, --no-enable-expert-parallel, -ep

Default: False

--enable-ep-weight-filter, --no-enable-ep-weight-filter

Default: False

--all2all-backend

Possible choices: allgather_reducescatter, deepep_high_throughput, deepep_low_latency, flashinfer_all2allv, flashinfer_nvlink_one_sided, flashinfer_nvlink_two_sided, mori, naive, nixl_ep, pplx
Default: allgather_reducescatter

--enable-dbo, --no-enable-dbo

Default: False

--ubatch-size

Default: 0

--enable-elastic-ep, --no-enable-elastic-ep

Default: False

--dbo-decode-token-threshold

Default: 32

--dbo-prefill-token-threshold

Default: 512

--disable-nccl-for-dp-synchronization, --no-disable-nccl-for-dp-synchronization

--enable-eplb, --no-enable-eplb

Default: False

--eplb-config

: API docs: https://docs.vllm.ai/en/latest/api/vllm/config/#vllm.config.EPLBConfig

Should either be a valid JSON string or JSON keys passed individually.
Default: EPLBConfig(window_size=1000, step_interval=3000, num_redundant_experts=0, log_balancedness=False, log_balancedness_interval=1, use_async=False, policy='default', communicator=None)

--expert-placement-strategy

Possible choices: linear, round_robin
Default: linear

--max-parallel-loading-workers

--ray-workers-use-nsight, --no-ray-workers-use-nsight

Default: False

--disable-custom-all-reduce, --no-disable-custom-all-reduce

Default: False

--worker-cls

Default: auto

--worker-extension-cls

Default: ""

CacheConfig

Configuration for the KV cache.

--block-size

--gpu-memory-utilization

Default: 0.92

--kv-cache-memory-bytes

: Parse human-readable integers like '1k', '2M', etc. Including decimal values with decimal multipliers.

    Examples:
    - '1k' -> 1,000
    - '1K' -> 1,024
    - '25.6k' -> 25,600

--kv-cache-dtype

Possible choices: auto, bfloat16, float16, fp8, fp8_ds_mla, fp8_e4m3, fp8_e5m2, fp8_inc, fp8_per_token_head, int8_per_token_head, nvfp4, turboquant_3bit_nc, turboquant_4bit_nc, turboquant_k3v4_nc, turboquant_k8v4
Default: auto

--num-gpu-blocks-override

--enable-prefix-caching, --no-enable-prefix-caching

--prefix-caching-hash-algo

Possible choices: sha256, sha256_cbor, xxhash, xxhash_cbor
Default: sha256

--calculate-kv-scales, --no-calculate-kv-scales

Default: False

--kv-cache-dtype-skip-layers

Default: []

--kv-sharing-fast-prefill, --no-kv-sharing-fast-prefill

Default: False

--mamba-cache-dtype

Possible choices: auto, float16, float32
Default: auto

--mamba-ssm-cache-dtype

Possible choices: auto, float16, float32
Default: auto

--mamba-block-size

--mamba-cache-mode

Possible choices: align, all, none
Default: none

--kv-offloading-size

--kv-offloading-backend

Possible choices: lmcache, native
Default: native

OffloadConfig

Configuration for model weight offloading to reduce GPU memory usage.

--offload-backend

Possible choices: auto, prefetch, uva
Default: auto

--cpu-offload-gb

Default: 0

--cpu-offload-params

Default: set()

--offload-group-size

Default: 0

--offload-num-in-group

Default: 1

--offload-prefetch-step

Default: 1

--offload-params

Default: set()

MultiModalConfig

Controls the behavior of multimodal models.

--language-model-only, --no-language-model-only

Default: False

--limit-mm-per-prompt

:
Should either be a valid JSON string or JSON keys passed individually.

Default: {}

--enable-mm-embeds, --no-enable-mm-embeds

Default: False

--media-io-kwargs

:
Should either be a valid JSON string or JSON keys passed individually.

Default: {}

--mm-processor-kwargs

: Should either be a valid JSON string or JSON keys passed individually.

--mm-processor-cache-gb

Default: 4

--mm-processor-cache-type

Possible choices: lru, shm
Default: lru

--mm-shm-cache-max-object-size-mb

Default: 128

--mm-encoder-only, --no-mm-encoder-only

Default: False

--mm-encoder-tp-mode

Possible choices: data, weights
Default: weights

--mm-encoder-attn-backend

--mm-encoder-attn-dtype

Possible choices: fp8, None

--mm-encoder-fp8-scale-path

--mm-encoder-fp8-scale-save-path

--mm-encoder-fp8-scale-save-margin

Default: 1.5

--interleave-mm-strings, --no-interleave-mm-strings

Default: False

--skip-mm-profiling, --no-skip-mm-profiling

Default: False

--video-pruning-rate

--mm-tensor-ipc

Possible choices: direct_rpc, torch_shm
Default: direct_rpc

LoRAConfig

Configuration for LoRA.

--enable-lora, --no-enable-lora

If True, enable handling of LoRA adapters.

--max-loras

Default: 1

--max-lora-rank

Possible choices: 1, 8, 16, 32, 64, 128, 256, 320, 512
Default: 16

--lora-dtype

Default: auto

--enable-tower-connector-lora, --no-enable-tower-connector-lora

Default: False

--max-cpu-loras

--fully-sharded-loras, --no-fully-sharded-loras

Default: False

--lora-target-modules

--default-mm-loras

: Should either be a valid JSON string or JSON keys passed individually.

--specialize-active-lora, --no-specialize-active-lora

Default: False

ObservabilityConfig

Configuration for observability - metrics and tracing.

--show-hidden-metrics-for-version

--otlp-traces-endpoint

--collect-detailed-traces

Possible choices: all, model, worker, None, model,worker, model,all, worker,model, worker,all, all,model, all,worker

--kv-cache-metrics, --no-kv-cache-metrics

Default: False

--kv-cache-metrics-sample

Default: 0.01

--cudagraph-metrics, --no-cudagraph-metrics

Default: False

--enable-layerwise-nvtx-tracing, --no-enable-layerwise-nvtx-tracing

Default: False

--enable-mfu-metrics, --no-enable-mfu-metrics

Default: False

--enable-logging-iteration-details, --no-enable-logging-iteration-details

Default: False

SchedulerConfig

Scheduler configuration.

--max-num-batched-tokens

: Parse human-readable integers like '1k', '2M', etc. Including decimal values with decimal multipliers.

    Examples:
    - '1k' -> 1,000
    - '1K' -> 1,024
    - '25.6k' -> 25,600

--max-num-seqs

--max-num-partial-prefills

Default: 1

--max-long-partial-prefills

Default: 1

--long-prefill-token-threshold

Default: 0

--scheduling-policy

Possible choices: fcfs, priority
Default: fcfs

--enable-chunked-prefill, --no-enable-chunked-prefill

--disable-chunked-mm-input, --no-disable-chunked-mm-input

Default: False

--scheduler-cls

--scheduler-reserve-full-isl, --no-scheduler-reserve-full-isl

Default: True

--disable-hybrid-kv-cache-manager, --no-disable-hybrid-kv-cache-manager

--async-scheduling, --no-async-scheduling

--stream-interval

Default: 1

CompilationConfig

Configuration for compilation.

You must pass CompilationConfig to VLLMConfig constructor.
VLLMConfig's post_init does further initialization. If used outside of the
VLLMConfig, some fields will be left in an improper state.

It contains PassConfig, which controls the custom fusion/transformation passes.
The rest has three parts:

- Top-level Compilation control:
    - [`mode`][vllm.config.CompilationConfig.mode]
    - [`debug_dump_path`][vllm.config.CompilationConfig.debug_dump_path]
    - [`cache_dir`][vllm.config.CompilationConfig.cache_dir]
    - [`backend`][vllm.config.CompilationConfig.backend]
    - [`custom_ops`][vllm.config.CompilationConfig.custom_ops]
    - [`splitting_ops`][vllm.config.CompilationConfig.splitting_ops]
    - [`compile_mm_encoder`][vllm.config.CompilationConfig.compile_mm_encoder]
- CudaGraph capture:
    - [`cudagraph_mode`][vllm.config.CompilationConfig.cudagraph_mode]
    - [`cudagraph_capture_sizes`]
    [vllm.config.CompilationConfig.cudagraph_capture_sizes]
    - [`max_cudagraph_capture_size`]
    [vllm.config.CompilationConfig.max_cudagraph_capture_size]
    - [`cudagraph_num_of_warmups`]
    [vllm.config.CompilationConfig.cudagraph_num_of_warmups]
    - [`cudagraph_copy_inputs`]
    [vllm.config.CompilationConfig.cudagraph_copy_inputs]
- Inductor compilation:
    - [`compile_sizes`][vllm.config.CompilationConfig.compile_sizes]
    - [`compile_ranges_endpoints`]
        [vllm.config.CompilationConfig.compile_ranges_endpoints]
    - [`inductor_compile_config`]
    [vllm.config.CompilationConfig.inductor_compile_config]
    - [`inductor_passes`][vllm.config.CompilationConfig.inductor_passes]
    - custom inductor passes

Why we have different sizes for cudagraph and inductor:
- cudagraph: a cudagraph captured for a specific size can only be used
    for the same size. We need to capture all the sizes we want to use.
- inductor: a graph compiled by inductor for a general shape can be used
    for different sizes. Inductor can also compile for specific sizes,
    where it can have more information to optimize the graph with fully
    static shapes. However, we find the general shape compilation is
    sufficient for most cases. It might be beneficial to compile for
    certain small batchsizes, where inductor is good at optimizing.

--cudagraph-capture-sizes

--max-cudagraph-capture-size

KernelConfig

Configuration for kernel selection and warmup behavior.

--ir-op-priority

: API docs: https://docs.vllm.ai/en/latest/api/vllm/config/#vllm.config.IrOpPriorityConfig

Should either be a valid JSON string or JSON keys passed individually.
Default: IrOpPriorityConfig(rms_norm=[], fused_add_rms_norm=[])

--enable-flashinfer-autotune, --no-enable-flashinfer-autotune

--moe-backend

Possible choices: aiter, auto, cutlass, deep_gemm, deep_gemm_mega_moe, emulation, flashinfer_cutedsl, flashinfer_cutlass, flashinfer_trtllm, humming, marlin, triton, triton_unfused
Default: auto

VllmConfig

Dataclass which contains all vllm-related configuration. This simplifies passing around the distinct configurations in the codebase.

--speculative-config, -sc

: API docs: https://docs.vllm.ai/en/latest/api/vllm/config/#vllm.config.SpeculativeConfig

Should either be a valid JSON string or JSON keys passed individually.

--kv-transfer-config

: API docs: https://docs.vllm.ai/en/latest/api/vllm/config/#vllm.config.KVTransferConfig

Should either be a valid JSON string or JSON keys passed individually.

--kv-events-config

: API docs: https://docs.vllm.ai/en/latest/api/vllm/config/#vllm.config.KVEventsConfig

Should either be a valid JSON string or JSON keys passed individually.

--ec-transfer-config

: API docs: https://docs.vllm.ai/en/latest/api/vllm/config/#vllm.config.ECTransferConfig

Should either be a valid JSON string or JSON keys passed individually.

--compilation-config, -cc

: API docs: https://docs.vllm.ai/en/latest/api/vllm/config/#vllm.config.CompilationConfig

Should either be a valid JSON string or JSON keys passed individually.
Default: {'mode': None, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': [], 'ir_enable_torch_wrap': None, 'splitting_ops': None, 'compile_mm_encoder': False, 'cudagraph_mm_encoder': False, 'encoder_cudagraph_token_budgets': [], 'encoder_cudagraph_max_vision_items_per_batch': 0, 'encoder_cudagraph_max_frames_per_batch': None, 'compile_sizes': None, 'compile_ranges_endpoints': None, 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': None, 'cudagraph_num_of_warmups': 0, 'cudagraph_capture_sizes': None, 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': None, 'pass_config': {}, 'max_cudagraph_capture_size': None, 'dynamic_shapes_config': {'type': <DynamicShapesType.BACKED: 'backed'>, 'evaluate_guards': False, 'assume_32_bit_indexing': False}, 'local_cache_dir': None, 'fast_moe_cold_start': None, 'static_all_moe_layers': []}

--attention-config, -ac

: API docs: https://docs.vllm.ai/en/latest/api/vllm/config/#vllm.config.AttentionConfig

Should either be a valid JSON string or JSON keys passed individually.
Default: AttentionConfig(backend=None, flash_attn_version=None, use_prefill_decode_attention=False, flash_attn_max_num_splits_for_cuda_graph=32, tq_max_kv_splits_for_cuda_graph=32, use_cudnn_prefill=False, use_trtllm_ragged_deepseek_prefill=False, use_trtllm_attention=None, disable_flashinfer_prefill=None, disable_flashinfer_q_quantization=False, mla_prefill_backend=None, use_prefill_query_quantization=False, use_fp4_indexer_cache=False, use_non_causal=False)

--reasoning-config

: API docs: https://docs.vllm.ai/en/latest/api/vllm/config/#vllm.config.ReasoningConfig

Should either be a valid JSON string or JSON keys passed individually.

--kernel-config

: API docs: https://docs.vllm.ai/en/latest/api/vllm/config/#vllm.config.KernelConfig

Should either be a valid JSON string or JSON keys passed individually.
Default: KernelConfig(ir_op_priority=IrOpPriorityConfig(rms_norm=[], fused_add_rms_norm=[]), enable_flashinfer_autotune=None, moe_backend='auto')

--additional-config

Default: {}

--structured-outputs-config

: API docs: https://docs.vllm.ai/en/latest/api/vllm/config/#vllm.config.StructuredOutputsConfig

Should either be a valid JSON string or JSON keys passed individually.
Default: StructuredOutputsConfig(backend='auto', disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='', reasoning_parser_plugin='', enable_in_reasoning=False)

--profiler-config

: API docs: https://docs.vllm.ai/en/latest/api/vllm/config/#vllm.config.ProfilerConfig

Should either be a valid JSON string or JSON keys passed individually.
Default: ProfilerConfig(profiler=None, torch_profiler_dir='', torch_profiler_with_stack=True, torch_profiler_with_flops=False, torch_profiler_use_gzip=True, torch_profiler_dump_cuda_time_total=True, torch_profiler_record_shapes=False, torch_profiler_with_memory=False, ignore_frontend=False, delay_iterations=0, max_iterations=0, warmup_iterations=0, active_iterations=5, wait_iterations=0)

--optimization-level

Default: 2

--performance-mode

Possible choices: balanced, interactivity, throughput
Default: balanced

--weight-transfer-config

: API docs: https://docs.vllm.ai/en/latest/api/vllm/config/#vllm.config.WeightTransferConfig

Should either be a valid JSON string or JSON keys passed individually.