Skip to content

vllm bench sweep serve_workload

JSON CLI Arguments

When passing JSON CLI arguments, the following sets of arguments are equivalent:

  • --json-arg '{"key1": "value1", "key2": {"key3": "value2"}}'
  • --json-arg.key1 value1 --json-arg.key2.key3 value2

Additionally, list elements can be passed individually using +:

  • --json-arg '{"key4": ["value3", "value4", "value5"]}'
  • --json-arg.key4+ value3 --json-arg.key4+='value4,value5'

Arguments

--serve-cmd

The command used to run the server: vllm serve ...

--bench-cmd

The command used to run the benchmark: vllm bench serve ...

--after-bench-cmd

After a benchmark run is complete, invoke this command instead of the default ServerWrapper.clear_cache().

--show-stdout

If set, logs the standard output of subcommands. Useful for debugging but can be quite spammy.
Default: False

--server-ready-timeout

Timeout in seconds to wait for the server to become ready.
Default: 300

--serve-params

Path to JSON file containing parameter combinations for the vllm serve command. Can be either a list of dicts or a dict where keys are benchmark names. If both serve_params and bench_params are given, this script will iterate over their Cartesian product.
Comma-separated list of linked variables between serve and bench, e.g. max_num_seqs=max_concurrency,max_model_len=random_input_len
Default: ""

--bench-params

Path to JSON file containing parameter combinations for the vllm bench serve command. Can be either a list of dicts or a dict where keys are benchmark names. If both serve_params and bench_params are given, this script will iterate over their Cartesian product.

-o, --output-dir

The main directory to which results are written.
Default: results

-e, --experiment-name

The name of this experiment (defaults to current timestamp). Results will be stored under output_dir/experiment_name.

--num-runs

Number of runs per parameter combination.
Default: 3

--dry-run

If set, prints the commands to run, then exits without executing them.
Default: False

--resume

Resume a previous execution of this script, i.e., only run parameter combinations for which there are still no output files under output_dir/experiment_name.
Default: False

workload options

--workload-var

Possible choices: request_rate, max_concurrency
The variable to adjust in each iteration.
Default: request_rate

--workload-iters

Number of workload levels to explore. This includes the first two iterations used to interpolate the value of workload_var for remaining iterations.
Default: 10