Skip to content

vllm.entrypoints.llm

_R module-attribute

_R = TypeVar('_R', default=Any)

logger module-attribute

logger = init_logger(__name__)

LLM

An LLM for generating texts from given prompts and sampling parameters.

This class includes a tokenizer, a language model (possibly distributed across multiple GPUs), and GPU memory space allocated for intermediate states (aka KV cache). Given a batch of prompts and sampling parameters, this class generates texts from the model, using an intelligent batching mechanism and efficient memory management.

Parameters:

Name Type Description Default
model str

The name or path of a HuggingFace Transformers model.

required
tokenizer Optional[str]

The name or path of a HuggingFace Transformers tokenizer.

None
tokenizer_mode TokenizerMode

The tokenizer mode. "auto" will use the fast tokenizer if available, and "slow" will always use the slow tokenizer.

'auto'
skip_tokenizer_init bool

If true, skip initialization of tokenizer and detokenizer. Expect valid prompt_token_ids and None for prompt from the input.

False
trust_remote_code bool

Trust remote code (e.g., from HuggingFace) when downloading the model and tokenizer.

False
allowed_local_media_path str

Allowing API requests to read local images or videos from directories specified by the server file system. This is a security risk. Should only be enabled in trusted environments.

''
tensor_parallel_size int

The number of GPUs to use for distributed execution with tensor parallelism.

1
dtype ModelDType

The data type for the model weights and activations. Currently, we support float32, float16, and bfloat16. If auto, we use the torch_dtype attribute specified in the model config file. However, if the torch_dtype in the config is float32, we will use float16 instead.

'auto'
quantization Optional[QuantizationMethods]

The method used to quantize the model weights. Currently, we support "awq", "gptq", and "fp8" (experimental). If None, we first check the quantization_config attribute in the model config file. If that is None, we assume the model weights are not quantized and use dtype to determine the data type of the weights.

None
revision Optional[str]

The specific model version to use. It can be a branch name, a tag name, or a commit id.

None
tokenizer_revision Optional[str]

The specific tokenizer version to use. It can be a branch name, a tag name, or a commit id.

None
seed Optional[int]

The seed to initialize the random number generator for sampling.

None
gpu_memory_utilization float

The ratio (between 0 and 1) of GPU memory to reserve for the model weights, activations, and KV cache. Higher values will increase the KV cache size and thus improve the model's throughput. However, if the value is too high, it may cause out-of- memory (OOM) errors.

0.9
swap_space float

The size (GiB) of CPU memory per GPU to use as swap space. This can be used for temporarily storing the states of the requests when their best_of sampling parameters are larger than 1. If all requests will have best_of=1, you can safely set this to 0. Noting that best_of is only supported in V0. Otherwise, too small values may cause out-of-memory (OOM) errors.

4
cpu_offload_gb float

The size (GiB) of CPU memory to use for offloading the model weights. This virtually increases the GPU memory space you can use to hold the model weights, at the cost of CPU-GPU data transfer for every forward pass.

0
enforce_eager bool

Whether to enforce eager execution. If True, we will disable CUDA graph and always execute the model in eager mode. If False, we will use CUDA graph and eager execution in hybrid.

False
max_seq_len_to_capture int

Maximum sequence len covered by CUDA graphs. When a sequence has context length larger than this, we fall back to eager mode. Additionally for encoder-decoder models, if the sequence length of the encoder input is larger than this, we fall back to the eager mode.

8192
disable_custom_all_reduce bool False
disable_async_output_proc bool

Disable async output processing. This may result in lower performance.

False
hf_token Optional[Union[bool, str]]

The token to use as HTTP bearer authorization for remote files . If True, will use the token generated when running huggingface-cli login (stored in ~/.huggingface).

None
hf_overrides Optional[HfOverrides]

If a dictionary, contains arguments to be forwarded to the HuggingFace config. If a callable, it is called to update the HuggingFace config.

None
mm_processor_kwargs Optional[dict[str, Any]]

Arguments to be forwarded to the model's processor for multi-modal data, e.g., image processor. Overrides for the multi-modal processor obtained from AutoProcessor.from_pretrained. The available overrides depend on the model that is being run. For example, for Phi-3-Vision: {"num_crops": 4}.

None
override_pooler_config Optional[PoolerConfig]

Initialize non-default pooling config or override default pooling config for the pooling model. e.g. PoolerConfig(pooling_type="mean", normalize=False).

None
compilation_config Optional[Union[int, dict[str, Any], CompilationConfig]]

Either an integer or a dictionary. If it is an integer, it is used as the level of compilation optimization. If it is a dictionary, it can specify the full compilation configuration.

None
**kwargs

Arguments for EngineArgs.

{}
Note

This class is intended to be used for offline inference. For online serving, use the AsyncLLMEngine class instead.

Source code in vllm/entrypoints/llm.py
  61
  62
  63
  64
  65
  66
  67
  68
  69
  70
  71
  72
  73
  74
  75
  76
  77
  78
  79
  80
  81
  82
  83
  84
  85
  86
  87
  88
  89
  90
  91
  92
  93
  94
  95
  96
  97
  98
  99
 100
 101
 102
 103
 104
 105
 106
 107
 108
 109
 110
 111
 112
 113
 114
 115
 116
 117
 118
 119
 120
 121
 122
 123
 124
 125
 126
 127
 128
 129
 130
 131
 132
 133
 134
 135
 136
 137
 138
 139
 140
 141
 142
 143
 144
 145
 146
 147
 148
 149
 150
 151
 152
 153
 154
 155
 156
 157
 158
 159
 160
 161
 162
 163
 164
 165
 166
 167
 168
 169
 170
 171
 172
 173
 174
 175
 176
 177
 178
 179
 180
 181
 182
 183
 184
 185
 186
 187
 188
 189
 190
 191
 192
 193
 194
 195
 196
 197
 198
 199
 200
 201
 202
 203
 204
 205
 206
 207
 208
 209
 210
 211
 212
 213
 214
 215
 216
 217
 218
 219
 220
 221
 222
 223
 224
 225
 226
 227
 228
 229
 230
 231
 232
 233
 234
 235
 236
 237
 238
 239
 240
 241
 242
 243
 244
 245
 246
 247
 248
 249
 250
 251
 252
 253
 254
 255
 256
 257
 258
 259
 260
 261
 262
 263
 264
 265
 266
 267
 268
 269
 270
 271
 272
 273
 274
 275
 276
 277
 278
 279
 280
 281
 282
 283
 284
 285
 286
 287
 288
 289
 290
 291
 292
 293
 294
 295
 296
 297
 298
 299
 300
 301
 302
 303
 304
 305
 306
 307
 308
 309
 310
 311
 312
 313
 314
 315
 316
 317
 318
 319
 320
 321
 322
 323
 324
 325
 326
 327
 328
 329
 330
 331
 332
 333
 334
 335
 336
 337
 338
 339
 340
 341
 342
 343
 344
 345
 346
 347
 348
 349
 350
 351
 352
 353
 354
 355
 356
 357
 358
 359
 360
 361
 362
 363
 364
 365
 366
 367
 368
 369
 370
 371
 372
 373
 374
 375
 376
 377
 378
 379
 380
 381
 382
 383
 384
 385
 386
 387
 388
 389
 390
 391
 392
 393
 394
 395
 396
 397
 398
 399
 400
 401
 402
 403
 404
 405
 406
 407
 408
 409
 410
 411
 412
 413
 414
 415
 416
 417
 418
 419
 420
 421
 422
 423
 424
 425
 426
 427
 428
 429
 430
 431
 432
 433
 434
 435
 436
 437
 438
 439
 440
 441
 442
 443
 444
 445
 446
 447
 448
 449
 450
 451
 452
 453
 454
 455
 456
 457
 458
 459
 460
 461
 462
 463
 464
 465
 466
 467
 468
 469
 470
 471
 472
 473
 474
 475
 476
 477
 478
 479
 480
 481
 482
 483
 484
 485
 486
 487
 488
 489
 490
 491
 492
 493
 494
 495
 496
 497
 498
 499
 500
 501
 502
 503
 504
 505
 506
 507
 508
 509
 510
 511
 512
 513
 514
 515
 516
 517
 518
 519
 520
 521
 522
 523
 524
 525
 526
 527
 528
 529
 530
 531
 532
 533
 534
 535
 536
 537
 538
 539
 540
 541
 542
 543
 544
 545
 546
 547
 548
 549
 550
 551
 552
 553
 554
 555
 556
 557
 558
 559
 560
 561
 562
 563
 564
 565
 566
 567
 568
 569
 570
 571
 572
 573
 574
 575
 576
 577
 578
 579
 580
 581
 582
 583
 584
 585
 586
 587
 588
 589
 590
 591
 592
 593
 594
 595
 596
 597
 598
 599
 600
 601
 602
 603
 604
 605
 606
 607
 608
 609
 610
 611
 612
 613
 614
 615
 616
 617
 618
 619
 620
 621
 622
 623
 624
 625
 626
 627
 628
 629
 630
 631
 632
 633
 634
 635
 636
 637
 638
 639
 640
 641
 642
 643
 644
 645
 646
 647
 648
 649
 650
 651
 652
 653
 654
 655
 656
 657
 658
 659
 660
 661
 662
 663
 664
 665
 666
 667
 668
 669
 670
 671
 672
 673
 674
 675
 676
 677
 678
 679
 680
 681
 682
 683
 684
 685
 686
 687
 688
 689
 690
 691
 692
 693
 694
 695
 696
 697
 698
 699
 700
 701
 702
 703
 704
 705
 706
 707
 708
 709
 710
 711
 712
 713
 714
 715
 716
 717
 718
 719
 720
 721
 722
 723
 724
 725
 726
 727
 728
 729
 730
 731
 732
 733
 734
 735
 736
 737
 738
 739
 740
 741
 742
 743
 744
 745
 746
 747
 748
 749
 750
 751
 752
 753
 754
 755
 756
 757
 758
 759
 760
 761
 762
 763
 764
 765
 766
 767
 768
 769
 770
 771
 772
 773
 774
 775
 776
 777
 778
 779
 780
 781
 782
 783
 784
 785
 786
 787
 788
 789
 790
 791
 792
 793
 794
 795
 796
 797
 798
 799
 800
 801
 802
 803
 804
 805
 806
 807
 808
 809
 810
 811
 812
 813
 814
 815
 816
 817
 818
 819
 820
 821
 822
 823
 824
 825
 826
 827
 828
 829
 830
 831
 832
 833
 834
 835
 836
 837
 838
 839
 840
 841
 842
 843
 844
 845
 846
 847
 848
 849
 850
 851
 852
 853
 854
 855
 856
 857
 858
 859
 860
 861
 862
 863
 864
 865
 866
 867
 868
 869
 870
 871
 872
 873
 874
 875
 876
 877
 878
 879
 880
 881
 882
 883
 884
 885
 886
 887
 888
 889
 890
 891
 892
 893
 894
 895
 896
 897
 898
 899
 900
 901
 902
 903
 904
 905
 906
 907
 908
 909
 910
 911
 912
 913
 914
 915
 916
 917
 918
 919
 920
 921
 922
 923
 924
 925
 926
 927
 928
 929
 930
 931
 932
 933
 934
 935
 936
 937
 938
 939
 940
 941
 942
 943
 944
 945
 946
 947
 948
 949
 950
 951
 952
 953
 954
 955
 956
 957
 958
 959
 960
 961
 962
 963
 964
 965
 966
 967
 968
 969
 970
 971
 972
 973
 974
 975
 976
 977
 978
 979
 980
 981
 982
 983
 984
 985
 986
 987
 988
 989
 990
 991
 992
 993
 994
 995
 996
 997
 998
 999
1000
1001
1002
1003
1004
1005
1006
1007
1008
1009
1010
1011
1012
1013
1014
1015
1016
1017
1018
1019
1020
1021
1022
1023
1024
1025
1026
1027
1028
1029
1030
1031
1032
1033
1034
1035
1036
1037
1038
1039
1040
1041
1042
1043
1044
1045
1046
1047
1048
1049
1050
1051
1052
1053
1054
1055
1056
1057
1058
1059
1060
1061
1062
1063
1064
1065
1066
1067
1068
1069
1070
1071
1072
1073
1074
1075
1076
1077
1078
1079
1080
1081
1082
1083
1084
1085
1086
1087
1088
1089
1090
1091
1092
1093
1094
1095
1096
1097
1098
1099
1100
1101
1102
1103
1104
1105
1106
1107
1108
1109
1110
1111
1112
1113
1114
1115
1116
1117
1118
1119
1120
1121
1122
1123
1124
1125
1126
1127
1128
1129
1130
1131
1132
1133
1134
1135
1136
1137
1138
1139
1140
1141
1142
1143
1144
1145
1146
1147
1148
1149
1150
1151
1152
1153
1154
1155
1156
1157
1158
1159
1160
1161
1162
1163
1164
1165
1166
1167
1168
1169
1170
1171
1172
1173
1174
1175
1176
1177
1178
1179
1180
1181
1182
1183
1184
1185
1186
1187
1188
1189
1190
1191
1192
1193
1194
1195
1196
1197
1198
1199
1200
1201
1202
1203
1204
1205
1206
1207
1208
1209
1210
1211
1212
1213
1214
1215
1216
1217
1218
1219
1220
1221
1222
1223
1224
1225
1226
1227
1228
1229
1230
1231
1232
1233
1234
1235
1236
1237
1238
1239
1240
1241
1242
1243
1244
1245
1246
1247
1248
1249
1250
1251
1252
1253
1254
1255
1256
1257
1258
1259
1260
1261
1262
1263
1264
1265
1266
1267
1268
1269
1270
1271
1272
1273
1274
1275
1276
1277
1278
1279
1280
1281
1282
1283
1284
1285
1286
1287
1288
1289
1290
1291
1292
1293
1294
1295
1296
1297
1298
1299
1300
1301
1302
1303
1304
1305
1306
1307
1308
1309
1310
1311
1312
1313
1314
1315
1316
1317
1318
1319
1320
1321
1322
1323
1324
1325
1326
1327
1328
1329
1330
1331
1332
1333
1334
1335
1336
1337
1338
1339
1340
1341
1342
1343
1344
1345
1346
1347
1348
1349
1350
1351
1352
1353
1354
1355
1356
1357
1358
1359
1360
1361
1362
1363
1364
1365
1366
1367
1368
1369
1370
1371
1372
1373
1374
1375
1376
1377
1378
1379
1380
1381
1382
1383
1384
1385
1386
1387
1388
1389
1390
1391
1392
1393
1394
1395
1396
1397
1398
1399
1400
1401
1402
1403
1404
1405
1406
1407
1408
1409
1410
1411
1412
1413
1414
1415
1416
1417
1418
1419
1420
1421
1422
1423
1424
1425
1426
1427
1428
1429
1430
1431
1432
1433
1434
1435
1436
1437
1438
1439
1440
1441
1442
1443
1444
1445
1446
1447
1448
1449
1450
1451
1452
1453
1454
1455
1456
1457
1458
1459
1460
1461
1462
1463
1464
1465
1466
1467
1468
1469
1470
1471
1472
1473
1474
1475
1476
1477
1478
1479
1480
1481
1482
1483
1484
1485
1486
1487
1488
1489
1490
1491
1492
1493
1494
1495
1496
1497
1498
1499
1500
1501
1502
1503
1504
1505
1506
1507
1508
1509
1510
1511
1512
1513
1514
1515
1516
1517
1518
1519
1520
1521
1522
1523
1524
1525
1526
1527
1528
1529
1530
1531
1532
1533
1534
1535
1536
1537
1538
1539
1540
1541
1542
1543
1544
1545
1546
1547
1548
1549
1550
1551
1552
1553
1554
1555
1556
1557
1558
1559
1560
1561
1562
1563
1564
1565
1566
1567
1568
1569
1570
1571
1572
1573
1574
1575
1576
1577
1578
1579
1580
1581
1582
1583
1584
1585
1586
1587
1588
1589
1590
1591
1592
1593
1594
1595
1596
1597
1598
1599
class LLM:
    """An LLM for generating texts from given prompts and sampling parameters.

    This class includes a tokenizer, a language model (possibly distributed
    across multiple GPUs), and GPU memory space allocated for intermediate
    states (aka KV cache). Given a batch of prompts and sampling parameters,
    this class generates texts from the model, using an intelligent batching
    mechanism and efficient memory management.

    Args:
        model: The name or path of a HuggingFace Transformers model.
        tokenizer: The name or path of a HuggingFace Transformers tokenizer.
        tokenizer_mode: The tokenizer mode. "auto" will use the fast tokenizer
            if available, and "slow" will always use the slow tokenizer.
        skip_tokenizer_init: If true, skip initialization of tokenizer and
            detokenizer. Expect valid prompt_token_ids and None for prompt
            from the input.
        trust_remote_code: Trust remote code (e.g., from HuggingFace) when
            downloading the model and tokenizer.
        allowed_local_media_path: Allowing API requests to read local images
            or videos from directories specified by the server file system.
            This is a security risk. Should only be enabled in trusted
            environments.
        tensor_parallel_size: The number of GPUs to use for distributed
            execution with tensor parallelism.
        dtype: The data type for the model weights and activations. Currently,
            we support `float32`, `float16`, and `bfloat16`. If `auto`, we use
            the `torch_dtype` attribute specified in the model config file.
            However, if the `torch_dtype` in the config is `float32`, we will
            use `float16` instead.
        quantization: The method used to quantize the model weights. Currently,
            we support "awq", "gptq", and "fp8" (experimental).
            If None, we first check the `quantization_config` attribute in the
            model config file. If that is None, we assume the model weights are
            not quantized and use `dtype` to determine the data type of
            the weights.
        revision: The specific model version to use. It can be a branch name,
            a tag name, or a commit id.
        tokenizer_revision: The specific tokenizer version to use. It can be a
            branch name, a tag name, or a commit id.
        seed: The seed to initialize the random number generator for sampling.
        gpu_memory_utilization: The ratio (between 0 and 1) of GPU memory to
            reserve for the model weights, activations, and KV cache. Higher
            values will increase the KV cache size and thus improve the model's
            throughput. However, if the value is too high, it may cause out-of-
            memory (OOM) errors.
        swap_space: The size (GiB) of CPU memory per GPU to use as swap space.
            This can be used for temporarily storing the states of the requests
            when their `best_of` sampling parameters are larger than 1. If all
            requests will have `best_of=1`, you can safely set this to 0.
            Noting that `best_of` is only supported in V0. Otherwise, too small
            values may cause out-of-memory (OOM) errors.
        cpu_offload_gb: The size (GiB) of CPU memory to use for offloading
            the model weights. This virtually increases the GPU memory space
            you can use to hold the model weights, at the cost of CPU-GPU data
            transfer for every forward pass.
        enforce_eager: Whether to enforce eager execution. If True, we will
            disable CUDA graph and always execute the model in eager mode.
            If False, we will use CUDA graph and eager execution in hybrid.
        max_seq_len_to_capture: Maximum sequence len covered by CUDA graphs.
            When a sequence has context length larger than this, we fall back
            to eager mode. Additionally for encoder-decoder models, if the
            sequence length of the encoder input is larger than this, we fall
            back to the eager mode.
        disable_custom_all_reduce: See
            [ParallelConfig][vllm.config.ParallelConfig].
        disable_async_output_proc: Disable async output processing.
            This may result in lower performance.
        hf_token: The token to use as HTTP bearer authorization for remote files
            . If `True`, will use the token generated when running
            `huggingface-cli login` (stored in `~/.huggingface`).
        hf_overrides: If a dictionary, contains arguments to be forwarded to the
            HuggingFace config. If a callable, it is called to update the
            HuggingFace config.
        mm_processor_kwargs: Arguments to be forwarded to the model's processor
            for multi-modal data, e.g., image processor. Overrides for the
            multi-modal processor obtained from `AutoProcessor.from_pretrained`.
            The available overrides depend on the model that is being run.
            For example, for Phi-3-Vision: `{"num_crops": 4}`.
        override_pooler_config: Initialize non-default pooling config or
            override default pooling config for the pooling model.
            e.g. `PoolerConfig(pooling_type="mean", normalize=False)`.
        compilation_config: Either an integer or a dictionary. If it is an
            integer, it is used as the level of compilation optimization. If it
            is a dictionary, it can specify the full compilation configuration.
        **kwargs: Arguments for [`EngineArgs`][vllm.EngineArgs].

    Note:
        This class is intended to be used for offline inference. For online
        serving, use the [AsyncLLMEngine][vllm.AsyncLLMEngine] class instead.
    """

    DEPRECATE_LEGACY: ClassVar[bool] = True
    """A flag to toggle whether to deprecate the legacy generate/encode API."""

    @classmethod
    @contextmanager
    def deprecate_legacy_api(cls):
        cls.DEPRECATE_LEGACY = True

        yield

        cls.DEPRECATE_LEGACY = False

    def __init__(
        self,
        model: str,
        *,
        task: TaskOption = "auto",
        tokenizer: Optional[str] = None,
        tokenizer_mode: TokenizerMode = "auto",
        skip_tokenizer_init: bool = False,
        trust_remote_code: bool = False,
        allowed_local_media_path: str = "",
        tensor_parallel_size: int = 1,
        dtype: ModelDType = "auto",
        quantization: Optional[QuantizationMethods] = None,
        revision: Optional[str] = None,
        tokenizer_revision: Optional[str] = None,
        seed: Optional[int] = None,
        gpu_memory_utilization: float = 0.9,
        swap_space: float = 4,
        cpu_offload_gb: float = 0,
        enforce_eager: bool = False,
        max_seq_len_to_capture: int = 8192,
        disable_custom_all_reduce: bool = False,
        disable_async_output_proc: bool = False,
        hf_token: Optional[Union[bool, str]] = None,
        hf_overrides: Optional[HfOverrides] = None,
        mm_processor_kwargs: Optional[dict[str, Any]] = None,
        override_pooler_config: Optional[PoolerConfig] = None,
        compilation_config: Optional[Union[int, dict[str, Any],
                                           CompilationConfig]] = None,
        **kwargs,
    ) -> None:
        """LLM constructor."""

        if "disable_log_stats" not in kwargs:
            kwargs["disable_log_stats"] = True

        if "worker_cls" in kwargs:
            worker_cls = kwargs["worker_cls"]
            # if the worker_cls is not qualified string name,
            # we serialize it using cloudpickle to avoid pickling issues
            if isinstance(worker_cls, type):
                kwargs["worker_cls"] = cloudpickle.dumps(worker_cls)

        if "kv_transfer_config" in kwargs and isinstance(
                kwargs["kv_transfer_config"], dict):
            from vllm.config import KVTransferConfig
            raw_config_dict = kwargs["kv_transfer_config"]
            try:
                kwargs["kv_transfer_config"] = KVTransferConfig(
                    **raw_config_dict)
            except ValidationError as e:
                logger.error(
                    "Failed to convert 'kv_transfer_config' dict to "
                    "KVTransferConfig object. Dict: %s. Error: %s",
                    raw_config_dict, e)
                # Consider re-raising a more specific vLLM error or ValueError
                # to provide better context to the user.
                raise ValueError(
                    f"Invalid 'kv_transfer_config' provided: {e}") from e

        if hf_overrides is None:
            hf_overrides = {}

        if compilation_config is not None:
            if isinstance(compilation_config, int):
                compilation_config_instance = CompilationConfig(
                    level=compilation_config)
            elif isinstance(compilation_config, dict):
                predicate = lambda x: is_init_field(CompilationConfig, x[0])
                compilation_config_instance = CompilationConfig(
                    **dict(filter(predicate, compilation_config.items())))
            else:
                compilation_config_instance = compilation_config
        else:
            compilation_config_instance = CompilationConfig()

        engine_args = EngineArgs(
            model=model,
            task=task,
            tokenizer=tokenizer,
            tokenizer_mode=tokenizer_mode,
            skip_tokenizer_init=skip_tokenizer_init,
            trust_remote_code=trust_remote_code,
            allowed_local_media_path=allowed_local_media_path,
            tensor_parallel_size=tensor_parallel_size,
            dtype=dtype,
            quantization=quantization,
            revision=revision,
            tokenizer_revision=tokenizer_revision,
            seed=seed,
            gpu_memory_utilization=gpu_memory_utilization,
            swap_space=swap_space,
            cpu_offload_gb=cpu_offload_gb,
            enforce_eager=enforce_eager,
            max_seq_len_to_capture=max_seq_len_to_capture,
            disable_custom_all_reduce=disable_custom_all_reduce,
            disable_async_output_proc=disable_async_output_proc,
            hf_token=hf_token,
            hf_overrides=hf_overrides,
            mm_processor_kwargs=mm_processor_kwargs,
            override_pooler_config=override_pooler_config,
            compilation_config=compilation_config_instance,
            **kwargs,
        )

        # Create the Engine (autoselects V0 vs V1)
        self.llm_engine = LLMEngine.from_engine_args(
            engine_args=engine_args, usage_context=UsageContext.LLM_CLASS)
        self.engine_class = type(self.llm_engine)

        self.request_counter = Counter()
        self.default_sampling_params: Union[dict[str, Any], None] = None

    def get_tokenizer(
        self,
        lora_request: Optional[LoRARequest] = None,
    ) -> AnyTokenizer:
        return self.llm_engine.get_tokenizer_group().get_lora_tokenizer(
            lora_request)

    def set_tokenizer(self, tokenizer: AnyTokenizer) -> None:
        tokenizer_group = self.llm_engine.get_tokenizer_group()

        # While CachedTokenizer is dynamic, have no choice but
        # compare class name. Misjudgment will arise from
        # user-defined tokenizer started with 'Cached'
        if tokenizer.__class__.__name__.startswith("Cached"):
            tokenizer_group.tokenizer = tokenizer
        else:
            tokenizer_group.tokenizer = get_cached_tokenizer(tokenizer)

    def get_default_sampling_params(self) -> SamplingParams:
        if self.default_sampling_params is None:
            self.default_sampling_params = (
                self.llm_engine.model_config.get_diff_sampling_param())
        if self.default_sampling_params:
            return SamplingParams.from_optional(**self.default_sampling_params)
        return SamplingParams()

    @overload
    def generate(
        self,
        prompts: Union[PromptType, Sequence[PromptType]],
        /,
        sampling_params: Optional[Union[SamplingParams,
                                        Sequence[SamplingParams]]] = None,
        *,
        use_tqdm: Union[bool, Callable[..., tqdm]] = True,
        lora_request: Optional[Union[list[LoRARequest], LoRARequest]] = None,
        prompt_adapter_request: Optional[PromptAdapterRequest] = None,
        guided_options_request: Optional[Union[LLMGuidedOptions,
                                               GuidedDecodingRequest]] = None,
    ) -> list[RequestOutput]:
        ...

    @overload  # LEGACY: single (prompt + optional token ids)
    @deprecated("'prompt_token_ids' will become part of 'prompts'")
    def generate(
        self,
        prompts: str,
        sampling_params: Optional[Union[SamplingParams,
                                        list[SamplingParams]]] = None,
        prompt_token_ids: Optional[list[int]] = None,
        use_tqdm: Union[bool, Callable[..., tqdm]] = True,
        lora_request: Optional[Union[list[LoRARequest], LoRARequest]] = None,
        prompt_adapter_request: Optional[PromptAdapterRequest] = None,
        guided_options_request: Optional[Union[LLMGuidedOptions,
                                               GuidedDecodingRequest]] = None,
    ) -> list[RequestOutput]:
        ...

    @overload  # LEGACY: multi (prompt + optional token ids)
    @deprecated("'prompt_token_ids' will become part of 'prompts'")
    def generate(
        self,
        prompts: list[str],
        sampling_params: Optional[Union[SamplingParams,
                                        list[SamplingParams]]] = None,
        prompt_token_ids: Optional[list[list[int]]] = None,
        use_tqdm: Union[bool, Callable[..., tqdm]] = True,
        lora_request: Optional[Union[list[LoRARequest], LoRARequest]] = None,
        prompt_adapter_request: Optional[PromptAdapterRequest] = None,
        guided_options_request: Optional[Union[LLMGuidedOptions,
                                               GuidedDecodingRequest]] = None,
    ) -> list[RequestOutput]:
        ...

    @overload  # LEGACY: single (token ids + optional prompt)
    @deprecated("'prompt_token_ids' will become part of 'prompts'")
    def generate(
        self,
        prompts: Optional[str] = None,
        sampling_params: Optional[Union[SamplingParams,
                                        list[SamplingParams]]] = None,
        *,
        prompt_token_ids: list[int],
        use_tqdm: Union[bool, Callable[..., tqdm]] = True,
        lora_request: Optional[Union[list[LoRARequest], LoRARequest]] = None,
        prompt_adapter_request: Optional[PromptAdapterRequest] = None,
        guided_options_request: Optional[Union[LLMGuidedOptions,
                                               GuidedDecodingRequest]] = None,
    ) -> list[RequestOutput]:
        ...

    @overload  # LEGACY: multi (token ids + optional prompt)
    @deprecated("'prompt_token_ids' will become part of 'prompts'")
    def generate(
        self,
        prompts: Optional[list[str]] = None,
        sampling_params: Optional[Union[SamplingParams,
                                        list[SamplingParams]]] = None,
        *,
        prompt_token_ids: list[list[int]],
        use_tqdm: Union[bool, Callable[..., tqdm]] = True,
        lora_request: Optional[Union[list[LoRARequest], LoRARequest]] = None,
        prompt_adapter_request: Optional[PromptAdapterRequest] = None,
        guided_options_request: Optional[Union[LLMGuidedOptions,
                                               GuidedDecodingRequest]] = None,
    ) -> list[RequestOutput]:
        ...

    @overload  # LEGACY: single or multi token ids [pos-only]
    @deprecated("'prompt_token_ids' will become part of 'prompts'")
    def generate(
        self,
        prompts: None,
        sampling_params: None,
        prompt_token_ids: Union[list[int], list[list[int]]],
        use_tqdm: Union[bool, Callable[..., tqdm]] = True,
        lora_request: Optional[Union[list[LoRARequest], LoRARequest]] = None,
        prompt_adapter_request: Optional[PromptAdapterRequest] = None,
        guided_options_request: Optional[Union[LLMGuidedOptions,
                                               GuidedDecodingRequest]] = None,
    ) -> list[RequestOutput]:
        ...

    @deprecate_kwargs(
        "prompt_token_ids",
        is_deprecated=lambda: LLM.DEPRECATE_LEGACY,
        additional_message="Please use the 'prompts' parameter instead.",
    )
    def generate(
        self,
        prompts: Union[Union[PromptType, Sequence[PromptType]],
                       Optional[Union[str, list[str]]]] = None,
        sampling_params: Optional[Union[SamplingParams,
                                        Sequence[SamplingParams]]] = None,
        prompt_token_ids: Optional[Union[list[int], list[list[int]]]] = None,
        use_tqdm: Union[bool, Callable[..., tqdm]] = True,
        lora_request: Optional[Union[list[LoRARequest], LoRARequest]] = None,
        prompt_adapter_request: Optional[PromptAdapterRequest] = None,
        guided_options_request: Optional[Union[LLMGuidedOptions,
                                               GuidedDecodingRequest]] = None,
        priority: Optional[list[int]] = None,
    ) -> list[RequestOutput]:
        """Generates the completions for the input prompts.

        This class automatically batches the given prompts, considering
        the memory constraint. For the best performance, put all of your prompts
        into a single list and pass it to this method.

        Args:
            prompts: The prompts to the LLM. You may pass a sequence of prompts
                for batch inference. See [PromptType][vllm.inputs.PromptType]
                for more details about the format of each prompts.
            sampling_params: The sampling parameters for text generation. If
                None, we use the default sampling parameters.
                When it is a single value, it is applied to every prompt.
                When it is a list, the list must have the same length as the
                prompts and it is paired one by one with the prompt.
            use_tqdm: If `True`, shows a tqdm progress bar.
                If a callable (e.g., `functools.partial(tqdm, leave=False)`),
                it is used to create the progress bar.
                If `False`, no progress bar is created.
            lora_request: LoRA request to use for generation, if any.
            prompt_adapter_request: Prompt Adapter request to use for
                generation, if any.
            priority: The priority of the requests, if any.
                Only applicable when priority scheduling policy is enabled.

        Returns:
            A list of `RequestOutput` objects containing the
            generated completions in the same order as the input prompts.

        Note:
            Using `prompts` and `prompt_token_ids` as keyword parameters is
            considered legacy and may be deprecated in the future. You should
            instead pass them via the `inputs` parameter.
        """
        runner_type = self.llm_engine.model_config.runner_type
        if runner_type not in ["generate", "transcription"]:
            messages = [
                "LLM.generate() is only supported for (conditional) generation "
                "models (XForCausalLM, XForConditionalGeneration).",
            ]

            supported_runner_types = self.llm_engine.model_config \
                .supported_runner_types
            if "generate" in supported_runner_types:
                messages.append(
                    "Your model supports the 'generate' runner, but is "
                    f"currently initialized for the '{runner_type}' runner. "
                    "Please initialize vLLM using `--task generate`.")

            raise ValueError(" ".join(messages))

        if prompt_token_ids is not None:
            parsed_prompts = self._convert_v1_inputs(
                prompts=cast(Optional[Union[str, list[str]]], prompts),
                prompt_token_ids=prompt_token_ids,
            )
        else:
            parsed_prompts = cast(Union[PromptType, Sequence[PromptType]],
                                  prompts)

        if isinstance(guided_options_request, dict):
            if len(guided_options_request) > 1:
                raise ValueError(
                    "You can only use one guided decoding but multiple is "
                    f"specified: {guided_options_request}")
            guided_options_request = GuidedDecodingRequest(
                **guided_options_request)

        if sampling_params is None:
            # Use default sampling params.
            sampling_params = self.get_default_sampling_params()

        tokenization_kwargs: dict[str, Any] = {}
        truncate_prompt_tokens = None
        if isinstance(sampling_params, SamplingParams):
            truncate_prompt_tokens = sampling_params.truncate_prompt_tokens
        _validate_truncation_size(self.llm_engine.model_config.max_model_len,
                                  truncate_prompt_tokens, tokenization_kwargs)

        self._validate_and_add_requests(
            prompts=parsed_prompts,
            params=sampling_params,
            use_tqdm=use_tqdm,
            lora_request=lora_request,
            prompt_adapter_request=prompt_adapter_request,
            guided_options=guided_options_request,
            tokenization_kwargs=tokenization_kwargs,
            priority=priority,
        )

        outputs = self._run_engine(use_tqdm=use_tqdm)
        return self.engine_class.validate_outputs(outputs, RequestOutput)

    def collective_rpc(self,
                       method: Union[str, Callable[..., _R]],
                       timeout: Optional[float] = None,
                       args: tuple = (),
                       kwargs: Optional[dict[str, Any]] = None) -> list[_R]:
        """
        Execute an RPC call on all workers.

        Args:
            method: Name of the worker method to execute, or a callable that
                is serialized and sent to all workers to execute.

                If the method is a callable, it should accept an additional
                `self` argument, in addition to the arguments passed in `args`
                and `kwargs`. The `self` argument will be the worker object.
            timeout: Maximum time in seconds to wait for execution. Raises a
                [`TimeoutError`][] on timeout. `None` means wait indefinitely.
            args: Positional arguments to pass to the worker method.
            kwargs: Keyword arguments to pass to the worker method.

        Returns:
            A list containing the results from each worker.

        Note:
            It is recommended to use this API to only pass control messages,
            and set up data-plane communication to pass data.
        """

        return self.llm_engine.collective_rpc(method, timeout, args, kwargs)

    def apply_model(self, func: Callable[[nn.Module], _R]) -> list[_R]:
        """
        Run a function directly on the model inside each worker,
        returning the result for each of them.
        """
        executor = self.llm_engine.model_executor
        return executor.apply_model(func)

    def _get_beam_search_lora_requests(
        self,
        lora_request: Optional[Union[list[LoRARequest], LoRARequest]],
        prompts: list[Union[TokensPrompt, TextPrompt]],
    ) -> list[Optional[LoRARequest]]:
        """Get the optional lora request corresponding to each prompt."""
        if isinstance(lora_request,
                      Sequence) and len(lora_request) != len(prompts):
            raise ValueError(
                "Lora request list should be the same length as the prompts")

        if lora_request is None or isinstance(lora_request, LoRARequest):
            return [lora_request] * len(prompts)

        raise TypeError(f"Invalid lora_request type {type(lora_request)}")

    def beam_search(
        self,
        prompts: list[Union[TokensPrompt, TextPrompt]],
        params: BeamSearchParams,
        lora_request: Optional[Union[list[LoRARequest], LoRARequest]] = None,
        use_tqdm: bool = False,
    ) -> list[BeamSearchOutput]:
        """
        Generate sequences using beam search.

        Args:
            prompts: A list of prompts. Each prompt can be a string or a list
                of token IDs.
            params: The beam search parameters.
            lora_request: LoRA request to use for generation, if any.
            use_tqdm: Whether to use tqdm to display the progress bar.
        """
        # TODO: how does beam search work together with length penalty,
        # frequency, penalty, and stopping criteria, etc.?
        beam_width = params.beam_width
        max_tokens = params.max_tokens
        temperature = params.temperature
        ignore_eos = params.ignore_eos
        length_penalty = params.length_penalty

        lora_requests = self._get_beam_search_lora_requests(
            lora_request, prompts)

        tokenizer = self.get_tokenizer()
        sort_beams_key = create_sort_beams_key_function(
            tokenizer.eos_token_id,
            length_penalty,
        )

        def create_tokens_prompt_from_beam(
                beam: BeamSearchSequence) -> TokensPrompt:
            token_prompt_kwargs: TokensPrompt = {
                "prompt_token_ids": beam.tokens
            }
            if beam.multi_modal_data is not None:
                token_prompt_kwargs["multi_modal_data"] = beam.multi_modal_data

            if beam.mm_processor_kwargs is not None:
                token_prompt_kwargs[
                    "mm_processor_kwargs"] = beam.mm_processor_kwargs
            return TokensPrompt(**token_prompt_kwargs)

        # generate 2 * beam_width candidates at each step
        # following the huggingface transformers implementation
        # at https://github.com/huggingface/transformers/blob/e15687fffe5c9d20598a19aeab721ae0a7580f8a/src/transformers/generation/beam_search.py#L534 # noqa
        beam_search_params = SamplingParams(logprobs=2 * beam_width,
                                            max_tokens=1,
                                            temperature=temperature)
        instances: list[BeamSearchInstance] = []

        for lora_req, prompt in zip(lora_requests, prompts):
            # Add multimodal processor kwargs & data
            mm_kwargs = {}
            if "multi_modal_data" in prompt:
                mm_kwargs["multi_modal_data"] = prompt["multi_modal_data"]
            if "mm_processor_kwargs" in prompt:
                mm_kwargs["mm_processor_kwargs"] = prompt[
                    "mm_processor_kwargs"]

            if "prompt_token_ids" in prompt:
                prompt = cast(TokensPrompt, prompt)  # Needed for mypy
                prompt_tokens = prompt["prompt_token_ids"]
            else:
                prompt_tokens = tokenizer.encode(prompt["prompt"])

            instances.append(
                BeamSearchInstance(
                    prompt_tokens,
                    lora_request=lora_req,
                    logprobs=None,
                    **mm_kwargs,
                ), )

        token_iter = range(max_tokens)
        if use_tqdm:
            token_iter = tqdm(token_iter,
                              desc="Beam search",
                              unit="token",
                              unit_scale=False)
            logger.warning(
                "The progress bar shows the upper bound on token steps and "
                "may finish early due to stopping conditions. It does not "
                "reflect instance-level progress.")

        for _ in token_iter:
            all_beams: list[BeamSearchSequence] = list(
                sum((instance.beams for instance in instances), []))
            pos = [0] + list(
                itertools.accumulate(
                    len(instance.beams) for instance in instances))
            instance_start_and_end: list[tuple[int, int]] = list(
                zip(pos[:-1], pos[1:]))

            if len(all_beams) == 0:
                break

            # create the corresponding batch entries for prompt & optional lora
            prompts_batch, lora_req_batch = zip(
                *[(create_tokens_prompt_from_beam(beam), beam.lora_request)
                  for beam in all_beams])

            # only runs for one step
            # we don't need to use tqdm here
            output = self.generate(prompts_batch,
                                   sampling_params=beam_search_params,
                                   use_tqdm=False,
                                   lora_request=lora_req_batch)

            for (start, end), instance in zip(instance_start_and_end,
                                              instances):
                instance_new_beams = []
                for i in range(start, end):
                    current_beam = all_beams[i]
                    result = output[i]

                    if result.outputs[0].logprobs is not None:
                        # if `result.outputs[0].logprobs` is None, it means
                        # the sequence is completed because of the max-model-len
                        # or abortion. we don't need to add it to the new beams.
                        logprobs = result.outputs[0].logprobs[0]
                        for token_id, logprob_obj in logprobs.items():
                            new_beam = BeamSearchSequence(
                                tokens=current_beam.tokens + [token_id],
                                logprobs=current_beam.logprobs + [logprobs],
                                lora_request=current_beam.lora_request,
                                cum_logprob=current_beam.cum_logprob +
                                logprob_obj.logprob,
                                multi_modal_data=current_beam.multi_modal_data,
                                mm_processor_kwargs=current_beam.
                                mm_processor_kwargs)

                            if token_id == tokenizer.eos_token_id and \
                                not ignore_eos:
                                instance.completed.append(new_beam)
                            else:
                                instance_new_beams.append(new_beam)
                sorted_beams = sorted(instance_new_beams,
                                      key=sort_beams_key,
                                      reverse=True)
                instance.beams = sorted_beams[:beam_width]

        outputs = []
        for instance in instances:
            instance.completed.extend(instance.beams)
            sorted_completed = sorted(instance.completed,
                                      key=sort_beams_key,
                                      reverse=True)
            best_beams = sorted_completed[:beam_width]

            for beam in best_beams:
                beam.text = tokenizer.decode(beam.tokens)
            outputs.append(BeamSearchOutput(sequences=best_beams))

        return outputs

    def chat(
        self,
        messages: Union[list[ChatCompletionMessageParam],
                        list[list[ChatCompletionMessageParam]]],
        sampling_params: Optional[Union[SamplingParams,
                                        list[SamplingParams]]] = None,
        use_tqdm: Union[bool, Callable[..., tqdm]] = True,
        lora_request: Optional[LoRARequest] = None,
        chat_template: Optional[str] = None,
        chat_template_content_format: ChatTemplateContentFormatOption = "auto",
        add_generation_prompt: bool = True,
        continue_final_message: bool = False,
        tools: Optional[list[dict[str, Any]]] = None,
        chat_template_kwargs: Optional[dict[str, Any]] = None,
        mm_processor_kwargs: Optional[dict[str, Any]] = None,
    ) -> list[RequestOutput]:
        """
        Generate responses for a chat conversation.

        The chat conversation is converted into a text prompt using the
        tokenizer and calls the [generate][] method to generate the
        responses.

        Multi-modal inputs can be passed in the same way you would pass them
        to the OpenAI API.

        Args:
            messages: A list of conversations or a single conversation.

                - Each conversation is represented as a list of messages.
                - Each message is a dictionary with 'role' and 'content' keys.

            sampling_params: The sampling parameters for text generation.
                If None, we use the default sampling parameters. When it
                is a single value, it is applied to every prompt. When it
                is a list, the list must have the same length as the
                prompts and it is paired one by one with the prompt.
            use_tqdm: If `True`, shows a tqdm progress bar.
                If a callable (e.g., `functools.partial(tqdm, leave=False)`),
                it is used to create the progress bar.
                If `False`, no progress bar is created.
            lora_request: LoRA request to use for generation, if any.
            chat_template: The template to use for structuring the chat.
                If not provided, the model's default chat template will be used.
            chat_template_content_format: The format to render message content.

                - "string" will render the content as a string.
                  Example: `"Who are you?"`
                - "openai" will render the content as a list of dictionaries,
                  similar to OpenAI schema.
                  Example: `[{"type": "text", "text": "Who are you?"}]`

            add_generation_prompt: If True, adds a generation template
                to each message.
            continue_final_message: If True, continues the final message in
                the conversation instead of starting a new one. Cannot be
                `True` if `add_generation_prompt` is also `True`.
            chat_template_kwargs: Additional kwargs to pass to the chat
                template.
            mm_processor_kwargs: Multimodal processor kwarg overrides for this
                chat request. Only used for offline requests.

        Returns:
            A list of `RequestOutput` objects containing the generated
            responses in the same order as the input messages.
        """
        list_of_messages: list[list[ChatCompletionMessageParam]]

        # Handle multi and single conversations
        if is_list_of(messages, list):
            # messages is list[list[...]]
            list_of_messages = cast(list[list[ChatCompletionMessageParam]],
                                    messages)
        else:
            # messages is list[...]
            list_of_messages = [
                cast(list[ChatCompletionMessageParam], messages)
            ]

        tokenizer = self.get_tokenizer(lora_request)
        model_config = self.llm_engine.get_model_config()
        resolved_content_format = resolve_chat_template_content_format(
            chat_template,
            tools,
            chat_template_content_format,
            tokenizer,
            model_config=model_config,
        )

        _chat_template_kwargs: dict[str, Any] = dict(
            chat_template=chat_template,
            add_generation_prompt=add_generation_prompt,
            continue_final_message=continue_final_message,
            tools=tools,
        )
        _chat_template_kwargs.update(chat_template_kwargs or {})

        prompts: list[Union[TokensPrompt, TextPrompt]] = []

        for msgs in list_of_messages:
            # NOTE: _parse_chat_message_content_parts() currently doesn't
            # handle mm_processor_kwargs, since there is no implementation in
            # the chat message parsing for it.
            conversation, mm_data = parse_chat_messages(
                msgs,
                model_config,
                tokenizer,
                content_format=resolved_content_format,
            )

            if isinstance(tokenizer, MistralTokenizer):
                prompt_token_ids = apply_mistral_chat_template(
                    tokenizer,
                    messages=msgs,
                    **_chat_template_kwargs,
                )
            else:
                prompt_str = apply_hf_chat_template(
                    tokenizer=tokenizer,
                    conversation=conversation,
                    model_config=model_config,
                    **_chat_template_kwargs,
                )
                # Special tokens are already included in chat templates so
                # should not be added by the tokenizer in this case.
                prompt_token_ids = tokenizer.encode(prompt_str,
                                                    add_special_tokens=False)

            prompt = TokensPrompt(prompt_token_ids=prompt_token_ids)

            if mm_data is not None:
                prompt["multi_modal_data"] = mm_data

            if mm_processor_kwargs is not None:
                prompt["mm_processor_kwargs"] = mm_processor_kwargs

            prompts.append(prompt)

        return self.generate(
            prompts,
            sampling_params=sampling_params,
            use_tqdm=use_tqdm,
            lora_request=lora_request,
        )

    @overload
    def encode(
        self,
        prompts: Union[PromptType, Sequence[PromptType]],
        /,
        pooling_params: Optional[Union[PoolingParams,
                                       Sequence[PoolingParams]]] = None,
        *,
        truncate_prompt_tokens: Optional[int] = None,
        use_tqdm: Union[bool, Callable[..., tqdm]] = True,
        lora_request: Optional[Union[list[LoRARequest], LoRARequest]] = None,
        prompt_adapter_request: Optional[PromptAdapterRequest] = None,
    ) -> list[PoolingRequestOutput]:
        ...

    @overload  # LEGACY: single (prompt + optional token ids)
    @deprecated("'prompt_token_ids' will become part of 'prompts'")
    def encode(
        self,
        prompts: str,
        pooling_params: Optional[Union[PoolingParams,
                                       Sequence[PoolingParams]]] = None,
        prompt_token_ids: Optional[list[int]] = None,
        truncate_prompt_tokens: Optional[int] = None,
        use_tqdm: Union[bool, Callable[..., tqdm]] = True,
        lora_request: Optional[Union[list[LoRARequest], LoRARequest]] = None,
        prompt_adapter_request: Optional[PromptAdapterRequest] = None,
    ) -> list[PoolingRequestOutput]:
        ...

    @overload  # LEGACY: multi (prompt + optional token ids)
    @deprecated("'prompt_token_ids' will become part of 'prompts'")
    def encode(
        self,
        prompts: list[str],
        pooling_params: Optional[Union[PoolingParams,
                                       Sequence[PoolingParams]]] = None,
        prompt_token_ids: Optional[list[list[int]]] = None,
        truncate_prompt_tokens: Optional[int] = None,
        use_tqdm: Union[bool, Callable[..., tqdm]] = True,
        lora_request: Optional[Union[list[LoRARequest], LoRARequest]] = None,
        prompt_adapter_request: Optional[PromptAdapterRequest] = None,
    ) -> list[PoolingRequestOutput]:
        ...

    @overload  # LEGACY: single (token ids + optional prompt)
    @deprecated("'prompt_token_ids' will become part of 'prompts'")
    def encode(
        self,
        prompts: Optional[str] = None,
        pooling_params: Optional[Union[PoolingParams,
                                       Sequence[PoolingParams]]] = None,
        *,
        prompt_token_ids: list[int],
        truncate_prompt_tokens: Optional[int] = None,
        use_tqdm: Union[bool, Callable[..., tqdm]] = True,
        lora_request: Optional[Union[list[LoRARequest], LoRARequest]] = None,
        prompt_adapter_request: Optional[PromptAdapterRequest] = None,
    ) -> list[PoolingRequestOutput]:
        ...

    @overload  # LEGACY: multi (token ids + optional prompt)
    @deprecated("'prompt_token_ids' will become part of 'prompts'")
    def encode(
        self,
        prompts: Optional[list[str]] = None,
        pooling_params: Optional[Union[PoolingParams,
                                       Sequence[PoolingParams]]] = None,
        *,
        prompt_token_ids: list[list[int]],
        truncate_prompt_tokens: Optional[int] = None,
        use_tqdm: Union[bool, Callable[..., tqdm]] = True,
        lora_request: Optional[Union[list[LoRARequest], LoRARequest]] = None,
        prompt_adapter_request: Optional[PromptAdapterRequest] = None,
    ) -> list[PoolingRequestOutput]:
        ...

    @overload  # LEGACY: single or multi token ids [pos-only]
    @deprecated("'prompt_token_ids' will become part of 'prompts'")
    def encode(
        self,
        prompts: None,
        pooling_params: None,
        prompt_token_ids: Union[list[int], list[list[int]]],
        truncate_prompt_tokens: Optional[int] = None,
        use_tqdm: Union[bool, Callable[..., tqdm]] = True,
        lora_request: Optional[Union[list[LoRARequest], LoRARequest]] = None,
        prompt_adapter_request: Optional[PromptAdapterRequest] = None,
    ) -> list[PoolingRequestOutput]:
        ...

    @deprecate_kwargs(
        "prompt_token_ids",
        is_deprecated=lambda: LLM.DEPRECATE_LEGACY,
        additional_message="Please use the 'prompts' parameter instead.",
    )
    def encode(
        self,
        prompts: Union[Union[PromptType, Sequence[PromptType]],
                       Optional[Union[str, list[str]]]] = None,
        pooling_params: Optional[Union[PoolingParams,
                                       Sequence[PoolingParams]]] = None,
        prompt_token_ids: Optional[Union[list[int], list[list[int]]]] = None,
        truncate_prompt_tokens: Optional[int] = None,
        use_tqdm: Union[bool, Callable[..., tqdm]] = True,
        lora_request: Optional[Union[list[LoRARequest], LoRARequest]] = None,
        prompt_adapter_request: Optional[PromptAdapterRequest] = None,
    ) -> list[PoolingRequestOutput]:
        """Apply pooling to the hidden states corresponding to the input
        prompts.

        This class automatically batches the given prompts, considering
        the memory constraint. For the best performance, put all of your prompts
        into a single list and pass it to this method.

        Args:
            prompts: The prompts to the LLM. You may pass a sequence of prompts
                for batch inference. See [PromptType][vllm.inputs.PromptType]
                for more details about the format of each prompts.
            pooling_params: The pooling parameters for pooling. If None, we
                use the default pooling parameters.
            use_tqdm: If `True`, shows a tqdm progress bar.
                If a callable (e.g., `functools.partial(tqdm, leave=False)`),
                it is used to create the progress bar.
                If `False`, no progress bar is created.
            lora_request: LoRA request to use for generation, if any.
            prompt_adapter_request: Prompt Adapter request to use for
                generation, if any.

        Returns:
            A list of `PoolingRequestOutput` objects containing the
            pooled hidden states in the same order as the input prompts.

        Note:
            Using `prompts` and `prompt_token_ids` as keyword parameters is
            considered legacy and may be deprecated in the future. You should
            instead pass them via the `inputs` parameter.
        """
        runner_type = self.llm_engine.model_config.runner_type
        if runner_type != "pooling":
            messages = ["LLM.encode() is only supported for pooling models."]

            supported_runner_types = self.llm_engine.model_config \
                .supported_runner_types
            if "pooling" in supported_runner_types:
                messages.append(
                    "Your model supports the 'pooling' runner, but is "
                    f"currently initialized for the '{runner_type}' runner. "
                    "Please initialize vLLM using `--task embed`, "
                    "`--task classify`, `--task score` etc.")

            raise ValueError(" ".join(messages))

        if prompt_token_ids is not None:
            parsed_prompts = self._convert_v1_inputs(
                prompts=cast(Optional[Union[str, list[str]]], prompts),
                prompt_token_ids=prompt_token_ids,
            )
        else:
            parsed_prompts = cast(Union[PromptType, Sequence[PromptType]],
                                  prompts)

        if pooling_params is None:
            # Use default pooling params.
            pooling_params = PoolingParams()
        elif isinstance(pooling_params, PoolingParams):
            pooling_params.verify(self.llm_engine.model_config)
        else:
            for pooling_param in pooling_params:
                pooling_param.verify(self.llm_engine.model_config)

        tokenization_kwargs: dict[str, Any] = {}
        _validate_truncation_size(self.llm_engine.model_config.max_model_len,
                                  truncate_prompt_tokens, tokenization_kwargs)

        self._validate_and_add_requests(
            prompts=parsed_prompts,
            params=pooling_params,
            use_tqdm=use_tqdm,
            lora_request=lora_request,
            tokenization_kwargs=tokenization_kwargs,
            prompt_adapter_request=prompt_adapter_request,
        )

        outputs = self._run_engine(use_tqdm=use_tqdm)
        return self.engine_class.validate_outputs(outputs,
                                                  PoolingRequestOutput)

    def embed(
        self,
        prompts: Union[PromptType, Sequence[PromptType]],
        /,
        *,
        truncate_prompt_tokens: Optional[int] = None,
        use_tqdm: Union[bool, Callable[..., tqdm]] = True,
        pooling_params: Optional[Union[PoolingParams,
                                       Sequence[PoolingParams]]] = None,
        lora_request: Optional[Union[list[LoRARequest], LoRARequest]] = None,
        prompt_adapter_request: Optional[PromptAdapterRequest] = None,
    ) -> list[EmbeddingRequestOutput]:
        """
        Generate an embedding vector for each prompt.

        This class automatically batches the given prompts, considering
        the memory constraint. For the best performance, put all of your prompts
        into a single list and pass it to this method.

        Args:
            prompts: The prompts to the LLM. You may pass a sequence of prompts
                for batch inference. See [PromptType][vllm.inputs.PromptType]
                for more details about the format of each prompts.
            pooling_params: The pooling parameters for pooling. If None, we
                use the default pooling parameters.
            use_tqdm: If `True`, shows a tqdm progress bar.
                If a callable (e.g., `functools.partial(tqdm, leave=False)`),
                it is used to create the progress bar.
                If `False`, no progress bar is created.
            lora_request: LoRA request to use for generation, if any.
            prompt_adapter_request: Prompt Adapter request to use for
                generation, if any.

        Returns:
            A list of `EmbeddingRequestOutput` objects containing the
            embedding vectors in the same order as the input prompts.
        """
        if self.llm_engine.model_config.task != "embed":
            raise ValueError(
                "Embedding API is only enabled for `--task embed`")

        items = self.encode(prompts,
                            truncate_prompt_tokens=truncate_prompt_tokens,
                            use_tqdm=use_tqdm,
                            pooling_params=pooling_params,
                            lora_request=lora_request,
                            prompt_adapter_request=prompt_adapter_request)

        return [EmbeddingRequestOutput.from_base(item) for item in items]

    def classify(
        self,
        prompts: Union[PromptType, Sequence[PromptType]],
        /,
        *,
        use_tqdm: Union[bool, Callable[..., tqdm]] = True,
        lora_request: Optional[Union[list[LoRARequest], LoRARequest]] = None,
        prompt_adapter_request: Optional[PromptAdapterRequest] = None,
    ) -> list[ClassificationRequestOutput]:
        """
        Generate class logits for each prompt.

        This class automatically batches the given prompts, considering
        the memory constraint. For the best performance, put all of your prompts
        into a single list and pass it to this method.

        Args:
            prompts: The prompts to the LLM. You may pass a sequence of prompts
                for batch inference. See [PromptType][vllm.inputs.PromptType]
                for more details about the format of each prompts.
            use_tqdm: If `True`, shows a tqdm progress bar.
                If a callable (e.g., `functools.partial(tqdm, leave=False)`),
                it is used to create the progress bar.
                If `False`, no progress bar is created.
            lora_request: LoRA request to use for generation, if any.
            prompt_adapter_request: Prompt Adapter request to use for
                generation, if any.

        Returns:
            A list of `ClassificationRequestOutput` objects containing the
            embedding vectors in the same order as the input prompts.
        """
        if self.llm_engine.model_config.task != "classify":
            raise ValueError(
                "Classification API is only enabled for `--task classify`")

        items = self.encode(prompts,
                            use_tqdm=use_tqdm,
                            lora_request=lora_request,
                            prompt_adapter_request=prompt_adapter_request)

        return [ClassificationRequestOutput.from_base(item) for item in items]

    def _embedding_score(
        self,
        tokenizer: AnyTokenizer,
        text_1: list[Union[str, TextPrompt, TokensPrompt]],
        text_2: list[Union[str, TextPrompt, TokensPrompt]],
        truncate_prompt_tokens: Optional[int] = None,
        use_tqdm: Union[bool, Callable[..., tqdm]] = True,
        lora_request: Optional[Union[list[LoRARequest], LoRARequest]] = None,
        prompt_adapter_request: Optional[PromptAdapterRequest] = None,
    ) -> list[ScoringRequestOutput]:

        encoded_output: list[PoolingRequestOutput] = self.encode(
            text_1 + text_2,
            truncate_prompt_tokens=truncate_prompt_tokens,
            use_tqdm=use_tqdm,
            lora_request=lora_request,
            prompt_adapter_request=prompt_adapter_request)

        encoded_output_1: list[PoolingRequestOutput] = encoded_output[
            0:len(text_1)]
        encoded_output_2: list[PoolingRequestOutput] = encoded_output[
            len(text_1):]

        if len(encoded_output_1) == 1:
            encoded_output_1 = encoded_output_1 * len(encoded_output_2)

        scores = _cosine_similarity(tokenizer=tokenizer,
                                    embed_1=encoded_output_1,
                                    embed_2=encoded_output_2)

        items = self.engine_class.validate_outputs(scores,
                                                   PoolingRequestOutput)
        return [ScoringRequestOutput.from_base(item) for item in items]

    def _cross_encoding_score(
        self,
        tokenizer: AnyTokenizer,
        text_1: list[str],
        text_2: list[str],
        truncate_prompt_tokens: Optional[int] = None,
        use_tqdm: Union[bool, Callable[..., tqdm]] = True,
        lora_request: Optional[Union[list[LoRARequest], LoRARequest]] = None,
        prompt_adapter_request: Optional[PromptAdapterRequest] = None,
    ) -> list[ScoringRequestOutput]:

        if isinstance(tokenizer, MistralTokenizer):
            raise ValueError(
                "Score API is only enabled for `--task embed or score`")

        if len(text_1) == 1:
            text_1 = text_1 * len(text_2)

        input_pairs = [(t1, t2) for t1, t2 in zip(text_1, text_2)]

        pooling_params = PoolingParams()

        tokenization_kwargs: dict[str, Any] = {}
        _validate_truncation_size(self.llm_engine.model_config.max_model_len,
                                  truncate_prompt_tokens, tokenization_kwargs)

        parsed_prompts = []

        for q, t in input_pairs:
            prompt_inputs = tokenizer(text=q,
                                      text_pair=t,
                                      **tokenization_kwargs)
            engine_prompt = TokensPrompt(
                prompt_token_ids=prompt_inputs["input_ids"],
                token_type_ids=prompt_inputs.get("token_type_ids"))
            parsed_prompts.append(engine_prompt)

        self._validate_and_add_requests(
            prompts=parsed_prompts,
            params=pooling_params,
            use_tqdm=use_tqdm,
            lora_request=lora_request,
            prompt_adapter_request=prompt_adapter_request,
        )

        outputs = self._run_engine(use_tqdm=use_tqdm)
        items = self.engine_class.validate_outputs(outputs,
                                                   PoolingRequestOutput)

        return [ScoringRequestOutput.from_base(item) for item in items]

    def score(
        self,
        text_1: Union[SingletonPrompt, Sequence[SingletonPrompt]],
        text_2: Union[SingletonPrompt, Sequence[SingletonPrompt]],
        /,
        *,
        truncate_prompt_tokens: Optional[int] = None,
        use_tqdm: Union[bool, Callable[..., tqdm]] = True,
        lora_request: Optional[Union[list[LoRARequest], LoRARequest]] = None,
        prompt_adapter_request: Optional[PromptAdapterRequest] = None,
    ) -> list[ScoringRequestOutput]:
        """Generate similarity scores for all pairs `<text,text_pair>`.

        The inputs can be `1 -> 1`, `1 -> N` or `N -> N`.
        In the `1 - N` case the `text_1` sentence will be replicated `N`
        times to pair with the `text_2` sentences.
        The input pairs are used to build a list of prompts for the
        cross encoder model. This class automatically batches the prompts,
        considering the memory constraint. For the best performance, put all
        of your texts into a single list and pass it to this method.

        Args:
            text_1: can be a single prompt or a list of prompts, in which
                case it has to have the same length as the `text_2` list
            text_2: The texts to pair with the query to form the input
                to the LLM. See [PromptType][vllm.inputs.PromptType] for
                more details about the format of each prompts.
            use_tqdm: If `True`, shows a tqdm progress bar.
                If a callable (e.g., `functools.partial(tqdm, leave=False)`),
                it is used to create the progress bar.
                If `False`, no progress bar is created.
            lora_request: LoRA request to use for generation, if any.
            prompt_adapter_request: Prompt Adapter request to use for
                generation, if any.

        Returns:
            A list of `ScoringRequestOutput` objects containing the
            generated scores in the same order as the input prompts.
        """
        runner_type = self.llm_engine.model_config.runner_type
        if runner_type != "pooling":
            messages = ["LLM.score() is only supported for pooling models."]

            supported_runner_types = self.llm_engine.model_config \
                .supported_runner_types
            if "pooling" in supported_runner_types:
                messages.append(
                    "Your model supports the 'pooling' runner, but is "
                    f"currently initialized for the '{runner_type}' runner. "
                    "Please initialize vLLM using `--task embed`, "
                    "`--task classify`, `--task score` etc.")

            raise ValueError(" ".join(messages))

        if self.llm_engine.model_config.task not in ("embed", "classify"):
            raise ValueError("Score API is only enabled for "
                             "`--task embed or --task classify`.")

        if (self.llm_engine.model_config.task == "classify"
                and self.llm_engine.model_config.hf_config.num_labels != 1):
            raise ValueError("Score API is only enabled for num_labels == 1.")

        # the tokenizer for models such as
        # "cross-encoder/ms-marco-MiniLM-L-6-v2" doesn't support passing
        # lists of tokens to the `text` and `text_pair` kwargs
        tokenizer = self.get_tokenizer()

        def ensure_str(prompt: SingletonPrompt):
            if isinstance(prompt, dict):
                if "multi_modal_data" in prompt:
                    raise ValueError("Multi-modal prompt is not "
                                     "supported for scoring")
                elif "prompt_token_ids" in prompt:
                    prompt = tokenizer.decode(
                        cast(TokensPrompt, prompt)["prompt_token_ids"])
                elif "prompt" in prompt:
                    prompt = cast(TextPrompt, prompt)["prompt"]
            assert type(prompt) is str
            return prompt

        if isinstance(text_1, (str, dict)):
            # Convert a single prompt to a list.
            text_1 = [text_1]
        input_text_1: list[str] = [ensure_str(t) for t in text_1]

        if isinstance(text_2, (str, dict)):
            # Convert a single prompt to a list.
            text_2 = [text_2]
        input_text_2: list[str] = [ensure_str(t) for t in text_2]

        _validate_score_input_lens(input_text_1, input_text_2)

        if self.llm_engine.model_config.is_cross_encoder:
            return self._cross_encoding_score(tokenizer, input_text_1,
                                              input_text_2,
                                              truncate_prompt_tokens, use_tqdm,
                                              lora_request,
                                              prompt_adapter_request)
        else:
            return self._embedding_score(
                tokenizer,
                input_text_1,  # type: ignore[arg-type]
                input_text_2,  # type: ignore[arg-type]
                truncate_prompt_tokens,
                use_tqdm,
                lora_request,
                prompt_adapter_request)

    def start_profile(self) -> None:
        self.llm_engine.start_profile()

    def stop_profile(self) -> None:
        self.llm_engine.stop_profile()

    def reset_prefix_cache(self, device: Optional[Device] = None) -> bool:
        return self.llm_engine.reset_prefix_cache(device)

    def sleep(self, level: int = 1):
        """
        Put the engine to sleep. The engine should not process any requests.
        The caller should guarantee that no requests are being processed
        during the sleep period, before `wake_up` is called.

        Args:
            level: The sleep level. Level 1 sleep will offload the model
                weights and discard the kv cache. The content of kv cache
                is forgotten. Level 1 sleep is good for sleeping and waking
                up the engine to run the same model again. The model weights
                are backed up in CPU memory. Please make sure there's enough
                CPU memory to store the model weights. Level 2 sleep will
                discard both the model weights and the kv cache. The content
                of both the model weights and kv cache is forgotten. Level 2
                sleep is good for sleeping and waking up the engine to run a
                different model or update the model, where previous model
                weights are not needed. It reduces CPU memory pressure.
        """
        self.reset_prefix_cache()
        self.llm_engine.sleep(level=level)

    def wake_up(self, tags: Optional[list[str]] = None):
        """
        Wake up the engine from sleep mode. See the [sleep][] method
        for more details.

        Args:
            tags: An optional list of tags to reallocate the engine memory
                for specific memory allocations. Values must be in
                `("weights", "kv_cache")`. If None, all memory is reallocated.
                wake_up should be called with all tags (or None) before the
                engine is used again.
        """
        self.llm_engine.wake_up(tags)

    def get_metrics(self) -> list["Metric"]:
        """Return a snapshot of aggregated metrics from Prometheus.

        Returns:
            A ``MetricSnapshot`` instance capturing the current state
            of all aggregated metrics from Prometheus.

        Note:
            This method is only available with the V1 LLM engine.
        """
        from vllm.v1.engine.llm_engine import LLMEngine as V1LLMEngine
        assert isinstance(self.llm_engine, V1LLMEngine)
        return self.llm_engine.get_metrics()

    # LEGACY
    def _convert_v1_inputs(
        self,
        prompts: Optional[Union[str, list[str]]],
        prompt_token_ids: Optional[Union[list[int], list[list[int]]]],
    ):
        # skip_tokenizer_init is now checked in engine

        if prompts is None and prompt_token_ids is None:
            raise ValueError(
                "Either prompts or prompt_token_ids must be provided.")
        if prompts is not None and prompt_token_ids is not None \
                and len(prompts) != len(prompt_token_ids):
            raise ValueError(
                "The lengths of prompts and prompt_token_ids must be the same."
            )

        if prompts is not None:
            prompts = [p["content"] for p in parse_and_batch_prompt(prompts)]
        if prompt_token_ids is not None:
            prompt_token_ids = [
                p["content"] for p in parse_and_batch_prompt(prompt_token_ids)
            ]
        if prompts is not None:
            num_requests = len(prompts)
        elif prompt_token_ids is not None:
            num_requests = len(prompt_token_ids)
        parsed_prompts: list[PromptType] = []
        for i in range(num_requests):
            item: PromptType

            if prompts is not None:
                item = TextPrompt(prompt=prompts[i])
            elif prompt_token_ids is not None:
                item = TokensPrompt(prompt_token_ids=prompt_token_ids[i])
            else:
                raise AssertionError

            parsed_prompts.append(item)

        return parsed_prompts

    def _validate_and_add_requests(
        self,
        prompts: Union[PromptType, Sequence[PromptType]],
        params: Union[SamplingParams, Sequence[SamplingParams], PoolingParams,
                      Sequence[PoolingParams]],
        *,
        use_tqdm: Union[bool, Callable[..., tqdm]] = True,
        lora_request: Optional[Union[Sequence[LoRARequest], LoRARequest]],
        prompt_adapter_request: Optional[PromptAdapterRequest],
        tokenization_kwargs: Optional[dict[str, Any]] = None,
        guided_options: Optional[GuidedDecodingRequest] = None,
        priority: Optional[list[int]] = None,
    ) -> None:
        if guided_options is not None:
            warnings.warn(
                "guided_options_request is deprecated, use "
                "SamplingParams.guided_decoding instead",
                DeprecationWarning,
                stacklevel=2,
            )

        if isinstance(prompts, (str, dict)):
            # Convert a single prompt to a list.
            prompts = [prompts]

        num_requests = len(prompts)
        if isinstance(params, Sequence) and len(params) != num_requests:
            raise ValueError("The lengths of prompts and params "
                             "must be the same.")
        if isinstance(lora_request,
                      Sequence) and len(lora_request) != num_requests:
            raise ValueError("The lengths of prompts and lora_request "
                             "must be the same.")

        for sp in params if isinstance(params, Sequence) else (params, ):
            if isinstance(sp, SamplingParams):
                self._add_guided_params(sp, guided_options)

                # We only care about the final output
                sp.output_kind = RequestOutputKind.FINAL_ONLY

        # Add requests to the engine.
        it = prompts
        if use_tqdm:
            tqdm_func = use_tqdm if callable(use_tqdm) else tqdm
            it = tqdm_func(it, desc="Adding requests")

        for i, prompt in enumerate(it):
            self._add_request(
                prompt,
                params[i] if isinstance(params, Sequence) else params,
                tokenization_kwargs=tokenization_kwargs,
                lora_request=lora_request[i] if isinstance(
                    lora_request, Sequence) else lora_request,
                prompt_adapter_request=prompt_adapter_request,
                priority=priority[i] if priority else 0,
            )

    def _add_request(
        self,
        prompt: PromptType,
        params: Union[SamplingParams, PoolingParams],
        tokenization_kwargs: Optional[dict[str, Any]] = None,
        lora_request: Optional[LoRARequest] = None,
        prompt_adapter_request: Optional[PromptAdapterRequest] = None,
        priority: int = 0,
    ) -> None:
        request_id = str(next(self.request_counter))
        self.llm_engine.add_request(
            request_id,
            prompt,
            params,
            lora_request=lora_request,
            tokenization_kwargs=tokenization_kwargs,
            prompt_adapter_request=prompt_adapter_request,
            priority=priority,
        )

    def _add_guided_params(
            self,
            params: SamplingParams,
            guided_options: Optional[GuidedDecodingRequest] = None):
        if guided_options is None:
            return params

        if params.guided_decoding is not None:
            raise ValueError("Cannot set both guided_options_request and "
                             "params.guided_decoding.")

        params.guided_decoding = GuidedDecodingParams(
            json=guided_options.guided_json,
            regex=guided_options.guided_regex,
            choice=guided_options.guided_choice,
            grammar=guided_options.guided_grammar,
            json_object=guided_options.guided_json_object,
            backend=guided_options.guided_decoding_backend,
            whitespace_pattern=guided_options.guided_whitespace_pattern,
            structural_tag=guided_options.structural_tag,
        )
        return params

    def _run_engine(
        self,
        *,
        use_tqdm: Union[bool, Callable[..., tqdm]] = True
    ) -> list[Union[RequestOutput, PoolingRequestOutput]]:
        # Initialize tqdm.
        if use_tqdm:
            num_requests = self.llm_engine.get_num_unfinished_requests()
            tqdm_func = use_tqdm if callable(use_tqdm) else tqdm
            pbar = tqdm_func(
                total=num_requests,
                desc="Processed prompts",
                dynamic_ncols=True,
                postfix=(f"est. speed input: {0:.2f} toks/s, "
                         f"output: {0:.2f} toks/s"),
            )

        # Run the engine.
        outputs: list[Union[RequestOutput, PoolingRequestOutput]] = []
        total_in_toks = 0
        total_out_toks = 0
        while self.llm_engine.has_unfinished_requests():
            step_outputs = self.llm_engine.step()
            for output in step_outputs:
                if output.finished:
                    outputs.append(output)
                    if use_tqdm:
                        if isinstance(output, RequestOutput):
                            # Calculate tokens only for RequestOutput
                            n = len(output.outputs)
                            assert output.prompt_token_ids is not None
                            total_in_toks += len(output.prompt_token_ids) * n
                            in_spd = total_in_toks / pbar.format_dict["elapsed"]
                            total_out_toks += sum(
                                len(stp.token_ids) for stp in output.outputs)
                            out_spd = (total_out_toks /
                                       pbar.format_dict["elapsed"])
                            pbar.postfix = (
                                f"est. speed input: {in_spd:.2f} toks/s, "
                                f"output: {out_spd:.2f} toks/s")
                            pbar.update(n)
                        else:
                            pbar.update(1)
                        if pbar.n == num_requests:
                            pbar.refresh()

        if use_tqdm:
            pbar.close()
        # Sort the outputs by request ID.
        # This is necessary because some requests may be finished earlier than
        # its previous requests.
        return sorted(outputs, key=lambda x: int(x.request_id))

DEPRECATE_LEGACY class-attribute

DEPRECATE_LEGACY: bool = True

A flag to toggle whether to deprecate the legacy generate/encode API.

default_sampling_params instance-attribute

default_sampling_params: Union[dict[str, Any], None] = None

engine_class instance-attribute

engine_class = type(llm_engine)

llm_engine instance-attribute

llm_engine = from_engine_args(
    engine_args=engine_args, usage_context=LLM_CLASS
)

request_counter instance-attribute

request_counter = Counter()

__init__

__init__(
    model: str,
    *,
    task: TaskOption = "auto",
    tokenizer: Optional[str] = None,
    tokenizer_mode: TokenizerMode = "auto",
    skip_tokenizer_init: bool = False,
    trust_remote_code: bool = False,
    allowed_local_media_path: str = "",
    tensor_parallel_size: int = 1,
    dtype: ModelDType = "auto",
    quantization: Optional[QuantizationMethods] = None,
    revision: Optional[str] = None,
    tokenizer_revision: Optional[str] = None,
    seed: Optional[int] = None,
    gpu_memory_utilization: float = 0.9,
    swap_space: float = 4,
    cpu_offload_gb: float = 0,
    enforce_eager: bool = False,
    max_seq_len_to_capture: int = 8192,
    disable_custom_all_reduce: bool = False,
    disable_async_output_proc: bool = False,
    hf_token: Optional[Union[bool, str]] = None,
    hf_overrides: Optional[HfOverrides] = None,
    mm_processor_kwargs: Optional[dict[str, Any]] = None,
    override_pooler_config: Optional[PoolerConfig] = None,
    compilation_config: Optional[
        Union[int, dict[str, Any], CompilationConfig]
    ] = None,
    **kwargs,
) -> None

LLM constructor.

Source code in vllm/entrypoints/llm.py
def __init__(
    self,
    model: str,
    *,
    task: TaskOption = "auto",
    tokenizer: Optional[str] = None,
    tokenizer_mode: TokenizerMode = "auto",
    skip_tokenizer_init: bool = False,
    trust_remote_code: bool = False,
    allowed_local_media_path: str = "",
    tensor_parallel_size: int = 1,
    dtype: ModelDType = "auto",
    quantization: Optional[QuantizationMethods] = None,
    revision: Optional[str] = None,
    tokenizer_revision: Optional[str] = None,
    seed: Optional[int] = None,
    gpu_memory_utilization: float = 0.9,
    swap_space: float = 4,
    cpu_offload_gb: float = 0,
    enforce_eager: bool = False,
    max_seq_len_to_capture: int = 8192,
    disable_custom_all_reduce: bool = False,
    disable_async_output_proc: bool = False,
    hf_token: Optional[Union[bool, str]] = None,
    hf_overrides: Optional[HfOverrides] = None,
    mm_processor_kwargs: Optional[dict[str, Any]] = None,
    override_pooler_config: Optional[PoolerConfig] = None,
    compilation_config: Optional[Union[int, dict[str, Any],
                                       CompilationConfig]] = None,
    **kwargs,
) -> None:
    """LLM constructor."""

    if "disable_log_stats" not in kwargs:
        kwargs["disable_log_stats"] = True

    if "worker_cls" in kwargs:
        worker_cls = kwargs["worker_cls"]
        # if the worker_cls is not qualified string name,
        # we serialize it using cloudpickle to avoid pickling issues
        if isinstance(worker_cls, type):
            kwargs["worker_cls"] = cloudpickle.dumps(worker_cls)

    if "kv_transfer_config" in kwargs and isinstance(
            kwargs["kv_transfer_config"], dict):
        from vllm.config import KVTransferConfig
        raw_config_dict = kwargs["kv_transfer_config"]
        try:
            kwargs["kv_transfer_config"] = KVTransferConfig(
                **raw_config_dict)
        except ValidationError as e:
            logger.error(
                "Failed to convert 'kv_transfer_config' dict to "
                "KVTransferConfig object. Dict: %s. Error: %s",
                raw_config_dict, e)
            # Consider re-raising a more specific vLLM error or ValueError
            # to provide better context to the user.
            raise ValueError(
                f"Invalid 'kv_transfer_config' provided: {e}") from e

    if hf_overrides is None:
        hf_overrides = {}

    if compilation_config is not None:
        if isinstance(compilation_config, int):
            compilation_config_instance = CompilationConfig(
                level=compilation_config)
        elif isinstance(compilation_config, dict):
            predicate = lambda x: is_init_field(CompilationConfig, x[0])
            compilation_config_instance = CompilationConfig(
                **dict(filter(predicate, compilation_config.items())))
        else:
            compilation_config_instance = compilation_config
    else:
        compilation_config_instance = CompilationConfig()

    engine_args = EngineArgs(
        model=model,
        task=task,
        tokenizer=tokenizer,
        tokenizer_mode=tokenizer_mode,
        skip_tokenizer_init=skip_tokenizer_init,
        trust_remote_code=trust_remote_code,
        allowed_local_media_path=allowed_local_media_path,
        tensor_parallel_size=tensor_parallel_size,
        dtype=dtype,
        quantization=quantization,
        revision=revision,
        tokenizer_revision=tokenizer_revision,
        seed=seed,
        gpu_memory_utilization=gpu_memory_utilization,
        swap_space=swap_space,
        cpu_offload_gb=cpu_offload_gb,
        enforce_eager=enforce_eager,
        max_seq_len_to_capture=max_seq_len_to_capture,
        disable_custom_all_reduce=disable_custom_all_reduce,
        disable_async_output_proc=disable_async_output_proc,
        hf_token=hf_token,
        hf_overrides=hf_overrides,
        mm_processor_kwargs=mm_processor_kwargs,
        override_pooler_config=override_pooler_config,
        compilation_config=compilation_config_instance,
        **kwargs,
    )

    # Create the Engine (autoselects V0 vs V1)
    self.llm_engine = LLMEngine.from_engine_args(
        engine_args=engine_args, usage_context=UsageContext.LLM_CLASS)
    self.engine_class = type(self.llm_engine)

    self.request_counter = Counter()
    self.default_sampling_params: Union[dict[str, Any], None] = None

_add_guided_params

_add_guided_params(
    params: SamplingParams,
    guided_options: Optional[GuidedDecodingRequest] = None,
)
Source code in vllm/entrypoints/llm.py
def _add_guided_params(
        self,
        params: SamplingParams,
        guided_options: Optional[GuidedDecodingRequest] = None):
    if guided_options is None:
        return params

    if params.guided_decoding is not None:
        raise ValueError("Cannot set both guided_options_request and "
                         "params.guided_decoding.")

    params.guided_decoding = GuidedDecodingParams(
        json=guided_options.guided_json,
        regex=guided_options.guided_regex,
        choice=guided_options.guided_choice,
        grammar=guided_options.guided_grammar,
        json_object=guided_options.guided_json_object,
        backend=guided_options.guided_decoding_backend,
        whitespace_pattern=guided_options.guided_whitespace_pattern,
        structural_tag=guided_options.structural_tag,
    )
    return params

_add_request

_add_request(
    prompt: PromptType,
    params: Union[SamplingParams, PoolingParams],
    tokenization_kwargs: Optional[dict[str, Any]] = None,
    lora_request: Optional[LoRARequest] = None,
    prompt_adapter_request: Optional[
        PromptAdapterRequest
    ] = None,
    priority: int = 0,
) -> None
Source code in vllm/entrypoints/llm.py
def _add_request(
    self,
    prompt: PromptType,
    params: Union[SamplingParams, PoolingParams],
    tokenization_kwargs: Optional[dict[str, Any]] = None,
    lora_request: Optional[LoRARequest] = None,
    prompt_adapter_request: Optional[PromptAdapterRequest] = None,
    priority: int = 0,
) -> None:
    request_id = str(next(self.request_counter))
    self.llm_engine.add_request(
        request_id,
        prompt,
        params,
        lora_request=lora_request,
        tokenization_kwargs=tokenization_kwargs,
        prompt_adapter_request=prompt_adapter_request,
        priority=priority,
    )

_convert_v1_inputs

_convert_v1_inputs(
    prompts: Optional[Union[str, list[str]]],
    prompt_token_ids: Optional[
        Union[list[int], list[list[int]]]
    ],
)
Source code in vllm/entrypoints/llm.py
def _convert_v1_inputs(
    self,
    prompts: Optional[Union[str, list[str]]],
    prompt_token_ids: Optional[Union[list[int], list[list[int]]]],
):
    # skip_tokenizer_init is now checked in engine

    if prompts is None and prompt_token_ids is None:
        raise ValueError(
            "Either prompts or prompt_token_ids must be provided.")
    if prompts is not None and prompt_token_ids is not None \
            and len(prompts) != len(prompt_token_ids):
        raise ValueError(
            "The lengths of prompts and prompt_token_ids must be the same."
        )

    if prompts is not None:
        prompts = [p["content"] for p in parse_and_batch_prompt(prompts)]
    if prompt_token_ids is not None:
        prompt_token_ids = [
            p["content"] for p in parse_and_batch_prompt(prompt_token_ids)
        ]
    if prompts is not None:
        num_requests = len(prompts)
    elif prompt_token_ids is not None:
        num_requests = len(prompt_token_ids)
    parsed_prompts: list[PromptType] = []
    for i in range(num_requests):
        item: PromptType

        if prompts is not None:
            item = TextPrompt(prompt=prompts[i])
        elif prompt_token_ids is not None:
            item = TokensPrompt(prompt_token_ids=prompt_token_ids[i])
        else:
            raise AssertionError

        parsed_prompts.append(item)

    return parsed_prompts

_cross_encoding_score

_cross_encoding_score(
    tokenizer: AnyTokenizer,
    text_1: list[str],
    text_2: list[str],
    truncate_prompt_tokens: Optional[int] = None,
    use_tqdm: Union[bool, Callable[..., tqdm]] = True,
    lora_request: Optional[
        Union[list[LoRARequest], LoRARequest]
    ] = None,
    prompt_adapter_request: Optional[
        PromptAdapterRequest
    ] = None,
) -> list[ScoringRequestOutput]
Source code in vllm/entrypoints/llm.py
def _cross_encoding_score(
    self,
    tokenizer: AnyTokenizer,
    text_1: list[str],
    text_2: list[str],
    truncate_prompt_tokens: Optional[int] = None,
    use_tqdm: Union[bool, Callable[..., tqdm]] = True,
    lora_request: Optional[Union[list[LoRARequest], LoRARequest]] = None,
    prompt_adapter_request: Optional[PromptAdapterRequest] = None,
) -> list[ScoringRequestOutput]:

    if isinstance(tokenizer, MistralTokenizer):
        raise ValueError(
            "Score API is only enabled for `--task embed or score`")

    if len(text_1) == 1:
        text_1 = text_1 * len(text_2)

    input_pairs = [(t1, t2) for t1, t2 in zip(text_1, text_2)]

    pooling_params = PoolingParams()

    tokenization_kwargs: dict[str, Any] = {}
    _validate_truncation_size(self.llm_engine.model_config.max_model_len,
                              truncate_prompt_tokens, tokenization_kwargs)

    parsed_prompts = []

    for q, t in input_pairs:
        prompt_inputs = tokenizer(text=q,
                                  text_pair=t,
                                  **tokenization_kwargs)
        engine_prompt = TokensPrompt(
            prompt_token_ids=prompt_inputs["input_ids"],
            token_type_ids=prompt_inputs.get("token_type_ids"))
        parsed_prompts.append(engine_prompt)

    self._validate_and_add_requests(
        prompts=parsed_prompts,
        params=pooling_params,
        use_tqdm=use_tqdm,
        lora_request=lora_request,
        prompt_adapter_request=prompt_adapter_request,
    )

    outputs = self._run_engine(use_tqdm=use_tqdm)
    items = self.engine_class.validate_outputs(outputs,
                                               PoolingRequestOutput)

    return [ScoringRequestOutput.from_base(item) for item in items]

_embedding_score

_embedding_score(
    tokenizer: AnyTokenizer,
    text_1: list[Union[str, TextPrompt, TokensPrompt]],
    text_2: list[Union[str, TextPrompt, TokensPrompt]],
    truncate_prompt_tokens: Optional[int] = None,
    use_tqdm: Union[bool, Callable[..., tqdm]] = True,
    lora_request: Optional[
        Union[list[LoRARequest], LoRARequest]
    ] = None,
    prompt_adapter_request: Optional[
        PromptAdapterRequest
    ] = None,
) -> list[ScoringRequestOutput]
Source code in vllm/entrypoints/llm.py
def _embedding_score(
    self,
    tokenizer: AnyTokenizer,
    text_1: list[Union[str, TextPrompt, TokensPrompt]],
    text_2: list[Union[str, TextPrompt, TokensPrompt]],
    truncate_prompt_tokens: Optional[int] = None,
    use_tqdm: Union[bool, Callable[..., tqdm]] = True,
    lora_request: Optional[Union[list[LoRARequest], LoRARequest]] = None,
    prompt_adapter_request: Optional[PromptAdapterRequest] = None,
) -> list[ScoringRequestOutput]:

    encoded_output: list[PoolingRequestOutput] = self.encode(
        text_1 + text_2,
        truncate_prompt_tokens=truncate_prompt_tokens,
        use_tqdm=use_tqdm,
        lora_request=lora_request,
        prompt_adapter_request=prompt_adapter_request)

    encoded_output_1: list[PoolingRequestOutput] = encoded_output[
        0:len(text_1)]
    encoded_output_2: list[PoolingRequestOutput] = encoded_output[
        len(text_1):]

    if len(encoded_output_1) == 1:
        encoded_output_1 = encoded_output_1 * len(encoded_output_2)

    scores = _cosine_similarity(tokenizer=tokenizer,
                                embed_1=encoded_output_1,
                                embed_2=encoded_output_2)

    items = self.engine_class.validate_outputs(scores,
                                               PoolingRequestOutput)
    return [ScoringRequestOutput.from_base(item) for item in items]

_get_beam_search_lora_requests

_get_beam_search_lora_requests(
    lora_request: Optional[
        Union[list[LoRARequest], LoRARequest]
    ],
    prompts: list[Union[TokensPrompt, TextPrompt]],
) -> list[Optional[LoRARequest]]

Get the optional lora request corresponding to each prompt.

Source code in vllm/entrypoints/llm.py
def _get_beam_search_lora_requests(
    self,
    lora_request: Optional[Union[list[LoRARequest], LoRARequest]],
    prompts: list[Union[TokensPrompt, TextPrompt]],
) -> list[Optional[LoRARequest]]:
    """Get the optional lora request corresponding to each prompt."""
    if isinstance(lora_request,
                  Sequence) and len(lora_request) != len(prompts):
        raise ValueError(
            "Lora request list should be the same length as the prompts")

    if lora_request is None or isinstance(lora_request, LoRARequest):
        return [lora_request] * len(prompts)

    raise TypeError(f"Invalid lora_request type {type(lora_request)}")

_run_engine

_run_engine(
    *, use_tqdm: Union[bool, Callable[..., tqdm]] = True
) -> list[Union[RequestOutput, PoolingRequestOutput]]
Source code in vllm/entrypoints/llm.py
def _run_engine(
    self,
    *,
    use_tqdm: Union[bool, Callable[..., tqdm]] = True
) -> list[Union[RequestOutput, PoolingRequestOutput]]:
    # Initialize tqdm.
    if use_tqdm:
        num_requests = self.llm_engine.get_num_unfinished_requests()
        tqdm_func = use_tqdm if callable(use_tqdm) else tqdm
        pbar = tqdm_func(
            total=num_requests,
            desc="Processed prompts",
            dynamic_ncols=True,
            postfix=(f"est. speed input: {0:.2f} toks/s, "
                     f"output: {0:.2f} toks/s"),
        )

    # Run the engine.
    outputs: list[Union[RequestOutput, PoolingRequestOutput]] = []
    total_in_toks = 0
    total_out_toks = 0
    while self.llm_engine.has_unfinished_requests():
        step_outputs = self.llm_engine.step()
        for output in step_outputs:
            if output.finished:
                outputs.append(output)
                if use_tqdm:
                    if isinstance(output, RequestOutput):
                        # Calculate tokens only for RequestOutput
                        n = len(output.outputs)
                        assert output.prompt_token_ids is not None
                        total_in_toks += len(output.prompt_token_ids) * n
                        in_spd = total_in_toks / pbar.format_dict["elapsed"]
                        total_out_toks += sum(
                            len(stp.token_ids) for stp in output.outputs)
                        out_spd = (total_out_toks /
                                   pbar.format_dict["elapsed"])
                        pbar.postfix = (
                            f"est. speed input: {in_spd:.2f} toks/s, "
                            f"output: {out_spd:.2f} toks/s")
                        pbar.update(n)
                    else:
                        pbar.update(1)
                    if pbar.n == num_requests:
                        pbar.refresh()

    if use_tqdm:
        pbar.close()
    # Sort the outputs by request ID.
    # This is necessary because some requests may be finished earlier than
    # its previous requests.
    return sorted(outputs, key=lambda x: int(x.request_id))

_validate_and_add_requests

_validate_and_add_requests(
    prompts: Union[PromptType, Sequence[PromptType]],
    params: Union[
        SamplingParams,
        Sequence[SamplingParams],
        PoolingParams,
        Sequence[PoolingParams],
    ],
    *,
    use_tqdm: Union[bool, Callable[..., tqdm]] = True,
    lora_request: Optional[
        Union[Sequence[LoRARequest], LoRARequest]
    ],
    prompt_adapter_request: Optional[PromptAdapterRequest],
    tokenization_kwargs: Optional[dict[str, Any]] = None,
    guided_options: Optional[GuidedDecodingRequest] = None,
    priority: Optional[list[int]] = None,
) -> None
Source code in vllm/entrypoints/llm.py
def _validate_and_add_requests(
    self,
    prompts: Union[PromptType, Sequence[PromptType]],
    params: Union[SamplingParams, Sequence[SamplingParams], PoolingParams,
                  Sequence[PoolingParams]],
    *,
    use_tqdm: Union[bool, Callable[..., tqdm]] = True,
    lora_request: Optional[Union[Sequence[LoRARequest], LoRARequest]],
    prompt_adapter_request: Optional[PromptAdapterRequest],
    tokenization_kwargs: Optional[dict[str, Any]] = None,
    guided_options: Optional[GuidedDecodingRequest] = None,
    priority: Optional[list[int]] = None,
) -> None:
    if guided_options is not None:
        warnings.warn(
            "guided_options_request is deprecated, use "
            "SamplingParams.guided_decoding instead",
            DeprecationWarning,
            stacklevel=2,
        )

    if isinstance(prompts, (str, dict)):
        # Convert a single prompt to a list.
        prompts = [prompts]

    num_requests = len(prompts)
    if isinstance(params, Sequence) and len(params) != num_requests:
        raise ValueError("The lengths of prompts and params "
                         "must be the same.")
    if isinstance(lora_request,
                  Sequence) and len(lora_request) != num_requests:
        raise ValueError("The lengths of prompts and lora_request "
                         "must be the same.")

    for sp in params if isinstance(params, Sequence) else (params, ):
        if isinstance(sp, SamplingParams):
            self._add_guided_params(sp, guided_options)

            # We only care about the final output
            sp.output_kind = RequestOutputKind.FINAL_ONLY

    # Add requests to the engine.
    it = prompts
    if use_tqdm:
        tqdm_func = use_tqdm if callable(use_tqdm) else tqdm
        it = tqdm_func(it, desc="Adding requests")

    for i, prompt in enumerate(it):
        self._add_request(
            prompt,
            params[i] if isinstance(params, Sequence) else params,
            tokenization_kwargs=tokenization_kwargs,
            lora_request=lora_request[i] if isinstance(
                lora_request, Sequence) else lora_request,
            prompt_adapter_request=prompt_adapter_request,
            priority=priority[i] if priority else 0,
        )

apply_model

apply_model(func: Callable[[Module], _R]) -> list[_R]

Run a function directly on the model inside each worker, returning the result for each of them.

Source code in vllm/entrypoints/llm.py
def apply_model(self, func: Callable[[nn.Module], _R]) -> list[_R]:
    """
    Run a function directly on the model inside each worker,
    returning the result for each of them.
    """
    executor = self.llm_engine.model_executor
    return executor.apply_model(func)
beam_search(
    prompts: list[Union[TokensPrompt, TextPrompt]],
    params: BeamSearchParams,
    lora_request: Optional[
        Union[list[LoRARequest], LoRARequest]
    ] = None,
    use_tqdm: bool = False,
) -> list[BeamSearchOutput]

Generate sequences using beam search.

Parameters:

Name Type Description Default
prompts list[Union[TokensPrompt, TextPrompt]]

A list of prompts. Each prompt can be a string or a list of token IDs.

required
params BeamSearchParams

The beam search parameters.

required
lora_request Optional[Union[list[LoRARequest], LoRARequest]]

LoRA request to use for generation, if any.

None
use_tqdm bool

Whether to use tqdm to display the progress bar.

False
Source code in vllm/entrypoints/llm.py
def beam_search(
    self,
    prompts: list[Union[TokensPrompt, TextPrompt]],
    params: BeamSearchParams,
    lora_request: Optional[Union[list[LoRARequest], LoRARequest]] = None,
    use_tqdm: bool = False,
) -> list[BeamSearchOutput]:
    """
    Generate sequences using beam search.

    Args:
        prompts: A list of prompts. Each prompt can be a string or a list
            of token IDs.
        params: The beam search parameters.
        lora_request: LoRA request to use for generation, if any.
        use_tqdm: Whether to use tqdm to display the progress bar.
    """
    # TODO: how does beam search work together with length penalty,
    # frequency, penalty, and stopping criteria, etc.?
    beam_width = params.beam_width
    max_tokens = params.max_tokens
    temperature = params.temperature
    ignore_eos = params.ignore_eos
    length_penalty = params.length_penalty

    lora_requests = self._get_beam_search_lora_requests(
        lora_request, prompts)

    tokenizer = self.get_tokenizer()
    sort_beams_key = create_sort_beams_key_function(
        tokenizer.eos_token_id,
        length_penalty,
    )

    def create_tokens_prompt_from_beam(
            beam: BeamSearchSequence) -> TokensPrompt:
        token_prompt_kwargs: TokensPrompt = {
            "prompt_token_ids": beam.tokens
        }
        if beam.multi_modal_data is not None:
            token_prompt_kwargs["multi_modal_data"] = beam.multi_modal_data

        if beam.mm_processor_kwargs is not None:
            token_prompt_kwargs[
                "mm_processor_kwargs"] = beam.mm_processor_kwargs
        return TokensPrompt(**token_prompt_kwargs)

    # generate 2 * beam_width candidates at each step
    # following the huggingface transformers implementation
    # at https://github.com/huggingface/transformers/blob/e15687fffe5c9d20598a19aeab721ae0a7580f8a/src/transformers/generation/beam_search.py#L534 # noqa
    beam_search_params = SamplingParams(logprobs=2 * beam_width,
                                        max_tokens=1,
                                        temperature=temperature)
    instances: list[BeamSearchInstance] = []

    for lora_req, prompt in zip(lora_requests, prompts):
        # Add multimodal processor kwargs & data
        mm_kwargs = {}
        if "multi_modal_data" in prompt:
            mm_kwargs["multi_modal_data"] = prompt["multi_modal_data"]
        if "mm_processor_kwargs" in prompt:
            mm_kwargs["mm_processor_kwargs"] = prompt[
                "mm_processor_kwargs"]

        if "prompt_token_ids" in prompt:
            prompt = cast(TokensPrompt, prompt)  # Needed for mypy
            prompt_tokens = prompt["prompt_token_ids"]
        else:
            prompt_tokens = tokenizer.encode(prompt["prompt"])

        instances.append(
            BeamSearchInstance(
                prompt_tokens,
                lora_request=lora_req,
                logprobs=None,
                **mm_kwargs,
            ), )

    token_iter = range(max_tokens)
    if use_tqdm:
        token_iter = tqdm(token_iter,
                          desc="Beam search",
                          unit="token",
                          unit_scale=False)
        logger.warning(
            "The progress bar shows the upper bound on token steps and "
            "may finish early due to stopping conditions. It does not "
            "reflect instance-level progress.")

    for _ in token_iter:
        all_beams: list[BeamSearchSequence] = list(
            sum((instance.beams for instance in instances), []))
        pos = [0] + list(
            itertools.accumulate(
                len(instance.beams) for instance in instances))
        instance_start_and_end: list[tuple[int, int]] = list(
            zip(pos[:-1], pos[1:]))

        if len(all_beams) == 0:
            break

        # create the corresponding batch entries for prompt & optional lora
        prompts_batch, lora_req_batch = zip(
            *[(create_tokens_prompt_from_beam(beam), beam.lora_request)
              for beam in all_beams])

        # only runs for one step
        # we don't need to use tqdm here
        output = self.generate(prompts_batch,
                               sampling_params=beam_search_params,
                               use_tqdm=False,
                               lora_request=lora_req_batch)

        for (start, end), instance in zip(instance_start_and_end,
                                          instances):
            instance_new_beams = []
            for i in range(start, end):
                current_beam = all_beams[i]
                result = output[i]

                if result.outputs[0].logprobs is not None:
                    # if `result.outputs[0].logprobs` is None, it means
                    # the sequence is completed because of the max-model-len
                    # or abortion. we don't need to add it to the new beams.
                    logprobs = result.outputs[0].logprobs[0]
                    for token_id, logprob_obj in logprobs.items():
                        new_beam = BeamSearchSequence(
                            tokens=current_beam.tokens + [token_id],
                            logprobs=current_beam.logprobs + [logprobs],
                            lora_request=current_beam.lora_request,
                            cum_logprob=current_beam.cum_logprob +
                            logprob_obj.logprob,
                            multi_modal_data=current_beam.multi_modal_data,
                            mm_processor_kwargs=current_beam.
                            mm_processor_kwargs)

                        if token_id == tokenizer.eos_token_id and \
                            not ignore_eos:
                            instance.completed.append(new_beam)
                        else:
                            instance_new_beams.append(new_beam)
            sorted_beams = sorted(instance_new_beams,
                                  key=sort_beams_key,
                                  reverse=True)
            instance.beams = sorted_beams[:beam_width]

    outputs = []
    for instance in instances:
        instance.completed.extend(instance.beams)
        sorted_completed = sorted(instance.completed,
                                  key=sort_beams_key,
                                  reverse=True)
        best_beams = sorted_completed[:beam_width]

        for beam in best_beams:
            beam.text = tokenizer.decode(beam.tokens)
        outputs.append(BeamSearchOutput(sequences=best_beams))

    return outputs

chat

chat(
    messages: Union[
        list[ChatCompletionMessageParam],
        list[list[ChatCompletionMessageParam]],
    ],
    sampling_params: Optional[
        Union[SamplingParams, list[SamplingParams]]
    ] = None,
    use_tqdm: Union[bool, Callable[..., tqdm]] = True,
    lora_request: Optional[LoRARequest] = None,
    chat_template: Optional[str] = None,
    chat_template_content_format: ChatTemplateContentFormatOption = "auto",
    add_generation_prompt: bool = True,
    continue_final_message: bool = False,
    tools: Optional[list[dict[str, Any]]] = None,
    chat_template_kwargs: Optional[dict[str, Any]] = None,
    mm_processor_kwargs: Optional[dict[str, Any]] = None,
) -> list[RequestOutput]

Generate responses for a chat conversation.

The chat conversation is converted into a text prompt using the tokenizer and calls the [generate][] method to generate the responses.

Multi-modal inputs can be passed in the same way you would pass them to the OpenAI API.

Parameters:

Name Type Description Default
messages Union[list[ChatCompletionMessageParam], list[list[ChatCompletionMessageParam]]]

A list of conversations or a single conversation.

  • Each conversation is represented as a list of messages.
  • Each message is a dictionary with 'role' and 'content' keys.
required
sampling_params Optional[Union[SamplingParams, list[SamplingParams]]]

The sampling parameters for text generation. If None, we use the default sampling parameters. When it is a single value, it is applied to every prompt. When it is a list, the list must have the same length as the prompts and it is paired one by one with the prompt.

None
use_tqdm Union[bool, Callable[..., tqdm]]

If True, shows a tqdm progress bar. If a callable (e.g., functools.partial(tqdm, leave=False)), it is used to create the progress bar. If False, no progress bar is created.

True
lora_request Optional[LoRARequest]

LoRA request to use for generation, if any.

None
chat_template Optional[str]

The template to use for structuring the chat. If not provided, the model's default chat template will be used.

None
chat_template_content_format ChatTemplateContentFormatOption

The format to render message content.

  • "string" will render the content as a string. Example: "Who are you?"
  • "openai" will render the content as a list of dictionaries, similar to OpenAI schema. Example: [{"type": "text", "text": "Who are you?"}]
'auto'
add_generation_prompt bool

If True, adds a generation template to each message.

True
continue_final_message bool

If True, continues the final message in the conversation instead of starting a new one. Cannot be True if add_generation_prompt is also True.

False
chat_template_kwargs Optional[dict[str, Any]]

Additional kwargs to pass to the chat template.

None
mm_processor_kwargs Optional[dict[str, Any]]

Multimodal processor kwarg overrides for this chat request. Only used for offline requests.

None

Returns:

Type Description
list[RequestOutput]

A list of RequestOutput objects containing the generated

list[RequestOutput]

responses in the same order as the input messages.

Source code in vllm/entrypoints/llm.py
def chat(
    self,
    messages: Union[list[ChatCompletionMessageParam],
                    list[list[ChatCompletionMessageParam]]],
    sampling_params: Optional[Union[SamplingParams,
                                    list[SamplingParams]]] = None,
    use_tqdm: Union[bool, Callable[..., tqdm]] = True,
    lora_request: Optional[LoRARequest] = None,
    chat_template: Optional[str] = None,
    chat_template_content_format: ChatTemplateContentFormatOption = "auto",
    add_generation_prompt: bool = True,
    continue_final_message: bool = False,
    tools: Optional[list[dict[str, Any]]] = None,
    chat_template_kwargs: Optional[dict[str, Any]] = None,
    mm_processor_kwargs: Optional[dict[str, Any]] = None,
) -> list[RequestOutput]:
    """
    Generate responses for a chat conversation.

    The chat conversation is converted into a text prompt using the
    tokenizer and calls the [generate][] method to generate the
    responses.

    Multi-modal inputs can be passed in the same way you would pass them
    to the OpenAI API.

    Args:
        messages: A list of conversations or a single conversation.

            - Each conversation is represented as a list of messages.
            - Each message is a dictionary with 'role' and 'content' keys.

        sampling_params: The sampling parameters for text generation.
            If None, we use the default sampling parameters. When it
            is a single value, it is applied to every prompt. When it
            is a list, the list must have the same length as the
            prompts and it is paired one by one with the prompt.
        use_tqdm: If `True`, shows a tqdm progress bar.
            If a callable (e.g., `functools.partial(tqdm, leave=False)`),
            it is used to create the progress bar.
            If `False`, no progress bar is created.
        lora_request: LoRA request to use for generation, if any.
        chat_template: The template to use for structuring the chat.
            If not provided, the model's default chat template will be used.
        chat_template_content_format: The format to render message content.

            - "string" will render the content as a string.
              Example: `"Who are you?"`
            - "openai" will render the content as a list of dictionaries,
              similar to OpenAI schema.
              Example: `[{"type": "text", "text": "Who are you?"}]`

        add_generation_prompt: If True, adds a generation template
            to each message.
        continue_final_message: If True, continues the final message in
            the conversation instead of starting a new one. Cannot be
            `True` if `add_generation_prompt` is also `True`.
        chat_template_kwargs: Additional kwargs to pass to the chat
            template.
        mm_processor_kwargs: Multimodal processor kwarg overrides for this
            chat request. Only used for offline requests.

    Returns:
        A list of `RequestOutput` objects containing the generated
        responses in the same order as the input messages.
    """
    list_of_messages: list[list[ChatCompletionMessageParam]]

    # Handle multi and single conversations
    if is_list_of(messages, list):
        # messages is list[list[...]]
        list_of_messages = cast(list[list[ChatCompletionMessageParam]],
                                messages)
    else:
        # messages is list[...]
        list_of_messages = [
            cast(list[ChatCompletionMessageParam], messages)
        ]

    tokenizer = self.get_tokenizer(lora_request)
    model_config = self.llm_engine.get_model_config()
    resolved_content_format = resolve_chat_template_content_format(
        chat_template,
        tools,
        chat_template_content_format,
        tokenizer,
        model_config=model_config,
    )

    _chat_template_kwargs: dict[str, Any] = dict(
        chat_template=chat_template,
        add_generation_prompt=add_generation_prompt,
        continue_final_message=continue_final_message,
        tools=tools,
    )
    _chat_template_kwargs.update(chat_template_kwargs or {})

    prompts: list[Union[TokensPrompt, TextPrompt]] = []

    for msgs in list_of_messages:
        # NOTE: _parse_chat_message_content_parts() currently doesn't
        # handle mm_processor_kwargs, since there is no implementation in
        # the chat message parsing for it.
        conversation, mm_data = parse_chat_messages(
            msgs,
            model_config,
            tokenizer,
            content_format=resolved_content_format,
        )

        if isinstance(tokenizer, MistralTokenizer):
            prompt_token_ids = apply_mistral_chat_template(
                tokenizer,
                messages=msgs,
                **_chat_template_kwargs,
            )
        else:
            prompt_str = apply_hf_chat_template(
                tokenizer=tokenizer,
                conversation=conversation,
                model_config=model_config,
                **_chat_template_kwargs,
            )
            # Special tokens are already included in chat templates so
            # should not be added by the tokenizer in this case.
            prompt_token_ids = tokenizer.encode(prompt_str,
                                                add_special_tokens=False)

        prompt = TokensPrompt(prompt_token_ids=prompt_token_ids)

        if mm_data is not None:
            prompt["multi_modal_data"] = mm_data

        if mm_processor_kwargs is not None:
            prompt["mm_processor_kwargs"] = mm_processor_kwargs

        prompts.append(prompt)

    return self.generate(
        prompts,
        sampling_params=sampling_params,
        use_tqdm=use_tqdm,
        lora_request=lora_request,
    )

classify

classify(
    prompts: Union[PromptType, Sequence[PromptType]],
    /,
    *,
    use_tqdm: Union[bool, Callable[..., tqdm]] = True,
    lora_request: Optional[
        Union[list[LoRARequest], LoRARequest]
    ] = None,
    prompt_adapter_request: Optional[
        PromptAdapterRequest
    ] = None,
) -> list[ClassificationRequestOutput]

Generate class logits for each prompt.

This class automatically batches the given prompts, considering the memory constraint. For the best performance, put all of your prompts into a single list and pass it to this method.

Parameters:

Name Type Description Default
prompts Union[PromptType, Sequence[PromptType]]

The prompts to the LLM. You may pass a sequence of prompts for batch inference. See PromptType for more details about the format of each prompts.

required
use_tqdm Union[bool, Callable[..., tqdm]]

If True, shows a tqdm progress bar. If a callable (e.g., functools.partial(tqdm, leave=False)), it is used to create the progress bar. If False, no progress bar is created.

True
lora_request Optional[Union[list[LoRARequest], LoRARequest]]

LoRA request to use for generation, if any.

None
prompt_adapter_request Optional[PromptAdapterRequest]

Prompt Adapter request to use for generation, if any.

None

Returns:

Type Description
list[ClassificationRequestOutput]

A list of ClassificationRequestOutput objects containing the

list[ClassificationRequestOutput]

embedding vectors in the same order as the input prompts.

Source code in vllm/entrypoints/llm.py
def classify(
    self,
    prompts: Union[PromptType, Sequence[PromptType]],
    /,
    *,
    use_tqdm: Union[bool, Callable[..., tqdm]] = True,
    lora_request: Optional[Union[list[LoRARequest], LoRARequest]] = None,
    prompt_adapter_request: Optional[PromptAdapterRequest] = None,
) -> list[ClassificationRequestOutput]:
    """
    Generate class logits for each prompt.

    This class automatically batches the given prompts, considering
    the memory constraint. For the best performance, put all of your prompts
    into a single list and pass it to this method.

    Args:
        prompts: The prompts to the LLM. You may pass a sequence of prompts
            for batch inference. See [PromptType][vllm.inputs.PromptType]
            for more details about the format of each prompts.
        use_tqdm: If `True`, shows a tqdm progress bar.
            If a callable (e.g., `functools.partial(tqdm, leave=False)`),
            it is used to create the progress bar.
            If `False`, no progress bar is created.
        lora_request: LoRA request to use for generation, if any.
        prompt_adapter_request: Prompt Adapter request to use for
            generation, if any.

    Returns:
        A list of `ClassificationRequestOutput` objects containing the
        embedding vectors in the same order as the input prompts.
    """
    if self.llm_engine.model_config.task != "classify":
        raise ValueError(
            "Classification API is only enabled for `--task classify`")

    items = self.encode(prompts,
                        use_tqdm=use_tqdm,
                        lora_request=lora_request,
                        prompt_adapter_request=prompt_adapter_request)

    return [ClassificationRequestOutput.from_base(item) for item in items]

collective_rpc

collective_rpc(
    method: Union[str, Callable[..., _R]],
    timeout: Optional[float] = None,
    args: tuple = (),
    kwargs: Optional[dict[str, Any]] = None,
) -> list[_R]

Execute an RPC call on all workers.

Parameters:

Name Type Description Default
method Union[str, Callable[..., _R]]

Name of the worker method to execute, or a callable that is serialized and sent to all workers to execute.

If the method is a callable, it should accept an additional self argument, in addition to the arguments passed in args and kwargs. The self argument will be the worker object.

required
timeout Optional[float]

Maximum time in seconds to wait for execution. Raises a TimeoutError on timeout. None means wait indefinitely.

None
args tuple

Positional arguments to pass to the worker method.

()
kwargs Optional[dict[str, Any]]

Keyword arguments to pass to the worker method.

None

Returns:

Type Description
list[_R]

A list containing the results from each worker.

Note

It is recommended to use this API to only pass control messages, and set up data-plane communication to pass data.

Source code in vllm/entrypoints/llm.py
def collective_rpc(self,
                   method: Union[str, Callable[..., _R]],
                   timeout: Optional[float] = None,
                   args: tuple = (),
                   kwargs: Optional[dict[str, Any]] = None) -> list[_R]:
    """
    Execute an RPC call on all workers.

    Args:
        method: Name of the worker method to execute, or a callable that
            is serialized and sent to all workers to execute.

            If the method is a callable, it should accept an additional
            `self` argument, in addition to the arguments passed in `args`
            and `kwargs`. The `self` argument will be the worker object.
        timeout: Maximum time in seconds to wait for execution. Raises a
            [`TimeoutError`][] on timeout. `None` means wait indefinitely.
        args: Positional arguments to pass to the worker method.
        kwargs: Keyword arguments to pass to the worker method.

    Returns:
        A list containing the results from each worker.

    Note:
        It is recommended to use this API to only pass control messages,
        and set up data-plane communication to pass data.
    """

    return self.llm_engine.collective_rpc(method, timeout, args, kwargs)

deprecate_legacy_api classmethod

deprecate_legacy_api()
Source code in vllm/entrypoints/llm.py
@classmethod
@contextmanager
def deprecate_legacy_api(cls):
    cls.DEPRECATE_LEGACY = True

    yield

    cls.DEPRECATE_LEGACY = False

embed

embed(
    prompts: Union[PromptType, Sequence[PromptType]],
    /,
    *,
    truncate_prompt_tokens: Optional[int] = None,
    use_tqdm: Union[bool, Callable[..., tqdm]] = True,
    pooling_params: Optional[
        Union[PoolingParams, Sequence[PoolingParams]]
    ] = None,
    lora_request: Optional[
        Union[list[LoRARequest], LoRARequest]
    ] = None,
    prompt_adapter_request: Optional[
        PromptAdapterRequest
    ] = None,
) -> list[EmbeddingRequestOutput]

Generate an embedding vector for each prompt.

This class automatically batches the given prompts, considering the memory constraint. For the best performance, put all of your prompts into a single list and pass it to this method.

Parameters:

Name Type Description Default
prompts Union[PromptType, Sequence[PromptType]]

The prompts to the LLM. You may pass a sequence of prompts for batch inference. See PromptType for more details about the format of each prompts.

required
pooling_params Optional[Union[PoolingParams, Sequence[PoolingParams]]]

The pooling parameters for pooling. If None, we use the default pooling parameters.

None
use_tqdm Union[bool, Callable[..., tqdm]]

If True, shows a tqdm progress bar. If a callable (e.g., functools.partial(tqdm, leave=False)), it is used to create the progress bar. If False, no progress bar is created.

True
lora_request Optional[Union[list[LoRARequest], LoRARequest]]

LoRA request to use for generation, if any.

None
prompt_adapter_request Optional[PromptAdapterRequest]

Prompt Adapter request to use for generation, if any.

None

Returns:

Type Description
list[EmbeddingRequestOutput]

A list of EmbeddingRequestOutput objects containing the

list[EmbeddingRequestOutput]

embedding vectors in the same order as the input prompts.

Source code in vllm/entrypoints/llm.py
def embed(
    self,
    prompts: Union[PromptType, Sequence[PromptType]],
    /,
    *,
    truncate_prompt_tokens: Optional[int] = None,
    use_tqdm: Union[bool, Callable[..., tqdm]] = True,
    pooling_params: Optional[Union[PoolingParams,
                                   Sequence[PoolingParams]]] = None,
    lora_request: Optional[Union[list[LoRARequest], LoRARequest]] = None,
    prompt_adapter_request: Optional[PromptAdapterRequest] = None,
) -> list[EmbeddingRequestOutput]:
    """
    Generate an embedding vector for each prompt.

    This class automatically batches the given prompts, considering
    the memory constraint. For the best performance, put all of your prompts
    into a single list and pass it to this method.

    Args:
        prompts: The prompts to the LLM. You may pass a sequence of prompts
            for batch inference. See [PromptType][vllm.inputs.PromptType]
            for more details about the format of each prompts.
        pooling_params: The pooling parameters for pooling. If None, we
            use the default pooling parameters.
        use_tqdm: If `True`, shows a tqdm progress bar.
            If a callable (e.g., `functools.partial(tqdm, leave=False)`),
            it is used to create the progress bar.
            If `False`, no progress bar is created.
        lora_request: LoRA request to use for generation, if any.
        prompt_adapter_request: Prompt Adapter request to use for
            generation, if any.

    Returns:
        A list of `EmbeddingRequestOutput` objects containing the
        embedding vectors in the same order as the input prompts.
    """
    if self.llm_engine.model_config.task != "embed":
        raise ValueError(
            "Embedding API is only enabled for `--task embed`")

    items = self.encode(prompts,
                        truncate_prompt_tokens=truncate_prompt_tokens,
                        use_tqdm=use_tqdm,
                        pooling_params=pooling_params,
                        lora_request=lora_request,
                        prompt_adapter_request=prompt_adapter_request)

    return [EmbeddingRequestOutput.from_base(item) for item in items]

encode

encode(
    prompts: Union[PromptType, Sequence[PromptType]],
    /,
    pooling_params: Optional[
        Union[PoolingParams, Sequence[PoolingParams]]
    ] = None,
    *,
    truncate_prompt_tokens: Optional[int] = None,
    use_tqdm: Union[bool, Callable[..., tqdm]] = True,
    lora_request: Optional[
        Union[list[LoRARequest], LoRARequest]
    ] = None,
    prompt_adapter_request: Optional[
        PromptAdapterRequest
    ] = None,
) -> list[PoolingRequestOutput]
encode(
    prompts: str,
    pooling_params: Optional[
        Union[PoolingParams, Sequence[PoolingParams]]
    ] = None,
    prompt_token_ids: Optional[list[int]] = None,
    truncate_prompt_tokens: Optional[int] = None,
    use_tqdm: Union[bool, Callable[..., tqdm]] = True,
    lora_request: Optional[
        Union[list[LoRARequest], LoRARequest]
    ] = None,
    prompt_adapter_request: Optional[
        PromptAdapterRequest
    ] = None,
) -> list[PoolingRequestOutput]
encode(
    prompts: list[str],
    pooling_params: Optional[
        Union[PoolingParams, Sequence[PoolingParams]]
    ] = None,
    prompt_token_ids: Optional[list[list[int]]] = None,
    truncate_prompt_tokens: Optional[int] = None,
    use_tqdm: Union[bool, Callable[..., tqdm]] = True,
    lora_request: Optional[
        Union[list[LoRARequest], LoRARequest]
    ] = None,
    prompt_adapter_request: Optional[
        PromptAdapterRequest
    ] = None,
) -> list[PoolingRequestOutput]
encode(
    prompts: Optional[str] = None,
    pooling_params: Optional[
        Union[PoolingParams, Sequence[PoolingParams]]
    ] = None,
    *,
    prompt_token_ids: list[int],
    truncate_prompt_tokens: Optional[int] = None,
    use_tqdm: Union[bool, Callable[..., tqdm]] = True,
    lora_request: Optional[
        Union[list[LoRARequest], LoRARequest]
    ] = None,
    prompt_adapter_request: Optional[
        PromptAdapterRequest
    ] = None,
) -> list[PoolingRequestOutput]
encode(
    prompts: Optional[list[str]] = None,
    pooling_params: Optional[
        Union[PoolingParams, Sequence[PoolingParams]]
    ] = None,
    *,
    prompt_token_ids: list[list[int]],
    truncate_prompt_tokens: Optional[int] = None,
    use_tqdm: Union[bool, Callable[..., tqdm]] = True,
    lora_request: Optional[
        Union[list[LoRARequest], LoRARequest]
    ] = None,
    prompt_adapter_request: Optional[
        PromptAdapterRequest
    ] = None,
) -> list[PoolingRequestOutput]
encode(
    prompts: None,
    pooling_params: None,
    prompt_token_ids: Union[list[int], list[list[int]]],
    truncate_prompt_tokens: Optional[int] = None,
    use_tqdm: Union[bool, Callable[..., tqdm]] = True,
    lora_request: Optional[
        Union[list[LoRARequest], LoRARequest]
    ] = None,
    prompt_adapter_request: Optional[
        PromptAdapterRequest
    ] = None,
) -> list[PoolingRequestOutput]
encode(
    prompts: Union[
        Union[PromptType, Sequence[PromptType]],
        Optional[Union[str, list[str]]],
    ] = None,
    pooling_params: Optional[
        Union[PoolingParams, Sequence[PoolingParams]]
    ] = None,
    prompt_token_ids: Optional[
        Union[list[int], list[list[int]]]
    ] = None,
    truncate_prompt_tokens: Optional[int] = None,
    use_tqdm: Union[bool, Callable[..., tqdm]] = True,
    lora_request: Optional[
        Union[list[LoRARequest], LoRARequest]
    ] = None,
    prompt_adapter_request: Optional[
        PromptAdapterRequest
    ] = None,
) -> list[PoolingRequestOutput]

Apply pooling to the hidden states corresponding to the input prompts.

This class automatically batches the given prompts, considering the memory constraint. For the best performance, put all of your prompts into a single list and pass it to this method.

Parameters:

Name Type Description Default
prompts Union[Union[PromptType, Sequence[PromptType]], Optional[Union[str, list[str]]]]

The prompts to the LLM. You may pass a sequence of prompts for batch inference. See PromptType for more details about the format of each prompts.

None
pooling_params Optional[Union[PoolingParams, Sequence[PoolingParams]]]

The pooling parameters for pooling. If None, we use the default pooling parameters.

None
use_tqdm Union[bool, Callable[..., tqdm]]

If True, shows a tqdm progress bar. If a callable (e.g., functools.partial(tqdm, leave=False)), it is used to create the progress bar. If False, no progress bar is created.

True
lora_request Optional[Union[list[LoRARequest], LoRARequest]]

LoRA request to use for generation, if any.

None
prompt_adapter_request Optional[PromptAdapterRequest]

Prompt Adapter request to use for generation, if any.

None

Returns:

Type Description
list[PoolingRequestOutput]

A list of PoolingRequestOutput objects containing the

list[PoolingRequestOutput]

pooled hidden states in the same order as the input prompts.

Note

Using prompts and prompt_token_ids as keyword parameters is considered legacy and may be deprecated in the future. You should instead pass them via the inputs parameter.

Source code in vllm/entrypoints/llm.py
@deprecate_kwargs(
    "prompt_token_ids",
    is_deprecated=lambda: LLM.DEPRECATE_LEGACY,
    additional_message="Please use the 'prompts' parameter instead.",
)
def encode(
    self,
    prompts: Union[Union[PromptType, Sequence[PromptType]],
                   Optional[Union[str, list[str]]]] = None,
    pooling_params: Optional[Union[PoolingParams,
                                   Sequence[PoolingParams]]] = None,
    prompt_token_ids: Optional[Union[list[int], list[list[int]]]] = None,
    truncate_prompt_tokens: Optional[int] = None,
    use_tqdm: Union[bool, Callable[..., tqdm]] = True,
    lora_request: Optional[Union[list[LoRARequest], LoRARequest]] = None,
    prompt_adapter_request: Optional[PromptAdapterRequest] = None,
) -> list[PoolingRequestOutput]:
    """Apply pooling to the hidden states corresponding to the input
    prompts.

    This class automatically batches the given prompts, considering
    the memory constraint. For the best performance, put all of your prompts
    into a single list and pass it to this method.

    Args:
        prompts: The prompts to the LLM. You may pass a sequence of prompts
            for batch inference. See [PromptType][vllm.inputs.PromptType]
            for more details about the format of each prompts.
        pooling_params: The pooling parameters for pooling. If None, we
            use the default pooling parameters.
        use_tqdm: If `True`, shows a tqdm progress bar.
            If a callable (e.g., `functools.partial(tqdm, leave=False)`),
            it is used to create the progress bar.
            If `False`, no progress bar is created.
        lora_request: LoRA request to use for generation, if any.
        prompt_adapter_request: Prompt Adapter request to use for
            generation, if any.

    Returns:
        A list of `PoolingRequestOutput` objects containing the
        pooled hidden states in the same order as the input prompts.

    Note:
        Using `prompts` and `prompt_token_ids` as keyword parameters is
        considered legacy and may be deprecated in the future. You should
        instead pass them via the `inputs` parameter.
    """
    runner_type = self.llm_engine.model_config.runner_type
    if runner_type != "pooling":
        messages = ["LLM.encode() is only supported for pooling models."]

        supported_runner_types = self.llm_engine.model_config \
            .supported_runner_types
        if "pooling" in supported_runner_types:
            messages.append(
                "Your model supports the 'pooling' runner, but is "
                f"currently initialized for the '{runner_type}' runner. "
                "Please initialize vLLM using `--task embed`, "
                "`--task classify`, `--task score` etc.")

        raise ValueError(" ".join(messages))

    if prompt_token_ids is not None:
        parsed_prompts = self._convert_v1_inputs(
            prompts=cast(Optional[Union[str, list[str]]], prompts),
            prompt_token_ids=prompt_token_ids,
        )
    else:
        parsed_prompts = cast(Union[PromptType, Sequence[PromptType]],
                              prompts)

    if pooling_params is None:
        # Use default pooling params.
        pooling_params = PoolingParams()
    elif isinstance(pooling_params, PoolingParams):
        pooling_params.verify(self.llm_engine.model_config)
    else:
        for pooling_param in pooling_params:
            pooling_param.verify(self.llm_engine.model_config)

    tokenization_kwargs: dict[str, Any] = {}
    _validate_truncation_size(self.llm_engine.model_config.max_model_len,
                              truncate_prompt_tokens, tokenization_kwargs)

    self._validate_and_add_requests(
        prompts=parsed_prompts,
        params=pooling_params,
        use_tqdm=use_tqdm,
        lora_request=lora_request,
        tokenization_kwargs=tokenization_kwargs,
        prompt_adapter_request=prompt_adapter_request,
    )

    outputs = self._run_engine(use_tqdm=use_tqdm)
    return self.engine_class.validate_outputs(outputs,
                                              PoolingRequestOutput)

generate

generate(
    prompts: Union[PromptType, Sequence[PromptType]],
    /,
    sampling_params: Optional[
        Union[SamplingParams, Sequence[SamplingParams]]
    ] = None,
    *,
    use_tqdm: Union[bool, Callable[..., tqdm]] = True,
    lora_request: Optional[
        Union[list[LoRARequest], LoRARequest]
    ] = None,
    prompt_adapter_request: Optional[
        PromptAdapterRequest
    ] = None,
    guided_options_request: Optional[
        Union[LLMGuidedOptions, GuidedDecodingRequest]
    ] = None,
) -> list[RequestOutput]
generate(
    prompts: str,
    sampling_params: Optional[
        Union[SamplingParams, list[SamplingParams]]
    ] = None,
    prompt_token_ids: Optional[list[int]] = None,
    use_tqdm: Union[bool, Callable[..., tqdm]] = True,
    lora_request: Optional[
        Union[list[LoRARequest], LoRARequest]
    ] = None,
    prompt_adapter_request: Optional[
        PromptAdapterRequest
    ] = None,
    guided_options_request: Optional[
        Union[LLMGuidedOptions, GuidedDecodingRequest]
    ] = None,
) -> list[RequestOutput]
generate(
    prompts: list[str],
    sampling_params: Optional[
        Union[SamplingParams, list[SamplingParams]]
    ] = None,
    prompt_token_ids: Optional[list[list[int]]] = None,
    use_tqdm: Union[bool, Callable[..., tqdm]] = True,
    lora_request: Optional[
        Union[list[LoRARequest], LoRARequest]
    ] = None,
    prompt_adapter_request: Optional[
        PromptAdapterRequest
    ] = None,
    guided_options_request: Optional[
        Union[LLMGuidedOptions, GuidedDecodingRequest]
    ] = None,
) -> list[RequestOutput]
generate(
    prompts: Optional[str] = None,
    sampling_params: Optional[
        Union[SamplingParams, list[SamplingParams]]
    ] = None,
    *,
    prompt_token_ids: list[int],
    use_tqdm: Union[bool, Callable[..., tqdm]] = True,
    lora_request: Optional[
        Union[list[LoRARequest], LoRARequest]
    ] = None,
    prompt_adapter_request: Optional[
        PromptAdapterRequest
    ] = None,
    guided_options_request: Optional[
        Union[LLMGuidedOptions, GuidedDecodingRequest]
    ] = None,
) -> list[RequestOutput]
generate(
    prompts: Optional[list[str]] = None,
    sampling_params: Optional[
        Union[SamplingParams, list[SamplingParams]]
    ] = None,
    *,
    prompt_token_ids: list[list[int]],
    use_tqdm: Union[bool, Callable[..., tqdm]] = True,
    lora_request: Optional[
        Union[list[LoRARequest], LoRARequest]
    ] = None,
    prompt_adapter_request: Optional[
        PromptAdapterRequest
    ] = None,
    guided_options_request: Optional[
        Union[LLMGuidedOptions, GuidedDecodingRequest]
    ] = None,
) -> list[RequestOutput]
generate(
    prompts: None,
    sampling_params: None,
    prompt_token_ids: Union[list[int], list[list[int]]],
    use_tqdm: Union[bool, Callable[..., tqdm]] = True,
    lora_request: Optional[
        Union[list[LoRARequest], LoRARequest]
    ] = None,
    prompt_adapter_request: Optional[
        PromptAdapterRequest
    ] = None,
    guided_options_request: Optional[
        Union[LLMGuidedOptions, GuidedDecodingRequest]
    ] = None,
) -> list[RequestOutput]
generate(
    prompts: Union[
        Union[PromptType, Sequence[PromptType]],
        Optional[Union[str, list[str]]],
    ] = None,
    sampling_params: Optional[
        Union[SamplingParams, Sequence[SamplingParams]]
    ] = None,
    prompt_token_ids: Optional[
        Union[list[int], list[list[int]]]
    ] = None,
    use_tqdm: Union[bool, Callable[..., tqdm]] = True,
    lora_request: Optional[
        Union[list[LoRARequest], LoRARequest]
    ] = None,
    prompt_adapter_request: Optional[
        PromptAdapterRequest
    ] = None,
    guided_options_request: Optional[
        Union[LLMGuidedOptions, GuidedDecodingRequest]
    ] = None,
    priority: Optional[list[int]] = None,
) -> list[RequestOutput]

Generates the completions for the input prompts.

This class automatically batches the given prompts, considering the memory constraint. For the best performance, put all of your prompts into a single list and pass it to this method.

Parameters:

Name Type Description Default
prompts Union[Union[PromptType, Sequence[PromptType]], Optional[Union[str, list[str]]]]

The prompts to the LLM. You may pass a sequence of prompts for batch inference. See PromptType for more details about the format of each prompts.

None
sampling_params Optional[Union[SamplingParams, Sequence[SamplingParams]]]

The sampling parameters for text generation. If None, we use the default sampling parameters. When it is a single value, it is applied to every prompt. When it is a list, the list must have the same length as the prompts and it is paired one by one with the prompt.

None
use_tqdm Union[bool, Callable[..., tqdm]]

If True, shows a tqdm progress bar. If a callable (e.g., functools.partial(tqdm, leave=False)), it is used to create the progress bar. If False, no progress bar is created.

True
lora_request Optional[Union[list[LoRARequest], LoRARequest]]

LoRA request to use for generation, if any.

None
prompt_adapter_request Optional[PromptAdapterRequest]

Prompt Adapter request to use for generation, if any.

None
priority Optional[list[int]]

The priority of the requests, if any. Only applicable when priority scheduling policy is enabled.

None

Returns:

Type Description
list[RequestOutput]

A list of RequestOutput objects containing the

list[RequestOutput]

generated completions in the same order as the input prompts.

Note

Using prompts and prompt_token_ids as keyword parameters is considered legacy and may be deprecated in the future. You should instead pass them via the inputs parameter.

Source code in vllm/entrypoints/llm.py
@deprecate_kwargs(
    "prompt_token_ids",
    is_deprecated=lambda: LLM.DEPRECATE_LEGACY,
    additional_message="Please use the 'prompts' parameter instead.",
)
def generate(
    self,
    prompts: Union[Union[PromptType, Sequence[PromptType]],
                   Optional[Union[str, list[str]]]] = None,
    sampling_params: Optional[Union[SamplingParams,
                                    Sequence[SamplingParams]]] = None,
    prompt_token_ids: Optional[Union[list[int], list[list[int]]]] = None,
    use_tqdm: Union[bool, Callable[..., tqdm]] = True,
    lora_request: Optional[Union[list[LoRARequest], LoRARequest]] = None,
    prompt_adapter_request: Optional[PromptAdapterRequest] = None,
    guided_options_request: Optional[Union[LLMGuidedOptions,
                                           GuidedDecodingRequest]] = None,
    priority: Optional[list[int]] = None,
) -> list[RequestOutput]:
    """Generates the completions for the input prompts.

    This class automatically batches the given prompts, considering
    the memory constraint. For the best performance, put all of your prompts
    into a single list and pass it to this method.

    Args:
        prompts: The prompts to the LLM. You may pass a sequence of prompts
            for batch inference. See [PromptType][vllm.inputs.PromptType]
            for more details about the format of each prompts.
        sampling_params: The sampling parameters for text generation. If
            None, we use the default sampling parameters.
            When it is a single value, it is applied to every prompt.
            When it is a list, the list must have the same length as the
            prompts and it is paired one by one with the prompt.
        use_tqdm: If `True`, shows a tqdm progress bar.
            If a callable (e.g., `functools.partial(tqdm, leave=False)`),
            it is used to create the progress bar.
            If `False`, no progress bar is created.
        lora_request: LoRA request to use for generation, if any.
        prompt_adapter_request: Prompt Adapter request to use for
            generation, if any.
        priority: The priority of the requests, if any.
            Only applicable when priority scheduling policy is enabled.

    Returns:
        A list of `RequestOutput` objects containing the
        generated completions in the same order as the input prompts.

    Note:
        Using `prompts` and `prompt_token_ids` as keyword parameters is
        considered legacy and may be deprecated in the future. You should
        instead pass them via the `inputs` parameter.
    """
    runner_type = self.llm_engine.model_config.runner_type
    if runner_type not in ["generate", "transcription"]:
        messages = [
            "LLM.generate() is only supported for (conditional) generation "
            "models (XForCausalLM, XForConditionalGeneration).",
        ]

        supported_runner_types = self.llm_engine.model_config \
            .supported_runner_types
        if "generate" in supported_runner_types:
            messages.append(
                "Your model supports the 'generate' runner, but is "
                f"currently initialized for the '{runner_type}' runner. "
                "Please initialize vLLM using `--task generate`.")

        raise ValueError(" ".join(messages))

    if prompt_token_ids is not None:
        parsed_prompts = self._convert_v1_inputs(
            prompts=cast(Optional[Union[str, list[str]]], prompts),
            prompt_token_ids=prompt_token_ids,
        )
    else:
        parsed_prompts = cast(Union[PromptType, Sequence[PromptType]],
                              prompts)

    if isinstance(guided_options_request, dict):
        if len(guided_options_request) > 1:
            raise ValueError(
                "You can only use one guided decoding but multiple is "
                f"specified: {guided_options_request}")
        guided_options_request = GuidedDecodingRequest(
            **guided_options_request)

    if sampling_params is None:
        # Use default sampling params.
        sampling_params = self.get_default_sampling_params()

    tokenization_kwargs: dict[str, Any] = {}
    truncate_prompt_tokens = None
    if isinstance(sampling_params, SamplingParams):
        truncate_prompt_tokens = sampling_params.truncate_prompt_tokens
    _validate_truncation_size(self.llm_engine.model_config.max_model_len,
                              truncate_prompt_tokens, tokenization_kwargs)

    self._validate_and_add_requests(
        prompts=parsed_prompts,
        params=sampling_params,
        use_tqdm=use_tqdm,
        lora_request=lora_request,
        prompt_adapter_request=prompt_adapter_request,
        guided_options=guided_options_request,
        tokenization_kwargs=tokenization_kwargs,
        priority=priority,
    )

    outputs = self._run_engine(use_tqdm=use_tqdm)
    return self.engine_class.validate_outputs(outputs, RequestOutput)

get_default_sampling_params

get_default_sampling_params() -> SamplingParams
Source code in vllm/entrypoints/llm.py
def get_default_sampling_params(self) -> SamplingParams:
    if self.default_sampling_params is None:
        self.default_sampling_params = (
            self.llm_engine.model_config.get_diff_sampling_param())
    if self.default_sampling_params:
        return SamplingParams.from_optional(**self.default_sampling_params)
    return SamplingParams()

get_metrics

get_metrics() -> list[Metric]

Return a snapshot of aggregated metrics from Prometheus.

Returns:

Type Description
list[Metric]

A MetricSnapshot instance capturing the current state

list[Metric]

of all aggregated metrics from Prometheus.

Note

This method is only available with the V1 LLM engine.

Source code in vllm/entrypoints/llm.py
def get_metrics(self) -> list["Metric"]:
    """Return a snapshot of aggregated metrics from Prometheus.

    Returns:
        A ``MetricSnapshot`` instance capturing the current state
        of all aggregated metrics from Prometheus.

    Note:
        This method is only available with the V1 LLM engine.
    """
    from vllm.v1.engine.llm_engine import LLMEngine as V1LLMEngine
    assert isinstance(self.llm_engine, V1LLMEngine)
    return self.llm_engine.get_metrics()

get_tokenizer

get_tokenizer(
    lora_request: Optional[LoRARequest] = None,
) -> AnyTokenizer
Source code in vllm/entrypoints/llm.py
def get_tokenizer(
    self,
    lora_request: Optional[LoRARequest] = None,
) -> AnyTokenizer:
    return self.llm_engine.get_tokenizer_group().get_lora_tokenizer(
        lora_request)

reset_prefix_cache

reset_prefix_cache(device: Optional[Device] = None) -> bool
Source code in vllm/entrypoints/llm.py
def reset_prefix_cache(self, device: Optional[Device] = None) -> bool:
    return self.llm_engine.reset_prefix_cache(device)

score

score(
    text_1: Union[
        SingletonPrompt, Sequence[SingletonPrompt]
    ],
    text_2: Union[
        SingletonPrompt, Sequence[SingletonPrompt]
    ],
    /,
    *,
    truncate_prompt_tokens: Optional[int] = None,
    use_tqdm: Union[bool, Callable[..., tqdm]] = True,
    lora_request: Optional[
        Union[list[LoRARequest], LoRARequest]
    ] = None,
    prompt_adapter_request: Optional[
        PromptAdapterRequest
    ] = None,
) -> list[ScoringRequestOutput]

Generate similarity scores for all pairs <text,text_pair>.

The inputs can be 1 -> 1, 1 -> N or N -> N. In the 1 - N case the text_1 sentence will be replicated N times to pair with the text_2 sentences. The input pairs are used to build a list of prompts for the cross encoder model. This class automatically batches the prompts, considering the memory constraint. For the best performance, put all of your texts into a single list and pass it to this method.

Parameters:

Name Type Description Default
text_1 Union[SingletonPrompt, Sequence[SingletonPrompt]]

can be a single prompt or a list of prompts, in which case it has to have the same length as the text_2 list

required
text_2 Union[SingletonPrompt, Sequence[SingletonPrompt]]

The texts to pair with the query to form the input to the LLM. See PromptType for more details about the format of each prompts.

required
use_tqdm Union[bool, Callable[..., tqdm]]

If True, shows a tqdm progress bar. If a callable (e.g., functools.partial(tqdm, leave=False)), it is used to create the progress bar. If False, no progress bar is created.

True
lora_request Optional[Union[list[LoRARequest], LoRARequest]]

LoRA request to use for generation, if any.

None
prompt_adapter_request Optional[PromptAdapterRequest]

Prompt Adapter request to use for generation, if any.

None

Returns:

Type Description
list[ScoringRequestOutput]

A list of ScoringRequestOutput objects containing the

list[ScoringRequestOutput]

generated scores in the same order as the input prompts.

Source code in vllm/entrypoints/llm.py
def score(
    self,
    text_1: Union[SingletonPrompt, Sequence[SingletonPrompt]],
    text_2: Union[SingletonPrompt, Sequence[SingletonPrompt]],
    /,
    *,
    truncate_prompt_tokens: Optional[int] = None,
    use_tqdm: Union[bool, Callable[..., tqdm]] = True,
    lora_request: Optional[Union[list[LoRARequest], LoRARequest]] = None,
    prompt_adapter_request: Optional[PromptAdapterRequest] = None,
) -> list[ScoringRequestOutput]:
    """Generate similarity scores for all pairs `<text,text_pair>`.

    The inputs can be `1 -> 1`, `1 -> N` or `N -> N`.
    In the `1 - N` case the `text_1` sentence will be replicated `N`
    times to pair with the `text_2` sentences.
    The input pairs are used to build a list of prompts for the
    cross encoder model. This class automatically batches the prompts,
    considering the memory constraint. For the best performance, put all
    of your texts into a single list and pass it to this method.

    Args:
        text_1: can be a single prompt or a list of prompts, in which
            case it has to have the same length as the `text_2` list
        text_2: The texts to pair with the query to form the input
            to the LLM. See [PromptType][vllm.inputs.PromptType] for
            more details about the format of each prompts.
        use_tqdm: If `True`, shows a tqdm progress bar.
            If a callable (e.g., `functools.partial(tqdm, leave=False)`),
            it is used to create the progress bar.
            If `False`, no progress bar is created.
        lora_request: LoRA request to use for generation, if any.
        prompt_adapter_request: Prompt Adapter request to use for
            generation, if any.

    Returns:
        A list of `ScoringRequestOutput` objects containing the
        generated scores in the same order as the input prompts.
    """
    runner_type = self.llm_engine.model_config.runner_type
    if runner_type != "pooling":
        messages = ["LLM.score() is only supported for pooling models."]

        supported_runner_types = self.llm_engine.model_config \
            .supported_runner_types
        if "pooling" in supported_runner_types:
            messages.append(
                "Your model supports the 'pooling' runner, but is "
                f"currently initialized for the '{runner_type}' runner. "
                "Please initialize vLLM using `--task embed`, "
                "`--task classify`, `--task score` etc.")

        raise ValueError(" ".join(messages))

    if self.llm_engine.model_config.task not in ("embed", "classify"):
        raise ValueError("Score API is only enabled for "
                         "`--task embed or --task classify`.")

    if (self.llm_engine.model_config.task == "classify"
            and self.llm_engine.model_config.hf_config.num_labels != 1):
        raise ValueError("Score API is only enabled for num_labels == 1.")

    # the tokenizer for models such as
    # "cross-encoder/ms-marco-MiniLM-L-6-v2" doesn't support passing
    # lists of tokens to the `text` and `text_pair` kwargs
    tokenizer = self.get_tokenizer()

    def ensure_str(prompt: SingletonPrompt):
        if isinstance(prompt, dict):
            if "multi_modal_data" in prompt:
                raise ValueError("Multi-modal prompt is not "
                                 "supported for scoring")
            elif "prompt_token_ids" in prompt:
                prompt = tokenizer.decode(
                    cast(TokensPrompt, prompt)["prompt_token_ids"])
            elif "prompt" in prompt:
                prompt = cast(TextPrompt, prompt)["prompt"]
        assert type(prompt) is str
        return prompt

    if isinstance(text_1, (str, dict)):
        # Convert a single prompt to a list.
        text_1 = [text_1]
    input_text_1: list[str] = [ensure_str(t) for t in text_1]

    if isinstance(text_2, (str, dict)):
        # Convert a single prompt to a list.
        text_2 = [text_2]
    input_text_2: list[str] = [ensure_str(t) for t in text_2]

    _validate_score_input_lens(input_text_1, input_text_2)

    if self.llm_engine.model_config.is_cross_encoder:
        return self._cross_encoding_score(tokenizer, input_text_1,
                                          input_text_2,
                                          truncate_prompt_tokens, use_tqdm,
                                          lora_request,
                                          prompt_adapter_request)
    else:
        return self._embedding_score(
            tokenizer,
            input_text_1,  # type: ignore[arg-type]
            input_text_2,  # type: ignore[arg-type]
            truncate_prompt_tokens,
            use_tqdm,
            lora_request,
            prompt_adapter_request)

set_tokenizer

set_tokenizer(tokenizer: AnyTokenizer) -> None
Source code in vllm/entrypoints/llm.py
def set_tokenizer(self, tokenizer: AnyTokenizer) -> None:
    tokenizer_group = self.llm_engine.get_tokenizer_group()

    # While CachedTokenizer is dynamic, have no choice but
    # compare class name. Misjudgment will arise from
    # user-defined tokenizer started with 'Cached'
    if tokenizer.__class__.__name__.startswith("Cached"):
        tokenizer_group.tokenizer = tokenizer
    else:
        tokenizer_group.tokenizer = get_cached_tokenizer(tokenizer)

sleep

sleep(level: int = 1)

Put the engine to sleep. The engine should not process any requests. The caller should guarantee that no requests are being processed during the sleep period, before wake_up is called.

Parameters:

Name Type Description Default
level int

The sleep level. Level 1 sleep will offload the model weights and discard the kv cache. The content of kv cache is forgotten. Level 1 sleep is good for sleeping and waking up the engine to run the same model again. The model weights are backed up in CPU memory. Please make sure there's enough CPU memory to store the model weights. Level 2 sleep will discard both the model weights and the kv cache. The content of both the model weights and kv cache is forgotten. Level 2 sleep is good for sleeping and waking up the engine to run a different model or update the model, where previous model weights are not needed. It reduces CPU memory pressure.

1
Source code in vllm/entrypoints/llm.py
def sleep(self, level: int = 1):
    """
    Put the engine to sleep. The engine should not process any requests.
    The caller should guarantee that no requests are being processed
    during the sleep period, before `wake_up` is called.

    Args:
        level: The sleep level. Level 1 sleep will offload the model
            weights and discard the kv cache. The content of kv cache
            is forgotten. Level 1 sleep is good for sleeping and waking
            up the engine to run the same model again. The model weights
            are backed up in CPU memory. Please make sure there's enough
            CPU memory to store the model weights. Level 2 sleep will
            discard both the model weights and the kv cache. The content
            of both the model weights and kv cache is forgotten. Level 2
            sleep is good for sleeping and waking up the engine to run a
            different model or update the model, where previous model
            weights are not needed. It reduces CPU memory pressure.
    """
    self.reset_prefix_cache()
    self.llm_engine.sleep(level=level)

start_profile

start_profile() -> None
Source code in vllm/entrypoints/llm.py
def start_profile(self) -> None:
    self.llm_engine.start_profile()

stop_profile

stop_profile() -> None
Source code in vllm/entrypoints/llm.py
def stop_profile(self) -> None:
    self.llm_engine.stop_profile()

wake_up

wake_up(tags: Optional[list[str]] = None)

Wake up the engine from sleep mode. See the [sleep][] method for more details.

Parameters:

Name Type Description Default
tags Optional[list[str]]

An optional list of tags to reallocate the engine memory for specific memory allocations. Values must be in ("weights", "kv_cache"). If None, all memory is reallocated. wake_up should be called with all tags (or None) before the engine is used again.

None
Source code in vllm/entrypoints/llm.py
def wake_up(self, tags: Optional[list[str]] = None):
    """
    Wake up the engine from sleep mode. See the [sleep][] method
    for more details.

    Args:
        tags: An optional list of tags to reallocate the engine memory
            for specific memory allocations. Values must be in
            `("weights", "kv_cache")`. If None, all memory is reallocated.
            wake_up should be called with all tags (or None) before the
            engine is used again.
    """
    self.llm_engine.wake_up(tags)