vllm.spec_decode.util
Timer
¶
Basic timer context manager for measuring CPU time.
Source code in vllm/spec_decode/util.py
__enter__
¶
create_logprobs_output
¶
create_logprobs_output(
token_id: int,
token_id_logprob_rank: int,
token_id_logprob: float,
topk_token_ids: List[Optional[int]],
topk_logprobs: List[Optional[float]],
) -> Dict[int, Logprob]
Create a Logprob Dict for a token given the sampling results.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
token_id
|
int
|
The sampled token for the sequence. |
required |
token_id_logprob_rank
|
int
|
The logprob rank of the sampled token. |
required |
token_id_logprob
|
float
|
The logprob value of the sampled token. |
required |
topk_token_ids
|
List[Optional[int]]
|
The list of top-k token ids. |
required |
topk_logprobs
|
List[Optional[float]]
|
The list of top-k logprobs. |
required |
Source code in vllm/spec_decode/util.py
create_sequence_group_output
¶
create_sequence_group_output(
token_id: int,
token_id_logprob_rank: int,
token_id_logprob: float,
seq_id: SeqId,
topk_token_ids: List[Optional[int]],
topk_logprobs: List[Optional[float]],
prompt_logprobs: Optional[PromptLogprobs] = None,
step_index: Optional[int] = 0,
) -> CompletionSequenceGroupOutput
Create a SequenceGroupOutput given the sampling results.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
token_id
|
int
|
The sampled token for the sequence. |
required |
token_id_logprob_rank
|
int
|
The logprob rank of the sampled token. |
required |
token_id_logprob
|
float
|
The logprob value of the sampled token. |
required |
seq_id
|
int
|
The sequence id. |
required |
topk_token_ids
|
List[Optional[int]]
|
The list of top-k token ids. |
required |
topk_logprobs
|
List[Optional[float]]
|
The list of top-k logprobs. |
required |
step_index
|
Optional[int]
|
(Optional[int]): The index of the speculative token. |
0
|
Source code in vllm/spec_decode/util.py
get_all_num_logprobs
¶
get_all_num_logprobs(
seq_group_metadata_list: List[SequenceGroupMetadata],
) -> List[int]
Given a list of SequenceGroupMetadata, create a list of all num_logprobs.
If the sampling params do not call for any logprobs, return 0 for that sequence.
Source code in vllm/spec_decode/util.py
get_sampled_token_logprobs
¶
get_sampled_token_logprobs(
logprob_tensor: Tensor, sampled_token_ids: Tensor
) -> Tuple[Tensor, Tensor]
Get the logprobs for the sampled tokens. Returns the ranks and logprobs.
Source code in vllm/spec_decode/util.py
maybe_mock_device_tensors
¶
maybe_mock_device_tensors(
sampler_output: SamplerOutput,
batch_size: int,
vocab_size: int,
device: str,
) -> None
Helper method which mocks out the GPU tensors in SamplerOutput with dummy values. This will be removed in PR 7/9. https://docs.google.com/document/d/1rE4pr3IdspRw97XbImY4fS9IWYuJJ3HGtL7AdIKGrw8/edit#heading=h.qijw1sdidrer
Source code in vllm/spec_decode/util.py
nvtx_range
¶
Context manager / decorator that pushes an NVTX range at the beginning of its scope, and pops it at the end. If extra arguments are given, they are passed as arguments to msg.format().
If running with cuda graphs, you must enable nsys cuda graph profiling.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
msg
|
string
|
message to associate with the range |
required |
Source code in vllm/spec_decode/util.py
sampler_output_to_torch
¶
sampler_output_to_torch(
sampler_output_list: Sequence[SamplerOutput],
sampler_transposed: bool,
) -> Tuple[Tensor, Tensor, Tensor, Optional[Tensor]]
Utility function which converts a list of SamplerOutput to tensors.
sampler_transposed here is used as the indicator for whether we need do additional tensor transpose logic here.
Returns:
Name | Type | Description |
---|---|---|
sampled_token_ids |
Tensor
|
torch.Tensor shape: [batch_size, len(sampler_output_list)] |
sampled_token_probs |
Tensor
|
torch.Tensor shape: [batch_size, len(sampler_output_list), vocab_size] |
Source code in vllm/spec_decode/util.py
split_batch_by_proposal_len
¶
split_batch_by_proposal_len(
seq_group_metadata_list: List[SequenceGroupMetadata],
proposal_lens: List[int],
) -> Tuple[
Tuple[List[SequenceGroupMetadata], List[int]],
Tuple[List[SequenceGroupMetadata], List[int]],
]
Utility function that splits a batch based on whether the proposal len is zero or not. We should remove this once vLLM supports per-sequence proposal lens in a batch.