vllm.model_executor.layers.typical_acceptance_sampler
TypicalAcceptanceSampler
¶
Bases: SpecDecodeDeterministicBaseSampler
Apply typical acceptance sampling as described in section 3.3.1 in "MEDUSA: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads" https://arxiv.org/pdf/2401.10774
Source code in vllm/model_executor/layers/typical_acceptance_sampler.py
11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 |
|
__init__
¶
Create a Typical Acceptance Sampler.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
strict_mode
|
bool
|
Whether or not to perform shape/device/dtype checks |
False
|
posterior_threshold
|
A threshold value that sets a lower bound |
required | |
posterior_alpha
|
A scaling factor for the entropy-based |
required |
Source code in vllm/model_executor/layers/typical_acceptance_sampler.py
_evaluate_accepted_tokens
¶
Evaluates and returns a mask of accepted tokens based on the posterior probabilities.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
target_probs
|
Tensor
|
A tensor of shape (batch_size, k, vocab_size) representing the probabilities of each token in the vocabulary for each position in the proposed sequence. This is the distribution generated by the target model. |
required |
draft_token_ids
|
Tensor
|
A tensor of shape (batch_size, k) representing the proposed token ids. |
required |
A draft token_id x_{n+k} is accepted if it satisfies the following condition
where corresponds to target_probs and and correspond to hyperparameters specified using self._posterior_threshold and self._posterior_alpha
This method computes the posterior probabilities for the given draft token ids based on the provided target probabilities. It calculates the entropy of the posterior distribution and determines a dynamic threshold for each token position using the provided posterior_threshold and posterior_alpha values. The method then returns a boolean mask indicating which tokens can be accepted.
Returns:
Type | Description |
---|---|
torch.Tensor: A boolean tensor of shape (batch_size, k) where each element indicates whether the corresponding draft token has been accepted or rejected. True indicates acceptance and false indicates rejection. |
Source code in vllm/model_executor/layers/typical_acceptance_sampler.py
_get_recovered_token_ids
¶
The recovered token ids will fill the first unmatched token by the target token.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
target_probs
|
Tensor
|
A tensor of shape (batch_size, k, vocab_size) containing the target probability distribution. |
required |
Returns:
Type | Description |
---|---|
torch.Tensor: A tensor of shape (batch_size, k) with the recovered token ids which are selected from target probs. |
Source code in vllm/model_executor/layers/typical_acceptance_sampler.py
forward
¶
forward(
target_with_bonus_probs: Tensor,
bonus_token_ids: Tensor,
draft_probs: Tensor,
draft_token_ids: Tensor,
) -> Tensor
Sample token ids using typical acceptance sampling. This accepts or rejects tokens proposed by the draft model using the probability of each token according to the draft and target models.
In the worst case where all draft tokens are rejected, it is guaranteed one token will be emitted.
In the case where all draft tokens are accepted, the bonus token will be accepted.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
target_probs
|
The probability distribution over token ids given context according to the target model. |
required | |
bonus_token_ids
|
Tensor
|
The "bonus" token ids that are accepted iff all speculative tokens in a sequence are accepted. |
required |
draft_probs
|
Tensor
|
This parameter is unused by the acceptance sampler. |
required |
draft_token_ids
|
Tensor
|
The token ids that were sampled from the draft probabilities. |
required |
Returns:
Name | Type | Description |
---|---|---|
output_token_ids |
Tensor
|
The token ids sampled via rejection sampling, or -1 if unable to sample a token because the previous token was rejected. |
Tensor
|
shape = [batch_size, num_speculative_tokens + num_bonus_tokens] |