vllm.entrypoints.serve.rlhf.api_router ¶
attach_router ¶
engine_client ¶
engine_client(request: Request) -> EngineClient
get_world_size async ¶
get_world_size(
raw_request: Request, include_dp: bool = Query(True)
)
Get the world size from the parallel config.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
include_dp | bool | If True (default), returns the world size including data parallelism (TP * PP * DP). If False, returns the world size without data parallelism (TP * PP). | Query(True) |
Source code in vllm/entrypoints/serve/rlhf/api_router.py
init_weight_transfer_engine async ¶
Source code in vllm/entrypoints/serve/rlhf/api_router.py
is_paused async ¶
Return the current pause status.
Source code in vllm/entrypoints/serve/rlhf/api_router.py
pause_generation async ¶
pause_generation(
raw_request: Request,
mode: Annotated[PauseMode, Query()] = "abort",
wait_for_inflight_requests: bool = Query(False),
clear_cache: Annotated[bool, Query()] = True,
) -> JSONResponse
Pause generation requests to allow weight updates.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
mode | Annotated[PauseMode, Query()] | How to handle in-flight requests: - | 'abort' |
wait_for_inflight_requests | bool | DEPRECATED. Use | Query(False) |
clear_cache | Annotated[bool, Query()] | DEPRECATED. Whether to clear KV/prefix caches after draining. Ignored when mode="keep". | True |
Source code in vllm/entrypoints/serve/rlhf/api_router.py
resume_generation async ¶
Resume generation after a pause.