You are viewing the latest developer preview docs. Click here to view docs for the latest stable release.

logo
vLLM
Modal
Initializing search
    GitHub
    • Home
    • User Guide
    • Developer Guide
    • API Reference
    • CLI Reference
    • Community
    GitHub
    • Home
      • User Guide
      • vLLM V1
        • Frequently Asked Questions
        • Production Metrics
        • Reproducibility
        • Security
        • Troubleshooting
        • Usage Stats Collection
        • Offline Inference
        • OpenAI-Compatible Server
        • Distributed Inference and Serving
        • Integrations
        • Using Docker
        • Using Kubernetes
        • Using Nginx
          • Anything LLM
          • AutoGen
          • BentoML
          • Cerebrium
          • Chatbox
          • Dify
          • dstack
          • Haystack
          • Helm
          • LiteLLM
          • Lobe Chat
          • LWS
          • Modal
          • Open WebUI
          • Retrieval-Augmented Generation
          • SkyPilot
          • Streamlit
          • NVIDIA Triton
        • Integrations
        • Reinforcement Learning from Human Feedback
        • Transformers Reinforcement Learning
        • Summary
        • Conserving Memory
        • Engine Arguments
        • Environment Variables
        • Model Resolution
        • Optimization and Tuning
        • Server Arguments
        • Supported Models
        • Generative Models
        • Pooling Models
        • Extensions
        • Hardware Supported Models
        • Compatibility Matrix
        • Automatic Prefix Caching
        • Disaggregated Prefilling (experimental)
        • LoRA Adapters
        • Multimodal Inputs
        • Prompt Embedding Inputs
        • Reasoning Outputs
        • Speculative Decoding
        • Structured Outputs
        • Tool Calling
        • Quantization
    • Developer Guide
    • API Reference
    • CLI Reference
    • Community

    Modal

    vLLM can be run on cloud GPUs with Modal, a serverless computing platform designed for fast auto-scaling.

    For details on how to deploy vLLM on Modal, see this tutorial in the Modal documentation.

    Made with Material for MkDocs