Hugging Face adds one-command vLLM server deployment via HF Jobs
Users can now launch an OpenAI-compatible LLM endpoint on Hugging Face infrastructure with a single CLI command, billed per second with no server provisioning required.
1 source · cross-referenced
- Hugging Face Jobs now supports launching a private, OpenAI-compatible vLLM server with a single command.
- The endpoint is billed per second and does not require provisioning servers or Kubernetes.
- Access is gated by Hugging Face tokens, and the server can be queried via curl or the OpenAI Python client.
- Users can scale to larger models by selecting higher-tier GPU flavors and adjusting vLLM parameters.
Hugging Face has added a feature to its Jobs platform that lets users launch a private, OpenAI-compatible vLLM server with a single command, eliminating the need to provision servers or manage Kubernetes clusters. The deployment is billed per second based on hardware usage, making it suitable for short-lived workloads such as tests, evaluations, or batch generation.
To launch the server, users run a CLI command that specifies a GPU flavor, exposes the vLLM port, and sets a timeout. For example, the command uses the official vllm/vllm-openai image to serve the Qwen/Qwen3-4B model on an A10G GPU. The command returns a URL where the server is reachable, and users can monitor startup via logs until "Application startup complete" appears.
Access to the endpoint is gated by Hugging Face tokens, requiring a bearer token in each request. Users can query the endpoint using curl or the OpenAI Python client by pointing to the exposed URL and passing their Hugging Face token as the API key. The endpoint adheres to the OpenAI API schema, returning responses in the standard format.
The feature supports scaling to larger models by selecting higher-tier GPU flavors and configuring vLLM parameters such as tensor-parallel-size. For instance, the 122B Qwen3.5 mixture-of-experts model can be served on a dual H200 configuration with adjusted context length and batch settings to avoid out-of-memory errors.
Users can also attach an SSH session to a running job for debugging, monitoring GPU memory, or inspecting logs interactively. This is enabled by adding the --ssh flag to the launch command and registering a public key with Hugging Face.
Hugging Face distinguishes this offering from its managed Inference Endpoints service, positioning the Jobs-based vLLM server as a lightweight option for development and testing rather than production deployments.
- Jun 26, 2026 · TechCrunch — AI
Patronus AI raises $50M Series B to build simulated environments for testing AI agents
Trust79 - Jun 25, 2026 · TechCrunch — AI
Unconventional AI unveils oscillator-based architecture with 1,000x power efficiency claim for inference
Trust71 - Jun 25, 2026 · Hugging Face
Hugging Face study finds hybrid models excel at predicting meaning-bearing tokens but trail on verbatim repeats
Trust79