Skip to content
Tools · Jun 26, 2026

Hugging Face adds one-command vLLM server deployment via HF Jobs

Users can now launch an OpenAI-compatible LLM endpoint on Hugging Face infrastructure with a single CLI command, billed per second with no server provisioning required.

Trust79
HypeLow hype

1 source · cross-referenced

ShareXLinkedInEmail
TL;DR
  • Hugging Face Jobs now supports launching a private, OpenAI-compatible vLLM server with a single command.
  • The endpoint is billed per second and does not require provisioning servers or Kubernetes.
  • Access is gated by Hugging Face tokens, and the server can be queried via curl or the OpenAI Python client.
  • Users can scale to larger models by selecting higher-tier GPU flavors and adjusting vLLM parameters.

Hugging Face has added a feature to its Jobs platform that lets users launch a private, OpenAI-compatible vLLM server with a single command, eliminating the need to provision servers or manage Kubernetes clusters. The deployment is billed per second based on hardware usage, making it suitable for short-lived workloads such as tests, evaluations, or batch generation.

To launch the server, users run a CLI command that specifies a GPU flavor, exposes the vLLM port, and sets a timeout. For example, the command uses the official vllm/vllm-openai image to serve the Qwen/Qwen3-4B model on an A10G GPU. The command returns a URL where the server is reachable, and users can monitor startup via logs until "Application startup complete" appears.

Access to the endpoint is gated by Hugging Face tokens, requiring a bearer token in each request. Users can query the endpoint using curl or the OpenAI Python client by pointing to the exposed URL and passing their Hugging Face token as the API key. The endpoint adheres to the OpenAI API schema, returning responses in the standard format.

The feature supports scaling to larger models by selecting higher-tier GPU flavors and configuring vLLM parameters such as tensor-parallel-size. For instance, the 122B Qwen3.5 mixture-of-experts model can be served on a dual H200 configuration with adjusted context length and batch settings to avoid out-of-memory errors.

Users can also attach an SSH session to a running job for debugging, monitoring GPU memory, or inspecting logs interactively. This is enabled by adding the --ssh flag to the launch command and registering a public key with Hugging Face.

Hugging Face distinguishes this offering from its managed Inference Endpoints service, positioning the Jobs-based vLLM server as a lightweight option for development and testing rather than production deployments.

Sources
  1. 01Hugging FaceRun a vLLM Server on HF Jobs in One Command
Also on Tools

Stories may contain errors. Dispatch is assembled with AI assistance and curated by human editors; despite the trust-score filter, mistakes happen. We correct publicly — every article links to its revision history. Nothing here is financial, legal, or medical advice. Verify before relying on any claim.

© 2026 Dispatch. No ads. No sponsorships. No paid placement. Reader-supported via Ko-fi.

Built by a person who cares about honest AI news.