Serving an open-source model yourself means renting a GPU and running an inference server on it. In this workshop we take a quantized DeepSeek model and rent a GPU by the hour on RunPod. We serve it with vLLM behind an OpenAI-compatible API. Then we point a small tool-calling FAQ agent at it, the same way we would point it at OpenAI.

Most of the code is written by an AI assistant (Codex), and the prompts are quoted verbatim.

This was a freestyle session with no prep, so it also surfaces the dead ends:

A model that does not fit the card.
A full container disk.
An SSH session that keeps dropping.
A tool parser that needs a flag we did not set.

Each one has a fix worth keeping.

Links

These materials are related:

The system you build

The model runs on the rented GPU, not on your laptop. Everything you write locally talks to it over HTTP, exactly the way it talks to OpenAI. Only the base URL and the API key change.

The setup has these pieces:

A GPU pod on RunPod runs the model.
vLLM serves it behind an OpenAI-compatible HTTP API.
RunPod's HTTPS proxy exposes that API, guarded by a bearer token.
The FAQ agent runs on your laptop and calls one search tool.

flowchart LR AGENT["FAQ agent (your laptop)"] PROXY["RunPod HTTPS proxy <pod-id>-8000.proxy.runpod.net"] VLLM["vLLM server OpenAI-compatible API"] GPU["GPU pod DeepSeek-R1-Distill-Qwen-14B-AWQ"] FAQ["DataTalks.Club FAQ"] AGENT -->|"POST /v1/chat/completions + Bearer token"| PROXY PROXY --> VLLM VLLM --> GPU AGENT -->|search tool| FAQ

The model is stelterlab/DeepSeek-R1-Distill-Qwen-14B-AWQ, a 14-billion-parameter model quantized to fit a single 24 GB consumer GPU.

The agent is the FAQ search agent from the end-to-end agent deployment workshop. We port it from the OpenAI Responses API to vLLM's chat completions API. It answers Data Engineering Zoomcamp questions with one search tool against the course FAQ.

Workshop flow

We build the workshop in this order:

Rent a GPU by the hour on RunPod and connect over SSH.
Install vLLM and serve a quantized model.
Fit the model on the card, then move the caches to the pod volume.
Port the FAQ agent and get the open model to make tool calls.
Reach the model over the network with an API key.
Make the whole thing reproducible with a one-command deploy.
Shut everything down, including the storage that keeps billing after the pod is gone.

The reproducible deploy recreates the part the live session ran out of time to finish. The teardown step is the one I learned the hard way: stopping the pod does not stop the storage bill.

Serving Open Models with vLLM on RunPod

Links

The system you build

Workshop flow

Tutorial pages

Watch the recording