How LiveKit Works: The Complete Engineer's Guide
You've probably heard "real-time voice AI" thrown around a lot lately. Agents that answer your phone, book appointments, handle support calls — they're everywhere. And a huge chunk of them are quietly running on LiveKit under the hood. This post tears it apart, layer by layer, so you actually understand what's happening when someone calls your AI agent.
The Problem Nobody Talks About
Before we get into LiveKit, let's talk about why it exists.
Sending audio in real-time sounds easy. It's not.
The browser protocol for real-time audio and video is called WebRTC. It was designed for two browsers talking directly to each other — peer to peer. Think a Zoom call between two people where audio goes directly between their browsers without touching a central server.
The moment you put a server in the middle — which you must do for an AI agent — you're fighting against everything WebRTC was designed for. Your agent needs to:
- Receive audio from the user in real time
- Send it to a speech-to-text (STT) service
- Wait for an LLM to generate a response
- Convert that response to speech via TTS
- Stream the audio back to the user without noticeable delay
That's a lot of moving parts. And the latency budget is brutal. Humans start feeling an awkward pause after about 500 milliseconds. Add network travel, STT processing, LLM thinking time, and TTS synthesis — and you're already tight.
LiveKit is the infrastructure layer that makes this possible without you losing your mind.
What LiveKit Actually Is
At its core, LiveKit is an SFU — Selective Forwarding Unit.
Think of it as a specialized router for media. When a user sends their audio, the SFU receives it and forwards copies to whoever needs it — without re-encoding, without re-processing, just forwarding packets. This is wildly more efficient than mixing audio on a server (what older conferencing systems do).
LiveKit is:
- Open source — you can self-host it entirely
- Written in Go — fast, low memory, made for concurrency
- WebRTC-native — speaks the same protocol as browsers and phones
- Horizontally scalable — run one node or a hundred, same config
But LiveKit is not just the server. The ecosystem includes:
| Component | What it does |
|---|---|
livekit-server |
The SFU core — routes audio/video between participants |
livekit-agents |
Python SDK for building AI agents that join rooms |
livekit-plugins-* |
Drop-in integrations for Deepgram, ElevenLabs, OpenAI, etc. |
| LiveKit Cloud | Managed hosting if you don't want to self-host |

The Three Core Concepts: Room, Participant, Track
Before anything else clicks, you need these three locked in.
Room
A Room is a named virtual space where a conversation happens. Think of it like a phone call session — it has a name, it lives on a server node, and it disappears when everyone leaves.
await room_service.create_room(CreateRoomRequest(name="session-user-abc"))
One room = one conversation. Your backend creates it. LiveKit hosts it.

Participant
Anyone — or anything — that joins a room is a participant. This is the subtle thing most people miss: your AI agent is also a participant. It connects to the room with its own identity, just like the user does. It has a name, it can publish audio, it can subscribe to other participants' audio.
So a typical voice agent session has exactly two participants: 1. The human user 2. The AI agent
Track
A Track is a single media stream. The user's microphone = one audio track. The agent's text-to-speech output = one audio track. These tracks are published into the room, and the SFU forwards them to the other participants.
That's it. Room → Participants → Tracks. Everything else builds on top of this.
How the SFU Works (Without Melting Your Brain)
Imagine a phone call without WebRTC complexity. User speaks → audio goes somewhere → agent hears it → agent responds → user hears response.
The SFU sits in the middle. Here's what it's actually doing:
User's mic audio → [SFU] → Agent receives it
Agent's TTS audio → [SFU] → User receives it
The SFU never re-encodes the audio. It just reads the packet headers to know where to send them and forwards them. This is why it's fast — it's doing router-level work, not compute-level work.
For a multi-node setup (production), LiveKit nodes coordinate through Redis. If User A is on Node 1 and somehow their room ends up with a participant on Node 2, Redis handles the routing state so they stay in sync. Each room lives on exactly one node, but participants can be spread across nodes transparently.
The Agent Pipeline: What Happens When Someone Calls
This is the juicy part. Let's walk through a real call, step by step.
Step 1 — Your backend creates a room
The moment a user initiates a call (via your app, or via Twilio ringing your phone number), your FastAPI backend creates a LiveKit room and mints an access token for the user.
# Your backend
room_name = f"call-{tenant_id}-{uuid4().hex[:8]}"
await room_service.create_room(CreateRoomRequest(
name=room_name,
metadata=json.dumps({
"tenant_id": tenant_id,
"stt": "deepgram",
"llm": "gpt-4o-mini",
"tts": "elevenlabs"
})
))
# Mint a token for the user — this is their keycard into the room
token = AccessToken(API_KEY, API_SECRET)
token.add_grant(VideoGrants(room_join=True, room=room_name))
return token.to_jwt()
The user's app takes this token and connects to LiveKit. They're now inside the room, and their microphone is publishing audio as a track.
Step 2 — LiveKit dispatches a job to your worker
Your agent code runs as a Worker — a persistent process that registers with LiveKit and says: "I'm here. Give me jobs."
When LiveKit sees a new room, it dispatches a job to the worker with the lowest load. The worker receives the job and spawns a subprocess specifically for this session.
Worker process (always running, always registered)
└── receives job for "call-tenant1-abc123"
└── spawns: AgentJob subprocess
└── connects to "call-tenant1-abc123" as a participant
That subprocess connects to the room as a second participant — the agent. Now the room has two participants: the human and the agent.
Step 3 — The voice pipeline starts
Inside the AgentJob subprocess, this runs continuously:
User speaks
→ VAD (Voice Activity Detection) detects speech
→ Audio chunks stream to STT (Deepgram / Whisper)
→ Transcript: "Hey, I want to book an appointment for Friday"
→ Turn detection: "Is the user done? Yes."
→ LLM receives transcript, generates response token by token
→ TTS receives tokens, generates audio in real time
→ Audio published as agent's track in the room
→ User hears the agent's voice
Each step is streaming. STT doesn't wait for the full sentence. LLM doesn't wait for STT to finish. TTS doesn't wait for LLM to finish. It's a streaming pipeline from mic to speaker.
Step 4 — Interruptions
This is where it gets elegant. If the user starts talking while the agent is mid-sentence, the VAD detects the UserStartedSpeakingFrame and the pipeline cancels everything in flight — the LLM stops generating, the TTS stops playing. The agent shuts up and listens. This is what makes LiveKit agents feel natural rather than robotic.
The Worker / Job Architecture (The Part Everyone Gets Wrong)
Here's the most common misconception: one worker per session.
That's wrong. Here's the reality:
One server machine
└── One Worker process (registered with LiveKit)
├── AgentJob subprocess #1 → room-user-A
├── AgentJob subprocess #2 → room-user-B
├── AgentJob subprocess #3 → room-user-C
└── ... up to ~20 concurrent jobs
One Worker can handle 10–25 concurrent sessions. Each session is its own isolated subprocess with its own memory, its own LLM context, and its own audio pipeline. They don't know each other exist.
Your agent code is like a recipe, not a running process. The Worker is the kitchen. Each session is a chef hired to cook that recipe. When the call ends, the chef clocks out. The kitchen stays open.
What happens when a second user calls?
Your backend creates a new room. LiveKit sees it and dispatches another job to the same Worker (if it's not too busy). Another subprocess spawns. Another room, another pipeline, completely isolated.
room-user-A → AgentJob #1 (running STT→LLM→TTS independently)
room-user-B → AgentJob #2 (running STT→LLM→TTS independently)
User A and User B are having completely separate conversations, in complete isolation, on the same server.
Scaling to 100 Concurrent Sessions
Now let's say your SaaS is growing and you have 100 people calling simultaneously. Here's how the math works.
A typical voice AI pipeline — Deepgram STT, GPT-4o-mini, ElevenLabs TTS — uses roughly:
- CPU: 0.2–0.5 cores per session (mostly waiting on external API calls)
- RAM: 150–300 MB per session (audio buffers, LLM context)
- Network: ~64 kbps per session (Opus-encoded audio)
On a 4-core / 8GB server, you can comfortably run ~20 concurrent sessions before the Worker starts hitting its load threshold.
For 100 concurrent sessions: 5–6 Worker servers, each handling ~20 sessions.
The Load Function
Each Worker reports its own load back to LiveKit via a load function. LiveKit uses this to decide which Worker to dispatch new jobs to.
import psutil
def my_load_function(ctx: JobContext) -> float:
# 0.0 = idle, 1.0 = completely full
return psutil.cpu_percent() / 100.0
cli.run_app(WorkerOptions(
entrypoint_fnc=entrypoint,
load_fnc=my_load_function,
load_threshold=0.7, # stop taking jobs at 70% CPU
))
When Worker 1 hits 70% CPU, LiveKit automatically routes new jobs to Worker 2. When Worker 2 fills up, it goes to Worker 3. This is the autoscaling mechanism — you pair it with your infrastructure's container autoscaling (ECS, Kubernetes, Railway, Fly.io) to spin up new Worker containers automatically.
Keep Workers Pre-Warmed
One gotcha: if all your Workers are cold (just started), the first user gets a ~10–15 second delay while the container boots and registers with LiveKit.
Always keep a minimum of 2–3 Workers running at all times, even at zero load. The cost is tiny compared to the UX damage of a cold start.
Multi-Tenant SaaS: Different Pipeline Per Tenant
Here's the pattern for building a real SaaS where each customer can choose their own STT, LLM, and TTS provider.
The trick: embed the pipeline config in the room metadata. Your worker reads it at job start and builds the right pipeline dynamically. No code changes, no Docker rebuilds.
Backend: embed config when creating the room
tenant = await db.get_tenant_config(tenant_id)
await room_service.create_room(CreateRoomRequest(
name=room_name,
metadata=json.dumps({
"tenant_id": tenant_id,
"stt": tenant["stt"], # "deepgram" | "whisper" | "google"
"stt_model": tenant["stt_model"],
"llm": tenant["llm"], # "openai" | "groq" | "anthropic"
"llm_model": tenant["llm_model"],
"tts": tenant["tts"], # "elevenlabs" | "cartesia" | "openai"
"tts_voice_id": tenant["tts_voice_id"],
})
))
Agent worker: build pipeline from config
async def entrypoint(ctx: JobContext):
await ctx.connect()
config = json.loads(ctx.room.metadata)
keys = await fetch_tenant_keys(config["tenant_id"]) # from your secure store
session = AgentSession(
stt=build_stt(config, keys),
llm=build_llm(config, keys),
tts=build_tts(config, keys),
turn_detection=MultilingualModel(),
vad=silero.VAD.load(),
preemptive_generation=True,
)
await session.start(ctx.room)
def build_stt(config, keys):
if config["stt"] == "deepgram":
return deepgram.STTv2(model=config["stt_model"], api_key=keys["deepgram"])
elif config["stt"] == "whisper":
return openai.STT(model=config["stt_model"], api_key=keys["openai"])
raise ValueError(f"Unknown STT: {config['stt']}")
def build_llm(config, keys):
if config["llm"] == "openai":
return openai.LLM(model=config["llm_model"], api_key=keys["openai"])
elif config["llm"] == "groq":
return groq.LLM(model=config["llm_model"], api_key=keys["groq"])
elif config["llm"] == "anthropic":
return anthropic.LLM(model=config["llm_model"], api_key=keys["anthropic"])
raise ValueError(f"Unknown LLM: {config['llm']}")
def build_tts(config, keys):
if config["tts"] == "elevenlabs":
return elevenlabs.TTS(
voice_id=config["tts_voice_id"],
model="eleven_flash_v2_5",
api_key=keys["elevenlabs"]
)
elif config["tts"] == "cartesia":
return cartesia.TTS(voice_id=config["tts_voice_id"], api_key=keys["cartesia"])
raise ValueError(f"Unknown TTS: {config['tts']}")
Your Docker image installs all plugins. Which ones actually run is a runtime decision, not a build-time decision. Three tenants, three different pipeline combinations, zero rebuilds.
The Twilio Path: When Someone Calls a Phone Number
If your voice agent needs to handle inbound phone calls (not just browser/app calls), you connect LiveKit to Twilio via SIP.
When someone dials your number:
- Twilio picks up and fires a webhook to your FastAPI backend
- Your backend creates a LiveKit room
- Your backend tells Twilio to bridge the call into that room via SIP
- LiveKit's built-in SIP server accepts the connection
- From that point, it's identical to a browser call — LiveKit dispatches a job, agent joins the room, pipeline runs
@app.post("/twilio/incoming")
async def incoming_call(request: Request):
room_name = f"call-{uuid4().hex[:8]}"
await lk_room_service.create_room(CreateRoomRequest(name=room_name))
response = VoiceResponse()
response.dial().sip(f"sip:{room_name}@your-livekit-sip-uri")
return Response(content=str(response), media_type="text/xml")
The agent has no idea whether the human is on a browser, a mobile app, or an old-school phone. It just sees an audio track coming in, and it does its thing.
The Full Production Architecture
Here's what the complete stack looks like when you're running this for real:

The Dockerfile: One Image, Every Provider
FROM python:3.12-slim
WORKDIR /app
RUN pip install \
livekit-agents \
livekit-plugins-deepgram \
livekit-plugins-openai \
livekit-plugins-elevenlabs \
livekit-plugins-cartesia \
livekit-plugins-groq \
livekit-plugins-silero \
livekit-plugins-turn-detector
COPY src/ .
# Only your infra credentials — not tenant credentials
ENV LIVEKIT_URL=""
ENV LIVEKIT_API_KEY=""
ENV LIVEKIT_API_SECRET=""
CMD ["python", "agent.py", "start"]
Tenant API keys live in your database, fetched at job start. Never baked into the image.
LiveKit vs. Pipecat: Know the Difference
You'll see Pipecat mentioned alongside LiveKit a lot. They're different things:
| LiveKit Agents | Pipecat | |
|---|---|---|
| Core idea | Infrastructure-first (rooms, participants, dispatch) | Pipeline-first (frames flowing through processors) |
| Multi-user | Built-in job dispatch, automatic | You manage process spawning yourself |
| Transport | LiveKit is the transport (its own SFU) | Transport is a plugin (Daily, WebSocket, LiveKit, local) |
| Best for | SaaS platforms, scalable deployments | Custom pipeline logic, research, tightly controlled flows |
| Echo cancellation | Browser/WebRTC handles it | LocalTransport has none (AEC nightmare on Mac) |
They're not mutually exclusive either. Pipecat has a LiveKit transport, so you can use Pipecat's pipeline flexibility with LiveKit's infrastructure underneath. Some teams do exactly this.
For a production multi-tenant SaaS, pure LiveKit Agents is the right default. The job dispatch, room management, and load balancing are built in. Pipecat is powerful but requires you to build that scaffolding yourself.
Common Gotchas and How to Avoid Them
1. Cold start latency
Workers take 10–15 seconds to boot and register. Keep a minimum of 2 Workers always running. The cost is negligible. The user experience impact is not.
2. API keys in room metadata
Room metadata is visible to all participants, including the user's browser. Don't put raw API keys there — either put only provider names in metadata and fetch keys from your backend inside the Worker, or use LiveKit's server-side participant attributes.
3. One room per session, always
Never reuse rooms. When a session ends, let the room die. Create a fresh one for the next call. Reusing rooms causes state bugs that are miserable to debug.
4. LLM context leaking between sessions
Each AgentJob subprocess is isolated by the OS, but if you're using LangGraph or a persistent conversation store, make sure you're keying the thread/session ID per room, not per user. Otherwise two calls from the same user bleed into each other's conversation history.
# Good — fresh thread ID per session
config = {"configurable": {"thread_id": uuid.uuid4().hex}}
return langchain.LLMAdapter(graph, config=config)
5. VAD tuning for your use case
The default VAD settings work well for English, quiet environments. For noisy backgrounds or non-English languages, tune min_silence_duration and prefix_padding_duration. The multilingual turn detection model helps here too.
Quick Recap
If you only remember one thing from this post, make it this:
One call = one room. One room = one agent subprocess. Subprocesses are isolated, parallel, and automatically dispatched by LiveKit's worker system. Your code describes how the agent behaves. LiveKit handles how many run at once.
The rest is just configuration.
LiveKit takes what would be a genuinely hard distributed systems problem — real-time audio, multi-tenant isolation, low-latency AI pipelines, automatic scaling — and turns it into a few Python files and some environment variables. That's rare. That's worth understanding properly.
Built something with LiveKit? The community Discord is genuinely active and the core team ships fast. Worth a look if you're going deep on voice AI. And CortexTech has experience working with production grade Livekit Voice Agent Pipeline