How LiveKit Works: The Complete Engineer's Guide

The Problem Nobody Talks About

Before we get into LiveKit, let's talk about why it exists.

Sending audio in real-time sounds easy. It's not.

The browser protocol for real-time audio and video is called WebRTC. It was designed for two browsers talking directly to each other — peer to peer. Think a Zoom call between two people where audio goes directly between their browsers without touching a central server.

The moment you put a server in the middle — which you must do for an AI agent — you're fighting against everything WebRTC was designed for. Your agent needs to:

Receive audio from the user in real time
Send it to a speech-to-text (STT) service
Wait for an LLM to generate a response
Convert that response to speech via TTS
Stream the audio back to the user without noticeable delay

That's a lot of moving parts. And the latency budget is brutal. Humans start feeling an awkward pause after about 500 milliseconds. Add network travel, STT processing, LLM thinking time, and TTS synthesis — and you're already tight.

LiveKit is the infrastructure layer that makes this possible without you losing your mind.

What LiveKit Actually Is

At its core, LiveKit is an SFU — Selective Forwarding Unit.

Think of it as a specialized router for media. When a user sends their audio, the SFU receives it and forwards copies to whoever needs it — without re-encoding, without re-processing, just forwarding packets. This is wildly more efficient than mixing audio on a server (what older conferencing systems do).

LiveKit is:

Open source — you can self-host it entirely
Written in Go — fast, low memory, made for concurrency
WebRTC-native — speaks the same protocol as browsers and phones
Horizontally scalable — run one node or a hundred, same config

But LiveKit is not just the server. The ecosystem includes:

Component	What it does
`livekit-server`	The SFU core — routes audio/video between participants
`livekit-agents`	Python SDK for building AI agents that join rooms
`livekit-plugins-*`	Drop-in integrations for Deepgram, ElevenLabs, OpenAI, etc.
LiveKit Cloud	Managed hosting if you don't want to self-host

The Three Core Concepts: Room, Participant, Track

Before anything else clicks, you need these three locked in.

Room

A Room is a named virtual space where a conversation happens. Think of it like a phone call session — it has a name, it lives on a server node, and it disappears when everyone leaves.

await room_service.create_room(CreateRoomRequest(name="session-user-abc"))

One room = one conversation. Your backend creates it. LiveKit hosts it.

Participant

Anyone — or anything — that joins a room is a participant. This is the subtle thing most people miss: your AI agent is also a participant. It connects to the room with its own identity, just like the user does. It has a name, it can publish audio, it can subscribe to other participants' audio.

So a typical voice agent session has exactly two participants: 1. The human user 2. The AI agent

Track

A Track is a single media stream. The user's microphone = one audio track. The agent's text-to-speech output = one audio track. These tracks are published into the room, and the SFU forwards them to the other participants.

That's it. Room → Participants → Tracks. Everything else builds on top of this.

How the SFU Works (Without Melting Your Brain)

Imagine a phone call without WebRTC complexity. User speaks → audio goes somewhere → agent hears it → agent responds → user hears response.

The SFU sits in the middle. Here's what it's actually doing:

User's mic audio  →  [SFU]  →  Agent receives it
Agent's TTS audio →  [SFU]  →  User receives it

The SFU never re-encodes the audio. It just reads the packet headers to know where to send them and forwards them. This is why it's fast — it's doing router-level work, not compute-level work.

For a multi-node setup (production), LiveKit nodes coordinate through Redis. If User A is on Node 1 and somehow their room ends up with a participant on Node 2, Redis handles the routing state so they stay in sync. Each room lives on exactly one node, but participants can be spread across nodes transparently.

The Agent Pipeline: What Happens When Someone Calls

This is the juicy part. Let's walk through a real call, step by step.

Step 1 — Your backend creates a room

The moment a user initiates a call (via your app, or via Twilio ringing your phone number), your FastAPI backend creates a LiveKit room and mints an access token for the user.

# Your backend
room_name = f"call-{tenant_id}-{uuid4().hex[:8]}"

await room_service.create_room(CreateRoomRequest(
    name=room_name,
    metadata=json.dumps({
        "tenant_id": tenant_id,
        "stt": "deepgram",
        "llm": "gpt-4o-mini",
        "tts": "elevenlabs"
    })
))

# Mint a token for the user — this is their keycard into the room
token = AccessToken(API_KEY, API_SECRET)
token.add_grant(VideoGrants(room_join=True, room=room_name))
return token.to_jwt()

The user's app takes this token and connects to LiveKit. They're now inside the room, and their microphone is publishing audio as a track.

Step 2 — LiveKit dispatches a job to your worker

Your agent code runs as a Worker — a persistent process that registers with LiveKit and says: "I'm here. Give me jobs."

When LiveKit sees a new room, it dispatches a job to the worker with the lowest load. The worker receives the job and spawns a subprocess specifically for this session.

Worker process (always running, always registered)
  └── receives job for "call-tenant1-abc123"
       └── spawns: AgentJob subprocess
              └── connects to "call-tenant1-abc123" as a participant

That subprocess connects to the room as a second participant — the agent. Now the room has two participants: the human and the agent.

Step 3 — The voice pipeline starts

Inside the AgentJob subprocess, this runs continuously:

User speaks
  → VAD (Voice Activity Detection) detects speech
  → Audio chunks stream to STT (Deepgram / Whisper)
  → Transcript: "Hey, I want to book an appointment for Friday"
  → Turn detection: "Is the user done? Yes."
  → LLM receives transcript, generates response token by token
  → TTS receives tokens, generates audio in real time
  → Audio published as agent's track in the room
  → User hears the agent's voice

Each step is streaming. STT doesn't wait for the full sentence. LLM doesn't wait for STT to finish. TTS doesn't wait for LLM to finish. It's a streaming pipeline from mic to speaker.

Step 4 — Interruptions

This is where it gets elegant. If the user starts talking while the agent is mid-sentence, the VAD detects the UserStartedSpeakingFrame and the pipeline cancels everything in flight — the LLM stops generating, the TTS stops playing. The agent shuts up and listens. This is what makes LiveKit agents feel natural rather than robotic.

The Worker / Job Architecture (The Part Everyone Gets Wrong)

Here's the most common misconception: one worker per session.

That's wrong. Here's the reality:

One server machine
  └── One Worker process (registered with LiveKit)
        ├── AgentJob subprocess #1  →  room-user-A
        ├── AgentJob subprocess #2  →  room-user-B
        ├── AgentJob subprocess #3  →  room-user-C
        └── ... up to ~20 concurrent jobs

One Worker can handle 10–25 concurrent sessions. Each session is its own isolated subprocess with its own memory, its own LLM context, and its own audio pipeline. They don't know each other exist.

Your agent code is like a recipe, not a running process. The Worker is the kitchen. Each session is a chef hired to cook that recipe. When the call ends, the chef clocks out. The kitchen stays open.

What happens when a second user calls?

Your backend creates a new room. LiveKit sees it and dispatches another job to the same Worker (if it's not too busy). Another subprocess spawns. Another room, another pipeline, completely isolated.

room-user-A  →  AgentJob #1  (running STT→LLM→TTS independently)
room-user-B  →  AgentJob #2  (running STT→LLM→TTS independently)

User A and User B are having completely separate conversations, in complete isolation, on the same server.

Scaling to 100 Concurrent Sessions

Now let's say your SaaS is growing and you have 100 people calling simultaneously. Here's how the math works.

A typical voice AI pipeline — Deepgram STT, GPT-4o-mini, ElevenLabs TTS — uses roughly:

CPU: 0.2–0.5 cores per session (mostly waiting on external API calls)
RAM: 150–300 MB per session (audio buffers, LLM context)
Network: ~64 kbps per session (Opus-encoded audio)

On a 4-core / 8GB server, you can comfortably run ~20 concurrent sessions before the Worker starts hitting its load threshold.

For 100 concurrent sessions: 5–6 Worker servers, each handling ~20 sessions.

The Load Function

Each Worker reports its own load back to LiveKit via a load function. LiveKit uses this to decide which Worker to dispatch new jobs to.

import psutil

def my_load_function(ctx: JobContext) -> float:
    # 0.0 = idle, 1.0 = completely full
    return psutil.cpu_percent() / 100.0

cli.run_app(WorkerOptions(
    entrypoint_fnc=entrypoint,
    load_fnc=my_load_function,
    load_threshold=0.7,  # stop taking jobs at 70% CPU
))

When Worker 1 hits 70% CPU, LiveKit automatically routes new jobs to Worker 2. When Worker 2 fills up, it goes to Worker 3. This is the autoscaling mechanism — you pair it with your infrastructure's container autoscaling (ECS, Kubernetes, Railway, Fly.io) to spin up new Worker containers automatically.

Keep Workers Pre-Warmed

One gotcha: if all your Workers are cold (just started), the first user gets a ~10–15 second delay while the container boots and registers with LiveKit.

Always keep a minimum of 2–3 Workers running at all times, even at zero load. The cost is tiny compared to the UX damage of a cold start.

Multi-Tenant SaaS: Different Pipeline Per Tenant

Here's the pattern for building a real SaaS where each customer can choose their own STT, LLM, and TTS provider.

The trick: embed the pipeline config in the room metadata. Your worker reads it at job start and builds the right pipeline dynamically. No code changes, no Docker rebuilds.

Backend: embed config when creating the room

tenant = await db.get_tenant_config(tenant_id)

await room_service.create_room(CreateRoomRequest(
    name=room_name,
    metadata=json.dumps({
        "tenant_id": tenant_id,
        "stt": tenant["stt"],            # "deepgram" | "whisper" | "google"
        "stt_model": tenant["stt_model"],
        "llm": tenant["llm"],            # "openai" | "groq" | "anthropic"
        "llm_model": tenant["llm_model"],
        "tts": tenant["tts"],            # "elevenlabs" | "cartesia" | "openai"
        "tts_voice_id": tenant["tts_voice_id"],
    })
))

Agent worker: build pipeline from config

async def entrypoint(ctx: JobContext):
    await ctx.connect()

    config = json.loads(ctx.room.metadata)
    keys = await fetch_tenant_keys(config["tenant_id"])  # from your secure store

    session = AgentSession(
        stt=build_stt(config, keys),
        llm=build_llm(config, keys),
        tts=build_tts(config, keys),
        turn_detection=MultilingualModel(),
        vad=silero.VAD.load(),
        preemptive_generation=True,
    )

    await session.start(ctx.room)


def build_stt(config, keys):
    if config["stt"] == "deepgram":
        return deepgram.STTv2(model=config["stt_model"], api_key=keys["deepgram"])
    elif config["stt"] == "whisper":
        return openai.STT(model=config["stt_model"], api_key=keys["openai"])
    raise ValueError(f"Unknown STT: {config['stt']}")


def build_llm(config, keys):
    if config["llm"] == "openai":
        return openai.LLM(model=config["llm_model"], api_key=keys["openai"])
    elif config["llm"] == "groq":
        return groq.LLM(model=config["llm_model"], api_key=keys["groq"])
    elif config["llm"] == "anthropic":
        return anthropic.LLM(model=config["llm_model"], api_key=keys["anthropic"])
    raise ValueError(f"Unknown LLM: {config['llm']}")


def build_tts(config, keys):
    if config["tts"] == "elevenlabs":
        return elevenlabs.TTS(
            voice_id=config["tts_voice_id"],
            model="eleven_flash_v2_5",
            api_key=keys["elevenlabs"]
        )
    elif config["tts"] == "cartesia":
        return cartesia.TTS(voice_id=config["tts_voice_id"], api_key=keys["cartesia"])
    raise ValueError(f"Unknown TTS: {config['tts']}")

Your Docker image installs all plugins. Which ones actually run is a runtime decision, not a build-time decision. Three tenants, three different pipeline combinations, zero rebuilds.

The Twilio Path: When Someone Calls a Phone Number

If your voice agent needs to handle inbound phone calls (not just browser/app calls), you connect LiveKit to Twilio via SIP.

When someone dials your number:

Twilio picks up and fires a webhook to your FastAPI backend
Your backend creates a LiveKit room
Your backend tells Twilio to bridge the call into that room via SIP
LiveKit's built-in SIP server accepts the connection
From that point, it's identical to a browser call — LiveKit dispatches a job, agent joins the room, pipeline runs

@app.post("/twilio/incoming")
async def incoming_call(request: Request):
    room_name = f"call-{uuid4().hex[:8]}"
    await lk_room_service.create_room(CreateRoomRequest(name=room_name))

    response = VoiceResponse()
    response.dial().sip(f"sip:{room_name}@your-livekit-sip-uri")
    return Response(content=str(response), media_type="text/xml")

The agent has no idea whether the human is on a browser, a mobile app, or an old-school phone. It just sees an audio track coming in, and it does its thing.

The Full Production Architecture

Here's what the complete stack looks like when you're running this for real:

The Dockerfile: One Image, Every Provider

FROM python:3.12-slim
WORKDIR /app

RUN pip install \
    livekit-agents \
    livekit-plugins-deepgram \
    livekit-plugins-openai \
    livekit-plugins-elevenlabs \
    livekit-plugins-cartesia \
    livekit-plugins-groq \
    livekit-plugins-silero \
    livekit-plugins-turn-detector

COPY src/ .

# Only your infra credentials — not tenant credentials
ENV LIVEKIT_URL=""
ENV LIVEKIT_API_KEY=""
ENV LIVEKIT_API_SECRET=""

CMD ["python", "agent.py", "start"]

Tenant API keys live in your database, fetched at job start. Never baked into the image.

LiveKit vs. Pipecat: Know the Difference

You'll see Pipecat mentioned alongside LiveKit a lot. They're different things:

	LiveKit Agents	Pipecat
Core idea	Infrastructure-first (rooms, participants, dispatch)	Pipeline-first (frames flowing through processors)
Multi-user	Built-in job dispatch, automatic	You manage process spawning yourself
Transport	LiveKit is the transport (its own SFU)	Transport is a plugin (Daily, WebSocket, LiveKit, local)
Best for	SaaS platforms, scalable deployments	Custom pipeline logic, research, tightly controlled flows
Echo cancellation	Browser/WebRTC handles it	LocalTransport has none (AEC nightmare on Mac)

They're not mutually exclusive either. Pipecat has a LiveKit transport, so you can use Pipecat's pipeline flexibility with LiveKit's infrastructure underneath. Some teams do exactly this.

For a production multi-tenant SaaS, pure LiveKit Agents is the right default. The job dispatch, room management, and load balancing are built in. Pipecat is powerful but requires you to build that scaffolding yourself.

Common Gotchas and How to Avoid Them

1. Cold start latency
Workers take 10–15 seconds to boot and register. Keep a minimum of 2 Workers always running. The cost is negligible. The user experience impact is not.

2. API keys in room metadata
Room metadata is visible to all participants, including the user's browser. Don't put raw API keys there — either put only provider names in metadata and fetch keys from your backend inside the Worker, or use LiveKit's server-side participant attributes.

3. One room per session, always
Never reuse rooms. When a session ends, let the room die. Create a fresh one for the next call. Reusing rooms causes state bugs that are miserable to debug.

4. LLM context leaking between sessions
Each AgentJob subprocess is isolated by the OS, but if you're using LangGraph or a persistent conversation store, make sure you're keying the thread/session ID per room, not per user. Otherwise two calls from the same user bleed into each other's conversation history.

# Good — fresh thread ID per session
config = {"configurable": {"thread_id": uuid.uuid4().hex}}
return langchain.LLMAdapter(graph, config=config)

5. VAD tuning for your use case
The default VAD settings work well for English, quiet environments. For noisy backgrounds or non-English languages, tune min_silence_duration and prefix_padding_duration. The multilingual turn detection model helps here too.

Quick Recap

If you only remember one thing from this post, make it this:

One call = one room. One room = one agent subprocess. Subprocesses are isolated, parallel, and automatically dispatched by LiveKit's worker system. Your code describes how the agent behaves. LiveKit handles how many run at once.

The rest is just configuration.

LiveKit takes what would be a genuinely hard distributed systems problem — real-time audio, multi-tenant isolation, low-latency AI pipelines, automatic scaling — and turns it into a few Python files and some environment variables. That's rare. That's worth understanding properly.

Built something with LiveKit? The community Discord is genuinely active and the core team ships fast. Worth a look if you're going deep on voice AI. And CortexTech has experience working with production grade Livekit Voice Agent Pipeline