Complete System~30 min

End-to-End Real-Time Avatar

Building a complete conversational avatar system from audio to pixels

The Complete Picture

A real-time conversational avatar isn't just one technology—it's a pipeline. Audio comes in from a microphone, gets transcribed to text, processed by an AI brain, converted back to speech, and finally rendered as a talking avatar. Each stage has latency, and the total must stay under ~1 second to feel natural. This guide shows you how to wire these pieces together using each of the four avatar approaches, with working code you can deploy today.

What Makes a Real-Time Avatar?

Listen

Capture and transcribe speech

Think

Generate intelligent response

Speak

Synthesize natural speech

Animate

Render talking avatar

The Voice AI Pipeline

Every conversational avatar follows this flow. Understanding each stage helps you optimize for latency and choose the right tools.

Step 1 of 7

Speed:

Audio Capture

Capture user speech via WebRTC or browser MediaRecorder API

Use ← → arrow keys to navigate, Space to play/pause

Voice AI Pipeline Flow

Speed1x

Streaming Mode

Last latency: 0ms

Completed: 0

Watch data packets flow through the voice AI pipeline. Each stage adds latency. Total E2E: 530-1800ms.

1Audio Capture

Capture user speech via WebRTC or browser MediaRecorder API

2Voice Activity Detection

Detect when the user starts and stops speaking (turn detection)

3Speech-to-Text

Transcribe audio to text using Deepgram, Whisper, or AssemblyAI

4LLM Processing

Generate response using GPT-4, Claude, or other language models

5Text-to-Speech

Convert text to natural speech audio with ElevenLabs, PlayHT, or Cartesia

6Avatar Rendering

Animate and render the avatar driven by the audio output

7Stream to User

Deliver video/audio back to the user via WebRTC

Three Paths to Real-Time Avatars

Each approach offers different tradeoffs. Click to explore the architecture and code for each.

Low complexity300-800ms total

Streaming Provider (Fastest to Deploy)

Use Hedra, Tavus, or HeyGen via WebRTC

The simplest path to production. Avatar providers handle all rendering server-side—you just send audio and receive video. Best for MVPs and production apps where you don't need custom rendering.

Learn more →|Click to expand

High complexity500ms-2s per chunk

Generative Video (Best Quality)

Diffusion models for photorealistic synthesis

Use diffusion models like those powering Hedra or open-source alternatives to generate photorealistic talking head video. Requires GPU infrastructure but produces the highest quality output from a single reference photo.

Learn more →|Click to expand

Medium-High complexity50-150ms rendering

MetaHuman / Game Engine (Most Control)

Real-time 3D with skeletal animation

Use Unreal Engine's MetaHuman or similar 3D avatars with real-time face tracking. Audio drives blendshapes via Audio2Face or similar, and the game engine renders at 60+ FPS. Best for games, VR, or when you need precise animation control.

Learn more →|Click to expand

Very High complexity16-33ms rendering

Gaussian Splatting (Cutting Edge)

Photorealistic real-time 3D

The newest approach: capture a person as millions of 3D Gaussians, then animate them in real-time. Combines photorealism with real-time rendering speeds. Still research-stage but rapidly maturing.

Learn more →|Click to expand

Approach	Latency	Quality	Setup Time	Cost	Best For
Streaming (Hedra/Tavus)	300-800ms	High	Hours	$$$/min	Production apps, MVPs
Generative Video	500ms-2s	Photorealistic	Days	GPU compute	Async content, highest quality
MetaHuman/3D	50-150ms	3D realistic	Weeks	Render server	Games, VR/AR, control
Gaussian Splatting	16-33ms	Photorealistic	Days-Weeks	Capture + train	Known identities, research

Latency Budget for Natural Conversation

Target: < 1000ms total round-trip for natural conversation

Network (user → server)

20-100ms

Voice Activity Detection

100-300ms

Speech-to-Text

100-500ms

LLM (first token)

100-500ms

Text-to-Speech

50-200ms

Avatar Rendering

16-500ms

Network (server → user)

20-100ms

Total (worst case)

2200ms(best: 406ms)

Optimization Tips

→Use streaming STT (process as user speaks)
→Stream LLM tokens to TTS immediately
→Pre-buffer TTS audio before starting avatar
→Use WebRTC for lowest network latency
→Co-locate services in same region

Quality vs Latency Tradeoff

Configure each component of the voice AI pipeline to see how choices affect total latency, quality, and cost. Find the right balance for your use case.

Presets:

Speech-to-Text150ms

Whisper Base - Good balance

Language Model120ms

GPT-4 Turbo - Better reasoning

Text-to-Speech50ms

Streaming TTS - Low latency

Avatar Rendering150ms

Diffusion-based (Hedra/etc)

Total Latency

520ms

Acceptable

Avg Quality

86%

Cost Index

$$$$

11/20 units

Latency Breakdown

STT

150ms

LLM

120ms

TTS

50ms

Avatar

150ms

Network

50ms

Response Timeline

STT

LLM

Avatar

0msUser speaksAvatar responds520ms

Recommendation

Consider for quality-focused applications only.

Build It Yourself

Follow these steps to deploy a complete real-time avatar system using LiveKit.

Deploy LiveKit for WebRTC infrastructure - handles audio/video routing

bash

6 lines

1# Using LiveKit Cloud (easiest)
2# Or self-host with Docker:
3docker run -d \
4  -p 7880:7880 -p 7881:7881 -p 7882:7882/udp \
5  livekit/livekit-server \
6  --dev --bind 0.0.0.0

Step 1 of 5

Resources

LiveKit Agents Framework

github

LiveKit Voice Assistant Example

github

Hedra API Documentation

docs

NVIDIA Audio2Face

docs

This Project's LiveKit Demo

demo

Which Approach Should You Use?

Do you need to ship in < 1 week?

Yes → Streaming Provider (Fastest to Deploy)No → Next question

Do you need photorealistic quality?

Yes → Generative or GaussianNo → MetaHuman / Game Engine (Most Control)

Is it a specific known person?

Yes → Gaussian Splatting (Cutting Edge)No → Generative Video (Best Quality)

Video Generation

Diffusion & streaming

MetaHuman

Most control

Gaussian

Cutting edge

See It In Action

Experience complete real-time avatar systems using two different approaches.

Diffusion + WebRTC

LiveKit + Hedra streaming avatar

Launch Demo →

MetaHuman + Pixel Streaming

Rapport UE5 cloud-rendered avatar

Launch Demo →

When to Build a Real-Time Avatar

Good Use Cases

+Building a conversational AI product
+Need visual engagement beyond voice-only
+Want to create emotional connection with users
+Representing a brand or persona visually

Consider Alternatives

-Voice-only is sufficient for your use case
-Latency budget is extremely tight (< 200ms)
-Running on very low-end devices
-Privacy concerns about face synthesis

Best Use Case

Customer service, education, entertainment, and telepresence applications where visual presence enhances the interaction

Complete System~30 min

End-to-End Real-Time Avatar

Building a complete conversational avatar system from audio to pixels

The Complete Picture

What Makes a Real-Time Avatar?

Listen

Capture and transcribe speech

Think

Generate intelligent response

Speak

Synthesize natural speech

Animate

Render talking avatar

The Voice AI Pipeline

Every conversational avatar follows this flow. Understanding each stage helps you optimize for latency and choose the right tools.

Step 1 of 7

Speed:

Audio Capture

Capture user speech via WebRTC or browser MediaRecorder API

Use ← → arrow keys to navigate, Space to play/pause

Voice AI Pipeline Flow

Speed1x

Streaming Mode

Last latency: 0ms

Completed: 0

Watch data packets flow through the voice AI pipeline. Each stage adds latency. Total E2E: 530-1800ms.

1Audio Capture

Capture user speech via WebRTC or browser MediaRecorder API

2Voice Activity Detection

Detect when the user starts and stops speaking (turn detection)

3Speech-to-Text

Transcribe audio to text using Deepgram, Whisper, or AssemblyAI

4LLM Processing

Generate response using GPT-4, Claude, or other language models

5Text-to-Speech

Convert text to natural speech audio with ElevenLabs, PlayHT, or Cartesia

6Avatar Rendering

Animate and render the avatar driven by the audio output

7Stream to User

Deliver video/audio back to the user via WebRTC

Three Paths to Real-Time Avatars

Each approach offers different tradeoffs. Click to explore the architecture and code for each.

Low complexity300-800ms total

Streaming Provider (Fastest to Deploy)

Use Hedra, Tavus, or HeyGen via WebRTC

The simplest path to production. Avatar providers handle all rendering server-side—you just send audio and receive video. Best for MVPs and production apps where you don't need custom rendering.

Learn more →|Click to expand

High complexity500ms-2s per chunk

Generative Video (Best Quality)

Diffusion models for photorealistic synthesis

Learn more →|Click to expand

Medium-High complexity50-150ms rendering

MetaHuman / Game Engine (Most Control)

Real-time 3D with skeletal animation

Learn more →|Click to expand

Very High complexity16-33ms rendering

Gaussian Splatting (Cutting Edge)

Photorealistic real-time 3D

The newest approach: capture a person as millions of 3D Gaussians, then animate them in real-time. Combines photorealism with real-time rendering speeds. Still research-stage but rapidly maturing.

Learn more →|Click to expand

Approach	Latency	Quality	Setup Time	Cost	Best For
Streaming (Hedra/Tavus)	300-800ms	High	Hours	$$$/min	Production apps, MVPs
Generative Video	500ms-2s	Photorealistic	Days	GPU compute	Async content, highest quality
MetaHuman/3D	50-150ms	3D realistic	Weeks	Render server	Games, VR/AR, control
Gaussian Splatting	16-33ms	Photorealistic	Days-Weeks	Capture + train	Known identities, research

Latency Budget for Natural Conversation

Target: < 1000ms total round-trip for natural conversation

Network (user → server)

20-100ms

Voice Activity Detection

100-300ms

Speech-to-Text

100-500ms

LLM (first token)

100-500ms

Text-to-Speech

50-200ms

Avatar Rendering

16-500ms

Network (server → user)

20-100ms

Total (worst case)

2200ms(best: 406ms)

Optimization Tips

→Use streaming STT (process as user speaks)
→Stream LLM tokens to TTS immediately
→Pre-buffer TTS audio before starting avatar
→Use WebRTC for lowest network latency
→Co-locate services in same region

Quality vs Latency Tradeoff

Configure each component of the voice AI pipeline to see how choices affect total latency, quality, and cost. Find the right balance for your use case.

Presets:

Speech-to-Text150ms

Whisper Base - Good balance

Language Model120ms

GPT-4 Turbo - Better reasoning

Text-to-Speech50ms

Streaming TTS - Low latency

Avatar Rendering150ms

Diffusion-based (Hedra/etc)

Total Latency

520ms

Acceptable

Avg Quality

86%

Cost Index

$$$$

11/20 units

Latency Breakdown

STT

150ms

LLM

120ms

TTS

50ms

Avatar

150ms

Network

50ms

Response Timeline

STT

LLM

Avatar

0msUser speaksAvatar responds520ms

Recommendation

Consider for quality-focused applications only.

Build It Yourself

Follow these steps to deploy a complete real-time avatar system using LiveKit.

Deploy LiveKit for WebRTC infrastructure - handles audio/video routing

bash

6 lines

1# Using LiveKit Cloud (easiest)
2# Or self-host with Docker:
3docker run -d \
4  -p 7880:7880 -p 7881:7881 -p 7882:7882/udp \
5  livekit/livekit-server \
6  --dev --bind 0.0.0.0

Step 1 of 5

Resources

LiveKit Agents Framework

github

LiveKit Voice Assistant Example

github

Hedra API Documentation

docs

NVIDIA Audio2Face

docs

This Project's LiveKit Demo

demo

Which Approach Should You Use?

Do you need to ship in < 1 week?

Yes → Streaming Provider (Fastest to Deploy)No → Next question

Do you need photorealistic quality?

Yes → Generative or GaussianNo → MetaHuman / Game Engine (Most Control)

Is it a specific known person?

Yes → Gaussian Splatting (Cutting Edge)No → Generative Video (Best Quality)

Video Generation

Diffusion & streaming

MetaHuman

Most control

Gaussian

Cutting edge

See It In Action

Experience complete real-time avatar systems using two different approaches.

Diffusion + WebRTC

LiveKit + Hedra streaming avatar

Launch Demo →

MetaHuman + Pixel Streaming

Rapport UE5 cloud-rendered avatar

Launch Demo →

When to Build a Real-Time Avatar

Good Use Cases

+Building a conversational AI product
+Need visual engagement beyond voice-only
+Want to create emotional connection with users
+Representing a brand or persona visually

Consider Alternatives

-Voice-only is sufficient for your use case
-Latency budget is extremely tight (< 200ms)
-Running on very low-end devices
-Privacy concerns about face synthesis

Best Use Case

Customer service, education, entertainment, and telepresence applications where visual presence enhances the interaction