Skip to content
Real-Time Avatars/Learn
End-to-End|GaussianMetaHumanVideo Gen
Complete System~30 min

End-to-End Real-Time Avatar

Building a complete conversational avatar system from audio to pixels

1Overview2Pipeline3Approaches4Latency5Build It6Choose

The Complete Picture

A real-time conversational avatar isn't just one technology—it's a pipeline. Audio comes in from a microphone, gets transcribed to text, processed by an AI brain, converted back to speech, and finally rendered as a talking avatar. Each stage has latency, and the total must stay under ~1 second to feel natural. This guide shows you how to wire these pieces together using each of the four avatar approaches, with working code you can deploy today.

What Makes a Real-Time Avatar?

Listen

Capture and transcribe speech

Think

Generate intelligent response

Speak

Synthesize natural speech

Animate

Render talking avatar

The Voice AI Pipeline

Every conversational avatar follows this flow. Understanding each stage helps you optimize for latency and choose the right tools.

Step 1 of 7
Speed:

Audio Capture

Capture user speech via WebRTC or browser MediaRecorder API

Use ← → arrow keys to navigate, Space to play/pause

Voice AI Pipeline Flow

Speed1x

Last latency: 0ms

Completed: 0

Watch data packets flow through the voice AI pipeline. Each stage adds latency. Total E2E: 530-1800ms.

1Audio Capture

Capture user speech via WebRTC or browser MediaRecorder API

2Voice Activity Detection

Detect when the user starts and stops speaking (turn detection)

3Speech-to-Text

Transcribe audio to text using Deepgram, Whisper, or AssemblyAI

4LLM Processing

Generate response using GPT-4, Claude, or other language models

5Text-to-Speech

Convert text to natural speech audio with ElevenLabs, PlayHT, or Cartesia

6Avatar Rendering

Animate and render the avatar driven by the audio output

7Stream to User

Deliver video/audio back to the user via WebRTC

Three Paths to Real-Time Avatars

Each approach offers different tradeoffs. Click to explore the architecture and code for each.

Low complexity300-800ms total

Streaming Provider (Fastest to Deploy)

Use Hedra, Tavus, or HeyGen via WebRTC

The simplest path to production. Avatar providers handle all rendering server-side—you just send audio and receive video. Best for MVPs and production apps where you don't need custom rendering.

Learn more →|Click to expand
High complexity500ms-2s per chunk

Generative Video (Best Quality)

Diffusion models for photorealistic synthesis

Use diffusion models like those powering Hedra or open-source alternatives to generate photorealistic talking head video. Requires GPU infrastructure but produces the highest quality output from a single reference photo.

Learn more →|Click to expand
Medium-High complexity50-150ms rendering

MetaHuman / Game Engine (Most Control)

Real-time 3D with skeletal animation

Use Unreal Engine's MetaHuman or similar 3D avatars with real-time face tracking. Audio drives blendshapes via Audio2Face or similar, and the game engine renders at 60+ FPS. Best for games, VR, or when you need precise animation control.

Learn more →|Click to expand
Very High complexity16-33ms rendering

Gaussian Splatting (Cutting Edge)

Photorealistic real-time 3D

The newest approach: capture a person as millions of 3D Gaussians, then animate them in real-time. Combines photorealism with real-time rendering speeds. Still research-stage but rapidly maturing.

Learn more →|Click to expand
ApproachLatencyQualitySetup TimeCostBest For
Streaming (Hedra/Tavus)300-800msHighHours$$$/minProduction apps, MVPs
Generative Video500ms-2sPhotorealisticDaysGPU computeAsync content, highest quality
MetaHuman/3D50-150ms3D realisticWeeksRender serverGames, VR/AR, control
Gaussian Splatting16-33msPhotorealisticDays-WeeksCapture + trainKnown identities, research

Latency Budget for Natural Conversation

Target: < 1000ms total round-trip for natural conversation

Network (user → server)
20-100ms
Voice Activity Detection
100-300ms
Speech-to-Text
100-500ms
LLM (first token)
100-500ms
Text-to-Speech
50-200ms
Avatar Rendering
16-500ms
Network (server → user)
20-100ms
Total (worst case)
2200ms(best: 406ms)

Optimization Tips

  • →Use streaming STT (process as user speaks)
  • →Stream LLM tokens to TTS immediately
  • →Pre-buffer TTS audio before starting avatar
  • →Use WebRTC for lowest network latency
  • →Co-locate services in same region

Quality vs Latency Tradeoff

Configure each component of the voice AI pipeline to see how choices affect total latency, quality, and cost. Find the right balance for your use case.

Presets:
Speech-to-Text150ms

Whisper Base - Good balance

Language Model120ms

GPT-4 Turbo - Better reasoning

Text-to-Speech50ms

Streaming TTS - Low latency

Avatar Rendering150ms

Diffusion-based (Hedra/etc)

Total Latency

520ms

Acceptable

Avg Quality

86%

Cost Index

$$$$

11/20 units

Latency Breakdown
STT
150ms
LLM
120ms
TTS
50ms
Avatar
150ms
Network
50ms

Response Timeline

STT
LLM
Avatar
0msUser speaksAvatar responds520ms

Recommendation

Consider for quality-focused applications only.

Build It Yourself

Follow these steps to deploy a complete real-time avatar system using LiveKit.

Deploy LiveKit for WebRTC infrastructure - handles audio/video routing

bash
6 lines
1# Using LiveKit Cloud (easiest)
2# Or self-host with Docker:
3docker run -d \
4 -p 7880:7880 -p 7881:7881 -p 7882:7882/udp \
5 livekit/livekit-server \
6 --dev --bind 0.0.0.0

Step 1 of 5

Resources

LiveKit Agents Framework

github

LiveKit Voice Assistant Example

github

Hedra API Documentation

docs

NVIDIA Audio2Face

docs

This Project's LiveKit Demo

demo

Which Approach Should You Use?

1

Do you need to ship in < 1 week?

Yes → Streaming Provider (Fastest to Deploy)No → Next question
2

Do you need photorealistic quality?

Yes → Generative or GaussianNo → MetaHuman / Game Engine (Most Control)
3

Is it a specific known person?

Yes → Gaussian Splatting (Cutting Edge)No → Generative Video (Best Quality)

Video Generation

Diffusion & streaming

MetaHuman

Most control

Gaussian

Cutting edge

See It In Action

Experience complete real-time avatar systems using two different approaches.

Diffusion + WebRTC

LiveKit + Hedra streaming avatar

Launch Demo →

MetaHuman + Pixel Streaming

Rapport UE5 cloud-rendered avatar

Launch Demo →

When to Build a Real-Time Avatar

Good Use Cases

  • +Building a conversational AI product
  • +Need visual engagement beyond voice-only
  • +Want to create emotional connection with users
  • +Representing a brand or persona visually

Consider Alternatives

  • -Voice-only is sufficient for your use case
  • -Latency budget is extremely tight (< 200ms)
  • -Running on very low-end devices
  • -Privacy concerns about face synthesis

Best Use Case

Customer service, education, entertainment, and telepresence applications where visual presence enhances the interaction

Learn Real-Time Avatar Technologies

Back to Research Survey