Overview
Interactive digital humans that respond in near real-time to user input are becoming central to virtual communication, gaming, and AI assistants. Achieving a convincing digital human requires balancing visual realism, low latency, precise controllability, and feasible deployment.
Recent advances (2023-2024) have produced several distinct approaches to real-time responsive avatars, each with unique trade-offs in latency, fidelity, control, and system cost. Additionally, streaming infrastructure like LiveKit now enables production-ready avatar deployment with multiple provider integrations.
MetaHuman Pipeline
Game-engine characters driven by performance capture or animation rigs for real-time rendering in Unreal Engine.
Generative Video Models
Diffusion or transformer-based models that directly synthesize avatar video frames from audio or other signals.
Gaussian Splatting
Neural 3D scene representation using Gaussian primitives that can be efficiently animated and rendered in real-time.
Streaming Avatars
Production-ready WebRTC infrastructure integrating multiple avatar providers with voice AI agents via LiveKit.
MetaHuman Pipeline
Epic Games' MetaHuman framework exemplifies the graphics-based approach to digital humans. MetaHumans are highly detailed 3D character models with rigged faces and bodies, designed for real-time rendering in Unreal Engine.
Key Features
- +60+ FPS rendering with ~30-50ms latency
- +Precise control via rigs and blendshapes
- +Live Link support for real-time streaming
- +No per-person ML training required
Limitations
- -CGI look may not achieve true photorealism
- -Significant content creation effort upfront
- -Requires capable GPU and game engine
- -Manual design needed for specific likenesses
How It Works
Generative Video Models
AI generative models, often based on diffusion or transformer architectures, directly synthesize video frames of a talking or moving person. A single input image can be turned into a lifelike talking video with one-shot generalization to unseen identities.
Key Features
- +Photorealistic output from minimal input
- +One-shot: no per-subject training needed
- +Natural behaviors (blinks, head movements)
- +20-30 FPS on high-end GPUs achievable
Limitations
- -Heavy compute requirements (A100+ GPU)
- -Limited explicit control over output
- -Risk of artifacts or identity drift
- -Higher first-frame latency (~0.3-1s)
Key Techniques
Models like CausVid use block-wise causal attention for 40x speedup over vanilla diffusion.
Reference Sink and RAPR techniques prevent identity drift over extended generation.
Second-stage discriminator training recovers detail lost in distillation.
Neural Gaussian Splatting
3D Gaussian Splatting (3DGS) enables real-time rendering of photorealistic 3D scenes using a cloud of Gaussian primitives. By capturing a person as textured 3D Gaussians that can be animated, we get a streaming neural avatar that runs extremely fast and looks realistic.
Key Features
- +60+ FPS rendering on consumer GPUs
- +Photorealistic for the captured subject
- +Multi-view consistent output for AR/VR
- +Can be driven by parametric models
Limitations
- -Requires multi-view capture per person
- -Hours of training time per identity
- -Fixed identity (one model = one person)
- -Quality degrades outside training range
Notable Projects
Factors full human avatar into layered Gaussian clusters (body, garments, face) attached to a deformable cage rig.
First to generate photorealistic multi-view talking head sequences from audio input with expression-dependent details.
Streaming Avatars with LiveKit
LiveKit Agents provides production-ready infrastructure for deploying real-time avatars at scale. Rather than building avatar rendering from scratch, it integrates multiple avatar providers through a unified API, handling WebRTC streaming, synchronization, and voice AI pipelines automatically.
Key Features
- +Multiple avatar providers (Tavus, Hedra, Simli, etc.)
- +Built-in voice AI pipeline (STT + LLM + TTS)
- +WebRTC-based low-latency streaming
- +Production deployment with load balancing
- +Cross-platform SDKs (Web, iOS, Android, Flutter)
Limitations
- -Requires third-party avatar provider subscription
- -Less control over avatar rendering pipeline
- -Dependent on provider capabilities and quality
- -Per-minute or per-session pricing from providers
Architecture
The avatar worker joins as a separate participant, receiving audio from the agent and publishing synchronized video back to users. This minimizes latency by having the provider connect directly to LiveKit rooms.
Supported Avatar Providers
Photorealistic digital twins with custom voice cloning and persona training.
Character-based avatars with expressive animations and customizable styles.
Real-time lip-sync avatars optimized for conversational AI applications.
AI-powered digital humans with natural gestures and emotional expressions.
Enterprise-grade avatars for customer service and virtual assistance.
Hyper-realistic avatars with advanced facial animation technology.
Side-by-Side Comparison
| Aspect | MetaHuman | Generative | Gaussian | Streaming |
|---|---|---|---|---|
| Latency | ~30-50ms (60+ FPS) | ~0.3-1s first frame, 20-30 FPS | <100ms (30-60 FPS) | ~100-300ms (provider dependent) |
| Visual Realism | High-quality CGI | Photorealistic | Photorealistic (subject-specific) | Varies by provider |
| Controllability | Explicit, fine-grained | Limited, audio-driven | Moderate to high | Audio-driven, provider APIs |
| New Identity | Moderate effort (modeling) | One-shot (just an image) | Low (capture + training) | Provider-specific setup |
| Training Required | None per character | Base model only | Per-subject (hours) | None (managed by provider) |
| Hardware | Gaming GPU | A100+ or cloud | Consumer GPU | Any (cloud-hosted) |
| Best For | Production, precise control | Quick deployment, any face | VR/AR telepresence | Voice AI apps, rapid deploy |
Getting Started Tutorial
Choose your approach based on your requirements. Below are quick-start guides for each method with links to open-source implementations.
MetaHuman + Live Link
Download from Epic Games Launcher. MetaHuman requires UE 5.0+.
Use MetaHuman Creator (metahuman.unrealengine.com) to design or import a character.
Install Live Link Face app on iPhone. Connect to Unreal via your local network.
Add Live Link plugin, create a Live Link preset, and connect the ARKit face data to your MetaHuman blueprint.
SadTalker (Diffusion-based)
git clone https://github.com/OpenTalker/SadTalker.gitpip install -r requirements.txtRun the download script or manually download checkpoints from the releases page.
python inference.py --source_image face.jpg --driven_audio speech.wavD3GA (Gaussian Avatars)
git clone https://github.com/facebookresearch/D3GA.gitRecord the subject from multiple angles. The more viewpoints, the better the reconstruction.
Run the training script with your captured data. This may take several hours depending on data size.
Use FLAME parameters, body poses, or audio input to animate your trained avatar in real-time.
LiveKit Agents + Avatar
pip install livekit-agents livekit-plugins-hedraSet up LiveKit Cloud account and obtain API keys from your chosen avatar provider (Hedra, Tavus, Simli, etc.).
from livekit.agents import AgentSession
from livekit.plugins import hedra
# Create voice AI agent
agent_session = AgentSession(
stt="assemblyai/universal-streaming",
llm="openai/gpt-4.1-mini",
tts="cartesia/sonic-3"
)
# Create avatar session
avatar_session = hedra.AvatarSession()
# Start avatar with agent
await avatar_session.start(
agent_session=agent_session,
room=ctx.room
)Use LiveKit's React hooks or native SDKs to display the avatar video track. The avatar worker publishes synchronized audio/video to the room.
Open Source Resources
Audio-driven talking head generation with 3D motion prediction
Drivable 3D Gaussian Avatars from Facebook Research
Real-time face animation with first-order motion model
Curated list of talking head generation papers and code
Build real-time AI agents with voice and avatar support
Official documentation for integrating avatar providers