Skip to content
Real-Time Avatars/Learn
End-to-End|GaussianMetaHumanVideo Gen
Diffusion / Streaming~60 min

Video Generation

Diffusion-based synthesis and streaming avatar infrastructure for real-time talking heads

1Introduction2Pipeline3Key Concepts4Providers5Build It6Trade-offs

What is Video Generation?

Video generation encompasses two complementary sides of real-time talking heads: diffusion-based synthesis (creating photorealistic video from noise and conditioning signals) and streaming infrastructure (delivering that video to users in real time via WebRTC and cloud providers). Together, they form a complete pipeline from a single reference photo to a live, interactive avatar on any device.

The Core Idea in 30 Seconds

Add Noise, Learn to Reverse

Train on clear-to-noisy, generate noisy-to-clear

Audio Drives Motion

Sound patterns map to mouth shapes automatically

Stream to Any Device

WebRTC delivers video in real time, no plugins needed

The Marble Sculpture Metaphor

1. Perfect Sculpture

Start with a clear image

2. Sandpaper 1000x

Add noise until unrecognizable

3. Train Restorer

Learn to undo one stroke

4. Generate

Rough block, sculpture emerges

Unlike GANs which learn to generate in a single shot, diffusion models iteratively refine their output. This makes training more stable (no mode collapse) and generation more controllable (you can guide each step). The cost? Generation requires 20-50 network passes instead of just one, making real-time video challenging.

Diffusion Fundamentals

Noise Levels

Forward adds noise, reverse removes it

t=0 (clean)t=T (noise)

Original image

Gaussian Noise

Bell curve distribution for noise

σ1.0

G(x) = e^(-x²/2σ²)

Noise Schedule

Linear vs cosine noise timing

Constant rate

DDPM vs DDIM

Stochastic vs deterministic sampling

50 steps

Text Embeddings

Prompt to vector representation

Token embedding dims

CFG Scale Effect

Prompt adherence strength

Low
High

CFG Scale: 7.5

Model Architecture

U-Net Skip Connections

Encoder-decoder with shortcuts

VAE Latent Space

Compress/decompress images

←

64×64×4 latent

Residual Connections

Skip connections for gradient flow

x
→
F(x)
→
+

Attention Mechanisms

Attention Heatmap

Where the model focuses

Click token to see its attention

Multi-Head Attention

Parallel attention patterns

H1
H2
H3
H4

4 heads × 16d = 64d total

Attention Q*K

Query-key dot product similarity

A
B
C
D

Query attends to keys

Positional Encoding

Position info in transformers

Sinusoidal patterns encode position info

Layer Normalization

Stabilize transformer training

Softmax

Attention weight normalization

63%
23%
14%

softmax(z)ᵢ = eᶻⁱ / Σeᶻʲ

Generation Control

CFG Strength

Conditioning guidance scale

Uncond
Result
Cond

CFG: 7 (good)

Temporal Consistency

Frame-to-frame stability

Embedding Space

Semantic word clustering

king
queen
man
woman
cat
dog

Similar concepts cluster together in vector space

Neural Network Basics

Single Neuron

Weighted sum + activation

0.5×0.4
0.8×-0.2
0.3×0.6
→
0.58

σ(Σwᵢxᵢ + b) = 0.579

Activation Functions

ReLU, Sigmoid, Tanh

Convolution Kernel

Edge/blur/sharpen filters

-1.0
-1.0
-1.0
-1.0
8.0
-1.0
-1.0
-1.0
-1.0

3×3 kernel slides over image

Pooling Layers

Max/Avg downsampling

4
2
1
3
1
5
6
2
3
1
2
4
2
3
5
1
→
5
6
3
5

4×4 → 2×2 (take maximum)

Dropout

Regularization by random deactivation

Drop30%
Batch Normalization

Normalize layer outputs

Training Dynamics

Loss Landscape

Optimization surface

Click to place optimizer (yellow = minimum)

Learning Rate

Step size affects convergence

LR0.10

Slow convergence

Backpropagation

Gradient computation through layers

Overfitting

Train vs validation loss divergence

TrainVal
Vanishing Gradient

Gradient decay in deep networks

5 layers → gradient shrinks

Denoising Visualizer

Watch how diffusion models progressively remove noise to reveal an image. Each step predicts and subtracts a small amount of noise.

Pure noise -- completely random

Denoising Step0 / 50

DDPM Reverse Step

x49 = denoise(x50, t=50)

At each step t, the U-Net predicts noise εθ(xt, t), which is subtracted from xt scaled by the noise schedule (α, β).

Noise Schedule

t=0 (clean)t=50 (noise)

Diffusion Steps Comparison

Compare image quality at different denoising step counts. More steps = better quality but slower generation.

Seed: 12345
Click Generate

4 steps

Fastest, lower quality

~200ms

Click Generate

10 steps

Fast, decent quality

~400ms

Click Generate

25 steps

Balanced

~800ms

Click Generate

50 steps

High quality

~1500ms

Quality vs Speed Trade-off

4
10
25
50
Fastest / Lowest QualitySlowest / Highest Quality

Why Fewer Steps = Lower Quality?

  • • Each step removes a small amount of noise
  • • Fewer steps = larger noise jumps = artifacts
  • • Fine details emerge in later steps
  • • Blocky/blurry results from skipped refinement

For Real-Time Avatars

  • • 4-8 steps typical for interactive use
  • • Consistency models enable 1-4 step generation
  • • Distillation transfers quality to fewer steps
  • • Trade acceptable quality for latency
Try this: Generate with different seeds and compare the 4-step vs 50-step results. Notice how the basic structure is there in 4 steps, but fine details like eye highlights are missing!

Latent Space Explorer

Explore how VAEs compress images into a low-dimensional latent space. Click and drag in the space to interpolate between expressions.

2D Latent Space

Expression (Positive → Serious)
Arousal (Calm → Excited)

Decoded Face

Latent Vector z

z = [0.500, 0.500]
Compression Ratio8x
512×512×3 pixels→64×64×4 latent

How VAE Latent Space Works

Encode

Compress 512×512 image → small latent vector (e.g., 64×64×4)

Manipulate

Diffusion happens in latent space, much faster than pixel space

Decode

Reconstruct full image from modified latent vector

Lip Sync Playground

Explore how audio phonemes map to visual mouth shapes (visemes). Click visemes or play phrases to see the lip sync in action.

Mouth Shape

Current: silClosed

Phonemes: silence

Jaw Open

0.00

Lip Width

0.50

Lip Round

0.00

Viseme Palette

Sample Phrases

Smoothing0.50

Higher = smoother transitions, lower = snappier

Speed1x

Audio-to-Lip-Sync Pipeline

Audio

Raw waveform

Phonemes

"Hello" → h-ə-l-oʊ

Visemes

Mouth shape targets

Animation

Blended shapes

Identity Preservation Demo

See how identity lock maintains facial features while allowing expression/motion changes. Without identity lock, the face drifts to an average appearance.

Identity Strength90%

Lower values allow more drift toward average face

Identity Features

eye Distance0.50
nose Length0.50
jaw Width0.50
forehead Height0.50
lip Fullness0.50

How Identity Lock Works

1. Extract Identity

Encoder captures facial features (eye spacing, nose shape, etc.) as a latent vector

2. Condition Generation

Identity vector is injected via cross-attention at every denoising step

3. Preserve + Animate

Motion changes expressions while identity features remain locked

How Neural Networks Learn

Diffusion models learn to denoise through gradient descent. This fundamental algorithm drives all deep learning.

Gradient Descent Visualizer

Watch how neural networks learn by following the gradient downhill. Click anywhere on the landscape to set a starting point.

Current position
Path
Global minimum
Gradient

Loss

4.023

Steps

0

Position

(2.00, 2.00)

Learning Rate0.10

Balanced

Momentum0.00

No momentum (vanilla SGD)

Starting Points

The Update Rule

w = w - lr * gradient + momentum * velocity

This is how diffusion models, face encoders, and all neural networks learn. The gradient points uphill; we go downhill.

Neural Network Forward Pass

Watch data flow through layers of neurons. This is the basic building block of diffusion models and face encoders.

Neural Network Forward Pass

Watch data flow through a neural network. Each neuron computes a weighted sum of inputs, adds a bias, then applies an activation function.

Input Values

x10.50
x20.80
x30.30

Activation Function

The Forward Pass Equation

output = activation(Σ(weight × input) + bias)

Green connections = positive weights (amplify signal), Red = negative weights (inhibit signal). Line thickness shows weight magnitude. This is how diffusion U-Nets, face encoders, and LLMs process data.

U-Net Architecture

The U-Net is the denoising backbone of Stable Diffusion. Explore its encoder-decoder structure with skip connections.

U-Net Architecture Explorer

The U-Net is the backbone of Stable Diffusion. It compresses the image (encoder), processes it (middle), then expands back (decoder). Skip connections preserve detail.

Hover over a layer

Layer Types

conv
attention
downsample
upsample
skip

Why U-Net?

The "U" shape allows: (1) Compression to capture global context, (2) Skip connections to preserve fine detail, (3) Attention layers to incorporate conditioning (text/audio). Height = spatial resolution, Width = channel depth.

Cross-Attention Mechanism

See how audio tokens guide which parts of the image to modify. This is how the model knows to move the mouth when speaking.

Cross-Attention Visualizer

See how audio tokens guide which parts of the image to modify. This is how diffusion models know to move the mouth when you say "ah" but keep the eyes still.

Attention Temperature1.0

Low = sharp focus on few tokens, High = diffuse attention

Select Image Region

Select Audio Token

How Cross-Attention Works

Query (Image)

"What information do I need?" - each image region asks what audio to attend to

Key (Audio)

"Here's what I represent" - audio tokens provide their semantic meaning

Value (Audio)

"Here's my actual content" - weighted sum of audio features guides denoising

Noise Schedule Comparison

Different schedules control how quickly noise is added/removed. The schedule significantly affects generation quality and speed.

Noise Schedule Comparison

Different schedules control how quickly noise is added/removed. The schedule significantly affects generation quality and speed.

Step0 / 50

Cosine Schedule

β(t) = cos²(πt/2)
Current noise level:100.0%

Best Practices

Cosine schedule is preferred for most diffusion models (used in DDPM, Stable Diffusion). It preserves more signal at the start, allowing fine details to emerge gradually.

Classifier-Free Guidance

CFG controls how strongly the model follows conditioning. Too low = blurry, too high = artifacts. Find the sweet spot.

Classifier-Free Guidance (CFG)

CFG controls how strongly the model follows the conditioning (audio/text). Low values = blurry/generic, High values = artifacts/oversaturation. Sweet spot: 5-9.

CFG Scale7.5
Ignore promptFollow strictly

The Math

output = uncond + cfg * (cond - uncond)

CFG scales the difference between conditioned and unconditioned predictions. Higher values amplify prompt influence but can overshoot.

Recommended Values

Stable Diffusion: 7-8
SDXL: 5-7
Face gen: 3-5
Artistic: 10-12

For Avatars

Talking head models typically use CFG 3-6. Lower values preserve identity better, higher values give more pronounced expressions but risk artifacts.

Temporal Consistency

Video generation requires frame-to-frame coherence. Compare raw output vs temporally smoothed output.

Temporal Consistency

Video generation requires frame-to-frame coherence. Without temporal consistency, each frame flickers independently. With it, changes are smooth and natural.

Left: Raw Output

Each frame independent

Right: Temporally Smoothed

Blended with previous

Temporal Weight0.70
No smoothingHeavy smoothing

Techniques Used

Latent Blending: Interpolate latent codes between frames
Cross-Frame Attention: Attend to previous frame features
Motion Prior: Predict expected change from audio
Optical Flow: Warp previous frame as initialization

For Talking Heads

Avatar systems must balance temporal consistency with responsiveness. Too much smoothing = laggy lip sync. Too little = jittery video. Most systems use ~0.6-0.8 weight with audio-aware adjustments.

Convolution Operation

The fundamental operation behind CNNs. Watch how kernels slide over images to extract features for face encoding.

Convolution Operation

Convolution slides a kernel (filter) over the image, computing weighted sums. This is the core operation in CNNs used for face recognition and image generation.

Input

Output

At position (3, 3):

200×0.11 + 200×0.11 + 200×0.11 + 200×0.11 + 200×0.11 + 200×0.11 + 200×0.11 + 200×0.11 + 200×0.11

= 200

Kernel Values

0.11
0.11
0.11
0.11
0.11
0.11
0.11
0.11
0.11

In Neural Networks

CNNs learn kernel values through backpropagation. Early layers detect edges, later layers detect complex patterns like eyes or noses. This is how face encoders extract identity features.

Sampler Comparison

Compare DDPM, DDIM, Euler, and DPM++ samplers. See how modern samplers achieve quality in 20 steps vs 1000.

Sampler Comparison

Different samplers trade off speed vs quality. Modern samplers like DPM++ achieve high quality in 20 steps vs 1000 for original DDPM.

Steps20
DDPM
DDIM
Euler
DPM++

Denoising Diffusion Implicit Models

Recommended steps: 50

Current noise level: 100.0%

For Real-Time Avatars

DPM++ or Euler with 4-8 steps is common for real-time face generation. Quality is acceptable, and speed meets the ~100ms budget for talking heads.

Face Encoder Architecture

See how face encoders extract identity, expression, and pose into a compact latent code for conditioning.

Face Encoder Architecture

Face encoders extract identity, expression, and pose into a compact latent code. This enables identity-preserving generation and expression transfer.

Encoder Pipeline

  1. 1. Input: 256×256 RGB face image
  2. 2. Conv layers: Extract spatial features
  3. 3. Global pooling: Spatial → vector
  4. 4. Fully connected: Compress features
  5. 5. Output: Disentangled latent code

Disentanglement

Identity

Who the person is

Expression

Facial state

Pose

Head orientation

Common Architectures

ArcFace: Identity-focused, used for recognition

DECA: 3DMM parameter prediction

InsightFace: Fast, good for real-time

For Avatar Generation

The identity code is used to condition the diffusion model, ensuring the generated face looks like the reference. Expression codes drive animation.

Activation Functions

Explore ReLU, Sigmoid, Tanh, and GELU - the non-linearities that enable neural networks to learn.

Activation Functions

Activation functions introduce non-linearity, enabling neural networks to learn complex patterns. Each has different properties for training stability and gradient flow.

Input (x)0.00

Current Values

Input

0.00

Output

0.000

Gradient

0.000

For Avatar Models

Face encoders typically use ReLU or LeakyReLU. Transformers (in diffusion models) use GELU. The final layer often uses Sigmoid or Tanh to bound outputs.

Backpropagation

Watch the training algorithm in action: forward pass, backward pass, and weight updates.

Backpropagation

The algorithm that trains neural networks. Forward pass computes output, backward pass computes gradients, then weights update to reduce error.

Positive weight
Negative weight
∂Gradient
Input0.50
Target0.80
Learning Rate0.50

The Algorithm

  1. 1. Forward: Compute activations layer by layer
  2. 2. Backward: Compute gradients via chain rule
  3. 3. Update: w = w - lr × gradient

For Diffusion Models

The U-Net denoiser is trained with backpropagation. Gradients flow from the reconstruction loss back through millions of parameters, updating each weight to predict noise better.

Pooling Layers

Pooling reduces spatial dimensions while preserving important features. Essential for building efficient encoders.

Pooling Layers

Pooling reduces spatial dimensions while preserving important features. Hover over output cells to see the pooling window.

Max Pooling

Takes the maximum value in each window. Preserves the strongest activations and provides translation invariance.

Average Pooling

Takes the mean of values. Smoother downsampling, often used in final layers before classification.

Variational Autoencoder (VAE)

Stable Diffusion operates in VAE latent space. Understand how images are compressed and reconstructed.

Variational Autoencoder (VAE)

VAEs learn a compressed latent representation. Adjust the latent dimensions to see how they affect the output.

Encoder q(z|x)

Maps input to mean μ and variance σ² of the latent distribution.

Reparameterization

z = μ + σ·ε enables backprop through sampling (ε ~ N(0,1)).

Decoder p(x|z)

Reconstructs the input from the sampled latent vector z.

Motion Field (Optical Flow)

Visualize per-pixel motion vectors. Motion fields help video models understand and predict movement patterns.

Motion Field Visualization

Optical flow shows per-pixel motion vectors. This is how video models understand and predict movement.

In Video Generation

Motion fields help maintain temporal consistency. The model predicts motion vectors from audio, then warps the previous frame accordingly before refining with diffusion.

Dropout Regularization

Prevent overfitting by randomly deactivating neurons during training. Watch the network learn redundant representations.

Dropout Regularization

Dropout randomly deactivates neurons during training to prevent overfitting. Watch neurons get "dropped" each iteration.

Training

Random neurons deactivated each forward pass. Forces network to learn redundant representations.

Inference

All neurons active, but outputs scaled by (1-p) to match expected training statistics.

Batch Normalization

Normalize activations to stabilize training. See how BatchNorm transforms the distribution of layer outputs.

Batch Normalization

BatchNorm normalizes layer inputs to stabilize training. See how it transforms the distribution of activations.

Training

Uses batch statistics (μ, σ). Learnable γ and β allow the network to undo normalization if needed.

Inference

Uses running averages computed during training. No dependency on batch size at test time.

Attention Mechanism

Explore self-attention and cross-attention. These mechanisms allow models to focus on relevant parts of the input.

Attention Mechanism

Attention allows models to focus on relevant parts of the input. Click different query tokens to see attention patterns.

Self-Attention

Each token attends to all other tokens in the same sequence. Used in transformers for context understanding.

Cross-Attention

Query attends to different sequence (e.g., audio → video). Used for conditioning generation on external signals.

Loss Functions

Loss functions measure prediction errors. Compare MSE, MAE, BCE, and Huber loss for different training scenarios.

Loss Functions

Loss functions measure how wrong predictions are. Different losses have different properties for training.

MSE vs MAE

MSE penalizes large errors more heavily (quadratic). MAE is more robust to outliers (linear).

Huber Loss

Combines MSE and MAE: quadratic for small errors, linear for large ones. Best of both worlds.

Embedding Space

Explore how neural networks learn meaningful vector representations. Similar concepts cluster together in embedding space.

Embedding Space

Neural networks learn to map concepts to vectors. Click to query nearest neighbors or select points.

Vector Similarity

Similar concepts have similar vectors. Cosine similarity or Euclidean distance measures closeness.

In Generative Models

Face encoders, CLIP, and audio embeddings all learn meaningful spaces for conditioning generation.

The Generation Pipeline

Audio comes in, identity reference is extracted, and through iterative denoising, a photorealistic talking video emerges.

Step 1 of 6
Speed:

Audio Encoding

Convert speech to mel-spectrograms or learned speech embeddings that capture phonemes and prosody

Use ← → arrow keys to navigate, Space to play/pause

Key Architectural Insight

The magic happens in the U-Net or Transformer denoiser, which receives conditioning from both the audio (what motion to generate) and the reference image (what identity to preserve). Cross-attention layers let these signals guide the denoising process.

Key Concepts

Understanding these five pillars will give you a complete mental model of how diffusion-based talking heads work.

1
Intermediate

Diffusion & Denoising

The model learns to reverse a gradual noise-addition process. Given noisy input and a timestep, it predicts what noise was added so we can subtract it, step by step refining static into a clear image.

Imagine a marble sculpture slowly eroded by sandpaper until it's an unrecognizable block. Now train a 'master restorer' to undo exactly one sandpaper stroke. Give them a rough block, and stroke by stroke, the sculpture emerges.
Try demoGo deeper
2
Intermediate

Latent Space

Instead of working with millions of pixels, we compress images into a smaller 'semantic ZIP' using a VAE. Diffusion happens in this compressed representation, then we decompress back to pixels.

Like how a ZIP file preserves document essentials in a smaller package, the VAE compresses images keeping what matters (face shape, expression) while discarding redundant pixel details.
Try demoGo deeper
3
Intermediate

Audio-to-Motion

The model learns which mouth shapes go with which sounds. 'P' and 'B' = lips closed. 'Ah' = mouth open. It converts audio waveforms into facial motion parameters that drive the talking head.

Think of a puppeteer reading a script with stage directions. Each sound is a cue: 'close mouth for M', 'raise eyebrows for questions'. The model learned these mappings from thousands of hours of people talking.
Try demoGo deeper
4
Intermediate

Distillation for Real-Time

Standard diffusion needs 50+ steps - too slow for real-time. Distillation trains a 'student' model to achieve the same result in 1-4 steps by learning shortcuts, enabling 20+ FPS generation.

A careful artist makes 50 brushstrokes to complete a painting. Distillation trains an apprentice to achieve the same result in 4 bold strokes by learning 'what would the master eventually create?'
Try demoGo deeper
5
Intermediate

Identity Preservation

The hardest challenge: keeping the same person's face throughout the video. ReferenceNet and decoupled attention separate 'what moves' (expression) from 'what stays' (identity).

Like a film actor wearing prosthetic makeup that moves naturally with their expressions while maintaining the character's distinct look. The reference encoder captures this 'prosthetic template'.
Try demoGo deeper

Providers & Streaming Infrastructure

Generative video models produce the frames, but delivering them to users requires real-time streaming infrastructure. Avatar providers bundle generation with WebRTC delivery so you can integrate via API.

The Puppeteer Metaphor

Avatar providers are like puppeteers hidden behind the stage. You send them audio (the script), they perform the show (generate video). The audience only sees the puppet, never the puppeteer.

Your Audio

The script

Avatar Provider

The hidden puppeteer

Video Stream

The performance

Your User

The audience

In conversational AI, total latency determines user experience. Above 500ms feels sluggish, above 1 second feels broken. Streaming avatars must balance quality vs speed: faster generation often means lower quality, and network conditions add unpredictable delays.

WebRTC & LiveKit Architecture

WebRTC enables peer-to-peer encrypted media transport directly in the browser. LiveKit adds a Selective Forwarding Unit (SFU) for scalable multi-party rooms, plus server-side agent infrastructure for connecting AI pipelines (STT, LLM, TTS, Avatar) to the media stream.

ICE / STUN / TURN

NAT traversal for peer connectivity

SFU (Selective Forwarding)

Server relays media without decoding

LiveKit Agents

Server-side AI pipeline participants

Latency Budget (Target: ~500ms)

STT
90-200ms
LLM
75-300ms
TTS
100-200ms
Network
50-100ms

Provider Comparison

Compare avatar providers across latency, quality, and pricing. Each takes a different approach to real-time generation.

Avatar Provider Comparison

Compare streaming avatar providers by latency, quality, and features. Click a provider for detailed information.

Select a provider to see details

Quick Decision Guide

Lowest Latency

Simli (~100ms) for real-time conversation

Best Quality

Hedra for photorealistic output

Most Customizable

Rapport for full MetaHuman control

SFU Architecture Comparison

Understand why Selective Forwarding Units are preferred over mesh or MCU topologies for real-time avatar streaming.

Media Routing Architecture Comparison

Compare P2P, SFU, and MCU architectures for real-time media streaming. Watch how data packets flow differently in each architecture.

Participants4

Selective Forwarding Unit

Server receives all streams and forwards without processing

Latency

~100ms

Server Load

Low (forwarding only)

Bandwidth

Medium (1 upload, N downloads)

Pros

  • +Good scalability
  • +Low server CPU
  • +Flexible quality adaptation

Cons

  • -Server bandwidth costs
  • -Still moderate client download

Best For

Video conferencingLive streamingAvatar streaming

Legend

Participant
SFU Server
Outgoing
Forwarded

End-to-End Latency

Simulate the full voice AI pipeline latency from microphone input to avatar video output.

Voice AI Latency Visualizer

Understand how latency accumulates across the voice AI pipeline. Adjust individual stages to see the impact on total round-trip time.

Total Round-Trip Latency

700ms

Acceptable - slight delay

Audio Capture

20ms

Upload (WebRTC)

50ms

Speech-to-Text

150ms

LLM Processing

200ms

Text-to-Speech

150ms

Avatar Render

80ms

Download (WebRTC)

50ms

~350ms effective latency

Quick Presets

Audio Capture20ms (10-30)
Upload (WebRTC)50ms (20-100)
Speech-to-Text150ms (90-300)
LLM Processing200ms (75-500)
Text-to-Speech150ms (100-300)
Avatar Render80ms (30-150)
Download (WebRTC)50ms (20-100)

Target

~500ms

Acceptable

~700ms

Noticeable

~1000ms

Poor

>1000ms

Optimization Strategies

  • • Streaming STT: Start processing before user finishes speaking
  • • Streaming TTS: Begin playback before full response generated
  • • Edge deployment: Reduce network latency with regional servers
  • • Model selection: Smaller LLMs trade quality for speed
  • • Turn detection: Faster endpointing reduces perceived latency

Try the Live Demos

Experience streaming avatars in action with two different approaches.

LiveKit + Hedra

Diffusion-based avatar streamed via WebRTC

Launch Demo

Rapport MetaHuman

Unreal Engine pixel-streamed avatar

Launch Demo

Build It Yourself

Start with SadTalker for quick results, then explore more advanced options.

Clone the repository and set up the environment for audio-driven talking head generation

bash
3 lines
1git clone https://github.com/OpenTalker/SadTalker.git
2cd SadTalker
3pip install -r requirements.txt

Step 1 of 5

Resources

SadTalker (CVPR 2023)

github

GeneFace++ (Real-Time NeRF)

github

LivePortrait (High-Fidelity Animation)

github

CausVid Paper (Streaming Diffusion)

paper

Awesome Talking Head Generation

github

Hugging Face Diffusers Tutorial

docs

When to Use Video Generation

Use When

  • +You need photorealistic results from a single reference image
  • +You want to animate any face without per-person training
  • +Quality matters more than inference speed (or you can use distillation)
  • +You have access to powerful GPUs (A100 / H100 / cloud)
  • +You want natural behaviors like blinks and micro-expressions
  • +You need to deploy to any device via WebRTC with minimal client requirements
  • +You want a hosted provider API for fast time-to-market

Avoid When

  • -You need sub-100ms latency (use MetaHuman or Streaming instead)
  • -You need precise frame-by-frame control (use rigged models)
  • -You're running on consumer hardware without cloud (use Gaussian Splatting)
  • -You need guaranteed deterministic output (diffusion is probabilistic)
  • -You're concerned about deepfake misuse in your application

Best Use Case

Creating photorealistic talking head videos from any face image without per-person training, ideal for content creation, dubbing, and AI assistants

Common Misconceptions

Diffusion models generate images in one shot

Actually: Diffusion requires iterative refinement - typically 20-50 steps. Each step slightly improves the image. This is fundamentally different from GANs which generate in a single forward pass.

More denoising steps always means better quality

Actually: Returns diminish rapidly after a certain point (often 20-30 steps). Distilled models achieve comparable quality in 1-4 steps through learned shortcuts.

The prompt controls everything precisely

Actually: The mapping from conditioning to output is probabilistic, not deterministic. Artists spend hours refining prompts because results vary.

Real-time diffusion is impossible

Actually: Through distillation (CausVid, DMD, consistency models), diffusion can achieve 9-20+ FPS. The key is training a student model to skip steps.

GANs are obsolete

Actually: GANs still excel at speed and can be better for specific real-time applications. Diffusion offers superior stability, diversity, and mode coverage.

Approaches Comparison

ApproachMethodExampleBest For
2D WarpingFirst-Order Motion Model: fast and simple but limited head motionAvatarifyQuick prototypes
3DMM-basedExplicit 3D face model for interpretable controlSadTalker, VividTalkControllable animation
NeRF-basedTrue 3D neural rendering for view synthesisGeneFace++Multi-view synthesis
Full DiffusionHighest quality using video diffusion priorsHallo, Sonic, MAGIC-TalkMaximum quality

Ready to Go Deeper?

Explore the math behind diffusion, or see how this compares to other avatar approaches.

Dive into Diffusion MathCompare All Methods

Explore Other Approaches

Gaussian Splatting

Neural 3D rendering

MetaHuman

Game engine rigs

Learn Real-Time Avatar Technologies

Back to Research Survey