LayerT2V: A Unified Multi-Layer Video Generation Framework

Abstract

Text-to-video generation has advanced rapidly, but existing methods typically output only the final composited video and lack editable layered representations, limiting their use in professional workflows. We propose LayerT2V, a unified multi-layer video generation framework that produces multiple semantically consistent outputs in a single inference pass: the full video, an independent background layer, and multiple foreground RGB layers with corresponding alpha mattes. Our key insight is that recent video generation backbones use high compression in both time and space, enabling us to serialize multiple layer representations along the temporal dimension and jointly model them on a shared generation trajectory. This turns cross-layer consistency into an intrinsic objective, improving semantic alignment and temporal coherence. To mitigate layer ambiguity and conditional leakage, we augment a shared DiT backbone with LayerAdaLN and layer-aware cross-attention modulation. LayerT2V is trained in three stages: alpha mask VAE adaptation, joint multi-layer learning, and multi-foreground extension. We also introduce VidLayer, the first large-scale dataset for multi-layer video generation. Extensive experiments demonstrate that LayerT2V substantially outperforms prior methods in visual fidelity, temporal consistency, and cross-layer coherence.

Method

LayerT2V leverages the high temporal-spatial compression of modern video backbones to serialize multiple layer streams along a shared trajectory, augmented with LayerAdaLN and layer-aware cross-attention for layer identity and prompt routing.

LayerT2V training pipeline and architecture. — **Stage 1 — Mask VAE Adaptation** fine-tunes the Wan VAE decoder to faithfully reconstruct alpha mattes. **Stage 2 — Multi-layer Joint Generation** jointly models text, video, and mask tokens with LayerAdaLN-conditioned blocks and layer-aware cross-attention to keep each layer aligned with its prompt.

VidLayer Dataset

The first large-scale layer-aligned video corpus. An automated pipeline turns 200K raw caption-video pairs into 50K structured samples — each with the full video, foreground, background, alpha matte, and layer-level captions.

VidLayer data construction pipeline. — **Three automated stages**: semantic annotation (Qwen3-VL + SAM3), multi-layer component extraction (MatAnyone + Gen-Omnimatte), and GPT-4o powered artefact auto-checking.

VidLayer dataset samples and statistics. — **Diverse and balanced**: single- and multi-foreground samples covering people, animals, vehicles, and natural scenes — with semantic redundancy filtering for prompt diversity.

Layered Generation Results

Each slide shows one scene with all four outputs produced in a single inference pass — composited full video, background layer, foreground RGB, and the alpha matte. Use the arrows or pagination dots to browse.

Full

Background

Foreground

Alpha

Full

Background

Foreground

Alpha

Full

Background

Foreground

Alpha

Full

Background

Foreground

Alpha

Full

Background

Foreground

Alpha

Full

Background

Foreground

Alpha

Full

Background

Foreground

Alpha

Full

Background

Foreground

Alpha

Full

Background

Foreground

Alpha

Full

Background

Foreground

Alpha

Full

Background

Foreground

Alpha

Full

Background

Foreground

Alpha

Full

Background

Foreground

Alpha

Full

Background

Foreground

Alpha

Full

Background

Foreground

Alpha

Comparison with Baselines

Against LayerFlow on held-out VidLayer prompts, LayerT2V produces sharper foregrounds, cleaner backgrounds, and tighter cross-layer alignment.

Qualitative comparison with LayerFlow. — Qualitative comparison. BL/BG/FG correspond to the full-video, background, and foreground prompts.

VBench (held-out, 200 prompts)

LayerT2V wins on every dimension of foreground / background / blended outputs.

Aesthetic Quality ↑0.497 / 0.558 / 0.539
Motion Smoothness ↑0.992 / 0.992 / 0.985
Temporal Flickering ↑0.987 / 0.983 / 0.968
Subject Consistency ↑0.983 / 0.984 / 0.975
Text Alignment ↑0.201 / 0.231 / 0.214

User Study (30 participants)

Preference rates — higher is better. LayerT2V is preferred by a 4–6× margin.

Aesthetic Quality ↑0.724 vs. 0.120 / 0.156
Foreground Quality ↑0.768 vs. 0.156 / 0.076
Text Alignment ↑0.668 vs. 0.136 / 0.196

Ablation

LayerAdaLN and layered cross-attention are both essential. 4D RoPE fails to disentangle layer identity from temporal dynamics; the native Mask VAE produces blurry mattes — VAE LoRA fixes both.

More Qualitative Results

Three generation modes — single foreground with single subject, single foreground with multiple subjects, and multi-foreground with independent layers.

Additional LayerT2V qualitative results.

BibTeX

@misc{li2026layert2vunifiedmultilayervideo,
  title  = {LayerT2V: A Unified Multi-Layer Video Generation Framework},
  author = {Guangzhao Li and Kangrui Cen and Baixuan Zhao and Yi Xin and
            Siqi Luo and Guangtao Zhai and Lei Zhang and Xiaohong Liu},
  year   = {2026},
  eprint = {2508.04228},
  archivePrefix = {arXiv},
  primaryClass  = {cs.CV},
  url    = {https://arxiv.org/abs/2508.04228}
}

LayerT2V: A Unified Multi-LayerVideo Generation Framework

Abstract

Method

VidLayer Dataset

Layered Generation Results

Comparison with Baselines

VBench (held-out, 200 prompts)

User Study (30 participants)

Ablation

More Qualitative Results

BibTeX

LayerT2V: A Unified Multi-Layer
Video Generation Framework