LayerT2V: A Unified Multi-Layer
Video Generation Framework

International Conference on Machine Learning (ICML), 2026

Guangzhao Li*1,5, Kangrui Cen*2,3, Baixuan Zhao1, Yi Xin4,5, Siqi Luo1, Guangtao Zhai1, Lei Zhang2,3, Xiaohong Liu†1,5
1Shanghai Jiao Tong University 2Hong Kong Polytechnic University 3OPPO Research Institute 4Nanjing University 5Shanghai Innovation Institute

*Equal contribution.  Corresponding author.

Abstract

Text-to-video generation has advanced rapidly, but existing methods typically output only the final composited video and lack editable layered representations, limiting their use in professional workflows. We propose LayerT2V, a unified multi-layer video generation framework that produces multiple semantically consistent outputs in a single inference pass: the full video, an independent background layer, and multiple foreground RGB layers with corresponding alpha mattes. Our key insight is that recent video generation backbones use high compression in both time and space, enabling us to serialize multiple layer representations along the temporal dimension and jointly model them on a shared generation trajectory. This turns cross-layer consistency into an intrinsic objective, improving semantic alignment and temporal coherence. To mitigate layer ambiguity and conditional leakage, we augment a shared DiT backbone with LayerAdaLN and layer-aware cross-attention modulation. LayerT2V is trained in three stages: alpha mask VAE adaptation, joint multi-layer learning, and multi-foreground extension. We also introduce VidLayer, the first large-scale dataset for multi-layer video generation. Extensive experiments demonstrate that LayerT2V substantially outperforms prior methods in visual fidelity, temporal consistency, and cross-layer coherence.

Method

LayerT2V leverages the high temporal-spatial compression of modern video backbones to serialize multiple layer streams along a shared trajectory, augmented with LayerAdaLN and layer-aware cross-attention for layer identity and prompt routing.

LayerT2V training pipeline and architecture.
Stage 1 — Mask VAE Adaptation fine-tunes the Wan VAE decoder to faithfully reconstruct alpha mattes. Stage 2 — Multi-layer Joint Generation jointly models text, video, and mask tokens with LayerAdaLN-conditioned blocks and layer-aware cross-attention to keep each layer aligned with its prompt.

VidLayer Dataset

The first large-scale layer-aligned video corpus. An automated pipeline turns 200K raw caption-video pairs into 50K structured samples — each with the full video, foreground, background, alpha matte, and layer-level captions.

VidLayer data construction pipeline.
Three automated stages: semantic annotation (Qwen3-VL + SAM3), multi-layer component extraction (MatAnyone + Gen-Omnimatte), and GPT-4o powered artefact auto-checking.
VidLayer dataset samples and statistics.
Diverse and balanced: single- and multi-foreground samples covering people, animals, vehicles, and natural scenes — with semantic redundancy filtering for prompt diversity.

Layered Generation Results

Each slide shows one scene with all four outputs produced in a single inference pass — composited full video, background layer, foreground RGB, and the alpha matte. Use the arrows or pagination dots to browse.

Comparison with Baselines

Against LayerFlow on held-out VidLayer prompts, LayerT2V produces sharper foregrounds, cleaner backgrounds, and tighter cross-layer alignment.

Qualitative comparison with LayerFlow.
Qualitative comparison. BL/BG/FG correspond to the full-video, background, and foreground prompts.

VBench (held-out, 200 prompts)

LayerT2V wins on every dimension of foreground / background / blended outputs.

  • Aesthetic Quality ↑0.497 / 0.558 / 0.539
  • Motion Smoothness ↑0.992 / 0.992 / 0.985
  • Temporal Flickering ↑0.987 / 0.983 / 0.968
  • Subject Consistency ↑0.983 / 0.984 / 0.975
  • Text Alignment ↑0.201 / 0.231 / 0.214

User Study (30 participants)

Preference rates — higher is better. LayerT2V is preferred by a 4–6× margin.

  • Aesthetic Quality ↑0.724 vs. 0.120 / 0.156
  • Foreground Quality ↑0.768 vs. 0.156 / 0.076
  • Text Alignment ↑0.668 vs. 0.136 / 0.196

Ablation

LayerAdaLN and layered cross-attention are both essential. 4D RoPE fails to disentangle layer identity from temporal dynamics; the native Mask VAE produces blurry mattes — VAE LoRA fixes both.

Ablation results.
(a–b) 4D RoPE breaks pretrained spatiotemporal encoding. (c–d) VAE LoRA produces sharper masks than a from-scratch Mask VAE.

More Qualitative Results

Three generation modes — single foreground with single subject, single foreground with multiple subjects, and multi-foreground with independent layers.

Additional LayerT2V qualitative results.

BibTeX

@misc{li2026layert2vunifiedmultilayervideo,
  title  = {LayerT2V: A Unified Multi-Layer Video Generation Framework},
  author = {Guangzhao Li and Kangrui Cen and Baixuan Zhao and Yi Xin and
            Siqi Luo and Guangtao Zhai and Lei Zhang and Xiaohong Liu},
  year   = {2026},
  eprint = {2508.04228},
  archivePrefix = {arXiv},
  primaryClass  = {cs.CV},
  url    = {https://arxiv.org/abs/2508.04228}
}