05Active Research

Video Generation Models

Temporal coherence and controllable generation in video diffusion models.

DiffusionTemporal ModelingControllabilityEfficiency

Overview

Video generation sits at the frontier of generative AI. We research temporal consistency in diffusion-based video models, efficient architectures for long video synthesis, and fine-grained motion control for production media workflows.

Research Directions

Temporal attention architectures

Attention mechanisms that reason across frames without quadratic cost scaling with sequence length.

Motion conditioning

Trajectory-conditioned generation enabling precise control over camera movement and object motion.

Video LoRA fine-tuning

Adapting pre-trained video models to specific subjects, styles, and motion vocabularies.

Latent video compression

Efficient video VAEs that compress temporal sequences without frame-level flicker artifacts.

Long-form coherence

Maintaining semantic and visual consistency across clips exceeding 60 seconds.

The architecture landscape

Current production video models fall into two architectural families: video diffusion transformers (DiT-based, e.g., Sora, CogVideoX) and cascaded latent diffusion models (e.g., Stable Video Diffusion). DiT models scale more predictably with compute but require significantly larger training budgets. Latent diffusion models are more accessible for fine-tuning but show earlier quality ceilings.

Temporal consistency as an optimization target

Frame-level diffusion generates sharp images but produces temporal flicker: small inconsistencies between adjacent frames that are imperceptible per-frame but visually disturbing in motion. Solving this requires conditioning each frame on its neighbors, either via 3D convolutions, temporal attention, or explicit optical flow guidance. We use a hybrid: sparse 3D attention for local consistency and cross-frame CLIP embeddings for global semantic coherence.

Efficient inference for long video

Generating a 30-second clip at 24 fps with a standard video DiT requires processing 720 frames through multiple denoising steps. Naive inference at this scale exceeds the memory of a single H100. We apply sliding window denoising with cross-window overlap and use a distilled consistency model for later denoising steps, reducing inference time by 3.5x while maintaining perceptual quality scores above 0.92 SSIM.

Motion control and trajectory conditioning

For production workflows, uncontrolled motion generation is insufficient. We train motion ControlNet adapters on camera trajectory annotations from large video datasets, enabling precise conditioning on pan, tilt, zoom, and dolly paths. Object-level motion is conditioned via 2D bounding box trajectories that the model follows while filling in plausible appearance variations.

Trajectory encoding

Camera trajectories are encoded as sequences of 4x4 rotation-translation matrices, projected to a learned embedding space. This representation is rotation-equivariant and generalizes to camera paths not seen during training.

PreviousProduction MLOps Next Voice Agents

The architecture landscape

Temporal consistency as an optimization target

Efficient inference for long video

Motion control and trajectory conditioning

Trajectory encoding