Temporal coherence and controllable generation in video diffusion models.
Overview
Video generation sits at the frontier of generative AI. We research temporal consistency in diffusion-based video models, efficient architectures for long video synthesis, and fine-grained motion control for production media workflows.
Research Directions
Temporal attention architectures
Attention mechanisms that reason across frames without quadratic cost scaling with sequence length.
Motion conditioning
Trajectory-conditioned generation enabling precise control over camera movement and object motion.
Video LoRA fine-tuning
Adapting pre-trained video models to specific subjects, styles, and motion vocabularies.
Latent video compression
Efficient video VAEs that compress temporal sequences without frame-level flicker artifacts.
Long-form coherence
Maintaining semantic and visual consistency across clips exceeding 60 seconds.
Current production video models fall into two architectural families: video diffusion transformers (DiT-based, e.g., Sora, CogVideoX) and cascaded latent diffusion models (e.g., Stable Video Diffusion). DiT models scale more predictably with compute but require significantly larger training budgets. Latent diffusion models are more accessible for fine-tuning but show earlier quality ceilings.
Frame-level diffusion generates sharp images but produces temporal flicker: small inconsistencies between adjacent frames that are imperceptible per-frame but visually disturbing in motion. Solving this requires conditioning each frame on its neighbors, either via 3D convolutions, temporal attention, or explicit optical flow guidance. We use a hybrid: sparse 3D attention for local consistency and cross-frame CLIP embeddings for global semantic coherence.
Generating a 30-second clip at 24 fps with a standard video DiT requires processing 720 frames through multiple denoising steps. Naive inference at this scale exceeds the memory of a single H100. We apply sliding window denoising with cross-window overlap and use a distilled consistency model for later denoising steps, reducing inference time by 3.5x while maintaining perceptual quality scores above 0.92 SSIM.
For production workflows, uncontrolled motion generation is insufficient. We train motion ControlNet adapters on camera trajectory annotations from large video datasets, enabling precise conditioning on pan, tilt, zoom, and dolly paths. Object-level motion is conditioned via 2D bounding box trajectories that the model follows while filling in plausible appearance variations.
Trajectory encoding
Camera trajectories are encoded as sequences of 4x4 rotation-translation matrices, projected to a learned embedding space. This representation is rotation-equivariant and generalizes to camera paths not seen during training.