Mind the Time:

Temporally-Controlled Multi-Event Video Generation

CVPR 2025

Ziyi Wu^1,2,3, Aliaksandr Siarohin¹, Willi Menapace¹, Ivan Skorokhodov¹,
Yuwei Fang¹, Varnith Chordia¹, Igor Gilitschenski^2,3,*, Sergey Tulyakov^1,*
¹Snap Research ²University of Toronto ³Vector Institute
^* Equal Supervision

[Paper] [Twitter Thread]

Temporal Captions	MinT (Ours)	Sora (Storyboard)
[0.0s → 2.3s]: A man lifts up his head and raises up both arms. [2.3s → 4.5s]: The man lowers down his head and puts down both arms. [4.5s → 6.8s]: The man turns his head to the right and extends both arms to the right. [6.8s → 9.1s]: The man turns his head to the left and extends both arms to the left. Playback Speed:

[0.0s → 2.3s]: A young man typing on the laptop keyboard with both hands. [2.3s → 4.5s]: The man touches the headphones with his right hand. [4.5s → 6.5s]: The man closes the laptop with his left hand. [6.5s → 9.1s]: The man stands up. Playback Speed:

[0.0s → 1.8s]: The woman is holding the phone in her left hand, looking at it while tapping on it with her right hand. [1.8s → 3.8s]: The woman holds the phone with both hands, extending them forward at face level to take a selfie. [3.8s → 7.0s]: The woman lowers the phone and begins typing on it with her right hand. [7.0s → 12.2s]: The woman adjusts her hair with her right hand, tucking it behind her left ear. Playback Speed:

[0.0s → 1.2s]: The man holds an tablet with his left hand and uses it with his right hand. [1.2s → 5.6s]: The man looks at the blue bottles on his left and points his right hand towards them. [5.6s → 6.5s]: The man looks at the tablet and uses it with his right hand. [6.5s → 12.2s]: The man looks at the camera while holding the tablet with his left hand. Playback Speed:

[0.0s → 2.1s]: The woman waves with her right hand. [2.1s → 7.5s]: The woman talks gesturing with her hands. [7.5s → 10.5s]: The woman makes a heart gesture. [10.5s → 12.2s]: The woman gives a blow kiss with her right hand. Playback Speed:

Temporal Captions	MinT (Ours)	CogVideoX-5B	Mochi 1	Kling 1.5	Gen-3 Alpha
[0.0s → 2.3s]: A man lifts up his head and raises up both arms. [2.3s → 4.5s]: The man lowers down his head and puts down both arms. [4.5s → 6.8s]: The man turns his head to the right and extends both arms to the right. [6.8s → 9.1s]: The man turns his head to the left and extends both arms to the left. Playback Speed:

[0.0s → 2.3s]: A young man typing on the laptop keyboard with both hands. [2.3s → 4.5s]: The man touches the headphones with his right hand. [4.5s → 6.5s]: The man closes the laptop with his left hand. [6.5s → 9.1s]: The man stands up. Playback Speed:

[0.0s → 1.8s]: The woman is holding the phone in her left hand, looking at it while tapping on it with her right hand. [1.8s → 3.8s]: The woman holds the phone with both hands, extending them forward at face level to take a selfie. [3.8s → 7.0s]: The woman lowers the phone and begins typing on it with her right hand. [7.0s → 12.2s]: The woman adjusts her hair with her right hand, tucking it behind her left ear. Playback Speed:

[0.0s → 1.2s]: The man holds an tablet with his left hand and uses it with his right hand. [1.2s → 5.6s]: The man looks at the blue bottles on his left and points his right hand towards them. [5.6s → 6.5s]: The man looks at the tablet and uses it with his right hand. [6.5s → 12.2s]: The man looks at the camera while holding the tablet with his left hand. Playback Speed:

[0.0s → 2.1s]: The woman waves with her right hand. [2.1s → 7.5s]: The woman talks gesturing with her hands. [7.5s → 10.5s]: The woman makes a heart gesture. [10.5s → 12.2s]: The woman gives a blow kiss with her right hand. Playback Speed:

Captions	MinT (Ours)	Short	Global
Short caption: a cat drinking water. Extended temporal captions: [0.0s → 1.7s]: A fluffy, orange cat walks towards a ceramic water bowl. [1.7s → 4.3s]: The cat has its pink nose dipping into the water as it begins to lap at the water with its tiny tongue. [4.3s → 8.1s]: The cat lifts its head and glances around the room with its green eyes. Playback Speed:

Short caption: a bicycle accelerating to gain speed. Extended temporal captions: [0.0s → 2.2s]: The camera is at ground level capturing a close-up of the bicycle's wheels, standing still. [2.2s → 4.0s]: The camera tilts up to show the rider lightly pushing down on the pedal with their foot. [4.0s → 5.9s]: The camera zooms out to a medium shot, revealing the rider steadily pedaling while leaning forward. [5.9s → 8.1s]: A smooth track motion shows the bicycle racing down a street, gaining speed quickly. Playback Speed:

Short caption: a bear catching a salmon in its powerful jaws. Extended temporal captions: [0.0s → 2.5s]: A large brown bear standing waist-deep in a rushing river. [2.5s → 4.3s]: The bear lunging catches a fish from the water. [4.3s → 8.1s]: The bear clenches a silver salmon in its jaws and lifts its head triumphantly. Playback Speed:

Short caption: a bear climbing a tree. Extended temporal captions: [0.0s → 1.3s]: A brown bear walks towards a tall tree. [1.3s → 3.3s]: The bear stands on its hind legs and places its front paws on the tree trunk. [3.3s → 6.0s]: The bear slowly climbs higher up the tree. [6.0s → 8.1s]: The bear pauses to catch its breath. Playback Speed:

Temporal Captions	Temporal Captions	Temporal Captions
[0.0s → 3.5s]: Event1 [3.5s → 6.0s]: Event2 [6.0s → 9.1s]: Event3	[0.0s → 3.5s]: Event1 3.5s: Scene cut [3.5s → 6.0s]: Event2 [6.0s → 9.1s]: Event3	[0.0s → 3.5s]: Event1 3.5s: Scene cut [3.5s → 6.0s]: Event2 6.0s: Scene cut [6.0s → 9.1s]: Event3
Playback Speed:

[0.0s → 2.7s]: Event1 [2.7s → 6.6s]: Event2 [6.6s → 9.3s]: Event3	[0.0s → 2.7s]: Event1 2.7s: Scene cut [2.7s → 6.6s]: Event2 [6.6s → 9.3s]: Event3	[0.0s → 2.7s]: Event1 2.7s: Scene cut [2.7s → 6.6s]: Event2 6.6s: Scene cut [6.6s → 9.3s]: Event3
Playback Speed:

[0.0s → 1.9s]: Event1 [1.9s → 4.0s]: Event2 [4.0s → 6.2s]: Event3 [6.2s → 9.1s]: Event4	[0.0s → 1.9s]: Event1 1.9s: Scene cut [1.9s → 4.0s]: Event2 [4.0s → 6.2s]: Event3 [6.2s → 9.1s]: Event4	[0.0s → 1.9s]: Event1 1.9s: Scene cut [1.9s → 4.0s]: Event2 4.0s: Scene cut [4.0s → 6.2s]: Event3 6.2s: Scene cut [6.2s → 9.1s]: Event4
Playback Speed:

-1.0s

-0.5s

Original Timestamps

+0.5s

+1.0s

Playback Speed:

@inproceedings{wu2025mint, title={Mind the Time: Temporally-Controlled Multi-Event Video Generation}, author={Wu, Ziyi and Siarohin, Aliaksandr and Menapace, Willi and Skorokhodov, Ivan and Fang, Yuwei and Chordia, Varnith and Gilitschenski, Igor and Tulyakov, Sergey}, booktitle={CVPR}, year={2025} }

Mind the Time:

Temporally-Controlled Multi-Event Video Generation

CVPR 2025

[Paper] [Twitter Thread]

TL;DR:

News

Abstract

Method

Qualitative Results

Comparison with SOTA Models

NEW! Comparison with Sora

Comparison with Other SOTA Models

MinT Results on OOD Prompts

Temporal Captions MinT (Ours)

Prompt Enhancement on VBench

Scene Cut Conditioning

Event Time Span Control

Failure Case Analysis

BibTeX

References

[0.0s → 2.4s]: A sweeping crane shot reveals two warriors on the edge of the rugged cliffs, swords at the ready. [2.4s → 4.3s]: One warrior advances, swinging his sword in a wide arc aimed at his opponent's side. [4.3s → 6.7s]: The other warrior parries the attack, causing a shower of sparks to fly from the swords' contact. [6.7s → 9.1s]: The camera cranes up to capture both warriors circling each other, blades poised.

[0.0s → 2.3s]: The astronaut bends forward to pick up a sparker from a metal container on the table. [2.3s → 4.5s]: The astronaut lights up a sparkler with a matchstick. [4.5s → 6.8s]: The astronaut waves the lit sparkler in a circle, leaving a trail of glowing sparkles. [6.8s → 9.1s]: The astronaut holds up the sparkler at eye level and admires the burst of colorful sparks.

[0.0s → 2.5s]: The person on the left strokes the woman's hand. [2.5s → 6.0s]: The person holds the woman's hand firmly. The camera tilts up. [6.0s → 9.1s]: The woman responds to the person. The camera tilts up and dollies backward.

[0.0s → 3.3s]: A light-skinned person cuts a strawberry with a knife. [3.3s → 5.9s]: The person cuts the strawberry into four pieces. [5.9s → 9.1s]: The person pushes the strawberry pieces towards the left with a knife.