Mind the Time: Temporally-Controlled Multi-Event Video Generation

Comparison with SOTA Models


We show more comparisons with SOTA video generators here.
[Go back to main page]


Temporal Captions MinT (Ours) CogVideoX-5B Mochi 1 Kling 1.5 Gen-3 Alpha
[0.0s → 4.5s]: The camera pans left and tilts up to show the female worker picking up the coffee cup, taking a sip, and then putting it back on the table while looking down.
[4.5s → 6.5s]: The camera tilts down to show the male worker beginning to write something.
[6.5s → 12.2s]: The camera trucks to the right, showing the female worker and the male worker both have their left hand under his mouth.

Playback Speed:

[0.0s → 2.3s]: A woman writes the details on a white sheet of paper.
[2.3s → 4.6s]: The woman looks at the right as a man holding a clipboard is coming to her.
[4.6s → 9.3s]: The man comes to the woman. They look at each other and start to discuss with the paper in the clipboard.

Playback Speed:

[0.0s → 3.3s]: A woman stands straight and smiles with a happy facial expression.
[3.3s → 7.5s]: A woman smiles with her hands closed at her stomach.
[7.5s → 9.3s]: A woman stands with closed hands begins to laugh while her torso is slightly bent forward.

Playback Speed:

[0.0s → 2.7s]: A woman stands with her head turned to the left and strokes her right hand with her left hand.
[2.7s → 6.6s]: The woman turns to look at the camera, stands with her hands down, and laughs.
[6.6s → 9.3s]: The woman stands with her head slightly tilted to her left and her left hand resting at waist level.

Playback Speed:

[0.0s → 3.1s]: The woman is writing something on a table.
[3.1s → 8.0s]: The woman looks upwards with a smile and spreads her arms.
[8.0s → 9.3s]: The woman lowers her arms and resumes to write something on the table.

Playback Speed: