MinT

Temporal Captions	MinT (Ours)	CogVideoX-5B	Mochi 1	Kling 1.5	Gen-3 Alpha
[0.0s → 4.5s]: The camera pans left and tilts up to show the female worker picking up the coffee cup, taking a sip, and then putting it back on the table while looking down. [4.5s → 6.5s]: The camera tilts down to show the male worker beginning to write something. [6.5s → 12.2s]: The camera trucks to the right, showing the female worker and the male worker both have their left hand under his mouth. Playback Speed:

[0.0s → 2.3s]: A woman writes the details on a white sheet of paper. [2.3s → 4.6s]: The woman looks at the right as a man holding a clipboard is coming to her. [4.6s → 9.3s]: The man comes to the woman. They look at each other and start to discuss with the paper in the clipboard. Playback Speed:

[0.0s → 3.3s]: A woman stands straight and smiles with a happy facial expression. [3.3s → 7.5s]: A woman smiles with her hands closed at her stomach. [7.5s → 9.3s]: A woman stands with closed hands begins to laugh while her torso is slightly bent forward. Playback Speed:

[0.0s → 2.7s]: A woman stands with her head turned to the left and strokes her right hand with her left hand. [2.7s → 6.6s]: The woman turns to look at the camera, stands with her hands down, and laughs. [6.6s → 9.3s]: The woman stands with her head slightly tilted to her left and her left hand resting at waist level. Playback Speed:

[0.0s → 3.1s]: The woman is writing something on a table. [3.1s → 8.0s]: The woman looks upwards with a smile and spreads her arms. [8.0s → 9.3s]: The woman lowers her arms and resumes to write something on the table. Playback Speed:

Mind the Time: Temporally-Controlled Multi-Event Video Generation

Comparison with SOTA Models