MinT

Mind the Time: Temporally-Controlled Multi-Event Video Generation

Results on OOD Prompts

MinT is fine-tuned on temporal caption videos that mostly describe human-centric events. Yet, we show that our model still possesses the base model's ability to generate novel concepts. Here, we show videos generated by MinT conditioned on out-of-distribution prompts. All prompts are generated by an LLM.
[Go back to main page]

Temporal Captions MinT (Ours)

[0.0s → 2.5s]: A fleet of starships glide through space. [2.5s → 5.4s]: A starship gets attacked by an energy beams from the enemy. [5.4s → 7.4s]: The starship combusts into a fiery explosion, sending out fragments of metallic debris. [7.4s → 9.1s]: The ego starship retreats between shattered vessels and cosmic rocks, evading further conflict.

[0.0s → 2.6s]: The camera shows two raccoons sitting on a bench near Times Square. [2.6s → 5.6s]: The camera zooms in on the raccoons holding books with their front paws. [5.6s → 9.1s]: One raccoon flips a page of the book.

[0.0s → 2.1s]: A close-up shot of a fluffy brown teddy bear standing near a kitchen sink. [2.1s → 5.0s]: The teddy bear picks up a plate from the sink. [5.0s → 6.9s]: The teddy bear scrubs the plate with little circular motions. [6.9s → 9.1s]: The teddy bear places the washed plate on the table.

[0.0s → 2.5s]: A fluffy white cat in colorful workout gear lying on a yoga mat. [2.5s → 5.2s]: The cat transitions into a perfect downward-facing dog pose, stretching its back. [5.2s → 7.5s]: The cat transitioning into a tree pose on its hind legs, raising a paw towards its head. [7.5s → 9.1s]: The cat moves into a cobra pose, lowering down its paw.