Mind the Time: Temporally-Controlled Multi-Event Video Generation

Results on OOD Prompts


MinT is fine-tuned on temporal caption videos that mostly describe human-centric events. Yet, we show that our model still possesses the base model's ability to generate novel concepts. Here, we show videos generated by MinT conditioned on out-of-distribution prompts. All prompts are generated by an LLM.
[Go back to main page]


Temporal Captions MinT (Ours)
[0.0s → 2.5s]: A fleet of starships glide through space.
[2.5s → 5.4s]: A starship gets attacked by an energy beams from the enemy.
[5.4s → 7.4s]: The starship combusts into a fiery explosion, sending out fragments of metallic debris.
[7.4s → 9.1s]: The ego starship retreats between shattered vessels and cosmic rocks, evading further conflict.
[0.0s → 2.6s]: The camera shows two raccoons sitting on a bench near Times Square.
[2.6s → 5.6s]: The camera zooms in on the raccoons holding books with their front paws.
[5.6s → 9.1s]: One raccoon flips a page of the book.
[0.0s → 2.1s]: A close-up shot of a fluffy brown teddy bear standing near a kitchen sink.
[2.1s → 5.0s]: The teddy bear picks up a plate from the sink.
[5.0s → 6.9s]: The teddy bear scrubs the plate with little circular motions.
[6.9s → 9.1s]: The teddy bear places the washed plate on the table.
[0.0s → 2.5s]: A fluffy white cat in colorful workout gear lying on a yoga mat.
[2.5s → 5.2s]: The cat transitions into a perfect downward-facing dog pose, stretching its back.
[5.2s → 7.5s]: The cat transitioning into a tree pose on its hind legs, raising a paw towards its head.
[7.5s → 9.1s]: The cat moves into a cobra pose, lowering down its paw.