VideoRepair: Improving Text-to-Video Generation via Misalignment Evaluation and Localized Refinement

University of North Carolina, Chapel Hill
Paper Code

Abstract

Recent text-to-video (T2V) diffusion models have demonstrated impressive generation capabilities across various domains. However, these models often generate videos that have misalignments with text prompts, especially when the prompts describe complex scenes with multiple objects and attributes. To address this, we introduce VideoRepair, a novel model-agnostic, training-free video refinement framework that automatically identifies fine-grained text-video misalignments and generates explicit spatial and textual feedback, enabling a T2V diffusion model to perform targeted, localized refinements. VideoRepair consists of four stages: In (1) video evaluation, we detect misalignments by generating fine-grained evaluation questions and answering those questions with MLLM. In (2) refinement planning, we identify accurately generated objects and then create localized prompts to refine other areas in the video. Next, in (3) region decomposition, we segment the correctly generated area using a combined grounding module. We regenerate the video by adjusting the misaligned regions while preserving the correct regions in (4) localized refinement. On two popular video generation benchmarks (EvalCrafter and T2V-CompBench), VideoRepair substantially outperforms recent baselines across various text-video alignment metrics. We provide a comprehensive analysis of VideoRepair components and qualitative examples.

Method

fig_pipeline

VideoRepair refines the generated video in four stages: (1) video evaluation (2) refinement planning (3) region decomposition and (4) localized refinement. From the initial prompt p, we first generate a fine-grained evaluation question set Qo and ask MLLM to answer it. Next, we identify O* by MLLM and plan how to refine the other region by LLM. Using O*, we decompose the region by Molmo and Semantic SAM to get the region to keep. In the end, we conduct localized refinement by the original frozen T2V model.

Qualitative Results

1. Comparison with Baselines (T2V-turbo)

"1 bear and 2 people making pizza"
GIF 4
+ Ours (VideoRepair)
"five aliens in a forest"
GIF 4
T2V-turbo
GIF 5
+ OPT2I
GIF 6
Vico
GIF 7
+ Ours (VideoRepair)
"A camel lounging in front of a snowman"
GIF 4
T2V-turbo
GIF 5
+ OPT2I
GIF 6
Vico
GIF 7
+ Ours (VideoRepair)
"A blue car parked next to a red fire hydrant on the street"
GIF 4
T2V-turbo
GIF 5
OPT2I
GIF 6
Vico
GIF 7
Ours (VideoRepair)
"Yellow rose swaying near a green bench"
GIF 4
T2V-turbo
GIF 5
+ OPT2I
GIF 6
Vico
GIF 7
+ Ours (VideoRepair)

2. Comparison with Baselines (VideoCrafter2)

"A dog sitting under a umbrella on a sunny beach"
GIF 1
VideoCrafter2
GIF 2
+ OPT2I
GIF 3
Vico
GIF 4
+ Ours (VideoRepair)
"With the style of pointilism, A green apple and a black backpack."
GIF 1
VideoCrafter2
GIF 2
+ OPT2I
GIF 3
Vico
GIF 4
+ Ours (VideoRepair)
"three foxes in a snowy forest"
GIF 1
VideoCrafter2
GIF 2
+ OPT2I
GIF 3
Vico
GIF 4
+ Ours (VideoRepair)
"A basket placed below a television"
GIF 1
VideoCrafter2
GIF 2
+ OPT2I
GIF 3
Vico
GIF 4
+ Ours (VideoRepair)
"A penguin standing on the right side of a cactus in a desert"
GIF 1
VideoCrafter2
GIF 2
+ OPT2I
GIF 3
Vico
GIF 4
+ Ours (VideoRepair)

3. Iterative Refinement

"A mother and her child feed ducks at a pond."
GIF 1
Initial video (T2V-turbo)
GIF 2
Iter 1
GIF 4
Iter 2
"A family of four set up a tent and build a campfire, enjoying a night of camping under the stars"
GIF 1
Initial video (T2V-turbo)
GIF 2
Iter 1
GIF 4
Iter 2
"A bright yellow umbrella with a wooden handle. It's compact and easy to carry."
GIF 1
Initial video (T2V-turbo)
GIF 2
Iter 1
GIF 4
Iter 2
"A group of six dancers perform a ballet on stage, their movements synchronized and graceful."
GIF 1
Initial video (T2V-turbo)
GIF 2
Iter 1
GIF 4
Iter 2

4. More examples

Quantitative Results

1. EvalCrafter

VideoRepair surpasses all baselines in the text-video alignment metric (evaluated using CLIP, BLIP2, and SAM-Track) across all four splits by a significant margin, achieving relative improvements of +2.87% and +11.09% over VideoCrafter2 and T2V-turbo on initial video generations.

2. T2V-Compbench

We observe that VideoRepair improves initial videos from both T2V models (VideoCrafter2 and T2V-turbo) in all three splits, achieving relative improvements of +8.76% and +5.40%, respectively.