VideoRepair: Improving Text-to-Video Generation via Misalignment Evaluation and Localized Refinement

University of North Carolina at Chapel Hill
Paper Code

Abstract

Recent text-to-video (T2V) diffusion models have demonstrated impressive generation capabilities across various domains. However, these models often generate videos that have misalignments with text prompts, especially when the prompts describe complex scenes with multiple objects and attributes. To address this, we introduce VideoRepair, a novel model-agnostic, training-free video refinement framework that automatically identifies fine-grained text-video misalignments and generates explicit spatial and textual feedback, enabling a T2V diffusion model to perform targeted, localized refinements. VideoRepair consists of two stages: In (1) video refinement planning, we first detect misalignments by generating fine-grained evaluation questions and answering them using an MLLM. Based on video evaluation outputs, we identify accurately generated objects and construct localized prompts to precisely refine misaligned regions. In (2) localized refinement, we enhance video alignment by “repairing” the misaligned regions from the original video while preserving the correctly generated areas. This is achieved by frame-wise region decomposition using our Region-Preserving Segmentation (RPS) module. On two popular video generation benchmarks (EvalCrafter and T2V-CompBench), VideoRepair substantially outperforms recent baselines across various text-video alignment metrics. We provide a comprehensive analysis of VideoRepair components and qualitative examples.

Method

fig_pipeline

VideoRepair refines the generated video in two stages: (1) video refinement planning (Sec. 3.1), (2) localized refinement (Sec. 3.2). Given the prompt p, we first generate a fine-grained evaluation question set and ask the MLLM to provide answers. Next, we identify accurately generated objects O* and plan the refinement pr of other regions using MLLM/LLM. Based on O*, we determine which regions to preserve or refine using the RPS module. Finally, we apply localized refinement with the original T2V model.

Qualitative Results

1. Comparison with Baselines (T2V-turbo)

"1 bear and 2 people making pizza"
GIF 4
+ Ours (VideoRepair)
"five aliens in a forest"
GIF 4
T2V-turbo
GIF 5
+ OPT2I
GIF 6
Vico
GIF 7
+ Ours (VideoRepair)
"A camel lounging in front of a snowman"
GIF 4
T2V-turbo
GIF 5
+ OPT2I
GIF 6
Vico
GIF 7
+ Ours (VideoRepair)
"A blue car parked next to a red fire hydrant on the street"
GIF 4
T2V-turbo
GIF 5
OPT2I
GIF 6
Vico
GIF 7
Ours (VideoRepair)
"Yellow rose swaying near a green bench"
GIF 4
T2V-turbo
GIF 5
+ OPT2I
GIF 6
Vico
GIF 7
+ Ours (VideoRepair)

2. Comparison with Baselines (VideoCrafter2)

"A dog sitting under a umbrella on a sunny beach"
GIF 1
VideoCrafter2
GIF 2
+ OPT2I
GIF 3
Vico
GIF 4
+ Ours (VideoRepair)
"With the style of pointilism, A green apple and a black backpack."
GIF 1
VideoCrafter2
GIF 2
+ OPT2I
GIF 3
Vico
GIF 4
+ Ours (VideoRepair)
"three foxes in a snowy forest"
GIF 1
VideoCrafter2
GIF 2
+ OPT2I
GIF 3
Vico
GIF 4
+ Ours (VideoRepair)
"A basket placed below a television"
GIF 1
VideoCrafter2
GIF 2
+ OPT2I
GIF 3
Vico
GIF 4
+ Ours (VideoRepair)
"A penguin standing on the right side of a cactus in a desert"
GIF 1
VideoCrafter2
GIF 2
+ OPT2I
GIF 3
Vico
GIF 4
+ Ours (VideoRepair)

3. Iterative Refinement

"A mother and her child feed ducks at a pond."
GIF 1
Initial video (T2V-turbo)
GIF 2
Iter 1
GIF 4
Iter 2
"A family of four set up a tent and build a campfire, enjoying a night of camping under the stars"
GIF 1
Initial video (T2V-turbo)
GIF 2
Iter 1
GIF 4
Iter 2
"A bright yellow umbrella with a wooden handle. It's compact and easy to carry."
GIF 1
Initial video (T2V-turbo)
GIF 2
Iter 1
GIF 4
Iter 2
"A group of six dancers perform a ballet on stage, their movements synchronized and graceful."
GIF 1
Initial video (T2V-turbo)
GIF 2
Iter 1
GIF 4
Iter 2

4. More examples

Quantitative Results

1. EvalCrafter

VideoRepair consistently outperforms all baselines across all four evaluation splits in text-video alignment, achieving relative gains of +6.22%, +7.65%, and +3.11% over VideoCrafter2, T2V-turbo, and CogVideoX-5B, respectively. We highlight the quality and consistency performance in red if it deteriorates by more than 1% from the original performance.

2. T2V-Compbench

We observe that VideoRepair improves the initial videos from all T2V models (VideoCrafter2, T2V-turbo, and CogVideoX-5B) in all three splits, with relative improvements of +8.16%, +8.69%, and +15.87%, respectively.