Recent text-to-video (T2V) diffusion models have demonstrated impressive generation capabilities across various domains. However, these models often generate videos that have misalignments with text prompts, especially when the prompts describe complex scenes with multiple objects and attributes. To address this, we introduce VideoRepair, a novel model-agnostic, training-free video refinement framework that automatically identifies fine-grained text-video misalignments and generates explicit spatial and textual feedback, enabling a T2V diffusion model to perform targeted, localized refinements. VideoRepair consists of four stages: In (1) video evaluation, we detect misalignments by generating fine-grained evaluation questions and answering those questions with MLLM. In (2) refinement planning, we identify accurately generated objects and then create localized prompts to refine other areas in the video. Next, in (3) region decomposition, we segment the correctly generated area using a combined grounding module. We regenerate the video by adjusting the misaligned regions while preserving the correct regions in (4) localized refinement. On two popular video generation benchmarks (EvalCrafter and T2V-CompBench), VideoRepair substantially outperforms recent baselines across various text-video alignment metrics. We provide a comprehensive analysis of VideoRepair components and qualitative examples.
VideoRepair refines the generated video in four stages: (1) video evaluation (2) refinement planning (3) region decomposition and (4) localized refinement. From the initial prompt p, we first generate a fine-grained evaluation question set Qo and ask MLLM to answer it. Next, we identify O* by MLLM and plan how to refine the other region by LLM. Using O*, we decompose the region by Molmo and Semantic SAM to get the region to keep. In the end, we conduct localized refinement by the original frozen T2V model.
VideoRepair surpasses all baselines in the text-video alignment metric (evaluated using CLIP, BLIP2, and SAM-Track) across all four splits by a significant margin, achieving relative improvements of +2.87% and +11.09% over VideoCrafter2 and T2V-turbo on initial video generations.
We observe that VideoRepair improves initial videos from both T2V models (VideoCrafter2 and T2V-turbo) in all three splits, achieving relative improvements of +8.76% and +5.40%, respectively.