Recent text-to-video (T2V) diffusion models have demonstrated impressive generation capabilities across various domains. However, these models often generate videos that have misalignments with text prompts, especially when the prompts describe complex scenes with multiple objects and attributes. To address this, we introduce VideoRepair, a novel model-agnostic, training-free video refinement framework that automatically identifies fine-grained text-video misalignments and generates explicit spatial and textual feedback, enabling a T2V diffusion model to perform targeted, localized refinements. VideoRepair consists of two stages: In (1) video refinement planning, we first detect misalignments by generating fine-grained evaluation questions and answering them using an MLLM. Based on video evaluation outputs, we identify accurately generated objects and construct localized prompts to precisely refine misaligned regions. In (2) localized refinement, we enhance video alignment by “repairing” the misaligned regions from the original video while preserving the correctly generated areas. This is achieved by frame-wise region decomposition using our Region-Preserving Segmentation (RPS) module. On two popular video generation benchmarks (EvalCrafter and T2V-CompBench), VideoRepair substantially outperforms recent baselines across various text-video alignment metrics. We provide a comprehensive analysis of VideoRepair components and qualitative examples.
VideoRepair refines the generated video in two stages: (1) video refinement planning (Sec. 3.1), (2) localized refinement (Sec. 3.2). Given the prompt p, we first generate a fine-grained evaluation question set and ask the MLLM to provide answers. Next, we identify accurately generated objects O* and plan the refinement pr of other regions using MLLM/LLM. Based on O*, we determine which regions to preserve or refine using the RPS module. Finally, we apply localized refinement with the original T2V model.
VideoRepair consistently outperforms all baselines across all four evaluation splits in text-video alignment, achieving relative gains of +6.22%, +7.65%, and +3.11% over VideoCrafter2, T2V-turbo, and CogVideoX-5B, respectively. We highlight the quality and consistency performance in red if it deteriorates by more than 1% from the original performance.
We observe that VideoRepair improves the initial videos from all T2V models (VideoCrafter2, T2V-turbo, and CogVideoX-5B) in all three splits, with relative improvements of +8.16%, +8.69%, and +15.87%, respectively.