Self-Correcting Text-to-Video Generation with Misalignment Detection and Localized Refinement

1UNC Chapel Hill   2NTU Singapore   3Allen Institute for AI   4Johns Hopkins University
ACL 2026 Findings 🔥
Paper Code

Abstract

Recent text-to-video (T2V) diffusion models have made remarkable progress in generating high-quality videos. However, they often struggle to align with complex text prompts, particularly when multiple objects, attributes, or spatial relations are specified. We introduce VideoRepair, the first self-correcting, training-free, and model-agnostic video refinement framework that automatically detects fine-grained text-video misalignments and performs targeted, localized corrections. Our key insight is that even misaligned videos usually contain correctly generated regions that should be preserved rather than regenerated. Building on this observation, VideoRepair proposes a novel region-preserving refinement strategy with three stages: (i) misalignment detection, where MLLM-based evaluation with automatically generated evaluation questions identifies misaligned regions; (ii) refinement planning, which preserves correctly generated entities, segments their regions across frames, and constructs targeted prompts for misaligned areas; and (iii) localized refinement, which selectively regenerates problematic regions while preserving faithful content through joint optimization of preserved and newly generated areas. On two benchmarks, EvalCrafter and T2V-CompBench with four recent T2V backbones, VideoRepair achieves substantial improvements over recent baselines across diverse alignment metrics. Comprehensive ablations further demonstrate the efficiency, robustness, and interpretability of our framework.

Method

fig_pipeline

VideoRepair refines the generated video in two stages: (1) video refinement planning (Sec. 3.1), (2) localized refinement (Sec. 3.2). Given the prompt p, we first generate a fine-grained evaluation question set and ask the MLLM to provide answers. Next, we identify accurately generated objects O* and plan the refinement pr of other regions using MLLM/LLM. Based on O*, we determine which regions to preserve or refine using the RPS module. Finally, we apply localized refinement with the original T2V model.

Qualitative Results

1. Comparison with Baselines (T2V-turbo)

"1 bear and 2 people making pizza"
GIF 4
+ Ours (VideoRepair)
"five aliens in a forest"
GIF 4
T2V-turbo
GIF 5
+ OPT2I
GIF 6
Vico
GIF 7
+ Ours (VideoRepair)
"A camel lounging in front of a snowman"
GIF 4
T2V-turbo
GIF 5
+ OPT2I
GIF 6
Vico
GIF 7
+ Ours (VideoRepair)
"A blue car parked next to a red fire hydrant on the street"
GIF 4
T2V-turbo
GIF 5
OPT2I
GIF 6
Vico
GIF 7
Ours (VideoRepair)
"Yellow rose swaying near a green bench"
GIF 4
T2V-turbo
GIF 5
+ OPT2I
GIF 6
Vico
GIF 7
+ Ours (VideoRepair)

2. Comparison with Baselines (VideoCrafter2)

"A dog sitting under a umbrella on a sunny beach"
GIF 1
VideoCrafter2
GIF 2
+ OPT2I
GIF 3
Vico
GIF 4
+ Ours (VideoRepair)
"With the style of pointilism, A green apple and a black backpack."
GIF 1
VideoCrafter2
GIF 2
+ OPT2I
GIF 3
Vico
GIF 4
+ Ours (VideoRepair)
"three foxes in a snowy forest"
GIF 1
VideoCrafter2
GIF 2
+ OPT2I
GIF 3
Vico
GIF 4
+ Ours (VideoRepair)
"A basket placed below a television"
GIF 1
VideoCrafter2
GIF 2
+ OPT2I
GIF 3
Vico
GIF 4
+ Ours (VideoRepair)
"A penguin standing on the right side of a cactus in a desert"
GIF 1
VideoCrafter2
GIF 2
+ OPT2I
GIF 3
Vico
GIF 4
+ Ours (VideoRepair)

3. Iterative Refinement

"A mother and her child feed ducks at a pond."
GIF 1
Initial video (T2V-turbo)
GIF 2
Iter 1
GIF 4
Iter 2
"A family of four set up a tent and build a campfire, enjoying a night of camping under the stars"
GIF 1
Initial video (T2V-turbo)
GIF 2
Iter 1
GIF 4
Iter 2
"A bright yellow umbrella with a wooden handle. It's compact and easy to carry."
GIF 1
Initial video (T2V-turbo)
GIF 2
Iter 1
GIF 4
Iter 2
"A group of six dancers perform a ballet on stage, their movements synchronized and graceful."
GIF 1
Initial video (T2V-turbo)
GIF 2
Iter 1
GIF 4
Iter 2

Quantitative Results

1. EvalCrafter

VideoRepair consistently outperforms all baselines across all four evaluation splits in text-video alignment, achieving relative gains of +6.22%, +7.65%, and +3.11% over VideoCrafter2, T2V-turbo, and CogVideoX-5B, respectively. We highlight the quality and consistency performance in red if it deteriorates by more than 1% from the original performance.

2. T2V-Compbench

We observe that VideoRepair improves the initial videos from all T2V models (VideoCrafter2, T2V-turbo, and CogVideoX-5B) in all three splits, with relative improvements of +8.16%, +8.69%, and +15.87%, respectively.

Ablations

(Left) Ablations of VideoRepair components across evaluation question type, refinement planning strategy, and ranking metric. Our proposed configuration achieves the best overall alignment performance. (Right) Robustness analysis under MLLM substitution — replacing internal MLLM components with GPT-4o, Qwen2.5VL-7B, Gemini-2.5-Flash, and human annotations shows minimal performance variation, demonstrating the framework's modularity and flexibility.

Error Analysis

(Left) Categorized breakdown of failure modes: mask drift (58.8%) is the most frequent failure, followed by identity inconsistency (47.1%), QA hallucination (35.3%), boundary artifacts (23.5%), and planning error (11.8%). (Right) Error propagation analysis examining conditional probabilities of downstream errors given upstream failures. Mask drift shows the strongest cascading effect with 43.5% probability of causing identity inconsistencies, while QA hallucinations propagate weakly (9.8–21.3%).