ReadPaper Blog
GGT-100K: Generative Ground Truth for Generalizable Real-World Image Restoration
The paper addresses a central bottleneck in real-world image restoration: the lack of scalable paired low-quality and high-quality training data that reflects complex, mixed real degradations. It proposes Generative Ground Truth, a dataset-construction paradigm that uses multimodal foundation models to synthesize high-quality restoration targets from real-world low-quality images, then filters them through quality-control stages. The resulting GGT-100K dataset contains 103,707 training pairs and a 500-pair test set, and the paper reports consistent generalization gains across multiple restoration model families.
Source: GGT-100K: Generative Ground Truth for Generalizable Real-World Image Restoration

The Data Problem
The paper motivates GGT-100K from the observation that real-world image restoration is constrained less by model architecture than by supervision quality. Classical restoration tasks such as denoising, deblurring, super-resolution, and dehazing often assume predefined degradations, while real-world restoration must handle mixed, unknown, and scene-dependent corruption. The authors argue that paired low-quality and high-quality examples are especially hard to obtain when degradation arises from weather, low light, motion, blur, noise, compression, or old-photo artifacts. This scarcity leaves CNN-based, Transformer-based, all-in-one, and generative restoration systems with limited ability to generalize outside their training distributions. The paper therefore frames the core research question as whether high-quality restoration targets can be generated reliably enough to serve as supervised training data.

Why Existing Data Fails
The paper contrasts two dominant routes for building paired restoration data and explains why both remain insufficient for broad real-world generalization. Synthetic datasets are scalable because low-quality images can be created from high-quality sources using hand-designed degradation models, but the authors emphasize that such simulations do not fully capture the real image formation process. Physically collected real-world pairs provide more realistic supervision, yet they are expensive, difficult to align, and limited in scene diversity because the same scene must be captured with and without degradation under changing conditions. The paper notes that rain, haze, illumination changes, and motion make reference capture especially difficult, and even recent real-world datasets remain constrained in degradation coverage. This analysis motivates a third route that starts from diverse real-world low-quality images without requiring an already captured high-quality counterpart.

The Weird New Idea
The central idea of the paper is Generative Ground Truth, or GGT, in which generative multimodal foundation models produce high-quality targets directly from real-world low-quality inputs. The authors treat modern instruction-following image models as possible restoration-target generators because they can condition on both an image and a textual prompt. The paper also stresses that restoration is stricter than ordinary image editing, since the generated target must improve perceptual quality while preserving scene structure and content fidelity. This requirement makes hallucination, structural distortion, color shift, and prompt instability major risks rather than minor aesthetic flaws. GGT is therefore presented not simply as image generation, but as a supervised-data construction paradigm that must balance perceptual realism with faithful correspondence to the low-quality input.

How They Built It
To determine whether multimodal foundation models can support this role, the paper systematically evaluates nine state-of-the-art models, including Nano-Banana-2 and GPT-Image-2, across different scenes and degradation types. The evaluation compares fixed prompts with VLM-based adaptive prompting, reflecting the authors’ claim that prompt design affects both restoration strength and content preservation. The selection criteria include image content fidelity, perceptual quality, VLM-based assessment, and human preference, rather than relying on a single metric. The paper reports that Nano-Banana-2 combined with VLM-based adaptive prompting provides the most reliable high-quality targets among the tested settings. Based on this model-selection study, the authors build a generation pipeline that uses Nano-Banana-2 to synthesize candidate targets and then applies automatic metric-based filtering, VLM-assisted screening, manual verification, and iterative refinement to remove low-gain or unfaithful samples.

The Result
The resulting GGT-100K dataset contains 103,707 low-quality and high-quality training pairs at 1024 × 1024 resolution, along with a carefully established 500-pair test set. Its source images come from existing datasets, Internet images, and the authors’ own captures, with the paper reporting a source composition that includes dataset capture, Internet collection, and newly captured material. The degradation coverage includes general mixed degradation, rain, haze, snow, low-light conditions, and old photos, with the authors emphasizing that these categories often contain coupled artifacts such as blur, noise, and compression rather than isolated single degradations. The experimental section retrains or fine-tunes representative CNN-based models, Transformer-based models, all-in-one restoration methods, and generative restoration models with and without GGT-100K. The paper reports consistent improvements in real-world generalization, with especially strong benefits for generative models adapted to restoration, supporting the claim that multimodal foundation models can be practical tools for restoration-oriented data generation.
