ReadPaper Blog
Direct 3D-Aware Object Insertion via Decomposed Visual Proxies
The paper introduces DIRECT, a framework for pose-controllable object insertion that composites a reference object into a target image while obeying a user-specified 6-DoF pose. It addresses the limitation of diffusion-based 2D inpainting methods, which can preserve appearance but lack explicit 3D geometric control, by converting a single reference image into a rendered 3D visual proxy and injecting appearance, geometry, and context through separate pathways. The result matters because it connects interactive 3D manipulation with high-fidelity 2D image synthesis, enabling more reliable object placement in realistic scenes.
Source: Direct 3D-Aware Object Insertion via Decomposed Visual Proxies

Pose Control Hits a Wall
The paper frames object insertion as a problem that has moved beyond simple visual plausibility toward precise spatial control. Existing reference-guided generation methods based on backbones such as Stable Diffusion and FLUX can harmonize identity and environment, but they usually treat insertion as 2D inpainting conditioned on a mask, a background, and a reference image. The authors argue that this formulation cannot reliably satisfy a user-specified 3D pose because natural-language instructions are spatially ambiguous and low-dimensional pose parameters are difficult to translate into dense image deformations. They define the harder task as pose-controllable object insertion, where the output must preserve the reference object while conforming to a 6-DoF pose ξ in the target scene. This formulation highlights a practical gap in methods such as Nano Banana Pro, Object3DIT, AnyDoor, SEELE, IMPRINT, and InsertAnything: strong appearance transfer does not by itself guarantee geometrically correct placement.

The DIRECT Solution
DIRECT addresses this gap by turning the desired 6-DoF manipulation into a dense visual condition rather than relying on text or sparse rotation controls. The method first uses recent feed-forward image-to-3D generation, drawing on advances such as TRELLIS-style 3D priors, to lift the single reference object image into a coarse 3D representation. It then renders this proxy under the user-adjusted pose, producing an RGB geometric condition image that encodes the target orientation and projection at pixel level. The paper emphasizes that RGB proxy renderings can preserve semantic pose information that depth or normal maps may lose, especially for symmetric objects whose orientation is ambiguous under standard spatial signals. By giving the diffusion generator an explicit rendered proxy, DIRECT bridges rigid 3D interaction and flexible 2D synthesis without requiring a high-quality manually built 3D asset.

Three Signals, Three Paths
A central technical claim of the paper is that a rendered 3D proxy is useful for geometry but insufficient as the sole source of conditioning because current image-to-3D models often degrade texture and introduce artifacts. DIRECT therefore decomposes the insertion signal into appearance guidance from the original reference image, geometry guidance from the posed 3D proxy rendering, and context guidance from the target background. These three components are injected through independent pathways, including decomposed LoRA-style adapters and separate positional handling, so the model can reduce feature entanglement between identity, pose, and scene adaptation. The framework also modifies the masked background condition by pasting the rendered object in the desired pose into the target region, giving the generator a spatially aligned local cue. This design lets the method use the proxy for geometric obedience while drawing high-fidelity texture from the reference and lighting or environmental compatibility from the background.

Training the Hero Squad
The paper also argues that training data is a bottleneck for pose-controllable insertion because object-centric 3D datasets often contain simplified settings and limited real-world scene diversity. To address this, the authors introduce an automated data construction pipeline that builds paired training examples from single-view, in-the-wild images. The pipeline uses a VLM-powered agent to filter high-quality object instances and a generative editing model to synthesize novel views, creating supervision that links a reference object, a posed proxy, and a target insertion context. The resulting hybrid dataset combines synthesized samples from SA-1B with a high-quality subset of MVImgNet and contains over 160k pairs. This data strategy is intended to improve generalization to complex backgrounds, richer object categories, and realistic object-scene interactions that are hard to capture with conventional 3D datasets alone.

Victory in the City Scene
In evaluation, the paper reports that DIRECT outperforms prior methods in both geometric controllability and visual quality on the testing split of the curated hybrid dataset. The authors compare against approaches that rely on strong 2D generation or 3D-aware editing and find that DIRECT better preserves identity while following complex pose transformations. The reported evidence emphasizes robustness to upstream 3D prior artifacts, which is important because the method deliberately uses coarse generated proxies rather than assuming perfect 3D assets. The framework’s separation of appearance, geometry, and context helps avoid the common failure modes of geometric distortion, texture degradation, and poor scene alignment. The broader implication is that explicit visual 3D proxies can make object insertion more controllable while retaining the photorealistic strengths of diffusion-based image synthesis.
