ReadPaper Blog
CRAFTER: A Multi-Agent Harness for Editable Scientific Figure Generation
The paper introduces CRAFTER, a multi-agent harness for generating publication-quality scientific figures from diverse inputs such as text, papers, sketches, reference images, and partial layouts. It addresses two practical gaps in prior systems: narrow text-only figure generation and raster outputs that cannot be locally edited, using structured agentic planning, verification, and revision plus CRAFTEDITOR for raster-to-SVG conversion.
Source: CRAFTER: A Multi-Agent Harness for Editable Scientific Figure Generation

Why scientific figures are hard
CRAFTER is motivated by the observation that scientific figures are unusually hard for automated image systems because they are not merely pictures, but structured compositions of semantic components. The paper emphasizes that labels, boxes, arrows, icons, annotations, and spatial relationships each carry scientific meaning, so small local failures can make an otherwise polished figure unusable. Existing text-to-image systems may produce visually appealing outputs, but the authors argue that high output variance, garbled text, misaligned connectors, and inconsistent layouts remain persistent obstacles for publication-quality scientific illustration. The paper frames this as a workflow problem in research communication, where creating and revising figures is one of the most labor-intensive parts of preparing a paper. Its central claim is that scientific figure generation requires targeted correction of localized structural errors rather than simply relying on a stronger image-generation backbone.

Two gaps: narrow inputs and non-editable outputs
The paper identifies two major limitations in prior automated figure-generation work: narrow input assumptions and non-editable outputs. Current agentic pipelines and code-generation methods often focus on text-to-image generation for a single figure type, while real researchers work from papers, rough sketches, partial layouts, visual references, icons, and iterative design constraints. The authors also stress that raster outputs are poorly suited to scientific revision because users often need to change individual labels, adjust color schemes, move components, or repair local layout mistakes. Code-generation approaches such as TikZ-style diagram synthesis offer editability, but the paper argues that they often lack the visual richness required for icons and stylized scientific layouts. CRAFTER therefore targets cross-type, cross-condition generation, while the companion CRAFTEDITOR system addresses the missing editability layer.

The core idea: a harness
The core technical idea in the paper is a harness: an orchestration layer that wraps an executor, such as an image generator or code generator, with planning, verification, structured memory, and revision. CRAFTER instantiates this harness with cooperating agents including an intent reasoner, a plan generator, a critic, a specification refiner, and a convergence judge. Instead of accumulating contradictory free-text prompt edits, the system maintains an evolving structured specification that records the current plan, revision history, diagnostics, and typed edits. The method uses diversity-driven plan exploration to generate multiple candidate framings, then applies a directive critic that reports targeted defects and suggested corrections rather than only scalar scores. This verify-then-refine loop is intended to correct specific failures in scientific layouts while keeping the system general across figure types and input conditions without architectural changes.

Two systems, same pattern
The paper presents two complementary systems built around the same harness abstraction: CRAFTER for figure generation and CRAFTEDITOR for raster-to-SVG conversion. In CRAFTER, the designer role proposes actionable plans, the executor renders images through an image-generation backend, the verifier evaluates outputs with multidimensional diagnostics, and the reviser writes typed corrections into the shared specification. In CRAFTEDITOR, the same pattern is adapted to editability through extraction, processing, and composition phases. The extraction phase removes text overlays and visual clutter to recover graphical assets, the processing phase captions assets and classifies them as vector or raster, and the composition phase assembles an SVG skeleton and refines it with a hybrid critic. This design lets the paper connect generation and downstream editing into a single workflow rather than treating scientific figures as final static images.

Evidence and takeaway
The evidence in the paper comes from experiments on PaperBanana-Bench and the authors’ new CRAFTBENCH benchmark. CRAFTBENCH contains 279 samples spanning three figure types and four input conditions, curated from published papers, award-tier conference posters, and research blogs with human quality annotation. The authors report that CRAFTER substantially outperforms standalone generators and the strongest agentic baseline under controlled comparisons, and they use VLM-based judging protocols to assess output quality against real images. Ablation studies support the contribution of the harness components, with removal of any single component causing a reported 5.04 to 8.90 point drop. The paper’s broader implication is that reliable scientific figure automation depends on structured agent orchestration and editable representations, not just more capable one-shot image generation.
