ReadPaper Blog
HakushoBench: A Japanese Chart and Table VQA Benchmark from Governmental White Papers
HakushoBench addresses the lack of realistic non-English benchmarks for evaluating whether vision-language models can understand chart and table images in Japanese documents. The paper builds a Japanese visual question answering benchmark from governmental white papers, using diverse real-world chart and table images with manually annotated, challenging QA pairs. Its results show that open-weight VLMs still struggle with complex Japanese chart and table reasoning, leaving a large gap relative to proprietary systems.
Source: HakushoBench: A Japanese Chart and Table VQA Benchmark from Governmental White Papers

Why a New Benchmark?
The paper motivates HakushoBench from a central evaluation gap: chart and table visual question answering is essential for real-world document understanding, but most established benchmarks are English-centric. The authors argue that strong performance on English datasets such as ChartQA, ChartQAPro, and CharXiv does not necessarily imply robust understanding of Japanese charts and tables. Japanese documents can differ in terminology, layout structure, information density, geographic conventions, and mixtures of vertical and horizontal text. These language- and culture-specific properties make chart and table VQA a multilingual document-understanding problem rather than a purely visual extraction task. HakushoBench is therefore designed to test whether VLMs can handle realistic Japanese chart and table images that require holistic interpretation rather than isolated recognition of labels or values.

The Gap in Japanese Evaluation
The paper identifies JGraphQA as the closest existing Japanese benchmark but argues that it is too small and limited to serve as a demanding measure of current VLM capability. JGraphQA contains around 200 examples, covers relatively limited visual diversity, and follows a question style focused largely on data extraction and basic arithmetic. The authors note that this has led to performance saturation, with even 3B-scale VLMs already achieving over 80% accuracy. This saturation weakens the benchmark’s ability to distinguish models that can perform deeper Japanese chart and table reasoning from models that handle only simpler cases. HakushoBench is proposed as a larger and more challenging alternative that better reflects the complexity of real Japanese documents.

The Core Idea
The core methodological idea of the paper is to use governmental white papers, or hakusho, as a scalable source of realistic Japanese chart and table images. These documents are publicly available through sources such as Japan’s e-Gov portal and cover broad policy domains including defense, energy, welfare, education, economy, security, and society. The paper emphasizes that white papers contain professionally designed, information-dense figures and tables intended for general readers, making them suitable for evaluating practical VLM document understanding. Because many governments publish comparable white papers, the authors also frame this source strategy as extensible beyond Japanese to future multilingual benchmarks. This choice directly addresses the difficulty of collecting diverse, naturally occurring non-English chart and table images at scale.

How HakushoBench Was Built
HakushoBench is constructed through a pipeline that combines collection, manual filtering, annotation, and verification. The authors collect images from 33 Japanese governmental white paper series, selecting recent HTML editions to make image extraction more reliable and to reduce repeated annual variants. From an initial pool of 18,539 images, they manually remove unsuitable content such as photographs, unreadable low-resolution images, and near-duplicates, retaining 5,903 candidate chart and table images. Twenty-one native Japanese-speaking annotators then create one high-difficulty QA pair per image, targeting skills such as global understanding, multi-hop reasoning, counting, visual interpretation, and the use of external knowledge when appropriate. Cross-annotator verification filters the annotations into 2,053 final VQA examples spanning more than 10 image types.

What the Results Say
The experiments show that HakushoBench is substantially more difficult for current open-weight VLMs than prior Japanese chart and table benchmarks. Across evaluated models, the best open-weight system reported in the excerpt, Qwen3-VL 8B, reaches only 58.6% accuracy on HakushoBench. The benchmark also exposes a 34.9-point accuracy gap between the best proprietary and best open-weight models, indicating that open-weight VLMs remain far from matching frontier systems on complex Japanese chart and table understanding. Comparisons with benchmarks such as ChartQA, ChartQAPro, CharXiv, and JGraphQA suggest that HakushoBench is especially effective at revealing weaknesses that simpler or saturated datasets can hide. Manual error analysis of Gemini 3 Pro further finds failures involving perception, external knowledge, and counting, implying that even state-of-the-art models have unresolved limitations on information-dense Japanese visual documents.
