ReadPaper Blog
GENEB: Why Genomic Models Are Hard to Compare
The paper introduces GENEB, a large-scale diagnostic benchmark for comparing genomic foundation models under a unified evaluation protocol. It addresses the fragmentation of genomic model evaluation by testing frozen representations from 40 models across 100 DNA classification tasks in 13 functional categories, showing that model rankings depend strongly on task category, architecture, tokenization, pretraining data, and scale.
Source: (none provided)

The Comparison Crisis
The paper argues that progress in genomic foundation models is difficult to assess because the field lacks a shared basis for comparison. Models such as DNA-GPT, GENOMEOCEAN, and EVO are often evaluated on different task sets, preprocessing pipelines, and reporting conventions, so claims of superiority are not necessarily comparable. The authors describe a fragmented comparison landscape in which published models are sparsely connected by common baselines, making it hard to determine whether apparent improvements reflect genuine progress. This evaluation gap becomes more consequential as genomic foundation models grow larger and claims about generality become broader. The paper’s central motivation is therefore methodological: genomic machine learning needs a controlled framework that can compare representations across models, tasks, and design choices.

Enter GENEB
GENEB is proposed as that framework: a benchmark evaluating frozen representations from 40 genomic foundation models on 100 DNA classification tasks spanning 13 functional categories. The benchmark is positioned as a diagnostic resource rather than a single-task contest, analogous in spirit to MTEB in natural language processing. Its design enables matched comparisons across model scale, architecture, tokenization, and pretraining data, which are otherwise confounded in cross-paper evaluations. The task coverage includes diverse functional categories such as histone modifications, lncRNA, splice sites, enhancers, promoters, transcription factor binding, DNA methylation, species classification, chromatin accessibility, and virus/phage-related prediction. By producing a complete performance matrix under one protocol, GENEB makes task-level trade-offs visible rather than hiding them behind an aggregate headline score.

How the Test Works
The methodology uses an embedding-based probing protocol to isolate the quality of frozen sequence representations. For each task, the paper extracts frozen embeddings from each genomic foundation model and trains a lightweight logistic regression classifier with a maximum of 1000 iterations. The evaluation includes 1-shot, 10-shot, and full-data regimes, allowing the benchmark to assess both low-data behavior and performance when all available training examples are used. Results are averaged over five fixed random seeds, specifically 13, 17, 42, 123, and 997, to reduce sensitivity to sampling variation. The paper reports Matthews Correlation Coefficient, a metric well suited to imbalanced genomic classification tasks, and subsamples tasks exceeding 100,000 sequences after an empirical stabilization analysis using GENOMEOCEAN embeddings.

What the Evidence Says
The paper’s evidence shows that aggregate genomic model leaderboards are unstable and that larger models are not uniformly better. Across the full benchmark, model size is associated with aggregate performance, with a Spearman correlation of 0.573 between log parameter count and macro-MCC, rising to 0.694 after excluding the prokaryotic-only EVO-1-131K outlier. Yet the authors find many cases where much smaller models outperform larger ones, including MUTBERT at 86M parameters exceeding ECDNAMAMBA at 537M parameters by 0.110 macro-MCC. Category-level scaling correlations vary substantially, from stronger associations in histone modifications, lncRNA, and splice sites to weaker or non-significant relationships in species classification and chromatin accessibility. Controlled comparisons also suggest that architecture and pretraining alignment can outweigh parameter count, with Transformer models outperforming the available matched Mamba model in the multi-species BPE setting and encoder models exceeding decoders across matched Transformer pairs in the reported analysis.

The Takeaway
The paper’s practical conclusion is that genomic foundation models should be selected with task category and evaluation protocol in mind rather than by a single overall ranking. GENEB exposes cases where aggregate macro-MCC, micro-MCC, few-shot behavior, architecture, and pretraining domain can point to different model choices. This matters for applications because a model that is strong on one functional category may be less appropriate for another, especially when cross-species generalization or domain mismatch is involved. The benchmark therefore reframes evaluation as a category-aware diagnostic process rather than a universal leaderboard. By planning public benchmark releases and hosted evaluations on Hugging Face, the authors position GENEB as shared infrastructure for more reproducible and principled comparison in genomic machine learning.
