Boundlessness Overtaking Benchmarks: The Crisis of Evaluating AI Scientists

As AI systems begin drafting full research reports, our long-standing evaluation mindset is hitting its limits. We are used to benchmarking models on massive data sets with well-defined, comparable metrics. But modern AI-generated science is now judged on only a small number of long, open-ended research outputs, making traditional notions of generalization hard to verify. In the absence of standard evaluation frameworks, researchers find themselves creating case-specific evaluation criteria. This blog is a wake-up call, a look at how quickly LLM-based scientific agents are outgrowing our inherited evaluation paradigms, and why we must rethink our long-held assumptions to build rigorous and standardized ways of assessing this new form of AI-driven scientific work.

Introduction

Since the origins of machine learning as a field, research has been anchored by a simple, shared and stable framework of evaluation:

fix a data set → fix an evaluation metric → compare models → assess performance

In this, the task is well-specified, the output space is bounded, and the ground truth is fixed. This framework rewarded rigor and reproducibility, and it provided a shared language for progress.

Machine learning (ML) has applied this evaluation framework in many sub-fields, from computer vision to speech recognition, but our discussion here focuses on NLP, especially knowledge-grounded scientific tasks. Benchmarks like Natural QA, HotpotQA, and BioASQ provided a comfort zone where researchers could iterate rapidly, measure objectively, and scale evaluations to tens or hundreds of thousands of samples to ensure generalizability and robustness . Even in the era of long-form tasks, like machine translation or summarization, metrics such as BLEU, ROUGE, METEOR, and BERTScore—whether they measure n-gram overlap or embedding-based similarity—offered a shared reference point . The outputs were usually short with low level of abstraction, and the evaluation problem was fundamentally one of lexical or semantic similarity. Imperfect, yes, but collectively understood. These benchmarks gave us the comforting illusion that we were comparing like with like.

And it worked spectacularly until we built models that escaped this framework. LLMs did not merely improve performance; they changed the fundamental nature of what a model is and what an output can be. AI models today can effortlessly switch between generating a short, three-sentence summary to writing a detailed, 60-page research proposal.

We are therefore drifting towards a world in which:

Output sizes are unbounded: A model can generate a book, a research paper, a laboratory protocol, or a simulation-driven argument.
Reasoning paths are not unique: Multiple chains of thought can be valid, divergent, and equally defensible.
Quality is contextual: The correctness of a biology report is evaluated differently from a computer science theorem or a social science argument.
Domain experts validate science: Evaluating 10,000 open-ended essays for “scientific quality” is not scalable the way classifying 10,000 images is.

We therefore eagerly need a corresponding shift in our epistemic infrastructure, in order to be able to evaluate and further develop this new world.

In this blog post, we focus exclusively on the emerging space of AI systems that generate scientific outputs—hypotheses, experimental plans, methodological rationales, data-interpretation narratives, and full research papers. Our discussion does not to the same degree concern tasks like math olympiad problems or abstract reasoning questions, which have well-defined solutions fundamentally different from long-form scientific output.

The Evaluation Problem No One Wants to Name

The fundamental problem with rapidly evolving AI models and their expanding capabilities is that current machine-learning benchmarks rest on assumptions that no longer hold:

Assumption 1: There exists a canonical correct answer.
Scientific long-form tasks rarely have one. In science, there usually is no clear right/wrong dichotomy. Rather, scientific contributions are evaluated according to their usefulness (for further progress). Wrong answers can be equally useful to refute hypotheses, motivate further work, or simply as eye-openers. AI models for science can therefore generate entire argumentative universes, not just labels.

Assumption 2: Evaluation is output-based.
In science, it is often not the final output that matters most but the process of how one got there: new algorithms often emerge from constructive proofs, and multiple paths can lead to the same result, not all of them equally insightful. Also in scientific texts, justification, citation accuracy, experimental validity, and methodological rigor are often more important than mere quality of language.

Assumption 3: Humans can serve as gold-standard evaluators.
Scientific quality is hard to evaluate, even for humans, as often only time will tell. Human peer review is notoriously inconsistent even for conventional papers. For more groundbreaking, innovative papers, experts wildly disagree. For AI-generated content—which can be longer, denser, and more numerous—there is a simple scaling problem: having humans evaluate thousands of long-form research outputs is simply impossible. We have pushed the complexity of the output space into a regime where humans themselves cannot provide consistent, scalable ground truth.

Assumption 4: Correctness is equivalent to alignment with ground truth.
This assumption worked well for traditional long-form QA, where correctness could be approximated by matching key facts or phrases to a reference answer. However, for open-ended tasks such as hypothesis generation, research writing, or literature review, aspects like creativity and perspective are intrinsic. For instance, if the task involves generating a literature review, the original survey may serve as a reference point, but the organization, articulation, and interpretation of the research objectives produced by an AI system can turn out to be completely different from any previously known ground truth. Since humans write divergent reviews from the same sources, unalignment does not imply incorrectness.

In summary, when AI models violate one or several of these assumptions, their performance cannot be evaluated by word-pattern matching, embedding similarity, or any simple classification proxy. In this sense, recent LLMs have escaped the benchmarking infrastructure we once built around deterministic tasks.

Progress Lacks Standard Measurement

With the arrival of AI Scientists, lab-automation agents, and autonomous discovery pipelines, we face a new, uncomfortable reality: the traditional foundations of ML evaluation do not extend to AI-generated scientific texts. The recent generation of AI systems does not merely answer questions. These systems generate full reasoning paths, complete with rationale, exposition, and self-evaluation. This shift in what AI systems produce requires rethinking how they are evaluated. A strong indicator of this shift can be seen in the examples of recent works below, which report results on very small test data sets—often 3 to 20 samples—rather than the large-scale evaluations common in traditional ML. As a result, the emphasis moves towards showcasing model’s capabilities rather than testing generalization. The examples that follow illustrate this change and collectively point to the absence of a shared, large-scale evaluation protocol for AI-driven scientific discovery.

ChemCrow is an LLM-based chemistry agent, published by EPFL in Nature Machine Intelligence in May 2024. It uses a tool-augmented (ReAct) approach, with access to 18 external tools, including ChemSpace, PubChem, and the IUPAC to SMILES converter OPSIN. The authors evaluate its chemical reasoning capabilities across 14 specialized tasks from drug discovery, organic synthesis, and materials design. They compared ChemCrow’s output against GPT-4 outputs using EvaluatorGPT (LLM judge) and domain experts. The expert chemists were asked to evaluate each model’s performance for each task along three dimensions: (a) quality of reasoning, (b) correctness of the chemistry, and (c) degree of task completion. As per the code and Appendix G in the supplementary material, it appears that each task included only a single test instance. This yields a data set of merely 14 samples in total. In their limitations section (Appendix F), the authors acknowledged a broader need for standardized and scalable assessment frameworks for AI scientist agents. Heavy reliance on expert judgment and the demanding nature of designing domain-specific experiments to display the strengths and weaknesses of an agent both limit the speed and consistency of evaluation.
AI Scientist-v1 published by SakanaAI as an arXiv preprint in August 2024, demonstrated the ability to autonomously generate research artifacts—code, experiments, and papers—across multiple subfields of machine learning, including diffusion models, transformer-based language modeling, and learning dynamics. To evaluate the outputs, the authors built an automated peer-reviewer (powered by an LLM) that assigns scores to the generated papers similar to human peer-review assessing: novelty, methodological soundness, clarity, empirical results, and overall contribution. To validate the automated reviewer, they tested it on a data set of 500 real papers from ICLR 2022 (from the OpenReview public archive). The data set was unbalanced, containing more rejected papers. They then compared the LLM’s accept/reject decisions with the actual outcomes. The automated reviewer achieved roughly 70% overlap with the human-written reviews When evaluated on a balanced subset of accepted vs. rejected papers, it attained roughly 65% balanced accuracy (human-level ~66% in the same consistency experiment). The system then generated its own ML-research papers, which were fed into the same automated reviewer. Some of them “passed”, i.e., they exceeded the acceptance threshold as defined by the reviewer.
AI Co-Scientist published by SakanaAI and Google as an arXiv preprint in February 2025, is a multi-agent system built on Gemini 2.0. It uses a “generate → debate → evolve” workflow: given a high-level research goal in natural language, specialized agents generate candidate hypotheses, critique and rank them, then refine and evolve them—akin to an automated “scientific debate and selection” process. The authors demonstrated the potential for augmenting biomedical and scientific discovery in three applications: novel target discovery for liver fibrosis, drug repurposing for acute myeloid leukemia, and explaining mechanisms of bacterial evolution and anti-microbial resistance. The authors combined multiple evaluation strategies:
- Automated hypothesis-quality evaluation: A pool of 203 distinct research goals was curated to evaluate system-generated hypotheses using Elo rating. Such automated ranking (Elo), however, only provides relative plausibility, not guaranteed scientific feasibility.
- Full paper evaluation: A subset of 15 research goals was curated by seven biomedical human experts to evaluate the full research articles generated by the model. There was no broad “peer review of full reports” for all tasks or research goals. The expert human evaluation was limited to novelty/impact preference and did not include a full rigorous peer review of the generated research articles.
- Report validation with domain experts: Three of the selected hypotheses were validated in wet-lab experiments (in vitro / organoid). This provided isolated stand-alone demonstrations of success, as some of the AI’s top (albeit scientifically rather unsurprising) predictions could be experimentally confirmed.
AI Scientist-v2 , the second version of AI-Scientist, published by SakanaAI as an arXiv preprint in April 2025, is an end-to-end agentic system that autonomously generates scientific hypotheses, designs and runs experiments, analyzes and visualizes the data, and finally writes manuscripts—making it the first system claimed to produce fully AI-generated papers that passed peer review at a workshop. For evaluation, the authors submitted three manuscripts produced entirely by the system (without human-written code templates) to a workshop at ICLR 2025, in cooperation with the workshop organizers under a double-blind peer review process. The submitted papers underwent the standard peer-review evaluation by human reviewers, who scored them on criteria such as scientific soundness, clarity, novelty, quality of experiments, and presentation. One of the three manuscripts earned an average reviewer score of 6.33 (with individual ratings 6, 7, and 6), placing it approximately in the top 45% of submissions—above the average acceptance threshold—thus marking the first time a fully AI-generated paper successfully passed a human scientific peer-review process.
Biomni is a general-purpose biomedical AI agent, published by Stanford as a bioRxiv preprint in June 2025. It uses a tool-augmented (ReAct+Code) approach with a unified biomedical action space consisting of around 150 specialized tools, 59 databases, and 105 software packages. The authors evaluated the agent’s performance on established biomedical benchmarks, such as HLE and LabBench’s DbQA and SeqQA , both of which are multiple-choice question-answering data sets, in line with traditional evaluation protocols. Additionally, they evaluated their agent on eight biomedical tasks, namely rare disease diagnosis, drug repurposing, patient gene prioritization, variant prioritization, GWAS (Genome-Wide Association Study) causal gene detection, CRISPR (a gene editing technology) perturbation screen design, single-cell RNA-sequence annotation, and microbiome disease-taxa analysis. Looking in depth into the available data sets on huggingface, each task contains sample sizes between 10 and 50 entries, many of which follow template-based formats. Overall, biomni unifies a diverse landscape of biomedical tools into a unified framework, enabling seamless knowledge grounding and accelerating complex scientific reasoning.
Agent Laboratory published by ETH Zurich at EMNLP in November 2025, examined whether autonomous agents can conduct end-to-end research workflows. It used 5 research questions from the fields of NLP and computer vision to produce 15 papers by three LLM backends, GPT-4o, o1-mini, and o1-preview. The generated reports were evaluated by human reviewers according to experimental quality, report quality, and usefulness. In addition, they used NeurIPS scores for the criteria: quality, significance, clarity, soundness, presentation, contribution, and overall. Human and automated reviewer scores were evaluated side by side. They then computed scores for cost, time, and success rate of subtasks. Such scores, however, do not capture the scientific quality, originality, or usefulness of the generated research output but rather quantify the computational and data efficiency of the agent.
KOSMOS published by Future House as arXiv preprint in November 2025, spans multiple scientific domains—metabolomics, materials science, neuroscience, and statistical genetics. The technical report highlights seven discoveries. KOSMOS takes as input a research objective and a data set, both provided by a human scientist. KOSMOS then attempts to complete the research objective by using LLMs, data analysis agents, literature search agents, and a “world model” to perform iterative discovery cycles. Three of its discoveries independently reproduced findings from preprinted or unpublished manuscripts that were not accessible to KOSMOS. The other four discoveries marked truly novel contributions to the scientific literature: two supporting existing findings with novel methods, one developing a new method, and one providing a novel discovery not previously identified by human researchers. The authors used the following ways to evaluate KOSMOS:
- Expert auditing of sample statements: The authors collectively extracted 102 claims out of three generated reports and had domain-expert scientists classify each as “Supported” vs. “Refuted” — i.e. whether the claim could be replicated by independent analysis or found in the literature. The 79.4% accuracy suggests that many of the statements were meaningfully supported by data or literature as a human-grounded measure of reliability.
- Estimating human-equivalent research effort: They measured how much “work” KOSMOS did in each run, e.g., how many lines of code were written or how many papers read. A typical run wrote ~42,000 lines of code and read ~1,500 papers. This suggests a KOSMOS run to equal ~4.1 “expert-months” of human work. Such metrics, however, do not capture the scientific validity of the generated outputs, nor the smartness or productivity of the agent.
- Domain-expert evaluation: The four novel discoveries were checked by domain expert collaborators, who verified that the reasoning, code, and citations made sense, but didn’t fully reproduce the results with their own experiments. Therefore, the KOSMOS discoveries can be seen as interesting, potentially valuable ideas that passed initial expert screening.
  Importantly, the KOSMOS paper acknowledges that evaluating which insights truly matter still depends on substantial human effort, as each report contains several discovery narratives, each with dozens of claims, and no automated method exists to judge accuracy, novelty, or significance.

These examples illustrate our point that while AI scientists appear formidable within their own familiar environment and domain, we lack a standardized arena in which to objectively evaluate the extent of their actual capabilities and their future potential for scientifically useful contributions.

All of the above examples used different and mutually incomparable evaluation metrics and protocols. Some agents were assessed using traditional-looking benchmarks like GPQA , HLE, and LabBench. These are multi-choice question-answering benchmarks that gauge recall and reasoning but fail to capture scientific novelty, creativity, or usefulness. Others used internal auto-metrics like Elo-style self-play for hypothesis assessment, or human-evaluated metrics such as novelty and plausibility on very small sample sizes. In some cases, isolated claims extracted from generated output were verified by domain experts or experiments, while others had experts score full papers without validating or reproducing the results. This was complemented by an array of ad-hoc numerical metrics like cost, time, task-completion success rate, lines of code generated, papers read, and human-equivalent effort.

While each of these metrics is valuable by itself, there is no community consensus on which subset of them to minimally report in order to make results comparable. Standardized reporting is, however, needed to meaningfully judge performance, create fair competition, and render results reproducible. All of these are prerequisite to the scientific method. Defining shared, domain-specific problems, task structures, and unified evaluation paradigms would bridge the current fragmentation and improve communication within the AI community.

Conclusion

The above examples serve to illustrate the two main points we wish to emphasize: (1) Recent long-form AI tools, exemplified by “AI-Scientists”, generate potentially unbounded output with no clear right/wrong dichotomy. (2) Every team evaluates their system with different, often purpose-made evaluation protocols. We lack a scalable standard evaluation framework for such outputs. This indicates that a structural transition in AI is underway. Historically, evaluation in machine learning relied on large sample sizes (often ~10,000 or more) because models produced short, paragraph-length output, and generating thousands of samples was computationally inexpensive. With contemporary LLMs capable of producing research-grade documents, multi-page analyses, and even book-length content, the computational cost of each sample has increased dramatically. As a result, evaluating 10,000 such outputs is no longer feasible, neither for model generation nor for human or hybrid assessment.

Consequently, the field is shifting away from traditional large-N statistical generalization toward a different evaluative paradigm—one centered on assessing a model’s capacity for sound reasoning, cross-domain competence, trustworthiness, robustness to manipulation, and broader cognitive generalization. This transition is not a problem in itself, but it does represent a fundamental change in what counts as evidence for model capability, and it requires us to rethink the epistemic foundations of evaluation in the age of AI-generated scientific work.

This blogpost is a timely wake up call to build a standard evaluation protocol for AI-driven science well before systems reach Artificial Super Intelligence (ASI).

Without such a protocol, the path toward ASI becomes substantially more difficult. The pre-ASI systems are likely to generate elegant-looking, unreliable theories, lead to the massive accumulation of unverifiable claims, and produce detrimental artifacts. Even if ASI emerges without a prior evaluation framework and develops its own evaluation methods, it would be unreasonable to assume those methods will align with the norms and goals of human scientific institutions. Therefore, developing standard evaluation protocols is essential both to guide development and to prevent a situation in which one system dictates what is true and controls the mechanisms for verifying it.

Now And Next

Given the risks associated with flying blind in AI-Science, it is reassuring to see that recent research increasingly highlights the need for a standardization of evaluation protocols for AI-Scientist systems . While there is no consensus yet, awareness is growing and progress is accelerating. Recent data sets and benchmarks, like IdeaBench (biomedical research ideas), HypoBench (generic hypothesis discovery), and ScienceAgentBench (scientific agent performance) constitute an incremental but much needed advancement toward standard evaluation frameworks. Looking ahead, promising solutions include domain-specific shared tasks that enable communities to converge around common challenges, as well as benchmarks designed with strict evaluation-only data sets to prevent data leakage. Standardized evaluation will likely rely on multi-metric scorecards rather than a single scores, combining measures of correctness and task success (e.g., accuracy, discovery rate), novelty and impact (e.g., expert- and model-assessed plausibility), efficiency (e.g., cost and wall-clock time), robustness and reproducibility, safety and governance, and the usefulness of human-AI collaboration. To reduce cherry-picking and ensure transparency, researchers may also be expected to release full trace logs and code as evaluation artifacts. Future directions include exploring cross-disciplinary metrics and experimental workflows that can reliably verify AI-generated scientific output. Additionally, LLM as a judge and meta-evaluation of those judges would enable scalable, responsible, and reliable evaluation of increasingly autonomous scientific AI systems.