Methodology

Validation Highlights

Critique quality on H-Max. Above the best-human-review anchor of 5.0, benchmarked using the ScholarPeer framework

ScholarPeer, arXiv 2601.22638

Inter-rater reliability (ICC₂) with adjudication - against a human peer-review benchmark of 0.34

Bornmann, Mutz & Daniel (2010)

Of retracted papers flagged in the bottom two Study Quality tiers - blind, before retraction information was available. Evaluation surfaces latest retraction status as well.

Validation corpus. Numbers update as the expanded corpus completes.

1. Abstract

Nabu evaluates research papers on intrinsic merit - Study Quality, Trust Signals, and Impact Potential - benchmarked against the standard in the paper’s field and methodology type. It evaluates blinded to publication venue, citation count, or other proxies.

Study Quality is scored against a four-dimension framework (Contribution, Methodological Rigour, Reporting, Positioning) by blinded AI-Human hybrid reviewers per paper, with adjudicated scoring and documented rationale per component. A separate Trust Signals rating (Red / Amber / Green) flags concerns - methodological, statistical, or post-publication - that could change how the findings should be interpreted. A separate Impact Potential score captures likely real-world significance and translation. The three signals are never blended.

The framework is rubric-driven by design: explicit criteria, field-specific calibration anchors, and component-level scoring keep AI judgments evidence-bound and surface insufficient-information cases rather than guessing through them.

The framework has been tested four ways against an initial validation corpus (sampled across OECD Fields of Science): critique quality benchmarked against expert human reviews, blind detection of retracted work, inter-rater reliability and decision-divergence from the journal-prestige default. The Nabu evaluation ranks slightly better than the best human reviewers on average (6.1 vs 5.0 for best human review; none worse than human reviewers (<4); n=50); reaches inter-rater reliability (ICC₂) of 0.81 (n=400+); and blinded flagged 85%+ of retracted papers in the bottom two Study Quality tiers (n=50). Details below.

2. Methodology

Every paper is evaluated on three independent signals: Study Quality, Trust Signals, and Impact Potential. Each draws on the same evaluation pipeline, with criteria calibrated to the paper’s methodology type and field of science. The three signals are scored, displayed, and reasoned about separately - never blended into a single number.

Study Quality
Trust Signals (RAG)
Impact Potential

2.1 Study Quality Framework

The Study Quality score is built from four weighted dimensions. Each dimension is scored holistically against a set of guiding criteria rather than many individually scored sub-components.

Contribution · 25%
Does this paper move the field forward?
Methodological Rigour · 45%
Is the methodology sound for the question asked?
Reporting · 10%
Is there enough detail, clearly presented, to evaluate and reproduce the work?
Positioning · 20%
Does the paper engage fairly with relevant and contradictory prior work, and represent its own novelty and limitations honestly?

Study Quality score calibration

Score range	What the paper looks like
4.4 - 5.0Exemplary	Holds up on every dimension, with no material gaps.
3.8 - 4.3Strong	Clearly strong overall, with only minor weaknesses on any single dimension.
3.0 - 3.7Moderate	Real strengths alongside meaningful gaps across one or more dimensions.
2.0 - 2.9Limited	Notable weaknesses across several dimensions limit what the work establishes.
0.0 - 1.9Very Limited	Serious shortfalls on the dimensions that matter most.

Scores are calibrated against published work at the time of publication, not against current standards. II (Insufficient Information) is used for non-reportable components; weight is redistributed across reportable components, not penalised.

2.2 Trust Signals

The Trust Signals rating is a pre and post-scoring gate, that flags methodological, statistical, or evidentiary concerns.

A Red rating does not mean "bad research." It means "proceed with caution - specific concerns identified." Rigorous work can carry an Amber or Red flag if trust-signal concerns are present, and weak work can carry a Green flag if no specific concerns are identified.

Internal Coherence
Methods–results alignment. Statistical results computable. Claims supported by evidence presented in the paper.
Research Conduct
Ethics approval declared. Conflicts of interest disclosed. Preregistration referenced with identifier. Data and code availability as reported.
Reference Integrity
Every cited reference resolves to a real publication whose metadata matches the citation. Topaz et al. 2026, The Lancet
Post-Publication Record
Retraction status. Post-publication concerns raised by the scientific community.

Each component returns one of three statuses: Clean (no concerns identified), Noted (concerns present but do not change interpretation of core findings), or Concern (concerns that could change interpretation if confirmed). The overall Trust Signals flag is derived from the worst component status across the four.

Green: No concerns identified. Study Quality and Impact Potential scores reflect merit on their own.
Amber: Watch-outs noted. The concerns do not change the interpretation of core findings, but readers should consult the rationale.
Red: Concerns identified that could change the interpretation of findings. Affected components are capped, and the paper carries an explicit reliability flag in downstream displays.

2.3 Impact Potential Framework

Study Quality and Impact Potential are measured separately, always. A perfectly executed study of a trivial question may score high on Study Quality and low on Impact Potential. The reverse also occurs: an ambitious agenda paper with weak execution can carry strong Impact Potential and a low Study Quality score. The two signals answer different questions, and the framework keeps them apart.

The Impact Potential score is built from four components: Relevance, Readiness, Generalizability, and Trajectory.

Relevance · 30%
Does the paper address a problem that real stakeholders are actively trying to solve?
Readiness · 30%
How close are the findings to deployment or applied use?
Generalizability · 20%
How likely are the findings to hold beyond the specific study conditions?
Trajectory · 20%
Do the findings advance a cumulative evidence base?

Impact Potential calibration

Score range	What the paper looks like
4.4 - 5.0Very High	Directly addresses a documented need with clear stakeholder pathway.
3.8 - 4.3High	Clear real-world relevance with a plausible pathway to use.
3.0 - 3.7Medium	Meaningful potential, but pathway remains partly defined.
2.0 - 2.9Low	Narrow or early-stage pathway requiring substantial additional work.
0.0 - 1.9Minimal	No clear pathway to application, policy influence, or practical use yet.

The same II (Insufficient Information) rule applies: where a component cannot be evaluated from what the paper reports, it is marked II and weight is redistributed.

2.4 Role of AI in the Pipeline

Nabu’s reviewers are AI-Human hybrid agents operating within the structured rubric. Adjudication is also AI-executed within the rubric, with an escalation and quality-assurance path to human reviewers. The system is engineered so that the rubric does the work, not the model.

“The rubric is the evaluator. The AI is the instrument.”

Four operational guardrails constrain the AI to evidence-based judgments:

Component-level scoring, not holistic judgment.
Each reviewer scores each rubric component independently, not the paper as a whole.
Evidence-bound rationale per component.
Every component score must be accompanied by rationale that cites specific text, methods, or results from the paper.
The Insufficient Information (II) flag.
When a reviewer cannot extract enough evidence from the paper to score a component, the reviewer marks the component II rather than guess.
Field-specific calibration anchors.
Each rubric component is calibrated against literature and standards for the paper's Field of Science. This prevents the AI from defaulting to a generic prior of good research.

Two further controls operate inside this pipeline:

Bias control
Reviewers are blinded to author, journal, and institutional information; calibration anchors are field-specific to prevent default-to-prestigious patterns.
Anti-metric gaming
Rubric components reward methodological substance, not surface signals; scores cannot be improved by edits that do not reflect underlying quality.

3. Validation

Initial validation results from the Nabu evaluation corpus. Numbers will be updated as the expanded validation corpus completes.

Result 1: Critique quality vs. expert human reviews

Nabu’s review critiques were scored on H-Max - a metric that calibrates critique quality against the full set of human expert reviews, where the best human review anchors at 5.0. The approach follows ScholarPeer, a peer-review framework published by Google. Nabu scored 6.1, above the best-human-review anchor (n=50). Notably, none of Nabu’s reviews scored below 4 - none fell short of the human benchmark.

6.1

Nabu critiques

mean H-Max

5.0

Best human review

benchmark anchor

H-Max calibrates critique quality against the full set of human expert reviews. Approach based on ScholarPeer, a peer-review framework from Google (arXiv 2601.22638).

Result 2: Inter-rater reliability

Nabu’s primary reviewers achieve an ICC₂ of 0.81 (absolute agreement) across all scoring dimensions (n=400+, sampled across OECD Fields of Science) - more than double the published meta-analytic benchmark for human peer review (0.34, across 48 studies and 19,443 manuscripts).

Bornmann, Mutz & Daniel (2010). A reliability-generalization study of journal peer reviews: a multilevel meta-analysis. PLOS ONE, 5(12), e14331. doi:10.1371/journal.pone.0014331

Result 3: Retraction detection

A curated corpus of confirmed-retracted papers (n=50), evaluated blind against a control set of non-retracted papers from the same sources - with no knowledge of retraction status available to the reviewers.

85%+ of retracted papers were placed in the bottom two Study Quality tiers (Limited or Very Limited, 0.0–2.9).
The low scores were driven by specific, documented methodological concerns - not a generic “this seems bad” signal.
The remainder were papers retracted for reasons not reliably visible in the rubric-scorable text alone (post-hoc data fabrication, image manipulation, ethical violations).

The rubric identified what post-publication scrutiny later confirmed.

4. Limitations

The framework is calibrated against published research. It is most reliable for paper formats where methodology, claims, and evidence are explicit and reportable. It is less reliable, by design, in the following cases:

Trust Signals and validation results in the Validation section should be read with these scope conditions in mind.

5. Commitments

Open methodology.
The rubric, weights, scoring criteria, and adjudication principles are published in full. Evaluate the evaluator.
Full blinding.
No author, journal, or institution signals enter the evaluation. Publication year is retained solely for era-appropriate calibration.
No un-paid reviewers.
All reviewers engaged by Nabu for calibration, escalation and quality control are dedicated professional reviewers trained and compensated accordingly.
No conflicts of interest.
Nabu has no relationship with publishers, journals, or institutions being evaluated. Reviewers have no career incentive tied to the scores they produce. The evaluation is structurally independent: there is no scenario in which scoring a paper higher or lower benefits Nabu commercially.
Venue-independent.
The same rubric everywhere. A paper is a paper.
Living evaluations.
Post-publication signals - replications, retractions, method validity - update the reliability layer over time. Field feedback runs alongside the assessment: readers can register their own calibration on Study Quality, Trust Signals, and Impact Potential. Where the field meaningfully disagrees with the Nabu assessment, the work returns for re-evaluation.
DORA and CoARA alignment.
Paper-level, methodology-based, venue-independent. The framework operationalises what 3,000+ signatory institutions committed to.

What is DORA? → · What is CoARA? →