How Nabu Evaluates Research

A framework for evaluating research papers on intrinsic merit — methodology, rubric, and validation.

Cite as: Nabu Science (2026). Nabu Evaluation Framework v1.0. nabu.science/methodology

Have a methodologist’s eye? We’d like to hear from you. submit feedback →

Abstract

Nabu evaluates research papers on intrinsic merit — quality, reliability, and likely impact — benchmarked against the published standard in the paper’s field. Not on publication venue, citation count, or other proxies.

Quality is scored against the 4C framework (Contribution, Craft, Clarity, Context) by three blinded AI-Human hybrid reviewers per paper, with adjudicated scoring and documented rationale per component. A separate Reliability rating (Red / Amber / Green) flags concerns — methodological, statistical, or evidentiary — that could change how the findings should be interpreted, based on what’s reported in the paper. A separate Impact Potential score captures likely real-world significance. The three signals are never blended.

The framework is rubric-driven by design: explicit criteria, field-specific calibration anchors, and component-level scoring keep AI judgments evidence-bound and surface insufficient information cases rather than guessing through them.

In initial validation (n=400+, sampled across OECD Fields of Science), primary reviewers achieved an inter-rater reliability (ICC2) of 0.81; with adjudication, 0.89. Blind evaluation of a curated retraction set (n=50) correctly flagged 90%+ as fundamentally flawed before retraction information was available.

Methodology, rubric, validation results, and limitations are below.

1. Introduction

Nabu evaluates research papers on intrinsic merit - quality, reliability, and likely impact - benchmarked against the published standard in the paper’s field. Not on publication venue, citation count, or other proxies.

Quality is scored against the 4C framework (Contribution, Craft, Clarity, Context) by three blinded AI-Human hybrid reviewers per paper, with adjudicated scoring and documented rationale per component. A separate Reliability rating (Red / Amber / Green) flags concerns - methodological, statistical, or evidentiary - that could change how the findings should be interpreted, based on what is reported in the paper. A separate Impact Potential score captures likely real-world significance. The three signals are never blended.

The framework is rubric-driven by design: explicit criteria, field-specific calibration anchors, and component-level scoring keep AI judgments evidence-bound and surface insufficient-information cases rather than guessing through them.

In initial validation (n=400+, sampled across OECD Fields of Science), primary reviewers achieved an inter-rater reliability (ICC2) of 0.81; with adjudication, 0.89. Blind evaluation of a curated retraction set (n=50) correctly flagged 90%+ as fundamentally flawed before retraction information was available.

Methodology, rubric, validation results, and limitations are below.

2. Methodology

Every paper is evaluated on three independent signals: Quality, Reliability, and Impact Potential. Each draws on the same evaluation pipeline, with criteria calibrated to the paper’s methodology type and field of science. The three signals are scored, displayed, and reasoned about separately - they are never blended into a single number.

  • Quality (4C)
  • Reliability (RAG)
  • Impact Potential (4T)

2.1 Quality (4C) Framework

The Quality score is built from four weighted dimensions: Contribution, Craft, Clarity, and Context. Craft carries the highest weight (45%) because methodological rigour is the foundation on which the other dimensions stand. Each dimension is scored holistically against a set of guiding criteria rather than many individually scored sub-components.

Quality score calibration

Score rangeWhat the paper looks like
4.5 - 5.0ExceptionalReference-quality work with strong evidence across dimensions.
3.5 - 4.4Very GoodClearly strong work with limited, non-fundamental weaknesses.
2.5 - 3.4GoodSolid work with identifiable strengths and some meaningful gaps.
1.5 - 2.4AcceptableMinimum threshold met; notable concerns reduce confidence.
0.0 - 1.4PoorFundamental concerns that materially limit reliability.

Scores are calibrated against published work at the time of publication, not against current standards. II (Insufficient Information) is used for non-reportable components; weight is redistributed across reportable components, not penalised.

2.2 Reliability Screen

The Reliability rating is independent of the Quality and Impact Potential scores. It is a pre-scoring gate, applied to every paper before any dimension is scored, that flags methodological, statistical, or evidentiary concerns visible in what the paper reports.

A Red rating does not mean "bad research." It means "proceed with caution - specific concerns identified." Quality work can carry an Amber or Red flag if reliability concerns are present, and weak work can carry a Green flag if no specific concerns are identified.

  • Green: No reliability concerns identified. Quality and Impact Potential scores reflect merit on their own.

  • Amber: Watch-outs noted. The concerns do not change the interpretation of core findings, but readers should consult the rationale.

  • Red: Concerns identified that could change the interpretation of findings. Affected components are capped, and the paper carries an explicit reliability flag in downstream displays.

Post-publication signals (replications, retractions, method validity, practitioner feedback) update the reliability layer over time. A reliability rating is not fixed at publication.

2.3 Impact Potential (4T) Framework

Quality and Impact Potential are measured separately, always. A perfectly executed study of a trivial question may score high on Quality and low on Impact Potential. The reverse also occurs: an ambitious agenda paper with weak execution can carry strong Impact Potential and a low Quality score. The two signals answer different questions, and the framework keeps them apart.

The Impact Potential score is built from four components: Traction, Translation, Transferability, and Trajectory.

Impact Potential calibration

Score rangeWhat the paper looks like
4.5 - 5.0HighDirectly addresses a documented need with clear stakeholder pathway.
3.5 - 4.4StrongClear real-world relevance with a plausible pathway to use.
2.5 - 3.4ModerateMeaningful potential, but pathway remains partly defined.
1.5 - 2.4LimitedNarrow or early-stage pathway requiring substantial additional work.
0.0 - 1.4MinimalNo clear pathway to application, policy influence, or practical use yet.

The same II (Insufficient Information) rule applies: where a component cannot be evaluated from what the paper reports, it is marked II and weight is redistributed.

2.4 Methodology Modules

Specific scoring criteria within the rubric apply differently depending on the paper’s methodology. An RCT is evaluated differently from a qualitative study, because it should be. Six modules cover the primary research designs. Each paper is assigned a module by a classification agent before review begins.

2.5 Evaluation Pipeline

Each paper passes through a structured pipeline. A classification agent first assigns the paper to its OECD Field of Science and selects the appropriate methodology module. Three independent reviewers then evaluate the paper against the rubric, each blind to the others' scoring. An adjudicator resolves any divergence on strength of evidence, applying the same rubric and field-specific calibration anchors.

The architecture is designed so that the rubric, not any single reviewer, is the evaluator. When reviewers disagree, the rubric - applied by the adjudicator - decides.

Two further controls operate inside this pipeline:

  • Bias control - reviewers are blinded to author, journal, and institutional information; calibration anchors are field-specific to prevent default-to-prestigious patterns.

  • Anti-metric gaming - rubric components reward methodological substance, not surface signals; scores cannot be improved by edits that do not reflect underlying quality.

2.6 Role of AI in the Pipeline

Nabu’s reviewers are AI-Human hybrid agents: multiple frontier large language models operating within the structured rubric. Adjudication is also AI-executed within the rubric, with an escalation and quality-assurance path to human reviewers where divergence cannot be resolved on the evidence available, where the paper falls outside well-calibrated training distributions, or where rubric application produces systematic anomalies. The system is engineered so that the rubric does the work, not the model.

“The rubric is the evaluator. The AI is the instrument.”

Four operational guardrails constrain the AI to evidence-based judgments:

  • Component-level scoring, not holistic judgment.

    Each reviewer scores each rubric component independently, not the paper as a whole.

  • Evidence-bound rationale per component.

    Every component score must be accompanied by rationale that cites specific text, methods, or results from the paper.

  • The Insufficient Information (II) flag.

    When a reviewer cannot extract enough evidence from the paper to score a component, the reviewer marks the component II rather than guess.

  • Field-specific calibration anchors.

    Each rubric component is calibrated against the published literature for the paper's OECD Field of Science. This prevents the AI from defaulting to a generic prior of what good research looks like.

3. Validation

Initial validation results from the Nabu evaluation corpus. Numbers will be updated as the expanded validation corpus completes.

Result 1: Inter-rater reliability

Nabu’s primary reviewers achieve an ICC2 of 0.81 (absolute agreement) on the composite Quality score - more than double the published meta-analytic benchmark for human peer review (0.34, across 48 studies and 19,443 manuscripts). When the adjudication layer is applied (resolving score divergence on strength of evidence), reliability rises to 0.89.

Bornmann, Mutz & Daniel (2010). A reliability-generalization study of journal peer reviews: a multilevel meta-analysis. PLOS ONE, 5(12), e14331. doi:10.1371/journal.pone.0014331

0.000.250.500.751.00Human peer review(meta-analysis mean)0.34Nabu (primary reviewers)0.81Nabu (with adjudication)0.89Good reliability threshold

Result 2: Score distribution

Distribution of Quality scores across the non-retracted evaluation corpus. The majority of evaluated papers score in the Very Good range (3.5-4.4). No papers have scored Exceptional - the standard is intentionally conservative and calibrated to the published literature, not to internal benchmarks. The rubric discriminates: scores are distributed, not clustered.

0%10%20%30%40%50%60%5%Poor0.0 - 1.416%Acceptable1.5 - 2.420%Good2.5 - 3.456%Very Good3.5 - 4.43%Exceptional4.5 - 5.0

Distribution computed across the current evaluation corpus (n=240). Updated as the expanded corpus is processed.

Result 3: Retraction detection

A curated retraction corpus of 50 papers with confirmed retracted status was evaluated blind, before any retraction information was available to the reviewers.

  • 90%+ scored in the bottom two quality tiers (Poor or Acceptable, 0.0-2.4)
  • 90%+ flagged Red for reliability concerns by all three reviewers independently
  • Mean Quality score across the retracted corpus: 1.2 / 5.0
  • The remaining ~10% included papers retracted for non-methodological reasons (post-hoc data fabrication, image manipulation, ethical violations) that are not reliably visible from the rubric-scorable text alone

The rubric identified what post-publication scrutiny later confirmed. In each case, the low scores were driven by specific, documented methodological concerns - not by a generic “this seems bad” signal.

4. Limitations

The framework is calibrated against published research. It is most reliable for paper formats where methodology, claims, and evidence are explicit and reportable. It is less reliable, by design, in the following cases:

  • Theoretical and conceptual papers.

    The Craft dimension assumes empirical methodology. For papers whose contribution is a framework, argument, or proof rather than data collection, the Theoretical / Conceptual module applies, but the dimension weights remain calibrated against empirical work. Quality scores for these papers reflect logical and engagement-with-literature rigour more than methodological rigour; readers should weight the dimension breakdown more heavily than the composite score in these cases.

  • Very short formats.

    Letters, editorials, commentaries, and brief reports often lack the reportable detail (sample, methods, analysis) that the rubric scores. The framework will return Insufficient Information flags on multiple components, and the resulting composite score is a low-confidence signal. We recommend reading dimension scores rather than the composite for short formats.

  • Niche subfields and emerging methods.

    Calibration anchors are field-specific (OECD Fields of Science), but within fields, some subfields have less published baseline material. In these cases, the calibration may default to a broader field anchor, which can systematically under- or over-rate work depending on subfield norms. Subfield-level expert calibration panels (planned) are intended to close this gap.

  • Pre-registered reports.

    When a paper is a Registered Report (where the methodology is reviewed and accepted before data collection), some Craft components - specifically design pre-specification - are partially redundant with the registration itself. Nabu treats these papers like any other on the rubric, which can produce slightly higher scores than for non-registered work of equivalent quality. We are evaluating whether to surface registration status as a separate signal rather than absorb it into the score.

Reliability and validation results in Section 3 should be read with these scope conditions in mind.

5. Commitments

  • Open methodology.

    The rubric, weights, scoring criteria, and adjudication principles are published in full. Evaluate the evaluator.

  • Full blinding.

    No author, journal, or institution signals enter the evaluation. Publication year is retained solely for era-appropriate calibration.

  • No conflicts of interest.

    Nabu has no relationship with publishers, journals, or institutions being evaluated. Reviewers have no career incentive tied to the scores they produce. The evaluation is structurally independent: there is no scenario in which scoring a paper higher or lower benefits Nabu commercially.

  • Venue-independent.

    The same rubric everywhere. A paper is a paper.

  • Living evaluations.

    Post-publication signals (replications, retractions, method validity, practitioner feedback) update the reliability layer over time.

  • DORA and CoARA alignment.

    Paper-level, methodology-based, venue-independent. The framework operationalises what 3,000+ signatory institutions committed to.

What is DORA? → · What is CoARA? →