Proof · data, computation, accuracy

What we can defend with real data

The MESSAI platform’s evidence for prediction accuracy, computational depth, and a curated data corpus — sourced from on-disk training artifacts and live database counts. Numbers refresh per request; every tile cites its source file. Where the evidence is thin or missing, the honest-gaps section says so — nothing is silently hidden.

Out-of-sample coverage

97.98%

95% CI · 940 held-out obs

Expected calibration error

1.96%

overall · calibration.json

Measured parameter values

195,846

live · ExtractedParameterData

Research papers

23,480

live · ResearchPaper

Canonical parameters

706

live · ParameterClassification

DAG edges

2,812

live · ParameterEdge

All numbers above are read at request time from apps/web/public/data/computed/research/ + Prisma. Live counts grow across sessions; snapshot fallbacks render when the DB is unreachable.

§1 Prediction accuracy

Out-of-sample 95% credible-interval coverage

Out-of-sample prediction accuracy

7 strata · 940 held-out observations · cohort holdout-2026-05-21 · generated 2026-05-21

generalizes

OOS coverage

97.98%

vs 95% target

In-sample coverage

95.85%

-2.13 pp gap (lower is better)

Expected calibration error

calibration.json

Holdout strata

7

940 observations

Per-stratum coverage (sorted by n)

ParameterSystemnInsideOOS coverageIS − OOS (pp)
power_density_arealMFC23422997.86%-0.43
cod_removalMFC22021497.27%-3.18
current_density_arealMFC16315997.55%-0.61
coulombic_efficiencyMFC135135100.00%-7.41
current_density_arealMEC6969100.00%-1.45
coulombic_efficiencyMEC6464100.00%+0.00
cod_removalMEC555192.73%+0.00

Negative IS − OOS gap means the model generalizes better than its fit cohort would suggest — the holdout papers happened to be slightly easier than average. A large positive gap would indicate overfitting; none of the 7 strata show that.

§2 Computational moat

Models, physics, and discovered laws

GP-SCM — Gaussian Process Structural Causal Model

Multi-output GP with learned coregionalization kernel. Supports inversion (target → design parameters), Pareto front, and global sensitivity.

in production

Trained artifact

508 KB

gp-scm-fitted-named-db-2026-05-19-v3.pkl

Kernel

ICM

Intrinsic Coregionalization

Training data

195,846

from 9,511 papers

Trained at

2026-05-19

v3, latest of 6 snapshots

Physics constraint laws — 8 encoded

Pure-function residual checks anchored to Logan et al., Newman & Thomas-Alyea, and the Faraday/Nernst foundations. Used to gate every extracted row for physical plausibility.

LawFormulaSource
Power identityP = V · ILogan et al.
Coulombic efficiencyCE = ∫I dt / Q_substrateLogan 2008
Max power theoremP_max = V_oc² / 4R_intThevenin
Mass-transfer diffusion limitsj_lim ∝ D · C / δNewman & Thomas-Alyea
COD electron-accounting8 mol e⁻ / mol CODSleutels 2012
Faraday's lawn_H₂ = Q / (2F)Faraday
Nernst equationE = E° − (RT/nF) ln QNernst
pH-dependent electrode potential−59 mV / pH-unit @ 25°CLogan 2008 (DOI 10.1021/es801553e)

Symbolic regression — PySR (Cranmer 2023)

Genetic-programming search for closed-form mechanistic laws. 2 of 5 fits succeeded · 18.7s wall time

40% success

Discovered laws

  • powerDensity ~ f(currentDensity)MFC · n=51 · complexity=11
    (x0 + (0.22200376 / ((x0 + 2.3158035) * -1.8066067))) - 0.6148749

    Logan 2008 §3.4: P = V·I. At matched load V≈V_oc/2 ≈ const → P ∝ I

  • coulombic_efficiency ~ f(cod_removal)MFC · n=29 · complexity=13
    ((x0 + -0.19921306) / ((x0 + (x0 + 0.46581972)) / 0.0033559396)) + -1.308836

    Sleutels 2012 §4: at high COD-removal, more substrate goes to biomass (not e-) → CE drops

Skipped fits (3) — honest reporting

  • · internalResistance ~ f(electrolyte_conductivity) — insufficient_data (n=1)
  • · specific_h2_production ~ f(currentDensity) — insufficient_data (n=2)
  • · powerDensity ~ f(currentDensity, internalResistance) — insufficient_data (n=15)

Learned causal DAG — Bayesian network discovery

HillClimbSearch + BICScore on 173 papers across 9 features · v1-pgmpy-hillclimb-bic-2026-05-10

Novel edges learned (1)

  • · powerDensitycurrentDensity

Handcrafted edges NOT confirmed by data (14)

  • · appliedPotentialcurrentDensity
  • · biofilmThicknessmassTransportLimitation
  • · coulombic_efficiencyenergy_efficiency
  • · coulombic_efficiencyh2_yield
  • · coulombic_efficiencyspecific_h2_production
  • · electrodeMaterialspecificSurfaceArea
  • · exchangeCurrentDensitychargeTransferResistance
  • · flowRatebiofilmThickness
  • ...and 6 more

§3 Data moat

Curated corpus, taxonomies, and provenance

Data corpus and taxonomies

Live database counts + curated catalog snapshots. All counts read live from Supabase Postgres.

live

Proprietary corpus (live from DB)

Research papers

23,480

ResearchPaper

Extracted parameter values

195,846

ExtractedParameterData

Canonical parameters

706

ParameterClassification

DAG edges

2,812

ParameterEdge

Curated taxonomies (snapshot — open-source/mess-*)

Microbe species + priors

28

mess-microbes seed v1.3

Genus electroactivity priors

28

EET pathway × inoculum source

Electrode catalog entries

7

Pourbaix · conductivity · cost

Materials Project entries

160

DFT-backed, cross-referenced

What makes this a moat

  • · Every extracted value carries provenance: derivationMethod, uncertaintyPlus/Minus/Type, confidence, snippet, conditions (JSONB), canonical (JSONB).
  • · Parameter DAG is DB-as-source-of-truth with FK constraints on slug identifiers (post 2026-05-09); no JSON drift.
  • · Materials catalog bridges DFT (Materials Project API) + lab electrochemistry + Pourbaix system-class data — hand-curated, not bulk-imported.
  • · Two-tier extraction with provider fallback (Anthropic Gateway / Gemini / Groq) + schema-tolerance shim for non-Anthropic providers.

What is curation, not ownership

  • · ~35 MB of Zenodo / FigShare / Recherche Data Gouv metadata in mess-datasets-catalog is cross-linked & classified — but the underlying datasets are public.
  • · Public datasets count as a curation moat (consolidation + cross-classification effort) but not a data moat (we don’t own the underlying measurements).
  • · Research papers themselves are public; the moat is the extraction → DB pipeline with verifier sign-off, archive-never-delete, and per-row reproducibility scoring.

§4 Trust map

Per-parameter Bayesian prior states

Parameter trust map

102 unique parameter slugs · 268 stratum-level states (pooled + by system_type + by application_domain) · v2-trust-2026-05-16 · fitted 2026-05-16

live snapshot

Calibrated

12

4.5% of strata · converged + LOO reliable

Modeled

69

25.7% of strata · converged, weak LOO

Curated

178

66.4% of strata · literature prior, low-n

Flagged

9

3.4% of strata · anomalous posterior p

Breakdown by axis

AxisCalibratedModeledCuratedFlagged
Pooled (across systems)820713
By system_type336754
By application_domain113322

Each parameter has up to 3 stratification axes; a parameter that converges with reliable LOO at the pooled level might still be ‘curated’ at narrower system_type slices where n is small. This is honest stratification — the headline number above counts every stratum-level state, not just the best ones.

§5 Logan retrospective backtest

25+ peer-reviewed papers, encoded and re-predicted

Logan retrospective — literature backtest

25 peer-reviewed papers encoded as LoganPaperPreset objects with DOI, figure reference, reported values, units, and caveats. Re-predicted at each paper’s exact experimental configuration; percent-delta vs reported values classified by Logan & Regan 2006 lab-to-lab band.

93% green (MFC)

Encoded papers

25

apps/lab/src/lib/sweep/presets/*.ts

MFC green band

13/14

|Δ| ≤ 25% · 92.9%

Median |Δ|

6.7%

across MFC presets

Anchor tests passing

10/10

100% · Jest CI gate

MFC band distribution (Logan & Regan 2006 lab-to-lab thresholds)

Green — |Δ| ≤ 25%13 / 14 · 92.9%

Within lab-to-lab reproducibility

Amber — 25% < |Δ| ≤ 50%1 / 14 · 7.1%

Investigate; may indicate missing physics term

Red — |Δ| > 50%0 / 14 · 0.0%

Genuine model gap

Source files

  • · apps/lab/src/lib/sweep/presets/ (25 preset files)
  • · apps/lab/src/lib/sweep/__tests__/backtest-presets.test.ts
  • · apps/lab/src/lib/sweep/__tests__/butler-volmer-calibration.test.ts

Anchor tests assert 10 hard-coded numerical ranges from canonical Logan-group papers (Liu-Logan 2004 wastewater/PEM/acetate, Cheng 2006 carbon-cloth, Cheng-Logan 2007 brush anode, plus 5 more). All 10 pass the Jest regression gate pre-release. Snapshot dated 2026-05-21; auto-refresh would require a scripts/research/snapshot-backtest.ts helper (planned follow-up).

§6 Honest gaps

What we don’t yet have evidence for

Gap · cov-power-density1,285%

Power density CoV ~1,285%

Why & mitigation

Power-density measurements span ~4 orders of magnitude across the literature; point estimates are not meaningful without conditioning on system class, scale, and electrode geometry.

Source

open-source/mess-parameters/data/SCIENTIFIC_INTEGRITY.md (Rule §3)

Mitigation

Predictive intervals are reported per-stratum (system_class × application_domain) via hierarchical Bayesian priors; point estimates only render in the lab UI when n ≥ 5 in the stratum.

Gap · canonical-coverage19%

Canonical-slug coverage 19% (raw)

Why & mitigation

Of all raw parameter names in extracted data, 19% map directly to a canonical slug. The remaining 81% are aliases that need DAG nodes added (~706 canonicals today; ~58 high-frequency aliases unmapped).

Source

CLAUDE.local.md → "Open v2 followups" items 1 + 2

Mitigation

Post-refresh-all.sh the mapped rate rises to ~68%; ongoing work in open-source/mess-parameters/scripts/sync-from-database.js.

Gap · extraction-success75%

v2 extraction success rate ~75%

Why & mitigation

Of papers fed to the v2 extractor (simple_value_extractor.ts), ~75% produce a usable values_v2.json. The remaining 25% fail on schema-tolerance for non-Anthropic providers, malformed PDFs, or paywall walls.

Source

CLAUDE.local.md → "Other problems" section

Mitigation

Schema-tolerance shim landed 2026-05-09 (commit 3d215187b); BioC-PMC XML path is the next planned mitigation.

Gap · gemini-other-regression52%

system_class=OTHER 52% under Gemini fallback

Why & mitigation

When the Gateway is throttled and Gemini-2.5-flash takes over, system_class classification regresses from ~14% OTHER (Haiku) to ~52% OTHER on the first 21 batch5 papers. Several papers with "microbial fuel cell" in the title land in OTHER.

Source

docs/extraction/provider-fallback-2026-05-09.md

Mitigation

Mitigations under evaluation: inline prompt enumeration, Tier-1 regex strawman, try gemini-2.5-pro. paper_class field IS reliable on Gemini.

Gap · multiclass-validation2 of 7

Bayesian holdout validation: MFC + MEC only

Why & mitigation

The 97.98% OOS-coverage number covers 7 strata across MFC + MEC. MDC, MES, MNRC, MMRC, MBES analytical predictors landed 2026-05-15 but have not been validated against held-out paper cohorts.

Source

libs/shared/electrochemistry/src/predictors/{mdc,mes,mnrc,mmrc,mbes}.ts + holdout-coverage-2026-05-21.json

Mitigation

Multi-class holdout cohort assembly is the next planned validation workstream.

Gap · bioc-pmc-unconsumed1,400+

1,400+ BioC-PMC XML files unconsumed

Why & mitigation

paperscraper accumulated 1,400+ BioC-PMC XML files alongside PDFs. Neither v1 nor v2 extractors consume XML today — these are ground-truth full-text that would feed v2 with higher accuracy than PDF + Nougat.

Source

CLAUDE.local.md → "Known gaps" → "BioC-PMC XML extraction path"

Mitigation

Either write an XML-aware extractor or feed XML to v2 with a different system prompt — sized but not scheduled.

Gap · ml-predict-mfc-surrogate1 of 17

/api/ml/predict still uses MFC surrogate

Why & mitigation

The ML inference path defaults to predictMFC regardless of the input system_class. Phase 2 per-class analytical predictors landed in the lab UI 2026-05-15 but the /api/ml/predict endpoint has not been routed to dispatch per system class yet.

Source

docs/system-class-aware-predictor-architecture.md

Mitigation

Per-class ML routing is tracked as a separate workstream in the system-class-aware-predictor doc.

Gap · symbolic-regression-coverage2/5

Symbolic regression: 2 of 5 fits succeeded

Why & mitigation

PySR run on 2026-05-10 attempted 5 mechanistic fits; 2 succeeded (powerDensity↔currentDensity, coulombic_efficiency↔cod_removal); 3 were skipped for insufficient_data (internalResistance, specific_h2_production, P from current+R).

Source

apps/web/public/data/computed/research/symbolic-regression-laws.json

Mitigation

Re-run PySR after extraction backfills bring per-pair n above the data-sufficiency threshold (currently many pairs at n < 5).

What we don’t claim

We do not claim ownership of the public datasets we cross-link (Zenodo, FigShare, Recherche Data Gouv) — that’s curation, not data ownership. We do not claim field-deployment validation — all accuracy numbers above are against peer-reviewed lab/pilot literature, not in-situ deployments. We do not claim that the per-class predictors (MDC / MES / MNRC / MMRC / MBES) are Bayesian-validated — only MFC and MEC are in the 7-stratum holdout cohort today.

Source pipeline: services/ml-engine/ trainer publishes artifacts to apps/web/public/data/computed/research/; live counts come from the Supabase PostgreSQL DB via @messai/database/server. See /insights for the corpus-level rollup that backs these numbers.