Proof · data, computation, accuracy

What we can defend with real data

The MESSAI platform’s evidence for prediction accuracy, computational depth, and a curated data corpus — sourced from on-disk training artifacts and live database counts. Numbers refresh per request; every tile cites its source file. Where the evidence is thin or missing, the honest-gaps section says so — nothing is silently hidden.

Out-of-sample coverage

97.98%

95% CI · 940 held-out obs

Expected calibration error

1.96%

overall · calibration.json

Measured parameter values

196,522

live · ExtractedParameterData

Research papers

23,569

live · ResearchPaper

Canonical parameters

706

live · ParameterClassification

DAG edges

2,812

live · ParameterEdge

All numbers above are read at request time from apps/web/public/data/computed/research/ + Prisma. Live counts grow across sessions; snapshot fallbacks render when the DB is unreachable.

§1 Prediction accuracy

Out-of-sample 95% credible-interval coverage

Out-of-sample prediction accuracy

7 strata · 940 held-out observations · cohort holdout-2026-05-21 · generated 2026-05-21

generalizes

OOS coverage

97.98%

vs 95% target

In-sample coverage

95.85%

-2.13 pp gap (lower is better)

Expected calibration error

—

calibration.json

Holdout strata

940 observations

Per-stratum coverage (sorted by n)

Parameter	System	n	Inside	OOS coverage	IS − OOS (pp)
power_density_areal	MFC	234	229	97.86%	-0.43
cod_removal	MFC	220	214	97.27%	-3.18
current_density_areal	MFC	163	159	97.55%	-0.61
coulombic_efficiency	MFC	135	135	100.00%	-7.41
current_density_areal	MEC	69	69	100.00%	-1.45
coulombic_efficiency	MEC	64	64	100.00%	+0.00
cod_removal	MEC	55	51	92.73%	+0.00

Negative IS − OOS gap means the model generalizes better than its fit cohort would suggest — the holdout papers happened to be slightly easier than average. A large positive gap would indicate overfitting; none of the 7 strata show that.

§2 Computational moat

Models, physics, and discovered laws

GP-SCM — Gaussian Process Structural Causal Model

Multi-output GP with learned coregionalization kernel. Supports inversion (target → design parameters), Pareto front, and global sensitivity.

in production

Trained artifact

508 KB

gp-scm-fitted-named-db-2026-05-19-v3.pkl

Kernel

ICM

Intrinsic Coregionalization

Training data

195,846

from 9,511 papers

Trained at

2026-05-19

v3, latest of 6 snapshots

Physics constraint laws — 8 encoded

Pure-function residual checks anchored to Logan et al., Newman & Thomas-Alyea, and the Faraday/Nernst foundations. Used to gate every extracted row for physical plausibility.

Law	Formula	Source
Power identity	P = V · I	Logan et al.
Coulombic efficiency	CE = ∫I dt / Q_substrate	Logan 2008
Max power theorem	P_max = V_oc² / 4R_int	Thevenin
Mass-transfer diffusion limits	j_lim ∝ D · C / δ	Newman & Thomas-Alyea
COD electron-accounting	8 mol e⁻ / mol COD	Sleutels 2012
Faraday's law	n_H₂ = Q / (2F)	Faraday
Nernst equation	E = E° − (RT/nF) ln Q	Nernst
pH-dependent electrode potential	−59 mV / pH-unit @ 25°C	Logan 2008 (DOI 10.1021/es801553e)

Symbolic regression — PySR (Cranmer 2023)

Genetic-programming search for closed-form mechanistic laws. 2 of 5 fits succeeded · 18.7s wall time

40% success

Discovered laws

powerDensity ~ f(currentDensity)MFC · n=51 · complexity=11
(x0 + (0.22200376 / ((x0 + 2.3158035) * -1.8066067))) - 0.6148749
Logan 2008 §3.4: P = V·I. At matched load V≈V_oc/2 ≈ const → P ∝ I
coulombic_efficiency ~ f(cod_removal)MFC · n=29 · complexity=13
((x0 + -0.19921306) / ((x0 + (x0 + 0.46581972)) / 0.0033559396)) + -1.308836
Sleutels 2012 §4: at high COD-removal, more substrate goes to biomass (not e-) → CE drops

Skipped fits (3) — honest reporting

· internalResistance ~ f(electrolyte_conductivity) — insufficient_data (n=1)
· specific_h2_production ~ f(currentDensity) — insufficient_data (n=2)
· powerDensity ~ f(currentDensity, internalResistance) — insufficient_data (n=15)

Learned causal DAG — Bayesian network discovery

HillClimbSearch + BICScore on 173 papers across 9 features · v1-pgmpy-hillclimb-bic-2026-05-10

Novel edges learned (1)

· powerDensity → currentDensity

Handcrafted edges NOT confirmed by data (14)

· appliedPotential → currentDensity
· biofilmThickness → massTransportLimitation
· coulombic_efficiency → energy_efficiency
· coulombic_efficiency → h2_yield
· coulombic_efficiency → specific_h2_production
· electrodeMaterial → specificSurfaceArea
· exchangeCurrentDensity → chargeTransferResistance
· flowRate → biofilmThickness
...and 6 more

§3 Data moat

Curated corpus, taxonomies, and provenance

Data corpus and taxonomies

Live database counts + curated catalog snapshots. All counts read live from Supabase Postgres.

live

Proprietary corpus (live from DB)

Research papers

23,569

ResearchPaper

Extracted parameter values

196,522

ExtractedParameterData

Canonical parameters

706

ParameterClassification

DAG edges

2,812

ParameterEdge

Curated taxonomies (snapshot — open-source/mess-*)

Microbe species + priors

mess-microbes seed v1.3

Genus electroactivity priors

EET pathway × inoculum source

Electrode catalog entries

Pourbaix · conductivity · cost

Materials Project entries

160

DFT-backed, cross-referenced

What makes this a moat

· Every extracted value carries provenance: derivationMethod, uncertaintyPlus/Minus/Type, confidence, snippet, conditions (JSONB), canonical (JSONB).
· Parameter DAG is DB-as-source-of-truth with FK constraints on slug identifiers (post 2026-05-09); no JSON drift.
· Materials catalog bridges DFT (Materials Project API) + lab electrochemistry + Pourbaix system-class data — hand-curated, not bulk-imported.
· Two-tier extraction with provider fallback (Anthropic Gateway / Gemini / Groq) + schema-tolerance shim for non-Anthropic providers.

What is curation, not ownership

· ~35 MB of Zenodo / FigShare / Recherche Data Gouv metadata in mess-datasets-catalog is cross-linked & classified — but the underlying datasets are public.
· Public datasets count as a curation moat (consolidation + cross-classification effort) but not a data moat (we don’t own the underlying measurements).
· Research papers themselves are public; the moat is the extraction → DB pipeline with verifier sign-off, archive-never-delete, and per-row reproducibility scoring.

§4 Trust map

Per-parameter Bayesian prior states

Parameter trust map

102 unique parameter slugs · 268 stratum-level states (pooled + by system_type + by application_domain) · v2-trust-2026-05-16 · fitted 2026-05-16

live snapshot

Calibrated

4.5% of strata · converged + LOO reliable

Modeled

25.7% of strata · converged, weak LOO

Curated

178

66.4% of strata · literature prior, low-n

Flagged

3.4% of strata · anomalous posterior p

Breakdown by axis

Axis	Calibrated	Modeled	Curated	Flagged
Pooled (across systems)	8	20	71	3
By system_type	3	36	75	4
By application_domain	1	13	32	2

Each parameter has up to 3 stratification axes; a parameter that converges with reliable LOO at the pooled level might still be ‘curated’ at narrower system_type slices where n is small. This is honest stratification — the headline number above counts every stratum-level state, not just the best ones.

§5 Literature retrospective backtest

25+ peer-reviewed papers, encoded and re-predicted

Literature retrospective backtest

25 peer-reviewed papers encoded as LoganPaperPreset objects with DOI, figure reference, reported values, units, and caveats. Re-predicted at each paper’s exact experimental configuration; percent-delta vs reported values classified by Logan & Regan 2006 lab-to-lab band.

93% green (MFC)

Encoded papers

apps/lab/src/lib/sweep/presets/*.ts

MFC green band

13/14

|Δ| ≤ 25% · 92.9%

Median |Δ|

6.7%

across MFC presets

Anchor tests passing

10/10

100% · Jest CI gate

MFC band distribution (Logan & Regan 2006 lab-to-lab thresholds)

Green — |Δ| ≤ 25%13 / 14 · 92.9%

Within lab-to-lab reproducibility

Amber — 25% < |Δ| ≤ 50%1 / 14 · 7.1%

Investigate; may indicate missing physics term

Red — |Δ| > 50%0 / 14 · 0.0%

Genuine model gap

Source files

· apps/lab/src/lib/sweep/presets/ (25 preset files)
· apps/lab/src/lib/sweep/__tests__/backtest-presets.test.ts
· apps/lab/src/lib/sweep/__tests__/butler-volmer-calibration.test.ts

Anchor tests assert 10 hard-coded numerical ranges from canonical peer-reviewed papers (Liu-Logan 2004 wastewater/PEM/acetate, Cheng 2006 carbon-cloth, Cheng-Logan 2007 brush anode, plus 5 more). All 10 pass the Jest regression gate pre-release. Snapshot dated 2026-05-21; auto-refresh would require a scripts/research/snapshot-backtest.ts helper (planned follow-up).

§6 Honest gaps

What we don’t yet have evidence for

We're honest about gaps — peer section, not buried

Below are 9 known gaps in MESSAI’s defensibility today. They sit beside the moat sections with the same visual weight because surfacing them is the trust signal — reviewers should see them on the first scroll, not hunt for them in a footnote. Each card cites its source memory/handoff doc so the gap can be independently verified.

Gap · cov-power-density1,285%

Power density CoV ~1,285%

Why & mitigation›

Power-density measurements span ~4 orders of magnitude across the literature; point estimates are not meaningful without conditioning on system class, scale, and electrode geometry.

Source

open-source/mess-parameters/data/SCIENTIFIC_INTEGRITY.md (Rule §3)

Mitigation

Predictive intervals are reported per-stratum (system_class × application_domain) via hierarchical Bayesian priors; point estimates only render in the lab UI when n ≥ 5 in the stratum.

Gap · canonical-coverage19%

Canonical-slug coverage 19% (raw)

Why & mitigation›

Of all raw parameter names in extracted data, 19% map directly to a canonical slug. The remaining 81% are aliases that need DAG nodes added (~706 canonicals today; ~58 high-frequency aliases unmapped).

Source

CLAUDE.local.md → "Open v2 followups" items 1 + 2

Mitigation

Post-refresh-all.sh the mapped rate rises to ~68%; ongoing work in open-source/mess-parameters/scripts/sync-from-database.js.

Gap · extraction-success75%

v2 extraction success rate ~75%

Why & mitigation›

Of papers fed to the v2 extractor (simple_value_extractor.ts), ~75% produce a usable values_v2.json. The remaining 25% fail on schema-tolerance for non-Anthropic providers, malformed PDFs, or paywall walls.

Source

CLAUDE.local.md → "Other problems" section

Mitigation

Schema-tolerance shim landed 2026-05-09 (commit 3d215187b); BioC-PMC XML path is the next planned mitigation.

Gap · gemini-other-regression52%

system_class=OTHER 52% under Gemini fallback

Why & mitigation›

When the Gateway is throttled and Gemini-2.5-flash takes over, system_class classification regresses from ~14% OTHER (Haiku) to ~52% OTHER on the first 21 batch5 papers. Several papers with "microbial fuel cell" in the title land in OTHER.

Source

docs/extraction/provider-fallback-2026-05-09.md

Mitigation

Mitigations under evaluation: inline prompt enumeration, Tier-1 regex strawman, try gemini-2.5-pro. paper_class field IS reliable on Gemini.

Gap · multiclass-validation2 of 7

Bayesian holdout validation: MFC + MEC only

Why & mitigation›

The 97.98% OOS-coverage number covers 7 strata across MFC + MEC. MDC, MES, MNRC, MMRC, MBES analytical predictors landed 2026-05-15 but have not been validated against held-out paper cohorts.

Source

libs/shared/electrochemistry/src/predictors/{mdc,mes,mnrc,mmrc,mbes}.ts + holdout-coverage-2026-05-21.json

Mitigation

Multi-class holdout cohort assembly is the next planned validation workstream.

Gap · bioc-pmc-unconsumed1,400+

1,400+ BioC-PMC XML files unconsumed

Why & mitigation›

paperscraper accumulated 1,400+ BioC-PMC XML files alongside PDFs. Neither v1 nor v2 extractors consume XML today — these are ground-truth full-text that would feed v2 with higher accuracy than PDF + Nougat.

Source

CLAUDE.local.md → "Known gaps" → "BioC-PMC XML extraction path"

Mitigation

Either write an XML-aware extractor or feed XML to v2 with a different system prompt — sized but not scheduled.

Gap · bvm-mechanistic-data-gated19 / 30 cells

Mechanistic BVM kinetics: data-gated, not yet identifiable

Why & mitigation›

After the 2026-06-09 curve-sync fix recovered stranded 3-electrode data, the corpus now has 843 half-cell (vs-SHE) polarization points and 19 of 30 BVM-posterior-ready cells (up from 0 and 13). But that still falls short of the 30-cell target, and the genuine potential-sweep data is thin (≈3 CV curves), so the Butler-Volmer transfer coefficient α remains under-identified. We make NO mechanistic BVM prediction claims today.

Source

scripts/audits/bvm-identifiability-audit.ts + scripts/extraction/sync-curves-to-polarization-points.ts (2026-06-09)

Mitigation

We surface the validated within-design DIRECTION skill instead (power↔substrate 63%, power↔pH 62%, current↔external-resistance 71–78%); closing the remaining α gap needs new 3-electrode / cyclic-voltammetry papers + lab partnerships.

Gap · ml-predict-mfc-surrogate1 of 17

/api/ml/predict still uses MFC surrogate

Why & mitigation›

The ML inference path defaults to predictMFC regardless of the input system_class. Phase 2 per-class analytical predictors landed in the lab UI 2026-05-15 but the /api/ml/predict endpoint has not been routed to dispatch per system class yet.

Source

docs/system-class-aware-predictor-architecture.md

Mitigation

Per-class ML routing is tracked as a separate workstream in the system-class-aware-predictor doc.

Gap · symbolic-regression-coverage2/5

Symbolic regression: 2 of 5 fits succeeded

Why & mitigation›

PySR run on 2026-05-10 attempted 5 mechanistic fits; 2 succeeded (powerDensity↔currentDensity, coulombic_efficiency↔cod_removal); 3 were skipped for insufficient_data (internalResistance, specific_h2_production, P from current+R).

Source

apps/web/public/data/computed/research/symbolic-regression-laws.json

Mitigation

Re-run PySR after extraction backfills bring per-pair n above the data-sufficiency threshold (currently many pairs at n < 5).