Proof · data, computation, accuracy
What we can defend with real data
The MESSAI platform’s evidence for prediction accuracy, computational depth, and a curated data corpus — sourced from on-disk training artifacts and live database counts. Numbers refresh per request; every tile cites its source file. Where the evidence is thin or missing, the honest-gaps section says so — nothing is silently hidden.
Out-of-sample coverage
97.98%
95% CI · 940 held-out obs
Expected calibration error
1.96%
overall · calibration.json
Measured parameter values
195,846
live · ExtractedParameterData
Research papers
23,480
live · ResearchPaper
Canonical parameters
706
live · ParameterClassification
DAG edges
2,812
live · ParameterEdge
All numbers above are read at request time from apps/web/public/data/computed/research/ + Prisma. Live counts grow across sessions; snapshot fallbacks render when the DB is unreachable.
§1 Prediction accuracy
Out-of-sample 95% credible-interval coverage
Out-of-sample prediction accuracy
7 strata · 940 held-out observations · cohort holdout-2026-05-21 · generated 2026-05-21
OOS coverage
97.98%
vs 95% target
In-sample coverage
95.85%
-2.13 pp gap (lower is better)
Expected calibration error
—
calibration.json
Holdout strata
7
940 observations
Per-stratum coverage (sorted by n)
| Parameter | System | n | Inside | OOS coverage | IS − OOS (pp) |
|---|---|---|---|---|---|
| power_density_areal | MFC | 234 | 229 | 97.86% | -0.43 |
| cod_removal | MFC | 220 | 214 | 97.27% | -3.18 |
| current_density_areal | MFC | 163 | 159 | 97.55% | -0.61 |
| coulombic_efficiency | MFC | 135 | 135 | 100.00% | -7.41 |
| current_density_areal | MEC | 69 | 69 | 100.00% | -1.45 |
| coulombic_efficiency | MEC | 64 | 64 | 100.00% | +0.00 |
| cod_removal | MEC | 55 | 51 | 92.73% | +0.00 |
Negative IS − OOS gap means the model generalizes better than its fit cohort would suggest — the holdout papers happened to be slightly easier than average. A large positive gap would indicate overfitting; none of the 7 strata show that.
§2 Computational moat
Models, physics, and discovered laws
GP-SCM — Gaussian Process Structural Causal Model
Multi-output GP with learned coregionalization kernel. Supports inversion (target → design parameters), Pareto front, and global sensitivity.
Trained artifact
508 KB
gp-scm-fitted-named-db-2026-05-19-v3.pkl
Kernel
ICM
Intrinsic Coregionalization
Training data
195,846
from 9,511 papers
Trained at
2026-05-19
v3, latest of 6 snapshots
Physics constraint laws — 8 encoded
Pure-function residual checks anchored to Logan et al., Newman & Thomas-Alyea, and the Faraday/Nernst foundations. Used to gate every extracted row for physical plausibility.
| Law | Formula | Source |
|---|---|---|
| Power identity | P = V · I | Logan et al. |
| Coulombic efficiency | CE = ∫I dt / Q_substrate | Logan 2008 |
| Max power theorem | P_max = V_oc² / 4R_int | Thevenin |
| Mass-transfer diffusion limits | j_lim ∝ D · C / δ | Newman & Thomas-Alyea |
| COD electron-accounting | 8 mol e⁻ / mol COD | Sleutels 2012 |
| Faraday's law | n_H₂ = Q / (2F) | Faraday |
| Nernst equation | E = E° − (RT/nF) ln Q | Nernst |
| pH-dependent electrode potential | −59 mV / pH-unit @ 25°C | Logan 2008 (DOI 10.1021/es801553e) |
Symbolic regression — PySR (Cranmer 2023)
Genetic-programming search for closed-form mechanistic laws. 2 of 5 fits succeeded · 18.7s wall time
Discovered laws
- powerDensity ~ f(currentDensity)MFC · n=51 · complexity=11
(x0 + (0.22200376 / ((x0 + 2.3158035) * -1.8066067))) - 0.6148749Logan 2008 §3.4: P = V·I. At matched load V≈V_oc/2 ≈ const → P ∝ I
- coulombic_efficiency ~ f(cod_removal)MFC · n=29 · complexity=13
((x0 + -0.19921306) / ((x0 + (x0 + 0.46581972)) / 0.0033559396)) + -1.308836Sleutels 2012 §4: at high COD-removal, more substrate goes to biomass (not e-) → CE drops
Skipped fits (3) — honest reporting
- · internalResistance ~ f(electrolyte_conductivity) — insufficient_data (n=1)
- · specific_h2_production ~ f(currentDensity) — insufficient_data (n=2)
- · powerDensity ~ f(currentDensity, internalResistance) — insufficient_data (n=15)
Learned causal DAG — Bayesian network discovery
HillClimbSearch + BICScore on 173 papers across 9 features · v1-pgmpy-hillclimb-bic-2026-05-10
Novel edges learned (1)
- · powerDensity → currentDensity
Handcrafted edges NOT confirmed by data (14)
- · appliedPotential → currentDensity
- · biofilmThickness → massTransportLimitation
- · coulombic_efficiency → energy_efficiency
- · coulombic_efficiency → h2_yield
- · coulombic_efficiency → specific_h2_production
- · electrodeMaterial → specificSurfaceArea
- · exchangeCurrentDensity → chargeTransferResistance
- · flowRate → biofilmThickness
- ...and 6 more
§3 Data moat
Curated corpus, taxonomies, and provenance
Data corpus and taxonomies
Live database counts + curated catalog snapshots. All counts read live from Supabase Postgres.
Proprietary corpus (live from DB)
Research papers
23,480
ResearchPaper
Extracted parameter values
195,846
ExtractedParameterData
Canonical parameters
706
ParameterClassification
DAG edges
2,812
ParameterEdge
Curated taxonomies (snapshot — open-source/mess-*)
Microbe species + priors
28
mess-microbes seed v1.3
Genus electroactivity priors
28
EET pathway × inoculum source
Electrode catalog entries
7
Pourbaix · conductivity · cost
Materials Project entries
160
DFT-backed, cross-referenced
What makes this a moat
- · Every extracted value carries provenance:
derivationMethod,uncertaintyPlus/Minus/Type,confidence, snippet, conditions (JSONB), canonical (JSONB). - · Parameter DAG is DB-as-source-of-truth with FK constraints on slug identifiers (post 2026-05-09); no JSON drift.
- · Materials catalog bridges DFT (Materials Project API) + lab electrochemistry + Pourbaix system-class data — hand-curated, not bulk-imported.
- · Two-tier extraction with provider fallback (Anthropic Gateway / Gemini / Groq) + schema-tolerance shim for non-Anthropic providers.
What is curation, not ownership
- · ~35 MB of Zenodo / FigShare / Recherche Data Gouv metadata in mess-datasets-catalog is cross-linked & classified — but the underlying datasets are public.
- · Public datasets count as a curation moat (consolidation + cross-classification effort) but not a data moat (we don’t own the underlying measurements).
- · Research papers themselves are public; the moat is the extraction → DB pipeline with verifier sign-off, archive-never-delete, and per-row reproducibility scoring.
§4 Trust map
Per-parameter Bayesian prior states
Parameter trust map
102 unique parameter slugs · 268 stratum-level states (pooled + by system_type + by application_domain) · v2-trust-2026-05-16 · fitted 2026-05-16
Calibrated
12
4.5% of strata · converged + LOO reliable
Modeled
69
25.7% of strata · converged, weak LOO
Curated
178
66.4% of strata · literature prior, low-n
Flagged
9
3.4% of strata · anomalous posterior p
Breakdown by axis
| Axis | Calibrated | Modeled | Curated | Flagged |
|---|---|---|---|---|
| Pooled (across systems) | 8 | 20 | 71 | 3 |
| By system_type | 3 | 36 | 75 | 4 |
| By application_domain | 1 | 13 | 32 | 2 |
Each parameter has up to 3 stratification axes; a parameter that converges with reliable LOO at the pooled level might still be ‘curated’ at narrower system_type slices where n is small. This is honest stratification — the headline number above counts every stratum-level state, not just the best ones.
§5 Logan retrospective backtest
25+ peer-reviewed papers, encoded and re-predicted
Logan retrospective — literature backtest
25 peer-reviewed papers encoded as LoganPaperPreset objects with DOI, figure reference, reported values, units, and caveats. Re-predicted at each paper’s exact experimental configuration; percent-delta vs reported values classified by Logan & Regan 2006 lab-to-lab band.
Encoded papers
25
apps/lab/src/lib/sweep/presets/*.ts
MFC green band
13/14
|Δ| ≤ 25% · 92.9%
Median |Δ|
6.7%
across MFC presets
Anchor tests passing
10/10
100% · Jest CI gate
MFC band distribution (Logan & Regan 2006 lab-to-lab thresholds)
Within lab-to-lab reproducibility
Investigate; may indicate missing physics term
Genuine model gap
Source files
- · apps/lab/src/lib/sweep/presets/ (25 preset files)
- · apps/lab/src/lib/sweep/__tests__/backtest-presets.test.ts
- · apps/lab/src/lib/sweep/__tests__/butler-volmer-calibration.test.ts
Anchor tests assert 10 hard-coded numerical ranges from canonical Logan-group papers (Liu-Logan 2004 wastewater/PEM/acetate, Cheng 2006 carbon-cloth, Cheng-Logan 2007 brush anode, plus 5 more). All 10 pass the Jest regression gate pre-release. Snapshot dated 2026-05-21; auto-refresh would require a scripts/research/snapshot-backtest.ts helper (planned follow-up).
§6 Honest gaps
What we don’t yet have evidence for
We're honest about gaps — peer section, not buried
Power density CoV ~1,285%
Why & mitigation›
Power-density measurements span ~4 orders of magnitude across the literature; point estimates are not meaningful without conditioning on system class, scale, and electrode geometry.
Source
open-source/mess-parameters/data/SCIENTIFIC_INTEGRITY.md (Rule §3)Mitigation
Predictive intervals are reported per-stratum (system_class × application_domain) via hierarchical Bayesian priors; point estimates only render in the lab UI when n ≥ 5 in the stratum.
Canonical-slug coverage 19% (raw)
Why & mitigation›
Of all raw parameter names in extracted data, 19% map directly to a canonical slug. The remaining 81% are aliases that need DAG nodes added (~706 canonicals today; ~58 high-frequency aliases unmapped).
Source
CLAUDE.local.md → "Open v2 followups" items 1 + 2Mitigation
Post-refresh-all.sh the mapped rate rises to ~68%; ongoing work in open-source/mess-parameters/scripts/sync-from-database.js.
v2 extraction success rate ~75%
Why & mitigation›
Of papers fed to the v2 extractor (simple_value_extractor.ts), ~75% produce a usable values_v2.json. The remaining 25% fail on schema-tolerance for non-Anthropic providers, malformed PDFs, or paywall walls.
Source
CLAUDE.local.md → "Other problems" sectionMitigation
Schema-tolerance shim landed 2026-05-09 (commit 3d215187b); BioC-PMC XML path is the next planned mitigation.
system_class=OTHER 52% under Gemini fallback
Why & mitigation›
When the Gateway is throttled and Gemini-2.5-flash takes over, system_class classification regresses from ~14% OTHER (Haiku) to ~52% OTHER on the first 21 batch5 papers. Several papers with "microbial fuel cell" in the title land in OTHER.
Source
docs/extraction/provider-fallback-2026-05-09.mdMitigation
Mitigations under evaluation: inline prompt enumeration, Tier-1 regex strawman, try gemini-2.5-pro. paper_class field IS reliable on Gemini.
Bayesian holdout validation: MFC + MEC only
Why & mitigation›
The 97.98% OOS-coverage number covers 7 strata across MFC + MEC. MDC, MES, MNRC, MMRC, MBES analytical predictors landed 2026-05-15 but have not been validated against held-out paper cohorts.
Source
libs/shared/electrochemistry/src/predictors/{mdc,mes,mnrc,mmrc,mbes}.ts + holdout-coverage-2026-05-21.jsonMitigation
Multi-class holdout cohort assembly is the next planned validation workstream.
1,400+ BioC-PMC XML files unconsumed
Why & mitigation›
paperscraper accumulated 1,400+ BioC-PMC XML files alongside PDFs. Neither v1 nor v2 extractors consume XML today — these are ground-truth full-text that would feed v2 with higher accuracy than PDF + Nougat.
Source
CLAUDE.local.md → "Known gaps" → "BioC-PMC XML extraction path"Mitigation
Either write an XML-aware extractor or feed XML to v2 with a different system prompt — sized but not scheduled.
/api/ml/predict still uses MFC surrogate
Why & mitigation›
The ML inference path defaults to predictMFC regardless of the input system_class. Phase 2 per-class analytical predictors landed in the lab UI 2026-05-15 but the /api/ml/predict endpoint has not been routed to dispatch per system class yet.
Source
docs/system-class-aware-predictor-architecture.mdMitigation
Per-class ML routing is tracked as a separate workstream in the system-class-aware-predictor doc.
Symbolic regression: 2 of 5 fits succeeded
Why & mitigation›
PySR run on 2026-05-10 attempted 5 mechanistic fits; 2 succeeded (powerDensity↔currentDensity, coulombic_efficiency↔cod_removal); 3 were skipped for insufficient_data (internalResistance, specific_h2_production, P from current+R).
Source
apps/web/public/data/computed/research/symbolic-regression-laws.jsonMitigation
Re-run PySR after extraction backfills bring per-pair n above the data-sufficiency threshold (currently many pairs at n < 5).