How MESSAI works

Methodology & reference

How papers enter the corpus, what fields are extracted from each one, and how the classification taxonomy is defined. This page is the canonical source for understanding what the platform claims about each paper, and the audit-ready definitions behind every chip, badge, and parameter row in the UI. Share with collaborators or domain experts to sanity-check the field choices.

Goal

Why a structured schema?

Bioelectrochemical-systems papers report values in inconsistent ways: power density normalised to anode area in one paper and to total volume in another, peak vs steady- state values reported without distinction, voltages quoted without the reference electrode, and microbial communities described as "mixed culture" with no abundance data. A free-text scrape produces noise. The MESSAI extractor reads each paper's methods/results/tables and emits a structured JSON document with a fixed schema, so every value carries the context needed to compare it against another paper.

Below is every field the extractor produces, organised the way the JSON is shaped. Each field has a What line (the value), a Why line (what it lets the platform do downstream), and where applicable, the rule used to fill it.

is_bes_paper

Step 0 — Classification gate

Before extracting any measurements, the LLM decides whether the paper actually describes a bioelectrochemical system at all. Many corpus PDFs turn out to be off- topic: zinc-air fuel cells, COMSOL simulations of PEM hydrogen cells, VLSI circuits, clinical guidelines, mobile educational apps, etc. The classification gate rejects these cheaply (~$0.001 per call) and quarantines the file from downstream processing.

The LLM only emits the rich extraction schema when is_bes_paper: true. When false, it returns rejection_reason + concept_tags describing what the paper IS about, and the platform soft-hides the paper.

Field	Type	What & why
is_bes_paper	boolean	true if microbial/enzymatic catalysis at electrodes connected to an external circuit; false otherwise.Architecture-aware filter — uses inferred BES rules, not just keyword matching, so a paper describing "biofilm anode + air cathode + 1 kΩ resistor + acetate substrate" is recognized as MFC even when the term "MFC" never appears.
rejection_reason	string \| null	Single-sentence explanation when is_bes_paper=false.Audit trail. Surfaced inline on quarantined papers so a reviewer can verify the rejection was correct.

paper_context.primary_system_type

Primary system type

Twelve canonical types covering the full BES landscape. MES specifically means Microbial Electrosynthesis (cathodic CO₂ reduction), not the umbrella term — the umbrella is BES. See the Glossary tab for the full list with definitions.

Field	Type	What & why
primary_system_type	enum	Single dominant subtype: MFC \| MEC \| MES \| MDC \| EFC \| BER \| BEF \| MEB \| BES_generic \| BES_modeling \| BES_review \| otherLets the platform stratify by the dominant function of the paper. Modeling/review papers are flagged so the predictor doesn't treat their citations as measurements.

paper_context.system_variants[]

Architecture variants

Multi-tag list — a paper can be many things at once. A "constructed wetland MFC for nitrogen removal" gets variants=[constructed_wetland, single_chamber] + applications=[wastewater_treatment, power_generation]. Variants describe the chamber/biology architecture orthogonal to the primary type.

Field	Type	What & why
system_variants	string[]	Architecture tags drawn from a controlled vocabulary (sediment, plant, photosynthetic, constructed_wetland, biocathode, bioanode_only, mixed_culture, single_chamber, dual_chamber, h_type, tubular, microfluidic, cube, stacked, membraneless, ...).Powers multi-tag UI filters and stratification — research listings can filter to "all MFCs that are sediment + photosynthetic" with a single index lookup.

Single-chamber MFCs by definition have no membrane, so the prompt instructs the LLM to tag both single_chamber AND membraneless when there's one liquid compartment.

paper_context.application_domains[]

Application domains

What the paper is for. Most papers carry 1–3 of these. Lets users discover "what BES research has been done on metal recovery" or "all CO₂-reduction papers" without needing keyword guesses.

Field	Type	What & why
application_domains	string[]	power_generation \| hydrogen_production \| wastewater_treatment \| desalination \| electrosynthesis \| biosensing \| bioremediation \| electrofermentation \| nutrient_recovery \| CO2_reduction \| metal_recovery \| soil_remediationLets the platform answer "what works for X application" with explicit semantics instead of full-text search.

paper_context.geometry

Geometry

Dimensional and topological information needed to render a 3D model and (eventually) run a multiphysics simulation that can be cross-checked against the paper's own measurements. When a paper omits these the field is null — never invented.

Field	Type	What & why
topology_class	enum	single_chamber \| dual_chamber_h \| cube \| tubular \| stacked \| membraneless \| upflow \| sediment \| microfluidicDrives 3D model template selection and the multiphysics topology check (a chamber-count mismatch invalidates a model fit).
working_volume_anode_ml / cathode_ml / total_ml	number	Volumes of each chamber + total. When dual-chamber, anode + cathode should sum to total.Volume is the denominator for volume-normalized power density (W/m³). Without it, only area-normalized comparisons are valid.
inter_electrode_distance_cm	number	Anode-to-cathode separation distance.Sets ohmic-loss term in equivalent circuit. A 50 mV η_ohmic error → ~50× i₀ error in Butler-Volmer fits.
anode_projected_area_cm2 / cathode_projected_area_cm2	number	Geometric area of each electrode (NOT specific surface area).Denominator for area-normalized power/current density. The single most common normalization mismatch in BES literature.
membrane_area_cm2 / membrane_thickness_um	number	Membrane geometry when present.Powers ion-transport calculations and proton crossover modeling.
electrode_shape	enum	planar \| brush \| mesh \| cloth \| rod \| tubular \| foam \| otherA "brush" electrode and a "planar" electrode of the same projected area have very different actual surface areas and biofilm-volume ratios.
channel_pattern	enum	For microfluidic / flow-through: straight_parallel \| serpentine \| interdigitated \| y_junction \| spiral \| pin_fin \| mesh_distributedRenders correctly in 3D and determines flow-regime classification (laminar vs Taylor dispersion).
has_labelled_schematic	boolean	Whether the paper includes a labelled schematic figure.Future Gap B-c: digitizing the schematic for dimensional cross-checks against text-extracted geometry.

paper_context.materials

Materials — anode, cathode, membrane, biofilm

Each electrode role gets a structured object with two parallel fields: a canonical id (slug from mess-materials ormess-membranes) and an as_reported field carrying the exact text from the paper for audit. This means even unmatched materials retain their original wording — nothing is silently dropped.

Biological electrodes (biocathode, bioanode, algal_biofilm, photosynthetic_biofilm, enzymatic_electrode) are first-class material slugs starting in v1.2 — when a paper describes a photosynthetic biocathode, that information lands in materials.cathode.id, not just in system_variants.

Field	Type	What & why
materials.{anode,cathode,membrane}.id	string \| null	Canonical slug from mess-materials or mess-membranes. Null when no canonical match found.Joining papers by canonical material lets us answer "what power densities have been reported on carbon_cloth + Geobacter sulfurreducens" in O(1) DB lookups.
materials.{anode,cathode,membrane}.as_reported	string	Exact text from the paper, verbatim.Audit trail. When canonical_id=null, this is the only signal. When canonical_id is set, this lets reviewers verify the mapping was correct.
materials.{anode,cathode,membrane}.treatments[]	array	List of treatment objects with canonical id (acid_HNO3, heat_air, plasma_O2, MnO2_coating, ...) + as_reported + duration + temperature + agent.Same material with different pre-treatment can produce 25× performance range (carbon cloth as-received vs MXene-coated). Without treatments, the material signal is muddied.
materials.cathode.catalyst.{id,as_reported}	object	Catalyst layer on the cathode (Pt/C, MnO2, ABO3 perovskite, etc.) — separate from the cathode substrate itself.For air-cathode MFCs the catalyst loading dominates kinetics, not the underlying carbon support.
materials.cathode.environment	enum	air_exposed \| liquid_submerged \| gas_phase \| otherAir-cathode vs aqueous-cathode is a topology-level distinction that drastically changes O₂ availability and ORR kinetics.
materials.biofilm	object	thickness_um, density_g_L, conductivity_mS_cm, dominant_organism.{id, as_reported}.Biofilm physical parameters drive ohmic and conductive-mediator electron-transfer pathways.

paper_context.microbial_analysis

Microbial analysis

What organism(s) the paper studied, with provenance about how the identification was made (16S? metagenomics? culture-dependent?). Distinguishes a paper claiming "dominant Geobacter from V1-V3 16S" from one claiming the same identity from full-length PacBio sequencing — the methodological tier matters when training the predictor.

Field	Type	What & why
analysis_method	enum	16S_rRNA \| metagenomics \| metatranscriptomics \| FISH \| qPCR_specific_genes \| culture_dependent \| noneMethodological tier — 16S V4 amplicon < full-length 16S < shotgun metagenomics. Aggregating these as one signal mixes reliability tiers.
sequencing_platform / 16S_region_amplified / read_count	string \| int	Illumina MiSeq / NovaSeq / PacBio / Nanopore / Sanger; V1-V3, V3-V4, V4, V4-V5, full_length; reads per sample.Resolution depends on platform + region. Platform metadata is the audit trail for downstream microbe-feature confidence.
diversity_indices	object	Shannon, Simpson, Chao1, Pielou, Good's coverage %.Community-level signals — a high-Shannon mixed culture behaves differently from a low-Shannon enrichment.
dominant_taxa[]	array	List of {id (canonical mess-microbes slug), as_reported, name, rank, relative_abundance_pct, confidence}.Cross-paper organism analysis: which strains achieve which performance under which conditions.
biofilm_vs_planktonic_separated	boolean	Was sequencing done on biofilm vs free-cells separately, or pooled?Biofilm and planktonic communities can be very different. Pooled sequencing dilutes the electrode-attached signal.

paper_context.electrochemistry

Electrochemistry — Butler-Volmer + Tafel

Per-electrode kinetic parameters. When a paper reports them, we capture them; when it doesn't, we leave the slot null and the predictor uses literature defaults. This schema mirrors the parameters needed by the platform's COMSOL-style validation harness.

Field	Type	What & why
cathode_reaction_identity	enum	O2_ORR \| HER \| ferricyanide \| other.O₂ reduction has fundamentally different kinetics from H⁺ reduction; pretending they're comparable corrupts predictions.
i0_anode_A_m2 / i0_cathode_A_m2	number	Exchange current density per electrode.Butler-Volmer kinetic parameter. The platform's validation harness fits the curve and back-checks against this.
alpha_a / alpha_c	number	Anodic and cathodic charge-transfer coefficients (typically 0–1).Tafel slope = RT/(αFn); without α the slope is uninterpretable.
tafel_slope_anode_mV_dec / cathode	number	Slope of the Tafel plot per electrode (mV/decade).When i₀ is missing, Tafel slope can back-fit α. When both are present, they cross-validate.
E_eq_anode_V / E_eq_cathode_V	number	Equilibrium / Nernst potential per electrode.Anchors the absolute voltage scale. Without it, polarization curves are only relative.
overpotential_partition	object	{eta_act, eta_ohmic, eta_conc, operating_point_mA_cm2}.Tells the predictor which loss mechanism dominates at what current density — the determinant of where to optimize.

condition_sets[] + observations[]

Condition sets — operating conditions + measurements

When a paper varies a condition (temperature, pH, substrate, R_ext, applied voltage, ...) and reports outcomes for each, the LLM emits a separate condition_set per tuple plus an observations entry tied to that set. This preserves within-paper variation, which is the only kind of variation that supports causal-grade inference.

Every observation carries provenance: which section it came from, what value type (peak vs steady-state vs endpoint), what it's normalized to (anode area? cathode area? volume?), the page number, and a 200-character snippet for audit.

Field	Type	What & why
temperature_c / ph_anolyte / ph_catholyte	number	Operating temperature and chamber-specific pH.Anolyte and catholyte pH can differ in dual-chamber systems. Splitting them captures real biology (e.g. Geobacter at pH 7 anolyte, ORR at pH 3 catholyte).
substrate / substrate_id / substrate_concentration_mg_L	string + slug + number	Substrate name + canonical SubstrateClassification id + concentration.Canonical id (acetate, glucose, lactate, ...) lets the platform run substrate-aware Monod priors. as_reported preserved.
substrate_complexity	enum	pure_compound \| synthetic_mixture \| real_wastewater \| complex_media54.7% of the corpus is real wastewater. Without this distinction, predictions on complex substrates degrade to literature mean.
substrate_source	string	For wastewater: domestic_ww \| brewery \| dairy \| landfill_leachate \| industrial \| saline \| marine \| synthetic.Inhibitor profiles vary dramatically by source; this is the granular layer below substrate_complexity.
inhibitors_present[]	array	heavy_metals \| antibiotics \| sulfide \| ammonia_NH3_high \| salinity_high \| phenolic \| VFAs_high.Inhibitors mask the underlying substrate-utilization signal. Flagged so the predictor can stratify.
external_resistance_ohm	number	External load resistance for MFCs; required for I-V interpretation.Different R_ext drives different operating points on the polarization curve. A peak power without R_ext is uninterpretable.
applied_voltage_V	number	Applied potential for MEC / MES / BER.For non-spontaneous BES, this is the input that drives the chemistry.
reference_electrode + voltage_reporting_convention	enum + enum	Ag/AgCl_3M_KCl \| SCE \| Hg/HgO_1M_KOH \| SHE; vs_anode_reference \| vs_cathode_reference \| vs_SHE_normalized \| cell_voltage.A 30-110 mV offset between reference types makes raw voltages incomparable across papers. The platform normalizes everything to vs-SHE for cross-paper analysis.
mass balance fields	numbers	substrate_in/out_mg_L, COD_removed_pct, biomass_produced_mg_L, methane_produced_mol, mass_balance_closure_pct.Honesty check: a paper claiming 85% CE on glucose with 30% COD removal is internally inconsistent. The closure rate is the audit signal.

Per-observation fields

Field	Type	What & why
parameter_id	string (slug)	Ontology slug from mess-parameters (power_density, coulombic_efficiency, ...). 678 active definitions.Clean joins to ParameterDefinition for unit + bounds + display label. Avoids duplicate parameter definitions across papers.
value + unit	number + string	The reported numeric value + the EXACT unit shown in the paper.The platform never converts units at extraction time. Conversions happen later, with the unit preserved as audit.
value_type	enum	peak \| maximum \| steady_state \| endpoint \| average \| initial \| final \| minimum \| median \| range_min \| range_maxA peak value and a steady-state value of the same parameter are different numbers measuring different things. Conflating them produces 2-3× errors.
is_normalized_to	enum	anode_area \| cathode_area \| membrane_area \| total_volume \| anode_volume \| biomass_mass \| nullFor power and current density, the basis is decisive. mW/m² to anode area vs cathode area is a 1.5-3× difference; to volume is 10-1000×.
condition_set_label	string	Which ConditionSet this observation was measured under.Binds the value to its conditions. Without this, an observation is a number with no operating point.
section_source	enum	abstract \| methods \| results \| tables \| discussion \| figure_caption.Trust by section: tables > methods > results > discussion. The platform stratifies confidence by section.
page + snippet	int + string	Page number + ±200 char window from source text.Audit trail. A reviewer can click the value and see the exact text it was extracted from.
confidence	float [0,1]	LLM-reported confidence, calibrated against fixture-paper ground truth.Drives needs_review threshold (default <0.7) and Move 4 calibration weighting.
extraction_method	enum	REGEX \| LLM \| HYBRID \| MANUAL.Regex extractions are typically more reliable for numeric ranges; LLM extractions handle complex context. Knowing which path produced a value lets us audit failures.
uncertainty	object	{value, type: std \| sem \| ci95 \| range} when paper reports ±.A "5.2 ± 0.3 (n=3)" row is high-quality training data; a "~5" row is noise. Move 4 calibration weights by these.
replicates	int	Number of independent replicates the value is averaged over.Reproducibility tier. n≥3 reactors reported separately is the gold standard.
measurement_technique	enum	polarization_curve \| chronoamperometry \| chronopotentiometry \| cyclic_voltammetry \| LSV \| EIS \| open_circuit \| other.Power from polarization-curve peaks vs from chronoamperometry steady-state are different things. Stratifying by technique is essential for fair comparison.

polarization_curves[]

Polarization curves

When a paper reports a polarization curve — even partially — the LLM extracts every (current density, voltage, power density) point as a row. These power Move 1's within- paper effect modeling and let the multiphysics validation harness fit Tafel slopes per condition set.

Field	Type	What & why
polarization_curves[].condition_set_label	string	Which ConditionSet the curve was measured under.Different conditions produce different curves; binding to the condition set keeps them apart.
polarization_curves[].source	enum	table \| figure_digitized \| caption_extracted \| llm_inferredTells downstream consumers how reliable each point is. Table > caption > LLM-inferred.
points[].i_A_m2 / V_V / P_W_m2	number	(i, V, P) triplet per point, ordered along the curve.Move 1 fits the curve to extract internal resistance, peak power, and within-paper effect sizes.

eis_spectra[]

EIS spectra

Same shape as polarization curves but for electrochemical impedance spectroscopy. Each Nyquist point (frequency, Z_real, Z_imag) is captured per electrode (anode, cathode, or whole cell). Equivalent-circuit fits live separately in ElectrochemicalKinetic.

microbial_kinetic_constants[]

Microbial kinetic constants

One row per (organism × substrate × parameter). Captures Monod μ_max, K_s, Y_X/S, decay rate, maintenance coefficient, current-Marcus K_M, and lag phase when reported. Used by the Move 2 substrate-aware Monod prior.

electrochemical_kinetics[]

Electrochemical kinetics

Per-electrode kinetics: exchange current density (i₀), transfer coefficient (α), number of electrons (n), Tafel slope (b), equilibrium potential (E_eq), and overpotential partitions (η_act, η_ohmic, η_conc) at named operating points. Required by the multiphysics validation harness.

author_stated_limitations[] / future_work_stated[]

Author-stated limitations + future work

What the paper itself acknowledges as gaps and what it suggests as next steps. Surfaced on the AI Insights tab as "opportunities" — they're a more honest signal than the platform inferring research gaps from the corpus alone.

paper_context.concept_tags / cited_models / analysis_software

Concept tags + cited models

concept_tags are short keywords describing the paper's contribution beyond the system type (e.g. "biofilm_engineered", "novel_architecture", "scalable"). cited_models are theoretical models the paper uses (Butler-Volmer, Monod, Nernst, Tafel, Bruggeman). analysis_software records the tools (Origin, MATLAB, EC-Lab, NOVA, COMSOL).

Layers 1–6 (when run on full pipeline)

Multi-model validation orchestration

The full extraction stack runs each paper through six layers:

Layer 1 — self-consistency: 3× sampling at T=0.3, fields with 3/3 agreement go forward at confidence=1.0; 0/3 flagged.
Layer 2 — cross-model consensus on disputed fields (Gemini 2.5 Flash + Cerebras llama-3.1-70b for triangulation).
Layer 3 — adversarial critic prompt: "find errors in this extraction" looks for unit confusion, value-type misattribution, arithmetic inconsistencies, vs-SHE sanity, stoichiometric impossibilities.
Layer 4 — rule-based: range bounds (CE>100% impossible), internal arithmetic (P ≈ V·I within 15%), substrate stoichiometric closure.
Layer 5 — cross-paper consistency: same-author baseline checks, methodological-clone outlier detection, citation cross-checks, oracle outliers vs Logan/Patil/Rodríguez literature meta-analyses.
Layer 6 — Sonnet 4.6 escalation only when Layers 1+2 disagree AND critic flags errors AND rules trigger. Expected to fire on ≤5% of papers.

The current corpus state runs primarily Layer 1 + Layer 4 (free-tier round-robin, automatic critic on first 5). Higher layers activate when budget allows.