Attending ISMET EU 2026 in Toulouse?
Learn how you can contribute to MESSAI →

How MESSAI works

Methodology & reference

How papers enter the corpus, what fields are extracted from each one, and how the classification taxonomy is defined. This page is the canonical source for understanding what the platform claims about each paper, and the audit-ready definitions behind every chip, badge, and parameter row in the UI. Share with collaborators or domain experts to sanity-check the field choices.

Goal

Why a structured schema?

Bioelectrochemical-systems papers report values in inconsistent ways: power density normalised to anode area in one paper and to total volume in another, peak vs steady- state values reported without distinction, voltages quoted without the reference electrode, and microbial communities described as "mixed culture" with no abundance data. A free-text scrape produces noise. The MESSAI extractor reads each paper's methods/results/tables and emits a structured JSON document with a fixed schema, so every value carries the context needed to compare it against another paper.

Below is every field the extractor produces, organised the way the JSON is shaped. Each field has a What line (the value), a Why line (what it lets the platform do downstream), and where applicable, the rule used to fill it.

is_bes_paper

Step 0 — Classification gate

Before extracting any measurements, the LLM decides whether the paper actually describes a bioelectrochemical system at all. Many corpus PDFs turn out to be off- topic: zinc-air fuel cells, COMSOL simulations of PEM hydrogen cells, VLSI circuits, clinical guidelines, mobile educational apps, etc. The classification gate rejects these cheaply (~$0.001 per call) and quarantines the file from downstream processing.

The LLM only emits the rich extraction schema when is_bes_paper: true. When false, it returns rejection_reason + concept_tags describing what the paper IS about, and the platform soft-hides the paper.

FieldTypeWhat & why
is_bes_paperbooleantrue if microbial/enzymatic catalysis at electrodes connected to an external circuit; false otherwise.Architecture-aware filter — uses inferred BES rules, not just keyword matching, so a paper describing "biofilm anode + air cathode + 1 kΩ resistor + acetate substrate" is recognized as MFC even when the term "MFC" never appears.
rejection_reasonstring | nullSingle-sentence explanation when is_bes_paper=false.Audit trail. Surfaced inline on quarantined papers so a reviewer can verify the rejection was correct.

paper_context.primary_system_type

Primary system type

Twelve canonical types covering the full BES landscape. MES specifically means Microbial Electrosynthesis (cathodic CO₂ reduction), not the umbrella term — the umbrella is BES. See the Glossary tab for the full list with definitions.

FieldTypeWhat & why
primary_system_typeenumSingle dominant subtype: MFC | MEC | MES | MDC | EFC | BER | BEF | MEB | BES_generic | BES_modeling | BES_review | otherLets the platform stratify by the dominant function of the paper. Modeling/review papers are flagged so the predictor doesn't treat their citations as measurements.

paper_context.system_variants[]

Architecture variants

Multi-tag list — a paper can be many things at once. A "constructed wetland MFC for nitrogen removal" gets variants=[constructed_wetland, single_chamber] + applications=[wastewater_treatment, power_generation]. Variants describe the chamber/biology architecture orthogonal to the primary type.

FieldTypeWhat & why
system_variantsstring[]Architecture tags drawn from a controlled vocabulary (sediment, plant, photosynthetic, constructed_wetland, biocathode, bioanode_only, mixed_culture, single_chamber, dual_chamber, h_type, tubular, microfluidic, cube, stacked, membraneless, ...).Powers multi-tag UI filters and stratification — research listings can filter to "all MFCs that are sediment + photosynthetic" with a single index lookup.

Single-chamber MFCs by definition have no membrane, so the prompt instructs the LLM to tag both single_chamber AND membraneless when there's one liquid compartment.

paper_context.application_domains[]

Application domains

What the paper is for. Most papers carry 1–3 of these. Lets users discover "what BES research has been done on metal recovery" or "all CO₂-reduction papers" without needing keyword guesses.

FieldTypeWhat & why
application_domainsstring[]power_generation | hydrogen_production | wastewater_treatment | desalination | electrosynthesis | biosensing | bioremediation | electrofermentation | nutrient_recovery | CO2_reduction | metal_recovery | soil_remediationLets the platform answer "what works for X application" with explicit semantics instead of full-text search.

paper_context.geometry

Geometry

Dimensional and topological information needed to render a 3D model and (eventually) run a multiphysics simulation that can be cross-checked against the paper's own measurements. When a paper omits these the field is null — never invented.

FieldTypeWhat & why
topology_classenumsingle_chamber | dual_chamber_h | cube | tubular | stacked | membraneless | upflow | sediment | microfluidicDrives 3D model template selection and the multiphysics topology check (a chamber-count mismatch invalidates a model fit).
working_volume_anode_ml / cathode_ml / total_mlnumberVolumes of each chamber + total. When dual-chamber, anode + cathode should sum to total.Volume is the denominator for volume-normalized power density (W/m³). Without it, only area-normalized comparisons are valid.
inter_electrode_distance_cmnumberAnode-to-cathode separation distance.Sets ohmic-loss term in equivalent circuit. A 50 mV η_ohmic error → ~50× i₀ error in Butler-Volmer fits.
anode_projected_area_cm2 / cathode_projected_area_cm2numberGeometric area of each electrode (NOT specific surface area).Denominator for area-normalized power/current density. The single most common normalization mismatch in BES literature.
membrane_area_cm2 / membrane_thickness_umnumberMembrane geometry when present.Powers ion-transport calculations and proton crossover modeling.
electrode_shapeenumplanar | brush | mesh | cloth | rod | tubular | foam | otherA "brush" electrode and a "planar" electrode of the same projected area have very different actual surface areas and biofilm-volume ratios.
channel_patternenumFor microfluidic / flow-through: straight_parallel | serpentine | interdigitated | y_junction | spiral | pin_fin | mesh_distributedRenders correctly in 3D and determines flow-regime classification (laminar vs Taylor dispersion).
has_labelled_schematicbooleanWhether the paper includes a labelled schematic figure.Future Gap B-c: digitizing the schematic for dimensional cross-checks against text-extracted geometry.

paper_context.materials

Materials — anode, cathode, membrane, biofilm

Each electrode role gets a structured object with two parallel fields: a canonical id (slug from mess-materials ormess-membranes) and an as_reported field carrying the exact text from the paper for audit. This means even unmatched materials retain their original wording — nothing is silently dropped.

Biological electrodes (biocathode, bioanode, algal_biofilm, photosynthetic_biofilm, enzymatic_electrode) are first-class material slugs starting in v1.2 — when a paper describes a photosynthetic biocathode, that information lands in materials.cathode.id, not just in system_variants.

FieldTypeWhat & why
materials.{anode,cathode,membrane}.idstring | nullCanonical slug from mess-materials or mess-membranes. Null when no canonical match found.Joining papers by canonical material lets us answer "what power densities have been reported on carbon_cloth + Geobacter sulfurreducens" in O(1) DB lookups.
materials.{anode,cathode,membrane}.as_reportedstringExact text from the paper, verbatim.Audit trail. When canonical_id=null, this is the only signal. When canonical_id is set, this lets reviewers verify the mapping was correct.
materials.{anode,cathode,membrane}.treatments[]arrayList of treatment objects with canonical id (acid_HNO3, heat_air, plasma_O2, MnO2_coating, ...) + as_reported + duration + temperature + agent.Same material with different pre-treatment can produce 25× performance range (carbon cloth as-received vs MXene-coated). Without treatments, the material signal is muddied.
materials.cathode.catalyst.{id,as_reported}objectCatalyst layer on the cathode (Pt/C, MnO2, ABO3 perovskite, etc.) — separate from the cathode substrate itself.For air-cathode MFCs the catalyst loading dominates kinetics, not the underlying carbon support.
materials.cathode.environmentenumair_exposed | liquid_submerged | gas_phase | otherAir-cathode vs aqueous-cathode is a topology-level distinction that drastically changes O₂ availability and ORR kinetics.
materials.biofilmobjectthickness_um, density_g_L, conductivity_mS_cm, dominant_organism.{id, as_reported}.Biofilm physical parameters drive ohmic and conductive-mediator electron-transfer pathways.

paper_context.microbial_analysis

Microbial analysis

What organism(s) the paper studied, with provenance about how the identification was made (16S? metagenomics? culture-dependent?). Distinguishes a paper claiming "dominant Geobacter from V1-V3 16S" from one claiming the same identity from full-length PacBio sequencing — the methodological tier matters when training the predictor.

FieldTypeWhat & why
analysis_methodenum16S_rRNA | metagenomics | metatranscriptomics | FISH | qPCR_specific_genes | culture_dependent | noneMethodological tier — 16S V4 amplicon < full-length 16S < shotgun metagenomics. Aggregating these as one signal mixes reliability tiers.
sequencing_platform / 16S_region_amplified / read_countstring | intIllumina MiSeq / NovaSeq / PacBio / Nanopore / Sanger; V1-V3, V3-V4, V4, V4-V5, full_length; reads per sample.Resolution depends on platform + region. Platform metadata is the audit trail for downstream microbe-feature confidence.
diversity_indicesobjectShannon, Simpson, Chao1, Pielou, Good's coverage %.Community-level signals — a high-Shannon mixed culture behaves differently from a low-Shannon enrichment.
dominant_taxa[]arrayList of {id (canonical mess-microbes slug), as_reported, name, rank, relative_abundance_pct, confidence}.Cross-paper organism analysis: which strains achieve which performance under which conditions.
biofilm_vs_planktonic_separatedbooleanWas sequencing done on biofilm vs free-cells separately, or pooled?Biofilm and planktonic communities can be very different. Pooled sequencing dilutes the electrode-attached signal.

paper_context.electrochemistry

Electrochemistry — Butler-Volmer + Tafel

Per-electrode kinetic parameters. When a paper reports them, we capture them; when it doesn't, we leave the slot null and the predictor uses literature defaults. This schema mirrors the parameters needed by the platform's COMSOL-style validation harness.

FieldTypeWhat & why
cathode_reaction_identityenumO2_ORR | HER | ferricyanide | other.O₂ reduction has fundamentally different kinetics from H⁺ reduction; pretending they're comparable corrupts predictions.
i0_anode_A_m2 / i0_cathode_A_m2numberExchange current density per electrode.Butler-Volmer kinetic parameter. The platform's validation harness fits the curve and back-checks against this.
alpha_a / alpha_cnumberAnodic and cathodic charge-transfer coefficients (typically 0–1).Tafel slope = RT/(αFn); without α the slope is uninterpretable.
tafel_slope_anode_mV_dec / cathodenumberSlope of the Tafel plot per electrode (mV/decade).When i₀ is missing, Tafel slope can back-fit α. When both are present, they cross-validate.
E_eq_anode_V / E_eq_cathode_VnumberEquilibrium / Nernst potential per electrode.Anchors the absolute voltage scale. Without it, polarization curves are only relative.
overpotential_partitionobject{eta_act, eta_ohmic, eta_conc, operating_point_mA_cm2}.Tells the predictor which loss mechanism dominates at what current density — the determinant of where to optimize.

condition_sets[] + observations[]

Condition sets — operating conditions + measurements

When a paper varies a condition (temperature, pH, substrate, R_ext, applied voltage, ...) and reports outcomes for each, the LLM emits a separate condition_set per tuple plus an observations entry tied to that set. This preserves within-paper variation, which is the only kind of variation that supports causal-grade inference.

Every observation carries provenance: which section it came from, what value type (peak vs steady-state vs endpoint), what it's normalized to (anode area? cathode area? volume?), the page number, and a 200-character snippet for audit.

FieldTypeWhat & why
temperature_c / ph_anolyte / ph_catholytenumberOperating temperature and chamber-specific pH.Anolyte and catholyte pH can differ in dual-chamber systems. Splitting them captures real biology (e.g. Geobacter at pH 7 anolyte, ORR at pH 3 catholyte).
substrate / substrate_id / substrate_concentration_mg_Lstring + slug + numberSubstrate name + canonical SubstrateClassification id + concentration.Canonical id (acetate, glucose, lactate, ...) lets the platform run substrate-aware Monod priors. as_reported preserved.
substrate_complexityenumpure_compound | synthetic_mixture | real_wastewater | complex_media54.7% of the corpus is real wastewater. Without this distinction, predictions on complex substrates degrade to literature mean.
substrate_sourcestringFor wastewater: domestic_ww | brewery | dairy | landfill_leachate | industrial | saline | marine | synthetic.Inhibitor profiles vary dramatically by source; this is the granular layer below substrate_complexity.
inhibitors_present[]arrayheavy_metals | antibiotics | sulfide | ammonia_NH3_high | salinity_high | phenolic | VFAs_high.Inhibitors mask the underlying substrate-utilization signal. Flagged so the predictor can stratify.
external_resistance_ohmnumberExternal load resistance for MFCs; required for I-V interpretation.Different R_ext drives different operating points on the polarization curve. A peak power without R_ext is uninterpretable.
applied_voltage_VnumberApplied potential for MEC / MES / BER.For non-spontaneous BES, this is the input that drives the chemistry.
reference_electrode + voltage_reporting_conventionenum + enumAg/AgCl_3M_KCl | SCE | Hg/HgO_1M_KOH | SHE; vs_anode_reference | vs_cathode_reference | vs_SHE_normalized | cell_voltage.A 30-110 mV offset between reference types makes raw voltages incomparable across papers. The platform normalizes everything to vs-SHE for cross-paper analysis.
mass balance fieldsnumberssubstrate_in/out_mg_L, COD_removed_pct, biomass_produced_mg_L, methane_produced_mol, mass_balance_closure_pct.Honesty check: a paper claiming 85% CE on glucose with 30% COD removal is internally inconsistent. The closure rate is the audit signal.

Per-observation fields

FieldTypeWhat & why
parameter_idstring (slug)Ontology slug from mess-parameters (power_density, coulombic_efficiency, ...). 678 active definitions.Clean joins to ParameterDefinition for unit + bounds + display label. Avoids duplicate parameter definitions across papers.
value + unitnumber + stringThe reported numeric value + the EXACT unit shown in the paper.The platform never converts units at extraction time. Conversions happen later, with the unit preserved as audit.
value_typeenumpeak | maximum | steady_state | endpoint | average | initial | final | minimum | median | range_min | range_maxA peak value and a steady-state value of the same parameter are different numbers measuring different things. Conflating them produces 2-3× errors.
is_normalized_toenumanode_area | cathode_area | membrane_area | total_volume | anode_volume | biomass_mass | nullFor power and current density, the basis is decisive. mW/m² to anode area vs cathode area is a 1.5-3× difference; to volume is 10-1000×.
condition_set_labelstringWhich ConditionSet this observation was measured under.Binds the value to its conditions. Without this, an observation is a number with no operating point.
section_sourceenumabstract | methods | results | tables | discussion | figure_caption.Trust by section: tables > methods > results > discussion. The platform stratifies confidence by section.
page + snippetint + stringPage number + ±200 char window from source text.Audit trail. A reviewer can click the value and see the exact text it was extracted from.
confidencefloat [0,1]LLM-reported confidence, calibrated against fixture-paper ground truth.Drives needs_review threshold (default <0.7) and Move 4 calibration weighting.
extraction_methodenumREGEX | LLM | HYBRID | MANUAL.Regex extractions are typically more reliable for numeric ranges; LLM extractions handle complex context. Knowing which path produced a value lets us audit failures.
uncertaintyobject{value, type: std | sem | ci95 | range} when paper reports ±.A "5.2 ± 0.3 (n=3)" row is high-quality training data; a "~5" row is noise. Move 4 calibration weights by these.
replicatesintNumber of independent replicates the value is averaged over.Reproducibility tier. n≥3 reactors reported separately is the gold standard.
measurement_techniqueenumpolarization_curve | chronoamperometry | chronopotentiometry | cyclic_voltammetry | LSV | EIS | open_circuit | other.Power from polarization-curve peaks vs from chronoamperometry steady-state are different things. Stratifying by technique is essential for fair comparison.

polarization_curves[]

Polarization curves

When a paper reports a polarization curve — even partially — the LLM extracts every (current density, voltage, power density) point as a row. These power Move 1's within- paper effect modeling and let the multiphysics validation harness fit Tafel slopes per condition set.

FieldTypeWhat & why
polarization_curves[].condition_set_labelstringWhich ConditionSet the curve was measured under.Different conditions produce different curves; binding to the condition set keeps them apart.
polarization_curves[].sourceenumtable | figure_digitized | caption_extracted | llm_inferredTells downstream consumers how reliable each point is. Table > caption > LLM-inferred.
points[].i_A_m2 / V_V / P_W_m2number(i, V, P) triplet per point, ordered along the curve.Move 1 fits the curve to extract internal resistance, peak power, and within-paper effect sizes.

eis_spectra[]

EIS spectra

Same shape as polarization curves but for electrochemical impedance spectroscopy. Each Nyquist point (frequency, Z_real, Z_imag) is captured per electrode (anode, cathode, or whole cell). Equivalent-circuit fits live separately in ElectrochemicalKinetic.

microbial_kinetic_constants[]

Microbial kinetic constants

One row per (organism × substrate × parameter). Captures Monod μ_max, K_s, Y_X/S, decay rate, maintenance coefficient, current-Marcus K_M, and lag phase when reported. Used by the Move 2 substrate-aware Monod prior.

electrochemical_kinetics[]

Electrochemical kinetics

Per-electrode kinetics: exchange current density (i₀), transfer coefficient (α), number of electrons (n), Tafel slope (b), equilibrium potential (E_eq), and overpotential partitions (η_act, η_ohmic, η_conc) at named operating points. Required by the multiphysics validation harness.

author_stated_limitations[] / future_work_stated[]

Author-stated limitations + future work

What the paper itself acknowledges as gaps and what it suggests as next steps. Surfaced on the AI Insights tab as "opportunities" — they're a more honest signal than the platform inferring research gaps from the corpus alone.

paper_context.concept_tags / cited_models / analysis_software

Concept tags + cited models

concept_tags are short keywords describing the paper's contribution beyond the system type (e.g. "biofilm_engineered", "novel_architecture", "scalable"). cited_models are theoretical models the paper uses (Butler-Volmer, Monod, Nernst, Tafel, Bruggeman). analysis_software records the tools (Origin, MATLAB, EC-Lab, NOVA, COMSOL).

Layers 1–6 (when run on full pipeline)

Multi-model validation orchestration

The full extraction stack runs each paper through six layers:

  1. Layer 1 — self-consistency: 3× sampling at T=0.3, fields with 3/3 agreement go forward at confidence=1.0; 0/3 flagged.
  2. Layer 2 — cross-model consensus on disputed fields (Gemini 2.5 Flash + Cerebras llama-3.1-70b for triangulation).
  3. Layer 3 — adversarial critic prompt: "find errors in this extraction" looks for unit confusion, value-type misattribution, arithmetic inconsistencies, vs-SHE sanity, stoichiometric impossibilities.
  4. Layer 4 — rule-based: range bounds (CE>100% impossible), internal arithmetic (P ≈ V·I within 15%), substrate stoichiometric closure.
  5. Layer 5 — cross-paper consistency: same-author baseline checks, methodological-clone outlier detection, citation cross-checks, oracle outliers vs Logan/Patil/Rodríguez literature meta-analyses.
  6. Layer 6 — Sonnet 4.6 escalation only when Layers 1+2 disagree AND critic flags errors AND rules trigger. Expected to fire on ≤5% of papers.

The current corpus state runs primarily Layer 1 + Layer 4 (free-tier round-robin, automatic critic on first 5). Higher layers activate when budget allows.