For researchers - Data Garden

Scope & introduction

The Data Garden is the public projection of the Protein Discovery Platform - a working dataset that links plant-protein sources to scalable extraction methods to functional outcomes in food-relevant assays. The thesis is simple: ingredients that are cheap to isolate at scale frequently fail in the functional roles that matter in food, and the field has lacked a reproducible, comparable evidence base that connects extraction parameters back to those outcomes. The work fills that gap.

The initial product focus is cheese and egg functionality - high-value, high-difficulty categories where stretch, melt, binding, and foaming behavior have remained elusive across plant-protein systems. Methods and benchmarks generalize cleanly to dairy bases, bakery systems, and aerated products, so the dataset is intentionally broader than its first applications.

The work is staged as alternating exploration and exploitation cycles. Exploration cycles widen the dataset: surveying diverse protein sources under standardized extraction and assay conditions to surface functional "hits." Exploitation cycles deepen the most promising candidates: refining extraction windows, validating reproducibility across replicates, and producing transfer-ready SOPs and proof-of-concept formulations. Reliability comes first; throughput follows.

Technical language

Working definitions for the terms used across the catalog and publications. Three vocabularies meet here - protein chemistry, food functionality, and the platform's own identifier scheme. The glossary anchors each term to the protocol that defines its measurement, so a reader can move from a reported value back to the operation that produced it.

Extraction routes are named at the workflow level (isoelectric precipitation, micelle-based precipitation, alkaline + membrane concentration) rather than the brand-name level. Functional readouts are reported with their assay context - solubility is always paired with pH and ionic strength, foam decay with rotor-stator conditions, gel strength with the texture-analyzer method that produced it. Ingredient identifiers follow ING-NNN at the record level, with a versioned .vN suffix tied to the frozen run.

A hit is a protein-extraction combination that meets three criteria simultaneously: benchmark functionality - performance at or above a reference (soy/pea for emulsification and gelation, egg white for foaming, casein for stretch and melt); reproducibility and stability - confirmed across at least two independent replicates within the relevant assay window; and practical feasibility - yield and purity sufficient to support repeat testing and product-format transfer.

[Glossary pages are stubs.] Each term will link to a working definition with its protocol reference.

Methods

Every measurement reported in the Data Garden traces back to one of the protocols below. Each protocol page documents the instrument, calibration regime, replication policy, and the SHA of the analysis code used to produce the reported metric. By limiting the core extraction workflow to three aqueous routes - isoelectric precipitation, micelle-based precipitation, and alkaline solubilization paired with membrane concentration - comparability across sources is preserved without reaching for harsh solvents.

Open the Data Garden methods (PDF) · coming soon The full compendium PDF is in editorial review. Once it ships it will be a single document with abstract, introduction, all methods, and references; the file will stamp its own version and git commit so a saved copy stays auditable.

Protein extraction (IPI / MPI / Membrane)

The three aqueous routes that anchor the dataset. Shared solubilization step; the three diverge on how protein is concentrated out of solution.

→ writeup coming soon
Proximate composition

Protein (Dumas combustion), moisture (oven drying), ash (muffle furnace). Normalizes every downstream readout to a per-protein or per-solids basis.

→ writeup coming soon
Subunit profile (SDS-PAGE)

Reducing and non-reducing prep on stain-free Mini-PROTEAN gels; band pattern fingerprints which subunits are present in what proportion.

→ writeup coming soon
Particle size (DLS)

Dynamic light scattering on the Anton Paar Litesizer 700; sub-nm to several µm. Z-average diameter, PDI, and full intensity distribution.

→ writeup coming soon
Thermal behavior (DSC)

Discovery DSC 2500. Onset temperature, enthalpy, endotherm width - together a read on how much native structure survives extraction.

→ writeup coming soon
Solubility curve

Bradford colorimetric assay of the supernatant after centrifugation, across pH and ionic strength. Turbiscan-based variant covers the sedimentation kinetics.

→ writeup coming soon
Water-holding capacity

Centrifugal expressible-moisture method on heat-set gels. Reported as grams water retained per gram of protein solids.

→ writeup coming soon
Emulsification (EAI, EC, ESI)

Three readouts - activity (turbidimetric), capacity (centrifugation), and stability (Turbiscan time series). All anchored to the protein–oil interface.

→ writeup coming soon
Foaming

Rotor-stator aeration + Turbiscan Tower vertical scan over 30 minutes. Foaming capacity at t=0, stability at t=30 min, drainage profile in between.

→ writeup coming soon
Thermal gelation (TPA)

Heat-set 10% protein dispersions probed with Texture Profile Analysis on the TA.XTPlusC. Hardness, springiness, cohesiveness, gumminess.

→ writeup coming soon
Stretch & melt

Microwave-stretch screen for triage, plus the Texture Analyzer extensibility rig after a 200 °C heat treatment. Benchmark is casein.

→ writeup coming soon

Results browser

Browse the anonymized public projection of the dataset. Records are stripped of supplier identity; remaining fields are extraction route, ingredient class, protocol, and the analytical result with its run identifier.

Browse by stub UI · live filters land in Phase 4


Pea isolate · isoelectric	Denaturation onset (DSC)	76.4 °C	ING-007.v3.2
Pea isolate · isoelectric	Solubility at pH 7	42.1 %	ING-007.v3.2
Fava concentrate · alkaline	Denaturation onset (DSC)	72.0 °C	ING-014.v2.1
Mung isolate · isoelectric	Water binding capacity	3.1 g/g	ING-022.v1.0
Soy isolate · isoelectric	Foam stability (60 min)	68 %	ING-002.v4.0

[Stub rows.] Realistic data to follow.

Platform details

The Data Garden is a read-only public projection of the underlying lab information management system. Analytical records flow one-way from the lab bench → frozen run → anonymized public view; no portal traffic reaches the source system.

Architecture, at a glance

Two services. The source system records and reviews. The portal reads only what has been signed off, snapshotted, and anonymized. Publications are frozen artifacts with stable identifiers - corrections produce new versions with errata; no retractions ever, to preserve citation integrity.

Metadata is treated as a first-class output. For every sample, the dataset records supplier identity (held privately by default), any available processing history, and the extraction parameters applied in-house. Not all materials arrive with full crop-level provenance, and that is not treated as a failure - what is known is captured, what is unknown is marked, and the reader can see the boundary.

The release model is intentionally tiered. Raw data is publishable, but a bag of numbers without methodological context and navigation tools is less useful than it looks. The hosted platform layers metadata navigation, ML-assisted insights, and integration with supply and compliance data on top of the raw record, so a working scientist can move from "what was measured" to "what does it imply for my application" without rebuilding the surrounding scaffolding themselves.

ML approach & discussion

Machine learning is part of the design from the outset, not an afterthought bolted onto a finished dataset. The protocol set is constrained on purpose so that measurements are comparable across sources; metadata is rich enough to support feature engineering; replicates are required before a hit advances, so variance is observed rather than assumed. The point is to be ready to learn from the data, not just to publish it.

During exploitation cycles, Bayesian optimization guides the search for operational extraction windows - balancing functionality, reproducibility, and manufacturability rather than exhaustively sweeping every parameter. Replicate data and source-by-process-by-functional structure let downstream models reason about which conditions drive which outcomes, and where the uncertainty actually lives. Where any machine-assisted step contributes to a reported metric (peak-picking, anomaly flagging, comparison summarization), the model identifier, version, and training context are recorded alongside the run.

Gardener - the open ML window

Gardener is a query-and-explain layer over the public projection. It operates on the same anonymized view that the catalog exposes - no backdoor access to source data. Three Gardener variants exist (public, entitled, internal) with non-overlapping tool whitelists; the public variant is the only one available without sign-in. Gardener is a food-science LLM grounded on literature, industry best-practices, and the platform's own analytical record; its job is to make the dataset navigable, not to invent results.

Citing the Data Garden

Cite individual runs by their stable identifier (e.g. ING-007.v3.2), comparison sets by their snapshot identifier, and the dataset as a whole by its DOI (forthcoming). The repository is versioned at the release level - Exploration-1, Dataset V1, Dataset V2, and forward - so a citation pins both the figure and the surrounding cohort at the exact frozen state that produced it.

[Citation format, BibTeX snippets, and the canonical DOI will land alongside the first formal release.]