There Is No Such Thing Called 'Data Moats'
Biotech will not be won by hoarding measurements, but by engineering identifiability, repeatability, and constraint-aware control
Every few months someone says it with a straight face.
“We have a data moat.”
They mean it the way software people mean it. A scarce asset. A one‑way door. Something that keeps competitors out even if they throw money at the problem. They picture a corpus growing quietly in the background, compounding into inevitability.
Biology does not work that way. Neither do the markets.
I’ve had a long-running gripe with the way people talk about “data moats.” I laid out my thinking here: “Hey VCs, Your Outdated AI Investment Strategy Will Cost You and the Ecosystem Dearly. [Jul, 2024]” I’ve had plenty of debates with people who are convinced data alone is somehow a durable advantage. It’s 2025, and I still haven’t seen much evidence that the core argument has shifted, or that anyone has really shown otherwise.
The problem is not that data is useless. It’s that the story people tell about data in biotech commits a category error.
A cell is not a database. It is a regulated dynamical object with hidden state, feedback, redundancy, and context dependence that is far more fragile than our dashboards admit. It adapts to observation. It adapts harder to intervention. What you call a “feature” is often a compensation. What you call “signal” is often a proxy for a state you never measured.
So when someone says “we have a data moat,” I translate it into the only question that matters.
Does this pile of measurements uniquely constrain a set of actions that will move a biological system in the intended direction, across the regimes where translation lives?
Most of the time the honest answer is no. Sometimes it is not even close.
In causal inference, observational data typically does not identify a single causal story. It identifies an equivalence class, multiple causal graphs that agree on the statistical independences you can observe. Distinct mechanisms can fit the same correlations, and you only start collapsing the ambiguity when you intervene, or when you bring strong external constraints to the table. That is not a philosophical complaint. It is part of the formal backbone of causal discovery and identifiability. [1]
This is why the industry keeps confusing correlation with mechanism, and it has nothing to do with intelligence. It is the shape of the problem.
In high-dimensional biology, many causal stories are observationally indistinguishable. They generate the same patterns in your dataset. They generate the same impressive validation curves. They diverge only when you do the one thing that creates value: you intervene.
And we have empirical proof that models will happily exploit that ambiguity by learning the world around the biology.
Across multiple clinical ML settings, strong internal performance has repeatedly failed to hold up under external validation because models learn acquisition quirks, site signatures, and hidden biases. In multi-hospital chest X-ray work, for example, internal performance exceeded external performance and models could robustly infer hospital system or department, a direct demonstration that “the dataset” contains stable non-biological structure that a model can use as a shortcut. [2] In COVID-19 radiography work, the point is stated even more bluntly, that systems can appear accurate while relying on confounders rather than pathology, then break when moved to new hospitals. [3] Broader multi‑modality evaluations show the same phenomenon across imaging, ECGs, clinical notes, and auscultation, with performance often overestimated due to hidden data acquisition biases. [4]
This is not an edge case. It is what happens when you train on observational snapshots of a system you do not control, collected through channels that drift.
High-throughput biology makes the same point, in a different accent. Batch effects, reagent lots, personnel differences, processing date, lab temperature, and dozens of unrecorded nuisances can dominate the measured signal. Leek and colleagues surveyed high-throughput studies across platforms and documented that batch effects are widespread and can drive confusing or incorrect biological conclusions, even after normalization. [5]
So the model can learn the lab. It can learn the hospital. It can learn the era. It can learn your procurement chain.
Then you perturb the system or change the context, and the model reveals what it really was! just a compressor of the training world’s regularities, not a map of the levers that control biology.
This is where people reach for identifiability as a savior. They say: “fine, we will make it identifiable.”
And here is where the argument has to be honest, not righteous.
Treating strict global identifiability as a prerequisite is a kind of fetish. In most biological systems, strict global identifiability is often unattainable, and even when it is attainable it can be economically brutal. The state is partially hidden. Measurements are sparse and noisy. The intervention space is constrained by feasibility, ethics, cost, and time.
There is a systems-biology literature that spells this out. Structural and practical non-identifiability show up in partially observed dynamical models, and profile-likelihood analyses can expose entire manifolds of parameter values consistent with the data. In plain language, the data simply does not contain enough information to pin down what you want to pin down. [6] Sloppiness results go further. Across a collection of systems biology models, “sloppy” sensitivity spectra appear to be common, and the implication is sharp. Even large amounts of ideal data can leave many parameters poorly constrained, while some predictions may still be tight. [7]
That matters because it punctures the naive scaling story. You can keep adding more of the same kind of data and still not resolve the degrees of freedom that matter for intervention.
But the opposite move, pretending identifiability does not matter, is worse. It turns biotech into an exercise in aesthetic prediction.
The correct position is sharper.
You do not need global identifiability of the full mechanism. You need decision identifiability.
You need the parts of the system that actually move a translational decision to be pinned down tightly enough that you can act with bounded risk. You accept that many mechanistic explanations remain equivalent, and you stop trying to resolve them if they do not change the decision. You spend your experimental budget collapsing uncertainty along the axes that flip go or no‑go, dose choice, target selection, construct design, or cohort definition.
This is a profoundly different relationship to “data.”
In the data moat framing, data is a commodity you accumulate.
In the constraint framing, information is something you manufacture by choosing interventions and measurements that change what is learnable.
That’s why the moat is not the dataset. The moat is the loop design.
Now look at the biology through that lens, without theater.
Cells are not free to do anything. They live under constraints that are not negotiable. Finite resources. Saturation and competition. Kinetic bottlenecks. Feedback architectures that impose stability or commitment. State-dependent responses. Multi-stability, where the same perturbation means different things depending on which basin the cell is already in. Context coupling, where what you call a “mechanism” is only valid within a regime defined by microenvironment, stress, lineage, immune state, or metabolic load.
These constraints are the structure that makes biology intelligible at all.
The industry’s failure is not that it ignores data. It worships data while refusing to do the harder thing, which is to encode constraints in a way that survives contact with real experiments and real deployment.
And here the critique that “constraint capture is hard” is correct. It is hard in physics. It is harder in biology.
Even in domains where we know the governing equations, constraint-enforcing neural methods exhibit serious practical issues, such as unstable training, pathological convergence, and solutions that satisfy a residual objective while being wrong in the quantity you actually care about. Reviews and method papers in physics-informed learning explicitly discuss these failure modes and the ways optimization can collapse to trivial solutions under standard training. [8]
So if you write “enforcement means your model class is restricted” as if it’s a simple architectural choice, a technically literate reader is right to push back. This is not a solved engineering task. It’s a program.
The reason to pursue it anyway is that the alternative is worse. The alternative is to let the model learn whatever shortcuts the dataset offers, then call it biology.
This is where the Koopman perspective, used correctly, becomes more than a fashionable math term.
Koopman is seductive because it offers a way to represent nonlinear dynamics through linear evolution in a space of observables, and linear evolution invites the full machinery of mature control theory. But the seduction comes with a trap! the hard problem moves to the choice of observables.
In physics, we often know what the right observables are. In biology, we usually do not.
Deep learning approaches to Koopman explicitly acknowledge the difficulty, that identifying and representing Koopman eigenfunctions is mathematically and computationally challenging, and continuous spectra create unique problems for compact representation. [9] Work on the limits of Koopman learning goes further. There are fundamental barriers, independent of how much data you have, for robustly learning spectral properties in certain settings. In other words, “just collect more trajectories” is not always a way out. [10]
So you do not sell Koopman as “the solution.” That is naive.
You use it as an audit layer.
You force your learned representation to carry a specific kind of dynamical accountability, such that, under the interventions you can actually apply, the representation must evolve in a stable, predictable way inside its claimed regime. When it cannot, the failure should become visible as a residual you can localize, not something that vanishes into weights.
That turns failure into a design signal. It makes ignorance explicit.
THIS is the deeper math-to-biology translation pathway, stated plainly.
Constraints do not grant truth. They grant falsifiability of the right kind.
They turn “we don’t know” into a measurable discrepancy.
They turn model error into experimental design.
Now connect this to translation, because that is where arguments either cash out or collapse.
Translation is a sequence of commitments made under uncertainty, with consequences. You are not trying to “understand biology” in the abstract. You are trying to make a controlled change in a living system, inside an operating envelope you can defend.
A constraint-first program changes translation in three concrete ways.
First, it shifts the target of learning from retrospective prediction to stability under intervention and shift. Clinical ML has already shown what happens when we ignore this. Widely deployed proprietary models can look compelling on developer reports, then perform poorly under external validation, raising exactly the question you are asking: what did the model learn, and what did it miss? [11]
Second, it changes what “enough evidence” means. Instead of chasing global identifiability, you design for decision identifiability. You do not collect more of the same data to soothe uncertainty. You design the smallest set of new interventions and measurements that collapses the ambiguity relevant to the decision boundary.
Third, it makes uncertainty operational. Not as a cosmetic confidence score, but as a pointer to missing state, regime switching, measurement drift, or unexcited degrees of freedom.
Now the markets, because “data moat” is also an economic claim.
If data alone compounded into inevitability, we would expect data-rich companies to be structurally hard to kill. Reality does not cooperate. 23andMe built one of the most recognizable consumer genomics datasets and still filed for Chapter 11 in 2025 after weak demand and reputational damage from a breach, triggering public warnings from state attorneys general about genetic data and deletion rights. Whatever you call that asset, it was not an inevitability engine. It was a complex product, governance, and trust problem with a dataset attached. [12]
So a viable constraint-first company cannot behave like a purity cult. It needs a wedge that creates value before the grand theory is complete. That wedge is rarely “a foundation model trained on everything.” The wedge is usually closer to instrumentation, repeatability, decision reliability, and explicit validity envelopes.
Those are not sexy graphs. They are defensible systems.
Once you control the measurement channel, you can enforce constraints. Once you enforce constraints, you can manufacture information rather than accumulate noise. Once you can do that, you can make stronger interventional claims with fewer experiments. That is the compounding story people want when they say “data moat.”
It just isn’t a data story.
It’s a constraint story.
Future thoughts
Constraint capture needs to be framed as a research and engineering program, not a finished product. The near-term winners will be explicit about what they can enforce today, what remains soft, and how they detect when the model is cheating.
I expect the practical path to be piecewise and state-aware, not monolithic. Mixtures of local dynamical models rather than one universal operator. Regime detectors that admit when the cell moved into a different basin. Observable dictionaries seeded with biology that is hard to violate, then expanded cautiously with learned components that are audited under perturbation.
And the cultural shift is the real prize. Teams stop asking, “how do we get more data?” and start asking, “what is the smallest intervention that collapses the uncertainty blocking this decision?”
Notes
[1] Observational data typically identifies an equivalence class of causal structures (Markov equivalence), not a unique causal graph; interventions refine identifiability. (Journal of Machine Learning Research)
[2] Multi-site pneumonia chest X-ray work showing internal performance exceeding external performance and that CNNs can identify hospital system/department, indicating confounding site signatures. (PLOS)
[3] COVID-19 radiography work showing shortcut learning on confounders rather than pathology and failure in new hospitals. (Nature)
[4] Cross-modality evidence of shortcut learning and performance overestimation due to hidden data acquisition biases across 13 datasets. (Nature)
[5] Batch effects in high-throughput data are pervasive and can lead to incorrect biological conclusions, even with normalization. (PMC)
[6] Structural and practical non-identifiability in partially observed dynamical models and methods to diagnose it. (OUP Academic)
[7] “Sloppiness” in systems biology models: many parameters remain poorly constrained even with large ideal data, while some predictions may still be well constrained. (PLOS)
[8] Practical challenges and failure modes when enforcing hard constraints in neural approaches (examples from physics-informed training: loss balancing, spectral bias, instability, convergence to trivial solutions).
[9] Koopman eigenfunction identification is challenging; continuous spectra create unique representation issues even in deep Koopman approaches. (Nature)
[10] Fundamental limits on Koopman learning: barriers to robustly learning spectral properties in some settings, not cured by more data. (arXiv)
[11] External validation of a widely implemented proprietary sepsis prediction model showing poor discrimination/calibration despite broad adoption. (JAMA Network)
[12] 23andMe’s 2025 bankruptcy and public regulatory/AG responses around genetic data, illustrating that large data assets do not automatically produce durable economics. (Reuters)


