Your Data Never Dies

The purpose fades. Your data doesn't.

Michael Eggleton

March 2026

In 2016, hundreds of millions of people downloaded Pokémon Go and started walking around their neighbourhoods pointing their phones at buildings, parks, and street corners, trying to catch 'em all. They were also building one of the most comprehensive spatial datasets on the planet.

Catching more than we thought

A decade later, that data (30 billion images tagged with precise location metadata) is being used to train a "Large Geospatial Model" that powers autonomous delivery robots navigating city streets[1]. Niantic, the company behind the game, spun out a division called Niantic Spatial to commercialise it. Their robots are already operating in Los Angeles, Chicago, Miami, and Helsinki.

The players who used their phone’s camera and location weren't consenting to train navigation systems for delivery robots - that use case didn't exist yet. The consent was for a game. The value extraction is for an entirely different industry, a decade later.

Hungry for more

The internet has been scraped. The publicly available text, images, and code that powered the first wave of large language models is largely exhausted as a novel training resource. The frontier of AI development is now proprietary, unseen, and real-world data - the kind that doesn't exist on the open web. The kind that belongs to individuals, across all their own digital and real-world lives.

This creates enormous pressure on any organisation sitting on a unique dataset. Health records, financial transactions, geospatial imagery, sensor data, clinical notes, behavioural patterns - these are the training sets that AI companies need. And the incentive to monetise them, or to acquire the companies that hold them, is increasing every quarter.

Health and financial data sit at the sharpest end of this. They're simultaneously the most sensitive (deeply personal, heavily regulated) and the most commercially valuable (proprietary, high signal, hard to replicate). Clinical datasets in particular, like patient histories, diagnostic patterns, and treatment outcomes represent exactly the kind of novel, high-quality training data that the market is hungry for. The same is true for financial transaction data, insurance claims, and purchase records. The value locked inside these datasets is capturable by anyone who can access them - which makes the companies holding them acquisition targets, not just service providers.

A quiet exit

The hunger is the macro incentive. The quiet exit is the micro exposure: the everyday moments where data leaves your control and enters someone else's system.

Every document uploaded to a cloud platform, every conversation with an AI assistant, every file synced to a third-party service - these interactions create data that exists somewhere outside your direct control. The terms of service may say it won't be used for training, but the architecture still creates exposure. The boundary between "operating the service" and "learning from your usage" is not as clean as most people assume.

This extends beyond the digital. At a previous employer, I was issued a corporate expense card through Float. The KYC (Know Your Customer) verification was handled by an external provider who required biometric data. There was no alternative verification method, and no opt-out. Biometric data was collected by a third-party provider I had no relationship with, simply to reduce a tiny bit of friction for me, and much more so for the finance team.

KYC is a legitimate compliance requirement - but the biometric component was a step beyond what was necessary, and the lack of any alternative meant consent was effectively compulsory. My biometric data now sits in that provider's database. I have no control over their retention policy, their security posture, or what happens to it if they're acquired.

This is the pattern: individually reasonable requests that, in aggregate, create a sprawling footprint of personal data across dozens of third-party systems - most of which you didn't choose and can't monitor.

Organisations holding this data may have every intention of protecting it - but intentions don't survive market pressure forever.

End of life

Even if you trust the company holding your data today, companies don't all last forever. They get acquired. They go bankrupt. They pivot. They get desperate. But the data almost always outlives the company that collected it.

In almost every privacy policy, user data including PII is explicitly classified as a business asset. That's not buried in fine print; it's the standard legal framework. When a company changes hands, the data changes hands with it. In a bankruptcy, it can be liquidated like furniture, and sometimes it's one of the few assets worth anything at all.

This is the time dimension that most people don't think about. Consent degrades over time, not because the original agreement was violated, but because the conditions that made it reasonable no longer apply. The company you trusted in 2016 may not be the entity holding your data in 2030. The commercial incentive to monetise that data may have increased by orders of magnitude.

This isn't hypothetical. In 2022, Oracle acquired Cerner, one of the largest electronic health record systems in the world, holding data on millions of patients, for $28.3 billion. The stated rationale was healthcare transformation: better interoperability, a national health records database, AI-powered clinical tools. Cerner became Oracle Health.

Three years later, Oracle needs to fund a $300 billion, five-year datacenter contract with OpenAI. In early 2026, investment bank TD Cowen reported that Oracle is evaluating the sale of Cerner as one of several options to raise the capital[2]. The health data asset acquired to transform care delivery may be divested to finance GPU procurement.

The patients whose records sit inside Oracle Health didn't consent to become a line item in an AI infrastructure financing strategy, but that's where the chain of custody leads. Clinical data entered the system for care. It became a corporate asset inside Cerner. Cerner became a corporate asset inside Oracle. And now that asset may change hands again, not because of anything to do with healthcare, but because of the economics of training frontier AI models.

Each step in that chain was individually defensible. The aggregate is a sequence of custody transfers that no patient could have anticipated, governed by incentives that have nothing to do with the original purpose of collection.

One-way doors

There's a version of this problem that's worse than persistence: irreversibility.

Data that sits in a database can, in theory, be deleted. Retention policies can expire. A regulator can compel erasure. But there are thresholds data can cross after which deletion stops being meaningful: these are one-way doors.

The clearest is model training. When data is used to train a machine learning model, it doesn't sit inside the model as a retrievable record. It reshapes the model's parameters, influencing weights and statistical patterns across billions of variables. The source data can be deleted. The effect of that data on the model cannot. Exact deletion from a trained model is generally considered infeasible. What exists today is closer to mitigation than erasure[3].

The second is derived output. If your data informed a prediction, a generated report, or a synthetic dataset that was shared downstream, there is no way to trace and retract every derivative - the lineage is lost.

The third is custody transfer. Once the entity holding your data changes, the conditions under which you originally shared it are functionally void. You can't un-share something with a company that didn't exist when you made the decision.

The question of whether your data persists is almost secondary. The sharper question is whether it has already crossed a threshold from which there is no return.

What else?

Of course, this isn't a call to stop using technology, or to treat every service as adversarial. Many companies collecting data are acting in good faith, within the current rules.

The problem is structural. The consent model was designed for a world where data had a single, understood purpose at the time of collection. We now live in a world where the most valuable use of data often hasn't been invented yet when it's first collected. The question isn't "do I trust this company?", it's "do I trust every company that might ever hold this data, under every market condition that might ever exist, knowing that some of those paths are irreversible?"

That's a hard question when you're just trying to get something done today. Why should access be permanent when the need isn't? What if the default was single-use consent, not as a policy preference, but as an architectural constraint? Data cryptographically bound to its stated use, with access that degrades or revokes when the conditions change.

The defaults we build today will determine what's still recoverable tomorrow. For our future selves, a different default is worth building around.

[1] https://www.technologyreview.com/2026/03/10/1134099/how-pokemon-go-is-helping-robots-deliver-pizza-on-time/ [2] https://www.theregister.com/2026/01/29/oracle_td_cowen_note/ [3] https://www.techpolicy.press/the-right-to-be-forgotten-is-dead-data-lives-forever-in-ai/


Keep reading