1. The Short Version

Over roughly six weeks in spring 2026, a senior IBM engineer named Chris Hay published a series of YouTube videos and open-source code making one escalating argument: the knowledge inside an AI language model is not a mysterious fog spread across billions of numbers. It behaves like a database. You can query it. You can trace how it produces an answer. You can write new facts into it by hand, without any retraining. And — the most consequential step — the knowledge does not even need to live inside the model. It can sit on a separate, cheap machine, or potentially outside the model entirely, while the model reads it as if it were its own memory.

Each claim ships with working code anyone can run. The work is built on peer-reviewed research from the last five years; what is new is that nobody had assembled the pieces end-to-end into functioning engineering before. The community response has been fast: his main repository went from 1 star to roughly 800 in five weeks, with outside contributors already landing code.

If the claims hold at scale — a real if, addressed honestly in Section 8 — the consequences touch how AI models are built, priced, deployed, governed, and audited. This report explains the work in plain terms, separates what is demonstrated from what is plausible from what is speculative, and maps where the road leads.

2. A Two-Minute Primer

Four concepts make everything else in this report readable. No math required.

Layers have two working parts

A language model is a stack of layers — you likely know that much. What matters here is that each layer contains two distinct components doing different jobs. The first is attention: the machinery that figures out which words in the input relate to which other words. Think of it as the model's reading comprehension. The second is the feed-forward network (FFN): the machinery where the model's stored knowledge lives — facts, associations, vocabulary, patterns. Think of it as the model's memory. The memory portion is most of the model's physical size; the comprehension portion is comparatively small but does intensive work.

The residual stream: a shared scratchpad

As text flows through the layers, the model maintains a running scratchpad called the residual stream. Each layer reads the scratchpad, does its work (comprehension, then memory lookup), and writes its results back. By the final layer, the scratchpad contains everything needed to produce the next word. This scratchpad turns out to be the skeleton key for most of what follows.

Weights: the frozen numbers

A model's weights are the billions of numbers set during training — the slow, expensive process where the model reads vast amounts of text. Once training ends, the weights freeze. Everything the model 'knows' is encoded in them. Until recently, they were treated as an unreadable blob.

The KV cache: working memory for a conversation

During a conversation, the model keeps short-term notes about everything said so far, called the KV cache. It grows with every message and is the single biggest memory cost when running a model — for long conversations it can reach a gigabyte. The entire AI-serving industry engineers around managing it.

3. Who Is Chris Hay

Chris Hay is a Distinguished Engineer at IBM and the CTO of IBM iX, IBM's customer-transformation consulting division — the arm that guides Fortune 500 companies through technology modernization. Distinguished Engineer is among IBM's most senior technical titles. Before IBM, he worked on M-Pesa, the Kenyan mobile-payments platform that became one of the most successful financial-infrastructure deployments in the world. He maintains over 130 public code repositories and a YouTube channel where he works through ideas in public.

Two calibration notes, both honest. First, he is a self-described trend-spotter who has been early to things that did not pan out — his own profile records past enthusiasm for the metaverse and web3. Second, and cutting the other way: unlike trend commentary, this work ships reproducible code with every claim, stands on established peer-reviewed science, and is accumulating a real contributor community. He is not publishing papers; he is publishing proofs you can run on a laptop. I respect the work, and the honesty

4. What He Has Demonstrated

The work arrived as four escalating demonstrations. Each one builds on the last, and the sequence matters more than any single piece.

Step The claim The demonstration
1. The database The model's memory is a queryable graph of facts. A tool (LARQL) that lets you ask a model what it knows, watch an answer form layer by layer, and insert new facts without retraining.
2. The redundancy The conversation working memory (KV cache) is unnecessary. The scratchpad already contains everything the cache holds. Demo: the equivalent of a 370,000-word context stored in 2.8 megabytes — thousands of times smaller — on a MacBook.
3. The separation Memory and comprehension don't need to share a machine. Ran a 26-billion-parameter model with comprehension on a laptop GPU and the memory served from cheap, GPU-less servers — over the public internet — at near-local speed.
4. The mechanism Facts are read by address, and the address is the relationship. Built the memory system by hand, number by number. Hand-wrote a brand-new fact into a model with zero training; the model answered as if it had always known it.

The detail worth pausing on

The fourth video contains the deepest finding. The model stores many facts packed on top of each other in the same space — researchers call this superposition — and the natural assumption was that reading a fact back requires some untangling step. Hay's hand-built reconstruction shows otherwise: the model never untangles anything. It reads each fact directly by its address, and the address is the relationship itself ('capital of', 'works at', 'treats'). Storage is messy; reading is clean. That single property is why everything else works — why the memory can be turned into a database, why it can live on another machine, and why a fact written by hand at the right address reads back natively.

His closing line points at the destination: 'the lookup doesn't have to live in the weights at all.' If facts are read by address, the fact store can — in principle — be moved entirely outside the model.

5. The Science Underneath

None of the underlying science is fringe. The finding that a model's memory layers behave as key-value storage was published at a major conference in 2021 (Geva et al.) and is broadly accepted. Editing individual facts inside model weights is an established research line (ROME, 2022; MEMIT, 2023). Superposition — facts packed in shared space — comes from Anthropic's interpretability team (Elhage et al.). Relationships acting as clean linear addresses comes from work on linear relation decoding (Hernandez et al.). Even the boundary Hay draws between what a model looks up and what it must genuinely compute traces to one of the founding results of the field (Minsky & Papert, 1969). Hay's contribution, in his own words: 'I haven't seen these put together end to end on one model — that's what this is.'

Just as telling, major labs are independently converging on adjacent pieces. Apple published a technique for racing ahead of the model's own scratchpad to pre-load memory (M2R2, 2025). Google shipped a fast-drafting system in Gemma 4 that pairs a small predictive model with a large one (May 2026). And Anthropic released Natural Language Autoencoders (May 2026) — a method that translates a model's internal scratchpad states into readable English sentences, which Anthropic has already used to detect models behaving differently when they suspect they are being tested. Different groups, different motives, same underlying separation of comprehension, memory, and scratchpad.

6. What We Can Reasonably Posit

These are implications that follow with medium-to-high confidence if the demonstrations hold at larger scale. None require new science — only engineering maturation.

Models split into a reasoning core and an external memory

The natural endpoint of demonstrations 1 through 4 is a model in two pieces: a comparatively small reasoning core (comprehension plus genuine computation) and a large fact store that can live on disk, on a server, or as a managed database. This mirrors a very old idea — early computers fused processor and memory until someone separated them, and the separation defined the architecture of computing. The same separation is now visible for AI models.

The hardware economics invert

Today, running large models requires expensive specialized GPU memory, because the entire model must sit in it. Demonstration 3 shows the memory portion — most of the model — running on ordinary CPU RAM, which costs a small fraction as much. A company could serve a very large model from a single RAM-heavy server, with each user's laptop handling only the small comprehension portion. Frontier-scale capability on commodity hardware stops being a contradiction.

Knowledge updates become writes, not retraining

Today, updating what a model knows means an expensive training run, and every model carries a 'knowledge cutoff' — the date its information ends. If facts can be written by address, knowledge updates become database writes: instant, cheap, reversible, versioned. The knowledge-cutoff concept erodes. A model's knowledge becomes as current as its last write.

A model's confidence becomes inspectable

Watching an answer form layer by layer (demonstration 1) reveals a signature: answers the model genuinely knows resolve cleanly and decisively; answers it is guessing at look contested and scattered. That is a real-time quality signal — a way to flag probable fabrication before the answer reaches a user. Pair it with Anthropic's English-translation technique and you can not only flag the uncertain moment but read a plain-language description of what the model was doing in it.

The honest boundary: lookup versus computation

This is the limit that keeps the picture honest. Facts — things the model looks up — are addressable, movable, writable. Skills — reasoning, multi-step problem solving, genuine computation — are not. They live in the circuitry, not the address book. Writing 'be careful with money' into a model's memory makes that sentence retrievable; it does not make the model careful. Knowledge externalizes; capability still requires training. Every serious extrapolation in the next section respects that line.

7. Where This Could Lead

This section is explicitly speculative — coherent extrapolation, not demonstration. Confidence decreases as you read down.

Knowledge becomes a market

Today, an AI lab's competitive moat is a single bundle: reasoning ability and knowledge, fused in one set of weights, priced together. Separation splits the moat. Reasoning cores would compete on capability. Fact stores would compete as content businesses — a medical knowledge store, a legal store, a live financial store — each maintained by domain specialists, versioned and licensed like any data product, readable natively by any compatible core. Knowledge stops being baked into a handful of labs' models and becomes an industry of curated stores.

Hallucination gets an honest 'not found'

A meaningful share of AI fabrication happens because the model has no way to represent 'there is no fact at this address' — it retrieves the nearest plausible thing instead. An external store can fail loudly, the way a database does. The read either returns a fact with provenance or returns nothing, and 'I don't know' becomes a real, mechanically grounded state rather than a behavior we hope training instilled.

Instructions become auditable objects

Today, an AI's instructions live either baked invisibly into its training or pasted into its prompt — and in both cases, compliance can only be judged by inspecting outputs. If instructions are written at known addresses, reads of those addresses become observable events. For the first time you could distinguish three failures that currently look identical: the model never consulted the rule, the model consulted it and deviated anyway, or the model consulted it and misunderstood it. Each demands a different fix; today, every fix is a guess.

Intent itself becomes reviewable

Anthropic's translation technique already reads internal states into English — including, in their published examples, models privately suspecting they are being evaluated while saying nothing. Combine that with snapshots of the scratchpad at the moment an AI agent makes a decision, and you get a new kind of audit artifact: a timestamped, tamper-evident record of not just what an agent did but what its internal state looked like when it decided. For autonomous agents executing real transactions, the auditable question shifts from 'did the action follow the rules' to 'was the agent's intent at the moment of action consistent with its authorization.' No current audit framework can ask that question.

Knowledge acquires lifecycle, provenance — and accounting

Pull the threads together and a pattern appears. If knowledge lives in writable external stores, every write needs authorization (who was allowed to add this fact?), every read can be metered (which agent consumed which knowledge, billed to whom?), every store needs provenance and version history, and unverified writes are worse than hallucination — corruption the model trusts as its own memory. Those requirements have a name: they are the requirements of a ledger. Append-only history, authorized mutations, conservation rules (what was written equals what was authorized; what was consumed equals what is billed), and audit trails. The infrastructure layer beneath a knowledge economy does not exist yet — and it looks much less like today's AI tooling than like the transactional accounting systems that underpin finance. That gap is where this research program intersects with the author of this report's own work on accounting infrastructure for machine-to-machine commerce.

The endpoint: training becomes how models sleep

Neuroscience has long held that human memory uses two coupled systems — a fast, writable one for immediate learning and a slow, consolidated one holding deep structure, with sleep transferring between them. The architecture emerging here lands on the same design: a fast external store for continuous knowledge writes, a slow frozen core holding concepts and reasoning, and periodic consolidation runs that fold high-value knowledge into the core itself. In that world, training is no longer how models learn day to day. It is how they sleep.

8. The Honest Caveats

Everything above deserves a discount rate. The specifics:

9. Bottom Line

A credible senior engineer has spent six weeks publicly demonstrating, with runnable code, that an AI model's knowledge behaves like a database: queryable, traceable, writable, and movable — possibly all the way out of the model itself. The science underneath is established; the assembly is new; the scale is unproven. If it holds, the consequences run from running frontier-scale models on cheap hardware, to ending the knowledge cutoff, to making AI intent auditable, to the emergence of knowledge as a governed, versioned, accounted-for asset class. Even the conservative outcome — adoption only where governance matters most — would reshape how regulated industries deploy AI. The right posture is the one this report takes: watch closely, verify independently, and notice that the infrastructure this future requires has not been built yet.

Appendix: Sources and Status

The work

Key research it stands on

Claim status at a glance

Claim Status
Model memory is queryable and traceable Demonstrated, reproducible, on solid published science
Conversation working memory is redundant Demonstrated at laptop scale; falsifiable code available
Memory can be served from cheap remote machines Demonstrated at 26B scale, including over public internet
Facts can be hand-written and read natively Demonstrated for direct recall; multi-step use unproven
Knowledge can live fully outside the model Stated direction; not yet demonstrated
All of it holds at frontier scale Unknown — the central open question