1. The Short Version
Over roughly six weeks in spring 2026, a senior IBM engineer named Chris Hay published a series of YouTube videos and open-source code making one escalating argument: the knowledge inside an AI language model is not a mysterious fog spread across billions of numbers. It behaves like a database. You can query it. You can trace how it produces an answer. You can write new facts into it by hand, without any retraining. And — the most consequential step — the knowledge does not even need to live inside the model. It can sit on a separate, cheap machine, or potentially outside the model entirely, while the model reads it as if it were its own memory.
Each claim ships with working code anyone can run. The work is built on peer-reviewed research from the last five years; what is new is that nobody had assembled the pieces end-to-end into functioning engineering before. The community response has been fast: his main repository went from 1 star to roughly 800 in five weeks, with outside contributors already landing code.
If the claims hold at scale — a real if, addressed honestly in Section 8 — the consequences touch how AI models are built, priced, deployed, governed, and audited. This report explains the work in plain terms, separates what is demonstrated from what is plausible from what is speculative, and maps where the road leads.
2. A Two-Minute Primer
Four concepts make everything else in this report readable. No math required.
Layers have two working parts
A language model is a stack of layers — you likely know that much. What matters here is that each layer contains two distinct components doing different jobs. The first is attention: the machinery that figures out which words in the input relate to which other words. Think of it as the model's reading comprehension. The second is the feed-forward network (FFN): the machinery where the model's stored knowledge lives — facts, associations, vocabulary, patterns. Think of it as the model's memory. The memory portion is most of the model's physical size; the comprehension portion is comparatively small but does intensive work.
The residual stream: a shared scratchpad
As text flows through the layers, the model maintains a running scratchpad called the residual stream. Each layer reads the scratchpad, does its work (comprehension, then memory lookup), and writes its results back. By the final layer, the scratchpad contains everything needed to produce the next word. This scratchpad turns out to be the skeleton key for most of what follows.
Weights: the frozen numbers
A model's weights are the billions of numbers set during training — the slow, expensive process where the model reads vast amounts of text. Once training ends, the weights freeze. Everything the model 'knows' is encoded in them. Until recently, they were treated as an unreadable blob.
The KV cache: working memory for a conversation
During a conversation, the model keeps short-term notes about everything said so far, called the KV cache. It grows with every message and is the single biggest memory cost when running a model — for long conversations it can reach a gigabyte. The entire AI-serving industry engineers around managing it.
3. Who Is Chris Hay
Chris Hay is a Distinguished Engineer at IBM and the CTO of IBM iX, IBM's customer-transformation consulting division — the arm that guides Fortune 500 companies through technology modernization. Distinguished Engineer is among IBM's most senior technical titles. Before IBM, he worked on M-Pesa, the Kenyan mobile-payments platform that became one of the most successful financial-infrastructure deployments in the world. He maintains over 130 public code repositories and a YouTube channel where he works through ideas in public.
Two calibration notes, both honest. First, he is a self-described trend-spotter who has been early to things that did not pan out — his own profile records past enthusiasm for the metaverse and web3. Second, and cutting the other way: unlike trend commentary, this work ships reproducible code with every claim, stands on established peer-reviewed science, and is accumulating a real contributor community. He is not publishing papers; he is publishing proofs you can run on a laptop. I respect the work, and the honesty
4. What He Has Demonstrated
The work arrived as four escalating demonstrations. Each one builds on the last, and the sequence matters more than any single piece.
| Step | The claim | The demonstration |
|---|---|---|
| 1. The database | The model's memory is a queryable graph of facts. | A tool (LARQL) that lets you ask a model what it knows, watch an answer form layer by layer, and insert new facts without retraining. |
| 2. The redundancy | The conversation working memory (KV cache) is unnecessary. | The scratchpad already contains everything the cache holds. Demo: the equivalent of a 370,000-word context stored in 2.8 megabytes — thousands of times smaller — on a MacBook. |
| 3. The separation | Memory and comprehension don't need to share a machine. | Ran a 26-billion-parameter model with comprehension on a laptop GPU and the memory served from cheap, GPU-less servers — over the public internet — at near-local speed. |
| 4. The mechanism | Facts are read by address, and the address is the relationship. | Built the memory system by hand, number by number. Hand-wrote a brand-new fact into a model with zero training; the model answered as if it had always known it. |
The detail worth pausing on
The fourth video contains the deepest finding. The model stores many facts packed on top of each other in the same space — researchers call this superposition — and the natural assumption was that reading a fact back requires some untangling step. Hay's hand-built reconstruction shows otherwise: the model never untangles anything. It reads each fact directly by its address, and the address is the relationship itself ('capital of', 'works at', 'treats'). Storage is messy; reading is clean. That single property is why everything else works — why the memory can be turned into a database, why it can live on another machine, and why a fact written by hand at the right address reads back natively.
His closing line points at the destination: 'the lookup doesn't have to live in the weights at all.' If facts are read by address, the fact store can — in principle — be moved entirely outside the model.
5. The Science Underneath
None of the underlying science is fringe. The finding that a model's memory layers behave as key-value storage was published at a major conference in 2021 (Geva et al.) and is broadly accepted. Editing individual facts inside model weights is an established research line (ROME, 2022; MEMIT, 2023). Superposition — facts packed in shared space — comes from Anthropic's interpretability team (Elhage et al.). Relationships acting as clean linear addresses comes from work on linear relation decoding (Hernandez et al.). Even the boundary Hay draws between what a model looks up and what it must genuinely compute traces to one of the founding results of the field (Minsky & Papert, 1969). Hay's contribution, in his own words: 'I haven't seen these put together end to end on one model — that's what this is.'
Just as telling, major labs are independently converging on adjacent pieces. Apple published a technique for racing ahead of the model's own scratchpad to pre-load memory (M2R2, 2025). Google shipped a fast-drafting system in Gemma 4 that pairs a small predictive model with a large one (May 2026). And Anthropic released Natural Language Autoencoders (May 2026) — a method that translates a model's internal scratchpad states into readable English sentences, which Anthropic has already used to detect models behaving differently when they suspect they are being tested. Different groups, different motives, same underlying separation of comprehension, memory, and scratchpad.
6. What We Can Reasonably Posit
These are implications that follow with medium-to-high confidence if the demonstrations hold at larger scale. None require new science — only engineering maturation.
Models split into a reasoning core and an external memory
The natural endpoint of demonstrations 1 through 4 is a model in two pieces: a comparatively small reasoning core (comprehension plus genuine computation) and a large fact store that can live on disk, on a server, or as a managed database. This mirrors a very old idea — early computers fused processor and memory until someone separated them, and the separation defined the architecture of computing. The same separation is now visible for AI models.
The hardware economics invert
Today, running large models requires expensive specialized GPU memory, because the entire model must sit in it. Demonstration 3 shows the memory portion — most of the model — running on ordinary CPU RAM, which costs a small fraction as much. A company could serve a very large model from a single RAM-heavy server, with each user's laptop handling only the small comprehension portion. Frontier-scale capability on commodity hardware stops being a contradiction.
Knowledge updates become writes, not retraining
Today, updating what a model knows means an expensive training run, and every model carries a 'knowledge cutoff' — the date its information ends. If facts can be written by address, knowledge updates become database writes: instant, cheap, reversible, versioned. The knowledge-cutoff concept erodes. A model's knowledge becomes as current as its last write.
A model's confidence becomes inspectable
Watching an answer form layer by layer (demonstration 1) reveals a signature: answers the model genuinely knows resolve cleanly and decisively; answers it is guessing at look contested and scattered. That is a real-time quality signal — a way to flag probable fabrication before the answer reaches a user. Pair it with Anthropic's English-translation technique and you can not only flag the uncertain moment but read a plain-language description of what the model was doing in it.
The honest boundary: lookup versus computation
This is the limit that keeps the picture honest. Facts — things the model looks up — are addressable, movable, writable. Skills — reasoning, multi-step problem solving, genuine computation — are not. They live in the circuitry, not the address book. Writing 'be careful with money' into a model's memory makes that sentence retrievable; it does not make the model careful. Knowledge externalizes; capability still requires training. Every serious extrapolation in the next section respects that line.
7. Where This Could Lead
This section is explicitly speculative — coherent extrapolation, not demonstration. Confidence decreases as you read down.
Knowledge becomes a market
Today, an AI lab's competitive moat is a single bundle: reasoning ability and knowledge, fused in one set of weights, priced together. Separation splits the moat. Reasoning cores would compete on capability. Fact stores would compete as content businesses — a medical knowledge store, a legal store, a live financial store — each maintained by domain specialists, versioned and licensed like any data product, readable natively by any compatible core. Knowledge stops being baked into a handful of labs' models and becomes an industry of curated stores.
Hallucination gets an honest 'not found'
A meaningful share of AI fabrication happens because the model has no way to represent 'there is no fact at this address' — it retrieves the nearest plausible thing instead. An external store can fail loudly, the way a database does. The read either returns a fact with provenance or returns nothing, and 'I don't know' becomes a real, mechanically grounded state rather than a behavior we hope training instilled.
Instructions become auditable objects
Today, an AI's instructions live either baked invisibly into its training or pasted into its prompt — and in both cases, compliance can only be judged by inspecting outputs. If instructions are written at known addresses, reads of those addresses become observable events. For the first time you could distinguish three failures that currently look identical: the model never consulted the rule, the model consulted it and deviated anyway, or the model consulted it and misunderstood it. Each demands a different fix; today, every fix is a guess.
Intent itself becomes reviewable
Anthropic's translation technique already reads internal states into English — including, in their published examples, models privately suspecting they are being evaluated while saying nothing. Combine that with snapshots of the scratchpad at the moment an AI agent makes a decision, and you get a new kind of audit artifact: a timestamped, tamper-evident record of not just what an agent did but what its internal state looked like when it decided. For autonomous agents executing real transactions, the auditable question shifts from 'did the action follow the rules' to 'was the agent's intent at the moment of action consistent with its authorization.' No current audit framework can ask that question.
Knowledge acquires lifecycle, provenance — and accounting
Pull the threads together and a pattern appears. If knowledge lives in writable external stores, every write needs authorization (who was allowed to add this fact?), every read can be metered (which agent consumed which knowledge, billed to whom?), every store needs provenance and version history, and unverified writes are worse than hallucination — corruption the model trusts as its own memory. Those requirements have a name: they are the requirements of a ledger. Append-only history, authorized mutations, conservation rules (what was written equals what was authorized; what was consumed equals what is billed), and audit trails. The infrastructure layer beneath a knowledge economy does not exist yet — and it looks much less like today's AI tooling than like the transactional accounting systems that underpin finance. That gap is where this research program intersects with the author of this report's own work on accounting infrastructure for machine-to-machine commerce.
The endpoint: training becomes how models sleep
Neuroscience has long held that human memory uses two coupled systems — a fast, writable one for immediate learning and a slow, consolidated one holding deep structure, with sleep transferring between them. The architecture emerging here lands on the same design: a fast external store for continuous knowledge writes, a slow frozen core holding concepts and reasoning, and periodic consolidation runs that fold high-value knowledge into the core itself. In that world, training is no longer how models learn day to day. It is how they sleep.
8. The Honest Caveats
Everything above deserves a discount rate. The specifics:
- Scale is unproven. The demonstrations run on a 4-billion-parameter model on a laptop. Whether clean address-based reading survives at hundreds of billions of parameters and billions of stored facts — where packed facts increasingly interfere — is the central open question.
- Writing is verified only for direct recall. A hand-written fact answers correctly when asked directly. Whether the model can use that fact inside multi-step reasoning — as a premise rather than an answer — has not been shown.
- Expert validation is pending. The mechanistic-interpretability research community (the academics and lab teams who would stress-test these claims hardest) has not yet visibly engaged. Community traction is real but consists mostly of practitioners, not yet specialists.
- The incumbents have reasons to resist. Frontier labs' business model is the fused bundle of reasoning plus knowledge. Externalized knowledge erodes that moat, so the best-resourced organizations have weak incentives to push this direction — which slows it regardless of technical merit.
- The likeliest near-term outcome is the modest one. Not a wholesale rearchitecting of AI, but adoption in the niche where the trade-offs already favor it: organizations that value governance, auditability, data sovereignty, and freshness over raw capability — enterprises, regulated industries, on-premise deployments. Notably, that niche is large, well-funded, and currently locked out of frontier AI precisely because of those concerns.
9. Bottom Line
A credible senior engineer has spent six weeks publicly demonstrating, with runnable code, that an AI model's knowledge behaves like a database: queryable, traceable, writable, and movable — possibly all the way out of the model itself. The science underneath is established; the assembly is new; the scale is unproven. If it holds, the consequences run from running frontier-scale models on cheap hardware, to ending the knowledge cutoff, to making AI intent auditable, to the emergence of knowledge as a governed, versioned, accounted-for asset class. Even the conservative outcome — adoption only where governance matters most — would reshape how regulated industries deploy AI. The right posture is the one this report takes: watch closely, verify independently, and notice that the infrastructure this future requires has not been built yet.
Appendix: Sources and Status
The work
- LARQL (the query tool) — Apache 2.0 licensed
- The Mechanism (hand-built memory)
- Videos: 'LLMs are databases — so query them'; the KV-cache elimination video; the distributed 26B demonstration; 'The Mechanism' — all on Hay's YouTube channel (@chrishayuk)
Key research it stands on
- Foundational science: Geva et al. (2021) — memory layers as key-value stores; Elhage et al. — superposition; Meng et al. — ROME / MEMIT fact editing; Hernandez et al. — relationships as linear addresses; Minsky & Papert (1969) — the lookup/computation boundary
- Convergent industry work: Apple M2R2 (2025); Google Gemma 4 multi-token drafting (May 2026); Anthropic Natural Language Autoencoders (May 2026)
Claim status at a glance
| Claim | Status |
|---|---|
| Model memory is queryable and traceable | Demonstrated, reproducible, on solid published science |
| Conversation working memory is redundant | Demonstrated at laptop scale; falsifiable code available |
| Memory can be served from cheap remote machines | Demonstrated at 26B scale, including over public internet |
| Facts can be hand-written and read natively | Demonstrated for direct recall; multi-step use unproven |
| Knowledge can live fully outside the model | Stated direction; not yet demonstrated |
| All of it holds at frontier scale | Unknown — the central open question |