Zero-Inflated Intelligence

latent-spacesreliabilityAIhallucinationseries

Part IV of Latent Spaces. After The Latent Connection.

Part III closed the philosophical circle. The tree is an objective structure; the observer is not. What's buildable is the tree side of the seam — growth, crumpling, routing — and Part IV begins with the question that follows: what goes wrong when the tree is absent?

The standard framing treats hallucination as a bug to fix. A moral failing, almost — the model lied. The reliability-theoretic framing is less dramatic and more useful: hallucination is a failure mode of a probabilistic system, and failure modes have structure.

The die with missing faces

Any system that produces outputs from a probability distribution has a sample space — the set of all outputs it can generate. For a loaded die, the sample space is {1, 2, 3, 4, 5, 6} with unequal weights. For an LLM responding to a prompt, the sample space is the set of all token sequences the model can produce, weighted by its parameters and the prompt context.

When the sample space contains the correct answer, the system's errors are stochastic. It might produce a wrong answer on any given roll, but the right answer is in there — resampling, averaging, or adjusting temperature can improve performance. These errors are noise around a correct central tendency. Reliability is meaningful. You can measure it, improve it, bound it.

But absence of data does not constitute the data. When the sample space does not contain the correct answer, something structurally different happens. The model produces an output — it cannot avoid doing so - but the output is drawn from a distribution that literally cannot include the right answer. No amount of resampling helps. Temperature is irrelevant. The die doesn't have that face. The failure rate in this region isn't high. It's one.

The observed output distribution is a mixture of these two regimes. In some regions of query space, the model has coverage — the right answer is a face on the die, and its errors are stochastic. In other regions, the die is structurally incomplete, and every output is wrong with certainty. The machine produces identical fluency in both regimes. That's the problem.

Zero-inflated distributions

In statistics, this mixture has a name. A zero-inflated distribution models a process with two components: a point mass at zero (the event structurally cannot produce a positive outcome) and a count distribution (the event can produce outcomes, with stochastic variation). The classic example is fisheries data — some lakes have no fish at all (structural zero: wrong habitat, dried up, poisoned), and some lakes have fish in varying abundance (stochastic count). A single observed zero is ambiguous: was this a lake with no fish, or a lake with fish where the net came up empty?

With LLMs, a wrong answer is similarly ambiguous. Was this a query in a region where the model has no coverage (structural zero — the die doesn't have that face), or a region where it has coverage but sampled poorly (stochastic miss — the die rolled wrong)?

You cannot distinguish the two from a single output. The hallucination looks exactly like a mistake. The structural impossibility is wrapped in the same fluency as stochastic noise. This is what makes hallucination qualitatively different from ordinary error, and why treating it as a precision problem — just make the model more careful — misses the mechanism.

Guardrails as intentional zero-inflation

Here's the inversion that clarifies the engineering.

A model without guardrails has a zero-inflated output distribution where the zeros are accidental — structural gaps in coverage distributed invisibly across the weight space. Guardrails add intentional zeros. Same mathematical object. Different intent.

Every guardrail, regardless of implementation, does essentially the same thing: multiply the model's stochastic output by a binary mask. Where the mask is 1, the output passes through. Where it's 0, the output is suppressed. The result is a zero-inflated distribution — by design.

The mask can be applied at three levels, and the levels form a progression.

Downstream filtering. The model generates an output; a classifier examines it and suppresses or flags it. The die rolled. You looked at the result and hid it. This is wasteful — the expensive computation already happened — and leaky, because the filter is itself a probabilistic system with its own coverage gaps. But it's simple, and it's where most production systems start.

System-prompt constraint. The model's generation is conditioned on instructions that narrow the output distribution before sampling. The die still has those faces, but the prompt biases the roll away from them. This is a softer mask — not binary but weighted — which is why it leaks. Jailbreaks are precisely the failure mode of a soft mask: adversarial inputs that shift the weights back toward the suppressed region. The mask was never structural. It was persuasion.

Upstream classification. A separate, simpler model examines the query before the expensive model runs and makes a binary decision: is this query in a region where the model has coverage, or not? If not, don't roll the die at all. The zero-inflation happens before the expensive computation. The mask is cheaper, harder (binary, not weighted), and earlier. A RAG system exemplifies this type of system, where the absence of appropriate search results can take the output in a completely different direction.

The progression from downstream to upstream is a progression toward determinism of content, though not necessarily of the rendering of it. Downstream filtering redacts the answer after it's produced — the system detects inconsistency and covers it up. System-prompt constraint tries to prevent the hallucination through instruction — hoping the model can police itself - and resulting in a cottage industry of sleuths sharing jailbreak technology. Upstream classification eliminates the possibility of reaching into an inaccurate solution space.

The obvious architecture

An upstream binary classifier gives you one fork: answer or don't answer. The resolution is too coarse. You know the model has coverage somewhere and lacks it somewhere else, but a single gate can only split the entire query space in two.

What you need is a cascade of forks. Each one narrows the query into a more specific region, and at each fork you can ask: does the model have coverage in this region? By the time you reach the terminal node, you've routed through enough binary decisions that the remaining sample space is dense — the die at that leaf has most of its faces.

This is a tree. And to anyone who has worked with database indexes, this should be utterly obvious.

A B-tree doesn't make the data better. It doesn't improve the records, clean the values, or add missing entries. What it does is route a query through a cascade of comparisons so that by the time you reach a leaf page, you're only asking for records that the page actually contains. Scanning every row in a database is absurd — not because it's slow (though it is), but because you're asking pages for records they don't contain. And unlike a null lookup on disk, a semantic page will give you something — which is worse than nothing.

Yet that's what flat inference does. The query enters the full weight space. The model generates from its entire parameter surface. If the answer lives in a region of that surface with good coverage, you get a good answer. If it doesn't, you get a confident, fluent, wrong answer — a hallucination. Then you filter downstream. Scan-and-filter. On every query. Mixture of Experts is the right kind of idea, but the routing is learned once during training and frozen — it can't extend to regions the training didn't cover.

The B-tree solution has been solved infrastructure for fifty years. You don't expect a storage system to produce data it doesn't contain. You index it so the routing is explicit, and you know which pages cannot exist before you touch them.

The tree is the same architecture applied to inference. Route first. Generate second. Each fork in the tree is a binary classifier — not on the content of the answer, but on whether the model has coverage for this class of query. By the time you reach generation, the sample space has been narrowed to a region where the die is complete. Not perfect — stochastic errors remain — but structurally complete. The noise is then just noise, not absence.

What the tree framework predicts

The complexity budget established that C(M) + C(X) ≥ C(Y) — the model and prompt must jointly cover the task's complexity. The zero-inflated frame sharpens this: the inequality isn't about total capacity. It's about local capacity. The model may have enormous aggregate complexity and still have structural zeros in specific regions.

A tree makes the coverage map legible. Each branch represents a region of query space with a known depth of coverage. Shallow branches — few training examples, sparse representations — are where the structural zeros live. Deep branches — dense training, refined representations — are where the die is complete. The routing decision at each fork is, structurally, a coverage check.

This reframes every intervention in the current toolkit:

RLHF reshapes the stochastic hump — reweighting outputs within regions where the model has coverage. Useful in regime 2 (stochastic errors on a complete die). Meaningless in regime 1 (structural zeros). RLHF doesn't add faces to the die. It adjusts the odds on existing faces. The model doesn't learn why something is wrong; it learns that saying it gets penalized. The coverage gaps haven't moved. You've resculpted the surface without deepening the tree, and possibly reduced the global novelty of the model.

Chain-of-thought and reasoning extend the prompt complexity C(X), buying more room in the feasibility condition. This helps when the gap is between the model's representations and the specific answer (regime 2 — the die has the face but it's hard to reach). It doesn't help when the model has no representation at all (regime 1 — the face doesn't exist), although it can beneficially sculpt out the middle and project hallucinations with more contrast.

Retrieval-augmented generation injects external content into the prompt, increasing C(X) and potentially filling coverage gaps. This is closer to adding faces to the die — but only if the retrieval is itself well-routed. Current RAG is flat retrieval: cosine similarity in embedding space, memoryless, starting from the door every time. It returns chunks that contain the words, not chunks that a reasoning process once connected to those words. RAG with tree-structured retrieval — where the routing learns from its own history — would be RAG that actually fills structural zeros.

Scaling (more parameters) increases aggregate C(M), which generally reduces the number of structural zeros. But the reduction is uneven — some regions get dense coverage while others remain sparse. Scaling fills the most common gaps first (high-frequency patterns in training data) and leaves the long tail untouched. Past a point, you're adding faces that the die already has, and possibly greater fecundity in producing subtle and pleasing hallucinations.

None of these add the routing that would make coverage legible. They operate on the die — adding faces, adjusting weights, injecting context — without building the index that tells you which faces exist before you roll.

The two problems

The zero-inflated frame separates hallucination into two distinct engineering problems, which is useful because they have different solutions.

Problem 1: Detect regime. Given a query, determine whether the model's sample space contains the correct answer (regime 2, stochastic) or not (regime 1, structural zero). This is the coverage check — the upstream classifier, the B-tree routing. It doesn't require knowing the right answer. It requires knowing whether the model has the right answer in its distribution.

Problem 2: Fill gaps. Where structural zeros exist, add coverage — through targeted training, retrieval, or routing to a different model that does have coverage. This is the growth operation from Post 7: extending the tree into regions where it currently has no branches.

Current approaches conflate the two. RLHF tries to solve Problem 1 by modifying the distribution (Problem 2's territory). Guardrails try to solve Problem 1 with a blunt mask. Scaling tries to solve Problem 2 by brute force. Disentangling them — route first, then generate — is the tree architecture applied to inference.

What routing preserves

There's a property of trees that matters here and that none of the flat interventions share: routing is additive.

Add a branch to a tree and the existing branches don't change. A new leaf page in a B-tree doesn't corrupt the pages that were already there. The index grows; it doesn't mutate. Coverage in one region is structurally independent of coverage in another because the routing isolates them. You can extend the tree indefinitely without disturbing what it already knows.

Contrast this with every operation on a shared weight space. Fine-tuning modifies the same parameters that encode existing capabilities. LoRA constrains the modification to a low-rank subspace, but it's still the same substrate — every parameter participates in every output. RLHF reshapes the entire output surface. These operations cannot change one region without perturbing the rest, because there are no regions. There's just the surface.

Catastrophic forgetting is not a bug in these approaches. It is their structural signature — the inevitable consequence of modifying a shared, non-indexed surface. You want the model to learn new things without losing old things, but the old things and the new things live in the same undifferentiated weight space. Every modification is a trade.

A tree doesn't have this problem because routing is a selection operation, not a modification operation. You're choosing which sub-model to query, not changing the sub-model. The die at each leaf stays the same. Adding a new die — a new leaf, a new region of coverage — doesn't alter the faces on the existing dice. The total reachable output space only grows. Nothing that was reachable becomes unreachable. Nothing that was a face on an existing die gets filed off to make room.

This is the same property that makes B-trees reliable in databases: you can insert indefinitely without corrupting existing reads. The data structure guarantees isolation between regions. Flat weight spaces have no such guarantee because they have no such structure.

The next post recovers the current AI toolkit through this lens. LoRA, distillation, fine-tuning, MoE — each is a tree operation applied to a non-tree structure. Understanding which operation each one performs, and where each one breaks, follows from taking the zero-inflated model seriously.