Growth and the Missing Architecture

latent-spacesAIarchitecturepredictive-codingMoEseries

Part IV of Latent Spaces, concluded. After The Toolkit, Recovered.

The previous post audited the toolkit and found it uniformly subtractive. Leaf-pruning, terminal collapse, branch-pruning — three operations at three scales, all removing something from the tree. No operation in the current inventory makes the tree larger. That absence is not incidental. It is the structural consequence of working on a flat, shared weight space where every modification is also an interference.

What growth requires

The knowledge tree from Part III defined growth as smooth branch extension — the L-system deepening in a familiar direction, adding leaves without disturbing the existing topology. Crumpling was the other operation: manifold folding that brings distant branches into proximity, a phase transition rather than gradual accretion. Both are additive. Neither has a counterpart in the current toolkit.

Growth on a tree has four requirements.

A structured prior as a starting point rather than random weights. The knowledge tree distinguished human learning from neural network training precisely here: humans begin primed, with a tree already shaped by genetics and early experience, and new input routes through existing structure. Neural networks begin from random initialization and build structure through prediction minimization. The difference is not philosophical — it determines whether new capability extends from an existing branch or must construct its own scaffolding from noise. Starting from a strong model — a distilled, RLHF'd, instruction-tuned model with broad but uneven coverage — is starting from a tree with deep branches in common regions and sparse coverage at the periphery. Growth means extending the periphery.

A curriculum of ordered inputs with monotonically increasing surprisal relative to the growing branch. This is the smoothness condition from Constructal Geometry applied to learning: the n-th extension requires differentiability at the (n-1)-th level. Prerequisites are not a pedagogical convenience but the mathematical condition under which the manifold remains smooth enough for the next step to be well-defined. A curriculum that violates this — jumping to material the tree has no branch to receive — produces the same failure as a fractal trying to iterate past its scaling range. The rule can't apply because the substrate isn't ready.

Persistence at architectural depth, not context depth. The branch must stay after the input passes, which sounds obvious but rules out most of what passes for "learning" in current systems. In-context learning is not growth — it's routing through a temporarily extended context window that evaporates when the conversation ends. Fine-tuning modifies the substrate globally. Neither produces an isolated, persistent branch. The branch must be written into the architecture, not the context and not the shared weights.

The smoothness condition itself, applied to the attachment point. Growth is possible where the manifold is differentiable — where the existing tree provides enough structure for the new extension to attach. Where prerequisites are absent, the manifold isn't smooth, and growth requires crumpling first: a dimensional fold that brings a distant branch into proximity so it can serve as the attachment point. Growth without crumpling is incremental. Growth through crumpling is what the series has been calling insight — and it's a harder problem, deferred to the forest.

The energy argument

The brain doesn't do backpropagation.

It consumes roughly twenty watts — about twenty percent of the body's resting metabolic budget for two percent of its mass. It operates near its energy ceiling at all times. Every additional operation competes with maintenance, sensory processing, motor control. The organism cannot afford a learning algorithm that requires a full backward pass through every synapse for every update. The selective pressure toward energy-efficient learning is not subtle. It is the dominant constraint.

This is Odum's maximum power principle from The Recursive Second Law, applied to cognition. Systems that dissipate at intermediate efficiency — not maximum rate, not maximum precision, but the operating point where useful work per unit time is highest — outcompete systems at either extreme. A learning algorithm that computes a global gradient is the Carnot engine of cognition: thermodynamically optimal and practically useless, too slow and too expensive for a substrate that can't afford the computation. The energy budget demands something that converges faster, updates locally, and spends its computation only where the error actually is.

The result is what Kahneman named System 1 and System 2, but the naming obscures the mechanism. System 1 is not a different kind of thinking but routing through deep, well-worn branches — inference along paths where the tree has high coverage and the prediction error at each node is near zero, cheap and fast because the computation stays local. System 2 is the brain choosing to pay the energy cost of a more global computation — stepping back from the established routing, considering multiple branches, propagating error across levels — expensive and slow, but necessary when the local predictions don't converge.

The transition from System 2 to System 1 — the process by which effortful, conscious processing becomes automatic and fast — is exactly the process of branch deepening. The tree grows until the routing along that branch produces near-zero prediction error, at which point the expensive global computation is no longer needed. The branch has been internalized. The energy budget is freed for the next frontier.

Any learning algorithm that claims to support growth must respect this thermodynamic constraint. It must be local — updating only the nodes along the active path. It must be cheap in context — not necessarily cheaper than backpropagation in general, but cheaper when starting from a nearly converged state and touching only the nodes that need adjustment. And it must produce branches that, once grown, are cheap to traverse.

Predictive coding as growth mechanism

The algorithm that meets these requirements has been developing in computational neuroscience for twenty-five years, under the name predictive coding.

The core idea, traced to Rao and Ballard (1999) and formalized by Friston's free energy principle, is architectural: each layer in a hierarchy generates a top-down prediction of what the layer below should look like, and the layer below sends back only the prediction error — the residual between what was predicted and what was observed. Learning happens when a node adjusts its parameters to reduce its own prediction error, using only local information: the prediction it received from above, the error signal from below, and its own current state.

There is no backward pass, no global loss function, no gradient flowing from the output layer through every intermediate layer back to the input — each node is a self-contained prediction engine that improves by minimizing the discrepancy between what it expected and what it got. The updates compose — layer by layer, each one settling its local equilibrium — but they compose forward, not backward.

The semantic hierarchy this produces is the knowledge tree rendered in neural architecture — abstract representations at the top generating predictions, concrete representations at the bottom measuring error, the trunk predicting the shape of the branches and the branches predicting the shapes of their leaves. When a leaf matches its branch's prediction, the error is zero — recognition, K(new | tree) ≈ 0, the input routes cleanly through existing structure. When the leaf doesn't match — positive surprisal — the error propagates upward, but only as far as it needs to. If the branch can adjust to accommodate the new leaf, the error stops at the branch. If the branch can't, the error reaches the trunk, and the restructuring is deeper.

This is exactly the surprisal-triggered learning from The Knowledge Tree, with a concrete algorithmic implementation. The knowledge tree's two operations map directly: growth is what happens when prediction errors are small and local — the existing branch extends to accommodate a new leaf. Crumpling is what happens when prediction errors are large and propagate to the trunk — the hierarchy itself must restructure.

The key property — the one that makes predictive coding a candidate for growth on a tree architecture — is locality of update. Backpropagation finds a global minimum (or tries to) by computing the gradient of a single loss with respect to every parameter. Predictive coding finds successive local equilibria: each node minimizes its own prediction error, and the equilibria compose without a global coordinator. Millidge, Seth, and Buckley showed that under certain conditions, predictive coding converges to the same solution as backpropagation. But the convergence path is different, and on a tree-structured architecture, that difference becomes structural.

On a flat weight space, the distinction between local and global optimization is mostly computational — you get the same answer either way, just with different memory and communication patterns. But on an architecture where routing selects which nodes participate in a given computation — an MoE model — locality becomes isolation. The nodes that the router selected for this input receive prediction errors and update. The nodes that weren't selected receive nothing. Not because a regularization term penalizes their update. Not because a careful training schedule freezes their parameters. Because the error signal was never routed to them. The isolation is architectural, guaranteed by the routing, and total.

The error map

Take an existing model — a mixture-of-experts architecture with a learned routing network. For a given input domain — a corpus of inputs that represent the capability you want to grow — run each input through the model and record three things: which experts were selected by the router, what output was produced, and what the correct output should have been.

At each node along the selected path, compute the prediction error: the difference between what that node predicted (its top-down expectation given its parent's state) and what it actually received (the bottom-up signal from its child). In a standard feedforward pass, these errors are invisible — they're the gradients you would have computed if you were doing backpropagation, but since inference doesn't backpropagate, they're never materialized. Predictive coding materializes them as a matter of course, because the forward pass is the process of settling prediction errors at each node.

Accumulate these per-node errors across the input domain. Average them. The result is an error map: a heat map of the model's internal prediction failures, indexed by node and by expert.

The map reveals three things.

Where the errors are uniformly low, the tree already handles this domain. The existing routing selects experts whose internal predictions are well-calibrated. Growth is unnecessary.

Where the errors are concentrated in a few nodes within otherwise-capable experts, the tree is close — the right experts are firing, but their internal representations need deepening. This is the locus of growth: specific layers within specific experts that need their predictions refined. The operation is analogous to LoRA, but targeted by the error map rather than applied uniformly. You know which weights to update because the prediction errors tell you.

Where the errors are large and distributed across all selected experts — no expert handles this domain well — you're at the boundary of the tree's coverage. Growth here requires either extending an existing expert into new territory or adding a new expert. This is the harder case, and it's where the forest architecture (discussed below) earns its structure.

Selective deepening

The optimization of the identified nodes uses predictive coding's local update rule, not backpropagation.

For each node in the error locus: receive the top-down prediction from its parent, compare to the bottom-up signal from its child, compute the local prediction error, and adjust the node's parameters to reduce that error. The adjustment uses only local information. The parent's prediction, the child's signal, and the node's own weights. No global loss function. No gradient chain stretching back to the output layer.

The updates iterate — each node settles, its updated prediction changes the error its parent sees, the parent adjusts, and the settling propagates upward until the whole active path reaches a local equilibrium. This is not a single gradient step but a relaxation process, and the number of iterations needed is itself informative: paths that settle quickly are close to equilibrium already, paths that take many iterations are at the growth frontier.

Three properties of this procedure distinguish it from every existing toolkit operation.

It is additive — the nodes that aren't in the error locus don't update, the experts that the router didn't select don't participate, and the model after optimization contains everything it contained before plus refined predictions along the targeted path. Nothing was clipped, merged, or amputated.

It is local — the update at each node depends only on its immediate neighbors in the hierarchy, its parent's prediction and its child's signal, with no mechanism by which optimizing an expert for medical queries can degrade its performance on legal queries because the medical and legal paths don't share prediction signals during the update. The isolation isn't hoped for but architectural.

And it is — conditionally — energy-efficient. A caveat is needed here. Predictive coding in the general case is not cheaper than backpropagation. The iterative settling process typically requires 20-100 iterations per training step, each roughly the cost of a forward pass, versus backpropagation's cost of approximately three forward passes. On current GPU hardware, with decades of optimized CUDA libraries behind backprop and almost none behind predictive coding, the wall-clock comparison favors backpropagation by a factor of two to ten. The general claim that predictive coding is computationally efficient is wrong.

But the general case is not this case. Three conditions specific to this proposal change the cost profile. First, the model is already converged — most of the network has near-zero prediction error, so the settling process is fast. The iteration count T drops from 50-100 (training from scratch) to perhaps 5-20 (adjusting a nearly equilibrated system). Second, only the error locus participates — a subset E of the total parameter count N, potentially orders of magnitude smaller. Third, MoE's routing sparsity means prediction errors aren't even computed for unselected experts. The cost per step is roughly T × E rather than backpropagation's 3N. When T is small (near equilibrium) and E is small (targeted by the error map), this is comparable to LoRA in magnitude but spent differently: LoRA applies a uniform low-rank update everywhere, while this applies a full-rank update to a targeted subset. The memory advantage is cleaner — backpropagation stores all intermediate activations for the backward pass, scaling with depth × width, while predictive coding needs only local state at each active node.

The efficiency is conditional on the setup, not inherent to the algorithm. Starting from a converged model, updating only the error locus, inheriting MoE's routing sparsity — these are the conditions under which the cost scales with active parameters rather than total parameters. Remove any of them and the advantage disappears.

A clarification on what "growth" means at the level of hardware. The tree doesn't conjure new parameters — it differentiates existing ones. Growth requires a pool of isopredictive capacity: parameters whose current predictions are generic enough that specialization doesn't destroy anything, because there's nothing specific to destroy. In an MoE model, the experts not selected for a given input are idle capacity. Within a selected expert, the terminal layers carry slack — nodes whose predictions are broad, undifferentiated, available to specialize. Growth is the transition from isopredictive to differentiated, not from absent to present. The parameter count doesn't change. The prediction error decreases. This is biologically precise: the brain doesn't grow new neurons for most learning but differentiates existing synaptic connections, and predictive coding is particularly natural here because the update is differentiation — a node that predicted "roughly X" now predicts "specifically X given this domain."

This is growth in the series vocabulary. The tree extends through differentiation of existing capacity. The new leaves persist in the expert's updated weights. The routing that selected those experts is unchanged — the router already knew to send these inputs here; now "here" is deeper. And because the update is local, the smoothness condition is satisfied by construction: each node's adjustment is infinitesimal relative to the prediction it already held, differentiable, continuous with its prior state.

MoE as microforest

Mixture of Experts is already a tree. The gating network is the routing; the experts are the branches. But in current implementations, the routing is learned once during training and then frozen. The tree's topology is fixed. The branches can be pruned (distillation), blurred (quantization), or reshaped (fine-tuning), but they cannot grow. The architecture has the right shape and the wrong dynamics.

A live MoE — routing that adapts, experts that deepen, new experts that spawn — is a microforest within a single model. The error map provides the signal for all three operations.

Deepening is the operation described above: predictive coding within existing experts, targeted by the error map, extending capability without disturbing what's already there.

Pruning is what the previous post catalogued: leaf-pruning (RLHF, fine-tuning), terminal collapse (quantization), branch-pruning (distillation). These remain available and are the right tools for their regime — you prune a tree that has grown too bushy, you don't prune a sapling.

Spawning is the new operation: when the error map shows large, distributed errors that no existing expert can absorb — when the domain is genuinely novel relative to the tree's current coverage — create a new expert. Initialize it not from random weights but from the nearest existing expert (the one with the lowest average error for this domain), and deepen it using predictive coding on the new domain's curriculum. The new expert extends the tree's coverage without modifying the expert it was cloned from. The gating network updates to route relevant inputs to the new expert. The tree has grown a new branch.

The three operations — deepening, pruning, spawning — are the complete set of structural modifications on a tree. No other operation exists. Any intervention on a model's capability is one of these three, or a composition of them. Current practice has only pruning. Predictive coding within routed experts provides deepening. Cloning plus deepening provides spawning. The toolkit is now complete, at least in principle.

Degrees of freedom

There is a unifying observation beneath the toolkit that connects growth and compression as two phases of the same arc.

When predictive coding has converged at a node — prediction error near zero — that node's output is almost entirely determined by its input. The parameters exist, but they're informationally redundant: the prediction from above already contains the answer, and the node is just confirming it. Whether you remove that redundancy by quantizing (merging adjacent outputs that were already indistinguishable), by sparsifying the expert (zeroing out parameters that contribute nothing to the prediction), by using a thinner expert with fewer active parameters, or by skipping layers that have become identity functions passing the signal through unchanged — the evaluation doesn't move, because you're removing degrees of freedom that were carrying no information.

This is Jaynes' coarse-graining from The Rearrangement applied at the operational level. Microstates that are indistinguishable in their predictions can be collapsed into a macrostate without loss. The second law says you can't recover them, but if they were isopredictive there's nothing to recover.

Quantization, sparsification, layer reduction, and expert thinning are therefore not separate operations but different projections of the same underlying quantity: degrees of freedom that carry information versus degrees of freedom that don't. And the error map is simultaneously a growth signal and a compression signal. Where prediction errors are high, the degrees of freedom are informative — they need to stay or deepen. That's the growth locus. Where prediction errors are near zero, the degrees of freedom are redundant — they can be safely collapsed by any method, because all methods of removing uninformative capacity yield the same result.

The full lifecycle of a degree of freedom in this framework: it starts isopredictive (generic capacity in the base model), gets differentiated through predictive coding (growth — the node specializes, prediction error decreases), reaches convergence (near-zero error — the node now predicts its domain accurately), and at convergence becomes compressible again — not because it returned to its generic state, but because its output is now so well-determined that fewer bits suffice to represent it. The circle from undifferentiated to differentiated to compressible is the information-theoretic lifecycle of a branch. Quantization, sparsification, and layer reduction are not separate decisions but the natural consequence of successful growth. The tree grew and the traversal shortened — one thermodynamic arc, not two opposing operations.

What this buys

Growth is now a first-class operation. A practitioner with a domain — medicine, law, materials science, the Upanishads — takes an existing MoE model as a starting point, runs their corpus through it, and the error map identifies which experts fire and where their predictions fail. Predictive coding deepens those experts — locally, and cheaply relative to the full model because the cost scales with the error locus — until the prediction errors converge. The model now handles this domain not because it was retrained from scratch or fine-tuned globally, but because specific nodes along specific paths had their predictions refined by a local algorithm that touched nothing else.

This is not a theoretical proposal that requires a new architecture from scratch — MoE exists, predictive coding networks exist, error maps are computable today. The combination — using predictive coding as the growth algorithm within the routed structure that MoE already provides — is new, but its components are not. The gap is integration, not invention. Whether the error maps are informative enough, whether predictive coding converges fast enough within experts, whether spawning is stable — these are engineering questions with engineering answers.

What the series needed was to show that growth has a mechanism, that the mechanism is local, and that locality on a tree is structural isolation. The current toolkit had only scissors. Now it has a seed.

What happens when many seeds grow — when many trees, each locally optimal and each wrong in different ways, connect through their root systems — is the subject of the epilogue.