A formal statement of the core claims, architecture, and consequences


The Problem

The world has temporally extended structure. Objects persist, events unfold, causes precede effects, actions produce consequences. None of this is available in an instantaneous sensory slice. Any system that must act on this structure — and all behaving organisms must — needs to maintain and operate over sequences.

Not all sequential processing requires autoregression. Sensation-local filters — Reichardt-style motion detection, onset and offset responses — extract temporal features directly from the input stream. These are real and biologically widespread. What they cannot do is build trajectories, track content across gaps in sensation, or generate structure the signal does not contain.

Cognition is the regime that does. The autoregressive architecture is the solution to the problem of operating on temporally extended structure in the service of behavior.


The Architecture

The brain performs extensive parallel processing that is not autoregressive — sensory transduction, feature extraction, homeostatic regulation, autonomic control, reflexes. This processing is real, biologically critical, and not the subject of this theory. It produces the sensory state \(E_t\).

Cognition is a computationally distinct process that takes this massive parallel state and generates a serial, contextual stream. This process is \(\mathcal{G}\). Not everything the brain does is \(\mathcal{G}\). But everything we call cognition is. The move from parallel sensory state to integrated cognitive state — from \(E\) to \(y\) — is what the generator does, and it matches the intuition of what attention is: the funneling of an enormous concurrent state into a single sequential stream.

The entire theory reduces to one equation:

The Generator

\[y_t = \mathcal{G}(W_t,\; C_t,\; y_{t-1},\; E_t)\]

At each moment the system produces a new state \(y_t\). This state includes everything the system generates: percepts, thoughts, inner speech, imagined actions, and motor commands. These are not different operations. They are different varieties of the same generated output. The system does not distinguish between “I am perceiving,” “I am thinking,” and “I am acting” at the point of generation. It just generates its next state.

The generated state \(y_t\) is conditioned on:

  • \(y_{t-1}\) — the system’s own prior output, fed back as the primary conditioning substrate. This is what makes the system autoregressive.
  • \(E_t\) — current sensation, arriving from the environment and body. The output of the brain’s extensive parallel processing, funneled into the generator.
  • \(C_t\) — fast state: transient activation carrying the recent trajectory of generation.
  • \(W_t\) — slow structure: the cumulative shaping of connectivity by the system’s history.

All of \(y_t\) feeds back into the next cycle as \(y_{t-1}\). All of it. The perceptual content, the linguistic content, the motor content — it all conditions the next round of generation. The system’s input is what it has already produced.

Some components of \(y_t\) also happen to be connected, via anatomy, to muscles. That motor content has consequences in the world. But from the generator’s perspective there is nothing special about it — it is generated in exactly the same way as a percept or a thought.

The Environment

\[E_{t+1} = \text{Env}(E_t,\; y_t)\]

The world evolves, partly in response to whatever motor-relevant content was present in \(y_t\). The perceptual consequences of the system’s actions return as ordinary sensation \(E_{t+1}\). There is no special action-feedback channel. The system’s own speech, the visual consequences of its movements, the tactile results of its grasps — all arrive as \(E\), indistinguishable in kind from any other sensation. The system never sees its own behavior. It only sees what comes back.

State Updates

\[C_{t+1} = f_C(C_t,\; y_t)\] \[W_{t+1} = f_W(W_t,\; y_t,\; E_t)\]

Every cycle of generation is simultaneously a cycle of encoding and learning. \(C\) is updated with the current trajectory; \(W\) is reshaped by the current activity. There is no separate write operation, no distinct training phase. The act of generating the current state is the encoding of that state into the system’s structure.

Symbol Definition
\(\mathcal{G}\) The generator. Every cognitive act is one application of this operator.
\(y_t\) The system’s generated state: percepts, thoughts, inner speech, motor commands, and all other content. Everything the system produces in a single cycle.
\(y_{t-1}\) Prior generated state, fed back as the primary conditioning substrate. This is what makes the system autoregressive.
\(E_t\) Exogenous sensation: input from the environment and body, including the perceptual consequences of the system’s own prior behavior.
\(W_t\) Slow structure: cumulative shaping of connectivity. Modified continuously during operation. No separate training regime.
\(C_t\) Fast state: transient activation carrying recent trajectory. Decays on short timescales.

Core Claims

01. All behavior is next-state generation. Every externally observable act — movement, speech, gesture — is produced by the generator operating over encoded sequence. Human language is literally next-token generation: the same autoregressive process that produces perception produces speech. LLMs are not simulating this process; they are running the linguistic case of the same architecture.

02. The sequence is self-generated. The sequence over which cognition operates is the system’s own prior output. This is what makes it autoregressive. Perception is the system generating its processed version of the world, informed by sensation but not identical to it. Thought is the same process with sensation playing a lesser role. Both are instances of the system feeding itself the processed output of the previous cycle as input to the next.

03. Perception is output, not input. Sensation \(E_t\) arrives; the percept \(y_t\) is produced. These are not the same thing. What the system “sees” is what it generates, not what it receives. Sensation constrains generation but does not constitute it. This is why illusions, filling-in, amodal completion, and inattentional blindness exist: perception is authored by the system, not delivered to it.

04. Attention is the autoregressive process itself. The serial nature of conscious experience is not a limitation imposed by a bottleneck or a resource constraint. It is a structural necessity: multiple modalities converge into the generation of a single next state \(y_t\). Attention is this convergence. The system can only generate one \(y_t\) at a time, and what we call “attending” to something is that thing dominating the current generation step. Inattentional blindness is not a failure to notice — it is a failure to generate.

05. The system optimizes over what it can see for outcomes it cannot see. Some components of \(y_t\) have motor consequences that exit the body and change the world. The system never observes those consequences directly — they return only as sensation \(E_{t+1}\). Behavior is the entire reason the system exists, but behavior is beyond the system’s computational purview. It can only learn to generate states that tend to produce favorable returning sensation. This indirection is the architecture.

06. Memory is continuous encoding. There is no discrete storage event, no separate write operation. Every cycle of the generator simultaneously produces the current state and encodes it into the system’s structure — \(C\) is updated with the current trajectory, \(W\) is reshaped by the current activity. Generation is encoding.

07. Memory is influence, not storage. The past shapes current generation through the trajectory, not through retrieval from a store. Remembering is regeneration: the system produces a trajectory through state-space that recapitulates an earlier one, constrained by available cues. This is why memories are context-sensitive, malleable, and continuous with imagination — generation and “retrieval” are the same process with different seeds.

08. The short-term/long-term distinction dissolves. There are not two memory systems. There is influence at different timescales. \(C\) carries influence that decays over seconds — what is traditionally called working memory. \(W\) carries influence that persists across a lifetime — what is traditionally called long-term memory. The difference is the time constant of the substrate, not the nature of the operation. The forgetting curve is not the rate at which the brain loses what it stored. It is the rate at which successive autoregressive compression reduces the specificity needed for regeneration.

09. A single optimization process shapes both behavior and representation. \(W\) is continuously modified by a process that jointly achieves two outcomes. First, it shapes the system to generate states whose motor components, acting through the world and returning as sensation, are favorable — this is behavioral optimization. Second, it shapes the system to generate \(y_t\) in a format that the generator can most effectively condition on in the next cycle — this is representational optimization, entirely within the loop. The system does not just learn what to generate; it learns how to generate in a form optimized for its own subsequent ingestion. Development is partly the progressive refinement of output format. Inner speech may be specifically useful because linguistic tokens are a format co-optimized with the machinery that runs on them.

10. Consciousness is the autoregressive stream. The serial, contextual sequence of generated states \(y_t\) is not merely correlated with conscious experience. It is conscious experience. Unity of consciousness follows structurally: one generator, one output at each moment. The stream quality of experience follows from the autoregressive dependence of each state on its predecessor. The binding problem dissolves at the architectural level — components were never separate in the generator. What enters the generated sequence is conscious; what remains in \(E\) without being integrated into \(y\) is not. Attention and consciousness are the same thing: the funneling of parallel state into serial generation. The deeper question — why this architecture constitutes subjectivity — is addressed in the full theoretical treatment.

Perception, memory, action, learning, language, reasoning, consciousness, and imagination are all instances of \(\mathcal{G}\) — the same operation seen from different angles or at different timescales. Cognition is one equation.


Why This Architecture

Behavior must be conditioned on a coherent sequential substrate, because the structure that matters for behavior lives in the temporal flow of experience, not in any instantaneous slice of it. Objects, events, actions, utterances, intentions, affordances: none of these are present in \(E_t\) at a single moment. They are constituted across time.

The move from sensation-local processing to autoregressive generation is a real architectural transition. It also provides a principled axis for ordering cognitive capacity across species: by the depth and richness of autoregressive machinery.


Contrast: Predictive Coding

Predictive coding holds that the brain continuously generates predictions of its next input, compares them to actual input, and propagates prediction error upward through a processing hierarchy. Feedback carries predictions; feedforward carries errors. Error minimization is the fundamental currency of both perception and learning.

The autoregressive theory rejects this architecture. There is no dedicated predictive engine, no explicit error computation, no free-energy functional being minimized. Generation is implicitly sensitive to statistical regularities without requiring anticipation and comparison as a structural feature.

The appearance of prediction-like behavior — expectation effects, surprise responses, adaptation to statistical structure — is produced by a generative process whose structure \(W\) has been shaped by past regularities. When sensation fits poorly with that structure, the generative process is more disrupted. Neural responses scale with degree of violation as a natural consequence, without requiring explicit error-computing architecture.

Explicit prediction is a generated behavior — one output among many — not a continuously running architectural feature. The brain does not have a prediction engine. It has a generator that can, among other things, produce predictions.


The Physiology: A Sketch

Cortex is characterized by dense feedback projections running from higher-order to lower-order regions, typically outnumbering feedforward connections. The standard interpretation under predictive coding: feedback carries top-down predictions to be tested against bottom-up input.

The autoregressive interpretation: feedback carries the prior generative state \(y_{t-1}\) as conditioning content for the current cycle. The descending activity is not a prediction waiting to be confirmed. It is the continuation of generation — the prior output, held partially active and projected back to participate in the next round of computation.

The density of feedback relative to feedforward reflects a deep fact about the loop: the dominant conditioning content in any generative cycle is prior output, not current sensation. Neuromodulatory systems already handle diffuse gain control. The topographically organized, laminar-specific, anatomically dense feedback projections are overengineered for that job. They must be carrying something richer — a representation of the prior state projected back so the forward pass unfolds in a context shaped by what was just generated.

On this view there is no moment at which sensory regions perform pure feedforward processing. V1’s response to a stimulus is always a response in a cortex already shaped by what the system has been generating.

Plasticity across timescales. STDP operates at millisecond timescales, differentially strengthening synapses that carried the just-completed activity pattern. Hippocampal sharp-wave ripple replay provides additional shaping of recent trajectories during rest — not transfer of stored content, but the generator running recent sequences again, reinforcing the cortical patterns they produced. Longer-timescale synaptic remodeling accumulates the durable structure of \(W\).


The Information Geometry of Memory

A further consequence of the autoregressive view concerns the temporal shape of memory accessibility. The standard picture treats forgetting as degradation: representations stored at time of encoding fade with elapsed time, and the forgetting curve measures the rate of that fade. The autoregressive view requires a different account, because there are no stored representations to degrade. What needs explaining is why the influence of past content on current generation declines with distance at all, given that past content is fully expressed in the current state.

The answer is that successive autoregressive steps are compressive. Each generative step preserves what is needed to shape continuation while shedding specificity that does not contribute to the trajectory going forward. After many such steps, what remains of a distant past state is its influence on the trajectory, not the specificity required to reconstruct it. The further back in the chain, the more steps of compression have intervened, and the less of the original specificity survives. This is not loss of stored content. It is the natural information geometry of chained generation.

This account predicts a specific empirical signature. The marginal influence of past content on next-token generation should fall off with distance as a power law rather than an exponential, because the specificity-loss process is scale-free: the proportional reduction in influence per unit distance is constant, not the absolute reduction. Influence of token at distance \(d\) should be proportional to \(d^{-\alpha}\). We find this signature in human language production, with exponents clustering between \(-0.69\) and \(-0.88\) across written and spoken corpora spanning eight languages and six language families. Formal model comparison selects the power law over exponential alternatives in 75% of cases. The exponential form, predicted by discrete buffer models of working memory, is not selected by the better-calibrated probe for any dataset.

The same exponent range appears in two independent measurements of memory: Anderson and Schooler’s environmental information recurrence statistics (\(\alpha = -0.77\)) and Kahana et al.’s associative influence decay in free recall (\(\alpha = -0.82\)). Within the storage-and-retrieval view this convergence is mysterious: three different processes — language statistics, environmental recurrence, episodic recall — happening to share a decay shape. Within the autoregressive view it is expected: all three are measurements of the same underlying process, the compression of generative trajectory information under successive chaining. The power law is not a property of memory traces. It is the information-geometric signature of an autoregressive cognitive system observed at the population level.

This also resolves a tension. The autoregressive view denies a separate memory substrate, but explicit recall declines with elapsed time in ways that seem to require fading representations. The resolution is that explicit recall is reconstruction from the current state, and reconstruction depends on enough non-redundant specificity surviving the chain to allow the original state to be regenerated. Recent past has gone through few steps and retains specificity. Distant past has been compressed through many steps; its influence on the trajectory is preserved but the specificity required for reconstruction is not. What looks like decay of stored memories is the specificity-loss function of autoregressive compression, sampled by a reconstruction process that demands more than mere influence.

The forgetting curve, in this framing, is not the rate at which the brain loses what it stored. It is the rate at which an autoregressive system loses the ability to regenerate its own past, given that the past is present only as compressed influence on the current state.


Open Questions

Is prior sequence decodable from the forward pass? The critical experiment: present the same stimuli in different temporal orders and ask whether V1 responses to a fixed probe carry decodable information about the specific prior sequence — not just the prior stimulus set. Order specificity is the uniquely autoregressive prediction. Predictive coding does not require it.

The hard problem: dissolved or solved? Claim 10 identifies consciousness with the autoregressive stream. The theoretical treatment argues that subjectivity is recursive output-intake navigation of an egocentric decision space — that there is nothing to add to this process to make it conscious. Whether this constitutes a dissolution of the hard problem (showing the explanatory gap was an artifact of separating processing from experience) or a solution (providing the missing bridge) remains a question the theory frames but does not settle.

The cognitive hierarchy. Ordering species by the depth and richness of their autoregressive machinery gives a principled account of cognitive complexity. Diagnostic behaviors: object permanence, trace vs. delay conditioning, working memory span, planning horizon, language. The acquisition of autoregressive machinery and the emergence of memory may be the same evolutionary event.

How does distributed neural processing produce unified \(y_t\)? The architectural claim is clear: one generator, one state per cycle. The physiological question of how spatially distributed cortical activity converges into a single generated state remains open. This is the residual binding question — not why experience is unified (that follows from the architecture) but how the neural hardware implements a single \(\mathcal{G}\).