The Evidence
Why language is the strongest case for autoregressive cognition
If you want to know what the mind is doing, look first at language. Not because language is the only thing the brain does, but because it is the one cognitive process we can fully observe. Speech and text are overt, recordable, and structured: a complete external trace of an internal computation. Everything else the brain does (vision, motor control, emotion, memory) leaves only fragments. Language leaves the whole sequence.
And the sequence tells a remarkably consistent story: human language is autoregressive next-token generation. Each word is generated from the recent context that precedes it, by a mechanism of the form
\[x_{t+1} = f(x_{\leq t})\]The next token is a function of the tokens so far. No external syntax engine, no separate semantic module, no storage-and-retrieval lookup. Just continuation under context. Language is the strongest single body of evidence for this claim, and the evidence that kicked off the autoregressive framework itself.
1. The Computational Proof Came for Free
The most important fact arrived by accident. We built large language models to predict the next word in human text, and they became fluent, coherent, and competent across an astonishing range of tasks: syntactic judgment, reading comprehension, dialogue, reasoning. All from pure next-token prediction. No rule systems were programmed in. No grammar was hand-coded.
This establishes something stronger than “autoregression works for language.” It establishes that autoregression is sufficient. A single mechanism, scaled, reproduces the full range of linguistic competence we used to attribute to a stack of specialized modules.
Sufficiency carries a sharp implication. If autoregression alone suffices for human-level linguistic competence, then any additional mechanism we want to posit in the human case has to earn its place. It would either be computationally redundant (doing work that next-token generation already does) or it would produce capacities that models demonstrably lack. Since models match human linguistic performance across benchmarks, the extra machinery is superfluous. The most parsimonious hypothesis is that humans run the same kind of process.
This is a sufficiency argument, and it reverses the usual burden of proof. The question is no longer “can autoregression explain language?” It is “what, exactly, requires anything beyond it?”
2. The Archaeological Argument: We Wrote the Training Data
Here is the part that turns sufficiency into something closer to proof.
Language models did not invent autoregressive structure. They discovered it, in the statistics of the corpora we trained them on. But humans produced those corpora. Every sentence, every paragraph, every linguistic regularity that makes next-token prediction succeed was generated by human cognition in the first place.
So when a model succeeds by exploiting the predictive structure of human text, it is reverse-engineering whatever process originally generated that structure. We did not design these systems to be autoregressive because we knew brains were. We designed them to predict human output, and they converged on autoregression because that is the shape of the data.
This is computational archaeology. Training a model on human text excavates the cognitive architecture that produced it from the statistical fossils left behind. There are only two ways to read the result: either humans generate language through an autoregressive process, or humans happen, by sheer coincidence, to produce output perfectly suited to autoregressive processing. The second option is not credible. The sufficiency argument and the archaeological argument form a pincer: one shows autoregression is enough, the other shows the data was made by us.
3. The Statistics of Language Are Designed for It
Set the models aside. Natural language, examined on its own, already carries the signature of autoregressive structure, independent of any theory of mind.
An arrow of time. Surprisal falls in a consistent left-to-right gradient: early words systematically constrain later ones, and uncertainty drops as a clause approaches its boundary. Language has an intrinsic directionality, and autoregressive processing runs with that grain.
Short-range sufficiency. Over 90% of next-word predictability is captured by the preceding ~32–64 tokens. Fluent continuation does not require global access to everything said so far; it requires a rolling window of recent context, exactly what an autoregressive mechanism uses.
Zipfian recombination. Open-ended vocabularies emerge from systematic reuse of frequent elements, with long tails of rare words made predictable by local context. The generative engine is recombination under context, not retrieval from a fixed store.
Distance-dependent decay. Processing degrades smoothly with distance from relevant context: the temporal decay property baked into sequential generation.
These regularities exist before we apply any psychological framework. Language is autogenerative as a property of the sequence itself. Autoregression is simply the mechanism that exploits it.
A note on terms worth keeping straight: autogeneration is a property natural sequences have. Any long enough prefix encodes the regularities needed to continue it lawfully. Autoregression is the mechanism: generating the next element by consulting recent context. LLMs prove language is autogenerative; the claim here is that brains process it autoregressively.
4. Real-Time Processing Leaves Autoregressive Fingerprints
The strongest evidence is not statistical. It is behavioral. When you watch language being produced and comprehended in real time, you see a system generating forward under a limited, decaying context window, exactly as the framework predicts.
Speech errors. Slips of the tongue are not random. Anticipations (“beft lemisphere” for “left hemisphere”) show a forward-looking buffer bleeding into the present. Perseverations (“beef needle” for “beef noodle”) show recently produced material lingering as residual activation. Both peak when source and target are 1–3 syllables apart and fall off systematically with distance: the same recency gradient transformer attention shows. Errors almost never span more than 6–8 syllables, marking the edge of the effective context window. And the lexical bias (errors landing on real words far more often than chance) reveals learned weight structure shaping what the generator produces.
Self-repair. Speakers catch errors within a syllable or two, interrupting at syllable boundaries in time with the rhythm of generation, and correcting on internal representations before the wrong word is fully out of the mouth. That is a generation loop monitoring its own output token by token.
Garden-path sentences. “The horse raced past the barn fell” breaks because the reader commits to “raced” as the main verb from local context, fails to hold the alternative parse in the buffer, and has to rebuild when “fell” arrives. This is incremental, commit-as-you-go processing, not parallel evaluation of all parses. And it is universal: German verb-final constructions, Japanese case-marking reanalysis, Dutch cross-serial dependencies all show the same thing: difficulty scaling with the distance between the ambiguity and its resolution. “Good enough” processing (settling for a plausible-but-wrong reading when reanalysis is too costly) is what you would expect from a resource-limited generator, not an exhaustive parser.
Priming and residual activation. Recently activated content biases what comes next, with exactly the temporal decay autoregression predicts. Semantic priming (“doctor” speeds “nurse”) fades with delay. Syntactic priming (reusing a structure you just heard, even with zero shared words) shows abstract structural patterns sitting in the context buffer. Crucially, priming splits cleanly along the framework’s two components: fast automatic effects reflect weight-encoded associations, while slower controllable effects reflect deliberate maintenance in the context buffer.
Production–comprehension asymmetries. Production is more strongly forward-biased than comprehension; production priming persists across dozens of sentences while comprehension priming decays in a few trials; speaking primes later listening more than listening primes later speaking. Generation leaves deeper traces than interpretation: the directional asymmetry you would predict if production is the autoregressive generative act.
5. Two Systems, Same Solution, Different Roads
Human memory research and transformer analysis arrived independently at the same picture: a limited effective window (~50–100 tokens in models; comparable constraints in working memory), recency-weighted access, graded rather than binary retention, and stronger competition among nearby elements. The methods that most improved language models (ALiBi, RoPE) work precisely by building in distance-dependent decay.
This convergence matters because the two systems were optimized by completely different pressures: natural selection on one side, gradient descent on the other. When two independent optimizers land on the same architecture, that architecture is not an arbitrary implementation choice. It is a fundamental property of the problem: sequential generation under context.
The Bottom Line
Language is the core evidence because it is the one place we can see the whole computation. And from every angle (the computational sufficiency of next-token models, the archaeological fact that we generated their training data, the autoregressive statistics of language itself, the real-time fingerprints in errors, parsing, priming, and the independent convergence of brains and transformers) the same conclusion holds.
Human language is not produced by a grammar engine consulting a memory store. It is generated the way the models generate it: read the recent context, compute the next token, append it, advance. Language is not like autoregressive next-token prediction. Language is autoregressive next-token prediction. And it is the clearest window we have onto what the rest of the mind is doing too.