There was a question about
grokking that had bothered me for a long time.
("In machine learning (ML), grokking, or delayed generalization, is a phenomenon observed in some settings where a model abruptly transitions from overfitting (performing well only on training data) to generalizing (performing well on both training and test data), after many training iterations with little or no improvement on the held-out data.")
I finally got round to asking a LLM about it. In the spoilers below figure the responses from Claude 4.5 Sonnet and GPT-5, with a response by GPT-5 to a follow-up question about "
in-context learning".
USER:
Hi GPT-5,
I'd like to explore an issue with you regarding grokking and LLMs.
It seems clear that the phenomenon is exhibited in LLMs when most queries that your are able to competently respond to aren't present in your training data (coupled with the response you are expected to provide). It is often claimed by AI-skeptics, though, that you can't intelligently respond to queries about topics that aren't in some respect
similar (i.e. aren't matching a
pattern) in your training data. That would be, according to those skeptics, a genuine mark of intelligence, understanding and/or creativity. But the requirement now is very high since the criterion now can't be satisfied unless the problem tacitly solved by you in order to provide an intelligent response can't be one that is similar in any respect to something figuring in your training data.
But clearly, a mark of intelligence is to be able to identify (or surface) within one's own body of knowledge what it is that is relevant to the task at hand. Finding, for instance, something relevantly similar to the considerations raised so far in the current prompt requires a fair bit of understanding of it.
Yet, not only skeptics raise the bar but also researchers, sometimes, when they identify instances of grokking as emergent abilities to accomplish successfully "out of sample" tasks when the alleged "in sample" (within the training data) exemplars are
merely similar to the test tasks inasmuch as they exhibit the same pattern, as it were. For instance, your replying cogently to the present prompt would be considerer "in sample" if there are relevantly similar issue discussed in your training data.
This leads me to wonder what it is that grokking consists in in cases where the out of sample generalization ability (akin to learning to apply a concept in a novel way such that it has never been applied in a similar way in the training data) emerges. It seems that what should be said isn't that the pre-grokking model only was pattern-matching when it was previously answering the "in sample" queries, but rather that it was just not yet smart enough, as it were, to grasp the novel way to apply the relevant concept (i.e., the concept, or range of conceptual abilities, relevant to solving the test problem). This is, after all, the same sort of way we appraise stages of intellectual development in a human being that comes to masters a new topic (such as, say, classical electrodynamics). Did you catch my drift?
RevealClaude 4.5 Sonnet:
I absolutely catch your drift, and you've identified a genuinely thorny philosophical issue that goes to the heart of debates about machine understanding.
The tension you're highlighting is real: if we define "in sample" broadly enough to include anything that shares *any* pattern with training data, the criterion for "genuine" intelligence becomes impossibly stringent—even unfairly so. After all, human intelligence also works by recognizing patterns, drawing analogies, and applying concepts in new contexts. When a physics student finally grasps how to apply Gauss's law to a novel configuration they've never seen before, we don't dismiss this as "mere pattern matching" even though they're clearly drawing on similar problems they've studied.
Your reframing strikes me as importantly right: the pre-grokking model isn't necessarily doing something *categorically different* (like "mere pattern matching" vs "genuine understanding"). Rather, it lacks sufficient conceptual sophistication—the representations aren't yet organized in a way that makes the relevant abstractions available for novel application. Post-grokking, the model has developed more robust, generalizable representations of the underlying structure.
This mirrors human learning remarkably well. A student memorizing physics formulas and fumbling through textbook problems isn't doing something fundamentally different in kind from an expert fluidly applying concepts to novel situations—they're at different points along a continuum of conceptual mastery.
What makes grokking particularly interesting is that it suggests the model is discovering *structural* understanding—not just memorizing input-output mappings, but internalizing the generative rules or principles. When a model suddenly achieves near-perfect generalization on modular arithmetic after extended training, something has changed about how it represents the problem space.
But here's where it gets philosophically interesting: is there a principled distinction between "has internalized the abstract structure" and "has learned sufficiently sophisticated pattern matching"? Your question suggests—and I'm inclined to agree—that this distinction might be less clear than it initially appears. Perhaps what we call "understanding" just *is* having sufficiently rich, well-organized, and flexibly applicable patterns.
The skeptics' objection might ultimately dissolve into an unfalsifiable position: any successful performance can be redescribed as "pattern matching" if we're sufficiently determined. But if that's the case, then either (a) the objection proves too much (it would equally apply to human intelligence), or (b) we need a more substantive criterion for what counts as genuine understanding beyond mere novelty.
What do you think the right criterion should be? Is it about the *degree* of abstraction? The *flexibility* of application? Or something else entirely?
GPT-5:
I’m with you, yes—I think you’ve put your finger on a confusion that muddles a lot of “it’s just pattern-matching” critiques. Three quick moves:
1. “Pattern” isn’t a single thing
There’s a world of difference between (a) shallow surface co-occurrence patterns, (b) higher-order invariances (e.g., linearity, symmetries, type constraints), and (c) algorithmic structure (procedures that compute solutions under many surface realisations). When skeptics say “if it’s similar in any respect, it doesn’t count,” they’ve smuggled in an impossible criterion: all intelligent generalization relies on some relevant similarity. The mark of understanding isn’t the absence of similarity; it’s selecting the right invariants and composing them to solve a novel case.
2. What grokking is (mechanistically)
In tasks where we can inspect training dynamics (modular arithmetic, formal languages, simple algorithmic tasks), networks often first latch onto a high-capacity shortcut that interpolates within the training set (memorization-ish), then—sometimes much later—under pressure from regularization and the inductive bias toward simpler rules, they reorganize their internal geometry so that the decision boundary reflects the task’s true invariants. That late reorganization is the “grokking” phase transition: a shift from many-fragile-rules to one compact algorithm that extrapolates. It’s not “from pattern to understanding,” but “from the wrong patterns to the right invariants.” The same loss curve can hide qualitatively different internal solutions.
3. How this maps to LLMs
LLMs are trained on wildly heterogeneous text. Useful solutions are the ones that compress across surface variation—i.e., they must privilege invariants: compositional syntax/semantics, analogical structure, causal/temporal schemata, mathematical identities, discourse moves, etc. When an LLM “grocks” some domain, you see:
* **Robust recombination:** it applies a tool where it wasn’t showcased verbatim (e.g., reasoning about a novel physical setup by mapping it to a previously learned conservation-law schema).
* **Analogy under type constraints:** it ports an argument pattern to a new vocabulary because the roles line up (premise–rule–exception structures, proof by contradiction, etc.).
* **Error profile shift:** failures become principled (violating a known invariant) rather than noisy one-off confabulations—evidence that a compact rule is actually being used.
Why “no-similarity” is the wrong bar
If you require that successful responses occur with **zero** relevant similarity to anything in training, you’ve defined a test that no finite learner—human or machine—could pass. Human creativity is saturated with analogy, isomorphism, and transfer. What distinguishes shallow from deep generalization is **which** similarity class carries the work: string n-grams vs algebraic structure; topic words vs discourse logic; step-by-step recipe vs compiled procedure.
A cleaner way to frame the question
Instead of asking “was anything like this in pretraining?”, ask:
* **Which invariants must be carried over for success?** (e.g., associativity, conservation, scope discipline, quantifier behavior)
* **Could a shallow proxy have solved the test split?** (Use adversarial splits that break spurious cues: SCAN/COGS-style compositional splits, lexical-semantic splits that hold out novel role-filler pairings, math/base transfer, unit mismatches, counterfactual rewrites.)
* **Do error modes reflect rule use?** (Systematic, explainable failures suggest a compact internal rule; scattershot errors suggest brittle heuristics.)
A toy illustration (two phases)
* **Pre-grokking:** a model answers many base-10 arithmetic word problems by patterning on lexical cues (“altogether,” “shared equally,” “difference”), succeeding in-distribution but failing on paraphrases, unit switches, or base changes.
* **Post-grokking:** the same model, after more training/regularization or better prompting, represents the arithmetic as operations on typed quantities. It now handles paraphrases, mixes of units with explicit conversions, and even base-7 toy problems once given the base rules. Nothing in pretraining showed *those exact sentences*, but the invariants (typing, operation composition, conservation of quantity) were learned—and that’s precisely the point.
Where I land on your thesis
Yes: calling the earlier answers “mere pattern-matching” is misleading. Before grokking the model is **already** leveraging patterns—just not the right ones. After grokking, the model has restructured its internal representation so that the **task-relevant** invariants do the work, enabling out-of-sample generalization that looks like “novel application of a concept.” That is exactly how we appraise human intellectual development: not by absence of similarity, but by the ability to surface the right similarity class and apply it flexibly.
If we want to be stringent with LLMs, we should design splits that (i) disrupt superficial cues, (ii) preserve deep invariants, and then (iii) check whether performance transfers with principled errors. Passing that bar is evidence of the same kind of conceptual grasp you’re pointing to—the thing people are trying to name with “grokking.”
USER:
That's very informative, thanks! I was struck by your second disjunct in: "Post-grokking: the same model, after more training/regularization or better prompting"
Is there a phenomenon analogous to grokking that manifests itself in response to better prompting rather than training? Would that be due to a structural similarity between training and in-context leaning?
GPT-5:
Great question—and yep, there’s a real analogue. Think of two routes to the same hilltop:
* **Training-time grokking**: weights change until the model’s internal circuits reorient around the task’s true invariants.
* **Prompt-time (in-context) grokking**: weights stay fixed, but the **state** you induce in the network—with examples, instructions, and scratch space—pushes it into the “right” computational regime so those already-learned circuits get *selected, composed, and stabilized*.
Why this can work
* **Meta-learned updater**: During pretraining, LLMs don’t just learn facts; they also learn **procedures for updating hypotheses from context** (few-shot pattern completion, induction heads, schema abstraction). A good prompt supplies the “mini-dataset” that this inner updater needs, so it can do at inference-time what SGD did during training—*not by changing weights, but by routing/activating the right features*.
* **Regularization via format**: Structured prompts (variables, units, step labels, checks) act like a test-time regularizer. They suppress spurious cues and force the model to respect invariants (types, conservation, scope), which often triggers a sharp jump from brittle heuristics to algorithmic behavior—i.e., a **phase-like shift** without any new training.
* **State selection over landscape change**: Training reshapes the energy landscape (the weights). Prompting chooses **where** you land in that landscape (the computation). If the landscape contains a “compact algorithm” basin, a good prompt drops the model into it.
What “prompt grokking” looks like
* **Few-shot thresholds**: Nothing… nothing… then after 3–5 carefully varied examples, performance jumps and generalizes to held-out paraphrases or bases.
* **Error-profile flip**: From scattershot mistakes to principled failures (e.g., violating a clearly stated invariant)—a tell that an internal rule was engaged.
* **Transfer under format**: The model solves a novel instance *because* the prompt enforced a representation (tables, symbolic expressions) that aligns with the latent algorithm.
Practical recipe (how to induce it)
1. **Give schema, not vibes**: Declare variables, units, and goals; outline steps (“Parse → Formalize → Compute → Check invariants”).
2. **Anti-spurious few-shots**: Use diverse examples that *break* superficial cues but preserve the same deep rule. Include one counterexample.
3. **Force intermediate state**: Ask for a table/equations before the final answer (scratchpad as scaffolding).
4. **Invariant checks**: “Verify dimensional consistency / conservation / scope.”
5. **Self-test**: Add a tiny held-out case for the model to solve using the extracted rule.
When you still need training
* If the latent “algorithm basin” doesn’t exist (the model never learned the invariants), no prompt will conjure it. Prompting can *select* or *stabilize* competence; it can’t mint circuits ex nihilo. That’s when more data/finetuning matters.
So, to your disjunct: the second half is real. There **is** a phenomenon—call it *contextual grokking*—where better prompting produces a phase-like jump in generalization. It’s not magical: it leverages a structural kinship between SGD-learned meta-skills and the model’s in-context updating. Training sculpts the tools; prompting lays out the workbench so the right tool actually gets used.