Project Q*, OpenAI, the Chinese Room, and AGI Sure, thanks for asking!
Each word that ChatGPT spits out is really just a statistically plausible guess at what word might appear next in that context (context = the user's prompt and the other words it has already spit out). There are generally lots of words that might come next, and so long as it hits on one of them, it has done its job.
You can probably see where this is going. Math doesn't quite work that way. Yes, there can be multiple next steps that are allowed, but the rules are far more rigid than with everyday speech. The simplest example gets this general point across well enough:
2 + 3 =
How did Q* solve this problem? Here's my wild guess....
If the next word spit out has to be consistent not just with regularities culled from gobs of factual, faulty, and fanciful texts but it also has to be consistent with a model of [or, really, a model that coheres with] what the text describes, its answers will be far more tightly constrained.
If the next word has to be consistent with (a) regularities picked up from processing texts and (b) a mental model where, for instance, 3 matchsticks are added to 2 matchsticks, that greatly restricts the space of plausible next words.
Given that there are spatial proofs of the Pythagorean theorem and lots more besides, this takes you a long way into grade-school math. But why get spooked about that (the way OpenAI got spooked by its little mathematician Q*)?
A system that works as described would have lots of other capabilities. It could eventually understand why a mouse needn't worry about being trapped in a jack-o-lantern. It could engage in forethought prior to taking action (that is, once it gets a body) so as to achieve its goals. It could generate and test hypotheses. And so on. It wouldn't just be reflecting human language back at us in ways that look smart. It would be smart.