Wide and Narrow Thinking
Why are image-generation models fantastic at generating photorealistic content but hilariously bad at text?
And why can ChatGPT figure out challenging coding tasks but not reliably tell you whether 9.11 or 9.9 is larger?
It comes down to narrow versus wide thinking, in the loosest sense. If you ask Dalle or Midjourney for a renaissance painting of a cat dressed like Louis XIV, there are many paths the AI can take to get there. But if you ask it to add a text label, the space of acceptable outputs is vastly smaller.
The same applies to mathematical and logical reasoning. The path of acceptable steps to take is much smaller, and we're expecting quite a lot from an AI if it has to reconcile this very focused, narrow, discrete thinking with its more random, free-flowing nature.
Tools to the rescue
Specifically for language models, the most promising approach to fix this is using tools (like we've seen in the AI Agent case; just realize that there's nothing mystical about them). The "wide thinking" LLM will recognize when dealing with a math problem and can then defer to a calculator or write Python code to solve it. ChatGPT already does that, of course.
Over time, I would imagine more tools getting integrated with an LLM so that it can focus on what it's good at (wide thinking) and defer to tools for the things it's not good at, or where more precision and repeatability is desired (narrow thinking).
I could imagine a few such cases, like matching an AI code assistant with static analysis, automated refactorings and other goodies. Every industry and job will have its own set of narrow tools to enhance AI assistants' usefulness and reliability.