Armstrong predicts shift to cheaper AI models as cost pressures mount
Mounting operational costs and rising token prices are prompting a potential industry pivot away from the assumption that larger models are always superior, with early tests showing significant savings without quality loss.

The artificial intelligence sector is undergoing a strategic pivot away from the long-held industry assumption that larger, more compute-intensive models are inherently superior. Historically, major laboratories such as OpenAI and Anthropic have pursued a "scaling-first" approach, driven by the belief that the most powerful models would win, while clients relied on heavily subsidised prices to access these advanced options. However, mounting operational costs and rising token prices are forcing a re-evaluation of this strategy. Coinbase co-founder Brian Armstrong predicts that within 12 to 18 months, 80% of AI workloads will shift to significantly cheaper models, with only 20% reserved for tasks requiring maximum intelligence.
This shift is evidenced by early tests from legal AI tool Harvey, in partnership with Fireworks AI, which demonstrated a threefold reduction in inference costs without compromising quality. The test utilised a hybrid approach involving Claude Opus for intensive tasks and Fireworks’ GLM 5.1 for others, resulting in significantly lower server time and overall costs. Harvey co-founder Gabe Pereyra noted that while quality remains paramount, the definition is evolving from using the most powerful model for everything to using the best model that gets the right answer most efficiently.
The trend challenges the "scaling-first" approach of major labs like OpenAI and Anthropic, potentially impacting their financial outlook as they prepare for initial public offerings. Armstrong’s prediction, outlined on X, suggests that demand for intelligence is near infinite, but the vast majority of tasks will run on models that are 99% cheaper. This represents a significant departure from the previous industry norm where companies competed on quality by defaulting to the most advanced available options.
The real industry divide is increasingly viewed as being between large and small models, rather than proprietary versus open-weight distinctions. While switching from GPT-5.5 to DeepSeek’s V4 Flash offers savings, switching to GPT-5.4-mini works just as well, indicating that the specific type of smaller model matters less than the move away from frontier compute. An active price war is emerging between in-house inference from big labs and independently served open-weight models, further pressuring the economics of training frontier models.
With token prices rising and investor subsidies slowing down, users are facing cost pressure for the first time. While it remains unclear whether enterprise users will fully adopt smaller models or simply reduce usage, the potential for a massive shift in AI economics is clear. If most deployments can be run effectively on smaller models, it could dampen the growing demand for inference and raise new questions about how to justify the high costs associated with training the most compute-intensive models possible.


