7 Comments
User's avatar
Romilly Cocking's avatar

My two cent's worth; everyone I know who has the skill and money is exploring the limits of locally running LLMs. At the same time the Chinese see an opportuning to diminish the competitive threat from the USA by releasing smaller. more powerful models. These can be run locally and come close to proprietary models 10-100x their size. We're also seeing self-improving models and agent teams working well. It's hard to see any of the proprietary model providers surviving, let alone all, and the investment in new data-centres will steadily decline. All this against the backdrop of a global energy crisis. fun times.

Pawel Jozefiak's avatar

Beck's investor confidence argument is compelling, but I'd split the constraint in two. Investor signaling is real. Although the engineering part is also real - the throttling I saw last month wasn't uniform across use types, which suggests compute cost varies a lot by task. Long agentic sessions got cut harder than single completions. That looks like infrastructure constraint, not just signaling.

Both can be true at once. More interesting question: if limits are primarily narrative and lift when funding stabilizes, does demand catch up fast enough to immediately recreate the constraint? That cycle could run for a while before the infrastructure actually catches up.

Pawel Brodzinski's avatar

My favorite question applies: What does the endgame look like?

How long would the unconstrained demand curve look like it does now? Expand it some time into the future, and even with infinite capital, hardware capabilities, and custom chips, it's unsustainable for either major player.

If so, then taking over dissatisfied users from competitors isn't as surefire a move as it looks. These are, after all, the heaviest users. Those for whom the economic bill might look even worse than it does for the free tier.

Small wonder that Claude Code bans OpenClaw. It's literally like saying, "Dear OpenClawers, go to where the founder is—OpenAI—for them to pay your bills."

Technically, we can generate as much compute demand as we wish. And since this whole thing is still subsidized, no throttling means as much taking over the traffic as it means a suicide (semi-infinite money or not).

Being the first is overrated. As we know from history.

Sudeep's avatar

We’ve been seeing this squeeze coming, which is why we built Zoryn dot ai to run state-of-the-art models like Gemma 2 entirely locally.

As you noted, when usage limits hit 'mid-flow,' the developer’s work just stops. Moving the inference—and even more importantly, the high volume of low-compute intensive calls—to the user's own hardware feels like the only way to escape that 'narrative bottleneck' and the conversion pressure from the big model providers. Local-first isn't just about privacy anymore; it’s about reliability of flow.

Sean Corfield's avatar

I use Copilot Chat with VS Code and I run it in "auto" mode, as far as agent/model selection is concerned, and we have a Copilot Business license ($19/month). That seems to track only "premium requests" and after spending literally all day yesterday driving Copilot, it's still within "budget" and has used a mix of Claude Haiku 4.5, Claude Sonnet 4.6, GPT 5.4, and GPT 5.3 Codex (ordered from least-used to most-used). I'll keep an eye on it to see whether it is affected by any throttling and/or changes behavior in the face of this.

Eddy Borremans's avatar

i sincerely hope you get to draw a similar graph (soon?) regarding the usefullness of open source models and the point were it becomes really viable to run local inference. will the brilliance of bigtech models always outweigh the cost of using them?

Chris Brown's avatar

I have been posting on my genie journey on LinkedIn. Started as a lark, but it has some traction. My Dad's first cousin, in his 70s, commented that he is using Claude as an executive assistant; and he loves it. I totally believe Anthropic ran into a hard physical wall overnight. We are running it on our cloud at the moment, so we did not witness the throttling at work. I expect more of this - new data centers do not grow overnight.