Beiträge von unlambda@hachyderm.io | Abspeckgeflüster

unlambda@hachyderm.io

@EricLawton @maco @aredridel @futurebird @david_chisnall we don't know exactly how much it costs for the closed models; they may be selling at a loss, break even, or a slight profit on interference. But you can tell exactly how much inference costs with open weights models, you can run them on your own hardware and measure the cost of the hardware and power. And there's a competitive landscape of providers offering to run them. And open weights models are only lagging behind the closed models by a few months by now.

If the market consolidates down to only one or two leading players, then yes, it's possible for them to put a squeeze on the market and jack up prices. But right now, it's a highly competitive market, with very little stickiness, it's very easy to move to a different provider if the one you're using jacks up prices. Right now each of OpenAI, Anthropic, Google, and xAI are releasing frontier models regularly which leapfrog each other on various benchmarks, and the Chinese labs are only a few months behind, and generally release open weight models which are much easier to measure and build on top of. There's very little moat right now other than sheer capacity for training and inference.

And I would expect, if we do get a consolidation and squeeze, it would just be by jacking up prices, not by generating too many tokens. Right now inference is highly constrained; those people I work with who use these models regularly hit capacity limitations all the time. These companies can't build out capacity fast enough to meet demand, so if anything they're motivated to make things more efficient right now.

I have a lot of problems with the whole LLM industry, and I feel like in many ways it's being rushed out before we're truly ready for all of the consequences, but it is actually quite in demand right now.

unlambda@hachyderm.io

@futurebird @aredridel @EricLawton @david_chisnall @maco They have been improving the ability of the models writing code, probably faster than it's improving on almost any other ability. They can do this by what's called reinforcement learning with verifiable rewards (RLVR), since with code it's possible to verify whether the result is correct or not (whether it compiles, whether it passes a particular test or test suite, etc)

So while the pre training is based on just predicting the next token in existing code bases, they can then make it better and better at coding by giving it problems to solve (get this code to compile, fix this bug, implement this feature, etc), check whether it succeeded, and apply positive or negative reinforcement based on the result.

And this can scale fairly easily; you can come up with whole classes of problems, like "implement this feature in <language X>" and vary the language while using the same test suite, and now you can train it to write all of those languages better.

So while there are also improvements in the tooling, the models themselves have been getting quite a bit better at both writing correct code on the first try, and also figuring out what went wrong and fixing it when it doesn't work on the first try.

In fact, there are now open weights models (models that you can download and run on your own hardware, though for the biggest ones you really need thousands to tens of thousands of dollars of hardware to run the full model) which are competitive with the top tier closed models from just 6 months ago or so on coding tasks, in large part because of how effective RLVR is.

unlambda@hachyderm.io

@maco @aredridel @futurebird @EricLawton @david_chisnall In general, they charge for both input tokens and output tokens, at different rates. For example, Claude Opus 4.5 charges $5/million input tokens, and $25/million output tokens.

In order for an LLM to keep track of the context of the conversation/coding session, you need to feed the whole conversation in as input again each time, so you end up paying the input token rate many times over.

However, there's also caching. Since you're going to be putting the same conversation prefix in over and over again, it can cache the results of processing that in its attention system. Some providers just do caching automatically and roll that all into their pricing structure, some let you explicitly control caching by paying for certain conversations to be cached for 5 minutes or an hour. So then, you pay once for the input and once for the caching, and then you can keep using that prefix and appending to it.

If you're paying like this by the token (which you do if you're just using it as an API user), then yeah, if it gets it wrong, you have to pay all over again for the tokens to correct it.

However, the LLM companies generally offer special plans for their coding tools, where you pay a fixed rate between $20 and $200/month, where you have a certain guaranteed quota but can use more than it if there's spare capacity, which can allow you to use more tokens for a lower price than if you just paid by the token. But of course it's not guaranteed; you can run out of quota and need to wait in line if the servers are busy.

And their tools handle all of that caching, selecting different models for different kinds of tasks, running external tools for deterministic results, etc.

Abspeckgeflüster – Forum für Menschen mit Gewicht(ung)

unlambda@hachyderm.io

Beiträge