HACKER Q&A
📣 frannyg

Do LLMs get "better" with more processing power and or time per request?


Do they make more (recursive) queuries into their training data for breadth and depth? Or does the code limit the algorithms by design and or by constraints other than the incompleteness of the encoded semantics?


  👤 lolinder Accepted Answer ✓
There's a misconception in the question that is important to address first: when an LLM is running inference it isn't querying its training data at all, it's just using a function that we created previously (the "model") to predict the next word in a block of text. That's it. When considering plain inference (no web search or document lookup), the decisions that determine a model's speed and capabilities come before the inference step, during the creation of the model.

Building an LLM model consists of defining its "architecture" (an enormous mathematical function that defines the model's shape) and then using a lot of trial and error to guess which "parameters" (constants that we plug in to the function, like 'm' and 'b' in y=mx+b) will be most likely to produce text that resembles the training data.

So, to your question: LLMs tend to perform better the more parameters they have, so larger models will tend to beat smaller models. Larger models also require a lot of processing power and/or time per inferred token, so we do tend to see that better models take more processing power. But this is because larger models tend to be better, not because throwing more compute at an existing model helps it produce better results.


👤 HarHarVeryFunny
There are all sorts of changes one could imagine being made to how LLMs are trained and run, but if you are asking about what actually exists today, then:

1) At runtime, when you feed a "request" (prompt) into the model, the model will use a fixed amount of compute/time to generate each word of output. There is no looping going on internally - just a fixed number of steps to generate each word. Giving it more or less processing power at runtime will not change the output, just how fast that output is generated.

If you, as a user, are willing to take more time (and spend more money) to get a better answer, then a trick that often works is to take the LLM's output and feed it back in as a request, just asking the LLM to refine/reword it. You can do this multiple times.

2) At training time, for a given size of model and given set of training data, there is essentially an optimal amount of time to train for (= amount of computing power and time taken to train). Train for too short a time and the model won't have learnt all that it could. Train for too long a time (repeating the training data), and the model will start to memorize the training set rather than generalize from it, meaning that the model is getting worse.


👤 fykem
More processing power does not make a model better. You can train models on CPUs with same result based on same model architecture and dataset. It'll just take longer to get those results.

What makes models "good" is if the dataset "fits" the model architecture properly and you have given it enough time (epochs) to have a semi accurate prediction ratio (lets say 90% accurate). For image classification models I've done around ~100 epochs for 10,000 items seems to be the best certain data sets will ever get. There will at some point come a time when the continued training of the model is either underfitting/overfitting and no amount of continued training/processing power would help improve it.


👤 PeterisP
No, the standard LLM implementations currently used will apply a fixed amount of computations during inference, which is chosen and "baked in" by the model architecture before training. They don't really have the option to "think a bit more" before giving the answer, generating each token makes the exact same amount of matrix multiplications. Well, they probably theoretically could be modified to do it, but we don't do that properly yet, even if some styles of prompts e.g. "let's think step by step" kind of nudge the model in that direction.

The same model will give the same result, and more processing power will simply enable you to get the inference done faster.

On the other hand, more resources may enable (or be required for) a different, better model.


👤 viraptor
One caveat not mentioned yet is that you can get better responses through priming, fewshot and chain of thought. That means if you start talking about a related problem/concept, mention some keywords, then provide a few examples, then ask the LLM to provide chain of thought reasoning, you will get a better answer. Those will extend the runtime and processing power in practice.

👤 HeavyStorm
Without knowing the particulars of a Implementation, it's hard to say. Some can refine results by running the model a few more times, so yeah, better processing and/or more time would help, though probably not by much.

Most models, however, don't, so no special benefit from better processing other than speed


👤 sweezyjeezy
Interestingly it used to be quite standard with 'small' language models to use a search algorithm to render a full block of text, the most basic being beam search. Then you can get better with more processing power to do a wider path search. This is not what OP is talking about, it just means generating a larger number of candidate continuations. However it's not necessary or optimal for newer LLMs, because it tends to siphon the LLM into quite generic places, and it can get very repetitive.

👤 mikewarot
No.

An LLM can only give probabilities of the next token of output. The time to improve an LLM is during design, training, or fine tuning. Once you've got the final weights, the function is "locked in" and doesn't change.

However part of the process of learning to predict human output from the internet, literature, etc. causes some deeper learning to occur, potentially even more than in humans, certainly of a different nature. The LLM is communicating through a lossy process, and there is some randomness imposed on its outputs, so results may vary.

The nature of the prompt used can trigger some of this deeper learning, and yield better results than you might otherwise get. These weren't put in by design, they are emergent properties of the LLM. For instance "train of thought" prompting has been show to result in better output.

Prompt "engineering" is an empirical process of discovering the quirks and hidden strengths in the model. It is entirely possible that there is a super-human set of cognitive skills embedded inside GPT4, Mistral, or even LLAMA. Given sufficient time, there might be some prompting that could expose it and make it usable.

Because LLMs aren't "programs" in the traditional sense, you should treat them as if they were an alien intelligence, because that is effectively what they are. They don't understand humans, no matter how well they act like it at times. They are wild beasts, and we haven't figured out how to domesticate them yet.


👤 FlyingAvatar
Short answer is No.

I highly recommended watching Andrej Karpathy's Intro to LLMs talk, particularly the section on System 1 vs System 2 thinking. Long story short, what you are describing, using more processing to prepare a better response, is something that is an area of interest, but is not currently part of ChatGPT (or any other LLM that I am aware of).

See: https://youtu.be/zjkBMFhNj_g?t=2100&si=jaImuf3UCn6ReTp4


👤 lmeyerov
Yes, but not the reasons you're thinking

- If you have a fixed time budget and increase the GPU memory+compute available, you can directly query a bigger model. Raw models are basically giant lookup functions, and without the extra memory+compute, they'll spill to slower layers of your memory hierarchy, e.g., GPU RAM -> CPU RAM -> disk. Likewise, with MoE models, there are multiple concurrent models being queried.

- Most 'good' LLM systems are not just direct model calls, but code-based agent frameworks on top that call code tools, analyze the results, and decide to edit+retry things. For example, if doing code generation, they may decide to run lint analysis & type checking on a generated output, and if issues, ask the LLM to try again. In Louie.AI, we will even generate database queries and run GPU analytics & visualizations in on-the-fly Python sandboxes. These systems will do backtracking etc retries, and > 50% of the quality can easily come from these layers: LLM leaderboards like HumanEval increasingly report both the raw model + what agent framework on top. All this adds up and can quickly become more expensive than the LLM. So better systems can enable more here too.


👤 jdsully
For inference the common answer will be "no", you use the model you get and it takes a constant time to process.

However the truth is that inference platforms do take shortcuts that affect accuracy. E.g. LLama.cpp will down convert fp32 intermediates to 8-bit quantized so it can do the work using 8-bit integers. This is degrading the computation's accuracy for performance.


👤 janalsncm
My understanding of GPT4 is that it is a mixture of experts. In other words, multiple GPT 3.5 models responding to the same prompt in parallel, and another model on top choosing the best response among them.

So in that case, more models could give a better response, which costs more compute.


👤 tedivm
The same model will not get better by having more processing power or time. However, that's not the full story.

Larger models generally perform better than smaller models (this is a generalization, but a good enough one for now). The problem is that larger models are also slower.

This ends up being a balancing act for model developers. They could get better results but it may end up being a worse user experience. Models size can also limit where the model can be deployed.


👤 charcircuit
LLMs output a probability distribution for the next word. Searching the space of what the best next word to use is takes more time than just picking a good one and assuming that it was a good enough choice.

👤 neximo64
Both

It takes time to train them. More = better. Usually about 6 months or so. More processing power can allow the model to cram more power in


👤 bluecoconut
Directly answering your question requires making some assumptions about what you mean and also what "class" of models you are asking about. Unfortunately I don't think it's just a yes or no, since I can answer in both directions depending on the interpretation.

[No] If you mean "during inference", then the answer is mostly no in my opinion, but it depends on what you are calling a "LLM" and "processing power", haha. This is the interpretation I think you are asking for though.

[Yes] If you mean everything behind an endpoint is an LLM, eg. that includes a RAG system, specialized prompting, special search algorithms for decoding logits into tokens, then actually the answer is obviously a yes, those added things can increase skill/better-ness by using more processing power and increasing latency.

If you mean the raw model itself, and purely inference, then there's sorta 2 classes of answers.

[No] 1. On one side you have the standard LLM (just a gigantic transformer), and these run the same "flop" of compute to predict logits for 1 token's output (at fixed size input), and don't really have a tunable parameter for "think harder" -> this is the "no" that I think your question is mostly asking.

[Yes] 2. For mixture of experts, though they don't do advanced adaptive model techniques, they do sometimes have a "top-K" parameter (eg. top-1 top-2 experts) which "enable" more blocks of weights to be used during inference, in which case you could make the argument that they're gaining skill by running more compute. That said, afaik, everyone seems to run inference with the same N number of experts once set up and don't do dynamic scaling selection.

[Yes] Another interpretation: broadly there's the question of "what factors matter the most" for LLM skill, if you include training compute as part of compute (amortize it or whatever) --> then, per the scaling law papers: it seems like the 3 key things to keep in mind are: [FLOPs, Parameters, Tokens of training data], and in these parameters there is seemingly power-law scaling of behavior, showing that if you can "increase these" then the resulting skill also will keep "improving" (hence an interpretation of "more processing power" (training) and "time per request" (bigger model / inference latency) is correlated to "better" LLMs.

[No] You mention this idea of "more recursive queries into their training data", and its worth noting a trained model no longer has access to the training data. And in fact, the training data that gets sent to the model during training (eg. when gradients are being computed and weights are being updated) is sent on some "schedule" usually (or sampling strategy), and isn't really something that is being adaptively controlled or dynamically "sampled" even during training. So it doesn't have the ability to "look back" (unless a retrieval style architecture or a RAG inference setup)

[Yes] Another thing is the prompting strategy / decoding strategy, hinted at above. eg. you can decode with just taking 1 output, or you can take 10 outputs in parallel, rank them somehow (consensus ranking, or otherwise), and then yes, that can also improve (eg. this was contentious when gemini ultra was released, because their benchmarks used slightly different prompting strategies than GPT-4 prompting strategies, which made it even more opaque to determine "better" score per cost (as some meta-metric)) (some terms are chain/tree/graph of thought, etc.)

[Yes (weak)] Next, there's another "concept" of your question about "more processing power leading to better results", which you could argue "in-context learning" is itself more compute (takes flops to run the context tokens through the model (N^2 scaling, though with caches)) - and purely by "giving a model" more instructions in the beginning, you increase the compute and memory required, but also often "increase the skill" of the output tokens. So maybe in that regard, even a frozen model is a "yes" it does get smarter (with the right prompt / context).

One interesting detail about current SotA models, even the Mixture of Expert style models, is that they're "Static" in where their weights and "flow" of activations along the "layer" direction. They're dynamic (re-use) weights in the "token"/"causal" ordering direction (the N^2 part). I've personally spent some time (~1 month in Nov last year) working on trying to make more advanced "adaptive models", that use switches like those from the MoE style network, but to route to "the same" QKV attention matrices, so that something like what you describe is possible (make the "number of layers" a dynamic property, and have the model learn to predict after 2 layers, 10 layers, or 5,000 layers, and see if "more time to think" can improve the results, do math with concepts, etc. -- but for there to be dynamic layers, the weights can't be "frozen in place" like they currently are) -- currently I have nothing good to show here though. One interesting finding though (now that I'm rambling and just typing a lot) is that in a static model, you can "shuffle" the layers (eg. swap layer 4's weights with layer 7's weights) and the resulting tokens roughly seem similar (likely caused by the ResNet style backbone). Only the first ~3 layers and last ~3 layers seem "important to not permute". It kinda makes me interpret models as using the first few layers to get into some "universal" embedding space, operating in that space "without ordering in layer-order", and then "projecting back" to token space at the end. (rather than staying in token space the whole way through). This is why I think it's possible to do more dynamic routing in the middle of networks, which I think is what you're implying when you say "do they make more recursive queries into their data" (I'm projecting, but when i imagine the idea of "self-reflection" or "thought" like that, inside of a model, I imagine it at this layer -- which, as far as I know, has not been shown/tested in any current LLM / transformer architecture)