Back to the wiki
Inference
The moment the model works for you.
The analogy
Training is the chef's years of culinary school; inference is cooking your dish when you order it. The schooling happened once and cost a fortune; the cooking happens with every order and has to be fast.
In detail
Inference is running the already-trained model to generate answers: the weights don't change, they're only computed. Every generated token requires a pass through the whole network — that's why answers “type themselves out” progressively (streaming). Optimizations like quantization, context caching and batching cut cost and latency.
An example
When you watch the answer appear word by word, it's not a visual effect: the model is computing each token at that very moment, one after another.