Back to the wiki

Inference

The moment the model works for you.

The analogy

Training is the chef's years of culinary school; inference is cooking your dish when you order it. The schooling happened once and cost a fortune; the cooking happens with every order and has to be fast.

In detail

Inference is running the already-trained model to generate answers: the weights don't change, they're only computed. Every generated token requires a pass through the whole network — that's why answers “type themselves out” progressively (streaming). Optimizations like quantization, context caching and batching cut cost and latency.

An example

When you watch the answer appear word by word, it's not a visual effect: the model is computing each token at that very moment, one after another.

Related concepts