Back to the wiki

Transformer

The architecture that made everything else possible.

The analogy

At a party, you don't listen to everyone equally: you pay attention to whoever says something relevant to you. The transformer gave machines that ability: “attention” lets every word look at all the others and decide which ones matter for its meaning.

In detail

A neural network architecture introduced in 2017 in the paper “Attention Is All You Need”. Its self-attention mechanism processes all positions of a sequence in parallel and captures long-range relationships, making it far more trainable at scale than earlier recurrent networks. GPT, Claude, Gemini and Llama are all transformers.

An example

In “The cat we saw yesterday at the park was asleep”, to understand “was asleep” the model needs to connect it to “the cat”, eight words away. Attention does exactly that.

Related concepts