Back to the wiki

Multimodal AI

Models that don't just read: they also see, listen and draw.

The analogy

People don't understand the world through reading alone: they look at photos, listen to conversations, interpret charts. Multimodal models make that leap: instead of handling only text, they process images, audio or video as part of the same conversation.

In detail

A multimodal model converts different input types (image, audio, text) into compatible internal representations, usually projecting them into the same space as text tokens. That lets it describe a photo, read a scanned invoice or generate images. The fusion happens inside the model itself, not in separate programs.

An example

You send a photo of your home's breaker panel and ask: “which switch probably tripped if the kitchen has no power?”. The model “looks” at the image, reads the labels and reasons about them.

Related concepts