Multimodal AI
Models that don't just read: they also see, listen and draw.
The analogy
People don't understand the world through reading alone: they look at photos, listen to conversations, interpret charts. Multimodal models make that leap: instead of handling only text, they process images, audio or video as part of the same conversation.
In detail
A multimodal model converts different input types (image, audio, text) into compatible internal representations, usually projecting them into the same space as text tokens. That lets it describe a photo, read a scanned invoice or generate images. The fusion happens inside the model itself, not in separate programs.
An example
You send a photo of your home's breaker panel and ask: “which switch probably tripped if the kitchen has no power?”. The model “looks” at the image, reads the labels and reasons about them.