Multimodal AI — Promptpedia

01

The analogy

People don't understand the world through reading alone: they look at photos, listen to conversations, interpret charts. Multimodal models make that leap: instead of handling only text, they process images, audio or video as part of the same conversation.

02

In detail

A multimodal model converts different input types (image, audio, text) into compatible internal representations, usually projecting them into the same space as text tokens. That lets it describe a photo, read a scanned invoice or generate images. The fusion happens inside the model itself, not in separate programs.

03

An example

An example Promptpedia

You send a photo of your home's breaker panel and ask: “which switch probably tripped if the kitchen has no power?”. The model “looks” at the image, reads the labels and reasons about them.

04

Embeddings LLM (large language model) AI agents