Multimodal Model
Also: vision-language model
An AI that can work with more than just text — typically text and images together, sometimes audio or video too. Claude is multimodal: you can paste an image into your conversation and Claude will describe it, answer questions about it, or analyze what's in it. Multimodal models are why AI can now read screenshots, describe photos, analyze charts, and interpret diagrams — not just process typed words.
In practice
You paste a screenshot of an error message into Claude and ask "what's wrong?" Claude reads the image and explains the error. You upload a chart and ask "what trend does this show?" Claude interprets it. A multimodal model handles more than text — it can reason about images, which is what makes these image-based workflows possible.
Related concepts