What happens when you paste a screenshot, and ask questions in LLM?
When conversing with LLM (Claude, Cursor, ChatGPT), I often paste screenshot as a reference, to provide context and ask questions. I know, ultimately, this is pixels and bits. But how does it this work? Do LLMs do an image processing to text, translate them word vectors, and then answer the questions or do they go in a different mode? I find this kind of interaction with the machine, mind blowing.
Multimodal models are trained knowing how to understand encoded images. It really is magic. Base64 image data is a binary-to-text encoding scheme that represents image bytes as a printable ASCII string, formatted as data:[][;base64],. We think of llms as only good at text, but any structured data is predictable. As long as it can be turned into an N dimensional vector that represents a complex idea in the LLM hidden weights, the output of the model treats that essentially like text. With sufficient training data, it understands the text in what to us looks like noise.
this is super interesting - LLM's do not distinguish between pictures/screenshots and text - all are vectorized. LLM's process everything together and is part of the thinking process- it is magic and breakthrough.. My guess is that this was not by design but a nice after-effect of the core attention design.. a lot of papers are written on it - you will find it a very interested read.