VLMs - Pixels to Tokens
On the cover: Pixels to tokens In our last post, we spoke about how the Vision Transformer (ViT) took the candy away from CNNs. We learned that if you slice an image into patches and flatten them, you can treat an image just like a sentence. And ...