VLAs - Pixels to Tokens

VLAs - Pixels to Actions On the cover: Pixels to actions In our last post, we taught a Transformer to see and speak. We bolted a vision encoder onto an LLM, projected pixel-patches into the language space, and tricked the text model into halluci...

VLMs - Pixels to Tokens

On the cover: Decorative In our last post, we spoke about how the Vision Transformer (ViT) took the candy away from CNNs. We learned that if you slice an image into patches and flatten them, you can treat an image just like a sentence. And a sequ...

ViT - Pixels to Tokens

On the cover: Decorative For nearly a decade, if you wanted a computer to identify a cat in a picture, you had one reliable tool: The Convolutional Neural Network (CNN). CNNs were the undisputed kings of Computer Vision. They were dependable, the...

Cross Entropy Loss

On the cover: A weather forecasting stone meme Imagine you live, work and vacation in three different cities. In City A (home), it is sunny every single day. No clouds. No rain. Ever. You wake up, glance outside, and already know the answer. Bor...

MoE - Is Attention All You Really Need?

On the cover: A bunch of different robots who are expert at different tasks It’s been 8 years since the landmark paper “Attention is all you need” was published. The paper introduced the attention mechanism, which has revolutionized the field of ...