# Vision Transformer (ViT) **Vision Transformer (ViT)** adapts the [[Transformer]] architecture to images by slicing them into fixed-size patches and treating each as a token — the same self-attention mechanism used for language, applied to pixels. Introduced in "An Image is Worth 16×16 Words" (Dosovitskiy et al., 2020), it is now the backbone of most multimodal models. ^overview