Transformer - Mora Project

# Transformer **The Transformer** is a neural network architecture built entirely on *self-attention*: instead of reading a sequence step-by-step like earlier RNNs, every token simultaneously looks at every other token and votes on how much to “attend” to each — making processing fully parallel and capturing long-range dependencies that recurrent models struggled with. Introduced in “Attention Is All You Need” (Vaswani et al., Google, 2017), it became the foundation of LLMs ([[gpt|GPT]], [[bert|BERT]], [[claude|Claude]]) and, via **[[vision_transformer|Vision Transformers (ViT)]]** and **[[diffusion_transformer|Diffusion Transformers (DiT)]]**, of modern image generation pipelines that treat image patches as tokens. ^overview > [!example] > Think of it as a room full of people who can all talk to each other at once instead of whispering down a telephone chain. Each “person” (token) shouts its question (query) to the room, listens for relevant answers (keys), then updates what it knows (values) — all simultaneously. Vision models apply the same trick to image patches, letting the model weigh global composition rather than just local pixel edges.