Paper: Large Language Diffusion Models (LLaDA)
Recently, diffusion language models have grown popular by offering theoretical inference speed-ups and an alternative to autoregressive models (ARMs).
This paper removes reliance on ARMs. While built on a transformer architecture, it relies on bi-directional attention with a masked language model objective (similar to BERT). The goal is not to generate tokens one by one, but to iteratively demask words from a fixed-length sequence.
Implementation

