Recently, diffusion language models have grown popular by offering theoretical inference speed-ups and an alternative to autoregressive models (ARMs).

This paper removes reliance on ARMs. While built on a transformer architecture, it relies on bi-directional attention with a masked language model objective (similar to BERT). The goal is not to generate tokens one by one, but to iteratively demask words from a fixed-length sequence.

Implementation

LLaDA diagram