Recent discrete diffusion language models (DDLM) such as MDLM and SEDD approach or match similarly sized AR models in
generation quality. Importantly, discrete diffusion language models can generate samples in any order and
in parallel, unlike regular autoregressive models.
However, sampling from DDLMs requires thousands of steps to achieve good performance. Additionally, since
DDLMs use a bidirectional architecture, KV-caching is not applicable.
To reduce the number of sampling steps while retaining performance, we propose Self-Distillation
Through Time (SDTT), a novel distillation method for DDLMs.
Most off-the-shelf distillation methods for continuous diffusion rely on deterministic mappings from noise
to images, such as DDIM. Nonetheless, we demonstrate that
SDTT can reduce the number of sampling steps for pre-trained MDLMs 32-64 folds.
Importantly, our final student can generate samples with lower perplexity than GPT-2 with nucleus
sampling in 32 steps.
Our method is simple to implement and relatively cheap to run. Additionally, we release training and test
code along distilled models.
Recent studies have identified one can improve the performance of a fixed model by scaling up
computational resources at inference time. In this work, we improve the decoding speed of LLMs by moving
away from AR modeling.