Is Your Diffusion Sampler Actually Correct?
A Sampler-Centric Evaluation of Discrete Diffusion Language Models

Project page and code release.

Luhan Tang¹, Longxuan Yu¹, Shaorong Zhang¹, Greg Ver Steeg¹
¹University of California, Riverside

Abstract

Discrete diffusion language models (dLLMs) offer fast parallel decoding but are harder to evaluate than autoregressive models, because learned-denoiser approximation error and sampler-induced error are often conflated. We introduce a sampler-centric oracle evaluation framework that replaces learned denoisers with an exact HMM posterior derived from a ground-truth Markov chain, enabling controlled isolation of sampler error under method-consistent settings.

Under this oracle setup, we show that few-step dLLM samplers are not distributionally correct: transition-level mismatch is large at small step counts and shrinks only as diffusion steps approach sequence length. We further show that improvements in NLL, GenPPL, or MAUVE do not necessarily imply correct sampling.

Core Contributions

Sampler-centric oracle evaluation that isolates sampler-induced error from denoiser approximation.
Method-consistent comparison across LLaDA/MDLM, SEDD, and ReMDM.
Transition-level diagnosis via unified metrics, beyond single scalar quality scores.

Repository Highlights

Experiment entry
exp/run_all.sh

LLaDA/MDLM
exp/run_mdlm_llada.sh

SEDD
exp/run_sedd.sh

ReMDM
exp/run_remdm.sh

Metrics
sampler/metrics_full.py

OWT tokenizer + GT
exp/build_tokenizer.sh

Quick Start

From the repository root:

bash exp/build_gt.sh (text8 GT)
bash exp/build_tokenizer.sh all (OWT tokenizer + GT)
bash exp/run_all.sh all 0 512 (run all samplers)