Is Your Diffusion Sampler Actually Correct?
A Sampler-Centric Evaluation of Discrete Diffusion Language Models
Project page and code release.
Luhan Tang1, Longxuan Yu1, Shaorong Zhang1, Greg Ver Steeg1
1University of California, Riverside
Abstract
Discrete diffusion language models (dLLMs) offer fast parallel decoding but are harder to evaluate than autoregressive models, because learned-denoiser approximation error and sampler-induced error are often conflated. We introduce a sampler-centric oracle evaluation framework that replaces learned denoisers with an exact HMM posterior derived from a ground-truth Markov chain, enabling controlled isolation of sampler error under method-consistent settings.
Under this oracle setup, we show that few-step dLLM samplers are not distributionally correct: transition-level mismatch is large
at small step counts and shrinks only as diffusion steps approach sequence length. We further show that improvements in
NLL, GenPPL, or MAUVE do not necessarily imply correct sampling.
Core Contributions
- Sampler-centric oracle evaluation that isolates sampler-induced error from denoiser approximation.
- Method-consistent comparison across LLaDA/MDLM, SEDD, and ReMDM.
- Transition-level diagnosis via unified metrics, beyond single scalar quality scores.
Repository Highlights
exp/run_all.shexp/run_mdlm_llada.shexp/run_sedd.shexp/run_remdm.shsampler/metrics_full.pyexp/build_tokenizer.shQuick Start
From the repository root:
bash exp/build_gt.sh(text8 GT)bash exp/build_tokenizer.sh all(OWT tokenizer + GT)bash exp/run_all.sh all 0 512(run all samplers)