Is Your Diffusion Sampler Actually Correct?
A Sampler-Centric Evaluation of Discrete Diffusion Language Models

Project page and code release.

Luhan Tang1, Longxuan Yu1, Shaorong Zhang1, Greg Ver Steeg1
1University of California, Riverside

Abstract

Discrete diffusion language models (dLLMs) offer fast parallel decoding but are harder to evaluate than autoregressive models, because learned-denoiser approximation error and sampler-induced error are often conflated. We introduce a sampler-centric oracle evaluation framework that replaces learned denoisers with an exact HMM posterior derived from a ground-truth Markov chain, enabling controlled isolation of sampler error under method-consistent settings.

Under this oracle setup, we show that few-step dLLM samplers are not distributionally correct: transition-level mismatch is large at small step counts and shrinks only as diffusion steps approach sequence length. We further show that improvements in NLL, GenPPL, or MAUVE do not necessarily imply correct sampling.

Core Contributions

Repository Highlights

Experiment entry
exp/run_all.sh
LLaDA/MDLM
exp/run_mdlm_llada.sh
SEDD
exp/run_sedd.sh
ReMDM
exp/run_remdm.sh
Metrics
sampler/metrics_full.py
OWT tokenizer + GT
exp/build_tokenizer.sh

Quick Start

From the repository root: