dLLM Sampler

ICML 2026

Is Your Diffusion Sampler Actually Correct? A Sampler-Centric Evaluation of Discrete Diffusion Language Models

Luhan Tang, Longxuan Yu, Shaorong Zhang, Greg Ver Steeg

University of California, Riverside

Oracle sampler evaluation pipeline comparing learned denoisers, oracle HMM posterior, samplers, and transition metrics.
A sampler-centric evaluation isolates sampling dynamics by replacing learned denoisers with an exact HMM posterior, then measuring distributional mismatch beyond surface quality metrics.

Abstract

Sampler behavior matters, even with an oracle denoiser.

Discrete diffusion language models (dLLMs) provide a fast and flexible alternative to autoregressive models (ARMs) via iterative denoising with parallel updates. However, their evaluation is challenging: existing metrics conflate denoiser approximation error with sampler-induced error from the sampling dynamics, a problem that does not arise for ARMs whose autoregressive sampling exactly reflects the learned probability model.

We introduce a sampler-centric oracle framework that replaces learned denoisers with an exact Hidden Markov Model posterior derived from a ground-truth Markov chain, isolating sampler-induced error in a controlled setting. We show that few-step discrete diffusion samplers are not distributionally correct even under an oracle denoiser, with transition-level mismatch that vanishes only as the number of steps approaches the sequence length. Moreover, improvements in negative log-likelihood, generative perplexity, or MAUVE do not imply correct sampling.

Motivation

Current metrics may look strong without validating sampler correctness.

Why current metrics are hard to trust

Learned-denoiser approximation and sampler-induced bias can both affect final samples, so metrics such as NLL, GenPPL, and MAUVE do not isolate the source of error. As a result, it remains unclear whether strong metric values actually indicate correct sampler behavior.

What the oracle setup isolates

The learned denoiser is replaced by an exact posterior, leaving the sampler dynamics as the source of distributional deviation.

Method

A controlled framework for sampler-centric evaluation.

The data distribution is defined as a discrete Markov chain. Under forward masking, exact posteriors can be computed with Hidden Markov Model inference, enabling method-consistent oracle versions of samplers such as SEDD, MDLM/LLaDA, and ReMDM.

Oracle posterior

sampler/oracle_hmm_posterior.py

Unified metrics

sampler/metrics_full.py

Experiment runners

exp/run_all.sh

Results

Correct-looking samples are not always correct samples.

Few-step bias persists

Transition-level mismatch remains substantial at small step counts, even with exact denoising.

More steps reduce mismatch

Distributional error vanishes only as diffusion steps approach sequence length.

Quality scores are incomplete

Better NLL, GenPPL, or MAUVE does not necessarily imply sampler correctness.

Documentation

Reproduce the sampler evaluations.

The repository includes ground-truth construction, sampler runners, and unified evaluation scripts for text8 and OpenWebText-style experiments.

python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

bash exp/build_gt.sh
bash exp/run_all.sh text8 0 512

For the complete workflow, see README.md and TUTORIAL.md.

Citation

If you find our paper helpful, please cite this work.

@article{tang2026your,
  title={Is Your Diffusion Sampler Actually Correct? A Sampler-Centric Evaluation of Discrete Diffusion Language Models},
  author={Tang, Luhan and Yu, Longxuan and Zhang, Shaorong and Steeg, Greg Ver},
  journal={arXiv preprint arXiv:2602.19619},
  year={2026}
}