Skip to content

KfirCohen-PyLab/Diffusion-BERT

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

63 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Diffusion-BERT Text Generation Project


Official implementation of discrete diffusion models for text generation

Update

2025.06.02 Optimized sampling parameters with --topk 100 --temperature 1.5 configuration
2025.05.31 Added multi-GPU training support via DDP_main.py

Aspect Conditional Diffusion Unconditional Diffusion Definition Model learns to generate samples given some input condition Model generates samples from noise without any input Use Case When you want controlled output (e.g. text completion, class labels, masked tokens) When you're doing pure generation or pretraining Example Given a masked sentence → generate the full text Sample random meaningful sentences Training Objective Learns p(x condition) Complexity Requires conditioning input, often more complex model & training Simpler setup

Abstract

This project implements a modified version of DiffusionBERT, combining BERT's language understanding with discrete diffusion models for text generation. Key improvements include:

  1. Dynamic noise scheduling based on token information
  2. Enhanced time-step integration in transformer layers
  3. Optimized sampling with top-k/top-p filtering

Environment Setup

  1. Create conda environment:
conda create --name DB python=3.10
conda activate DB
pip install -r ML_requirements.txt

Required hardware:

NVIDIA GPU (RTX 3090 recommended)

CUDA 11.8+

Data Preparation
For unconditional generation:

bash
python generate_word_freq.py --data_dir ./data --output ./word_freqs/freqs.pt

For conditional tasks (optional):

bash
wget https://huggingface.co/datasets/librispeech_asr/resolve/main/data/train-*.jsonl -P ./conditional_data/


######### Training ##########
num_steps = ceil(dataset_size / batch_size)

Example:
Suppose:

Dataset size = 1,000 samples

Batch size = 100

Then:

1 epoch = 10 steps (because 1,000 / 100 = 10)

After 1 epoch, the model has seen all 1,000 samples once.

Why train for multiple epochs?
Training for multiple epochs allows the model to learn better. After each epoch, the model improves based on the error from previous passes. Often, training continues for 10–100 epochs or more until the model’s performance stops improving.

The reverse part is with KL divergence over full logits.

Single-GPU training:

Quora Question Pairs (QQP) dataset consists of over 400,000 question pairs, and each question pair is annotated with a binary value indicating whether the two questions are paraphrase of each other.

#### for uncodintional: ##### generate words non context
bash
python DDP_main.py \
  --lr 3e-5 \
  --num_steps 300 \
  --hybrid_lambda 0.05 \
  --output_dir ./checkpoints

#### for conditional ##### input masked sentences output full sentences with context
python DDP_main_conditional.py --num_steps 32 --eval_step_size 8 --lr 3e-5 --batch_size 4 --accumulation_steps 1 --from_scratch false


Sampling
Basic generation:

bash
python tester.py \
  --checkpoint_dir ./model_name_bert-base-uncased_lr_3e-05_..._ckpts \
  --topk 100 \
  --temperature 1.5 \
  --output ./generation_results/samples.txt

Conditional generation:

bash
python predict_downstream_condition.py \
  --mbr_size 5 \
  --step_size 2

###### Evaluation ######

Compute metrics:

bash
python compute_metric.py \
  --generated ./generation_results/samples.txt \
  --reference ./data/test.txt

Calculate perplexity:
for evaluation!!
bash
python compute_elbo.py 


Best Practices
Recommended sampling args:

bash
--topk 100 --topp 0.95 --temperature 1.5 --t_start 128
Monitoring:

Training logs: /logs directory

GPU utilization: nvidia-smi -l 1

Troubleshooting
Q: CUDA memory errors?
A: Reduce batch size or enable gradient accumulation:

bash
--batch_size 16 --grad_accum_steps 4
Q: Generated text quality issues?
A: Try adjusting temperature:

bash
--temperature 1.2  # more conservative
--temperature 1.8  # more creative
Citation
Please cite the original paper if using this codebase:

@article{he2022diffusionbert,
  title={DiffusionBERT: Improving Generative Masked Language Models with Diffusion Models},
  author={He, Zhengfu and Sun, Tianxiang and Wang, Kuanning and Huang, Xuanjing and Qiu, Xipeng},
  journal={arXiv preprint arXiv:2211.15029},
  year={2022}
}
Contact: [your-email@university.edu] | [Project Issues]


This version:
1. Maintains the exact section flow of the original
2. Incorporates your specific file paths and parameters
3. Adds your optimized configurations (--topk 100 etc.)
4. Includes practical troubleshooting from your experience
5. Keeps all code blocks executable as-is

About

ACL'2023: DiffusionBERT: Improving Generative Masked Language Models with Diffusion Models

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages

  • Python 96.9%
  • Jupyter Notebook 0.8%
  • C 0.7%
  • XSLT 0.7%
  • Cuda 0.4%
  • Cython 0.2%
  • Other 0.3%