Skip to content

diaazg/code-comment

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 

Repository files navigation

🧑‍💻 CodeT5 for Code-to-Docstring Generation

This project is a mini-project for the NLP course (Master 1, AI, Semester 2) focused on applying and fine-tuning the CodeT5 model to generate natural language docstrings from source code snippets.

I not only implemented the model but also studied and summarized the CodeT5 research paper, ensuring I understood the architecture, training methodology, and evaluation metrics used in the original work.


📌 Project Overview

Automatic code documentation is an important challenge in software engineering. Large models such as CodeT5 already achieve strong performance in code summarization tasks. However, fine-tuning on a carefully prepared dataset can adapt the model to a specific domain and improve the quality of generated docstrings.

In this project:

  • I fine-tuned CodeT5-small on a dataset of 40k code-docstring pairs.
  • I compared pre-trained performance vs. fine-tuned performance using multiple metrics.
  • I implemented an interactive CLI tool where users can paste code and instantly get generated docstrings.

⚙️ Installation

# Clone repository
git clone https://github.com/diaazg/code-comment.git
cd code-comment

# Create environment
conda create -n codet5-docstring python=3.10 -y
conda activate codet5-docstring

# Install dependencies
pip install -r requirements.txt

🚀 Usage

Run the interactive CLI tool:

python main.py --model_dir ./checkpoints --device mps # use cuda or mps depends on ur device

📊 Evaluation

I evaluated on 4,000 samples before and after fine-tuning.

Before Fine-Tuning (Pre-trained CodeT5)

  • ROUGE-1: ~0.05
  • METEOR: ~0.027
  • BERTScore (F1): ~0.78

After Fine-Tuning

  • ROUGE-1: ~0.34
  • METEOR: ~0.24
  • BERTScore (F1): ~0.87

✅ Fine-tuning led to a substantial improvement, especially in semantic similarity (BERTScore), showing that the model better captures the meaning of reference docstrings.

🙏 Acknowledgments

This project is based on the CodeT5 model introduced in:

Wang et al., CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and Generation, EMNLP 2021.

About

Domain-specific adaptation of CodeT5 for Python docstring generation. Compares pre-trained vs fine-tuned performance using NLP metrics (ROUGE, METEOR, BERTScore), showing significant improvements after targeted training.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors