🧑‍💻 CodeT5 for Code-to-Docstring Generation

This project is a mini-project for the NLP course (Master 1, AI, Semester 2) focused on applying and fine-tuning the CodeT5 model to generate natural language docstrings from source code snippets.

I not only implemented the model but also studied and summarized the CodeT5 research paper, ensuring I understood the architecture, training methodology, and evaluation metrics used in the original work.

📌 Project Overview

Automatic code documentation is an important challenge in software engineering. Large models such as CodeT5 already achieve strong performance in code summarization tasks. However, fine-tuning on a carefully prepared dataset can adapt the model to a specific domain and improve the quality of generated docstrings.

In this project:

I fine-tuned CodeT5-small on a dataset of 40k code-docstring pairs.
I compared pre-trained performance vs. fine-tuned performance using multiple metrics.
I implemented an interactive CLI tool where users can paste code and instantly get generated docstrings.

⚙️ Installation

# Clone repository
git clone https://github.com/diaazg/code-comment.git
cd code-comment

# Create environment
conda create -n codet5-docstring python=3.10 -y
conda activate codet5-docstring

# Install dependencies
pip install -r requirements.txt

🚀 Usage

Run the interactive CLI tool:

python main.py --model_dir ./checkpoints --device mps # use cuda or mps depends on ur device

📊 Evaluation

I evaluated on 4,000 samples before and after fine-tuning.

Before Fine-Tuning (Pre-trained CodeT5)

ROUGE-1: ~0.05
METEOR: ~0.027
BERTScore (F1): ~0.78

After Fine-Tuning

ROUGE-1: ~0.34
METEOR: ~0.24
BERTScore (F1): ~0.87

✅ Fine-tuning led to a substantial improvement, especially in semantic similarity (BERTScore), showing that the model better captures the meaning of reference docstrings.

🙏 Acknowledgments

This project is based on the CodeT5 model introduced in:

Wang et al., CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and Generation, EMNLP 2021.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
src		src
.gitignore		.gitignore
Readme.md		Readme.md
notebook.ipynb		notebook.ipynb
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🧑‍💻 CodeT5 for Code-to-Docstring Generation

📌 Project Overview

⚙️ Installation

🚀 Usage

📊 Evaluation

Before Fine-Tuning (Pre-trained CodeT5)

After Fine-Tuning

🙏 Acknowledgments

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

🧑‍💻 CodeT5 for Code-to-Docstring Generation

📌 Project Overview

⚙️ Installation

🚀 Usage

📊 Evaluation

Before Fine-Tuning (Pre-trained CodeT5)

After Fine-Tuning

🙏 Acknowledgments

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages