AI-powered Code Autocompletion

This project demonstrates the implementation of an AI-powered code autocompletion model using the Salesforce CodeGen model with parameter-efficient fine-tuning techniques.

Project Overview

Code autocompletion tools have revolutionized software development by predicting and suggesting code as developers type. The growing insurgence of GitHub Copilot and Cursor Tab, a code autocomplete project was simply a must-try for me. This project aims to fine-tune a pre-trained Large Language Model (LLM) to complete Python code snippets based on function signatures, docstrings and the function itself. The model is trained on Python functions from the Code_Search_Net dataset.

Features

Uses the Salesforce/codegen-350M-mono model as a base model (from huggingface)
Implements Parameter-Efficient Fine-Tuning (PEFT) with Low-Rank Adaptation (LoRA) to reduce trainable parameters
Training and evaluation pipeline for code completion
Demo App

Demo App

Technical Implementation

The project includes:

Data preparation from the Code_Search_Net dataset
Tokenization optimized for code
Parameter-Efficient Fine-Tuning with LoRA
Training and evaluation pipeline
Example generation and testing

Getting Started

Prerequisites

Python 3.8+
PyTorch
Transformers
Datasets
PEFT library

All project libraries and package requirements are listed in the requirements.txt file. To install the requirements, check below "Install dependencies".

Running the Project

Clone the repository:

git clone https://github.com/yourusername/ai-code-autocompletion.git
cd ai-code-autocompletion

Install dependencies:

pip install -r requirements.txt

or in conda environments (recommended):

conda install -r requirements.txt

Open and run the Jupyter notebook:

jupyter notebook notebooks/ai_code_autocompletion.ipynb

Follow the steps in the notebook to train and evaluate the model.

To train the model locally:

# Set environment variables
export MODEL_PATH="models/newly-trained model" # The trained model will be saved to this path
export WANDB_API_KEY="your-wandb-api-key"  # Optional for metrics tracking

# Run the training script
python src/training/train.py \
    --model_name="Salesforce/codegen-350M-mono" \
    --output_dir=$MODEL_PATH \
    --num_epochs=3 \
    --batch_size=8 \
    --learning_rate=5e-4

Also it is possible to train the model on Google Colab by simply uploading the notebooks/Model_training.ipynb notebook to Google Colab and hitting Run.

Usage Example

After training, you can use the model to complete code snippets:

# Example code completion
function_prefix = "def process_image(image_path):\n    # Load and preprocess image\n    import numpy as np\n    img = "
completion = generate_completion(model, tokenizer, function_prefix)
print(f"Completion: {completion}")

Demo Application

A web-based demo application is available to showcase the code completion model's capabilities:

Simple web interface built with Streamlit
Syntax highlighting for Python code
Adjustable parameters (temperature, max tokens)

Running the Demo

# Make sure to install the required libraries (included in the requirements.txt)
pip install streamlit pygments

# Run the demo app (from the root directory)
streamlit run src/app/demo.py

Then open your browser at the URL displayed in the terminal (by default: http://localhost:8501) to interact with the demo.

Metrics Tracking

The Codegen-350M-mono model makes use of Weights & Biases for metrics tracking. (You will need an account to get an API key).

Training Metrics

Evaluation Metrics

GPU Performance Metrics

Challenges and Limitations

Context Window Limitations: The model can only see a limited amount of context (often just the current function), making it difficult to understand the broader codebase.
Computational Efficiency: Code suggestions must appear nearly instantaneously to be useful, requiring more (GPU) power and model optimization for low latency.

Conclusion

This project demonstrates the feasibility of fine-tuning a code autocompletion model using modern techniques such as PEFT and LoRA. The Salesforce CodeGen model shows promising results for not only Python code completion tasks and can easily be extended and trained with other programming languages.

Remarks

The ai_code_completions.ipynb notebook in notebooks/ was my original submission for my end semester project for the Deep Learning and AI course I took (Winter 2025). I decided to expand the idea into this fully-flegded project you see now using industry best practices for model development and test. This is the first of its kind for me. If you're reading this, it really means a lot to me.

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
.github/workflows		.github/workflows
assets/images		assets/images
data		data
models		models
notebooks		notebooks
src		src
tests		tests
.gitignore		.gitignore
README.md		README.md
main.py		main.py
requirements.txt		requirements.txt
run_pipeline.sh		run_pipeline.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AI-powered Code Autocompletion

Project Overview

Features

Demo App

Technical Implementation

Getting Started

Prerequisites

Running the Project

Usage Example

Demo Application

Running the Demo

Metrics Tracking

Training Metrics

Evaluation Metrics

GPU Performance Metrics

Challenges and Limitations

Conclusion

Remarks

About

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

AI-powered Code Autocompletion

Project Overview

Features

Demo App

Technical Implementation

Getting Started

Prerequisites

Running the Project

Usage Example

Demo Application

Running the Demo

Metrics Tracking

Training Metrics

Evaluation Metrics

GPU Performance Metrics

Challenges and Limitations

Conclusion

Remarks

About

Resources

Uh oh!

Stars

Watchers

Forks

Uh oh!

Contributors

Uh oh!

Languages