This project demonstrates the implementation of an AI-powered code autocompletion model using the Salesforce CodeGen model with parameter-efficient fine-tuning techniques.
Code autocompletion tools have revolutionized software development by predicting and suggesting code as developers type. The growing insurgence of GitHub Copilot and Cursor Tab, a code autocomplete project was simply a must-try for me. This project aims to fine-tune a pre-trained Large Language Model (LLM) to complete Python code snippets based on function signatures, docstrings and the function itself. The model is trained on Python functions from the Code_Search_Net dataset.
- Uses the
Salesforce/codegen-350M-monomodel as a base model (fromhuggingface) - Implements Parameter-Efficient Fine-Tuning (PEFT) with Low-Rank Adaptation (LoRA) to reduce trainable parameters
- Training and evaluation pipeline for code completion
- Demo App
The project includes:
- Data preparation from the
Code_Search_Netdataset - Tokenization optimized for code
- Parameter-Efficient Fine-Tuning with LoRA
- Training and evaluation pipeline
- Example generation and testing
- Python 3.8+
- PyTorch
- Transformers
- Datasets
- PEFT library
All project libraries and package requirements are listed in the requirements.txt file. To install the requirements, check below "Install dependencies".
-
Clone the repository:
git clone https://github.com/yourusername/ai-code-autocompletion.git cd ai-code-autocompletion -
Install dependencies:
pip install -r requirements.txt
or in conda environments (recommended):
conda install -r requirements.txt
-
Open and run the Jupyter notebook:
jupyter notebook notebooks/ai_code_autocompletion.ipynb
-
Follow the steps in the notebook to train and evaluate the model.
-
To train the model locally:
# Set environment variables export MODEL_PATH="models/newly-trained model" # The trained model will be saved to this path export WANDB_API_KEY="your-wandb-api-key" # Optional for metrics tracking # Run the training script python src/training/train.py \ --model_name="Salesforce/codegen-350M-mono" \ --output_dir=$MODEL_PATH \ --num_epochs=3 \ --batch_size=8 \ --learning_rate=5e-4
-
Also it is possible to train the model on Google Colab by simply uploading the
notebooks/Model_training.ipynbnotebook toGoogle Colaband hitting Run.
After training, you can use the model to complete code snippets:
# Example code completion
function_prefix = "def process_image(image_path):\n # Load and preprocess image\n import numpy as np\n img = "
completion = generate_completion(model, tokenizer, function_prefix)
print(f"Completion: {completion}")A web-based demo application is available to showcase the code completion model's capabilities:
- Simple web interface built with
Streamlit - Syntax highlighting for Python code
- Adjustable parameters (temperature, max tokens)
# Make sure to install the required libraries (included in the requirements.txt)
pip install streamlit pygments
# Run the demo app (from the root directory)
streamlit run src/app/demo.pyThen open your browser at the URL displayed in the terminal (by default: http://localhost:8501) to interact with the demo.
The Codegen-350M-mono model makes use of Weights & Biases for metrics tracking. (You will need an account to get an API key).
- Context Window Limitations: The model can only see a limited amount of context (often just the current function), making it difficult to understand the broader codebase.
- Computational Efficiency: Code suggestions must appear nearly instantaneously to be useful, requiring more (GPU) power and model optimization for low latency.
This project demonstrates the feasibility of fine-tuning a code autocompletion model using modern techniques such as PEFT and LoRA. The Salesforce CodeGen model shows promising results for not only Python code completion tasks and can easily be extended and trained with other programming languages.
The ai_code_completions.ipynb notebook in notebooks/ was my original submission for my end semester project for the Deep Learning and AI course I took (Winter 2025). I decided to expand the idea into this fully-flegded project you see now using industry best practices for model development and test. This is the first of its kind for me. If you're reading this, it really means a lot to me.



