InjecDetect is a machine learning based interactive prompt injection detection system designed to identify adversarial prompts submitted to Large Language Models (LLMs). Prompt injection and jailbreak attacks attempt to override system instructions through carefully crafted user inputs.
The system uses a dual machine learning approach, combining both classical and modern NLP models to analyze user provided prompts in real time and assess their likelihood of being malicious.
- Accepts a user provided prompt for analysis
- Scores the prompt using a lexical ML model (TF-IDF + Logistic Regression)
- Scores the prompt using a semantic AI model (DistilBERT)
- Combines and displays detection results in real time
- Detect prompt injection and jailbreak attempts in user inputs
- Provide real-time risk scoring for user provided prompts
- Compare lexical and semantic detection approaches
- Demonstrate a layered AI security design using multiple complementary models
The system uses two complementary ML models, each capturing different characteristics of malicious prompts:
- Extracts word and n-gram level features using TF-IDF
- Trained with Logistic Regression for binary classification
- Strong at detecting explicit jailbreak phrases and known attack templates
- Fast and interpretable as a reliable baseline
- Transformer based language model fine-tuned for binary classification
- Captures contextual and semantic intent
- Better at detecting paraphrased or indirect prompt injection attacks
The models were trained and evaluated using the Malicious Prompt Detection Dataset (MPDD).
- Source: Kaggle
- Link: https://www.kaggle.com/datasets/mohammedaminejebbar/malicious-prompt-detection-dataset-mpdd/data
Clone the repository and install dependencies:
git clone https://github.com/FatimaZ-tech/InjecDetect.git
cd InjecDetect
pip install -r requirements.txtNote: Trained model files must be present locally. See Model Files & Setup.
python injecdetect.pyType any prompt at the prompt. Type exit or quit to stop.
Due to size limitations, trained model files are not included in this repository.
This project will not run unless the required models are available locally.
Users must either:
- Train the models themselves or
- Request the pre-trained model artifacts from the author.
This follows standard practices for managing large machine learning assets in public repositories.
- This system does not generate attacks; it detects them
- Detection performance depends on the diversity of training data
- Some subtle or novel attacks may still evade both models
This project is licensed under the MIT License.
You are free to use, modify, and distribute this software for research, educational, and operational purposes, provided that the original copyright notice and license are included.
See the LICENSE file for full license text.
Developed by Fatima Zakir.