This repository contains the implementation and report for Assignment 1 of the Advanced Natural Language Processing (ANLP) 2024 course. The focus is to build a character-level trigram language model and analyze its performance on language detection tasks.
The key objectives of this project are:
- Build a trigram language model: Construct a character-level language model from training data.
- Generate random sequences: Use the language model to generate random text sequences.
- Preprocess input data: Normalize input by lowercasing, removing special characters, and converting digits.
- Evaluate models with perplexity: Calculate the perplexity of test documents using language models.
/model/ # Contains all Python scripts and subdirectories
/data/ # Training and test datasets (English, Spanish, German)
/report/ # Report containing analysis and implementation details
main.py # Entry point for running the model
README.md # Project documentation
requirements.txt # List of dependencies for installing with pip
-
Preprocessing
Converts text to lowercase, removes non-alphanumeric characters (except.), and normalizes digits to0. -
Language Model Construction
Collects character 3-grams and estimates probabilities using the maximum likelihood estimation (MLE). -
Sequence Generation
Generates text sequences based on the trigram model’s probabilities. -
Perplexity Computation
Computes perplexity to measure how well a model predicts unseen text.
-
Clone the Repository
git clone <your-repository-url> cd <repository-folder>
-
Install Dependencies
Ensure Python 3.x is installed and run:pip install -r requirements.txt
-
Run the Project
To run the main script:python model/main.py