Language Modeling Project - ANLP 2024

This repository contains the implementation and report for Assignment 1 of the Advanced Natural Language Processing (ANLP) 2024 course. The focus is to build a character-level trigram language model and analyze its performance on language detection tasks.

Overview

The key objectives of this project are:

Build a trigram language model: Construct a character-level language model from training data.
Generate random sequences: Use the language model to generate random text sequences.
Preprocess input data: Normalize input by lowercasing, removing special characters, and converting digits.
Evaluate models with perplexity: Calculate the perplexity of test documents using language models.

Repository Structure

/model/           # Contains all Python scripts and subdirectories
  /data/          # Training and test datasets (English, Spanish, German)
  /report/        # Report containing analysis and implementation details
  main.py         # Entry point for running the model
README.md         # Project documentation
requirements.txt  # List of dependencies for installing with pip

Key Features

Preprocessing
Converts text to lowercase, removes non-alphanumeric characters (except .), and normalizes digits to 0.
Language Model Construction
Collects character 3-grams and estimates probabilities using the maximum likelihood estimation (MLE).
Sequence Generation
Generates text sequences based on the trigram model’s probabilities.
Perplexity Computation
Computes perplexity to measure how well a model predicts unseen text.

How to Run the Project

Clone the Repository

git clone <your-repository-url>
cd <repository-folder>

Install Dependencies
Ensure Python 3.x is installed and run:
```
pip install -r requirements.txt
```
Run the Project
To run the main script:
```
python model/main.py
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Language Modeling Project - ANLP 2024

Overview

Repository Structure

Key Features

How to Run the Project

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
model		model
report		report
README.md		README.md
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

Language Modeling Project - ANLP 2024

Overview

Repository Structure

Key Features

How to Run the Project

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages