Document Vector Representations in the Vector Space Model

This project explores how documents can be represented in a Vector Space Model (VSM) using different lexical processing strategies and term weighting functions.

It implements a complete pipeline starting from raw HTML documents and producing structured numerical representations suitable for tasks such as document comparison, clustering, or classification.

Project Objective

The goal of this project is to study how different preprocessing decisions and weighting schemes affect the representation of documents in a vector space.

The work is framed around a realistic scenario: representing news articles so that they could later be used for unsupervised thematic classification, based on term co-occurrence patterns. :contentReference[oaicite:0]{index=0}

Dataset

The corpus consists of HTML news articles from El País, manually collected from the homepage on a specific date.

Each document contains:

metadata (keywords, descriptions),
title,
and article body.

The system extracts and processes only the semantically relevant textual content from these sources. :contentReference[oaicite:1]{index=1}

Processing Pipeline

The project implements a full NLP pipeline:

1. Text Extraction

Parsing HTML using BeautifulSoup
Extracting:
- <meta> tags (keywords, description)
- <title>
- <article> content

2. Cleaning

Removal of:
- isolated numbers
- dates
Conversion of emojis into textual form

3. Tokenization

Regex-based tokenization using NLTK
Supports:
- hyphenated words
- abbreviations (e.g. U.S.A.)
- emoji tokens

4. Normalization

Lowercasing
Removal of punctuation (e.g. dots in abbreviations)
Stopword removal (NLTK Spanish stopword list)

5. Stemming

Spanish Snowball (Porter) stemmer

Vector Representation

After preprocessing, the system builds:

Vocabulary

Set of unique tokens across the corpus
Defines the vector space dimensions

Corpus Matrix

Sparse representation:
each document is stored as a dictionary (token → frequency)

Inverted Index

Maps each term to its frequency across all documents
Stored in fichero_invertido.txt :contentReference[oaicite:2]{index=2}

Weighting Schemes

The project compares several weighting functions:

Local weighting

Binary → presence/absence
Term Frequency (TF) → raw counts
Weighted TF (WTF) → normalized by document length

Global weighting

Binary + IDF
TF-IDF

These representations allow analysing how different weighting strategies impact the structure of the vector space.

Outputs

The system generates:

Vocabulary file
Inverted index
Document vectors, grouped by weighting scheme:
- bin/
- tf/
- wtf/
- bin_idf/
- tf_idf/

Each document is represented as a vector of fixed dimension (size of the vocabulary).

Key Insights

The quality of representations strongly depends on preprocessing decisions (e.g. normalization, stopwords, stemming).
Large vocabularies increase computational cost significantly.
Feature selection and normalization are critical for both:
- semantic quality
- efficiency :contentReference[oaicite:3]{index=3}

Technologies Used

Python
NLTK
BeautifulSoup
Regular expressions

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
docs		docs
input_samples		input_samples
notebooks		notebooks
output_samples		output_samples
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Document Vector Representations in the Vector Space Model

Project Objective

Dataset

Processing Pipeline

1. Text Extraction

2. Cleaning

3. Tokenization

4. Normalization

5. Stemming

Vector Representation

Vocabulary

Corpus Matrix

Inverted Index

Weighting Schemes

Local weighting

Global weighting

Outputs

Key Insights

Technologies Used

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Document Vector Representations in the Vector Space Model

Project Objective

Dataset

Processing Pipeline

1. Text Extraction

2. Cleaning

3. Tokenization

4. Normalization

5. Stemming

Vector Representation

Vocabulary

Corpus Matrix

Inverted Index

Weighting Schemes

Local weighting

Global weighting

Outputs

Key Insights

Technologies Used

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages