Skip to content

ariadnafruits/vector-space-model-representations

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Document Vector Representations in the Vector Space Model

This project explores how documents can be represented in a Vector Space Model (VSM) using different lexical processing strategies and term weighting functions.

It implements a complete pipeline starting from raw HTML documents and producing structured numerical representations suitable for tasks such as document comparison, clustering, or classification.


Project Objective

The goal of this project is to study how different preprocessing decisions and weighting schemes affect the representation of documents in a vector space.

The work is framed around a realistic scenario: representing news articles so that they could later be used for unsupervised thematic classification, based on term co-occurrence patterns. :contentReference[oaicite:0]{index=0}


Dataset

The corpus consists of HTML news articles from El País, manually collected from the homepage on a specific date.

Each document contains:

  • metadata (keywords, descriptions),
  • title,
  • and article body.

The system extracts and processes only the semantically relevant textual content from these sources. :contentReference[oaicite:1]{index=1}


Processing Pipeline

The project implements a full NLP pipeline:

1. Text Extraction

  • Parsing HTML using BeautifulSoup
  • Extracting:
    • <meta> tags (keywords, description)
    • <title>
    • <article> content

2. Cleaning

  • Removal of:
    • isolated numbers
    • dates
  • Conversion of emojis into textual form

3. Tokenization

  • Regex-based tokenization using NLTK
  • Supports:
    • hyphenated words
    • abbreviations (e.g. U.S.A.)
    • emoji tokens

4. Normalization

  • Lowercasing
  • Removal of punctuation (e.g. dots in abbreviations)
  • Stopword removal (NLTK Spanish stopword list)

5. Stemming

  • Spanish Snowball (Porter) stemmer

Vector Representation

After preprocessing, the system builds:

Vocabulary

  • Set of unique tokens across the corpus
  • Defines the vector space dimensions

Corpus Matrix

  • Sparse representation:
    each document is stored as a dictionary (token → frequency)

Inverted Index

  • Maps each term to its frequency across all documents
  • Stored in fichero_invertido.txt :contentReference[oaicite:2]{index=2}

Weighting Schemes

The project compares several weighting functions:

Local weighting

  • Binary → presence/absence
  • Term Frequency (TF) → raw counts
  • Weighted TF (WTF) → normalized by document length

Global weighting

  • Binary + IDF
  • TF-IDF

These representations allow analysing how different weighting strategies impact the structure of the vector space.


Outputs

The system generates:

  • Vocabulary file
  • Inverted index
  • Document vectors, grouped by weighting scheme:
    • bin/
    • tf/
    • wtf/
    • bin_idf/
    • tf_idf/

Each document is represented as a vector of fixed dimension (size of the vocabulary).


Key Insights

  • The quality of representations strongly depends on preprocessing decisions (e.g. normalization, stopwords, stemming).
  • Large vocabularies increase computational cost significantly.
  • Feature selection and normalization are critical for both:
    • semantic quality
    • efficiency :contentReference[oaicite:3]{index=3}

Technologies Used

  • Python
  • NLTK
  • BeautifulSoup
  • Regular expressions

About

Document representations in the Vector Space Model using multiple weighting schemes.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors