This project explores how documents can be represented in a Vector Space Model (VSM) using different lexical processing strategies and term weighting functions.
It implements a complete pipeline starting from raw HTML documents and producing structured numerical representations suitable for tasks such as document comparison, clustering, or classification.
The goal of this project is to study how different preprocessing decisions and weighting schemes affect the representation of documents in a vector space.
The work is framed around a realistic scenario: representing news articles so that they could later be used for unsupervised thematic classification, based on term co-occurrence patterns. :contentReference[oaicite:0]{index=0}
The corpus consists of HTML news articles from El País, manually collected from the homepage on a specific date.
Each document contains:
- metadata (keywords, descriptions),
- title,
- and article body.
The system extracts and processes only the semantically relevant textual content from these sources. :contentReference[oaicite:1]{index=1}
The project implements a full NLP pipeline:
- Parsing HTML using BeautifulSoup
- Extracting:
<meta>tags (keywords, description)<title><article>content
- Removal of:
- isolated numbers
- dates
- Conversion of emojis into textual form
- Regex-based tokenization using NLTK
- Supports:
- hyphenated words
- abbreviations (e.g. U.S.A.)
- emoji tokens
- Lowercasing
- Removal of punctuation (e.g. dots in abbreviations)
- Stopword removal (NLTK Spanish stopword list)
- Spanish Snowball (Porter) stemmer
After preprocessing, the system builds:
- Set of unique tokens across the corpus
- Defines the vector space dimensions
- Sparse representation:
each document is stored as a dictionary (token → frequency)
- Maps each term to its frequency across all documents
- Stored in
fichero_invertido.txt:contentReference[oaicite:2]{index=2}
The project compares several weighting functions:
- Binary → presence/absence
- Term Frequency (TF) → raw counts
- Weighted TF (WTF) → normalized by document length
- Binary + IDF
- TF-IDF
These representations allow analysing how different weighting strategies impact the structure of the vector space.
The system generates:
- Vocabulary file
- Inverted index
- Document vectors, grouped by weighting scheme:
bin/tf/wtf/bin_idf/tf_idf/
Each document is represented as a vector of fixed dimension (size of the vocabulary).
- The quality of representations strongly depends on preprocessing decisions (e.g. normalization, stopwords, stemming).
- Large vocabularies increase computational cost significantly.
- Feature selection and normalization are critical for both:
- semantic quality
- efficiency :contentReference[oaicite:3]{index=3}
- Python
- NLTK
- BeautifulSoup
- Regular expressions