Skip to content

AStroCvijo/word2vec

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

word2vec

CBOW and Skip-Gram with Negative Sampling (SGNS), implemented from scratch in NumPy. Trains on the text8 corpus and evaluates on the Google analogy and WordSim-353 benchmarks.

Setup

Script

bash setup.sh
conda activate word2vec

Manual

conda create -n word2vec python=3.9
conda activate word2vec
pip install numpy kagglehub pytest

Download the benchmark files once and place in the project root:

# Google analogy (19,544 questions — semantic + syntactic)
wget https://raw.githubusercontent.com/nicholas-leonard/word2vec/master/questions-words.txt

# WordSim-353 (353 word pairs with human similarity ratings)
wget https://raw.githubusercontent.com/commonsense/conceptnet5/master/conceptnet5/support_data/wordsim-353/combined.tab

The training corpus (text8) is downloaded automatically via kagglehub on the first run.

Train

# CBOW (default)
python main.py --epochs 50 --embed-dim 200

# Skip-Gram with Negative Sampling
python main.py --model sgns --epochs 10 --embed-dim 200

Embeddings are saved to embeddings/ after training. Checkpoints are written to embeddings/checkpoints/ every 500k tokens.

Key options:

Flag Default Description
--model cbow Model architecture: cbow or sgns
--epochs 50 Training passes over the corpus
--embed-dim 200 Embedding dimensionality
--window 5 Max context window radius
--neg-k 5 Negative samples per positive pair
--lr-init 0.025 Initial learning rate
--min-count 5 Min word frequency for vocabulary
--max-vocab 30000 Vocabulary size cap

Pretrained embeddings

Download pretrained embeddings (CBOW, text8, 10 epochs, embed-dim 200, neg-k 5) from the v1.0 release:

wget https://github.com/AStroCvijo/word2vec/releases/download/v1.0/W_in.npy  -P embeddings/
wget https://github.com/AStroCvijo/word2vec/releases/download/v1.0/W_out.npy -P embeddings/
wget https://github.com/AStroCvijo/word2vec/releases/download/v1.0/vocab.txt  -P embeddings/

Eval

Interactive REPL

Inspect embeddings after training:

python eval.py
> nn king                      # nearest neighbours
> king - man + woman           # vector arithmetic
> paris - france + germany     # analogy
> quit

Words used in an expression are excluded from the results so they don't trivially dominate.

To inspect a checkpoint instead of the final embeddings:

python eval.py --w-in-path embeddings/checkpoints/e02_t0003000000/W_in.npy

Benchmarks

# Analogy accuracy (semantic + syntactic breakdown)
python eval.py --benchmark-analogy questions-words.txt

# Word similarity Spearman correlation
python eval.py --benchmark-similarity combined.tab

# Both at once
python eval.py --benchmark-analogy questions-words.txt --benchmark-similarity combined.tab

Tests

# Run all tests
python -m pytest tests/

# Run a specific test file
python -m pytest tests/test_model.py
python -m pytest tests/test_dataset.py
python -m pytest tests/test_eval.py

Results on text8

CBOW

epochs embed-dim neg-k Analogy % WordSim ρ
50 200 5 22.4% 0.569
50 200 10 30.9% 0.676
50 300 5 43.1% 0.747
100 200 5 44.2% 0.719
100 200 10 43.3% 0.703

SGNS

epochs embed-dim neg-k Analogy % WordSim ρ
5 200 5 29.2% 0.700
5 200 10 32.2% 0.706
5 300 5 29.2% 0.701
10 200 5 37.8% 0.731
10 200 10 42.7% 0.720
10 300 10 39.3% 0.731

About

From-scratch NumPy implementation of Word2Vec (CBOW & SGNS) with benchmark evaluation, hyperparameter sweeps, and pretrained embeddings on text8.

Topics

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors