word2vec

CBOW and Skip-Gram with Negative Sampling (SGNS), implemented from scratch in NumPy. Trains on the text8 corpus and evaluates on the Google analogy and WordSim-353 benchmarks.

Setup

Script

bash setup.sh
conda activate word2vec

Manual

conda create -n word2vec python=3.9
conda activate word2vec
pip install numpy kagglehub pytest

Download the benchmark files once and place in the project root:

# Google analogy (19,544 questions — semantic + syntactic)
wget https://raw.githubusercontent.com/nicholas-leonard/word2vec/master/questions-words.txt

# WordSim-353 (353 word pairs with human similarity ratings)
wget https://raw.githubusercontent.com/commonsense/conceptnet5/master/conceptnet5/support_data/wordsim-353/combined.tab

The training corpus (text8) is downloaded automatically via kagglehub on the first run.

Train

# CBOW (default)
python main.py --epochs 50 --embed-dim 200

# Skip-Gram with Negative Sampling
python main.py --model sgns --epochs 10 --embed-dim 200

Embeddings are saved to embeddings/ after training. Checkpoints are written to embeddings/checkpoints/ every 500k tokens.

Key options:

Flag	Default	Description
`--model`	cbow	Model architecture: `cbow` or `sgns`
`--epochs`	50	Training passes over the corpus
`--embed-dim`	200	Embedding dimensionality
`--window`	5	Max context window radius
`--neg-k`	5	Negative samples per positive pair
`--lr-init`	0.025	Initial learning rate
`--min-count`	5	Min word frequency for vocabulary
`--max-vocab`	30000	Vocabulary size cap

Pretrained embeddings

Download pretrained embeddings (CBOW, text8, 10 epochs, embed-dim 200, neg-k 5) from the v1.0 release:

wget https://github.com/AStroCvijo/word2vec/releases/download/v1.0/W_in.npy  -P embeddings/
wget https://github.com/AStroCvijo/word2vec/releases/download/v1.0/W_out.npy -P embeddings/
wget https://github.com/AStroCvijo/word2vec/releases/download/v1.0/vocab.txt  -P embeddings/

Eval

Interactive REPL

Inspect embeddings after training:

python eval.py

> nn king                      # nearest neighbours
> king - man + woman           # vector arithmetic
> paris - france + germany     # analogy
> quit

Words used in an expression are excluded from the results so they don't trivially dominate.

To inspect a checkpoint instead of the final embeddings:

python eval.py --w-in-path embeddings/checkpoints/e02_t0003000000/W_in.npy

Benchmarks

# Analogy accuracy (semantic + syntactic breakdown)
python eval.py --benchmark-analogy questions-words.txt

# Word similarity Spearman correlation
python eval.py --benchmark-similarity combined.tab

# Both at once
python eval.py --benchmark-analogy questions-words.txt --benchmark-similarity combined.tab

Tests

# Run all tests
python -m pytest tests/

# Run a specific test file
python -m pytest tests/test_model.py
python -m pytest tests/test_dataset.py
python -m pytest tests/test_eval.py

Results on text8

CBOW

epochs	embed-dim	neg-k	Analogy %	WordSim ρ
50	200	5	22.4%	0.569
50	200	10	30.9%	0.676
50	300	5	43.1%	0.747
100	200	5	44.2%	0.719
100	200	10	43.3%	0.703

SGNS

epochs	embed-dim	neg-k	Analogy %	WordSim ρ
5	200	5	29.2%	0.700
5	200	10	32.2%	0.706
5	300	5	29.2%	0.701
10	200	5	37.8%	0.731
10	200	10	42.7%	0.720
10	300	10	39.3%	0.731

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
data		data
model		model
tests		tests
utils		utils
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
eval.py		eval.py
main.py		main.py
setup.sh		setup.sh
sweep.py		sweep.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

word2vec

Setup

Script

Manual

Train

Pretrained embeddings

Eval

Interactive REPL

Benchmarks

Tests

Results on text8

CBOW

SGNS

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

word2vec

Setup

Script

Manual

Train

Pretrained embeddings

Eval

Interactive REPL

Benchmarks

Tests

Results on text8

CBOW

SGNS

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages