CBOW and Skip-Gram with Negative Sampling (SGNS), implemented from scratch in NumPy. Trains on the text8 corpus and evaluates on the Google analogy and WordSim-353 benchmarks.
bash setup.sh
conda activate word2vecconda create -n word2vec python=3.9
conda activate word2vec
pip install numpy kagglehub pytestDownload the benchmark files once and place in the project root:
# Google analogy (19,544 questions — semantic + syntactic)
wget https://raw.githubusercontent.com/nicholas-leonard/word2vec/master/questions-words.txt
# WordSim-353 (353 word pairs with human similarity ratings)
wget https://raw.githubusercontent.com/commonsense/conceptnet5/master/conceptnet5/support_data/wordsim-353/combined.tabThe training corpus (text8) is downloaded automatically via kagglehub on the first run.
# CBOW (default)
python main.py --epochs 50 --embed-dim 200
# Skip-Gram with Negative Sampling
python main.py --model sgns --epochs 10 --embed-dim 200Embeddings are saved to embeddings/ after training. Checkpoints are written to embeddings/checkpoints/ every 500k tokens.
Key options:
| Flag | Default | Description |
|---|---|---|
--model |
cbow | Model architecture: cbow or sgns |
--epochs |
50 | Training passes over the corpus |
--embed-dim |
200 | Embedding dimensionality |
--window |
5 | Max context window radius |
--neg-k |
5 | Negative samples per positive pair |
--lr-init |
0.025 | Initial learning rate |
--min-count |
5 | Min word frequency for vocabulary |
--max-vocab |
30000 | Vocabulary size cap |
Download pretrained embeddings (CBOW, text8, 10 epochs, embed-dim 200, neg-k 5) from the v1.0 release:
wget https://github.com/AStroCvijo/word2vec/releases/download/v1.0/W_in.npy -P embeddings/
wget https://github.com/AStroCvijo/word2vec/releases/download/v1.0/W_out.npy -P embeddings/
wget https://github.com/AStroCvijo/word2vec/releases/download/v1.0/vocab.txt -P embeddings/Inspect embeddings after training:
python eval.py> nn king # nearest neighbours
> king - man + woman # vector arithmetic
> paris - france + germany # analogy
> quit
Words used in an expression are excluded from the results so they don't trivially dominate.
To inspect a checkpoint instead of the final embeddings:
python eval.py --w-in-path embeddings/checkpoints/e02_t0003000000/W_in.npy# Analogy accuracy (semantic + syntactic breakdown)
python eval.py --benchmark-analogy questions-words.txt
# Word similarity Spearman correlation
python eval.py --benchmark-similarity combined.tab
# Both at once
python eval.py --benchmark-analogy questions-words.txt --benchmark-similarity combined.tab# Run all tests
python -m pytest tests/
# Run a specific test file
python -m pytest tests/test_model.py
python -m pytest tests/test_dataset.py
python -m pytest tests/test_eval.py| epochs | embed-dim | neg-k | Analogy % | WordSim ρ |
|---|---|---|---|---|
| 50 | 200 | 5 | 22.4% | 0.569 |
| 50 | 200 | 10 | 30.9% | 0.676 |
| 50 | 300 | 5 | 43.1% | 0.747 |
| 100 | 200 | 5 | 44.2% | 0.719 |
| 100 | 200 | 10 | 43.3% | 0.703 |
| epochs | embed-dim | neg-k | Analogy % | WordSim ρ |
|---|---|---|---|---|
| 5 | 200 | 5 | 29.2% | 0.700 |
| 5 | 200 | 10 | 32.2% | 0.706 |
| 5 | 300 | 5 | 29.2% | 0.701 |
| 10 | 200 | 5 | 37.8% | 0.731 |
| 10 | 200 | 10 | 42.7% | 0.720 |
| 10 | 300 | 10 | 39.3% | 0.731 |