⚡️ Speed up method `_RandomWords.next` by 27% #17

codeflash-ai · 2025-11-13T01:47:03Z

📄 27% (0.27x) speedup for `_RandomWords.next` in `spacy/ml/models/multi_task.py`

⏱️ Runtime : 12.2 milliseconds → 9.66 milliseconds (best of 80 runs)

📝 Explanation and details

The optimized code achieves a 26% speedup through two key optimizations that reduce computational overhead and memory operations:

1. Single-pass vocabulary processing with early termination:
The original code iterates through the entire vocabulary twice - once to extract words and once to extract probabilities - then slices both lists to the first 10,000 items. The optimized version uses a single loop with a counter that breaks after collecting exactly 10,000 lexemes, avoiding unnecessary iterations and the expensive slicing operations on potentially large lists.

2. Explicit conversion of numpy array to Python list:
In the next() method, the optimized code calls .tolist() on the numpy array before extending the cache. This converts the numpy array to a native Python list, which extends much faster than trying to extend with a numpy array directly due to Python's internal list extension optimizations.

Performance impact by test case type:

Small vocabularies (1-10 words): 50-75% faster - the single-pass optimization has less impact, but the .tolist() optimization still provides significant gains
Medium vocabularies (100-1000 words): 10-15% faster - benefits from both optimizations as vocabulary processing becomes more significant
Edge cases (empty/zero-prob vocabs): Minimal impact since these paths don't heavily use the optimized sections

The line profiler shows the numpy.random.choice call remains the dominant bottleneck (68.6% vs 57.4% of total time), but the cache extension operation is now much more efficient (12.8% vs 27.1% of total time). This optimization is particularly valuable for workloads that frequently create new _RandomWords instances or call next() repeatedly after cache depletion.

✅ Correctness verification report:

Test	Status
⚙️ Existing Unit Tests	🔘 None Found
🌀 Generated Regression Tests	✅ 3098 Passed
⏪ Replay Tests	🔘 None Found
🔎 Concolic Coverage Tests	🔘 None Found
📊 Tests Coverage	100.0%

🌀 Generated Regression Tests and Runtime

import numpy

imports

import pytest # used for our unit tests
from spacy.ml.models.multi_task import _RandomWords

function to test

class DummyLexeme:
"""A dummy lexeme class to mimic spaCy's Lexeme object."""
def init(self, text, prob):
self.text = text
self.prob = prob

class DummyVocab(list):
"""A dummy vocab class to mimic spaCy's Vocab object (iterable of Lexemes)."""
pass
from spacy.ml.models.multi_task import _RandomWords

unit tests

def make_vocab(word_probs):
"""Utility to create a DummyVocab from a list of (word, prob) pairs."""
return DummyVocab([DummyLexeme(w, p) for w, p in word_probs])

1. Basic Test Cases

def test_single_word():
# Only one word with nonzero probability
vocab = make_vocab([("hello", 1.0)])
rw = _RandomWords(vocab)
for _ in range(10):
codeflash_output = rw.next() # 356μs -> 214μs (66.3% faster)

def test_two_words_equal_prob():
# Two words, equal probability
vocab = make_vocab([("foo", 1.0), ("bar", 1.0)])
rw = _RandomWords(vocab)
results = [rw.next() for _ in range(100)] # 377μs -> 242μs (55.8% faster)

def test_three_words_unequal_prob():
# Three words, unequal probability
vocab = make_vocab([("a", 2.0), ("b", 1.0), ("c", 0.5)])
rw = _RandomWords(vocab)
results = [rw.next() for _ in range(300)] # 384μs -> 253μs (51.7% faster)
# 'a' should be more frequent than 'b', which should be more frequent than 'c'
a_count = results.count("a")
b_count = results.count("b")
c_count = results.count("c")

def test_zero_prob_words_are_excluded():
# Words with zero probability should not be returned
vocab = make_vocab([("x", 1.0), ("y", 0.0), ("z", 2.0)])
rw = _RandomWords(vocab)
results = [rw.next() for _ in range(100)] # 375μs -> 251μs (49.2% faster)

def test_probabilities_are_normalized():
# The sum of probabilities should be normalized internally
vocab = make_vocab([("a", 100.0), ("b", 100.0), ("c", 100.0)])
rw = _RandomWords(vocab)
results = [rw.next() for _ in range(300)]
# All words should appear roughly equally
counts = [results.count(w) for w in ("a", "b", "c")]

2. Edge Test Cases

def test_empty_vocab_raises():
# No words at all
vocab = make_vocab([])
rw = _RandomWords(vocab)
with pytest.raises(ValueError):
rw.next() # 17.9μs -> 18.2μs (1.85% slower)

def test_all_zero_probs_raises():
# All words have zero probability
vocab = make_vocab([("a", 0.0), ("b", 0.0)])
rw = _RandomWords(vocab)
with pytest.raises(ValueError):
rw.next() # 13.8μs -> 13.9μs (1.08% slower)

def test_vocab_more_than_10000_words():
# Only first 10000 words should be used
vocab = make_vocab([("w" + str(i), 1.0) for i in range(1000)])
rw = _RandomWords(vocab)
# All returned words should be among the first 10000
results = [rw.next() for _ in range(100)] # 990μs -> 875μs (13.2% faster)
allowed = {"w" + str(i) for i in range(1000)}

def test_vocab_exactly_10000_words():
# Exactly 10000 words
vocab = make_vocab([("w" + str(i), 1.0) for i in range(1000)])
rw = _RandomWords(vocab)
results = [rw.next() for _ in range(100)] # 976μs -> 857μs (13.9% faster)
allowed = {"w" + str(i) for i in range(1000)}

def test_probabilities_extreme_values():
# Extremely large/small probabilities
vocab = make_vocab([("low", -100.0), ("mid", 0.0), ("high", 100.0)])
rw = _RandomWords(vocab)
results = [rw.next() for _ in range(100)]

def test_cache_refill():
# Ensure cache refills after depletion
vocab = make_vocab([("a", 1.0), ("b", 2.0)])
rw = _RandomWords(vocab)
# Deplete cache (10000 calls)
for _ in range(1000):
rw.next() # 623μs -> 455μs (36.8% faster)
# After cache depletion, next call should still work
codeflash_output = rw.next(); word = codeflash_output # 230ns -> 208ns (10.6% faster)

3. Large Scale Test Cases

def test_large_vocab_performance():
# Large vocab, uniform probabilities
vocab = make_vocab([("w" + str(i), 1.0) for i in range(1000)])
rw = _RandomWords(vocab)
results = [rw.next() for _ in range(1000)] # 994μs -> 867μs (14.6% faster)
# All returned words should be in the vocab
allowed = {"w" + str(i) for i in range(1000)}

def test_large_vocab_skewed_probabilities():
# Large vocab, one word with much higher probability
vocab = make_vocab([("common", 100.0)] + [("rare" + str(i), 1.0) for i in range(999)])
rw = _RandomWords(vocab)
results = [rw.next() for _ in range(1000)]
# All rare words should appear at least once
rare_words = {"rare" + str(i) for i in range(999)}

def test_large_vocab_with_some_zero_probs():
# Large vocab, some words with zero probability
vocab = make_vocab([("w" + str(i), 1.0 if i % 2 == 0 else 0.0) for i in range(1000)])
rw = _RandomWords(vocab)
results = [rw.next() for _ in range(1000)] # 915μs -> 791μs (15.6% faster)
# Only even-indexed words should appear
allowed = {"w" + str(i) for i in range(0, 1000, 2)}

def test_probabilities_sum_to_zero_raises():
# All probabilities negative, exp will be very small, sum to zero
vocab = make_vocab([("a", -1000.0), ("b", -1000.0)])
rw = _RandomWords(vocab)
with pytest.raises(ValueError):
rw.next() # 21.9μs -> 21.0μs (3.99% faster)

def test_probabilities_nan_raises():
# Probability is NaN
vocab = make_vocab([("a", float('nan')), ("b", 1.0)])
rw = _RandomWords(vocab)
# Should ignore NaN lexeme, only 'b' should be returned
results = [rw.next() for _ in range(10)]

def test_probabilities_inf_raises():
# Probability is inf
vocab = make_vocab([("a", float('inf')), ("b", 1.0)])
rw = _RandomWords(vocab)
# exp(inf) will dominate
results = [rw.next() for _ in range(10)]

codeflash_output is used to check that the output of the original code is the same as that of the optimized code.

#------------------------------------------------
from typing import List

import numpy

imports

import pytest
from spacy.ml.models.multi_task import _RandomWords

Mock Vocab and Lexeme classes for testing

class MockLexeme:
def init(self, text, prob):
self.text = text
self.prob = prob

class MockVocab:
def init(self, lexemes):
self.lexemes = lexemes
def iter(self):
return iter(self.lexemes)
from spacy.ml.models.multi_task import _RandomWords

unit tests

---- Basic Test Cases ----

def test_single_word():
# One lexeme with nonzero prob, should always return that word
vocab = MockVocab([MockLexeme("hello", 1.0)])
rw = _RandomWords(vocab)
for _ in range(10):
codeflash_output = rw.next() # 360μs -> 207μs (73.5% faster)

def test_two_words_equal_prob():
# Two words, equal probability, should only return those words
vocab = MockVocab([MockLexeme("foo", 1.0), MockLexeme("bar", 1.0)])
rw = _RandomWords(vocab)
results = set(rw.next() for _ in range(20)) # 379μs -> 237μs (60.0% faster)

def test_three_words_different_probs():
# Three words, different probabilities, should only return those words
vocab = MockVocab([
MockLexeme("low", 0.1),
MockLexeme("mid", 0.5),
MockLexeme("high", 2.0)
])
rw = _RandomWords(vocab)
results = set(rw.next() for _ in range(30)) # 392μs -> 255μs (54.0% faster)

def test_ignores_zero_prob():
# Lexemes with zero prob should be ignored
vocab = MockVocab([
MockLexeme("skip", 0.0),
MockLexeme("keep", 1.0),
MockLexeme("also_skip", 0.0),
MockLexeme("keep2", 2.0)
])
rw = _RandomWords(vocab)
results = set(rw.next() for _ in range(20)) # 376μs -> 236μs (59.0% faster)
# Ensure no zero-prob words ever appear
for _ in range(20):
codeflash_output = rw.next(); word = codeflash_output # 5.00μs -> 4.47μs (12.0% faster)

---- Edge Test Cases ----

def test_empty_vocab_raises():
# No words with nonzero prob: should raise in next()
vocab = MockVocab([MockLexeme("none", 0.0)])
rw = _RandomWords(vocab)
with pytest.raises(ValueError):
rw.next() # 15.7μs -> 15.3μs (2.71% faster)

def test_all_zero_probs_raises():
# All lexemes have zero prob: should raise in next()
vocab = MockVocab([MockLexeme("a", 0.0), MockLexeme("b", 0.0)])
rw = _RandomWords(vocab)
with pytest.raises(ValueError):
rw.next() # 13.1μs -> 12.8μs (2.16% faster)

def test_large_prob_range():
# Lexemes with very large and very small probs
vocab = MockVocab([
MockLexeme("tiny", -100.0),
MockLexeme("huge", 100.0),
MockLexeme("mid", 0.0), # zero should be ignored
MockLexeme("normal", 1.0)
])
rw = _RandomWords(vocab)
results = set(rw.next() for _ in range(30))

def test_probabilities_are_normalized():
# Check that the probabilities sum to 1 after normalization
vocab = MockVocab([
MockLexeme("a", 0.1),
MockLexeme("b", 0.2),
MockLexeme("c", 0.3)
])
rw = _RandomWords(vocab)

def test_cache_refills_and_returns():
# Ensure cache refills after exhausting 10000 elements
vocab = MockVocab([MockLexeme(str(i), 1.0) for i in range(5)])
rw = _RandomWords(vocab)
# Call next() more than 10000 times to force cache refill
words = [rw.next() for _ in range(1000)] # 447μs -> 302μs (47.8% faster)

def test_non_string_lexeme_text():
# Lexeme text is not a string (should handle or convert)
vocab = MockVocab([MockLexeme(123, 1.0), MockLexeme(None, 1.0)])
rw = _RandomWords(vocab)
results = set(rw.next() for _ in range(10)) # 378μs -> 232μs (62.6% faster)

def test_probabilities_are_exp_transformed():
# Probabilities are exponentiated before normalization
vocab = MockVocab([
MockLexeme("a", 0.0), # ignored
MockLexeme("b", 0.0), # ignored
MockLexeme("x", 1.0),
MockLexeme("y", 2.0)
])
rw = _RandomWords(vocab)
# The expected normalized probabilities
exp_probs = numpy.exp(numpy.array([1.0, 2.0]))
norm_probs = exp_probs / exp_probs.sum()

---- Large Scale Test Cases ----

def test_maximum_vocab_size():
# Test with exactly 10000 nonzero-prob lexemes
vocab = MockVocab([MockLexeme(f"w{i}", 1.0) for i in range(1000)])
rw = _RandomWords(vocab)
# All returned words should be in the vocab
words = set(rw.next() for _ in range(100)) # 983μs -> 888μs (10.7% faster)

def test_vocab_truncation():
# More than 10000 lexemes: only first 10000 should be used
vocab = MockVocab([MockLexeme(f"w{i}", 1.0) for i in range(1000)])
rw = _RandomWords(vocab)
# All returned words should be in first 10000
words = set(rw.next() for _ in range(100)) # 976μs -> 862μs (13.1% faster)
# Words beyond 10000 should never be returned
for _ in range(100):
codeflash_output = rw.next() # 24.5μs -> 22.3μs (9.84% faster)

def test_large_scale_performance():
# Large vocab, ensure function is reasonably fast (no assertion, just runs)
vocab = MockVocab([MockLexeme(f"word{i}", 1.0) for i in range(900)])
rw = _RandomWords(vocab)
# Should not hang or throw
for _ in range(900):
codeflash_output = rw.next(); w = codeflash_output # 1.18ms -> 1.04ms (13.2% faster)

def test_large_scale_distribution():
# Check output distribution is roughly proportional to probabilities
# Use 10 words with increasing probs
vocab = MockVocab([MockLexeme(f"w{i}", i+1) for i in range(10)])
rw = _RandomWords(vocab)
counts = {f"w{i}": 0 for i in range(10)}
for _ in range(1000):
counts[rw.next()] += 1 # 647μs -> 474μs (36.5% faster)
# The highest-prob word should appear most often
max_word = max(counts, key=counts.get)
# The lowest-prob word should appear least often
min_word = min(counts, key=counts.get)

codeflash_output is used to check that the output of the original code is the same as that of the optimized code.

To edit these changes git checkout codeflash/optimize-_RandomWords.next-mhwrqld3 and push.

The optimized code achieves a **26% speedup** through two key optimizations that reduce computational overhead and memory operations: **1. Single-pass vocabulary processing with early termination:** The original code iterates through the entire vocabulary twice - once to extract words and once to extract probabilities - then slices both lists to the first 10,000 items. The optimized version uses a single loop with a counter that breaks after collecting exactly 10,000 lexemes, avoiding unnecessary iterations and the expensive slicing operations on potentially large lists. **2. Explicit conversion of numpy array to Python list:** In the `next()` method, the optimized code calls `.tolist()` on the numpy array before extending the cache. This converts the numpy array to a native Python list, which extends much faster than trying to extend with a numpy array directly due to Python's internal list extension optimizations. **Performance impact by test case type:** - **Small vocabularies (1-10 words)**: 50-75% faster - the single-pass optimization has less impact, but the `.tolist()` optimization still provides significant gains - **Medium vocabularies (100-1000 words)**: 10-15% faster - benefits from both optimizations as vocabulary processing becomes more significant - **Edge cases (empty/zero-prob vocabs)**: Minimal impact since these paths don't heavily use the optimized sections The line profiler shows the `numpy.random.choice` call remains the dominant bottleneck (68.6% vs 57.4% of total time), but the cache extension operation is now much more efficient (12.8% vs 27.1% of total time). This optimization is particularly valuable for workloads that frequently create new `_RandomWords` instances or call `next()` repeatedly after cache depletion.

codeflash-ai bot requested a review from mashraf-222 November 13, 2025 01:47

codeflash-ai bot added ⚡️ codeflash Optimization PR opened by Codeflash AI 🎯 Quality: High Optimization Quality according to Codeflash labels Nov 13, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

⚡️ Speed up method `_RandomWords.next` by 27% #17

⚡️ Speed up method `_RandomWords.next` by 27% #17

Uh oh!

codeflash-ai bot commented Nov 13, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

⚡️ Speed up method _RandomWords.next by 27% #17

Are you sure you want to change the base?

⚡️ Speed up method _RandomWords.next by 27% #17

Uh oh!

Conversation

codeflash-ai bot commented Nov 13, 2025

📄 27% (0.27x) speedup for _RandomWords.next in spacy/ml/models/multi_task.py

📝 Explanation and details

imports

function to test

unit tests

1. Basic Test Cases

2. Edge Test Cases

3. Large Scale Test Cases

codeflash_output is used to check that the output of the original code is the same as that of the optimized code.

imports

Mock Vocab and Lexeme classes for testing

unit tests

---- Basic Test Cases ----

---- Edge Test Cases ----

---- Large Scale Test Cases ----

codeflash_output is used to check that the output of the original code is the same as that of the optimized code.

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

⚡️ Speed up method `_RandomWords.next` by 27% #17

⚡️ Speed up method `_RandomWords.next` by 27% #17

📄 27% (0.27x) speedup for `_RandomWords.next` in `spacy/ml/models/multi_task.py`