Skip to content

Conversation

@codeflash-ai
Copy link

@codeflash-ai codeflash-ai bot commented Nov 13, 2025

📄 27% (0.27x) speedup for _RandomWords.next in spacy/ml/models/multi_task.py

⏱️ Runtime : 12.2 milliseconds 9.66 milliseconds (best of 80 runs)

📝 Explanation and details

The optimized code achieves a 26% speedup through two key optimizations that reduce computational overhead and memory operations:

1. Single-pass vocabulary processing with early termination:
The original code iterates through the entire vocabulary twice - once to extract words and once to extract probabilities - then slices both lists to the first 10,000 items. The optimized version uses a single loop with a counter that breaks after collecting exactly 10,000 lexemes, avoiding unnecessary iterations and the expensive slicing operations on potentially large lists.

2. Explicit conversion of numpy array to Python list:
In the next() method, the optimized code calls .tolist() on the numpy array before extending the cache. This converts the numpy array to a native Python list, which extends much faster than trying to extend with a numpy array directly due to Python's internal list extension optimizations.

Performance impact by test case type:

  • Small vocabularies (1-10 words): 50-75% faster - the single-pass optimization has less impact, but the .tolist() optimization still provides significant gains
  • Medium vocabularies (100-1000 words): 10-15% faster - benefits from both optimizations as vocabulary processing becomes more significant
  • Edge cases (empty/zero-prob vocabs): Minimal impact since these paths don't heavily use the optimized sections

The line profiler shows the numpy.random.choice call remains the dominant bottleneck (68.6% vs 57.4% of total time), but the cache extension operation is now much more efficient (12.8% vs 27.1% of total time). This optimization is particularly valuable for workloads that frequently create new _RandomWords instances or call next() repeatedly after cache depletion.

Correctness verification report:

Test Status
⚙️ Existing Unit Tests 🔘 None Found
🌀 Generated Regression Tests 3098 Passed
⏪ Replay Tests 🔘 None Found
🔎 Concolic Coverage Tests 🔘 None Found
📊 Tests Coverage 100.0%
🌀 Generated Regression Tests and Runtime

import numpy

imports

import pytest # used for our unit tests
from spacy.ml.models.multi_task import _RandomWords

function to test

class DummyLexeme:
"""A dummy lexeme class to mimic spaCy's Lexeme object."""
def init(self, text, prob):
self.text = text
self.prob = prob

class DummyVocab(list):
"""A dummy vocab class to mimic spaCy's Vocab object (iterable of Lexemes)."""
pass
from spacy.ml.models.multi_task import _RandomWords

unit tests

def make_vocab(word_probs):
"""Utility to create a DummyVocab from a list of (word, prob) pairs."""
return DummyVocab([DummyLexeme(w, p) for w, p in word_probs])

1. Basic Test Cases

def test_single_word():
# Only one word with nonzero probability
vocab = make_vocab([("hello", 1.0)])
rw = _RandomWords(vocab)
for _ in range(10):
codeflash_output = rw.next() # 356μs -> 214μs (66.3% faster)

def test_two_words_equal_prob():
# Two words, equal probability
vocab = make_vocab([("foo", 1.0), ("bar", 1.0)])
rw = _RandomWords(vocab)
results = [rw.next() for _ in range(100)] # 377μs -> 242μs (55.8% faster)

def test_three_words_unequal_prob():
# Three words, unequal probability
vocab = make_vocab([("a", 2.0), ("b", 1.0), ("c", 0.5)])
rw = _RandomWords(vocab)
results = [rw.next() for _ in range(300)] # 384μs -> 253μs (51.7% faster)
# 'a' should be more frequent than 'b', which should be more frequent than 'c'
a_count = results.count("a")
b_count = results.count("b")
c_count = results.count("c")

def test_zero_prob_words_are_excluded():
# Words with zero probability should not be returned
vocab = make_vocab([("x", 1.0), ("y", 0.0), ("z", 2.0)])
rw = _RandomWords(vocab)
results = [rw.next() for _ in range(100)] # 375μs -> 251μs (49.2% faster)

def test_probabilities_are_normalized():
# The sum of probabilities should be normalized internally
vocab = make_vocab([("a", 100.0), ("b", 100.0), ("c", 100.0)])
rw = _RandomWords(vocab)
results = [rw.next() for _ in range(300)]
# All words should appear roughly equally
counts = [results.count(w) for w in ("a", "b", "c")]

2. Edge Test Cases

def test_empty_vocab_raises():
# No words at all
vocab = make_vocab([])
rw = _RandomWords(vocab)
with pytest.raises(ValueError):
rw.next() # 17.9μs -> 18.2μs (1.85% slower)

def test_all_zero_probs_raises():
# All words have zero probability
vocab = make_vocab([("a", 0.0), ("b", 0.0)])
rw = _RandomWords(vocab)
with pytest.raises(ValueError):
rw.next() # 13.8μs -> 13.9μs (1.08% slower)

def test_vocab_more_than_10000_words():
# Only first 10000 words should be used
vocab = make_vocab([("w" + str(i), 1.0) for i in range(1000)])
rw = _RandomWords(vocab)
# All returned words should be among the first 10000
results = [rw.next() for _ in range(100)] # 990μs -> 875μs (13.2% faster)
allowed = {"w" + str(i) for i in range(1000)}

def test_vocab_exactly_10000_words():
# Exactly 10000 words
vocab = make_vocab([("w" + str(i), 1.0) for i in range(1000)])
rw = _RandomWords(vocab)
results = [rw.next() for _ in range(100)] # 976μs -> 857μs (13.9% faster)
allowed = {"w" + str(i) for i in range(1000)}

def test_probabilities_extreme_values():
# Extremely large/small probabilities
vocab = make_vocab([("low", -100.0), ("mid", 0.0), ("high", 100.0)])
rw = _RandomWords(vocab)
results = [rw.next() for _ in range(100)]

def test_cache_refill():
# Ensure cache refills after depletion
vocab = make_vocab([("a", 1.0), ("b", 2.0)])
rw = _RandomWords(vocab)
# Deplete cache (10000 calls)
for _ in range(1000):
rw.next() # 623μs -> 455μs (36.8% faster)
# After cache depletion, next call should still work
codeflash_output = rw.next(); word = codeflash_output # 230ns -> 208ns (10.6% faster)

3. Large Scale Test Cases

def test_large_vocab_performance():
# Large vocab, uniform probabilities
vocab = make_vocab([("w" + str(i), 1.0) for i in range(1000)])
rw = _RandomWords(vocab)
results = [rw.next() for _ in range(1000)] # 994μs -> 867μs (14.6% faster)
# All returned words should be in the vocab
allowed = {"w" + str(i) for i in range(1000)}

def test_large_vocab_skewed_probabilities():
# Large vocab, one word with much higher probability
vocab = make_vocab([("common", 100.0)] + [("rare" + str(i), 1.0) for i in range(999)])
rw = _RandomWords(vocab)
results = [rw.next() for _ in range(1000)]
# All rare words should appear at least once
rare_words = {"rare" + str(i) for i in range(999)}

def test_large_vocab_with_some_zero_probs():
# Large vocab, some words with zero probability
vocab = make_vocab([("w" + str(i), 1.0 if i % 2 == 0 else 0.0) for i in range(1000)])
rw = _RandomWords(vocab)
results = [rw.next() for _ in range(1000)] # 915μs -> 791μs (15.6% faster)
# Only even-indexed words should appear
allowed = {"w" + str(i) for i in range(0, 1000, 2)}

def test_probabilities_sum_to_zero_raises():
# All probabilities negative, exp will be very small, sum to zero
vocab = make_vocab([("a", -1000.0), ("b", -1000.0)])
rw = _RandomWords(vocab)
with pytest.raises(ValueError):
rw.next() # 21.9μs -> 21.0μs (3.99% faster)

def test_probabilities_nan_raises():
# Probability is NaN
vocab = make_vocab([("a", float('nan')), ("b", 1.0)])
rw = _RandomWords(vocab)
# Should ignore NaN lexeme, only 'b' should be returned
results = [rw.next() for _ in range(10)]

def test_probabilities_inf_raises():
# Probability is inf
vocab = make_vocab([("a", float('inf')), ("b", 1.0)])
rw = _RandomWords(vocab)
# exp(inf) will dominate
results = [rw.next() for _ in range(10)]

codeflash_output is used to check that the output of the original code is the same as that of the optimized code.

#------------------------------------------------
from typing import List

import numpy

imports

import pytest
from spacy.ml.models.multi_task import _RandomWords

Mock Vocab and Lexeme classes for testing

class MockLexeme:
def init(self, text, prob):
self.text = text
self.prob = prob

class MockVocab:
def init(self, lexemes):
self.lexemes = lexemes
def iter(self):
return iter(self.lexemes)
from spacy.ml.models.multi_task import _RandomWords

unit tests

---- Basic Test Cases ----

def test_single_word():
# One lexeme with nonzero prob, should always return that word
vocab = MockVocab([MockLexeme("hello", 1.0)])
rw = _RandomWords(vocab)
for _ in range(10):
codeflash_output = rw.next() # 360μs -> 207μs (73.5% faster)

def test_two_words_equal_prob():
# Two words, equal probability, should only return those words
vocab = MockVocab([MockLexeme("foo", 1.0), MockLexeme("bar", 1.0)])
rw = _RandomWords(vocab)
results = set(rw.next() for _ in range(20)) # 379μs -> 237μs (60.0% faster)

def test_three_words_different_probs():
# Three words, different probabilities, should only return those words
vocab = MockVocab([
MockLexeme("low", 0.1),
MockLexeme("mid", 0.5),
MockLexeme("high", 2.0)
])
rw = _RandomWords(vocab)
results = set(rw.next() for _ in range(30)) # 392μs -> 255μs (54.0% faster)

def test_ignores_zero_prob():
# Lexemes with zero prob should be ignored
vocab = MockVocab([
MockLexeme("skip", 0.0),
MockLexeme("keep", 1.0),
MockLexeme("also_skip", 0.0),
MockLexeme("keep2", 2.0)
])
rw = _RandomWords(vocab)
results = set(rw.next() for _ in range(20)) # 376μs -> 236μs (59.0% faster)
# Ensure no zero-prob words ever appear
for _ in range(20):
codeflash_output = rw.next(); word = codeflash_output # 5.00μs -> 4.47μs (12.0% faster)

---- Edge Test Cases ----

def test_empty_vocab_raises():
# No words with nonzero prob: should raise in next()
vocab = MockVocab([MockLexeme("none", 0.0)])
rw = _RandomWords(vocab)
with pytest.raises(ValueError):
rw.next() # 15.7μs -> 15.3μs (2.71% faster)

def test_all_zero_probs_raises():
# All lexemes have zero prob: should raise in next()
vocab = MockVocab([MockLexeme("a", 0.0), MockLexeme("b", 0.0)])
rw = _RandomWords(vocab)
with pytest.raises(ValueError):
rw.next() # 13.1μs -> 12.8μs (2.16% faster)

def test_large_prob_range():
# Lexemes with very large and very small probs
vocab = MockVocab([
MockLexeme("tiny", -100.0),
MockLexeme("huge", 100.0),
MockLexeme("mid", 0.0), # zero should be ignored
MockLexeme("normal", 1.0)
])
rw = _RandomWords(vocab)
results = set(rw.next() for _ in range(30))

def test_probabilities_are_normalized():
# Check that the probabilities sum to 1 after normalization
vocab = MockVocab([
MockLexeme("a", 0.1),
MockLexeme("b", 0.2),
MockLexeme("c", 0.3)
])
rw = _RandomWords(vocab)

def test_cache_refills_and_returns():
# Ensure cache refills after exhausting 10000 elements
vocab = MockVocab([MockLexeme(str(i), 1.0) for i in range(5)])
rw = _RandomWords(vocab)
# Call next() more than 10000 times to force cache refill
words = [rw.next() for _ in range(1000)] # 447μs -> 302μs (47.8% faster)

def test_non_string_lexeme_text():
# Lexeme text is not a string (should handle or convert)
vocab = MockVocab([MockLexeme(123, 1.0), MockLexeme(None, 1.0)])
rw = _RandomWords(vocab)
results = set(rw.next() for _ in range(10)) # 378μs -> 232μs (62.6% faster)

def test_probabilities_are_exp_transformed():
# Probabilities are exponentiated before normalization
vocab = MockVocab([
MockLexeme("a", 0.0), # ignored
MockLexeme("b", 0.0), # ignored
MockLexeme("x", 1.0),
MockLexeme("y", 2.0)
])
rw = _RandomWords(vocab)
# The expected normalized probabilities
exp_probs = numpy.exp(numpy.array([1.0, 2.0]))
norm_probs = exp_probs / exp_probs.sum()

---- Large Scale Test Cases ----

def test_maximum_vocab_size():
# Test with exactly 10000 nonzero-prob lexemes
vocab = MockVocab([MockLexeme(f"w{i}", 1.0) for i in range(1000)])
rw = _RandomWords(vocab)
# All returned words should be in the vocab
words = set(rw.next() for _ in range(100)) # 983μs -> 888μs (10.7% faster)

def test_vocab_truncation():
# More than 10000 lexemes: only first 10000 should be used
vocab = MockVocab([MockLexeme(f"w{i}", 1.0) for i in range(1000)])
rw = _RandomWords(vocab)
# All returned words should be in first 10000
words = set(rw.next() for _ in range(100)) # 976μs -> 862μs (13.1% faster)
# Words beyond 10000 should never be returned
for _ in range(100):
codeflash_output = rw.next() # 24.5μs -> 22.3μs (9.84% faster)

def test_large_scale_performance():
# Large vocab, ensure function is reasonably fast (no assertion, just runs)
vocab = MockVocab([MockLexeme(f"word{i}", 1.0) for i in range(900)])
rw = _RandomWords(vocab)
# Should not hang or throw
for _ in range(900):
codeflash_output = rw.next(); w = codeflash_output # 1.18ms -> 1.04ms (13.2% faster)

def test_large_scale_distribution():
# Check output distribution is roughly proportional to probabilities
# Use 10 words with increasing probs
vocab = MockVocab([MockLexeme(f"w{i}", i+1) for i in range(10)])
rw = _RandomWords(vocab)
counts = {f"w{i}": 0 for i in range(10)}
for _ in range(1000):
counts[rw.next()] += 1 # 647μs -> 474μs (36.5% faster)
# The highest-prob word should appear most often
max_word = max(counts, key=counts.get)
# The lowest-prob word should appear least often
min_word = min(counts, key=counts.get)

codeflash_output is used to check that the output of the original code is the same as that of the optimized code.

To edit these changes git checkout codeflash/optimize-_RandomWords.next-mhwrqld3 and push.

Codeflash Static Badge

The optimized code achieves a **26% speedup** through two key optimizations that reduce computational overhead and memory operations:

**1. Single-pass vocabulary processing with early termination:**
The original code iterates through the entire vocabulary twice - once to extract words and once to extract probabilities - then slices both lists to the first 10,000 items. The optimized version uses a single loop with a counter that breaks after collecting exactly 10,000 lexemes, avoiding unnecessary iterations and the expensive slicing operations on potentially large lists.

**2. Explicit conversion of numpy array to Python list:**
In the `next()` method, the optimized code calls `.tolist()` on the numpy array before extending the cache. This converts the numpy array to a native Python list, which extends much faster than trying to extend with a numpy array directly due to Python's internal list extension optimizations.

**Performance impact by test case type:**
- **Small vocabularies (1-10 words)**: 50-75% faster - the single-pass optimization has less impact, but the `.tolist()` optimization still provides significant gains
- **Medium vocabularies (100-1000 words)**: 10-15% faster - benefits from both optimizations as vocabulary processing becomes more significant
- **Edge cases (empty/zero-prob vocabs)**: Minimal impact since these paths don't heavily use the optimized sections

The line profiler shows the `numpy.random.choice` call remains the dominant bottleneck (68.6% vs 57.4% of total time), but the cache extension operation is now much more efficient (12.8% vs 27.1% of total time). This optimization is particularly valuable for workloads that frequently create new `_RandomWords` instances or call `next()` repeatedly after cache depletion.
@codeflash-ai codeflash-ai bot requested a review from mashraf-222 November 13, 2025 01:47
@codeflash-ai codeflash-ai bot added ⚡️ codeflash Optimization PR opened by Codeflash AI 🎯 Quality: High Optimization Quality according to Codeflash labels Nov 13, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

⚡️ codeflash Optimization PR opened by Codeflash AI 🎯 Quality: High Optimization Quality according to Codeflash

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant