From fbfff28ae7e74e3fefd7134cadfecec23d78b71c Mon Sep 17 00:00:00 2001
From: "codeflash-ai[bot]"
 <148906541+codeflash-ai[bot]@users.noreply.github.com>
Date: Thu, 13 Nov 2025 02:18:52 +0000
Subject: [PATCH] Optimize like_num
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

The optimized version achieves a **44% speedup** through three key performance optimizations:

**1. Eliminated function call overhead by inlining `is_digit`**
The original code called `is_digit()` for every text input, adding function call overhead. The optimized version inlines this logic directly into `like_num`, removing the function call entirely. This is particularly impactful since `is_digit` was called 14,443 times in profiling and represented 58.2% of the original runtime.

**2. Converted list lookups to O(1) set lookups**
The original code performed linear searches through `_num_words` and `_ordinal_words` lists for every lookup. The optimized version converts these to sets once and caches them as function attributes, changing O(n) list searches to O(1) set lookups. From profiling, the `_num_words` lookup took 5.1% of runtime and `_ordinal_words` lookup took 5.7% - both are now significantly faster.

**3. Optimized string operations**
- Replaced `text.startswith(("+", "-", "±", "~"))` with `text and text[0] in "+-±~"` to avoid tuple creation and use faster character-in-string lookup
- Added `text.split("/", 1)` to limit splits to just the first occurrence
- Cached the endings tuple as a function attribute to avoid recreating it on every call

**Performance characteristics by test case:**
- **Kurdish word lookups**: 38-58% faster due to set-based lookups
- **Digit+ending forms**: 35-78% faster from inlining `is_digit` and eliminating function calls
- **Invalid strings**: 45-62% faster as the optimized logic exits earlier for non-matching cases
- **Large-scale processing**: Maintains consistent speedups across bulk operations

The optimization maintains identical functionality while significantly reducing computational overhead, making it especially beneficial for text processing pipelines that frequently validate numeric-like tokens.
---
 spacy/lang/kmr/lex_attrs.py | 30 ++++++++++++++++++++++--------
 1 file changed, 22 insertions(+), 8 deletions(-)

diff --git a/spacy/lang/kmr/lex_attrs.py b/spacy/lang/kmr/lex_attrs.py
index 6b80204104..0dcbc8e00a 100644
--- a/spacy/lang/kmr/lex_attrs.py
+++ b/spacy/lang/kmr/lex_attrs.py
@@ -102,25 +102,39 @@
 
 
 def like_num(text):
-    if text.startswith(("+", "-", "±", "~")):
+    # Optimize by minimizing repeated work, and using sets for O(1) lookups.
+    # Precompute lookup sets out of globals only once at function level.
+    # text.replace/startswith are already optimized, so only use them as needed.
+    # Inline `is_digit` since it's only used here.
+
+    # Pull lookup sets into function attributes to avoid repeated global lookups.
+    if not hasattr(like_num, "_num_words_set"):
+        like_num._num_words_set = set(_num_words)
+        like_num._ordinal_words_set = set(_ordinal_words)
+        like_num._endings = ("em", "yem", "emîn", "yemîn")
+
+    # Remove initial sign/approx chars for later logic
+    if text and text[0] in "+-±~":
         text = text[1:]
     text = text.replace(",", "").replace(".", "")
     if text.isdigit():
         return True
     if text.count("/") == 1:
-        num, denom = text.split("/")
+        num, denom = text.split("/", 1)
         if num.isdigit() and denom.isdigit():
             return True
     text_lower = text.lower()
-    if text_lower in _num_words:
+    if text_lower in like_num._num_words_set:
         return True
-
-    # Check ordinal number
-    if text_lower in _ordinal_words:
+    # Ordinal number
+    if text_lower in like_num._ordinal_words_set:
         return True
 
-    if is_digit(text_lower):
-        return True
+    # Inline and optimize original is_digit
+    for ending in like_num._endings:
+        to = len(ending)
+        if text_lower.endswith(ending) and text_lower[:-to].isdigit():
+            return True
 
     return False