⚡ Bolt: Optimize keyword density regular expressions#223
⚡ Bolt: Optimize keyword density regular expressions#223
Conversation
- Pre-compile title and company extraction regex patterns - Avoid `re.IGNORECASE` by lowercasing text and keywords for matching Co-authored-by: anchapin <6326294+anchapin@users.noreply.github.com>
|
👋 Jules, reporting for duty! I'm here to lend a hand with this pull request. When you start a review, I'll add a 👀 emoji to each comment to let you know I've read it. I'll focus on feedback directed at me and will do my best to stay out of conversations between you and other bots or reviewers to keep the noise down. I'll push a commit with your requested changes shortly after. Please note there might be a delay between these steps, but rest assured I'm on the job! For more direct control, you can switch me to Reactive Mode. When this mode is on, I will only act on comments where you specifically mention me with New to Jules? Learn more at jules.google/docs. For security, I will only act on instructions from the user who triggered this task. |
Reviewer's GuidePre-compiles regex patterns for job title and company extraction and optimizes keyword counting by performing a single lowercase conversion of resume text and keywords, removing repeated case-insensitive regex overhead inside loops. Class diagram for keyword_density module optimizationsclassDiagram
class keyword_density {
<<module>>
+list _TITLE_PATTERNS
+list _COMPANY_PATTERNS
+tuple _extract_job_details(self, job_description)
+dict _count_keywords_in_resume(self, keywords, resume_data)
}
class _TITLE_PATTERNS {
<<regex_list>>
+pattern0 (job title|position|title):\s*([^\n]+) flags: IGNORECASE MULTILINE
+pattern1 ^([^\n]+)\s*[-|]\s*[^|]+$ flags: IGNORECASE MULTILINE
+pattern2 #\s*([^\n]+) flags: IGNORECASE MULTILINE
}
class _COMPANY_PATTERNS {
<<regex_list>>
+pattern0 (company|organization):\s*([^\n]+) flags: IGNORECASE
+pattern1 (at|from)\s+([A-Z][^\n]+?)(\s+[-\u2014]|\s+$) flags: IGNORECASE
}
keyword_density "1" o-- "1" _TITLE_PATTERNS : uses
keyword_density "1" o-- "1" _COMPANY_PATTERNS : uses
File-Level Changes
Tips and commandsInteracting with Sourcery
Customizing Your ExperienceAccess your dashboard to:
Getting Help
|
There was a problem hiding this comment.
Hey - I've found 1 issue, and left some high level feedback:
- In
_count_keywords_in_resume, you’re still recompiling the regex for each keyword on every call; consider pre-compiling and caching the per-keyword patterns (e.g., keyed by the original keyword string) to avoid repeated compilation overhead if this is called frequently. - When normalizing text for case-insensitive matching,
str.casefold()is usually more robust thanstr.lower()for non-ASCII characters; if resumes can contain international text, switching tocasefold()would preserve the intended case-insensitive behavior more reliably.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- In `_count_keywords_in_resume`, you’re still recompiling the regex for each keyword on every call; consider pre-compiling and caching the per-keyword patterns (e.g., keyed by the original keyword string) to avoid repeated compilation overhead if this is called frequently.
- When normalizing text for case-insensitive matching, `str.casefold()` is usually more robust than `str.lower()` for non-ASCII characters; if resumes can contain international text, switching to `casefold()` would preserve the intended case-insensitive behavior more reliably.
## Individual Comments
### Comment 1
<location path="cli/utils/keyword_density.py" line_range="367-373" />
<code_context>
- # Get all resume text
- all_text = self._get_all_text(resume_data)
+ # Get all resume text and lower it once to optimize keyword matching
+ # avoiding the overhead of re.IGNORECASE for each keyword
+ all_text = self._get_all_text(resume_data).lower()
for keyword, _ in keywords:
</code_context>
<issue_to_address>
**suggestion:** Consider `casefold()` instead of `lower()` for more robust case-insensitive matching across locales.
Because this logic normalizes the full text once for case-insensitive matching, `str.casefold()` is a better fit than `str.lower()`: it handles more Unicode edge cases (e.g., ß, some accented characters) while remaining a drop-in replacement with similar performance.
```suggestion
# Get all resume text and casefold it once to optimize keyword matching
# avoiding the overhead of re.IGNORECASE for each keyword and handling
# more Unicode edge cases than a simple lowercasing
all_text = self._get_all_text(resume_data).casefold()
for keyword, _ in keywords:
# Casefold keyword for matching against casefolded text
kw_lower = keyword.casefold()
```
</issue_to_address>Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
| # Get all resume text and lower it once to optimize keyword matching | ||
| # avoiding the overhead of re.IGNORECASE for each keyword | ||
| all_text = self._get_all_text(resume_data).lower() | ||
|
|
||
| for keyword, _ in keywords: | ||
| # Count occurrences (case-insensitive) | ||
| count = len(re.findall(rf"\b{re.escape(keyword)}\b", all_text, re.IGNORECASE)) | ||
| # Lowercase keyword for matching against lowercased text | ||
| kw_lower = keyword.lower() |
There was a problem hiding this comment.
suggestion: Consider casefold() instead of lower() for more robust case-insensitive matching across locales.
Because this logic normalizes the full text once for case-insensitive matching, str.casefold() is a better fit than str.lower(): it handles more Unicode edge cases (e.g., ß, some accented characters) while remaining a drop-in replacement with similar performance.
| # Get all resume text and lower it once to optimize keyword matching | |
| # avoiding the overhead of re.IGNORECASE for each keyword | |
| all_text = self._get_all_text(resume_data).lower() | |
| for keyword, _ in keywords: | |
| # Count occurrences (case-insensitive) | |
| count = len(re.findall(rf"\b{re.escape(keyword)}\b", all_text, re.IGNORECASE)) | |
| # Lowercase keyword for matching against lowercased text | |
| kw_lower = keyword.lower() | |
| # Get all resume text and casefold it once to optimize keyword matching | |
| # avoiding the overhead of re.IGNORECASE for each keyword and handling | |
| # more Unicode edge cases than a simple lowercasing | |
| all_text = self._get_all_text(resume_data).casefold() | |
| for keyword, _ in keywords: | |
| # Casefold keyword for matching against casefolded text | |
| kw_lower = keyword.casefold() |
💡 What:
_TITLE_PATTERNSand_COMPANY_PATTERNS) as module-level constants incli/utils/keyword_density.py._count_keywords_in_resumeby lowercasing the entire resume text once and matching it against lowercased keywords to avoid the significant overhead of Python'sre.IGNORECASEflag within loops.🎯 Why:
_extract_job_details._count_keywords_in_resume, applyingre.IGNORECASEinside a loop over large text blocks can be extremely slow. Lowercasing everything explicitly avoids this bottleneck while maintaining functional equivalence.📊 Impact:
🔬 Measurement:
Run
pytest tests/test_keyword_density.pyto ensure all extraction and calculation logic remains perfectly intact.PR created automatically by Jules for task 4669696908226529015 started by @anchapin
Summary by Sourcery
Optimize keyword extraction and keyword counting performance in the resume keyword density utility.
Enhancements: