⚡ Bolt: Optimize ATS and Keyword Density performance#216
⚡ Bolt: Optimize ATS and Keyword Density performance#216
Conversation
- Pre-lowercase resume text in `_count_keywords_in_resume` to avoid `re.IGNORECASE` overhead in a loop. - Hoist pre-compiled regex patterns (`_TABLE_PATTERN`, `_SPECIAL_CHARS_PATTERN`, `_EMAIL_PATTERN`, `_PHONE_PATTERN`, `_QUANTIFIABLE_PATTERN`, `_ACRONYM_PATTERN`) and static lists (`_ACTION_VERBS`) in `ats_generator.py` to module level. - Fix silent case-sensitivity bug in ATS `_get_all_text` method while retaining lowercase logic where necessary. Co-authored-by: anchapin <6326294+anchapin@users.noreply.github.com>
|
👋 Jules, reporting for duty! I'm here to lend a hand with this pull request. When you start a review, I'll add a 👀 emoji to each comment to let you know I've read it. I'll focus on feedback directed at me and will do my best to stay out of conversations between you and other bots or reviewers to keep the noise down. I'll push a commit with your requested changes shortly after. Please note there might be a delay between these steps, but rest assured I'm on the job! For more direct control, you can switch me to Reactive Mode. When this mode is on, I will only act on comments where you specifically mention me with New to Jules? Learn more at jules.google/docs. For security, I will only act on instructions from the user who triggered this task. |
Reviewer's GuideOptimizes ATS scoring and keyword density by hoisting regex/list constants to module scope, separating lowercase/uppercase text flows, and updating tests and keyword counting logic to use pre-lowercased text for faster, correct matching (especially acronyms and keyword density). Sequence diagram for optimized keyword density countingsequenceDiagram
participant KD as KeywordDensityAnalyzer
participant ATS as ATSGenerator
participant RE as re_module
KD->>ATS: _get_all_text(resume_data)
ATS-->>KD: all_text (case-preserved)
KD->>KD: all_text_lower = all_text.lower()
loop for each keyword in keywords
KD->>KD: keyword_lower = keyword.lower()
KD->>RE: findall(boundary + keyword_lower + boundary, all_text_lower)
RE-->>KD: matches
KD->>KD: counts[keyword] = len(matches)
end
Class diagram for ATSGenerator and keyword density utilitiesclassDiagram
class ATSGenerator {
_check_format_parsing(resume_data)
_check_contact_info(resume_data)
_check_readability(resume_data)
_get_all_text(resume_data)
}
class ATSModuleConstants {
+_TABLE_PATTERN : Pattern
+_SPECIAL_CHARS_PATTERN : Pattern
+_EMAIL_PATTERN : Pattern
+_PHONE_PATTERN : Pattern
+_QUANTIFIABLE_PATTERN : Pattern
+_ACRONYM_PATTERN : Pattern
+_ACTION_VERBS : Tuple[str]
}
class KeywordDensityAnalyzer {
_count_keywords_in_resume(resume_data, keywords)
_get_all_text(resume_data)
}
ATSModuleConstants <.. ATSGenerator : uses
ATSModuleConstants <.. KeywordDensityAnalyzer : uses
ATSGenerator <.. KeywordDensityAnalyzer : shares_text_flow
File-Level Changes
Tips and commandsInteracting with Sourcery
Customizing Your ExperienceAccess your dashboard to:
Getting Help
|
There was a problem hiding this comment.
Hey - I've found 2 issues
Prompt for AI Agents
Please address the comments from this code review:
## Individual Comments
### Comment 1
<location path="cli/utils/keyword_density.py" line_range="370-371" />
<code_context>
for keyword, _ in keywords:
# Count occurrences (case-insensitive)
- count = len(re.findall(rf"\b{re.escape(keyword)}\b", all_text, re.IGNORECASE))
+ # The keyword should already be lowercased from extraction
+ count = len(re.findall(rf"\b{re.escape(keyword.lower())}\b", all_text_lower))
counts[keyword] = count
</code_context>
<issue_to_address>
**nitpick:** The comment about lowercased keywords and the extra `keyword.lower()` call are slightly inconsistent.
The code assumes keywords are pre-lowercased but still calls `keyword.lower()` in the regex. This is redundant if the contract holds. Either remove `.lower()` and enforce/validate lowercase earlier, or keep the defensive `.lower()` and update the comment to indicate that this line normalizes keywords as a safeguard.
</issue_to_address>
### Comment 2
<location path="tests/test_ats_generator.py" line_range="359-363" />
<code_context>
- assert "tech corp" in text
- assert "built apis" in text
- assert "python" in text
+ # Text is returned as is
+ assert "John" in text
+ assert "Tech Corp" in text
+ assert "Built APIs" in text
+ assert "Python" in text
</code_context>
<issue_to_address>
**suggestion (testing):** Add a regression test that explicitly proves acronyms are now detected correctly in readability scoring
This change only checks that `_get_all_text()` preserves casing; it doesn’t verify that the readability logic actually detects acronyms now that the `\b[A-Z]{2,4}\b` heuristic is no longer broken by lowercasing.
Please add a regression test that:
- Builds minimal `resume_data` with one or more acronyms (e.g., `API`, `SDK`) plus some normal text.
- Invokes the public readability-scoring entrypoint (or `_check_readability` if that’s the intended test surface).
- Asserts that the resulting details/suggestions show the expected acronym count and that the "Minimal acronyms" branch is taken.
That will directly lock in the behavior this PR fixes and guard against future regressions in casing or the acronym pattern.
Suggested implementation:
```python
# Text is returned as is
assert "John" in text
assert "Tech Corp" in text
assert "Built APIs" in text
assert "Python" in text
def test_readability_detects_acronyms_minimal_branch(ats_generator):
"""
Regression test: acronyms should be detected correctly in readability scoring,
and the "Minimal acronyms" branch should be exercised.
"""
# Minimal resume_data containing acronyms and normal text
resume_data = {
"basics": {
"name": "John Doe",
"summary": "Experienced engineer working with API and SDK integrations.",
},
"work": [
{
"name": "Tech Corp",
"position": "Software Engineer",
"highlights": [
"Designed REST API for internal tools.",
"Integrated third-party SDK for payments.",
],
}
],
}
# Prefer the public readability-scoring entrypoint if it exists,
# otherwise fall back to the internal helper for this regression test.
readability_func = getattr(
ats_generator,
"score_readability",
getattr(ats_generator, "_check_readability"),
)
readability_result = readability_func(resume_data)
# The exact structure of readability_result may differ; this assumes
# a dict-like contract with "details" and "suggestions" keys.
details = readability_result["details"]
suggestions = readability_result["suggestions"]
# Prove that acronyms are actually being detected by the scoring logic
assert details["acronym_count"] >= 2
# Prove that the "Minimal acronyms" branch is taken in the suggestions
assert any(
"Minimal acronyms" in (suggestion.get("message") or suggestion.get("text", ""))
for suggestion in suggestions
)
```
You may need to adjust the new test to match your actual readability API:
1. If the public entrypoint is named something other than `score_readability`, update the `getattr(ats_generator, "score_readability", ...)` call accordingly.
2. If `_check_readability` is not a method on `ats_generator` (for example, if it is a module-level function), replace `getattr(ats_generator, "_check_readability")` with the correct reference.
3. Adapt the handling of `readability_result`:
- If it returns a tuple instead of a dict (e.g., `(score, details, suggestions)`), unpack it accordingly and remove the `["details"]` / `["suggestions"]` indexing.
- If the keys are named differently (e.g., `acronyms` instead of `acronym_count`), update `details["acronym_count"]` to the correct key.
4. If suggestion entries use a different field name than `message`/`text`, update the `suggestion.get(...)` accessors in the `any(...)` check.
</issue_to_address>Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
| # The keyword should already be lowercased from extraction | ||
| count = len(re.findall(rf"\b{re.escape(keyword.lower())}\b", all_text_lower)) |
There was a problem hiding this comment.
nitpick: The comment about lowercased keywords and the extra keyword.lower() call are slightly inconsistent.
The code assumes keywords are pre-lowercased but still calls keyword.lower() in the regex. This is redundant if the contract holds. Either remove .lower() and enforce/validate lowercase earlier, or keep the defensive .lower() and update the comment to indicate that this line normalizes keywords as a safeguard.
| # Text is returned as is | ||
| assert "John" in text | ||
| assert "Tech Corp" in text | ||
| assert "Built APIs" in text | ||
| assert "Python" in text |
There was a problem hiding this comment.
suggestion (testing): Add a regression test that explicitly proves acronyms are now detected correctly in readability scoring
This change only checks that _get_all_text() preserves casing; it doesn’t verify that the readability logic actually detects acronyms now that the \b[A-Z]{2,4}\b heuristic is no longer broken by lowercasing.
Please add a regression test that:
- Builds minimal
resume_datawith one or more acronyms (e.g.,API,SDK) plus some normal text. - Invokes the public readability-scoring entrypoint (or
_check_readabilityif that’s the intended test surface). - Asserts that the resulting details/suggestions show the expected acronym count and that the "Minimal acronyms" branch is taken.
That will directly lock in the behavior this PR fixes and guard against future regressions in casing or the acronym pattern.
Suggested implementation:
# Text is returned as is
assert "John" in text
assert "Tech Corp" in text
assert "Built APIs" in text
assert "Python" in text
def test_readability_detects_acronyms_minimal_branch(ats_generator):
"""
Regression test: acronyms should be detected correctly in readability scoring,
and the "Minimal acronyms" branch should be exercised.
"""
# Minimal resume_data containing acronyms and normal text
resume_data = {
"basics": {
"name": "John Doe",
"summary": "Experienced engineer working with API and SDK integrations.",
},
"work": [
{
"name": "Tech Corp",
"position": "Software Engineer",
"highlights": [
"Designed REST API for internal tools.",
"Integrated third-party SDK for payments.",
],
}
],
}
# Prefer the public readability-scoring entrypoint if it exists,
# otherwise fall back to the internal helper for this regression test.
readability_func = getattr(
ats_generator,
"score_readability",
getattr(ats_generator, "_check_readability"),
)
readability_result = readability_func(resume_data)
# The exact structure of readability_result may differ; this assumes
# a dict-like contract with "details" and "suggestions" keys.
details = readability_result["details"]
suggestions = readability_result["suggestions"]
# Prove that acronyms are actually being detected by the scoring logic
assert details["acronym_count"] >= 2
# Prove that the "Minimal acronyms" branch is taken in the suggestions
assert any(
"Minimal acronyms" in (suggestion.get("message") or suggestion.get("text", ""))
for suggestion in suggestions
)You may need to adjust the new test to match your actual readability API:
- If the public entrypoint is named something other than
score_readability, update thegetattr(ats_generator, "score_readability", ...)call accordingly. - If
_check_readabilityis not a method onats_generator(for example, if it is a module-level function), replacegetattr(ats_generator, "_check_readability")with the correct reference. - Adapt the handling of
readability_result:- If it returns a tuple instead of a dict (e.g.,
(score, details, suggestions)), unpack it accordingly and remove the["details"]/["suggestions"]indexing. - If the keys are named differently (e.g.,
acronymsinstead ofacronym_count), updatedetails["acronym_count"]to the correct key.
- If it returns a tuple instead of a dict (e.g.,
- If suggestion entries use a different field name than
message/text, update thesuggestion.get(...)accessors in theany(...)check.
💡 What:
keyword_density.pyto prevent redundant case-insensitive matching during keyword counting.ats_generator.py._get_all_text()inats_generator.pyto return case-preserved text so that the acronym regex (\b[A-Z]{2,4}\b) functions correctly, while preserving lowercase matching for other heuristic checks.🎯 Why:
These specific operations are frequently hit during reporting or keyword density generation loops. Compiling regexes and manipulating string cases repeatedly inside standard method executions or nested loops introduces unnecessary CPU cycles and allocations. Removing the
.lower()cast in_get_all_text()also inadvertently fixes a bug where acronyms were never found because the text had already been mutated to lowercase.📊 Impact:
ATSGenerator._count_keywords_in_resumesincere.IGNORECASEis significantly slower than executing standard regex over a pre-lowercased string, especially when looping over up to 20 AI-extracted job keywords.🔬 Measurement:
pytesttest suite passes successfully.re.IGNORECASEto the quantifiable heuristic pattern so that string constants likeUsersorProjectsare matched in the raw-case text flow.PR created automatically by Jules for task 850629312244089742 started by @anchapin
Summary by Sourcery
Optimize ATS scoring and keyword density performance while fixing acronym handling in resume text processing.
Enhancements:
Tests: