perf: optimize pagerank using sparse vectorization and np.add.at#40
perf: optimize pagerank using sparse vectorization and np.add.at#40Phantom0299 wants to merge 1 commit into
Conversation
WalkthroughPageRank implementation refactored from Python dict-based loops to NumPy vectorized operations, including precomputed node/edge index arrays, vectorized rank updates with np.add.at, and vectorized checksum/top-node selection. Project now declares NumPy as a runtime dependency. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes 🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@chuck/tasks/graph_analytics/task.py`:
- Around line 29-43: There are trailing spaces on the blank lines around the
graph-building block (near variables node_to_idx, src_list, out_degrees, and
src_indices) causing pre-commit failures; open the chunk that defines
node_to_idx, src_list, dst_list, out_degrees and src_indices and remove any
trailing whitespace from the empty lines (or run a trim-whitespace/format step)
so the lines are truly empty and the file passes CI.
- Line 3: The module imports numpy (import numpy as np) but pyproject.toml lacks
a [project].dependencies entry; add numpy to the project dependencies by adding
a [project] dependencies array with "numpy>=1.20.0" so the package manager
installs numpy before runtime and the import in
chuck/tasks/graph_analytics/task.py resolves.
- Around line 54-56: The current selection using np.argmax(ranks)
(top_idx/top_node) breaks ties differently than the original max(nodes, key=...)
behavior: np.argmax picks the first occurrence, while the original chose the
lexicographically larger node when ranks tie. Replace the
np.argmax/top_idx/top_node logic with a direct max over nodes using the tuple
key (ranks[node], node) so ties pick the node with the larger name (e.g.,
top_node = max(nodes, key=lambda node: (ranks[node], node))); leave checksum
calculation (checksum = np.sum(np.arange(1, n + 1) * ranks)) unchanged. Ensure
references to top_idx are removed or updated accordingly.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 5a12ab7e-ae0c-4d1f-9246-3969a084be6c
📒 Files selected for processing (1)
chuck/tasks/graph_analytics/task.py
| top_idx = np.argmax(ranks) | ||
| top_node = nodes[top_idx] | ||
| checksum = np.sum(np.arange(1, n + 1) * ranks) |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Search for original PageRank top-node selection logic in git history
git log -p --all -S 'max(nodes' -- '*.py' | head -80Repository: iiitl/chuck
Length of output: 37
🏁 Script executed:
# First, check the current state of the file around lines 54-56
wc -l chuck/tasks/graph_analytics/task.pyRepository: iiitl/chuck
Length of output: 93
🏁 Script executed:
# Show the context around lines 54-56
sed -n '45,65p' chuck/tasks/graph_analytics/task.pyRepository: iiitl/chuck
Length of output: 650
🏁 Script executed:
# Check git history for this file to find the original implementation
git log --oneline chuck/tasks/graph_analytics/task.py | head -20Repository: iiitl/chuck
Length of output: 127
🏁 Script executed:
# Look for the original code in git diff
git log -p chuck/tasks/graph_analytics/task.py | head -200Repository: iiitl/chuck
Length of output: 37
🏁 Script executed:
# Show the specific commit that changed the PageRank implementation
git show 180e97bRepository: iiitl/chuck
Length of output: 50370
🏁 Script executed:
# Check if there are any earlier commits or if this is the initial commit
git log --all -- chuck/tasks/graph_analytics/task.pyRepository: iiitl/chuck
Length of output: 434
🏁 Script executed:
# Look for the original code before the optimization commit
git show 180e97b^:chuck/tasks/graph_analytics/task.py 2>/dev/null || echo "No previous version found"Repository: iiitl/chuck
Length of output: 80
🏁 Script executed:
# Check the previous commit to see if there was an original implementation
git show 3bb6d58:chuck/tasks/graph_analytics/task.py 2>/dev/null | head -100Repository: iiitl/chuck
Length of output: 1708
Tie-breaking behavior differs from original implementation.
The original max(nodes, key=lambda node: (rank[node], node)) selects the lexicographically larger node name when ranks are tied, while np.argmax(ranks) returns the first occurrence, selecting the lexicographically smaller node name (since nodes are sorted).
Though regression tests pass (suggesting ties may not occur in practice), this is a correctness discrepancy. To align with the original behavior:
- top_idx = np.argmax(ranks)
- top_node = nodes[top_idx]
+ max_rank = np.max(ranks)
+ # Among nodes with max rank, pick lexicographically largest (matches original behavior)
+ top_indices = np.where(ranks == max_rank)[0]
+ top_idx = top_indices[-1] # Last index among ties = largest node name
+ top_node = nodes[top_idx]📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| top_idx = np.argmax(ranks) | |
| top_node = nodes[top_idx] | |
| checksum = np.sum(np.arange(1, n + 1) * ranks) | |
| max_rank = np.max(ranks) | |
| # Among nodes with max rank, pick lexicographically largest (matches original behavior) | |
| top_indices = np.where(ranks == max_rank)[0] | |
| top_idx = top_indices[-1] # Last index among ties = largest node name | |
| top_node = nodes[top_idx] | |
| checksum = np.sum(np.arange(1, n + 1) * ranks) |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@chuck/tasks/graph_analytics/task.py` around lines 54 - 56, The current
selection using np.argmax(ranks) (top_idx/top_node) breaks ties differently than
the original max(nodes, key=...) behavior: np.argmax picks the first occurrence,
while the original chose the lexicographically larger node when ranks tie.
Replace the np.argmax/top_idx/top_node logic with a direct max over nodes using
the tuple key (ranks[node], node) so ties pick the node with the larger name
(e.g., top_node = max(nodes, key=lambda node: (ranks[node], node))); leave
checksum calculation (checksum = np.sum(np.arange(1, n + 1) * ranks)) unchanged.
Ensure references to top_idx are removed or updated accordingly.
8bdd7d0 to
38d35df
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@chuck/tasks/graph_analytics/task.py`:
- Around line 42-43: src_list/dst_list are converted to numpy arrays without a
dtype, so when empty they default to float64 and later indexing into ranks with
src_indices (in function/task using src_indices and dst_indices) raises
IndexError; fix by constructing src_indices and dst_indices with an explicit
integer dtype (e.g., np.intp or np.int64) when calling np.array(src_list) and
np.array(dst_list) so empty edge lists remain integer-typed and indexing becomes
a no-op.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 60f8d2e3-eeca-478a-a604-3ceb6b4aeb17
📒 Files selected for processing (3)
chuck/tasks/graph_analytics/task.pypyproject.tomlrequirements.txt
✅ Files skipped from review due to trivial changes (2)
- requirements.txt
- pyproject.toml
| src_indices = np.array(src_list) | ||
| dst_indices = np.array(dst_list) |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
python -m pip install --quiet numpy
python - <<'PY'
import numpy as np
bad = np.array([])
good = np.array([], dtype=np.intp)
ranks = np.array([1.0])
for label, idx in [("bad", bad), ("good", good)]:
print(label, idx.dtype)
try:
print(ranks[idx])
except Exception as exc:
print(type(exc).__name__, exc)
PYRepository: iiitl/chuck
Length of output: 151
🏁 Script executed:
head -60 chuck/tasks/graph_analytics/task.py | tail -30Repository: iiitl/chuck
Length of output: 921
🏁 Script executed:
head -70 chuck/tasks/graph_analytics/task.py | cat -nRepository: iiitl/chuck
Length of output: 2466
Force an integer dtype for empty edge lists.
If a caller passes a graph whose adjacency lists are all empty, these arrays become float64, and line 50 then raises IndexError on ranks[src_indices]. Using an explicit integer index dtype keeps the zero-edge case as an empty no-op instead of crashing.
🩹 Proposed fix
- src_indices = np.array(src_list)
- dst_indices = np.array(dst_list)
+ src_indices = np.array(src_list, dtype=np.intp)
+ dst_indices = np.array(dst_list, dtype=np.intp)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@chuck/tasks/graph_analytics/task.py` around lines 42 - 43, src_list/dst_list
are converted to numpy arrays without a dtype, so when empty they default to
float64 and later indexing into ranks with src_indices (in function/task using
src_indices and dst_indices) raises IndexError; fix by constructing src_indices
and dst_indices with an explicit integer dtype (e.g., np.intp or np.int64) when
calling np.array(src_list) and np.array(dst_list) so empty edge lists remain
integer-typed and indexing becomes a no-op.
|
Sir @Aaryan-Dadu, I have completed the PageRank performance overhaul for both the Python and C++ backends. Summary of PR #40 (Python):
Summary of PR #42 (C++):
Both PRs now have green CI checks and pass all local benchmarks. Please review them |
|
As mentioned in #6 (comment) your PR will be considered if #35 gets closed. |
I realized the original PageRank implementation was hitting a major issue by using nested dictionary loops, which effectively treated the graph like a dense matrix. This wasted a lot of cpu cycles on zero-connections. I’ve swapped this out for a sparse vectorized approach using NumPy, which moved the complexity from O(N^2) to O(E) which is a great reduction in the load taken by cpu and would also improve speed drastically.
What I Did
Sparse Edge Mapping: Instead of looping through nodes, I converted the graph into a sparse edge list
Vectorized Contributions: I used np.add.at to handle the rank updates. It basically lets me do a high-speed "scatter-add" in one vectorized shot, so we aren’t stuck waiting for Python to grind through every single sum one by one.
Index Alignment: Pre-calculated out-degrees as a vector to handle the division step efficiently within the NumPy iterations.
A/B Comparison Results (as per contributing guidelines)
graph_analytics: 6.26x speedup (84.04% faster)
Baseline: 0.004543s -> Optimized: 0.000725s
Reliability: 1.0 (Checksums match perfectly / 100% accuracy)
Learnings & Challenges
Linter & Style: My biggest hurdle was actually the formatting side of things. I ran into some strict linter issues on my previous PR regarding trailing whitespace and docstrings, so I spent extra time here making sure the code was clean and matched the project's 'no-docstring' convention from the start.
Mapping the Math: Figuring out how to translate the standard pagerank summation formula into NumPy compatible index arrays was a great learning experience. It’s a much more efficient way to think about graph problems compared to nested loops.
Stabilization: I had to ensure that the damping factor and teleportation logic stayed perfectly consistent with the original version so that the final checksums would match to the 6th decimal place.
AI Usage - I used Gemini to help me in the transition from the dictionary-based loop to the
np.add.atindexing logic. This helped me map the math correctly, but I personally handled the implementation, debugging the linter errors, and running the local benchmarks to verify the 6.26x speedup.fixes #6
Summary by CodeRabbit