Skip to content

perf: optimize pagerank using sparse vectorization and np.add.at#40

Draft
Phantom0299 wants to merge 1 commit into
iiitl:mainfrom
Phantom0299:perf/graph-analytics-pagerank
Draft

perf: optimize pagerank using sparse vectorization and np.add.at#40
Phantom0299 wants to merge 1 commit into
iiitl:mainfrom
Phantom0299:perf/graph-analytics-pagerank

Conversation

@Phantom0299
Copy link
Copy Markdown

@Phantom0299 Phantom0299 commented Apr 11, 2026

I realized the original PageRank implementation was hitting a major issue by using nested dictionary loops, which effectively treated the graph like a dense matrix. This wasted a lot of cpu cycles on zero-connections. I’ve swapped this out for a sparse vectorized approach using NumPy, which moved the complexity from O(N^2) to O(E) which is a great reduction in the load taken by cpu and would also improve speed drastically.

What I Did
Sparse Edge Mapping: Instead of looping through nodes, I converted the graph into a sparse edge list
Vectorized Contributions: I used np.add.at to handle the rank updates. It basically lets me do a high-speed "scatter-add" in one vectorized shot, so we aren’t stuck waiting for Python to grind through every single sum one by one.
Index Alignment: Pre-calculated out-degrees as a vector to handle the division step efficiently within the NumPy iterations.

A/B Comparison Results (as per contributing guidelines)
graph_analytics: 6.26x speedup (84.04% faster)
Baseline: 0.004543s -> Optimized: 0.000725s
Reliability: 1.0 (Checksums match perfectly / 100% accuracy)

Learnings & Challenges
Linter & Style: My biggest hurdle was actually the formatting side of things. I ran into some strict linter issues on my previous PR regarding trailing whitespace and docstrings, so I spent extra time here making sure the code was clean and matched the project's 'no-docstring' convention from the start.
Mapping the Math: Figuring out how to translate the standard pagerank summation formula into NumPy compatible index arrays was a great learning experience. It’s a much more efficient way to think about graph problems compared to nested loops.
Stabilization: I had to ensure that the damping factor and teleportation logic stayed perfectly consistent with the original version so that the final checksums would match to the 6th decimal place.

AI Usage - I used Gemini to help me in the transition from the dictionary-based loop to the np.add.at indexing logic. This helped me map the math correctly, but I personally handled the implementation, debugging the linter errors, and running the local benchmarks to verify the 6.26x speedup.
fixes #6

Summary by CodeRabbit

  • Refactor
    • Optimized graph analytics computation for faster, more efficient processing of large graphs.
  • Chores
    • Added a numerical library dependency to improve performance and enable vectorized computations.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 11, 2026

Walkthrough

PageRank implementation refactored from Python dict-based loops to NumPy vectorized operations, including precomputed node/edge index arrays, vectorized rank updates with np.add.at, and vectorized checksum/top-node selection. Project now declares NumPy as a runtime dependency.

Changes

Cohort / File(s) Summary
PageRank Vectorization
chuck/tasks/graph_analytics/task.py
Rewrote PageRank internals to use NumPy arrays: added numpy import, built node_to_idx, src_indices, dst_indices, out_degrees; replaced nested loop rank updates with vectorized contributions + np.add.at; switched top-node selection to np.argmax; checksum and top score computed from arrays and cast to float before round6.
Dependencies metadata
pyproject.toml, requirements.txt
Added runtime dependency on numpy (pyproject.toml [project].dependencies includes numpy>=1.20.0) and a requirements.txt with numpy.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'perf: optimize pagerank using sparse vectorization and np.add.at' directly summarizes the main change: replacing dict-based PageRank iteration with vectorized NumPy operations using np.add.at for performance optimization.
Linked Issues check ✅ Passed The PR fully satisfies issue #6 requirements: it replaces the dict-of-dicts iteration in PageRank with vectorized NumPy arrays and np.add.at, improving performance from O(N²) to O(E) while maintaining algorithmic correctness.
Out of Scope Changes check ✅ Passed All changes are in scope: task.py implements vectorized PageRank optimization, pyproject.toml and requirements.txt add required NumPy dependency. No unrelated modifications detected.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@chuck/tasks/graph_analytics/task.py`:
- Around line 29-43: There are trailing spaces on the blank lines around the
graph-building block (near variables node_to_idx, src_list, out_degrees, and
src_indices) causing pre-commit failures; open the chunk that defines
node_to_idx, src_list, dst_list, out_degrees and src_indices and remove any
trailing whitespace from the empty lines (or run a trim-whitespace/format step)
so the lines are truly empty and the file passes CI.
- Line 3: The module imports numpy (import numpy as np) but pyproject.toml lacks
a [project].dependencies entry; add numpy to the project dependencies by adding
a [project] dependencies array with "numpy>=1.20.0" so the package manager
installs numpy before runtime and the import in
chuck/tasks/graph_analytics/task.py resolves.
- Around line 54-56: The current selection using np.argmax(ranks)
(top_idx/top_node) breaks ties differently than the original max(nodes, key=...)
behavior: np.argmax picks the first occurrence, while the original chose the
lexicographically larger node when ranks tie. Replace the
np.argmax/top_idx/top_node logic with a direct max over nodes using the tuple
key (ranks[node], node) so ties pick the node with the larger name (e.g.,
top_node = max(nodes, key=lambda node: (ranks[node], node))); leave checksum
calculation (checksum = np.sum(np.arange(1, n + 1) * ranks)) unchanged. Ensure
references to top_idx are removed or updated accordingly.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 5a12ab7e-ae0c-4d1f-9246-3969a084be6c

📥 Commits

Reviewing files that changed from the base of the PR and between a302e5f and 180e97b.

📒 Files selected for processing (1)
  • chuck/tasks/graph_analytics/task.py

Comment thread chuck/tasks/graph_analytics/task.py
Comment thread chuck/tasks/graph_analytics/task.py
Comment on lines +54 to +56
top_idx = np.argmax(ranks)
top_node = nodes[top_idx]
checksum = np.sum(np.arange(1, n + 1) * ranks)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Search for original PageRank top-node selection logic in git history
git log -p --all -S 'max(nodes' -- '*.py' | head -80

Repository: iiitl/chuck

Length of output: 37


🏁 Script executed:

# First, check the current state of the file around lines 54-56
wc -l chuck/tasks/graph_analytics/task.py

Repository: iiitl/chuck

Length of output: 93


🏁 Script executed:

# Show the context around lines 54-56
sed -n '45,65p' chuck/tasks/graph_analytics/task.py

Repository: iiitl/chuck

Length of output: 650


🏁 Script executed:

# Check git history for this file to find the original implementation
git log --oneline chuck/tasks/graph_analytics/task.py | head -20

Repository: iiitl/chuck

Length of output: 127


🏁 Script executed:

# Look for the original code in git diff
git log -p chuck/tasks/graph_analytics/task.py | head -200

Repository: iiitl/chuck

Length of output: 37


🏁 Script executed:

# Show the specific commit that changed the PageRank implementation
git show 180e97b

Repository: iiitl/chuck

Length of output: 50370


🏁 Script executed:

# Check if there are any earlier commits or if this is the initial commit
git log --all -- chuck/tasks/graph_analytics/task.py

Repository: iiitl/chuck

Length of output: 434


🏁 Script executed:

# Look for the original code before the optimization commit
git show 180e97b^:chuck/tasks/graph_analytics/task.py 2>/dev/null || echo "No previous version found"

Repository: iiitl/chuck

Length of output: 80


🏁 Script executed:

# Check the previous commit to see if there was an original implementation
git show 3bb6d58:chuck/tasks/graph_analytics/task.py 2>/dev/null | head -100

Repository: iiitl/chuck

Length of output: 1708


Tie-breaking behavior differs from original implementation.

The original max(nodes, key=lambda node: (rank[node], node)) selects the lexicographically larger node name when ranks are tied, while np.argmax(ranks) returns the first occurrence, selecting the lexicographically smaller node name (since nodes are sorted).

Though regression tests pass (suggesting ties may not occur in practice), this is a correctness discrepancy. To align with the original behavior:

-    top_idx = np.argmax(ranks)
-    top_node = nodes[top_idx]
+    max_rank = np.max(ranks)
+    # Among nodes with max rank, pick lexicographically largest (matches original behavior)
+    top_indices = np.where(ranks == max_rank)[0]
+    top_idx = top_indices[-1]  # Last index among ties = largest node name
+    top_node = nodes[top_idx]
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
top_idx = np.argmax(ranks)
top_node = nodes[top_idx]
checksum = np.sum(np.arange(1, n + 1) * ranks)
max_rank = np.max(ranks)
# Among nodes with max rank, pick lexicographically largest (matches original behavior)
top_indices = np.where(ranks == max_rank)[0]
top_idx = top_indices[-1] # Last index among ties = largest node name
top_node = nodes[top_idx]
checksum = np.sum(np.arange(1, n + 1) * ranks)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@chuck/tasks/graph_analytics/task.py` around lines 54 - 56, The current
selection using np.argmax(ranks) (top_idx/top_node) breaks ties differently than
the original max(nodes, key=...) behavior: np.argmax picks the first occurrence,
while the original chose the lexicographically larger node when ranks tie.
Replace the np.argmax/top_idx/top_node logic with a direct max over nodes using
the tuple key (ranks[node], node) so ties pick the node with the larger name
(e.g., top_node = max(nodes, key=lambda node: (ranks[node], node))); leave
checksum calculation (checksum = np.sum(np.arange(1, n + 1) * ranks)) unchanged.
Ensure references to top_idx are removed or updated accordingly.

@Phantom0299 Phantom0299 force-pushed the perf/graph-analytics-pagerank branch from 8bdd7d0 to 38d35df Compare April 11, 2026 22:28
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@chuck/tasks/graph_analytics/task.py`:
- Around line 42-43: src_list/dst_list are converted to numpy arrays without a
dtype, so when empty they default to float64 and later indexing into ranks with
src_indices (in function/task using src_indices and dst_indices) raises
IndexError; fix by constructing src_indices and dst_indices with an explicit
integer dtype (e.g., np.intp or np.int64) when calling np.array(src_list) and
np.array(dst_list) so empty edge lists remain integer-typed and indexing becomes
a no-op.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 60f8d2e3-eeca-478a-a604-3ceb6b4aeb17

📥 Commits

Reviewing files that changed from the base of the PR and between 180e97b and 38d35df.

📒 Files selected for processing (3)
  • chuck/tasks/graph_analytics/task.py
  • pyproject.toml
  • requirements.txt
✅ Files skipped from review due to trivial changes (2)
  • requirements.txt
  • pyproject.toml

Comment on lines +42 to +43
src_indices = np.array(src_list)
dst_indices = np.array(dst_list)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
python -m pip install --quiet numpy
python - <<'PY'
import numpy as np

bad = np.array([])
good = np.array([], dtype=np.intp)
ranks = np.array([1.0])

for label, idx in [("bad", bad), ("good", good)]:
    print(label, idx.dtype)
    try:
        print(ranks[idx])
    except Exception as exc:
        print(type(exc).__name__, exc)
PY

Repository: iiitl/chuck

Length of output: 151


🏁 Script executed:

head -60 chuck/tasks/graph_analytics/task.py | tail -30

Repository: iiitl/chuck

Length of output: 921


🏁 Script executed:

head -70 chuck/tasks/graph_analytics/task.py | cat -n

Repository: iiitl/chuck

Length of output: 2466


Force an integer dtype for empty edge lists.

If a caller passes a graph whose adjacency lists are all empty, these arrays become float64, and line 50 then raises IndexError on ranks[src_indices]. Using an explicit integer index dtype keeps the zero-edge case as an empty no-op instead of crashing.

🩹 Proposed fix
-    src_indices = np.array(src_list)
-    dst_indices = np.array(dst_list)
+    src_indices = np.array(src_list, dtype=np.intp)
+    dst_indices = np.array(dst_list, dtype=np.intp)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@chuck/tasks/graph_analytics/task.py` around lines 42 - 43, src_list/dst_list
are converted to numpy arrays without a dtype, so when empty they default to
float64 and later indexing into ranks with src_indices (in function/task using
src_indices and dst_indices) raises IndexError; fix by constructing src_indices
and dst_indices with an explicit integer dtype (e.g., np.intp or np.int64) when
calling np.array(src_list) and np.array(dst_list) so empty edge lists remain
integer-typed and indexing becomes a no-op.

@Phantom0299
Copy link
Copy Markdown
Author

Sir @Aaryan-Dadu, I have completed the PageRank performance overhaul for both the Python and C++ backends.

Summary of PR #40 (Python):

  • Achievement: 6.26x speedup (0.0045s -> 0.0007s).
  • Tech: Swapped dict-loops for sparse NumPy vectorization using np.add.at.
  • Correctness: Fixed the tie-breaking behavior and integer dtypes as per CodeRabbit feedback to ensure 100% checksum reliability.(Fixes Replace PageRank dict-of-dicts iteration in graph_analytics #6)

Summary of PR #42 (C++):

Both PRs now have green CI checks and pass all local benchmarks. Please review them

@Aaryan-Dadu Aaryan-Dadu marked this pull request as draft April 12, 2026 08:35
@Aaryan-Dadu
Copy link
Copy Markdown
Member

As mentioned in #6 (comment) your PR will be considered if #35 gets closed.
Marking this one as draft for now

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Replace PageRank dict-of-dicts iteration in graph_analytics

2 participants