perf: optimize pagerank using sparse vectorization and np.add.at by Phantom0299 · Pull Request #40 · iiitl/chuck

Phantom0299 · 2026-04-11T21:57:48Z

I realized the original PageRank implementation was hitting a major issue by using nested dictionary loops, which effectively treated the graph like a dense matrix. This wasted a lot of cpu cycles on zero-connections. I’ve swapped this out for a sparse vectorized approach using NumPy, which moved the complexity from O(N^2) to O(E) which is a great reduction in the load taken by cpu and would also improve speed drastically.

What I Did
Sparse Edge Mapping: Instead of looping through nodes, I converted the graph into a sparse edge list
Vectorized Contributions: I used np.add.at to handle the rank updates. It basically lets me do a high-speed "scatter-add" in one vectorized shot, so we aren’t stuck waiting for Python to grind through every single sum one by one.
Index Alignment: Pre-calculated out-degrees as a vector to handle the division step efficiently within the NumPy iterations.

A/B Comparison Results (as per contributing guidelines)
graph_analytics: 6.26x speedup (84.04% faster)
Baseline: 0.004543s -> Optimized: 0.000725s
Reliability: 1.0 (Checksums match perfectly / 100% accuracy)

Learnings & Challenges
Linter & Style: My biggest hurdle was actually the formatting side of things. I ran into some strict linter issues on my previous PR regarding trailing whitespace and docstrings, so I spent extra time here making sure the code was clean and matched the project's 'no-docstring' convention from the start.
Mapping the Math: Figuring out how to translate the standard pagerank summation formula into NumPy compatible index arrays was a great learning experience. It’s a much more efficient way to think about graph problems compared to nested loops.
Stabilization: I had to ensure that the damping factor and teleportation logic stayed perfectly consistent with the original version so that the final checksums would match to the 6th decimal place.

AI Usage - I used Gemini to help me in the transition from the dictionary-based loop to the np.add.at indexing logic. This helped me map the math correctly, but I personally handled the implementation, debugging the linter errors, and running the local benchmarks to verify the 6.26x speedup.
fixes #6

Summary by CodeRabbit

Refactor
- Optimized graph analytics computation for faster, more efficient processing of large graphs.
Chores
- Added a numerical library dependency to improve performance and enable vectorized computations.

coderabbitai · 2026-04-11T21:58:02Z

Walkthrough

PageRank implementation refactored from Python dict-based loops to NumPy vectorized operations, including precomputed node/edge index arrays, vectorized rank updates with np.add.at, and vectorized checksum/top-node selection. Project now declares NumPy as a runtime dependency.

Changes

Cohort / File(s)	Summary
PageRank Vectorization `chuck/tasks/graph_analytics/task.py`	Rewrote PageRank internals to use NumPy arrays: added `numpy` import, built `node_to_idx`, `src_indices`, `dst_indices`, `out_degrees`; replaced nested loop rank updates with vectorized contributions + `np.add.at`; switched top-node selection to `np.argmax`; checksum and top score computed from arrays and cast to `float` before `round6`.
Dependencies metadata `pyproject.toml`, `requirements.txt`	Added runtime dependency on `numpy` (`pyproject.toml` [project].dependencies includes `numpy>=1.20.0`) and a `requirements.txt` with `numpy`.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'perf: optimize pagerank using sparse vectorization and np.add.at' directly summarizes the main change: replacing dict-based PageRank iteration with vectorized NumPy operations using np.add.at for performance optimization.
Linked Issues check	✅ Passed	The PR fully satisfies issue `#6` requirements: it replaces the dict-of-dicts iteration in PageRank with vectorized NumPy arrays and np.add.at, improving performance from O(N²) to O(E) while maintaining algorithmic correctness.
Out of Scope Changes check	✅ Passed	All changes are in scope: task.py implements vectorized PageRank optimization, pyproject.toml and requirements.txt add required NumPy dependency. No unrelated modifications detected.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@chuck/tasks/graph_analytics/task.py`:
- Around line 29-43: There are trailing spaces on the blank lines around the
graph-building block (near variables node_to_idx, src_list, out_degrees, and
src_indices) causing pre-commit failures; open the chunk that defines
node_to_idx, src_list, dst_list, out_degrees and src_indices and remove any
trailing whitespace from the empty lines (or run a trim-whitespace/format step)
so the lines are truly empty and the file passes CI.
- Line 3: The module imports numpy (import numpy as np) but pyproject.toml lacks
a [project].dependencies entry; add numpy to the project dependencies by adding
a [project] dependencies array with "numpy>=1.20.0" so the package manager
installs numpy before runtime and the import in
chuck/tasks/graph_analytics/task.py resolves.
- Around line 54-56: The current selection using np.argmax(ranks)
(top_idx/top_node) breaks ties differently than the original max(nodes, key=...)
behavior: np.argmax picks the first occurrence, while the original chose the
lexicographically larger node when ranks tie. Replace the
np.argmax/top_idx/top_node logic with a direct max over nodes using the tuple
key (ranks[node], node) so ties pick the node with the larger name (e.g.,
top_node = max(nodes, key=lambda node: (ranks[node], node))); leave checksum
calculation (checksum = np.sum(np.arange(1, n + 1) * ranks)) unchanged. Ensure
references to top_idx are removed or updated accordingly.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 5a12ab7e-ae0c-4d1f-9246-3969a084be6c

📥 Commits

Reviewing files that changed from the base of the PR and between a302e5f and 180e97b.

📒 Files selected for processing (1)

chuck/tasks/graph_analytics/task.py

coderabbitai · 2026-04-11T22:00:47Z

+    top_idx = np.argmax(ranks)
+    top_node = nodes[top_idx]
+    checksum = np.sum(np.arange(1, n + 1) * ranks)


⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash # Search for original PageRank top-node selection logic in git history git log -p --all -S 'max(nodes' -- '*.py' | head -80

Repository: iiitl/chuck

Length of output: 37

🏁 Script executed:

# First, check the current state of the file around lines 54-56 wc -l chuck/tasks/graph_analytics/task.py

Repository: iiitl/chuck

Length of output: 93

🏁 Script executed:

# Show the context around lines 54-56 sed -n '45,65p' chuck/tasks/graph_analytics/task.py

Repository: iiitl/chuck

Length of output: 650

🏁 Script executed:

# Check git history for this file to find the original implementation git log --oneline chuck/tasks/graph_analytics/task.py | head -20

Repository: iiitl/chuck

Length of output: 127

🏁 Script executed:

# Look for the original code in git diff git log -p chuck/tasks/graph_analytics/task.py | head -200

Repository: iiitl/chuck

Length of output: 37

🏁 Script executed:

# Show the specific commit that changed the PageRank implementation git show 180e97b

Repository: iiitl/chuck

Length of output: 50370

🏁 Script executed:

# Check if there are any earlier commits or if this is the initial commit git log --all -- chuck/tasks/graph_analytics/task.py

Repository: iiitl/chuck

Length of output: 434

🏁 Script executed:

# Look for the original code before the optimization commit git show 180e97b^:chuck/tasks/graph_analytics/task.py 2>/dev/null || echo "No previous version found"

Repository: iiitl/chuck

Length of output: 80

🏁 Script executed:

# Check the previous commit to see if there was an original implementation git show 3bb6d58:chuck/tasks/graph_analytics/task.py 2>/dev/null | head -100

Repository: iiitl/chuck

Length of output: 1708

Tie-breaking behavior differs from original implementation.

The original max(nodes, key=lambda node: (rank[node], node)) selects the lexicographically larger node name when ranks are tied, while np.argmax(ranks) returns the first occurrence, selecting the lexicographically smaller node name (since nodes are sorted).

Though regression tests pass (suggesting ties may not occur in practice), this is a correctness discrepancy. To align with the original behavior:

- top_idx = np.argmax(ranks) - top_node = nodes[top_idx] + max_rank = np.max(ranks) + # Among nodes with max rank, pick lexicographically largest (matches original behavior) + top_indices = np.where(ranks == max_rank)[0] + top_idx = top_indices[-1] # Last index among ties = largest node name + top_node = nodes[top_idx]

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

top_idx = np.argmax(ranks)

top_node = nodes[top_idx]

checksum = np.sum(np.arange(1, n + 1) * ranks)

max_rank = np.max(ranks)

# Among nodes with max rank, pick lexicographically largest (matches original behavior)

top_indices = np.where(ranks == max_rank)[0]

top_idx = top_indices[-1] # Last index among ties = largest node name

top_node = nodes[top_idx]

checksum = np.sum(np.arange(1, n + 1) * ranks)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@chuck/tasks/graph_analytics/task.py` around lines 54 - 56, The current selection using np.argmax(ranks) (top_idx/top_node) breaks ties differently than the original max(nodes, key=...) behavior: np.argmax picks the first occurrence, while the original chose the lexicographically larger node when ranks tie. Replace the np.argmax/top_idx/top_node logic with a direct max over nodes using the tuple key (ranks[node], node) so ties pick the node with the larger name (e.g., top_node = max(nodes, key=lambda node: (ranks[node], node))); leave checksum calculation (checksum = np.sum(np.arange(1, n + 1) * ranks)) unchanged. Ensure references to top_idx are removed or updated accordingly.

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@chuck/tasks/graph_analytics/task.py`:
- Around line 42-43: src_list/dst_list are converted to numpy arrays without a
dtype, so when empty they default to float64 and later indexing into ranks with
src_indices (in function/task using src_indices and dst_indices) raises
IndexError; fix by constructing src_indices and dst_indices with an explicit
integer dtype (e.g., np.intp or np.int64) when calling np.array(src_list) and
np.array(dst_list) so empty edge lists remain integer-typed and indexing becomes
a no-op.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 60f8d2e3-eeca-478a-a604-3ceb6b4aeb17

📥 Commits

Reviewing files that changed from the base of the PR and between 180e97b and 38d35df.

📒 Files selected for processing (3)

chuck/tasks/graph_analytics/task.py
pyproject.toml
requirements.txt

✅ Files skipped from review due to trivial changes (2)

requirements.txt
pyproject.toml

coderabbitai · 2026-04-11T22:34:33Z

+    src_indices = np.array(src_list)
+    dst_indices = np.array(dst_list)


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash python -m pip install --quiet numpy python - <<'PY' import numpy as np bad = np.array([]) good = np.array([], dtype=np.intp) ranks = np.array([1.0]) for label, idx in [("bad", bad), ("good", good)]: print(label, idx.dtype) try: print(ranks[idx]) except Exception as exc: print(type(exc).__name__, exc) PY

Repository: iiitl/chuck

Length of output: 151

🏁 Script executed:

head -60 chuck/tasks/graph_analytics/task.py | tail -30

Repository: iiitl/chuck

Length of output: 921

🏁 Script executed:

head -70 chuck/tasks/graph_analytics/task.py | cat -n

Repository: iiitl/chuck

Length of output: 2466

Force an integer dtype for empty edge lists.

If a caller passes a graph whose adjacency lists are all empty, these arrays become float64, and line 50 then raises IndexError on ranks[src_indices]. Using an explicit integer index dtype keeps the zero-edge case as an empty no-op instead of crashing.

🩹 Proposed fix

- src_indices = np.array(src_list) - dst_indices = np.array(dst_list) + src_indices = np.array(src_list, dtype=np.intp) + dst_indices = np.array(dst_list, dtype=np.intp)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@chuck/tasks/graph_analytics/task.py` around lines 42 - 43, src_list/dst_list are converted to numpy arrays without a dtype, so when empty they default to float64 and later indexing into ranks with src_indices (in function/task using src_indices and dst_indices) raises IndexError; fix by constructing src_indices and dst_indices with an explicit integer dtype (e.g., np.intp or np.int64) when calling np.array(src_list) and np.array(dst_list) so empty edge lists remain integer-typed and indexing becomes a no-op.

Phantom0299 · 2026-04-12T06:36:44Z

Sir @Aaryan-Dadu, I have completed the PageRank performance overhaul for both the Python and C++ backends.

Summary of PR #40 (Python):

Achievement: 6.26x speedup (0.0045s -> 0.0007s).
Tech: Swapped dict-loops for sparse NumPy vectorization using np.add.at.
Correctness: Fixed the tie-breaking behavior and integer dtypes as per CodeRabbit feedback to ensure 100% checksum reliability.(Fixes Replace PageRank dict-of-dicts iteration in graph_analytics #6)

Summary of PR #42 (C++):

Achievement: 3.14x speedup, finally overtaking the Python baseline.
Tech: Refactored std::map into contiguous std::vector for better cache locality.(Fixes graph-analytics C++ backend is slower than Python backend #38)

Both PRs now have green CI checks and pass all local benchmarks. Please review them

Aaryan-Dadu · 2026-04-12T08:38:07Z

As mentioned in #6 (comment) your PR will be considered if #35 gets closed.
Marking this one as draft for now

mini-walkerx Bot mentioned this pull request Apr 11, 2026

Replace PageRank dict-of-dicts iteration in graph_analytics #6

Open

coderabbitai Bot reviewed Apr 11, 2026

View reviewed changes

perf: optimize pagerank using sparse vectorization and np.add.at

38d35df

Phantom0299 force-pushed the perf/graph-analytics-pagerank branch from 8bdd7d0 to 38d35df Compare April 11, 2026 22:28

coderabbitai Bot reviewed Apr 11, 2026

View reviewed changes

Aaryan-Dadu marked this pull request as draft April 12, 2026 08:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: optimize pagerank using sparse vectorization and np.add.at#40

perf: optimize pagerank using sparse vectorization and np.add.at#40
Phantom0299 wants to merge 1 commit into
iiitl:mainfrom
Phantom0299:perf/graph-analytics-pagerank

Phantom0299 commented Apr 11, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Apr 11, 2026 •

edited

Loading

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot Apr 11, 2026

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot Apr 11, 2026

Uh oh!

Phantom0299 commented Apr 12, 2026

Uh oh!

Aaryan-Dadu commented Apr 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		src_indices = np.array(src_list)
		dst_indices = np.array(dst_list)

Conversation

Phantom0299 commented Apr 11, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Apr 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot Apr 11, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 11, 2026

Choose a reason for hiding this comment

Uh oh!

Phantom0299 commented Apr 12, 2026

Uh oh!

Aaryan-Dadu commented Apr 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Phantom0299 commented Apr 11, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Apr 11, 2026 •

edited

Loading