Fixes #6 - perf: optimize python backend for task graph-analytics#35
Fixes #6 - perf: optimize python backend for task graph-analytics#35IronKommander wants to merge 3 commits into
Conversation
WalkthroughThe pull request replaces the dictionary and loop-based PageRank algorithm implementation with a NumPy-vectorized approach. Key changes include using dense arrays for node/rank initialization, Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Suggested labels
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 3
🧹 Nitpick comments (2)
pyproject.toml (1)
11-13: Consider pinning a minimum NumPy version for reproducibility.Leaving the dependency unpinned may lead to inconsistent behavior across environments if NumPy introduces breaking API changes in future major releases. A minimum version constraint (e.g.,
numpy>=1.20) would document the tested baseline while still allowing upgrades.💡 Suggested improvement
dependencies = [ - "numpy" + "numpy>=1.20" ]🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pyproject.toml` around lines 11 - 13, Update the dependencies entry for "numpy" in the pyproject.toml dependencies list to specify a minimum tested version (e.g., change the "numpy" entry to a version-constrained spec like "numpy>=1.20") so that the dependencies array documents the tested baseline and prevents accidental installs of incompatible major releases; update the "dependencies" list item that currently contains the string "numpy" to include the version constraint.chuck/tasks/graph_analytics/task.py (1)
48-53: Inconsistent spacing before colon on line 52.
"checksum" :has an extra space before the colon, unlike the other keys in the dict. This doesn't affect functionality but is inconsistent with the surrounding code style.✏️ Proposed fix
"top_score": round6(float(rank[top_node])), - "checksum" : round6(checksum), + "checksum": round6(checksum), }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@chuck/tasks/graph_analytics/task.py` around lines 48 - 53, The return dictionary in the function that builds the graph analytics result has inconsistent spacing for the "checksum" key; change `"checksum" : round6(checksum),` to `"checksum": round6(checksum),` so the colon spacing matches the other entries (node_count, top_node, top_score) — locate the return block that references N, top_node, rank, round6 and update the key spacing accordingly.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@chuck/tasks/graph_analytics/task.py`:
- Line 23: The function signature for solve has an extra space before the comma
in the type annotation; update the signature in the solve function declaration
(def solve) to remove the space so the parameter type reads dict[str,
list[str]], iterations: int = 16, damping: float = 0.85) — i.e., change
"dict[str, list[str]] ," to "dict[str, list[str]]," to correct the typing
syntax.
- Around line 41-42: There's a typo: the variable is assigned as "recieved" but
should be "received"; update the assignment where np.bincount is called
(currently "recieved = np.bincount(cols, weights=msgs, minlength=N)") to
"received = np.bincount(cols, weights=msgs, minlength=N)" and update any
subsequent references to use "received" (e.g., the next line "rank = recieved +
base" should become "rank = received + base") so all usages of recieved are
replaced consistently.
- Around line 33-37: out_degree may contain zeros for dangling nodes causing
trans_wt = damping / out_degree to produce inf and drop their rank; change the
logic to (1) compute trans_wt safely by initializing trans_wt =
np.zeros_like(out_degree, dtype=np.float64) and only assigning
trans_wt[out_degree>0] = damping / out_degree[out_degree>0], and (2) account for
dangling nodes each iteration by computing dangling_sum = damping *
rank[out_degree==0].sum() and adding dangling_sum / N to every node's next-rank
(or include it in base) so dangling rank is redistributed uniformly; update uses
of trans_wt, rank, base, damping accordingly.
---
Nitpick comments:
In `@chuck/tasks/graph_analytics/task.py`:
- Around line 48-53: The return dictionary in the function that builds the graph
analytics result has inconsistent spacing for the "checksum" key; change
`"checksum" : round6(checksum),` to `"checksum": round6(checksum),` so the colon
spacing matches the other entries (node_count, top_node, top_score) — locate the
return block that references N, top_node, rank, round6 and update the key
spacing accordingly.
In `@pyproject.toml`:
- Around line 11-13: Update the dependencies entry for "numpy" in the
pyproject.toml dependencies list to specify a minimum tested version (e.g.,
change the "numpy" entry to a version-constrained spec like "numpy>=1.20") so
that the dependencies array documents the tested baseline and prevents
accidental installs of incompatible major releases; update the "dependencies"
list item that currently contains the string "numpy" to include the version
constraint.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: ee38d1d3-91cc-417c-b4fa-edd9a29be88b
📒 Files selected for processing (2)
chuck/tasks/graph_analytics/task.pypyproject.toml
| def solve(graph: dict[str, list[str]], iterations: int = 16, damping: float = 0.85) -> dict[str, Any]: | ||
| nodes = sorted(graph) | ||
| if not nodes: | ||
| def solve(graph: dict[str, list[str]] , iterations: int = 16, damping: float = 0.85) -> dict[str, Any]: |
There was a problem hiding this comment.
Remove extraneous space before comma.
There's an extra space in the function signature: dict[str, list[str]] , should be dict[str, list[str]],.
✏️ Proposed fix
-def solve(graph: dict[str, list[str]] , iterations: int = 16, damping: float = 0.85) -> dict[str, Any]:
+def solve(graph: dict[str, list[str]], iterations: int = 16, damping: float = 0.85) -> dict[str, Any]:📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| def solve(graph: dict[str, list[str]] , iterations: int = 16, damping: float = 0.85) -> dict[str, Any]: | |
| def solve(graph: dict[str, list[str]], iterations: int = 16, damping: float = 0.85) -> dict[str, Any]: |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@chuck/tasks/graph_analytics/task.py` at line 23, The function signature for
solve has an extra space before the comma in the type annotation; update the
signature in the solve function declaration (def solve) to remove the space so
the parameter type reads dict[str, list[str]], iterations: int = 16, damping:
float = 0.85) — i.e., change "dict[str, list[str]] ," to "dict[str, list[str]],"
to correct the typing syntax.
| out_degree = np.bincount(rows, minlength=N).astype(np.float64) | ||
|
|
||
| rank = np.full((N,), 1.0/N, dtype=np.float64) | ||
| base = (1.0 - damping) / N | ||
| trans_wt = damping / out_degree |
There was a problem hiding this comment.
Potential division by zero for graphs with sink nodes (dangling nodes).
If a node has no outgoing edges, out_degree[idx] will be 0, causing trans_wt to contain inf at that index. While this won't crash (since dangling nodes don't appear in rows), the node's rank is silently lost rather than being redistributed—standard PageRank typically handles dangling nodes by redistributing their rank to all nodes.
This won't affect the current generator (which always creates 2–4 edges per node), but arbitrary input graphs could yield incorrect results without any warning.
🛡️ Possible defensive fix
out_degree = np.bincount(rows, minlength=N).astype(np.float64)
+ # Handle dangling nodes: set out_degree to 1 to avoid inf; their rank contribution is zero anyway
+ out_degree[out_degree == 0] = 1.0
rank = np.full((N,), 1.0/N, dtype=np.float64)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| out_degree = np.bincount(rows, minlength=N).astype(np.float64) | |
| rank = np.full((N,), 1.0/N, dtype=np.float64) | |
| base = (1.0 - damping) / N | |
| trans_wt = damping / out_degree | |
| out_degree = np.bincount(rows, minlength=N).astype(np.float64) | |
| # Handle dangling nodes: set out_degree to 1 to avoid inf; their rank contribution is zero anyway | |
| out_degree[out_degree == 0] = 1.0 | |
| rank = np.full((N,), 1.0/N, dtype=np.float64) | |
| base = (1.0 - damping) / N | |
| trans_wt = damping / out_degree |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@chuck/tasks/graph_analytics/task.py` around lines 33 - 37, out_degree may
contain zeros for dangling nodes causing trans_wt = damping / out_degree to
produce inf and drop their rank; change the logic to (1) compute trans_wt safely
by initializing trans_wt = np.zeros_like(out_degree, dtype=np.float64) and only
assigning trans_wt[out_degree>0] = damping / out_degree[out_degree>0], and (2)
account for dangling nodes each iteration by computing dangling_sum = damping *
rank[out_degree==0].sum() and adding dangling_sum / N to every node's next-rank
(or include it in base) so dangling rank is redistributed uniformly; update uses
of trans_wt, rank, base, damping accordingly.
| recieved = np.bincount(cols, weights=msgs, minlength=N) | ||
| rank = recieved + base |
There was a problem hiding this comment.
Typo: recieved → received.
✏️ Proposed fix
- recieved = np.bincount(cols, weights=msgs, minlength=N)
- rank = recieved + base
+ received = np.bincount(cols, weights=msgs, minlength=N)
+ rank = received + base📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| recieved = np.bincount(cols, weights=msgs, minlength=N) | |
| rank = recieved + base | |
| received = np.bincount(cols, weights=msgs, minlength=N) | |
| rank = received + base |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@chuck/tasks/graph_analytics/task.py` around lines 41 - 42, There's a typo:
the variable is assigned as "recieved" but should be "received"; update the
assignment where np.bincount is called (currently "recieved = np.bincount(cols,
weights=msgs, minlength=N)") to "received = np.bincount(cols, weights=msgs,
minlength=N)" and update any subsequent references to use "received" (e.g., the
next line "rank = recieved + base" should become "rank = received + base") so
all usages of recieved are replaced consistently.
|
The PR is great! And a speedup of 7x is really good, but can we make it great! Could you look into some other approaches that could outperform these implementation too? |
|
You can refer to https://github.com/iiitl/chuck#:~:text=References%20for%20probabilistic%20direction as a starting point. I know you wouldn't be able to read them in a while but just take a gist of it. |
|
Thanks for the documentation, I will certainly look into it. can you please clarify a doubt of mine:
|
Yes, but please don't commit them just tweak them for testing locally, and yeah I would also favor having different approach for lower number of node and higher, similar in spirit to how |
|
@Aaryan-Dadu, so I went through all the documentation you provided me with, key things that I noted down that maybe could have been of help to me in this task were bloom filters(for checking whether two elements were connected to transmit rank between them?) or the Monte Carlo algo to check whether the product of two matrices was equal to the third one. (this one was a far reaching thought) Initially, I was implementing a standard power iteration on a Markov matrix for calculating pagerank. However, if #41 gets merged, the number of nodes will increase drastically and it would not make sense to create such a dense adjacency matrix, with many zeroes. As such, I have been reading on the probabilistic implementation(Monte carlo), of simply letting surfers walk randomly through the graph, and deriving rank from this. This would save on memory and save time initializing the dense matrix as well. What are your thoughts on this? |
The issue is marked as |
|
Please add **Fixes #6 ** in your PR description to link the issue to the PR. |
|
👋 @IronKommander, issue #6 is currently assigned to @Phantom0299. If their work doesn't close the issue, you'll get a chance. |
Description
Fixes #6. This involved removing the 16 iterations on a python dict of dicts, and instead switiching over to numpy math.
Performance wins
graph_analytics on the python backend now runs in under a millisecond. Previously on the same backend, it ran in around 7 milliseconds.
Previous python backend v/s New python backend
Current C++ backend v/s current python backend
The C++ backend for graph_analytics is quite slow at around 17ms, due to the implementation using the C++ STL
map, which is tree-based and quite heavy. I will look to open an issue for this.Working
This approach is optimized for graphs having low number of edges, but with high number of nodes, avoiding making a full adjacency matrix.
np.bincount()has been used to implement this.colsis a numpy array holding the nodes to which directed edges have been directed.rowsis a numpy array holding the nodes from where the directed edges start. Combining them, a adjacency matrix can be created, whereadj[rows][cols] == 1.Implementation details
The original backend codebase for
graph_analytics, written in pure Python, was still very efficient at its task. I originally tried creating a adjacency matrix to create a transition matrix and using fastnumpymatrix multiplication to achieve the task, as can be seen in this commit, but as it turns out, creating and initializing a 1000x1000 matrix is more expensive computationally than iterating over small number of edges(2-4 per node, so 2000-4000 iterations), even when usingnumpyto avoid writingforloops.At first, I was also researching on how to improve the
generatefunction in python. This was the orignal snippet i wrote that would return an adjacency matrix:This approach seemed faster to me due to reducing reliance on pure Python. However, I soon realized that this would break on the regression tests, and also I would have to change the C++ backend for it to have a chance to work.
Pull request checklist
README.md,docs/capabilities/*as needed)This PR is now ready for review @Aaryan-Dadu
Summary by CodeRabbit
Refactor
Chores