Trace analysis notebook#33
Conversation
|
Check out this pull request on See visual diffs & provide feedback on Jupyter Notebooks. Powered by ReviewNB |
|
Warning Review limit reached
More reviews will be available in 52 minutes and 46 seconds. Learn how PR review limits work. Your organization has run out of usage credits. Purchase more in the billing tab. ⌛ How to resolve this issue?After more reviews become available, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans include higher PR review limits than trial, open-source, and free plans. In all cases, reviews become available again over time. During sustained high-volume PR review activity, CodeRabbit may temporarily slow when the next review becomes available. Please see our Fair Usage Limits Policy for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Enterprise Run ID: 📒 Files selected for processing (1)
📝 WalkthroughWalkthroughThis pull request adds a new Jupyter notebook for analyzing MLflow trace data. The notebook connects to an MLflow server via environment-configured parameters, retrieves all traces for a specified experiment, normalizes trace and request/response metadata into pandas DataFrames, extracts job identifiers from response text, aggregates metrics at the session level, and generates matplotlib visualizations of cost per job. ChangesMLflow Trace Analysis Notebook
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes 🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
3a832ea to
9134d90
Compare
There was a problem hiding this comment.
Actionable comments posted: 3
🧹 Nitpick comments (2)
analysis/trace_analysis.ipynb (2)
54-68: ⚡ Quick winCells 2 and 3 duplicate the experiment-lookup logic.
The tracking-URI setup, client creation, and
get_experiment_by_nameblock here repeat what cell 2 (lines 29–44) already does, including a re-import ofmlflow/MlflowClient. Consolidate into a single setup cell soexperiment_idhas one source of truth and the notebook is easier to run top-to-bottom.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@analysis/trace_analysis.ipynb` around lines 54 - 68, Remove the duplicated MLflow setup and experiment lookup by consolidating the repeated imports and logic into a single initialization cell: keep one mlflow import and MlflowClient creation, one mlflow.set_tracking_uri(MLFLOW_TRACKING_URI), one client = MlflowClient(), and a single mlflow.get_experiment_by_name(...) block that assigns experiment_id; delete the redundant cell contents (the second import, set_tracking_uri, client creation, and try/except that recreates experiment_id) so experiment_id is sourced only once and the notebook runs top-to-bottom.
197-198: 💤 Low valueThis cell renders a meaningless DataFrameGroupBy object.
df_traces.groupby('session_id')produces a lazyGroupBy, anddisplayof it shows only its repr, not data.df_sessionsis also reassigned in the next cell, making this assignment dead. Consider removing the cell or replacing it with an actual aggregation/.size()view.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@analysis/trace_analysis.ipynb` around lines 197 - 198, The cell assigns df_sessions = df_traces.groupby('session_id') and then display(df_sessions), which just shows a GroupBy repr and is dead because df_sessions is reassigned later; replace or remove it: either remove the entire cell, or change the assignment to an actual aggregation/view such as df_sessions = df_traces.groupby('session_id').size().reset_index(name='count') (or another meaningful aggregation like .agg(...)) and then display(df_sessions) so the DataFrame is informative; update references to df_sessions accordingly.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@analysis/trace_analysis.ipynb`:
- Around line 120-139: The parsing loop accesses trace.data.response and
trace.info.trace_metadata keys unsafely (response_obj, response_text,
token_usage, total_cost, session_id) which can raise TypeError/KeyError and
abort the whole batch; update the logic around response_obj, span_names,
token_usage and session_id to defensively validate and default values: use
trace.data.get('response') or getattr(trace.data, 'response', None), check
isinstance(response_obj, (str, dict)) and wrap json.loads in try/except (skip or
continue on parse failure), use
trace.info.trace_metadata.get('mlflow.trace.tokenUsage') and
.get('mlflow.trace.cost') and .get('mlflow.trace.session') with sensible
defaults or skip traces missing required metadata, and only set response_text
when response_obj is a dict containing the 'response' key to avoid crashes
(apply these guards where spans, span_names, token_usage, total_cost, and
session_id are computed).
- Around line 93-98: The exception handler leaves the variable traces undefined
causing a NameError later; before the try (or inside the except) ensure traces
is initialized to an empty list so downstream cells can iterate safely. Locate
the block that assigns all_traces to traces (and the surrounding try/except) and
set traces = [] before entering the try or add traces = [] in the except
alongside the existing print/traceback so that references to traces (e.g., the
subsequent for trace in traces) degrade gracefully.
- Around line 235-240: Remove the early unguarded assignment of
df_sessions['cost_per_job'] (the line computing session_total_costs / job_count)
to avoid divide-by-zero warnings; keep the later guarded calculation that checks
job_count > 0 before dividing. Specifically, delete the initial cost_per_job
assignment referencing df_sessions, session_total_costs and job_count and rely
solely on the protected computation already present further down.
---
Nitpick comments:
In `@analysis/trace_analysis.ipynb`:
- Around line 54-68: Remove the duplicated MLflow setup and experiment lookup by
consolidating the repeated imports and logic into a single initialization cell:
keep one mlflow import and MlflowClient creation, one
mlflow.set_tracking_uri(MLFLOW_TRACKING_URI), one client = MlflowClient(), and a
single mlflow.get_experiment_by_name(...) block that assigns experiment_id;
delete the redundant cell contents (the second import, set_tracking_uri, client
creation, and try/except that recreates experiment_id) so experiment_id is
sourced only once and the notebook runs top-to-bottom.
- Around line 197-198: The cell assigns df_sessions =
df_traces.groupby('session_id') and then display(df_sessions), which just shows
a GroupBy repr and is dead because df_sessions is reassigned later; replace or
remove it: either remove the entire cell, or change the assignment to an actual
aggregation/view such as df_sessions =
df_traces.groupby('session_id').size().reset_index(name='count') (or another
meaningful aggregation like .agg(...)) and then display(df_sessions) so the
DataFrame is informative; update references to df_sessions accordingly.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Enterprise
Run ID: 118f7269-c342-4106-b40e-d2c450578e1a
📒 Files selected for processing (1)
analysis/trace_analysis.ipynb
| " traces = all_traces\n", | ||
| "\n", | ||
| "except Exception as e:\n", | ||
| " print(f\"Error: {e}\")\n", | ||
| " import traceback\n", | ||
| " traceback.print_exc()" |
There was a problem hiding this comment.
traces is undefined when fetching fails.
On the exception path traces is never assigned, so the next cell's for trace in traces raises NameError instead of surfacing the original failure. Initialize traces = [] before the try (or in the except) so downstream cells degrade gracefully.
Proposed fix
+ traces = []
# Fetch ALL traces with pagination
all_traces = []🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@analysis/trace_analysis.ipynb` around lines 93 - 98, The exception handler
leaves the variable traces undefined causing a NameError later; before the try
(or inside the except) ensure traces is initialized to an empty list so
downstream cells can iterate safely. Locate the block that assigns all_traces to
traces (and the surrounding try/except) and set traces = [] before entering the
try or add traces = [] in the except alongside the existing print/traceback so
that references to traces (e.g., the subsequent for trace in traces) degrade
gracefully.
| " response_obj = trace.data.response\n", | ||
| "\n", | ||
| " # If it's a string, parse it\n", | ||
| " if isinstance(response_obj, str):\n", | ||
| " response_obj = json.loads(response_obj)\n", | ||
| "\n", | ||
| " # Get the trace breakdown (spans)\n", | ||
| " spans = trace.data.spans\n", | ||
| "\n", | ||
| " span_names = [span.name for span in trace.data.spans]\n", | ||
| "\n", | ||
| " token_usage = json.loads(trace.info.trace_metadata['mlflow.trace.tokenUsage'])\n", | ||
| " total_tokens = token_usage['total_tokens']\n", | ||
| "\n", | ||
| " token_usage = json.loads(trace.info.trace_metadata['mlflow.trace.cost'])\n", | ||
| " total_cost = token_usage['total_cost']\n", | ||
| "\n", | ||
| " session_id = trace.info.trace_metadata['mlflow.trace.session']\n", | ||
| " # Now get the actual response text\n", | ||
| " response_text = response_obj['response']\n", |
There was a problem hiding this comment.
Unguarded response/metadata access can crash the parse loop.
Two reachable failures here:
response_objis only JSON-parsed when it is astr; iftrace.data.responseisNone(or a non-dict),response_obj['response']at line 139 raisesTypeError/KeyError.- The
trace.info.trace_metadata[...]lookups (lines 131–137) assumemlflow.trace.tokenUsage,mlflow.trace.cost, andmlflow.trace.sessionare always present; any trace missing these keys aborts the whole loop via the outerexcept, dropping all already-parsed rows.
Use .get() with defaults and skip/guard traces lacking the required metadata so one malformed trace doesn't fail the entire batch.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@analysis/trace_analysis.ipynb` around lines 120 - 139, The parsing loop
accesses trace.data.response and trace.info.trace_metadata keys unsafely
(response_obj, response_text, token_usage, total_cost, session_id) which can
raise TypeError/KeyError and abort the whole batch; update the logic around
response_obj, span_names, token_usage and session_id to defensively validate and
default values: use trace.data.get('response') or getattr(trace.data,
'response', None), check isinstance(response_obj, (str, dict)) and wrap
json.loads in try/except (skip or continue on parse failure), use
trace.info.trace_metadata.get('mlflow.trace.tokenUsage') and
.get('mlflow.trace.cost') and .get('mlflow.trace.session') with sensible
defaults or skip traces missing required metadata, and only set response_text
when response_obj is a dict containing the 'response' key to avoid crashes
(apply these guards where spans, span_names, token_usage, total_cost, and
session_id are computed).
| "# Add the session total tokens and costs\n", | ||
| "df_sessions = df_sessions.merge(session_tokens, on='session_id', how='left')\n", | ||
| "df_sessions = df_sessions.merge(session_costs, on='session_id', how='left')\n", | ||
| "\n", | ||
| "# Calculate cost per job\n", | ||
| "df_sessions['cost_per_job'] = df_sessions['session_total_costs'] / df_sessions['job_count']\n", |
There was a problem hiding this comment.
🧩 Analysis chain
🌐 Web query:
In pandas, does DataFrame.merge with a named Series whose index name matches the on key raise KeyError, requiring reset_index?
💡 Result:
In pandas, DataFrame.merge treats a named Series as a DataFrame with a single column, where the column name is the name of the Series [1][2]. The merge parameters on, left_on, and right_on can reference either column names or index level names [3]. If you attempt to merge a DataFrame and a named Series using a key that exists as an index name in the Series, pandas should correctly identify it as a join key without requiring you to manually reset the index [3]. The merge operation treats index levels as valid merge keys, and you do not need to convert the index to a column to perform the merge [3]. If you encounter a KeyError during a merge, it is typically because the specified key does not exist in the object being merged or because of an ambiguity in how the merge parameters are configured [4]. You do not need to reset the index simply because the join key is an index name; you only need to ensure the merge parameters (such as left_on or right_on) correctly point to the intended index levels or columns [3].
Citations:
- 1: https://pandas.pydata.org/docs/reference/api/pandas.merge.html
- 2: https://github.com/pandas-dev/pandas/blob/v3.0.2/pandas/core/reshape/merge.py
- 3: https://pandas.pydata.org/docs/user_guide/merging.html
- 4: https://stackoverflow.com/questions/45624135/python-pandas-merge-cannot-find-existing-column-keyerror
Remove dead cost_per_job assignment (avoid divide-by-zero warnings)
df_sessions['cost_per_job'] is computed once at ~line 240 as session_total_costs / job_count, then overwritten later (~248–251) with a guarded version (job_count > 0 ? ...). Remove the first computation and keep only the guarded calculation.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@analysis/trace_analysis.ipynb` around lines 235 - 240, Remove the early
unguarded assignment of df_sessions['cost_per_job'] (the line computing
session_total_costs / job_count) to avoid divide-by-zero warnings; keep the
later guarded calculation that checks job_count > 0 before dividing.
Specifically, delete the initial cost_per_job assignment referencing
df_sessions, session_total_costs and job_count and rely solely on the protected
computation already present further down.
Summary by CodeRabbit