Trace analysis notebook by EoghanOConnor · Pull Request #33 · redhat-et/rhdp-rca-plugin

EoghanOConnor · 2026-05-01T13:04:38Z

Summary by CodeRabbit

New Features
- Added trace analysis notebook for examining MLflow experiments. Features include data retrieval, session aggregation, and visualizations of cost-per-job and job count metrics across sessions.

review-notebook-app · 2026-05-01T13:04:44Z

Check out this pull request on

See visual diffs & provide feedback on Jupyter Notebooks.

Powered by ReviewNB

coderabbitai · 2026-05-01T13:04:46Z

Warning

Review limit reached

@EoghanOConnor, we couldn't start this review because you've reached your PR review rate limit.

More reviews will be available in 52 minutes and 46 seconds. Learn how PR review limits work.

Your organization has run out of usage credits. Purchase more in the billing tab.

⌛ How to resolve this issue?

After more reviews become available, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans include higher PR review limits than trial, open-source, and free plans. In all cases, reviews become available again over time. During sustained high-volume PR review activity, CodeRabbit may temporarily slow when the next review becomes available.

Please see our Fair Usage Limits Policy for further information.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Enterprise

Run ID: eb0b20f2-f7d2-4148-998c-87fa2c6a9078

📥 Commits

Reviewing files that changed from the base of the PR and between 12105f7 and da082f3.

📒 Files selected for processing (1)

analysis/trace_analysis.ipynb

📝 Walkthrough

Walkthrough

This pull request adds a new Jupyter notebook for analyzing MLflow trace data. The notebook connects to an MLflow server via environment-configured parameters, retrieves all traces for a specified experiment, normalizes trace and request/response metadata into pandas DataFrames, extracts job identifiers from response text, aggregates metrics at the session level, and generates matplotlib visualizations of cost per job.

Changes

MLflow Trace Analysis Notebook

Layer / File(s)	Summary
MLflow connection and trace retrieval `analysis/trace_analysis.ipynb`	Environment variables configure the MLflow tracking URI and experiment name. The notebook creates an MLflow client, looks up the experiment by name, and retrieves all traces using paginated `search_traces` calls, aggregating results into `all_traces`.
Trace data normalization and exploration `analysis/trace_analysis.ipynb`	Each trace is parsed to extract request/response JSON, span data, token counts, costs, session ID, and timestamps into normalized dicts. Results are converted to a DataFrame, displayed, and iterated through for console inspection of all fields and values.
Session aggregation and cost-per-job computation `analysis/trace_analysis.ipynb`	Traces are grouped by session ID. Helper functions `extract_job_ids` and `count_job_ids` parse "Job " occurrences from response text. Augmented traces with job counts are used to compute per-session total tokens and costs, select the representative row per session (max job count), merge totals back, and derive `cost_per_job` with division-by-zero handling.
Matplotlib plots and analysis visualization `analysis/trace_analysis.ipynb`	Multiple matplotlib figures are generated from session-level aggregated statistics, including line plots and bar charts showing average cost per job as a function of job count, with an annotated bar chart that includes session count labels.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title directly describes the main change: adding a Jupyter notebook for trace analysis with MLflow integration, data transformation, and visualization.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 3

🧹 Nitpick comments (2)

analysis/trace_analysis.ipynb (2)
54-68: ⚡ Quick win

Cells 2 and 3 duplicate the experiment-lookup logic.

The tracking-URI setup, client creation, and get_experiment_by_name block here repeat what cell 2 (lines 29–44) already does, including a re-import of mlflow/MlflowClient. Consolidate into a single setup cell so experiment_id has one source of truth and the notebook is easier to run top-to-bottom.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@analysis/trace_analysis.ipynb` around lines 54 - 68, Remove the duplicated
MLflow setup and experiment lookup by consolidating the repeated imports and
logic into a single initialization cell: keep one mlflow import and MlflowClient
creation, one mlflow.set_tracking_uri(MLFLOW_TRACKING_URI), one client =
MlflowClient(), and a single mlflow.get_experiment_by_name(...) block that
assigns experiment_id; delete the redundant cell contents (the second import,
set_tracking_uri, client creation, and try/except that recreates experiment_id)
so experiment_id is sourced only once and the notebook runs top-to-bottom.
197-198: 💤 Low value

This cell renders a meaningless DataFrameGroupBy object.

df_traces.groupby('session_id') produces a lazy GroupBy, and display of it shows only its repr, not data. df_sessions is also reassigned in the next cell, making this assignment dead. Consider removing the cell or replacing it with an actual aggregation/.size() view.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@analysis/trace_analysis.ipynb` around lines 197 - 198, The cell assigns
df_sessions = df_traces.groupby('session_id') and then display(df_sessions),
which just shows a GroupBy repr and is dead because df_sessions is reassigned
later; replace or remove it: either remove the entire cell, or change the
assignment to an actual aggregation/view such as df_sessions =
df_traces.groupby('session_id').size().reset_index(name='count') (or another
meaningful aggregation like .agg(...)) and then display(df_sessions) so the
DataFrame is informative; update references to df_sessions accordingly.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@analysis/trace_analysis.ipynb`:
- Around line 120-139: The parsing loop accesses trace.data.response and
trace.info.trace_metadata keys unsafely (response_obj, response_text,
token_usage, total_cost, session_id) which can raise TypeError/KeyError and
abort the whole batch; update the logic around response_obj, span_names,
token_usage and session_id to defensively validate and default values: use
trace.data.get('response') or getattr(trace.data, 'response', None), check
isinstance(response_obj, (str, dict)) and wrap json.loads in try/except (skip or
continue on parse failure), use
trace.info.trace_metadata.get('mlflow.trace.tokenUsage') and
.get('mlflow.trace.cost') and .get('mlflow.trace.session') with sensible
defaults or skip traces missing required metadata, and only set response_text
when response_obj is a dict containing the 'response' key to avoid crashes
(apply these guards where spans, span_names, token_usage, total_cost, and
session_id are computed).
- Around line 93-98: The exception handler leaves the variable traces undefined
causing a NameError later; before the try (or inside the except) ensure traces
is initialized to an empty list so downstream cells can iterate safely. Locate
the block that assigns all_traces to traces (and the surrounding try/except) and
set traces = [] before entering the try or add traces = [] in the except
alongside the existing print/traceback so that references to traces (e.g., the
subsequent for trace in traces) degrade gracefully.
- Around line 235-240: Remove the early unguarded assignment of
df_sessions['cost_per_job'] (the line computing session_total_costs / job_count)
to avoid divide-by-zero warnings; keep the later guarded calculation that checks
job_count > 0 before dividing. Specifically, delete the initial cost_per_job
assignment referencing df_sessions, session_total_costs and job_count and rely
solely on the protected computation already present further down.

---

Nitpick comments:
In `@analysis/trace_analysis.ipynb`:
- Around line 54-68: Remove the duplicated MLflow setup and experiment lookup by
consolidating the repeated imports and logic into a single initialization cell:
keep one mlflow import and MlflowClient creation, one
mlflow.set_tracking_uri(MLFLOW_TRACKING_URI), one client = MlflowClient(), and a
single mlflow.get_experiment_by_name(...) block that assigns experiment_id;
delete the redundant cell contents (the second import, set_tracking_uri, client
creation, and try/except that recreates experiment_id) so experiment_id is
sourced only once and the notebook runs top-to-bottom.
- Around line 197-198: The cell assigns df_sessions =
df_traces.groupby('session_id') and then display(df_sessions), which just shows
a GroupBy repr and is dead because df_sessions is reassigned later; replace or
remove it: either remove the entire cell, or change the assignment to an actual
aggregation/view such as df_sessions =
df_traces.groupby('session_id').size().reset_index(name='count') (or another
meaningful aggregation like .agg(...)) and then display(df_sessions) so the
DataFrame is informative; update references to df_sessions accordingly.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Enterprise

Run ID: 118f7269-c342-4106-b40e-d2c450578e1a

📥 Commits

Reviewing files that changed from the base of the PR and between 108dc23 and 12105f7.

📒 Files selected for processing (1)

analysis/trace_analysis.ipynb

coderabbitai · 2026-06-03T16:21:19Z

+    "    traces = all_traces\n",
+    "\n",
+    "except Exception as e:\n",
+    "    print(f\"Error: {e}\")\n",
+    "    import traceback\n",
+    "    traceback.print_exc()"


⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

traces is undefined when fetching fails.

On the exception path traces is never assigned, so the next cell's for trace in traces raises NameError instead of surfacing the original failure. Initialize traces = [] before the try (or in the except) so downstream cells degrade gracefully.

Proposed fix

+ traces = [] # Fetch ALL traces with pagination all_traces = []

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@analysis/trace_analysis.ipynb` around lines 93 - 98, The exception handler leaves the variable traces undefined causing a NameError later; before the try (or inside the except) ensure traces is initialized to an empty list so downstream cells can iterate safely. Locate the block that assigns all_traces to traces (and the surrounding try/except) and set traces = [] before entering the try or add traces = [] in the except alongside the existing print/traceback so that references to traces (e.g., the subsequent for trace in traces) degrade gracefully.

coderabbitai · 2026-06-03T16:21:19Z

+    "            response_obj = trace.data.response\n",
+    "\n",
+    "            # If it's a string, parse it\n",
+    "            if isinstance(response_obj, str):\n",
+    "                response_obj = json.loads(response_obj)\n",
+    "\n",
+    "            # Get the trace breakdown (spans)\n",
+    "            spans = trace.data.spans\n",
+    "\n",
+    "            span_names = [span.name for span in trace.data.spans]\n",
+    "\n",
+    "            token_usage = json.loads(trace.info.trace_metadata['mlflow.trace.tokenUsage'])\n",
+    "            total_tokens = token_usage['total_tokens']\n",
+    "\n",
+    "            token_usage = json.loads(trace.info.trace_metadata['mlflow.trace.cost'])\n",
+    "            total_cost = token_usage['total_cost']\n",
+    "\n",
+    "            session_id = trace.info.trace_metadata['mlflow.trace.session']\n",
+    "            # Now get the actual response text\n",
+    "            response_text = response_obj['response']\n",


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Unguarded response/metadata access can crash the parse loop.

Two reachable failures here:

response_obj is only JSON-parsed when it is a str; if trace.data.response is None (or a non-dict), response_obj['response'] at line 139 raises TypeError/KeyError.

The trace.info.trace_metadata[...] lookups (lines 131–137) assume mlflow.trace.tokenUsage, mlflow.trace.cost, and mlflow.trace.session are always present; any trace missing these keys aborts the whole loop via the outer except, dropping all already-parsed rows.

Use .get() with defaults and skip/guard traces lacking the required metadata so one malformed trace doesn't fail the entire batch.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@analysis/trace_analysis.ipynb` around lines 120 - 139, The parsing loop accesses trace.data.response and trace.info.trace_metadata keys unsafely (response_obj, response_text, token_usage, total_cost, session_id) which can raise TypeError/KeyError and abort the whole batch; update the logic around response_obj, span_names, token_usage and session_id to defensively validate and default values: use trace.data.get('response') or getattr(trace.data, 'response', None), check isinstance(response_obj, (str, dict)) and wrap json.loads in try/except (skip or continue on parse failure), use trace.info.trace_metadata.get('mlflow.trace.tokenUsage') and .get('mlflow.trace.cost') and .get('mlflow.trace.session') with sensible defaults or skip traces missing required metadata, and only set response_text when response_obj is a dict containing the 'response' key to avoid crashes (apply these guards where spans, span_names, token_usage, total_cost, and session_id are computed).

coderabbitai · 2026-06-03T16:21:19Z

+    "# Add the session total tokens and costs\n",
+    "df_sessions = df_sessions.merge(session_tokens, on='session_id', how='left')\n",
+    "df_sessions = df_sessions.merge(session_costs, on='session_id', how='left')\n",
+    "\n",
+    "# Calculate cost per job\n",
+    "df_sessions['cost_per_job'] = df_sessions['session_total_costs'] / df_sessions['job_count']\n",


⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

🧩 Analysis chain

🌐 Web query:

In pandas, does DataFrame.merge with a named Series whose index name matches the on key raise KeyError, requiring reset_index?

💡 Result:

In pandas, DataFrame.merge treats a named Series as a DataFrame with a single column, where the column name is the name of the Series [1][2]. The merge parameters on, left_on, and right_on can reference either column names or index level names [3]. If you attempt to merge a DataFrame and a named Series using a key that exists as an index name in the Series, pandas should correctly identify it as a join key without requiring you to manually reset the index [3]. The merge operation treats index levels as valid merge keys, and you do not need to convert the index to a column to perform the merge [3]. If you encounter a KeyError during a merge, it is typically because the specified key does not exist in the object being merged or because of an ambiguity in how the merge parameters are configured [4]. You do not need to reset the index simply because the join key is an index name; you only need to ensure the merge parameters (such as left_on or right_on) correctly point to the intended index levels or columns [3].

Citations:

1: https://pandas.pydata.org/docs/reference/api/pandas.merge.html

2: https://github.com/pandas-dev/pandas/blob/v3.0.2/pandas/core/reshape/merge.py

3: https://pandas.pydata.org/docs/user_guide/merging.html

4: https://stackoverflow.com/questions/45624135/python-pandas-merge-cannot-find-existing-column-keyerror

Remove dead cost_per_job assignment (avoid divide-by-zero warnings)

df_sessions['cost_per_job'] is computed once at ~line 240 as session_total_costs / job_count, then overwritten later (~248–251) with a guarded version (job_count > 0 ? ...). Remove the first computation and keep only the guarded calculation.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@analysis/trace_analysis.ipynb` around lines 235 - 240, Remove the early unguarded assignment of df_sessions['cost_per_job'] (the line computing session_total_costs / job_count) to avoid divide-by-zero warnings; keep the later guarded calculation that checks job_count > 0 before dividing. Specifically, delete the initial cost_per_job assignment referencing df_sessions, session_total_costs and job_count and rely solely on the protected computation already present further down.

EoghanOConnor added 2 commits June 3, 2026 16:58

[WIP]: Trace analysis notebook

1b0d843

small update

3ae7bc8

EoghanOConnor force-pushed the analysis branch 2 times, most recently from 3a832ea to 9134d90 Compare June 3, 2026 16:11

EoghanOConnor changed the title ~~[WIP]: Trace analysis notebook~~ Trace analysis notebook Jun 3, 2026

EoghanOConnor force-pushed the analysis branch from 9134d90 to 12105f7 Compare June 3, 2026 16:14

job cost analysis

da082f3

EoghanOConnor force-pushed the analysis branch from 12105f7 to da082f3 Compare June 3, 2026 16:17

coderabbitai Bot reviewed Jun 3, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Trace analysis notebook#33

Trace analysis notebook#33
EoghanOConnor wants to merge 3 commits into
redhat-et:mainfrom
EoghanOConnor:analysis

EoghanOConnor commented May 1, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

review-notebook-app Bot commented May 1, 2026

Uh oh!

coderabbitai Bot commented May 1, 2026 •

edited

Loading

Review limit reached

Walkthrough

Changes

Estimated code review effort

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot Jun 3, 2026

Uh oh!

coderabbitai Bot Jun 3, 2026

Uh oh!

coderabbitai Bot Jun 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

EoghanOConnor commented May 1, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

review-notebook-app Bot commented May 1, 2026

Uh oh!

coderabbitai Bot commented May 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review limit reached

Walkthrough

Changes

Estimated code review effort

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 3, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 3, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 3, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

EoghanOConnor commented May 1, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented May 1, 2026 •

edited

Loading