OLS-3274: add structured audit event logging with independent OTEL tracing#84
OLS-3274: add structured audit event logging with independent OTEL tracing#84vimalk78 wants to merge 1 commit into
Conversation
|
Warning Review limit reached
More reviews will be available in 3 minutes and 57 seconds. Learn how PR review limits work. Your organization has used up its prepaid credits, and credit purchases are no longer available. Enable the review add-on in the billing tab to keep reviews running — you're only billed for reviews past your plan's rate limits ($0.25/file). ⌛ How to resolve this issue?After more reviews become available, a review can be triggered using the To avoid repeated limits, reduce automatic review volume by pausing incremental auto-reviews earlier, using label-based review opt-in, excluding WIP or generated PR titles, or requesting reviews manually when the PR is ready. If your team needs uninterrupted high-volume reviews, an organization admin can enable usage-based credits. 🚦 How do rate limits work?CodeRabbit enforces per-developer PR review limits for each organization. Most developers receive the normal plan review availability. For paid Pro and Pro+ PR reviews, CodeRabbit uses adaptive limits for sustained high-volume activity. When a developer's recent PR review activity reaches the 95th percentile or higher among CodeRabbit users, additional reviews become available more gradually as earlier reviews age out of the rolling window. Please see our Fair Usage Limits Policy for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Enterprise Run ID: 📒 Files selected for processing (5)
📝 WalkthroughWalkthroughAdds structured audit logging for agent runs, including phase derivation, streaming text/thinking/tool events, completion records with token and cost metrics, provider usage extraction, and query route instrumentation. ChangesAudit logging flow
Sequence Diagram(s)sequenceDiagram
participant Client
participant run_endpoint
participant OpenAIProvider
participant AuditLogger
participant stdout
Client->>run_endpoint: POST /v1/agent/run
run_endpoint->>OpenAIProvider: query(...)
OpenAIProvider-->>run_endpoint: ProviderEvent stream
run_endpoint->>AuditLogger: process_event(event)
OpenAIProvider-->>run_endpoint: ResultEvent usage
run_endpoint->>AuditLogger: complete(success, input_tokens, output_tokens, cost_usd)
AuditLogger->>stdout: emit audit JSON
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
/retest |
|
All PipelineRuns for this commit have already succeeded. Use |
|
/test lint |
dc5d789 to
4c50771
Compare
|
/test e2e-claude |
8671661 to
7c3aee4
Compare
|
@vimalk78: This pull request references OLS-3274 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "5.0.0" version, but no target version was set. DetailsIn response to this: Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
blublinsky
left a comment
There was a problem hiding this comment.
should-fix: _tool_span leak on timeout, crash, or back-to-back tool calls
Location: src/lightspeed_agentic/audit.py:56-68
Problem:
The AuditLogger starts an OTEL span on tool_call and ends it on tool_result, but there are three paths where the span is never ended:
- Timeout/crash mid-tool —
tool_callfires, agent hangs,TimeoutErrorraised.complete(success=False)doesn't end the span. - Back-to-back
tool_call— Twotool_callevents without an interveningtool_result(e.g., parallel tool calls). Second call overwrites_tool_spanwithout ending the first. - Exception during tool execution — Error raised between
tool_callandtool_resultevents.
Suggested fix:
Add cleanup to complete():
def complete(self, *, success: bool, ...) -> None:
self._flush_buffers()
if self._tool_span is not None:
self._tool_span.end()
self._tool_span = None
self._emit("audit.agent.completed", ...)And handle back-to-back in process_event:
case "tool_call":
self._flush_buffers()
if self._tool_span is not None:
self._tool_span.end() # end previous before starting new
self._last_tool_name = event.name or "unknown"
self._tool_span = self._tracer.start_span(...)
blublinsky
left a comment
There was a problem hiding this comment.
nice-to-have: Simpler token counting via SDK aggregated usage
Location: src/lightspeed_agentic/providers/openai.py:262-263
Current:
input_tokens = sum(r.usage.input_tokens for r in result.raw_responses)
output_tokens = sum(r.usage.output_tokens for r in result.raw_responses)Suggested:
input_tokens = result.context_wrapper.usage.input_tokens
output_tokens = result.context_wrapper.usage.output_tokensThe SDK already aggregates token usage across all responses internally. Single access, no loop, and it's the SDK-recommended approach for totals.
7c3aee4 to
7dfa648
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/lightspeed_agentic/routes/query.py`:
- Around line 133-158: The audit completion in the query flow is recorded too
early, so `audit_logger.complete` can mark `success=True` even when the final
`RunResponse` is unsuccessful. Move or recompute the `success` value in
`query.py`’s `run()` handling so it reflects the actual returned outcome after
the empty-text check and the `parsed.get("success", True)` response shaping, and
keep `audit_logger.complete` consistent in both success and failure paths.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 15fe14f2-33e1-40c7-8a96-d457be6efa69
📒 Files selected for processing (5)
src/lightspeed_agentic/audit.pysrc/lightspeed_agentic/providers/openai.pysrc/lightspeed_agentic/routes/query.pytests/test_audit.pytests/test_routes.py
🔗 Linked repositories identified
CodeRabbit considers these linked repositories for cross-repo context during reviews:
openshift/lightspeed-agentic-operator(manual)
Adds AuditLogger that emits structured JSON audit events to stdout during agent execution. Logging and OTEL tracing are independent controls per spec: JSON logs are gated by LIGHTSPEED_AUDIT_ENABLED, OTEL spans are always created when an endpoint is configured. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Vimal Kumar <vimal78@gmail.com>
7dfa648 to
de0dbbd
Compare
|
@vimalk78: all tests passed! Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
Description:
Summary
AuditLoggerclass that emits structured JSON audit events (audit.agent.started,tool.call,tool.result,text,thinking,completed) tostdout during agent execution
LIGHTSPEED_AUDIT_ENABLED, OTEL spans created whenever an endpointis configured
/runroute for all providersTest plan
make test— 130 unit tests passmake lint— clean