Skip to content

OLS-3274: add structured audit event logging with independent OTEL tracing#84

Open
vimalk78 wants to merge 1 commit into
openshift:mainfrom
vimalk78:ols-3274-audit-events
Open

OLS-3274: add structured audit event logging with independent OTEL tracing#84
vimalk78 wants to merge 1 commit into
openshift:mainfrom
vimalk78:ols-3274-audit-events

Conversation

@vimalk78

@vimalk78 vimalk78 commented Jun 23, 2026

Copy link
Copy Markdown
Contributor

Description:

Summary

  • Adds AuditLogger class that emits structured JSON audit events (audit.agent.started, tool.call, tool.result, text, thinking, completed) to
    stdout during agent execution
  • Audit JSON logging and OTEL tracing are independent controls per spec: JSON logs gated by LIGHTSPEED_AUDIT_ENABLED, OTEL spans created whenever an endpoint
    is configured
  • Phase derivation from request context (analysis/execution/verification/escalation)
  • Wired into /run route for all providers

Test plan

  • make test — 130 unit tests pass
  • make lint — clean
  • Verified on cluster: audit logs enabled + OTEL disabled → JSON logs to stdout, no sandbox spans
  • Verified on cluster: audit logs disabled + OTEL enabled → no JSON logs, spans in Jaeger
  • Verified on cluster: both enabled → JSON logs + spans
  • Verified trace_id correlation across analysis and execution phases
image

@openshift-ci openshift-ci Bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jun 23, 2026
@coderabbitai

coderabbitai Bot commented Jun 23, 2026

Copy link
Copy Markdown

Warning

Review limit reached

@vimalk78, we couldn't start this review because you've reached your PR review rate limit.

More reviews will be available in 3 minutes and 57 seconds. Learn how PR review limits work.

Your organization has used up its prepaid credits, and credit purchases are no longer available. Enable the review add-on in the billing tab to keep reviews running — you're only billed for reviews past your plan's rate limits ($0.25/file).

⌛ How to resolve this issue?

After more reviews become available, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

To avoid repeated limits, reduce automatic review volume by pausing incremental auto-reviews earlier, using label-based review opt-in, excluding WIP or generated PR titles, or requesting reviews manually when the PR is ready. If your team needs uninterrupted high-volume reviews, an organization admin can enable usage-based credits.

🚦 How do rate limits work?

CodeRabbit enforces per-developer PR review limits for each organization. Most developers receive the normal plan review availability.

For paid Pro and Pro+ PR reviews, CodeRabbit uses adaptive limits for sustained high-volume activity. When a developer's recent PR review activity reaches the 95th percentile or higher among CodeRabbit users, additional reviews become available more gradually as earlier reviews age out of the rolling window.

Please see our Fair Usage Limits Policy for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: e07ab6a9-f2ec-481a-bc6a-03d143652936

📥 Commits

Reviewing files that changed from the base of the PR and between 7dfa648 and de0dbbd.

📒 Files selected for processing (5)
  • src/lightspeed_agentic/audit.py
  • src/lightspeed_agentic/providers/openai.py
  • src/lightspeed_agentic/routes/query.py
  • tests/test_audit.py
  • tests/test_routes.py
📝 Walkthrough

Walkthrough

Adds structured audit logging for agent runs, including phase derivation, streaming text/thinking/tool events, completion records with token and cost metrics, provider usage extraction, and query route instrumentation.

Changes

Audit logging flow

Layer / File(s) Summary
Phase derivation and startup records
src/lightspeed_agentic/audit.py, tests/test_audit.py
derive_phase resolves the audit phase from context, AuditLogger initializes tracing and buffering state, and startup tests assert the emitted audit.agent.started record and common fields.
Streaming events and completion records
src/lightspeed_agentic/audit.py, tests/test_audit.py
AuditLogger.process_event buffers text and thinking, emits tool call/result records, flushes on block boundaries, and complete emits audit.agent.completed; tests cover buffering, tool spans, completion, and disabled logging.
Query route wiring and result metrics
src/lightspeed_agentic/providers/openai.py, src/lightspeed_agentic/routes/query.py, tests/test_routes.py
OpenAIProvider now reads token counts from result.context_wrapper.usage; /run creates an AuditLogger, forwards provider events, records token and cost metrics, and route tests cover enabled/disabled audit output and phase derivation.

Sequence Diagram(s)

sequenceDiagram
  participant Client
  participant run_endpoint
  participant OpenAIProvider
  participant AuditLogger
  participant stdout
  Client->>run_endpoint: POST /v1/agent/run
  run_endpoint->>OpenAIProvider: query(...)
  OpenAIProvider-->>run_endpoint: ProviderEvent stream
  run_endpoint->>AuditLogger: process_event(event)
  OpenAIProvider-->>run_endpoint: ResultEvent usage
  run_endpoint->>AuditLogger: complete(success, input_tokens, output_tokens, cost_usd)
  AuditLogger->>stdout: emit audit JSON
Loading
🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 23.68% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately summarizes the main change: structured audit logging with separate OTEL tracing.
Description check ✅ Passed The description is clearly related to the PR and describes the audit logging and tracing changes.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands.

@openshift-ci openshift-ci Bot requested review from blublinsky and raptorsun June 23, 2026 18:08
@openshift-ci

openshift-ci Bot commented Jun 23, 2026

Copy link
Copy Markdown

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign xrajesh for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@vimalk78

Copy link
Copy Markdown
Contributor Author

/retest

@red-hat-konflux

Copy link
Copy Markdown
Contributor

All PipelineRuns for this commit have already succeeded. Use /retest <pipeline-name> to re-run a specific pipeline or /test to re-run all pipelines.

@vimalk78

Copy link
Copy Markdown
Contributor Author

/test lint

@vimalk78 vimalk78 force-pushed the ols-3274-audit-events branch from dc5d789 to 4c50771 Compare June 24, 2026 06:05
@vimalk78

Copy link
Copy Markdown
Contributor Author

/test e2e-claude

@vimalk78 vimalk78 force-pushed the ols-3274-audit-events branch 2 times, most recently from 8671661 to 7c3aee4 Compare June 25, 2026 07:43
@vimalk78 vimalk78 changed the title WIP: Ols 3274 audit events OLS-3274: add structured audit event logging with independent OTEL tracing Jun 25, 2026
@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Jun 25, 2026
@openshift-ci-robot

openshift-ci-robot commented Jun 25, 2026

Copy link
Copy Markdown

@vimalk78: This pull request references OLS-3274 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "5.0.0" version, but no target version was set.

Details

In response to this:

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci openshift-ci Bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jun 25, 2026

@blublinsky blublinsky left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should-fix: _tool_span leak on timeout, crash, or back-to-back tool calls

Location: src/lightspeed_agentic/audit.py:56-68

Problem:

The AuditLogger starts an OTEL span on tool_call and ends it on tool_result, but there are three paths where the span is never ended:

  1. Timeout/crash mid-tooltool_call fires, agent hangs, TimeoutError raised. complete(success=False) doesn't end the span.
  2. Back-to-back tool_call — Two tool_call events without an intervening tool_result (e.g., parallel tool calls). Second call overwrites _tool_span without ending the first.
  3. Exception during tool execution — Error raised between tool_call and tool_result events.

Suggested fix:

Add cleanup to complete():

def complete(self, *, success: bool, ...) -> None:
    self._flush_buffers()
    if self._tool_span is not None:
        self._tool_span.end()
        self._tool_span = None
    self._emit("audit.agent.completed", ...)

And handle back-to-back in process_event:

case "tool_call":
    self._flush_buffers()
    if self._tool_span is not None:
        self._tool_span.end()  # end previous before starting new
    self._last_tool_name = event.name or "unknown"
    self._tool_span = self._tracer.start_span(...)

@blublinsky blublinsky left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice-to-have: Simpler token counting via SDK aggregated usage

Location: src/lightspeed_agentic/providers/openai.py:262-263

Current:

input_tokens = sum(r.usage.input_tokens for r in result.raw_responses)
output_tokens = sum(r.usage.output_tokens for r in result.raw_responses)

Suggested:

input_tokens = result.context_wrapper.usage.input_tokens
output_tokens = result.context_wrapper.usage.output_tokens

The SDK already aggregates token usage across all responses internally. Single access, no loop, and it's the SDK-recommended approach for totals.

@vimalk78 vimalk78 force-pushed the ols-3274-audit-events branch from 7c3aee4 to 7dfa648 Compare June 25, 2026 10:33

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/lightspeed_agentic/routes/query.py`:
- Around line 133-158: The audit completion in the query flow is recorded too
early, so `audit_logger.complete` can mark `success=True` even when the final
`RunResponse` is unsuccessful. Move or recompute the `success` value in
`query.py`’s `run()` handling so it reflects the actual returned outcome after
the empty-text check and the `parsed.get("success", True)` response shaping, and
keep `audit_logger.complete` consistent in both success and failure paths.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 15fe14f2-33e1-40c7-8a96-d457be6efa69

📥 Commits

Reviewing files that changed from the base of the PR and between c9daf3f and 7dfa648.

📒 Files selected for processing (5)
  • src/lightspeed_agentic/audit.py
  • src/lightspeed_agentic/providers/openai.py
  • src/lightspeed_agentic/routes/query.py
  • tests/test_audit.py
  • tests/test_routes.py
🔗 Linked repositories identified

CodeRabbit considers these linked repositories for cross-repo context during reviews:

  • openshift/lightspeed-agentic-operator (manual)

Comment thread src/lightspeed_agentic/routes/query.py
Adds AuditLogger that emits structured JSON audit events to stdout
during agent execution. Logging and OTEL tracing are independent
controls per spec: JSON logs are gated by LIGHTSPEED_AUDIT_ENABLED,
OTEL spans are always created when an endpoint is configured.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Vimal Kumar <vimal78@gmail.com>
@vimalk78 vimalk78 force-pushed the ols-3274-audit-events branch from 7dfa648 to de0dbbd Compare June 25, 2026 11:29
@openshift-ci

openshift-ci Bot commented Jun 25, 2026

Copy link
Copy Markdown

@vimalk78: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

jira/valid-reference Indicates that this PR references a valid Jira ticket of any type.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants