Skip to content

chore(.claude): add review-ddtrace code review skill#4613

Draft
bm1549 wants to merge 6 commits intomainfrom
brian.marks/review-ddtrace-skill
Draft

chore(.claude): add review-ddtrace code review skill#4613
bm1549 wants to merge 6 commits intomainfrom
brian.marks/review-ddtrace-skill

Conversation

@bm1549
Copy link
Copy Markdown
Contributor

@bm1549 bm1549 commented Mar 27, 2026

What does this PR do?

Adds a /review-ddtrace Claude Code skill that reviews PRs against patterns and conventions dd-trace-go reviewers consistently enforce. The skill consists of:

  • review-ddtrace.md — Main skill file with universal checklist (happy path, regression tests, error handling, constants, config patterns, nil safety)
  • style-and-idioms.md — dd-trace-go-specific style patterns (beyond Effective Go)
  • concurrency.md — Mutex discipline, atomics, shared state, race-prone patterns
  • performance.md — Hot-path benchmarking, allocation avoidance, serialization correctness
  • contrib-patterns.md — Integration API design, async lifecycle, DSM patterns, cross-integration consistency

Eval Results

8 iterations of evals across 30+ unique PRs, comparing skill vs baseline (no skill) and measuring the impact of iterative refinements:

Iteration Comparison Skill Baseline Delta
7 (10 PRs, 3-way) post-fix vs baseline 83% 70% +13pp
8 (10 PRs, 2-way) post-cleanup vs pre-cleanup 76.7% 76.7% 0pp (cleanup is safe)

Key findings:

  • Skill adds ~+15pp effective assertion rate vs baseline across out-of-sample PRs
  • Strongest signal on repo-specific patterns: DSM gating, init() avoidance, restart state, function aliases
  • General Go patterns (profiler overhead, goroutine safety) caught equally by baseline and skill
  • Two rounds of cleanup per reviewer feedback (removing Effective Go duplicates, then removing reviewer quotes / hyperspecific examples) had zero measurable regression

Motivation

Human reviewers on dd-trace-go consistently flag the same patterns — happy path alignment, missing benchmarks, lock discipline, DSM gating, init() avoidance. This skill captures those patterns so Claude Code can flag them during review, reducing round-trips.

Reviewer's Checklist

  • This is a documentation/tooling-only change (.claude/ directory) — no Go code changes
  • Changed code has unit tests for its functionality at or near 100% coverage.
  • System-Tests covering this feature have been added and enabled with the va.b.c-dev version tag.
  • There is a benchmark for any new code, or changes to existing code.
  • If this interacts with the agent in a new way, a system test has been added.
  • New code is free of linting errors.
  • New code doesn't break existing tests.
  • Add an appropriate team label so this PR gets put in the right place for the release notes.
  • All generated files are up to date.
  • Non-trivial go.mod changes, e.g. adding new modules, are reviewed by @DataDog/dd-trace-go-guild.

Add a Claude Code command and reference files that encode the implicit
review standards dd-trace-go reviewers consistently enforce. Distilled
from 3 months of review comments (668 inline comments across 148 PRs).

The skill is structured as a progressive-disclosure system:
- `.claude/commands/review-ddtrace.md` — main command with universal checklist
- `.claude/review-ddtrace/style-and-idioms.md` — Go style patterns
- `.claude/review-ddtrace/concurrency.md` — locking, races, restart safety
- `.claude/review-ddtrace/contrib-patterns.md` — integration conventions
- `.claude/review-ddtrace/performance.md` — hot path awareness

Evaluated across 16 PRs over 5 iterations. On 10 never-before-seen PRs,
the skill catches 58% of reviewer-flagged patterns vs 35% baseline
(+23pp), with the strongest gains on contrib integration conventions and
type safety/lifecycle patterns.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@bm1549 bm1549 added the AI Generated Largely based on code generated by an AI or LLM. This label is the same across all dd-trace-* repos label Mar 27, 2026
@codecov
Copy link
Copy Markdown

codecov bot commented Mar 27, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 60.73%. Comparing base (ee8dce2) to head (225a001).
⚠️ Report is 1 commits behind head on main.

Additional details and impacted files

see 440 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@datadog-prod-us1-6
Copy link
Copy Markdown

datadog-prod-us1-6 bot commented Mar 27, 2026

✅ Tests

🎉 All green!

❄️ No new flaky tests detected
🧪 All tests passed

🎯 Code Coverage (details)
Patch Coverage: 100.00%
Overall Coverage: 60.07% (+0.02%)

This comment will be updated automatically if new data arrives.
🔗 Commit SHA: eb1fd1f | Docs | Datadog PR Page | Was this helpful? React with 👍/👎 or give us feedback!

@pr-commenter
Copy link
Copy Markdown

pr-commenter bot commented Mar 27, 2026

Benchmarks

Benchmark execution time: 2026-03-30 18:17:08

Comparing candidate commit eb1fd1f in PR branch brian.marks/review-ddtrace-skill with baseline commit ee8dce2 in branch main.

Found 1 performance improvements and 0 performance regressions! Performance is the same for 214 metrics, 9 unstable metrics.

Explanation

This is an A/B test comparing a candidate commit's performance against that of a baseline commit. Performance changes are noted in the tables below as:

  • 🟩 = significantly better candidate vs. baseline
  • 🟥 = significantly worse candidate vs. baseline

We compute a confidence interval (CI) over the relative difference of means between metrics from the candidate and baseline commits, considering the baseline as the reference.

If the CI is entirely outside the configured SIGNIFICANT_IMPACT_THRESHOLD (or the deprecated UNCONFIDENCE_THRESHOLD), the change is considered significant.

Feel free to reach out to #apm-benchmarking-platform on Slack if you have any questions.

More details about the CI and significant changes

You can imagine this CI as a range of values that is likely to contain the true difference of means between the candidate and baseline commits.

CIs of the difference of means are often centered around 0%, because often changes are not that big:

---------------------------------(------|---^--------)-------------------------------->
                              -0.6%    0%  0.3%     +1.2%
                                 |          |        |
         lower bound of the CI --'          |        |
sample mean (center of the CI) -------------'        |
         upper bound of the CI ----------------------'

As described above, a change is considered significant if the CI is entirely outside the configured SIGNIFICANT_IMPACT_THRESHOLD (or the deprecated UNCONFIDENCE_THRESHOLD).

For instance, for an execution time metric, this confidence interval indicates a significantly worse performance:

----------------------------------------|---------|---(---------^---------)---------->
                                       0%        1%  1.3%      2.2%      3.1%
                                                  |   |         |         |
       significant impact threshold --------------'   |         |         |
                      lower bound of CI --------------'         |         |
       sample mean (center of the CI) --------------------------'         |
                      upper bound of CI ----------------------------------'

scenario:BenchmarkPartialFlushing/Disabled-25

  • 🟩 avgHeapInUse(Mb) [-4.899MB; -1.205MB] or [-8.282%; -2.038%]

Comment thread .claude/review-ddtrace/performance.md Outdated
The example claimed a function going from cost 60 to 90 "won't inline
differently" because "it was already over 80", but 60 is under the
budget — it would inline at 60 and stop at 90. Fix the explanation.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Comment thread .claude/review-ddtrace/concurrency.md Outdated
Comment thread .claude/review-ddtrace/concurrency.md Outdated
Comment thread .claude/review-ddtrace/concurrency.md Outdated
Comment thread .claude/review-ddtrace/concurrency.md Outdated
Comment thread .claude/review-ddtrace/contrib-patterns.md
Comment thread .claude/review-ddtrace/contrib-patterns.md Outdated
Comment thread .claude/review-ddtrace/performance.md Outdated
Comment thread .claude/review-ddtrace/performance.md Outdated
Comment thread .claude/review-ddtrace/style-and-idioms.md
…race-go-specific patterns

Remove sections that duplicate Effective Go content from style-and-idioms.md
(import grouping, std library preference, code organization, function length).
Trim concurrency.md, performance.md, and contrib-patterns.md per reviewer feedback.

Add iteration-7 eval workspace: 3-way comparison (baseline / pre-fix / post-fix)
across 10 new PRs. Results: skill +15pp vs baseline (70%→85%); post-fix and
pre-fix within noise (85% vs 83%), confirming the cleanup has no regression.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@bm1549
Copy link
Copy Markdown
Contributor Author

bm1549 commented Mar 30, 2026

@hannahkm I've just applied all of your feedback and ran a fresh 3-way eval across 10 new PRs to measure the impact of your suggestions

Results (effective assertions caught out of 30):

  • Baseline (no skill): 21.0/30 (70%)
  • Pre-feedback skill: 25.5/30 (85%)
  • Post-feedback skill (your changes): 25.0/30 (83%)

The cleanup is a clean no-op since post-feedback and pre-feedback are within noise and the suggestions you made to style-and-idioms.md is better than what was there before, and the evals back that up!

bm1549 and others added 3 commits March 30, 2026 11:40
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…atterns

Single data point in eval dataset — not frequent enough to warrant dedicated guidance.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…edback

Strip all blockquoted reviewer comments, generalize hyperspecific PR
examples, collapse redundant sections. 538 → 446 lines (-17%).

Eval iteration-8 confirms identical aggregate scores (0.767 both configs)
across 10 PRs — the cleanup preserves all signal with less noise.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@bm1549 bm1549 requested a review from hannahkm March 30, 2026 18:39
Copy link
Copy Markdown
Member

@kakkoyun kakkoyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder how the performance would change if tell it to respect and consider files like CONTRIBUTING.md and contrib/README.md during reviews.

I want to understand if it would keep up with the human maintained documentation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

AI Generated Largely based on code generated by an AI or LLM. This label is the same across all dd-trace-* repos

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants