chore(.claude): add review-ddtrace code review skill by bm1549 · Pull Request #4613 · DataDog/dd-trace-go

bm1549 · 2026-03-27T18:52:47Z

What does this PR do?

Adds a /review-ddtrace Claude Code skill that reviews PRs against patterns and conventions dd-trace-go reviewers consistently enforce. The skill consists of:

review-ddtrace.md — Main skill file with universal checklist (happy path, regression tests, error handling, constants, config patterns, nil safety)
style-and-idioms.md — dd-trace-go-specific style patterns (beyond Effective Go)
concurrency.md — Mutex discipline, atomics, shared state, race-prone patterns
performance.md — Hot-path benchmarking, allocation avoidance, serialization correctness
contrib-patterns.md — Integration API design, async lifecycle, DSM patterns, cross-integration consistency

Eval Results

8 iterations of evals across 30+ unique PRs, comparing skill vs baseline (no skill) and measuring the impact of iterative refinements:

Iteration	Comparison	Skill	Baseline	Delta
7 (10 PRs, 3-way)	post-fix vs baseline	83%	70%	+13pp
8 (10 PRs, 2-way)	post-cleanup vs pre-cleanup	76.7%	76.7%	0pp (cleanup is safe)

Key findings:

Skill adds ~+15pp effective assertion rate vs baseline across out-of-sample PRs
Strongest signal on repo-specific patterns: DSM gating, init() avoidance, restart state, function aliases
General Go patterns (profiler overhead, goroutine safety) caught equally by baseline and skill
Two rounds of cleanup per reviewer feedback (removing Effective Go duplicates, then removing reviewer quotes / hyperspecific examples) had zero measurable regression

Motivation

Human reviewers on dd-trace-go consistently flag the same patterns — happy path alignment, missing benchmarks, lock discipline, DSM gating, init() avoidance. This skill captures those patterns so Claude Code can flag them during review, reducing round-trips.

Reviewer's Checklist

Add a Claude Code command and reference files that encode the implicit review standards dd-trace-go reviewers consistently enforce. Distilled from 3 months of review comments (668 inline comments across 148 PRs). The skill is structured as a progressive-disclosure system: - `.claude/commands/review-ddtrace.md` — main command with universal checklist - `.claude/review-ddtrace/style-and-idioms.md` — Go style patterns - `.claude/review-ddtrace/concurrency.md` — locking, races, restart safety - `.claude/review-ddtrace/contrib-patterns.md` — integration conventions - `.claude/review-ddtrace/performance.md` — hot path awareness Evaluated across 16 PRs over 5 iterations. On 10 never-before-seen PRs, the skill catches 58% of reviewer-flagged patterns vs 35% baseline (+23pp), with the strongest gains on contrib integration conventions and type safety/lifecycle patterns. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

codecov · 2026-03-27T18:57:53Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 60.73%. Comparing base (ee8dce2) to head (225a001).
⚠️ Report is 1 commits behind head on main.

Additional details and impacted files

see 440 files with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

datadog-prod-us1-6 · 2026-03-27T18:57:56Z

✅ Tests

🎉 All green!

❄️ No new flaky tests detected
🧪 All tests passed

🎯 Code Coverage (details)
• Patch Coverage: 100.00%
• Overall Coverage: 60.07% (+0.02%)

_{This comment will be updated automatically if new data arrives.

🔗 Commit SHA: eb1fd1f | Docs | Datadog PR Page | Was this helpful? React with 👍/👎 or give us feedback!}

pr-commenter · 2026-03-27T19:20:26Z

Benchmarks

Benchmark execution time: 2026-03-30 18:17:08

Comparing candidate commit eb1fd1f in PR branch brian.marks/review-ddtrace-skill with baseline commit ee8dce2 in branch main.

Found 1 performance improvements and 0 performance regressions! Performance is the same for 214 metrics, 9 unstable metrics.

Explanation

This is an A/B test comparing a candidate commit's performance against that of a baseline commit. Performance changes are noted in the tables below as:

🟩 = significantly better candidate vs. baseline
🟥 = significantly worse candidate vs. baseline

We compute a confidence interval (CI) over the relative difference of means between metrics from the candidate and baseline commits, considering the baseline as the reference.

If the CI is entirely outside the configured SIGNIFICANT_IMPACT_THRESHOLD (or the deprecated UNCONFIDENCE_THRESHOLD), the change is considered significant.

Feel free to reach out to #apm-benchmarking-platform on Slack if you have any questions.

More details about the CI and significant changes

You can imagine this CI as a range of values that is likely to contain the true difference of means between the candidate and baseline commits.

CIs of the difference of means are often centered around 0%, because often changes are not that big:

---------------------------------(------|---^--------)-------------------------------->
                              -0.6%    0%  0.3%     +1.2%
                                 |          |        |
         lower bound of the CI --'          |        |
sample mean (center of the CI) -------------'        |
         upper bound of the CI ----------------------'

As described above, a change is considered significant if the CI is entirely outside the configured SIGNIFICANT_IMPACT_THRESHOLD (or the deprecated UNCONFIDENCE_THRESHOLD).

For instance, for an execution time metric, this confidence interval indicates a significantly worse performance:

----------------------------------------|---------|---(---------^---------)---------->
                                       0%        1%  1.3%      2.2%      3.1%
                                                  |   |         |         |
       significant impact threshold --------------'   |         |         |
                      lower bound of CI --------------'         |         |
       sample mean (center of the CI) --------------------------'         |
                      upper bound of CI ----------------------------------'

scenario:BenchmarkPartialFlushing/Disabled-25

🟩 avgHeapInUse(Mb) [-4.899MB; -1.205MB] or [-8.282%; -2.038%]

The example claimed a function going from cost 60 to 90 "won't inline differently" because "it was already over 80", but 60 is under the budget — it would inline at 60 and stop at 90. Fix the explanation. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…race-go-specific patterns Remove sections that duplicate Effective Go content from style-and-idioms.md (import grouping, std library preference, code organization, function length). Trim concurrency.md, performance.md, and contrib-patterns.md per reviewer feedback. Add iteration-7 eval workspace: 3-way comparison (baseline / pre-fix / post-fix) across 10 new PRs. Results: skill +15pp vs baseline (70%→85%); post-fix and pre-fix within noise (85% vs 83%), confirming the cleanup has no regression. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

bm1549 · 2026-03-30T15:34:28Z

@hannahkm I've just applied all of your feedback and ran a fresh 3-way eval across 10 new PRs to measure the impact of your suggestions

Results (effective assertions caught out of 30):

Baseline (no skill): 21.0/30 (70%)
Pre-feedback skill: 25.5/30 (85%)
Post-feedback skill (your changes): 25.0/30 (83%)

The cleanup is a clean no-op since post-feedback and pre-feedback are within noise and the suggestions you made to style-and-idioms.md is better than what was there before, and the evals back that up!

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…atterns Single data point in eval dataset — not frequent enough to warrant dedicated guidance. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…edback Strip all blockquoted reviewer comments, generalize hyperspecific PR examples, collapse redundant sections. 538 → 446 lines (-17%). Eval iteration-8 confirms identical aggregate scores (0.767 both configs) across 10 PRs — the cleanup preserves all signal with less noise. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

kakkoyun

I wonder how the performance would change if tell it to respect and consider files like CONTRIBUTING.md and contrib/README.md during reviews.

I want to understand if it would keep up with the human maintained documentation.

bm1549 added the AI Generated Largely based on code generated by an AI or LLM. This label is the same across all dd-trace-* repos label Mar 27, 2026

ajgajg1134 reviewed Mar 27, 2026

View reviewed changes

Comment thread .claude/review-ddtrace/performance.md Outdated

hannahkm requested changes Mar 30, 2026

View reviewed changes

bm1549 and others added 3 commits March 30, 2026 11:40

chore: remove eval workspace from repo

5714c8f

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

chore(.claude): remove timestamp-variant dedup section from contrib-p…

095f9b4

…atterns Single data point in eval dataset — not frequent enough to warrant dedicated guidance. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

bm1549 requested a review from hannahkm March 30, 2026 18:39

kakkoyun reviewed Apr 8, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

chore(.claude): add review-ddtrace code review skill#4613

chore(.claude): add review-ddtrace code review skill#4613
bm1549 wants to merge 6 commits intomainfrom
brian.marks/review-ddtrace-skill

bm1549 commented Mar 27, 2026 •

edited

Loading

Uh oh!

codecov bot commented Mar 27, 2026 •

edited

Loading

Uh oh!

datadog-prod-us1-6 bot commented Mar 27, 2026 •

edited

Loading

Uh oh!

pr-commenter bot commented Mar 27, 2026 •

edited

Loading

Explanation

More details about the CI and significant changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

bm1549 commented Mar 30, 2026

Uh oh!

kakkoyun left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

bm1549 commented Mar 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Eval Results

Motivation

Reviewer's Checklist

Uh oh!

codecov bot commented Mar 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

datadog-prod-us1-6 bot commented Mar 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pr-commenter bot commented Mar 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Benchmarks

Explanation

More details about the CI and significant changes

scenario:BenchmarkPartialFlushing/Disabled-25

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

bm1549 commented Mar 30, 2026

Uh oh!

kakkoyun left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

bm1549 commented Mar 27, 2026 •

edited

Loading

codecov bot commented Mar 27, 2026 •

edited

Loading

datadog-prod-us1-6 bot commented Mar 27, 2026 •

edited

Loading

pr-commenter bot commented Mar 27, 2026 •

edited

Loading