Skip to content

feat: eBPF event deduplication before CEL rule evaluation#762

Open
slashben wants to merge 2 commits intomainfrom
feature/ebpf-event-dedup
Open

feat: eBPF event deduplication before CEL rule evaluation#762
slashben wants to merge 2 commits intomainfrom
feature/ebpf-event-dedup

Conversation

@slashben
Copy link
Copy Markdown
Contributor

@slashben slashben commented Mar 29, 2026

Summary

  • Adds a lock-free, fixed-size dedup cache (pkg/dedupcache/) using packed atomic.Uint64 slots (48-bit xxhash key + 16-bit expiry bucket at 64ms granularity)
  • Computes per-event-type dedup keys for: open, network, dns, capabilities, http, ssh, symlink, hardlink, ptrace, syscall. Events that should never be deduplicated (exec, exit, fork, randomx, kmod, bpf, unshare, iouring) are excluded.
  • Integrates dedup check in EventHandlerFactory.ProcessEvent() — duplicate events skip RuleManager (CEL evaluation), ContainerProfileManager, and MalwareManager while still flowing to metrics, DNS manager, and network stream
  • Configurable via EventDedupConfig (enabled by default, 2^18 = 262K slots ≈ 2MB)
  • Adds Prometheus counter node_agent_dedup_events_total{event_type, result} for observability

Motivation

Node-Agent Performance Epic §1.3 — targets 10% of the 20% CPU reduction goal by deduplicating repetitive eBPF events (e.g., repeated file opens, DNS lookups) before they reach the expensive CEL rule evaluation engine.

Design

See design/node-agent-performance-epic/ebpf-event-deduplication.md for the full design document.

Benchmarks

  • DedupCache.CheckAndSet: ~6ns/op miss, ~0.8ns/op hit, 0 allocations
  • Key computation: 19–44ns/op depending on event type, 0 allocations
  • Race detector: clean under 100-goroutine concurrent hammer test

Test plan

  • Unit tests for cache correctness (insert, TTL expiry, slot collision, pack/unpack)
  • Concurrent hammer test (100 goroutines × 10,000 ops)
  • Deterministic key output tests and collision resistance checks
  • Benchmarks for cache and all key functions
  • Race detector clean
  • Config test updated with new defaults
  • Full go test ./... passes (pre-existing eBPF permission failures only)
  • Kind cluster benchmark: baseline vs dedup CPU comparison

🤖 Generated with Claude Code

Summary by CodeRabbit

  • New Features

    • Event deduplication to reduce redundant event processing; configurable with tunable cache parameters.
    • New metrics reporting showing deduplicated vs passed events per event type.
  • Documentation

    • Added deduplication benchmark and reproduction guide with results and analysis.
  • Chores

    • Dependency manifest updated to support deduplication-related components.

Add a lock-free dedup cache that prevents structurally identical eBPF
events from reaching the expensive CEL rule engine. Events are keyed by
type-specific fields (mntns + pid + relevant attributes) using xxhash,
with per-event-type TTL windows (2-10s). The cache uses packed
atomic uint64 slots (48-bit key + 16-bit expiry bucket) for zero-lock
concurrent access from the 3,000-goroutine worker pool.

Consumers opt in to skipping duplicates: RuleManager, ContainerProfileManager,
and MalwareManager skip; Metrics, DNSManager, NetworkStream, and RulePolicy
always process. No events are dropped — the Duplicate flag is advisory.

Benchmarks: cache check ~7ns/op, key computation 24-52ns/op, 0 allocations.

Implements design/node-agent-performance-epic/ebpf-event-deduplication.md §1.3
(targets 10% of the 20% CPU reduction goal).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Ben <ben@armosec.io>
@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Mar 29, 2026

📝 Walkthrough

Walkthrough

Adds an optional event deduplication system: a lock-free DedupCache, xxhash-based key generators, integration into event enrichment and handler pipeline, metrics for dedup events, config flags/defaults, and early-skip in rule reporting for duplicates; also promotes xxhash and protobuf to direct go.mod requires.

Changes

Cohort / File(s) Summary
Dependency Management
go.mod
Moved github.com/cespare/xxhash/v2 v2.3.0 and google.golang.org/protobuf v1.36.11 from indirect to direct require.
Configuration
pkg/config/config.go, pkg/config/config_test.go
Added EventDedupConfig and Config.EventDedup; defaults set (enabled=true, slotsExponent=18); tests updated.
Deduplication Cache Core
pkg/dedupcache/dedup_cache.go, pkg/dedupcache/dedup_cache_test.go
New lock-free DedupCache using atomic uint64 slots (48-bit key + 16-bit expiry bucket), constructor NewDedupCache, CheckAndSet, TTL semantics, unit tests and benchmarks including concurrency and collision tests.
Deduplication Key Functions
pkg/dedupcache/keys.go, pkg/dedupcache/keys_test.go
Added exported Compute* key functions (Open, Network, DNS, Capabilities, HTTP, SSH, Symlink, Hardlink, Ptrace, Syscall) using xxhash; tests and benchmarks validate determinism and input sensitivity.
Event Data Model
pkg/ebpf/events/enriched_event.go
EnrichedEvent gains Duplicate bool and DedupBucket uint16.
Event Processing Integration
pkg/containerwatcher/v2/container_watcher.go, pkg/containerwatcher/v2/event_handler_factory.go
Container watcher sets DedupBucket during enrichment and conditionally creates DedupCache; EventHandlerFactory accepts dedupCache, computes per-event keys/TTL, calls CheckAndSet, marks Duplicate, reports dedup metrics, and skips handlers in dedupSkipSet for duplicates.
Metrics Reporting
pkg/metricsmanager/metrics_manager_interface.go, pkg/metricsmanager/metrics_manager_mock.go, pkg/metricsmanager/prometheus/prometheus.go
Added ReportDedupEvent(eventType, duplicate) to interface and mock; Prometheus adds node_agent_dedup_events_total counter with (event_type, result) and reports passed/deduplicated.
Rule Manager Integration
pkg/rulemanager/rule_manager.go
ReportEnrichedEvent now returns immediately for enrichedEvent.Duplicate, skipping enrichment, rule eval, metrics, cooldowns, and exports.
Docs
docs/dedup-testing/benchmark-results.md
New benchmarking and results documentation for dedup behavior and resource impact.

Sequence Diagram

sequenceDiagram
    participant EventSource as Event Source
    participant ContainerWatcher as Container Watcher
    participant EventHandlerFactory as Event Handler Factory
    participant DedupCache as Dedup Cache
    participant MetricsManager as Metrics Manager
    participant RuleManager as Rule Manager

    EventSource->>ContainerWatcher: Raw Event
    ContainerWatcher->>ContainerWatcher: Enrich Event, set DedupBucket
    ContainerWatcher->>EventHandlerFactory: enrichAndProcess(EnrichedEvent)
    EventHandlerFactory->>EventHandlerFactory: Compute dedup key & TTL
    EventHandlerFactory->>DedupCache: CheckAndSet(key, ttlBuckets, currentBucket)
    alt Duplicate Found
        DedupCache-->>EventHandlerFactory: true
        EventHandlerFactory->>EventHandlerFactory: Set EnrichedEvent.Duplicate = true
        EventHandlerFactory->>MetricsManager: ReportDedupEvent(..., duplicate=true)
    else New Event
        DedupCache-->>EventHandlerFactory: false
        EventHandlerFactory->>MetricsManager: ReportDedupEvent(..., duplicate=false)
    end
    EventHandlerFactory->>EventHandlerFactory: Dispatch handlers (skip dedupSkipSet if Duplicate)
    EventHandlerFactory->>RuleManager: ReportEnrichedEvent(EnrichedEvent)
    alt Is Duplicate
        RuleManager->>RuleManager: Return early
    else Not Duplicate
        RuleManager->>RuleManager: Enrich, filter, evaluate, export
    end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Suggested reviewers

  • matthyx

Poem

🐇✨ I hopped through hashes, buckets in tow,
A tiny cache where duplicate echoes go.
Keys spun by xxhash, expiry tucked neat,
Events pass once — no repeated beat. 🎩

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 40.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately and concisely describes the main objective of the PR: adding event deduplication before CEL rule evaluation, which is the primary change across the entire changeset.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feature/ebpf-event-dedup

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

🧹 Nitpick comments (2)
pkg/metricsmanager/metrics_manager_mock.go (1)

67-67: Consider recording dedup results in the mock.

This keeps the interface satisfied, but it also makes the new metric path unassertable in unit tests. Tracking duplicate/passed counts here would keep dedup observability testable.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/metricsmanager/metrics_manager_mock.go` at line 67, The
MetricsMock.ReportDedupEvent stub should record dedup outcomes so tests can
assert on them: update the MetricsMock struct to include fields (e.g.,
DedupDuplicateCount, DedupPassedCount and/or a DedupEvents slice) and a
sync.Mutex for thread-safety, then implement ReportDedupEvent(eventType
utils.EventType, duplicate bool) to lock, increment the appropriate counter (or
append a record with eventType and duplicate), and unlock; expose simple
getter/accessor methods or make the fields public so unit tests can inspect
counts/events.
pkg/metricsmanager/prometheus/prometheus.go (1)

395-400: Avoid per-event label-map allocations here.

ReportDedupEvent runs on the dedup hot path, but With(prometheus.Labels{...}) creates a fresh map on every call. That adds avoidable allocation/GC pressure right where this PR is trying to save CPU. WithLabelValues(string(eventType), result) or a tiny cached lookup would keep this path much cheaper.

♻️ Minimal change
-	p.dedupEventCounter.With(prometheus.Labels{eventTypeLabel: string(eventType), "result": result}).Inc()
+	p.dedupEventCounter.WithLabelValues(string(eventType), result).Inc()
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/metricsmanager/prometheus/prometheus.go` around lines 395 - 400,
ReportDedupEvent currently allocates a new map on every hot-path call by using
p.dedupEventCounter.With(prometheus.Labels{...}), causing GC/alloc pressure;
change it to use the zero-allocation variant WithLabelValues(string(eventType),
result) or implement a small cached map/observer keyed by eventType/result to
avoid per-call map allocations - update the ReportDedupEvent function to call
p.dedupEventCounter.WithLabelValues(string(eventType), result).Inc() (or use a
precomputed label handle cache) and keep references to dedupEventCounter and
eventTypeLabel names to locate the code.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@pkg/config/config.go`:
- Around line 28-32: The EventDedupConfig.SlotsExponent is used unchecked when
creating the dedup cache and can cause an overflow/panic in
dedupcache.NewDedupCache; update LoadConfig to validate SlotsExponent (e.g.,
ensure 0 <= SlotsExponent <= 31) before using it, returning an error if out of
range. Locate the struct EventDedupConfig and the code path in LoadConfig (and
any other places that call dedupcache.NewDedupCache) and add explicit bounds
checking for SlotsExponent, with a clear error message referencing the invalid
value, so callers cannot pass an exponent >31 (or your chosen safe max) into
dedupcache.NewDedupCache.

In `@pkg/containerwatcher/v2/container_watcher.go`:
- Line 473: The DedupBucket is being recomputed per event which can change
mid-batch; compute the bucket once for the entire queue batch and reuse it. Move
the calculation (uint16(time.Now().UnixNano() / (64 * 1_000_000))) out of the
per-event code and store it in a local variable (e.g., dedupBucket) at the start
of the batch processing routine, then assign enrichedEvent.DedupBucket =
dedupBucket for each event; ensure the calculation and assignment use the same
expression so behavior is unchanged except now stable across the batch.

In `@pkg/containerwatcher/v2/event_handler_factory.go`:
- Around line 193-196: The dedupSkipSet currently contains
containerProfileAdapter (and malwareManager), which causes ReportDroppedEvent
calls within containerProfileAdapter to be skipped for duplicate events; modify
the event flow so dropped-event reporting is always executed before
deduplication: extract the dropped-event handling path (calls to
ReportDroppedEvent and checks of enrichedEvent.HasDroppedEvents) out of
containerProfileAdapter and invoke it unconditionally prior to evaluating
factory.dedupSkipSet, or alternatively call the adapter's ReportDroppedEvent
logic explicitly before the duplicate-skip check; ensure the same change is
applied to the second occurrence referenced around the other dedupSkipSet
insertion so kernel-drop notifications are never suppressed.

In `@pkg/dedupcache/dedup_cache.go`:
- Around line 36-46: The expiry comparison in DedupCache.CheckAndSet uses a
wrapping uint16 bucket which breaks across the 65535→0 rollover; change the
expiry to a non-wrapping time base by extending the stored expiry field to a
larger integer (e.g., use a uint32/uint64 expiry inside the packed value),
update pack and unpack to encode/decode that wider expiry, ensure CheckAndSet
stores pack(key, uint64(currentBucket)+uint64(ttlBuckets)) and compares
storedExpiry > uint64(currentBucket) using the wider type, and add a regression
test that simulates buckets across the 65535→0 boundary to verify dedup
behavior; alternatively, if you prefer epoching, implement explicit rollover
detection in CheckAndSet that clears/epochs slots when currentBucket <
lastSeenBucket and add the same wrap test.
- Around line 16-21: NewDedupCache currently computes size := uint64(1) <<
slotsExponent without validating slotsExponent which can yield size==0 (when
slotsExponent>=64) or unreasonably large sizes (e.g., >30) causing panics or
huge allocations; add a bounds check at the start of NewDedupCache to reject
invalid exponents (e.g., require slotsExponent > 0 and <= 30 or another
project-safe maximum), and return an error or panic with a clear message instead
of constructing a zero/huge slice; ensure the check prevents creating a
DedupCache with zero-length slots and protects callers of CheckAndSet from
out-of-range accesses.

In `@pkg/dedupcache/keys.go`:
- Around line 41-48: ComputeNetworkKey (and the other key-builder helpers that
hash multiple variable-length strings such as the capability+syscall builder,
the host+path+rawQuery builder, and the oldPath+newPath builder) currently write
adjacent strings directly into the xxhash stream which can cause ambiguous
collisions; update each function (starting with ComputeNetworkKey) to prefix
every string component with its length (or another unambiguous delimiter) before
calling h.WriteString so the concatenation is unambiguous—e.g., call
writeUint32/Uint16/Uint64 for the string length then write the string itself for
each variable-length field; apply the same change to all similar helpers
mentioned in the comment so every multi-string hash is length-delimited.

---

Nitpick comments:
In `@pkg/metricsmanager/metrics_manager_mock.go`:
- Line 67: The MetricsMock.ReportDedupEvent stub should record dedup outcomes so
tests can assert on them: update the MetricsMock struct to include fields (e.g.,
DedupDuplicateCount, DedupPassedCount and/or a DedupEvents slice) and a
sync.Mutex for thread-safety, then implement ReportDedupEvent(eventType
utils.EventType, duplicate bool) to lock, increment the appropriate counter (or
append a record with eventType and duplicate), and unlock; expose simple
getter/accessor methods or make the fields public so unit tests can inspect
counts/events.

In `@pkg/metricsmanager/prometheus/prometheus.go`:
- Around line 395-400: ReportDedupEvent currently allocates a new map on every
hot-path call by using p.dedupEventCounter.With(prometheus.Labels{...}), causing
GC/alloc pressure; change it to use the zero-allocation variant
WithLabelValues(string(eventType), result) or implement a small cached
map/observer keyed by eventType/result to avoid per-call map allocations -
update the ReportDedupEvent function to call
p.dedupEventCounter.WithLabelValues(string(eventType), result).Inc() (or use a
precomputed label handle cache) and keep references to dedupEventCounter and
eventTypeLabel names to locate the code.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: d640beeb-a69e-4b2d-9fa9-a9145caec339

📥 Commits

Reviewing files that changed from the base of the PR and between 2b05e16 and e1cdc49.

📒 Files selected for processing (14)
  • go.mod
  • pkg/config/config.go
  • pkg/config/config_test.go
  • pkg/containerwatcher/v2/container_watcher.go
  • pkg/containerwatcher/v2/event_handler_factory.go
  • pkg/dedupcache/dedup_cache.go
  • pkg/dedupcache/dedup_cache_test.go
  • pkg/dedupcache/keys.go
  • pkg/dedupcache/keys_test.go
  • pkg/ebpf/events/enriched_event.go
  • pkg/metricsmanager/metrics_manager_interface.go
  • pkg/metricsmanager/metrics_manager_mock.go
  • pkg/metricsmanager/prometheus/prometheus.go
  • pkg/rulemanager/rule_manager.go

Comment on lines +28 to +32
// EventDedupConfig controls eBPF event deduplication before CEL rule evaluation.
type EventDedupConfig struct {
Enabled bool `mapstructure:"enabled"`
SlotsExponent uint8 `mapstructure:"slotsExponent"`
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail

rg -n -C3 'type EventDedupConfig|SlotsExponent|eventDedup::slotsExponent|NewDedupCache\(' pkg/config pkg/containerwatcher/v2 pkg/dedupcache
rg -n -C4 'func NewDedupCache|slotsExponent|1\s*<<|make\(|panic|return .*error' pkg/dedupcache

Repository: kubescape/node-agent

Length of output: 6520


🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Search for LoadConfig function and any validation around eventDedup
rg -n -A 30 'func LoadConfig' pkg/config/config.go | head -60

# Search for any bounds checking or max exponent constant
rg -n 'max.*exponent|exponent.*max|MAX.*EXPONENT|slotsExponent.*<|slotsExponent.*>|validate|bounds' pkg/ -i

# Check if eventDedup appears anywhere else with potential constraints
rg -n 'eventDedup' pkg/ -C 2

Repository: kubescape/node-agent

Length of output: 7611


🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Get complete LoadConfig function from config.go
sed -n '145,260p' pkg/config/config.go

# Also check if there's a Validate method on Config struct
rg -n 'func.*Config.*Validate|func \(c \*Config\)' pkg/config/ -A 5

Repository: kubescape/node-agent

Length of output: 6085


Add validation for SlotsExponent to prevent invalid cache allocations.

The slotsExponent field in EventDedupConfig is passed directly to dedupcache.NewDedupCache(...) without bounds checking. Since NewDedupCache performs an unconstrained bit-shift (size := uint64(1) << slotsExponent), an invalid exponent (e.g., > 63) can cause undefined behavior or runtime panic. Add explicit validation in LoadConfig to ensure slotsExponent stays within a safe range (e.g., 0–31, matching typical cache sizes up to 2 GB).

Also applies to: 79, 193–194

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/config/config.go` around lines 28 - 32, The
EventDedupConfig.SlotsExponent is used unchecked when creating the dedup cache
and can cause an overflow/panic in dedupcache.NewDedupCache; update LoadConfig
to validate SlotsExponent (e.g., ensure 0 <= SlotsExponent <= 31) before using
it, returning an error if out of range. Locate the struct EventDedupConfig and
the code path in LoadConfig (and any other places that call
dedupcache.NewDedupCache) and add explicit bounds checking for SlotsExponent,
with a clear error message referencing the invalid value, so callers cannot pass
an exponent >31 (or your chosen safe max) into dedupcache.NewDedupCache.


func (cw *ContainerWatcher) enrichAndProcess(entry EventEntry) {
enrichedEvent := cw.eventEnricher.EnrichEvents(entry)
enrichedEvent.DedupBucket = uint16(time.Now().UnixNano() / (64 * 1_000_000))
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Keep a stable dedup bucket for the whole queue batch.

Line 473 recomputes the bucket per event even though EnrichedEvent.DedupBucket is documented as batch-cached. A large drain can cross a 64ms boundary mid-batch and let same-batch duplicates miss dedup.

🧩 One way to keep the bucket stable
 func (cw *ContainerWatcher) processQueueBatch() {
 	batchSize := cw.cfg.EventBatchSize
 	processedCount := 0
+	dedupBucket := uint16(time.Now().UnixNano() / (64 * 1_000_000))
 	for !cw.orderedEventQueue.Empty() && processedCount < batchSize {
 		event, ok := cw.orderedEventQueue.PopEvent()
 		if !ok {
 			break
 		}
-		cw.enrichAndProcess(event)
+		cw.enrichAndProcess(event, dedupBucket)
 		processedCount++
 	}
 }
 
-func (cw *ContainerWatcher) enrichAndProcess(entry EventEntry) {
+func (cw *ContainerWatcher) enrichAndProcess(entry EventEntry, dedupBucket uint16) {
 	enrichedEvent := cw.eventEnricher.EnrichEvents(entry)
-	enrichedEvent.DedupBucket = uint16(time.Now().UnixNano() / (64 * 1_000_000))
+	enrichedEvent.DedupBucket = dedupBucket
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/containerwatcher/v2/container_watcher.go` at line 473, The DedupBucket is
being recomputed per event which can change mid-batch; compute the bucket once
for the entire queue batch and reuse it. Move the calculation
(uint16(time.Now().UnixNano() / (64 * 1_000_000))) out of the per-event code and
store it in a local variable (e.g., dedupBucket) at the start of the batch
processing routine, then assign enrichedEvent.DedupBucket = dedupBucket for each
event; ensure the calculation and assignment use the same expression so behavior
is unchanged except now stable across the batch.

Comment on lines +193 to +196
// Populate dedupSkipSet: managers that skip processing when event is duplicate.
// RuleManager checks enrichedEvent.Duplicate internally.
factory.dedupSkipSet[containerProfileAdapter] = struct{}{}
factory.dedupSkipSet[malwareManager] = struct{}{}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Don't let dedup suppress dropped-event reporting.

containerProfileAdapter does more than profile updates: at Lines 101-104 it also calls ReportDroppedEvent. Once the whole adapter is in dedupSkipSet, any duplicate event with HasDroppedEvents()==true skips that call too, so the profile manager can miss kernel-drop notifications and keep treating the profile as complete when it isn't. Please split the dropped-event path out of the adapter, or run that part before the duplicate skip.

Also applies to: 332-335

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/containerwatcher/v2/event_handler_factory.go` around lines 193 - 196, The
dedupSkipSet currently contains containerProfileAdapter (and malwareManager),
which causes ReportDroppedEvent calls within containerProfileAdapter to be
skipped for duplicate events; modify the event flow so dropped-event reporting
is always executed before deduplication: extract the dropped-event handling path
(calls to ReportDroppedEvent and checks of enrichedEvent.HasDroppedEvents) out
of containerProfileAdapter and invoke it unconditionally prior to evaluating
factory.dedupSkipSet, or alternatively call the adapter's ReportDroppedEvent
logic explicitly before the duplicate-skip check; ensure the same change is
applied to the second occurrence referenced around the other dedupSkipSet
insertion so kernel-drop notifications are never suppressed.

Comment on lines +16 to +21
func NewDedupCache(slotsExponent uint8) *DedupCache {
size := uint64(1) << slotsExponent
return &DedupCache{
slots: make([]atomic.Uint64, size),
mask: size - 1,
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

cat pkg/dedupcache/dedup_cache.go

Repository: kubescape/node-agent

Length of output: 1705


🏁 Script executed:

cat > /tmp/test_shift.go << 'EOF'
package main

import "fmt"

func main() {
	// Test shift behavior with counts >= 64
	for exp := uint8(62); exp < 66; exp++ {
		size := uint64(1) << exp
		mask := size - 1
		fmt.Printf("exponent=%d, size=%d, mask=%d\n", exp, size, mask)
	}
	
	// Direct test with 64 and 65
	fmt.Println("\nDirect shift tests:")
	fmt.Printf("uint64(1) << 64 = %d\n", uint64(1) << 64)
	fmt.Printf("uint64(1) << 65 = %d\n", uint64(1) << 65)
	
	// What happens with a zero-sized slice?
	fmt.Println("\nSlice access with zero size:")
	slice := make([]uint64, 0)
	fmt.Printf("len(slice) = %d, cap(slice) = %d\n", len(slice), cap(slice))
	
	// Try to access it (this will panic)
	defer func() {
		if r := recover(); r != nil {
			fmt.Printf("Panic caught: %v\n", r)
		}
	}()
	_ = slice[0]
	fmt.Println("Access succeeded (unexpected)")
}
EOF
go run /tmp/test_shift.go

Repository: kubescape/node-agent

Length of output: 300


🏁 Script executed:

cat > /tmp/test_shift.go << 'EOF'
package main

import "fmt"

func main() {
	// Test shift behavior with runtime variable counts >= 64
	for exp := uint8(62); exp < 66; exp++ {
		size := uint64(1) << exp
		mask := size - 1
		fmt.Printf("exponent=%d, size=%d, mask=%d\n", exp, size, mask)
	}
	
	// What happens with a zero-sized slice?
	fmt.Println("\nSlice behavior:")
	size := uint64(0)
	slice := make([]uint64, size)
	fmt.Printf("len(slice) = %d, cap(slice) = %d\n", len(slice), cap(slice))
	
	// Try to access it (this will panic)
	defer func() {
		if r := recover(); r != nil {
			fmt.Printf("Panic caught: %v\n", r)
		}
	}()
	_ = slice[0]
	fmt.Println("Access succeeded (unexpected)")
}
EOF
go run /tmp/test_shift.go

Repository: kubescape/node-agent

Length of output: 399


🏁 Script executed:

rg "NewDedupCache" --type go -B 2 -A 2

Repository: kubescape/node-agent

Length of output: 2734


🏁 Script executed:

rg "EventDedup" --type go -B 5 -A 5

Repository: kubescape/node-agent

Length of output: 5619


🏁 Script executed:

rg "SlotsExponent" --type go -B 3 -A 3

Repository: kubescape/node-agent

Length of output: 1406


🏁 Script executed:

rg "EventDedup" --type go -A 20 | grep -A 20 "Validate\|validate"

Repository: kubescape/node-agent

Length of output: 46


🏁 Script executed:

rg "func.*Validate" pkg/config --type go -A 10

Repository: kubescape/node-agent

Length of output: 46


🏁 Script executed:

cat pkg/config/config.go | head -200

Repository: kubescape/node-agent

Length of output: 13131


🏁 Script executed:

cat pkg/config/config.go | tail -100

Repository: kubescape/node-agent

Length of output: 2894


Validate slotsExponent before constructing the cache; invalid values cause a panic.

NewDedupCache is exported, but uint64(1) << slotsExponent produces zero once slotsExponent reaches 64. This results in a zero-length slots slice and mask underflow to 18446744073709551615. At runtime, CheckAndSet will panic with "index out of range" when it tries to access c.slots[idx]. Additionally, exponents much below 64 but still unreasonable (e.g., > 30) create unbounded memory allocations. Add a bounds check in the constructor to reject invalid exponents.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/dedupcache/dedup_cache.go` around lines 16 - 21, NewDedupCache currently
computes size := uint64(1) << slotsExponent without validating slotsExponent
which can yield size==0 (when slotsExponent>=64) or unreasonably large sizes
(e.g., >30) causing panics or huge allocations; add a bounds check at the start
of NewDedupCache to reject invalid exponents (e.g., require slotsExponent > 0
and <= 30 or another project-safe maximum), and return an error or panic with a
clear message instead of constructing a zero/huge slice; ensure the check
prevents creating a DedupCache with zero-length slots and protects callers of
CheckAndSet from out-of-range accesses.

Comment on lines +36 to +46
func (c *DedupCache) CheckAndSet(key uint64, ttlBuckets uint16, currentBucket uint16) bool {
idx := key & c.mask

stored := c.slots[idx].Load()
storedKey, storedExpiry := unpack(stored)
if storedKey == (key & 0xFFFFFFFFFFFF0000) && storedExpiry > currentBucket {
return true // duplicate
}

c.slots[idx].Store(pack(key, currentBucket+ttlBuckets))
return false
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

uint16 expiry rolls over every ~70 minutes.

The caller feeds CheckAndSet a uint16 64ms time bucket, so storedExpiry > currentBucket breaks as soon as the counter wraps at 65535 -> 0: fresh entries inserted just before rollover stop matching right after rollover, and stale slots can become "live" again on a later cycle. That means dedup will silently misclassify events after a long-running agent uptime. Please switch expiry tracking to a non-wrapping time base, or clear/epoch the cache on bucket rollover, and add a regression test around the wrap boundary.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/dedupcache/dedup_cache.go` around lines 36 - 46, The expiry comparison in
DedupCache.CheckAndSet uses a wrapping uint16 bucket which breaks across the
65535→0 rollover; change the expiry to a non-wrapping time base by extending the
stored expiry field to a larger integer (e.g., use a uint32/uint64 expiry inside
the packed value), update pack and unpack to encode/decode that wider expiry,
ensure CheckAndSet stores pack(key, uint64(currentBucket)+uint64(ttlBuckets))
and compares storedExpiry > uint64(currentBucket) using the wider type, and add
a regression test that simulates buckets across the 65535→0 boundary to verify
dedup behavior; alternatively, if you prefer epoching, implement explicit
rollover detection in CheckAndSet that clears/epochs slots when currentBucket <
lastSeenBucket and add the same wrap test.

Comment on lines +41 to +48
func ComputeNetworkKey(mntns uint64, pid uint32, dstAddr string, dstPort uint16, proto string) uint64 {
h := xxhash.New()
writeUint64(h, mntns)
writeUint32(h, pid)
h.WriteString(dstAddr)
writeUint16(h, dstPort)
h.WriteString(proto)
return h.Sum64()
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Length-delimit the string components before hashing.

Several key builders concatenate adjacent variable-length strings directly into the hash stream (capability+syscall, host+path+rawQuery, oldPath+newPath, etc.). Different tuples can therefore serialize to the exact same bytes before xxhash is even applied—for example, host="fo", path="o/bar" and host="foo", path="/bar". Because duplicates bypass security handlers, this turns into false dedup for distinct events. Please prefix each string with its length, or another unambiguous delimiter, before writing it.

🔧 Safer serialization pattern
+func writeString(h *xxhash.Digest, s string) {
+	writeUint32(h, uint32(len(s)))
+	h.WriteString(s)
+}
+
 func ComputeHTTPKey(mntns uint64, pid uint32, direction string, method string, host string, path string, rawQuery string) uint64 {
 	h := xxhash.New()
 	writeUint64(h, mntns)
 	writeUint32(h, pid)
-	h.WriteString(direction)
-	h.WriteString(method)
-	h.WriteString(host)
-	h.WriteString(path)
-	h.WriteString(rawQuery)
+	writeString(h, direction)
+	writeString(h, method)
+	writeString(h, host)
+	writeString(h, path)
+	writeString(h, rawQuery)
 	return h.Sum64()
 }

Apply the same pattern to every helper that hashes more than one variable-length string.

Also applies to: 61-67, 71-80, 93-109

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/dedupcache/keys.go` around lines 41 - 48, ComputeNetworkKey (and the
other key-builder helpers that hash multiple variable-length strings such as the
capability+syscall builder, the host+path+rawQuery builder, and the
oldPath+newPath builder) currently write adjacent strings directly into the
xxhash stream which can cause ambiguous collisions; update each function
(starting with ComputeNetworkKey) to prefix every string component with its
length (or another unambiguous delimiter) before calling h.WriteString so the
concatenation is unambiguous—e.g., call writeUint32/Uint16/Uint64 for the string
length then write the string itself for each variable-length field; apply the
same change to all similar helpers mentioned in the comment so every
multi-string hash is length-delimited.

Benchmark comparing v0.3.71 (baseline) vs dedup branch on a kind cluster
with 1000 open/s, 100 http/s load. Results show -16% avg CPU, -29% peak
CPU, with 91-99% dedup ratios on high-frequency event types.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Ben <ben@armosec.io>
Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@docs/dedup-testing/benchmark-results.md`:
- Line 87: Replace the misspelled directory name "perfornamce" with the correct
spelling "performance" in the markdown line that references the directory (the
string "perfornamce" in the diff); update any other occurrences of the same typo
in the document (search for "perfornamce") so all directory references use
"performance".
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 4a3b9631-0ff1-4184-a380-896c0145c38b

📥 Commits

Reviewing files that changed from the base of the PR and between e1cdc49 and 0de1bad.

⛔ Files ignored due to path filters (4)
  • docs/dedup-testing/after_cpu_usage.png is excluded by !**/*.png
  • docs/dedup-testing/after_memory_usage.png is excluded by !**/*.png
  • docs/dedup-testing/before_cpu_usage.png is excluded by !**/*.png
  • docs/dedup-testing/before_memory_usage.png is excluded by !**/*.png
📒 Files selected for processing (1)
  • docs/dedup-testing/benchmark-results.md

## Reproducing

```bash
cd perfornamce
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Fix typo in directory path.

The directory name contains a typo: "perfornamce" should be "performance".

📝 Proposed fix
-cd perfornamce
+cd performance
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
cd perfornamce
cd performance
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/dedup-testing/benchmark-results.md` at line 87, Replace the misspelled
directory name "perfornamce" with the correct spelling "performance" in the
markdown line that references the directory (the string "perfornamce" in the
diff); update any other occurrences of the same typo in the document (search for
"perfornamce") so all directory references use "performance".

@matthyx matthyx moved this to WIP in KS PRs tracking Mar 31, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: WIP

Development

Successfully merging this pull request may close these issues.

2 participants