Skip to content

conn, tun: replace pre-made buffers with an allocator pattern#49

Draft
illotum wants to merge 4 commits intotailscalefrom
illotum/rss-optimization
Draft

conn, tun: replace pre-made buffers with an allocator pattern#49
illotum wants to merge 4 commits intotailscalefrom
illotum/rss-optimization

Conversation

@illotum
Copy link
Copy Markdown

@illotum illotum commented Feb 19, 2026

  • Replace fixed 65 KB buffers per-buffer allocations with a tiered sync.Pool
  • Push buffer allocation to the edge: tun.Read and conn.ReceiveFunc accept nil buffer entries and allocate on demand
  • Each buffer carries a recycle function routing it back to its origin pool
  • TX coalescing (coalesceMessages) uses scatter-gather instead of copying into a contiguous slice
  • GRO coalescing uses a bump-pointer Arena for temporary scratch buffers

Reduces resident memory by an order of magnitude & improves (best case, iperf) throughput by about 10%.

Updates tailscale/corp#36989

Change-Id: I1a72cd54d4be2b3b0c7568982e1a700c6a6a6964

@illotum illotum force-pushed the illotum/rss-optimization branch 9 times, most recently from 496dc1a to 61b6c72 Compare March 8, 2026 03:43
@illotum illotum self-assigned this Mar 9, 2026
@illotum illotum force-pushed the illotum/rss-optimization branch 4 times, most recently from ea6f82a to e981b07 Compare March 12, 2026 03:37
buffer/buffer.go Outdated
// The returned Data slice must not be retained past Release.
type Buffer struct {
data []byte
recycle func(*Buffer)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently, recycle is always nil.

I'd expect that different buffer.Source implementations would want to use different recycle implementations (e.g., tsvnic vs rioconn vs sync.Pool).

What's the plan here?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently, buffer.FragmentPool sets the unexported recycle field directly, but external packages won't be able to do that, so I'm wondering what the API surface should look like.

Or is the expectation that all allocators be implemented in this package? We can explore that for rioconn, but perhaps not for tsvnic. This is what we currently have for rioconn: https://github.com/tailscale/tailscale/blob/nickkhyl/rioconn/net/rioconn/request_ring.go

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do either of them need more than uint64 of metadata? Per my original message, I considered just packing a metadata field into every buffer. And yes, making them all public.

If we need more, interfaces become an appealing option.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, I'd expect uint64 (or perhaps even uintptr) to be sufficient.

We can iterate on the specifics later, since it seems we're on the same page about the API, and there's also no expectation that all implementations should live in wireguard-go/buffer.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After some deliberation, I opted to instead embed an interface -- I/O implementations may stuff it with arbitrary data as needed. LMK what you think.

buffer/buffer.go Outdated
package buffer

const (
MaxMessageSize = (1 << 16) - 1 // largest possible UDP datagram
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this relate to a similar constant defined in device/queueconstants*, or should it be independent?

switch err {
case nil:
if bufs[0] == nil {
bufs[0] = buffer.New(make([]byte, buffer.MaxMessageSize), nil)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Related to https://github.com/tailscale/wireguard-go/pull/49/changes#r2955084809

This would have previously been sized to 2016 bytes. Other platforms vary as well, e.g. android, iOS.

@jwhited
Copy link
Copy Markdown
Member

jwhited commented Mar 18, 2026

I ran some local, single TCP stream benchmarks w/iperf3 on Ubuntu 24.04 w/Intel i5-12400 CPUs.

4184faf (what's currently used by tailscale/tailscale):

[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-10.00  sec  14.9 GBytes  12.8 Gbits/sec  604             sender
[  5]   0.00-10.00  sec  14.9 GBytes  12.8 Gbits/sec                  receiver

472f369:

[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-10.00  sec  12.2 GBytes  10.5 Gbits/sec  291             sender
[  5]   0.00-10.00  sec  12.2 GBytes  10.5 Gbits/sec                  receiver

RSS is much improved w/lots of streams (-P 32): 1GB+ vs ~200MB

messageBuffers *WaitPool
inboundElements *WaitPool
outboundElements *WaitPool
messageBuffers buffer.Source
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reminder: We will need to support a limit, e.g. PreallocatedBuffersPerPool constant today, on the outstanding buffers or outstanding aggregate buffer size for mobile platforms , which WaitPool previously enforced with its max field set to nonzero value.

device/send.go Outdated

err = peer.SendBuffers([][]byte{buf})
err = peer.SendBuffers([][]byte{buf.Data()})
buf.Release()
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we defer release immediately following instantiation? Same with all the other handshake and initiation message sending methods.

tun/tun_linux.go Outdated
}
file := os.NewFile(uintptr(fd), "/dev/tun")
bufPool := buffer.NoErrSource{Source: buffer.NewDefaultFragmentPool()}
arenaBuffer, _ := bufPool.Get(32 * 16 << 10)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this sizing arbitrary?

Can we document the decimal value in human-readable form?

@@ -452,10 +468,11 @@ type setGSOFunc func(control *[]byte, gsoSize uint16)

func coalesceMessages(addr *net.UDPAddr, ep *StdNetEndpoint, bufs [][]byte, offset int, msgs []ipv6.Message, setGSO setGSOFunc) int {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would love to see the scatter-gather changes land in tailscale/tailscale (net/batching pkg) before larger tailscale/wireguard-go changes.

My understanding is that:

  1. they are a prerequisite for the larger packet memory architecture changes in this PR
  2. they provide a modest performance improvement by collapsing memcpy/linearization into one place, kernel-side
  3. they are widely supported: if sendmmsg() is supported, scatter-gather is supported through it
  4. they also benefit peer relay, which uses net/batching

While I don't anticipate any issues with the approach, it would be better (and easier) to identify kernel bugs or whatever early and detached from the larger changes here.

@illotum illotum force-pushed the illotum/rss-optimization branch from 472f369 to 74691ad Compare March 18, 2026 22:18
illotum added 3 commits March 25, 2026 21:09
Signed-off-by: Alex Valiushko <alexvaliushko@tailscale.com>
Change-Id: I3852ae79c0a5f6bed0fad16e1860eed06a6a6964
Signed-off-by: Alex Valiushko <alexvaliushko@tailscale.com>
Change-Id: I97a2d22561468de14b17e09d557f212b6a6a6964
Signed-off-by: Alex Valiushko <alexvaliushko@tailscale.com>
Change-Id: I58908d9d3fd09441e9378a74b0ee19136a6a6964
@illotum illotum force-pushed the illotum/rss-optimization branch from 74691ad to d7467f5 Compare March 26, 2026 20:57
Signed-off-by: Alex Valiushko <alexvaliushko@tailscale.com>
Change-Id: I48217f8f461b17a01901cee9ab64d45e6a6a6964
@illotum illotum force-pushed the illotum/rss-optimization branch from d7467f5 to f357c5a Compare March 26, 2026 23:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants