conn, tun: replace pre-made buffers with an allocator pattern#49
conn, tun: replace pre-made buffers with an allocator pattern#49
Conversation
496dc1a to
61b6c72
Compare
ea6f82a to
e981b07
Compare
buffer/buffer.go
Outdated
| // The returned Data slice must not be retained past Release. | ||
| type Buffer struct { | ||
| data []byte | ||
| recycle func(*Buffer) |
There was a problem hiding this comment.
Currently, recycle is always nil.
I'd expect that different buffer.Source implementations would want to use different recycle implementations (e.g., tsvnic vs rioconn vs sync.Pool).
What's the plan here?
There was a problem hiding this comment.
Currently, buffer.FragmentPool sets the unexported recycle field directly, but external packages won't be able to do that, so I'm wondering what the API surface should look like.
Or is the expectation that all allocators be implemented in this package? We can explore that for rioconn, but perhaps not for tsvnic. This is what we currently have for rioconn: https://github.com/tailscale/tailscale/blob/nickkhyl/rioconn/net/rioconn/request_ring.go
There was a problem hiding this comment.
Do either of them need more than uint64 of metadata? Per my original message, I considered just packing a metadata field into every buffer. And yes, making them all public.
If we need more, interfaces become an appealing option.
There was a problem hiding this comment.
yeah, I'd expect uint64 (or perhaps even uintptr) to be sufficient.
We can iterate on the specifics later, since it seems we're on the same page about the API, and there's also no expectation that all implementations should live in wireguard-go/buffer.
There was a problem hiding this comment.
After some deliberation, I opted to instead embed an interface -- I/O implementations may stuff it with arbitrary data as needed. LMK what you think.
buffer/buffer.go
Outdated
| package buffer | ||
|
|
||
| const ( | ||
| MaxMessageSize = (1 << 16) - 1 // largest possible UDP datagram |
There was a problem hiding this comment.
does this relate to a similar constant defined in device/queueconstants*, or should it be independent?
tun/tun_windows.go
Outdated
| switch err { | ||
| case nil: | ||
| if bufs[0] == nil { | ||
| bufs[0] = buffer.New(make([]byte, buffer.MaxMessageSize), nil) |
There was a problem hiding this comment.
Related to https://github.com/tailscale/wireguard-go/pull/49/changes#r2955084809
This would have previously been sized to 2016 bytes. Other platforms vary as well, e.g. android, iOS.
|
I ran some local, single TCP stream benchmarks w/iperf3 on Ubuntu 24.04 w/Intel i5-12400 CPUs. 4184faf (what's currently used by tailscale/tailscale): RSS is much improved w/lots of streams (-P 32): 1GB+ vs ~200MB |
| messageBuffers *WaitPool | ||
| inboundElements *WaitPool | ||
| outboundElements *WaitPool | ||
| messageBuffers buffer.Source |
There was a problem hiding this comment.
reminder: We will need to support a limit, e.g. PreallocatedBuffersPerPool constant today, on the outstanding buffers or outstanding aggregate buffer size for mobile platforms , which WaitPool previously enforced with its max field set to nonzero value.
device/send.go
Outdated
|
|
||
| err = peer.SendBuffers([][]byte{buf}) | ||
| err = peer.SendBuffers([][]byte{buf.Data()}) | ||
| buf.Release() |
There was a problem hiding this comment.
Can we defer release immediately following instantiation? Same with all the other handshake and initiation message sending methods.
tun/tun_linux.go
Outdated
| } | ||
| file := os.NewFile(uintptr(fd), "/dev/tun") | ||
| bufPool := buffer.NoErrSource{Source: buffer.NewDefaultFragmentPool()} | ||
| arenaBuffer, _ := bufPool.Get(32 * 16 << 10) |
There was a problem hiding this comment.
Is this sizing arbitrary?
Can we document the decimal value in human-readable form?
| @@ -452,10 +468,11 @@ type setGSOFunc func(control *[]byte, gsoSize uint16) | |||
|
|
|||
| func coalesceMessages(addr *net.UDPAddr, ep *StdNetEndpoint, bufs [][]byte, offset int, msgs []ipv6.Message, setGSO setGSOFunc) int { | |||
There was a problem hiding this comment.
I would love to see the scatter-gather changes land in tailscale/tailscale (net/batching pkg) before larger tailscale/wireguard-go changes.
My understanding is that:
- they are a prerequisite for the larger packet memory architecture changes in this PR
- they provide a modest performance improvement by collapsing memcpy/linearization into one place, kernel-side
- they are widely supported: if sendmmsg() is supported, scatter-gather is supported through it
- they also benefit peer relay, which uses net/batching
While I don't anticipate any issues with the approach, it would be better (and easier) to identify kernel bugs or whatever early and detached from the larger changes here.
472f369 to
74691ad
Compare
Signed-off-by: Alex Valiushko <alexvaliushko@tailscale.com> Change-Id: I3852ae79c0a5f6bed0fad16e1860eed06a6a6964
Signed-off-by: Alex Valiushko <alexvaliushko@tailscale.com> Change-Id: I97a2d22561468de14b17e09d557f212b6a6a6964
Signed-off-by: Alex Valiushko <alexvaliushko@tailscale.com> Change-Id: I58908d9d3fd09441e9378a74b0ee19136a6a6964
74691ad to
d7467f5
Compare
Signed-off-by: Alex Valiushko <alexvaliushko@tailscale.com> Change-Id: I48217f8f461b17a01901cee9ab64d45e6a6a6964
d7467f5 to
f357c5a
Compare
coalesceMessages) uses scatter-gather instead of copying into a contiguous sliceArenafor temporary scratch buffersReduces resident memory by an order of magnitude & improves (best case, iperf) throughput by about 10%.
Updates tailscale/corp#36989
Change-Id: I1a72cd54d4be2b3b0c7568982e1a700c6a6a6964