Languages: English | 简体中文 | 日本語 | Español | Français
Lock-free and wait-free FIFO queue implementations for Go.
Package lfq provides lock-free and wait-free bounded FIFO queues optimized for different producer/consumer patterns (SPSC, MPSC, SPMC, MPMC), ensuring predictable, zero-allocation scaling under high contention.
lfq builds on foundational concurrency research to overcome the limitations of traditional mutexes and CAS (Compare-And-Swap) loops:
-
Lamport Ring Buffer (1977): Powers our Single-Producer/Single-Consumer (SPSC) paths, offering strictly
$O(1)$ wait-free latency without expensive atomic read-modify-write instructions. -
Scalable Circular Queue (SCQ, 2019): Our multi-party (MPSC, SPMC, MPMC) paths are implemented based on Ruslan Nikolaev's SCQ algorithm. By utilizing hardware Fetch-And-Add (FAA) instructions instead of CAS loops,
lfqinherently avoids contention bottlenecks and ABA problems, operating as a standalone queue without external safe memory reclamation (e.g., hazard pointers).
lfq is built for iox-based non-blocking I/O stacks. It eschews implicit runtime blocking in favor of explicit backpressure. If a queue is full or empty, it immediately returns ErrWouldBlock. This concise error model integrates perfectly with event loops and caller-side backoff primitives (like iox.Backoff), allowing predictable traffic management and sub-microsecond latency boundaries.
// Direct constructor (recommended for most cases)
q := lfq.NewSPSC[Event](1024)
// Builder API - auto-selects algorithm based on constraints
q := lfq.Build[Event](lfq.New(1024).SingleProducer().SingleConsumer()) // → SPSC
q := lfq.Build[Event](lfq.New(1024).SingleConsumer()) // → MPSC
q := lfq.Build[Event](lfq.New(1024).SingleProducer()) // → SPMC
q := lfq.Build[Event](lfq.New(1024)) // → MPMCgo get code.hybscloud.com/lfqRequirements: Go 1.26+
For better performance, compile with the intrinsics-optimized Go compiler:
# Using Makefile (recommended)
make install-compiler # Download pre-built release (~30 seconds)
make build # Build with intrinsics compiler
make test # Test with intrinsics compiler
# Or build compiler from source (bleeding-edge)
make install-compiler-sourceManual installation:
# Pre-built release (recommended)
OS=$(uname -s | tr '[:upper:]' '[:lower:]')
ARCH=$(uname -m | sed 's/x86_64/amd64/;s/aarch64/arm64/')
URL=$(curl -fsSL https://api.github.com/repos/hayabusa-cloud/go/releases/latest | grep "browser_download_url.*${OS}-${ARCH}\.tar\.gz\"" | cut -d'"' -f4)
curl -fsSL "$URL" | tar -xz -C ~/sdk
mv ~/sdk/go ~/sdk/go-atomix
# Use for building lfq-dependent code
GOROOT=~/sdk/go-atomix ~/sdk/go-atomix/bin/go build ./...The intrinsics compiler inlines atomix operations with proper memory ordering. The standard Go compiler works for basic testing but may exhibit issues under high contention.
| Type | Pattern | Progress Guarantee | Use Case |
|---|---|---|---|
| SPSC | Single-Producer Single-Consumer | Wait-free | Pipeline stages, channels |
| MPSC | Multi-Producer Single-Consumer | Lock-free (wait-free dequeue) | Event aggregation, logging |
| SPMC | Single-Producer Multi-Consumer | Lock-free (wait-free enqueue) | Work distribution |
| MPMC | Multi-Producer Multi-Consumer | Lock-free | General purpose |
- Wait-free: Every operation completes in bounded steps
- Lock-free: System-wide progress guaranteed; at least one thread makes progress
Classic bounded buffer with cached index optimization.
q := lfq.NewSPSC[int](1024)
// Producer
q.Enqueue(&value) // Wait-free O(1)
// Consumer
elem, err := q.Dequeue() // Wait-free O(1)By default, multi-access queues implement the SCQ (Scalable Circular Queue) algorithm using FAA (Fetch-And-Add) instructions. FAA blindly increments position counters, requiring 2n physical slots for capacity n, but scales better under high contention than CAS-based alternatives.
- Trade-off: Requires
2nphysical slots for nominal capacityn. - Transient Capacity: With
Pconcurrent producers, up toP-1additional items may be transiently enqueued beyondCap()before backpressure applies.
// Multiple producers, single consumer
q := lfq.NewMPSC[Event](1024) // FAA producers, wait-free dequeue
// Single producer, multiple consumers
q := lfq.NewSPMC[Task](1024) // Wait-free enqueue, FAA consumers
// Multiple producers and consumers
q := lfq.NewMPMC[*Request](4096) // FAA-based SCQ algorithmCycle-based slot validation provides ABA safety without epoch counters or hazard pointers.
Indirect and Ptr queue variants (all non-SPSC variants except Compact Indirect) pack sequence number and payload into a single 128-bit atomic. This reduces cache line contention and improves throughput under high concurrency.
For Ptr variants that encode pointers into 128-bit slots (FAA and PtrSeq variants), keep a typed Go reference to enqueued objects until they are dequeued (or otherwise guaranteed consumed). Do not rely on queued pointer bits alone as a GC reachability root.
// Indirect - single 128-bit atomic per operation
q := lfq.NewMPMCIndirect(4096)
// Ptr - same optimization for unsafe.Pointer
q := lfq.NewMPMCPtr(4096)Automatic algorithm selection based on constraints:
// SPSC - both constraints → Lamport ring
q := lfq.Build[T](lfq.New(1024).SingleProducer().SingleConsumer())
// MPSC - single consumer only
q := lfq.Build[T](lfq.New(1024).SingleConsumer())
// SPMC - single producer only
q := lfq.Build[T](lfq.New(1024).SingleProducer())
// MPMC - no constraints (default)
q := lfq.Build[T](lfq.New(1024))Each queue type has three variants:
| Variant | Element Type | Use Case |
|---|---|---|
| Generic | [T any] |
Type-safe, general purpose |
| Indirect | uintptr |
Index-based pools, handles |
| Ptr | unsafe.Pointer |
Zero-copy pointer passing |
// Generic
q := lfq.NewMPMC[MyStruct](1024)
// Indirect - for pool indices
q := lfq.NewMPMCIndirect(1024)
q.Enqueue(uintptr(poolIndex))
// Pointer - zero-copy
q := lfq.NewMPMCPtr(1024)
q.Enqueue(unsafe.Pointer(obj))Compact() selects CAS-based algorithms that use n physical slots (vs 2n for FAA-based default). Use when memory efficiency is more important than contention scalability:
// Compact mode - CAS-based, n slots
q := lfq.New(4096).Compact().BuildIndirect()| Mode | Algorithm | Physical Slots | Use When |
|---|---|---|---|
| Default | FAA-based | 2n | High contention, scalability |
| Compact | CAS-based | n | Memory constrained |
SPSC variants already use n slots (Lamport ring buffer) and ignore Compact(). For Indirect queues with Compact(), values are limited to 63 bits.
| Operation | Returns | Description |
|---|---|---|
Enqueue(elem) |
error |
Add element; returns ErrWouldBlock if full |
Dequeue() |
(T, error) |
Remove element; returns ErrWouldBlock if empty |
Cap() |
int |
Queue capacity |
err := q.Enqueue(&item)
if lfq.IsWouldBlock(err) {
// Queue is full - backpressure or retry
}
elem, err := q.Dequeue()
if lfq.IsWouldBlock(err) {
// Queue is empty - wait or poll
}The following patterns demonstrate how lfq can be integrated into concurrent systems. By avoiding allocations in the hot path and utilizing appropriate queue variants, you can achieve substantial latency reductions.
A common pattern for zero-allocation I/O is pre-allocating a pool of buffers and tracking their available indices using a wait-free SPSC queue. This eliminates GC pressure entirely.
const poolSize = 1024
const bufSize = 4096
// Pre-allocate buffers
pool := make([][]byte, poolSize)
for i := range pool {
pool[i] = make([]byte, bufSize)
}
// Free list tracks available indices
freeList := lfq.NewSPSCIndirect(poolSize)
for i := range poolSize {
freeList.Enqueue(uintptr(i))
}
// Allocate
func Alloc() ([]byte, uintptr, bool) {
idx, err := freeList.Dequeue()
if err != nil {
return nil, 0, false
}
return pool[idx], idx, true
}
// Free
func Free(idx uintptr) {
freeList.Enqueue(idx)
}type Event struct {
Source string
Timestamp time.Time
Data any
}
// Multiple sources → Single processor
events := lfq.NewMPSC[Event](8192)
// Event sources (multiple producers)
for sensor := range slices.Values(sensors) {
go func(s Sensor) {
for reading := range s.Readings() {
ev := Event{
Source: s.Name(),
Timestamp: time.Now(),
Data: reading,
}
events.Enqueue(&ev)
}
}(sensor)
}
// Single aggregator (single consumer)
go func() {
for {
ev, err := events.Dequeue()
if err == nil {
aggregate(ev)
}
}
}()// With retry and yield
func EnqueueWithRetry(q lfq.Queue[Item], item Item, maxRetries int) bool {
ba := iox.Backoff{}
for i := range maxRetries {
if q.Enqueue(&item) == nil {
return true
}
ba.Wait() // Yield to let consumers drain
}
return false // Apply backpressure to caller
}FAA-based multi-consumer queues (MPMC, SPMC) include a threshold mechanism to prevent livelock. MPSC also implements Drainer, but as a graceful-shutdown signal only (no dequeue threshold skip). For graceful shutdown where producers finish before consumers, use the Drainer interface:
// Producer goroutines finish
prodWg.Wait()
// Signal no more enqueues will occur
if d, ok := q.(lfq.Drainer); ok {
d.Drain()
}
// Consumers can now drain all remaining items.
// For MPMC/SPMC, Drain avoids threshold-based early exit.
for {
item, err := q.Dequeue()
if err != nil {
break // Queue is empty
}
process(item)
}Drain() is a hint — the caller must ensure no further Enqueue() calls will be made. For MPMC/SPMC, Drain() avoids threshold-based early exit in Dequeue. For MPSC, Drain() is a shutdown signal only. SPSC queues do not implement Drainer; the type assertion naturally handles this case.
┌─────────────────────────────────────────────────────────────────┐
│ How many producers? │
│ │
│ ┌──────────────────┐ ┌────────────────────┐ │
│ │ One (SPSC/ │ │ Multiple (MPMC/ │ │
│ │ SPMC) │ │ MPSC) │ │
│ └────────┬─────────┘ └─────────┬──────────┘ │
│ │ │ │
│ ▼ ▼ │
│ ┌──────────────────┐ ┌──────────────────┐ │
│ │ One consumer? │ │ One consumer? │ │
│ └────────┬─────────┘ └────────┬─────────┘ │
│ Yes │ No Yes │ No │
│ │ │ │ │ │ │ │
│ ▼ │ ▼ ▼ │ ▼ │
│ SPSC │ SPMC MPSC │ MPMC │
│ │ │ │
└────────────┴─────────────────────────────────┴─────────────────┘
Variant Selection:
• Generic [T] → Type-safe, copying semantics
• Indirect → Pool indices, buffer offsets (uintptr)
• Ptr → Zero-copy object passing (unsafe.Pointer)
Capacity rounds up to the next power of 2:
q := lfq.NewMPMC[int](3) // Actual capacity: 4
q := lfq.NewMPMC[int](4) // Actual capacity: 4
q := lfq.NewMPMC[int](1000) // Actual capacity: 1024
q := lfq.NewMPMC[int](1024) // Actual capacity: 1024Minimum capacity is 2. Constructors panic if capacity < 2.
All queues use cache-line padding (64 bytes) to prevent false sharing:
type MPMC[T any] struct {
_ [64]byte // Padding
tail atomix.Uint64 // Producer index
_ [64]byte // Padding
head atomix.Uint64 // Consumer index
_ [64]byte // Padding
buffer []slot[T]
// ...
}Go's race detector is not designed for lock-free algorithm verification. It tracks explicit sync primitives (mutex, channels) but cannot observe happens-before relationships from atomic memory orderings.
Tests use two protection mechanisms:
- Build tag
//go:build !raceexcludes example files from race testing - Runtime check
if lfq.RaceEnabled { t.Skip() }skips concurrent tests inlockfree_test.go
Run go test -race ./... for race-safe tests, or go test ./... for all tests.
- code.hybscloud.com/iox — Semantic errors (
ErrWouldBlock) - code.hybscloud.com/atomix — Atomic primitives with explicit memory ordering
- code.hybscloud.com/spin — Spin primitives
| Platform | Status |
|---|---|
| linux/amd64 | Primary |
| linux/arm64 | Supported |
| linux/riscv64 | Supported |
| linux/loong64 | Supported |
| darwin/amd64, darwin/arm64 | Supported |
| freebsd/amd64, freebsd/arm64 | Supported |
- Nikolaev, R. (2019). A Scalable, Portable, and Memory-Efficient Lock-Free FIFO Queue. DISC 2019 (LIPIcs). https://doi.org/10.4230/LIPIcs.DISC.2019.28. Preprint: https://arxiv.org/abs/1908.04511.
- Lamport, L. (1977). Proving the Correctness of Multiprocess Programs. IEEE Transactions on Software Engineering, 3(2), 125–143.
- Vyukov, D. (2010). Bounded MPMC Queue. 1024cores.net. https://1024cores.net/home/lock-free-algorithms/queues/bounded-mpmc-queue.
- Herlihy, M. (1991). Wait-Free Synchronization. ACM Transactions on Programming Languages and Systems, 13(1), 124–149.
- Herlihy, M., & Wing, J. M. (1990). Linearizability: A Correctness Condition for Concurrent Objects. ACM Transactions on Programming Languages and Systems, 12(3), 463–492.
- Michael, M. M., & Scott, M. L. (1996). Simple, Fast, and Practical Non-Blocking and Blocking Concurrent Queue Algorithms. In Proceedings of the 15th ACM Symposium on Principles of Distributed Computing (PODC '96), pp. 267–275.
- Adve, S. V., & Gharachorloo, K. (1996). Shared Memory Consistency Models: A Tutorial. IEEE Computer, 29(12), 66–76.
MIT — see LICENSE.
©2026 Hayabusa Cloud Co., Ltd.