Gobed - Ultra-Fast Text Search for Go

MIT

Semantic search for Go with efficient int8 embeddings. Gobed provides semantic search using compressed static embeddings. Features automatic GPU detection, int8 quantization for memory efficiency, and 7.9x model compression.

Performance Achievements

6.39s average search time on 243K documents (current)
1.7 queries/sec throughput with parallel processing
Int8 quantization - 7.9x compression, 87.4% space saved
0.151ms embedding latency with 6,629 embeddings/sec
15MB memory usage for full model vs 119MB original

Built on static embeddings with GPU kernel fusion for maximum speed.

Quick Start

CPU Setup

# 1. Install
go get github.com/lee101/gobed

# 2. Download model weights (one-time, 119MB)
git clone https://github.com/lee101/gobed
cd gobed
./setup.sh

# 3. Run!
go run examples/search_demo.go

GPU Setup (for even faster search)

# Prerequisites: CUDA 12.8
./setup_gpu.sh  # Automated GPU setup

# Or manual build:
cd gpu_search/cuda_ops
./build.sh

# Run with GPU
go build -tags="gpu cuda" your_app.go
export LD_LIBRARY_PATH="$PWD/gpu_search:$LD_LIBRARY_PATH"
./your_app

Use It Now

package main

import (
    "fmt"
    "github.com/lee101/gobed"
)

func main() {
    // Load model
    model, _ := gobed.LoadModel()
    
    // Create search engine
    engine := gobed.NewSearchEngine(model)
    
    // Index your documents
    docs := []string{
        "Machine learning transforms data into insights",
        "Deep learning mimics human neural networks",
        "Natural language processing understands text",
    }
    engine.IndexBatch(docs)
    
    // Search - returns results in <1ms
    results, _ := engine.Search("neural networks", 3)
    
    for _, r := range results {
        fmt.Printf("[%.3f] %s\n", r.Similarity, r.Text)
    }
}

Why Gobed?

1ms search latency on datasets that fit in GPU memory
150,000+ embeddings/second on CPU alone
2.5x faster with GPU for large-scale operations
75% less memory with INT8 quantization
Zero dependencies - pure Go with optional CUDA

Real benchmarks on commodity hardware:

Dataset Size	Search Latency	Throughput
1,000 docs	357 μs	2,798 QPS
10,000 docs	1.77 ms	566 QPS
100,000 docs	2.23 ms	448 QPS
1M docs (GPU)	947 ms batch	1,056 QPS

Bed CLI – Semantic Filesystem Search

bed is the command-line front end that applies Gobed embeddings to your local projects. It can index and search using CPU-only mode or take advantage of a CUDA-enabled GPU (via cuVS/CAGRA) for sub-millisecond querying.

Install v1

go install github.com/lee101/gobed/cmd/bed@v1.0.0
# or:
go install github.com/lee101/gobed/bed/cmd/bed@v1.0.0

Quick Start (GPU-accelerated)

# 1. Install CUDA 12.8 and fetch the Gobed model
./setup.sh

# 2. Run bed with GPU support (CAGRA + CUDA)
export LD_LIBRARY_PATH="$(pwd)/gpu:/usr/local/cuda-12.8/lib64:${LD_LIBRARY_PATH}"
bed --gpu "memory leak in handler"  # searches the current directory

Useful sub-commands:

# Index a project (stores the embedding index for faster repeat searches)
bed index /path/to/project

# Run a GPU search against an indexed project
bed --gpu --limit 15 "database connection"

# CPU fallback
bed "keyword"

# Live index mode
bed index . --watch

# Performance + quality
bed bench . --queries 200 --ndcg

Benchmark Against `testdata/`

We added a Go benchmark that indexes the repository's testdata/ directory and measures semantic search throughput:

cd bed
go test ./src -bench BedSearch -run ^$

The benchmark indexes the sample files once and then repeatedly searches using SimpleSearchEngine, reporting queries_per_second so you can compare CPU and GPU configurations on your machine.

Advanced Features

INT8 Mode (75% Less Memory)

// Use 4x less memory with minimal accuracy loss
model, _ := gobed.LoadModelInt8(true)

GPU Acceleration (RTX 3090 Optimized)

// Load model normally
model, _ := gobed.LoadModel()

// Create GPU-accelerated search engine
engine := gobed.NewGPUSearchEngine(model)

// Or with custom config:
import "github.com/lee101/gobed/gpu"
config := gpu.GPUSearchConfig{
    EnableGPU: true,
    DeviceID:  0,
    BatchSize: 1000,
    UseInt8:   true,  // 4x memory reduction
}
engine := gpu.NewGPUSearchEngineWithConfig(model, config)

GPU Implementation Features

Ultra-Fast Static Embeddings (cuda_ultra_fast.cu)
- Simple token→vector lookup (not BERT)
- Pre-quantized int8 embedding table
- Automatic IVF clustering at 50K+ documents
Fused Kernels (cuda_fused_embed_search.cu)
- Single-pass: embed + average + quantize
- No intermediate memory writes
- Direct GPU search pipeline
RTX 3090 Optimizations
- 164KB shared memory per SM fully utilized
- 6MB L2 cache for persistent data
- Warp shuffle reductions
- Multi-stream processing (4 concurrent)

Async Indexing (26x Faster)

config := gobed.AsyncSearchConfig()
engine := gobed.NewSearchEngineWithConfig(model, config)

// Non-blocking indexing
response := engine.IndexBatchAsync(millionDocs)
result := <-response  // Wait when ready
// Note: result.Stats.ProcessingTime contains duration

Shared Memory (Multiple Processes)

// Share index across processes with zero-copy
config := gobed.SearchConfig{
    UseSharedMemory: true,
    SharedBasePath: "/tmp/my_index",
    MaxVectors: 1000000,
}
engine := gobed.NewSearchEngineWithConfig(model, config)

API Reference

Core Functions

// Load model
model, err := gobed.LoadModel()

// Create search engine  
engine := gobed.NewSearchEngine(model)

// Index documents
id, err := engine.Index("your text")
ids, err := engine.IndexBatch(texts)

// Search
results, err := engine.Search("query", topK)

// Direct encoding
embedding, err := model.Encode("text")
similarity, err := model.Similarity("text1", "text2")

Installation Details

Requirements

Go 1.21+
119MB for model weights
Optional: CUDA 12.8 for GPU support
Optional: AVX-512 CPU for INT8 mode

Model

Using sentence-transformers/static-retrieval-mrl-en-v1:

1024-dimensional embeddings
30,522 token vocabulary
Static embeddings with mean pooling
Learn more: Static Embeddings

Important Notes

Model Location: The model files (real_model.safetensors and tokenizer.json) must be in a model/ directory relative to where your code runs. The setup.sh script handles this automatically.

INT8 Mode: Requires a CPU with AVX-512 support. Will crash with "illegal instruction" error on older CPUs. Check your CPU with lscpu | grep avx512 on Linux.

GPU Package: The published Go package has GPU build dependencies. For now, clone the repository locally instead of using go get if you need GPU support:

git clone https://github.com/lee101/gobed
cd gobed
# Use replace directive in your go.mod
go mod edit -replace github.com/lee101/gobed=./gobed

Examples

# Basic search
cd examples
go run search_demo.go

# Large-scale benchmark  
cd cmd/ann_demo
go run main.go

# INT8 demo
cd cmd/int8_demo
go run main.go

Development

# Run tests
make test

# Benchmarks
make bench-cpu

# Format code
make fmt

The model files (~15MB) will be downloaded automatically on first use.

Manual Model Download

The int8 quantized model is available on HuggingFace:

# Clone from HuggingFace
git clone https://huggingface.co/lee101/bed model/

# Or download with huggingface-cli
huggingface-cli download lee101/bed --local-dir model/

Or via Python:

from huggingface_hub import hf_hub_download
model_path = hf_hub_download(repo_id="lee101/bed", filename="modelint8_512dim.safetensors")
tokenizer_path = hf_hub_download(repo_id="lee101/bed", filename="tokenizer.json")

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 106 Commits
.claude		.claude
.github		.github
ann		ann
bed		bed
benchmarks		benchmarks
cmd		cmd
cuvs_c_test		cuvs_c_test
cuvs_cagra		cuvs_cagra
docs		docs
examples		examples
gpu		gpu
gpu_search		gpu_search
internal		internal
metrics		metrics
model		model
pkg		pkg
scripts		scripts
test_int8_standalone		test_int8_standalone
tests		tests
.gitignore		.gitignore
.golangci.yml		.golangci.yml
BUILD.md		BUILD.md
CLAUDE.md		CLAUDE.md
Dockerfile		Dockerfile
FINAL_INT8_SUCCESS_SUMMARY.md		FINAL_INT8_SUCCESS_SUMMARY.md
GPU_PERF_OPTIMIZATION.md		GPU_PERF_OPTIMIZATION.md
INT8_MODEL_SUMMARY.md		INT8_MODEL_SUMMARY.md
INT8_VS_ORIGINAL_PERFORMANCE.md		INT8_VS_ORIGINAL_PERFORMANCE.md
LICENSE		LICENSE
Makefile		Makefile
PROJECT_RESTRUCTURE_SUMMARY.md		PROJECT_RESTRUCTURE_SUMMARY.md
README.md		README.md
README_NEW.md		README_NEW.md
SCALE_BENCHMARK_RESULTS.md		SCALE_BENCHMARK_RESULTS.md
bedreview.md		bedreview.md
benchmark_config.json		benchmark_config.json
benchmark_optimizations_test.go		benchmark_optimizations_test.go
build_cagra.sh		build_cagra.sh
build_gpu_kmeans.sh		build_gpu_kmeans.sh
cache_path_test.go		cache_path_test.go
cache_test.go		cache_test.go
cagra_cache_test.go		cagra_cache_test.go
cagra_config_stub.go		cagra_config_stub.go
cagra_gpu_search_test.go		cagra_gpu_search_test.go
cagra_index.go		cagra_index.go
cagra_parallel_index_test.go		cagra_parallel_index_test.go
cagra_wrapper.c		cagra_wrapper.c
cagra_wrapper.h		cagra_wrapper.h
check_libtorch.sh		check_libtorch.sh
compact.go		compact.go
compact_integration_test.go		compact_integration_test.go
compact_test.go		compact_test.go
compare_embeddings.py		compare_embeddings.py
cosine_similarity.go		cosine_similarity.go
cpu_bulk_indexer.go		cpu_bulk_indexer.go
create_demo_dataset.py		create_demo_dataset.py
cuda_diagnosis.py		cuda_diagnosis.py
download_real_model.py		download_real_model.py
embed_simd_bench_test.go		embed_simd_bench_test.go
embedding_cache.go		embedding_cache.go
eval_dataset_test.go		eval_dataset_test.go
fast_million_bench_test.go		fast_million_bench_test.go
find_broken_test		find_broken_test
five_million_bench_test.go		five_million_bench_test.go
fused_cagra.cu		fused_cagra.cu
fused_cagra_embed.cu		fused_cagra_embed.cu
fused_cagra_fixed.cu		fused_cagra_fixed.cu
fused_cagra_gpu.cu		fused_cagra_gpu.cu
fused_cagra_gpu.go		fused_cagra_gpu.go
fused_cagra_int8_test.go		fused_cagra_int8_test.go
fused_cagra_opt.cu		fused_cagra_opt.cu
fused_cagra_simple.cu		fused_cagra_simple.cu
fused_cagra_stub.go		fused_cagra_stub.go
generate_real_reference.py		generate_real_reference.py
go.mod		go.mod
go.sum		go.sum
gobed.go		gobed.go
gobed_int8.go		gobed_int8.go
gobed_int8_512.go		gobed_int8_512.go
gobed_int8_512_simple.go		gobed_int8_512_simple.go
gobed_int8_512_simple_accessors_test.go		gobed_int8_512_simple_accessors_test.go
gobed_int8_512_simple_test.go		gobed_int8_512_simple_test.go
gobed_int8_512_test.go		gobed_int8_512_test.go
gobed_optimized.go		gobed_optimized.go
gobed_test.go		gobed_test.go
gpu_backend.py		gpu_backend.py
gpu_backend_optimized.py		gpu_backend_optimized.py
gpu_batch_processor.go		gpu_batch_processor.go
gpu_batch_processor_stub.go		gpu_batch_processor_stub.go
gpu_bulk_indexing_bench_test.go		gpu_bulk_indexing_bench_test.go
gpu_cagra_indexer.go		gpu_cagra_indexer.go
gpu_embedding.go		gpu_embedding.go
gpu_end_to_end.go		gpu_end_to_end.go
gpu_fast_search_indexer.go		gpu_fast_search_indexer.go
gpu_indexer.go		gpu_indexer.go
gpu_indexer_stub.go		gpu_indexer_stub.go
gpu_integration_test.go		gpu_integration_test.go
gpu_ivf_bulk_indexer.go		gpu_ivf_bulk_indexer.go
gpu_kmeans.cu		gpu_kmeans.cu
gpu_kmeans.go		gpu_kmeans.go

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Gobed - Ultra-Fast Text Search for Go

Performance Achievements

Quick Start

CPU Setup

GPU Setup (for even faster search)

Use It Now

Why Gobed?

Bed CLI – Semantic Filesystem Search

Install v1

Quick Start (GPU-accelerated)

Benchmark Against `testdata/`

Advanced Features

INT8 Mode (75% Less Memory)

GPU Acceleration (RTX 3090 Optimized)

GPU Implementation Features

Async Indexing (26x Faster)

Shared Memory (Multiple Processes)

API Reference

Core Functions

Installation Details

Requirements

Model

Important Notes

Examples

Development

Manual Model Download

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Gobed - Ultra-Fast Text Search for Go

Performance Achievements

Quick Start

CPU Setup

GPU Setup (for even faster search)

Use It Now

Why Gobed?

Bed CLI – Semantic Filesystem Search

Install v1

Quick Start (GPU-accelerated)

Benchmark Against testdata/

Advanced Features

INT8 Mode (75% Less Memory)

GPU Acceleration (RTX 3090 Optimized)

GPU Implementation Features

Async Indexing (26x Faster)

Shared Memory (Multiple Processes)

API Reference

Core Functions

Installation Details

Requirements

Model

Important Notes

Examples

Development

Manual Model Download

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Benchmark Against `testdata/`

Packages