sandx-embed

Shared embedding and vector similarity infrastructure for the SandX platform.

Part of the SandX Lab computational infrastructure ecosystem.

What It Does

sandx-embed is the shared latent representation layer used by all SandX engines. It provides:

Pluggable encoders — sentence-transformers models out of the box; register any custom encoder
High-performance ANN indexing — HNSW (production) and exact search (baseline), with save/load
Cross-domain similarity — cosine, L2, inner product; normalized and unnormalized vectors

Not a standalone product — consumed by sandx-er, sandx-graph, and sandx-compute as a shared dependency.

Status

v0.1 — Phase 2 active development

Component	Status
`Encoder` — pluggable model registry	Working
`SentenceTransformerEncoder` — SBERT, E5, BGE	Working
`VectorIndex` — HNSW and exact search	Working
Save / load index	Working
PyPI package	Working

Installation

pip install sandx-embed

Or from source:

git clone https://github.com/sandxlab/sandx-embed
cd sandx-embed
pip install -e ".[dev]"

Demo

pip install "sandx-embed[hnsw]"
python -m examples.encode_and_search

Encodes 15 company name variants into dense vectors, builds an ANN index, runs 4 semantic queries, and prints a pairwise similarity matrix — no external data required.

Quick Start

from sandx_embed import Encoder, VectorIndex

# Encode records into dense vectors
enc = Encoder(model="sentence-bert")   # downloads all-MiniLM-L6-v2 on first use
vectors = enc.encode(["John Smith, Boston", "Jon Smyth, Boston", "Alice Brown, NYC"])
# → np.ndarray shape (3, 384), L2-normalized

# Build an ANN index
idx = VectorIndex(method="hnsw", metric="cosine")
idx.build(vectors, ids=["r0", "r1", "r2"])

# Query nearest neighbors
result = idx.query(vectors[0], k=2)
print(result.ids)        # ["r0", "r1"]
print(result.distances)  # [0.0, 0.12]  (cosine distance)

# Persist and reload
idx.save("/tmp/my_index")
idx2 = VectorIndex.load("/tmp/my_index")

Built-in Models

Name	HuggingFace model	Dim	Notes
`"sentence-bert"`	`all-MiniLM-L6-v2`	384	Fast, English, recommended default
`"e5-small"`	`intfloat/e5-small-v2`	384	Higher quality, English
`"bge-m3"`	`BAAI/bge-m3`	1024	Multilingual, large

Custom Encoders

from sandx_embed.encoder import BaseEncoder, Encoder
import numpy as np

class MyEncoder(BaseEncoder):
    def encode(self, inputs, *, batch_size=64, normalize=True):
        # your model here
        return np.random.rand(len(inputs), 128).astype(np.float32)
    @property
    def dim(self): return 128

Encoder.register("my-model", lambda: MyEncoder())
enc = Encoder("my-model")

Index Methods

Method	Backend	When to use
`"hnsw"`	usearch	N > 10,000; production; fast queries
`"exact"`	numpy	Small datasets; correctness baseline

Design Principles

Pluggable — any encoder model or index backend can be registered
Portable — indexes serialize to disk and reload without rebuilding
Deterministic — same model version + input → same output
No vendor lock-in — no hard dependency on any hosted vector service

License

Apache 2.0 — see LICENSE

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
.github		.github
docs		docs
src		src
tests		tests
.gitignore		.gitignore
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

sandx-embed

What It Does

Status

Installation

Demo

Quick Start

Built-in Models

Custom Encoders

Index Methods

Design Principles

Related

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

sandx-embed

What It Does

Status

Installation

Demo

Quick Start

Built-in Models

Custom Encoders

Index Methods

Design Principles

Related

License

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages