Skip to content

sandxlab/sandx-embed

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

sandx-embed

Shared embedding and vector similarity infrastructure for the SandX platform.

CI Python 3.10+ License: Apache 2.0

Part of the SandX Lab computational infrastructure ecosystem.


What It Does

sandx-embed is the shared latent representation layer used by all SandX engines. It provides:

  • Pluggable encoders — sentence-transformers models out of the box; register any custom encoder
  • High-performance ANN indexing — HNSW (production) and exact search (baseline), with save/load
  • Cross-domain similarity — cosine, L2, inner product; normalized and unnormalized vectors

Not a standalone product — consumed by sandx-er, sandx-graph, and sandx-compute as a shared dependency.

Status

v0.1 — Phase 2 active development

Component Status
Encoder — pluggable model registry Working
SentenceTransformerEncoder — SBERT, E5, BGE Working
VectorIndex — HNSW and exact search Working
Save / load index Working
PyPI package Working

Installation

pip install sandx-embed

Or from source:

git clone https://github.com/sandxlab/sandx-embed
cd sandx-embed
pip install -e ".[dev]"

Demo

pip install "sandx-embed[hnsw]"
python -m examples.encode_and_search

Encodes 15 company name variants into dense vectors, builds an ANN index, runs 4 semantic queries, and prints a pairwise similarity matrix — no external data required.

Quick Start

from sandx_embed import Encoder, VectorIndex

# Encode records into dense vectors
enc = Encoder(model="sentence-bert")   # downloads all-MiniLM-L6-v2 on first use
vectors = enc.encode(["John Smith, Boston", "Jon Smyth, Boston", "Alice Brown, NYC"])
# → np.ndarray shape (3, 384), L2-normalized

# Build an ANN index
idx = VectorIndex(method="hnsw", metric="cosine")
idx.build(vectors, ids=["r0", "r1", "r2"])

# Query nearest neighbors
result = idx.query(vectors[0], k=2)
print(result.ids)        # ["r0", "r1"]
print(result.distances)  # [0.0, 0.12]  (cosine distance)

# Persist and reload
idx.save("/tmp/my_index")
idx2 = VectorIndex.load("/tmp/my_index")

Built-in Models

Name HuggingFace model Dim Notes
"sentence-bert" all-MiniLM-L6-v2 384 Fast, English, recommended default
"e5-small" intfloat/e5-small-v2 384 Higher quality, English
"bge-m3" BAAI/bge-m3 1024 Multilingual, large

Custom Encoders

from sandx_embed.encoder import BaseEncoder, Encoder
import numpy as np

class MyEncoder(BaseEncoder):
    def encode(self, inputs, *, batch_size=64, normalize=True):
        # your model here
        return np.random.rand(len(inputs), 128).astype(np.float32)
    @property
    def dim(self): return 128

Encoder.register("my-model", lambda: MyEncoder())
enc = Encoder("my-model")

Index Methods

Method Backend When to use
"hnsw" usearch N > 10,000; production; fast queries
"exact" numpy Small datasets; correctness baseline

Design Principles

  • Pluggable — any encoder model or index backend can be registered
  • Portable — indexes serialize to disk and reload without rebuilding
  • Deterministic — same model version + input → same output
  • No vendor lock-in — no hard dependency on any hosted vector service

Related

  • sandx-er — entity resolution engine (uses sandx-embed for blocking + matching)
  • sandx-graph — graph intelligence over resolved entities
  • sandx.io — project home

License

Apache 2.0 — see LICENSE

About

Embedding infrastructure - pluggable encoders, ANN indexing, cross-domain similarity

Topics

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages