nanobier testbench by vedjaw · Pull Request #7 · ScaledFocus/masala-embed

vedjaw · 2025-09-18T18:02:46Z

Pull Request: Add NanoBEIR Evaluator

Summary

Adds a NanoBEIR evaluator (evaluate_nanobeir.py) that can benchmark:
- Baseline Sentence-Transformers models (HuggingFace).
Computes and saves: R@1, R@5, R@10, NDCG@10, MedR, Latency p95, Cost/1k.
Includes a one-query debug peek to sanity-check retrieval (relevant IDs vs top-k).

Motivation

We needed a lightweight, reproducible retrieval evaluation to validate embeddings on BEIR datasets.
Prior attempts were less consistent; this PR hardens the evaluator and standardizes metrics.

What’s in this PR

evaluate_nanobeir.py
- Dataset fetch + load (SciFact, etc.)
- Encoder:
  - STModel for Sentence-Transformers baselines
- Metrics: Recall@K (manual), NDCG@10 (BEIR), MedR, Latency p95
- Debug peek for one query
- Writes results JSON to outputs/nanobeir_leaderboard.json
nanobeir_config.yaml
- Datasets, models, and Cost_per_1k are configurable.

How it works

Download BEIR dataset (e.g., scifact).
Encode corpus and queries with the selected encoder.
Compute cosine similarities via BEIR and rank docs per query.
Calculate metrics and save JSON; print one debug query.

Results (sample)

On SciFact with all-MiniLM-L6-v2 baseline:
- R@1 ~ 0.50, R@5 ~ 0.75, R@10 ~ 0.79
- NDCG@10 ~ 0.65
- MedR = 1.0 (median first relevant at rank 1)
- Latency p95 ~ 8–12 ms (single-query embedding)
- Cost/1k = 0.10 (from config, not measured)

Example table

Model	R@1	R@5	R@10	NDCG@10	MedR	p95 (ms)	Cost/1k
all-MiniLM-L6-v2	0.50	0.75	0.79	0.65	1.0	8.3	$0.10

Repro / How to Run

TRANSFORMERS_NO_TORCHVISION=1 python3 evaluate_nanobeir.py --config nanobeir_config.yaml

Output location

cat ./outputs/nanobeir_leaderboard.json

Config example

output_dir: ./outputs
leaderboard_json: ./outputs/nanobeir_leaderboard.json

datasets: [scifact]

models:
  baseline:
    hf_id: sentence-transformers/all-MiniLM-L6-v2

cost_per_1k:
  baseline: 0.10

Design notes

Robust JSON parsing in evaluator handles flexible inputs.
Fixed-dim embeddings enforced via first response (pad/truncate afterwards).
Retries with progressive truncation mitigate failures on long inputs.
Kept BEIR imports lazy to reduce cold-start overhead.

Risks / limitations

Cost/1k is a config constant (not auto-measured).
Only a few BEIR datasets are pre-mapped; others can be added trivially.

Testing done

End-to-end on scifact baseline.
Verified debug peek overlap shows relevant docs among top-k.
Confirmed output JSON schema and contents.

NirantK · 2025-09-19T11:08:47Z

Cannot review till automated checks are cleared

NirantK · 2025-09-19T11:10:26Z

inference/evaluate_nanobeir.py

+class LlamaServerEncoder:
+    """Client for llama.cpp --embedding server (/embedding). Ensures fixed-dim vectors."""
+    def __init__(self, endpoint: str):
+        import requests
+        from requests.adapters import HTTPAdapter
+        from urllib3.util.retry import Retry
+        self.session = requests.Session()
+        retry = Retry(total=3, backoff_factor=0.2, status_forcelist=(502, 503, 504))
+        self.session.mount("http://", HTTPAdapter(max_retries=retry))
+        self.endpoint = endpoint.rstrip("/")
+        self._dim = None  # lock the embedding size after first good response
+
+    def _parse_embedding_json(self, js):
+        # try to dig out the first numeric list from many shapes
+        def first_vector(obj):
+            if isinstance(obj, dict):
+                # common fields
+                for k in ("embedding", "vector", "values", "data"):
+                    if k in obj:
+                        return first_vector(obj[k])
+                # otherwise try the first value
+                if obj:
+                    return first_vector(next(iter(obj.values())))
+                return []
+            if isinstance(obj, (list, tuple)):
+                if not obj:
+                    return []
+                if isinstance(obj[0], dict):
+                    return first_vector(obj[0])
+                if isinstance(obj[0], (list, tuple)):
+                    return first_vector(obj[0])
+                return obj
+            return [obj]
+        return first_vector(js)
+
+    def _to_1d_numeric(self, obj):
+        def flatten(xs):
+            for x in xs:
+                if isinstance(x, (list, tuple)):
+                    yield from flatten(x)
+                elif isinstance(x, dict):
+                    for k in ("vector", "embedding", "values", "data"):
+                        if k in x:
+                            yield from flatten(x[k]); break
+                    else:
+                        for v in x.values():
+                            yield from flatten(v)
+                else:
+                    yield x
+        vec = list(flatten(self._parse_embedding_json(obj)))
+        vec = [float(x) for x in vec]
+        return vec
+
+    def _normalize_vec(self, vec):
+        vec = self._to_1d_numeric(vec)
+        if self._dim is None:
+            self._dim = len(vec)
+            print(f"[llama] locked embedding dim = {self._dim}", flush=True)
+        if len(vec) < self._dim:
+            vec = vec + [0.0] * (self._dim - len(vec))
+        elif len(vec) > self._dim:
+            vec = vec[:self._dim]
+        return vec
+
+    def _embed_one(self, text: str, timeout=60):
+        orig = text
+        for backoff, shrink in zip(RETRY_BACKOFF, RETRY_SHRINK):
+            if backoff: time.sleep(backoff)
+            t = orig[: int(MAX_CHARS * shrink)]
+            r = self.session.post(self.endpoint, json={"content": t}, timeout=timeout)
+            if r.status_code >= 500:
+                continue
+            r.raise_for_status()
+            return self._normalize_vec(r.json())
+        # final tiny attempt
+        t = orig[: min(512, len(orig))]
+        r = self.session.post(self.endpoint, json={"content": t}, timeout=timeout)
+        r.raise_for_status()
+        return self._normalize_vec(r.json())
+
+    def encode_queries(self, queries, batch_size=1, **kwargs):
+        return [self._embed_one(q) for q in queries]
+
+    def encode_corpus(self, corpus, batch_size=1, **kwargs):
+        docs_iter = corpus.values() if isinstance(corpus, dict) else corpus
+        texts = [_doc_text(d) for d in docs_iter]
+        return [self._embed_one(t) for t in texts]


Why are we embedding inside this code here? Instead of the existing embedding code?

NirantK · 2025-09-19T11:12:28Z

inference/nanobeir_config.yaml

+models:
+  baseline:
+    hf_id: sentence-transformers/all-MiniLM-L6-v2


We're ~8 weeks into the semester. You should atleast know the embedding models we are working on. all-MiniLM-L6-v2 is not even discussed or mentioned once.

It's okay to copy code off the Internet but for heaven's sake, don't sleep-walk through the project

NirantK · 2025-09-21T14:28:21Z

I don't see any of the changes for which you've commented "Resolved" @vedjaw.

Do you not know how to use Github? You need to push changes. And fix the linting while you're at it

did the required changes (qwen/qwen3-ebedding-0.6b)

did the required changes

fixed lint error

fixing lint

NirantK · 2025-09-23T06:38:30Z

The linter is still failing, not ready for review

…into nanobier

…s for comparison.

nanobier testbench

6aacd05

NirantK requested changes Sep 19, 2025

View reviewed changes

Change the embedding model to Qwen/qwen3-embedding-0.6b

b9531bd

vedjaw added 4 commits September 22, 2025 00:14

Update nanobeir_config.yaml

2a48557

did the required changes (qwen/qwen3-ebedding-0.6b)

Update evaluate_nanobeir.py

1ca8339

did the required changes

Update nanobeir_config.yaml

2f49e11

fixed lint error

Update evaluate_nanobeir.py

fa12019

fixing lint

vedjaw added 13 commits September 24, 2025 14:54

tried to fixing linter error

4c94e54

lint error fixing

8b63156

Merge branch 'nanobier' of https://github.com/ScaledFocus/masala-embed …

4573da6

…into nanobier

lint error fixing

f10b33a

changed few details

466985c

fixed linter

d5fb84d

style: format code with ruff

ebd910e

tintegrate vLLMAPI encoder

08ab7de

fixed lint error

2633666

implemented on zeta-alpha-ai/NanoDBPedia dataset

ede13c0

did a minor change in json output by removing the cost output.

ca575d0

throughput calculation done

7a688e4

lint error fix

249909a

NirantK self-requested a review October 2, 2025 12:34

vedjaw added 5 commits October 2, 2025 18:44

changed model name

ea6dc1b

fixed model name

eff4408

implementing two new models for benchmarking and used two new dataset…

af6e1c5

…s for comparison.

changes formatting

12268c0

lint error

1ef8bcf

ArnavBharti closed this Nov 20, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

nanobier testbench#7

nanobier testbench#7
vedjaw wants to merge 24 commits intomainfrom
nanobier

vedjaw commented Sep 18, 2025

Uh oh!

NirantK commented Sep 19, 2025

Uh oh!

NirantK Sep 19, 2025

Uh oh!

vedjaw Sep 20, 2025

Uh oh!

NirantK Sep 19, 2025

Uh oh!

vedjaw Sep 20, 2025

Uh oh!

NirantK commented Sep 21, 2025

Uh oh!

NirantK commented Sep 23, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

vedjaw commented Sep 18, 2025

Pull Request: Add NanoBEIR Evaluator

Summary

Motivation

What’s in this PR

How it works

Results (sample)

Example table

Repro / How to Run

Output location

Config example

Design notes

Risks / limitations

Testing done

Uh oh!

NirantK commented Sep 19, 2025

Uh oh!

NirantK Sep 19, 2025

Choose a reason for hiding this comment

Uh oh!

vedjaw Sep 20, 2025

Choose a reason for hiding this comment

Uh oh!

NirantK Sep 19, 2025

Choose a reason for hiding this comment

Uh oh!

vedjaw Sep 20, 2025

Choose a reason for hiding this comment

Uh oh!

NirantK commented Sep 21, 2025

Uh oh!

NirantK commented Sep 23, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants