Skip to content

nanobier testbench#7

Closed
vedjaw wants to merge 24 commits intomainfrom
nanobier
Closed

nanobier testbench#7
vedjaw wants to merge 24 commits intomainfrom
nanobier

Conversation

@vedjaw
Copy link
Copy Markdown
Contributor

@vedjaw vedjaw commented Sep 18, 2025

Screenshot 2025-09-18 171716

Pull Request: Add NanoBEIR Evaluator

Summary

  • Adds a NanoBEIR evaluator (evaluate_nanobeir.py) that can benchmark:
    • Baseline Sentence-Transformers models (HuggingFace).
  • Computes and saves: R@1, R@5, R@10, NDCG@10, MedR, Latency p95, Cost/1k.
  • Includes a one-query debug peek to sanity-check retrieval (relevant IDs vs top-k).

Motivation

  • We needed a lightweight, reproducible retrieval evaluation to validate embeddings on BEIR datasets.
  • Prior attempts were less consistent; this PR hardens the evaluator and standardizes metrics.

What’s in this PR

  • evaluate_nanobeir.py
    • Dataset fetch + load (SciFact, etc.)
    • Encoder:
      • STModel for Sentence-Transformers baselines
    • Metrics: Recall@K (manual), NDCG@10 (BEIR), MedR, Latency p95
    • Debug peek for one query
    • Writes results JSON to outputs/nanobeir_leaderboard.json
  • nanobeir_config.yaml
    • Datasets, models, and Cost_per_1k are configurable.

How it works

  1. Download BEIR dataset (e.g., scifact).
  2. Encode corpus and queries with the selected encoder.
  3. Compute cosine similarities via BEIR and rank docs per query.
  4. Calculate metrics and save JSON; print one debug query.

Results (sample)

  • On SciFact with all-MiniLM-L6-v2 baseline:
    • R@1 ~ 0.50, R@5 ~ 0.75, R@10 ~ 0.79
    • NDCG@10 ~ 0.65
    • MedR = 1.0 (median first relevant at rank 1)
    • Latency p95 ~ 8–12 ms (single-query embedding)
    • Cost/1k = 0.10 (from config, not measured)

Example table

Model R@1 R@5 R@10 NDCG@10 MedR p95 (ms) Cost/1k
all-MiniLM-L6-v2 0.50 0.75 0.79 0.65 1.0 8.3 $0.10

Repro / How to Run

TRANSFORMERS_NO_TORCHVISION=1 python3 evaluate_nanobeir.py --config nanobeir_config.yaml

Output location

cat ./outputs/nanobeir_leaderboard.json

Config example

output_dir: ./outputs
leaderboard_json: ./outputs/nanobeir_leaderboard.json

datasets: [scifact]

models:
  baseline:
    hf_id: sentence-transformers/all-MiniLM-L6-v2

cost_per_1k:
  baseline: 0.10

Design notes

  • Robust JSON parsing in evaluator handles flexible inputs.
  • Fixed-dim embeddings enforced via first response (pad/truncate afterwards).
  • Retries with progressive truncation mitigate failures on long inputs.
  • Kept BEIR imports lazy to reduce cold-start overhead.

Risks / limitations

  • Cost/1k is a config constant (not auto-measured).
  • Only a few BEIR datasets are pre-mapped; others can be added trivially.

Testing done

  • End-to-end on scifact baseline.
  • Verified debug peek overlap shows relevant docs among top-k.
  • Confirmed output JSON schema and contents.

@NirantK
Copy link
Copy Markdown
Member

NirantK commented Sep 19, 2025

Cannot review till automated checks are cleared

Comment on lines +101 to +187
class LlamaServerEncoder:
"""Client for llama.cpp --embedding server (/embedding). Ensures fixed-dim vectors."""
def __init__(self, endpoint: str):
import requests
from requests.adapters import HTTPAdapter
from urllib3.util.retry import Retry
self.session = requests.Session()
retry = Retry(total=3, backoff_factor=0.2, status_forcelist=(502, 503, 504))
self.session.mount("http://", HTTPAdapter(max_retries=retry))
self.endpoint = endpoint.rstrip("/")
self._dim = None # lock the embedding size after first good response

def _parse_embedding_json(self, js):
# try to dig out the first numeric list from many shapes
def first_vector(obj):
if isinstance(obj, dict):
# common fields
for k in ("embedding", "vector", "values", "data"):
if k in obj:
return first_vector(obj[k])
# otherwise try the first value
if obj:
return first_vector(next(iter(obj.values())))
return []
if isinstance(obj, (list, tuple)):
if not obj:
return []
if isinstance(obj[0], dict):
return first_vector(obj[0])
if isinstance(obj[0], (list, tuple)):
return first_vector(obj[0])
return obj
return [obj]
return first_vector(js)

def _to_1d_numeric(self, obj):
def flatten(xs):
for x in xs:
if isinstance(x, (list, tuple)):
yield from flatten(x)
elif isinstance(x, dict):
for k in ("vector", "embedding", "values", "data"):
if k in x:
yield from flatten(x[k]); break
else:
for v in x.values():
yield from flatten(v)
else:
yield x
vec = list(flatten(self._parse_embedding_json(obj)))
vec = [float(x) for x in vec]
return vec

def _normalize_vec(self, vec):
vec = self._to_1d_numeric(vec)
if self._dim is None:
self._dim = len(vec)
print(f"[llama] locked embedding dim = {self._dim}", flush=True)
if len(vec) < self._dim:
vec = vec + [0.0] * (self._dim - len(vec))
elif len(vec) > self._dim:
vec = vec[:self._dim]
return vec

def _embed_one(self, text: str, timeout=60):
orig = text
for backoff, shrink in zip(RETRY_BACKOFF, RETRY_SHRINK):
if backoff: time.sleep(backoff)
t = orig[: int(MAX_CHARS * shrink)]
r = self.session.post(self.endpoint, json={"content": t}, timeout=timeout)
if r.status_code >= 500:
continue
r.raise_for_status()
return self._normalize_vec(r.json())
# final tiny attempt
t = orig[: min(512, len(orig))]
r = self.session.post(self.endpoint, json={"content": t}, timeout=timeout)
r.raise_for_status()
return self._normalize_vec(r.json())

def encode_queries(self, queries, batch_size=1, **kwargs):
return [self._embed_one(q) for q in queries]

def encode_corpus(self, corpus, batch_size=1, **kwargs):
docs_iter = corpus.values() if isinstance(corpus, dict) else corpus
texts = [_doc_text(d) for d in docs_iter]
return [self._embed_one(t) for t in texts]
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we embedding inside this code here? Instead of the existing embedding code?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

resolved

Comment on lines +5 to +7
models:
baseline:
hf_id: sentence-transformers/all-MiniLM-L6-v2
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're ~8 weeks into the semester. You should atleast know the embedding models we are working on. all-MiniLM-L6-v2 is not even discussed or mentioned once.

It's okay to copy code off the Internet but for heaven's sake, don't sleep-walk through the project

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

resolved

@NirantK
Copy link
Copy Markdown
Member

NirantK commented Sep 21, 2025

I don't see any of the changes for which you've commented "Resolved" @vedjaw.

Do you not know how to use Github? You need to push changes. And fix the linting while you're at it

did the required changes (qwen/qwen3-ebedding-0.6b)
did the required changes
@NirantK
Copy link
Copy Markdown
Member

NirantK commented Sep 23, 2025

The linter is still failing, not ready for review

@NirantK NirantK self-requested a review October 2, 2025 12:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants