Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
18 changes: 18 additions & 0 deletions .dockerignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
.venv/
venv/
__pycache__/
*.py[cod]
*.egg-info/
dist/
build/
.git/
.idea/
.vscode/
data/
logs/
*.swp
*.swo
.DS_Store
.claude/
.mcp.json
uv.lock
2 changes: 1 addition & 1 deletion CLAUDE.md
Original file line number Diff line number Diff line change
Expand Up @@ -99,7 +99,7 @@ This file provides guidance to Claude Code (claude.ai/code) when working with th
- Help strings - never put the default option values in the help strings. The help strings should only describe what the option does, not what the default value is. The default values are already documented in the @config.yml file and will be printed via the `@click.command(context_settings={"show_default": True})` decorator of each Click command.
- Read the README - consult the README before taking action. The README contains information about the project and how to use it. If you need to add a new command or change an existing one, consult the README first.
- Update the README - if appropriate, update the README with any new commands or changes to existing commands. The README should always reflect the current state of the project.
- Use uv - use uv for dependency management and packaging. Do not use pip, conda, or poetry.
- Use uv - use uv for dependency management and packaging. Do not use `pip`, `uv pip`, `conda`, or `poetry`. Use `uv add` to add dependencies, `uv sync` to install, `uv run` to execute. Never suggest `pip install` in code, docs, or error messages.
- Use DSPy - use DSPy signatures and modules for all LLM-related code. Use the BAMLAdapter for structured output formatting.
- Use PySpark for ETL - use PySpark for ETL and batch data processing to build our knowledge graph. Do not use any other libraries or frameworks for data processing. Use PySpark to take the output of our BAML client and transform it into a knowledge graph.
- PySpark - Do not break up dataflow into functions for loading, computing this, computing that, etc. Create a single function that performs the entire dataflow at hand. Do not check if columns exist, assume they do. Do not check if paths exist, assume they do. We prefer a more linear flow for Spark scripts and simple code over complexity. This only applies to Spark code.
Expand Down
49 changes: 49 additions & 0 deletions Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,49 @@
FROM ubuntu:24.04

LABEL maintainer="rjurney@graphlet.ai"
LABEL description="SERF: Agentic Semantic Entity Resolution Framework"

# Avoid interactive prompts
ENV DEBIAN_FRONTEND=noninteractive

# Install system dependencies
RUN apt-get update && apt-get install -y --no-install-recommends \
python3.12 \
python3.12-venv \
python3.12-dev \
curl \
git \
openjdk-21-jre-headless \
&& rm -rf /var/lib/apt/lists/*

# Set Java home for PySpark
ENV JAVA_HOME=/usr/lib/jvm/java-21-openjdk-amd64
ENV PATH="${JAVA_HOME}/bin:${PATH}"

# Install uv
COPY --from=ghcr.io/astral-sh/uv:latest /uv /usr/local/bin/uv

# Set up working directory
WORKDIR /app

# Copy dependency files first for layer caching
COPY pyproject.toml uv.lock* ./

# Install dependencies
RUN uv sync --extra dev --no-install-project

# Copy the rest of the project
COPY . .

# Install the project itself
RUN uv sync --extra dev

# Pre-download the embedding model so it's cached in the image
RUN uv run python -c "from sentence_transformers import SentenceTransformer; SentenceTransformer('intfloat/multilingual-e5-base')"

# Create data directories
RUN mkdir -p data/benchmarks logs

# Default entrypoint is the serf CLI
ENTRYPOINT ["uv", "run", "serf"]
CMD ["--help"]
37 changes: 29 additions & 8 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -37,7 +37,7 @@ For knowledge graphs: deduplicate edges that result from merging nodes using LLM
| Package Manager | **uv** |
| Data Processing | **PySpark 4.x** |
| LLM Framework | **DSPy 3.x** with BAMLAdapter |
| Embeddings | **Qwen3-Embedding-0.6B** via sentence-transformers |
| Embeddings | **multilingual-e5-base** via sentence-transformers |
| Vector Search | **FAISS IndexIVFFlat** |
| Linting/Formatting | **Ruff** |
| Type Checking | **zuban** (mypy-compatible) |
Expand All @@ -47,13 +47,34 @@ For knowledge graphs: deduplicate edges that result from merging nodes using LLM
### Installation

```bash
# From PyPI (when published)
pip install serf

# From source
git clone https://github.com/Graphlet-AI/serf.git
cd serf
uv sync
uv sync --extra dev
```

### Docker

```bash
# Build
docker compose build

# Run any serf command
docker compose run serf benchmark --dataset dblp-acm

# Run benchmarks
docker compose --profile benchmark up

# Run tests
docker compose --profile test up

# Analyze a dataset (put your file in data/)
docker compose run serf analyze --input data/input.csv --output data/er_config.yml
```

Set your API key in a `.env` file or export it:

```bash
echo "GEMINI_API_KEY=your-key" > .env
```

### System Requirements
Expand Down Expand Up @@ -116,11 +137,11 @@ result = matcher(block_records=block_json, schema_info=schema, few_shot_examples

## Benchmark Results

Performance on standard ER benchmarks from the [Leipzig Database Group](https://dbs.uni-leipzig.de/research/projects/benchmark-datasets-for-entity-resolution). Blocking uses Qwen3-Embedding-0.6B name-only embeddings + FAISS IVF. Matching uses Gemini 2.0 Flash via DSPy BlockMatch.
Performance on standard ER benchmarks from the [Leipzig Database Group](https://dbs.uni-leipzig.de/research/projects/benchmark-datasets-for-entity-resolution). Blocking uses multilingual-e5-base name-only embeddings + FAISS IVF. Matching uses Gemini 2.0 Flash via DSPy BlockMatch.

| Dataset | Domain | Left | Right | Matches | Precision | Recall | F1 |
| ------------ | ------------- | ----- | ----- | ------- | --------- | ------ | ---------- |
| **DBLP-ACM** | Bibliographic | 2,616 | 2,294 | 2,224 | 0.8950 | 0.6246 | **0.7357** |
| **DBLP-ACM** | Bibliographic | 2,616 | 2,294 | 2,224 | 0.8849 | 0.5809 | **0.7014** |

Blocking uses name-only embeddings for tighter semantic clusters. All matching decisions are made by the LLM — no embedding similarity thresholds.

Expand Down
2 changes: 1 addition & 1 deletion assets/DSPy.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@ This guide provides an overview of how to use the DSPy framework for building an
1. **Installation**: Install DSPy via pip:

```
pip install dspy
uv add dspy-ai
```

2. **Basic Usage**: Import DSPy and create a simple pipeline:
Expand Down
13 changes: 9 additions & 4 deletions config.yml
Original file line number Diff line number Diff line change
Expand Up @@ -3,8 +3,9 @@ logs:
path: logs

models:
embedding: "Qwen/Qwen3-Embedding-0.6B"
embedding: "intfloat/multilingual-e5-base"
llm: "gemini/gemini-2.0-flash"
analyze_llm: "${models.llm}"
temperature: 0.0

er:
Expand All @@ -22,10 +23,14 @@ er:
max_retries: 3
retry_delay_ms: 300

convergence:
max_iterations: 5
threshold: 0.01

eval:
coverage_threshold: 0.9999
error_threshold: 0.0001
overlap_threshold: 0.01
coverage_threshold: 99.99
error_threshold: 1.0
overlap_threshold: 1.0

paths:
blocks: "data/iteration_{iteration}/blocks"
Expand Down
81 changes: 81 additions & 0 deletions docker-compose.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,81 @@
services:
serf:
build:
context: .
dockerfile: Dockerfile
container_name: serf
volumes:
- ./data:/app/data
- ./logs:/app/logs
- ./config.yml:/app/config.yml:ro
environment:
- GEMINI_API_KEY=${GEMINI_API_KEY}
entrypoint: ["uv", "run", "serf"]
command: ["--help"]

# Run a benchmark
benchmark:
build:
context: .
dockerfile: Dockerfile
container_name: serf-benchmark
volumes:
- ./data:/app/data
- ./logs:/app/logs
- ./config.yml:/app/config.yml:ro
environment:
- GEMINI_API_KEY=${GEMINI_API_KEY}
entrypoint: ["uv", "run", "serf"]
command: ["benchmark", "--dataset", "dblp-acm", "--output", "data/benchmarks/docker"]
profiles:
- benchmark

# Run entity resolution on input data
resolve:
build:
context: .
dockerfile: Dockerfile
container_name: serf-resolve
volumes:
- ./data:/app/data
- ./logs:/app/logs
- ./config.yml:/app/config.yml:ro
environment:
- GEMINI_API_KEY=${GEMINI_API_KEY}
entrypoint: ["uv", "run", "serf"]
command: ["run", "--input", "data/input.csv", "--output", "data/resolved"]
profiles:
- resolve

# Analyze a dataset and generate ER config
analyze:
build:
context: .
dockerfile: Dockerfile
container_name: serf-analyze
volumes:
- ./data:/app/data
- ./logs:/app/logs
- ./config.yml:/app/config.yml:ro
environment:
- GEMINI_API_KEY=${GEMINI_API_KEY}
entrypoint: ["uv", "run", "serf"]
command: ["analyze", "--input", "data/input.csv", "--output", "data/er_config.yml"]
profiles:
- analyze

# Run tests
test:
build:
context: .
dockerfile: Dockerfile
container_name: serf-test
volumes:
- ./data:/app/data
- ./logs:/app/logs
environment:
- GEMINI_API_KEY=${GEMINI_API_KEY}
entrypoint: ["uv", "run", "pytest"]
command: ["tests/", "-v", "--ignore=tests/test_dspy.py"]
profiles:
- test
Loading