From 1e2c7862c5dbbca0e834e591a940c105ed86f8a3 Mon Sep 17 00:00:00 2001 From: Russell Jurney Date: Sat, 7 Mar 2026 19:24:15 -0800 Subject: [PATCH 1/3] Add $100 overnight budget constraint for Gemini API usage Gemini 2.0 Flash only for all ER pipeline operations. Gemini 2.5 Pro allowed only for validation data generation with < 2K API calls. --- docs/SERF_LONG_SHOT_PLAN.md | 20 ++++++++++++++++++++ 1 file changed, 20 insertions(+) diff --git a/docs/SERF_LONG_SHOT_PLAN.md b/docs/SERF_LONG_SHOT_PLAN.md index 03d09e7..7ab23c6 100644 --- a/docs/SERF_LONG_SHOT_PLAN.md +++ b/docs/SERF_LONG_SHOT_PLAN.md @@ -790,6 +790,26 @@ After 3 rounds with merge factor 0.8 per round: - Comparison pairs shrink to 0.8^6 = 26.2% of original - Each round is cheaper than the last due to smaller dataset +### 9.6 Overnight Build Budget Constraint + +**Hard budget: $100 total Gemini API spend for the overnight build.** + +A `GEMINI_API_KEY` environment variable will be provided. The agent must stay within budget by following these rules: + +1. **Use Gemini 2.0 Flash exclusively** for all ER pipeline operations (blocking analysis, matching, merging, edge resolution). At $0.10/$0.40 per 1M input/output tokens, this allows ~160M+ input tokens -- more than enough for iterative ER across all three benchmark datasets. + +2. **Gemini 2.5 Pro is allowed ONLY for generating validation data** -- high-quality labeled match/non-match pairs and few-shot examples that will be used to evaluate and optimize the pipeline. Limit Gemini 2.5 Pro to **fewer than 2,000 API calls** total. At ~2,500 tokens per call with $1.25/$10.00 per 1M input/output tokens, 2K calls costs roughly $50 -- leaving ample headroom for Flash usage. + +3. **Never use Claude, GPT-4o, or any non-Gemini model** for pipeline operations during the build. The DSPy signatures and pipeline code should be model-agnostic, but all actual LLM calls during this build session must go through Gemini. + +4. **Track token usage** by logging input/output token counts from API responses. If cumulative spend approaches $80, stop making Gemini 2.5 Pro calls and finish remaining work with Flash only. + +| Use Case | Model | Max Calls | Est. Cost | +| ------------------------------ | ---------------- | ------------------------- | ---------- | +| ER pipeline (match/merge/edge) | Gemini 2.0 Flash | Unlimited (within budget) | ~$10-30 | +| Validation data generation | Gemini 2.5 Pro | < 2,000 | ~$50 | +| **Total** | | | **< $100** | + --- ## 10. Implementation Plan From 8f051f8854eddd5345389a17435eb4cbe5f82609 Mon Sep 17 00:00:00 2001 From: Russell Jurney Date: Sat, 7 Mar 2026 19:57:42 -0800 Subject: [PATCH 2/3] Replace BAML types with fresh DSPy Pydantic types and auto-generation Do not reuse Abzu's BAML-generated types. Build fresh domain-agnostic Pydantic classes for DSPy signatures. Add auto-generation of entity types from PySpark DataFrame schemas via type_generator module. Add Spark-to-Python type mapping and DatasetProfile-driven field descriptions. --- docs/SERF_LONG_SHOT_PLAN.md | 86 ++++++++++++++++++++++++++++++------- 1 file changed, 71 insertions(+), 15 deletions(-) diff --git a/docs/SERF_LONG_SHOT_PLAN.md b/docs/SERF_LONG_SHOT_PLAN.md index 7ab23c6..5d8ba3c 100644 --- a/docs/SERF_LONG_SHOT_PLAN.md +++ b/docs/SERF_LONG_SHOT_PLAN.md @@ -155,19 +155,21 @@ Abzu's codebase at `/Users/rjurney/Software/weave` contains the proven ER patter 14. **Cross-iteration UUID validation** -- Evaluation validates source_uuids against ALL historical UUIDs from all previous iterations, not just the current round 15. **Auto-scaling target block size** -- `effective_target = max(10, target_block_size // iteration)` creates tighter clusters in later rounds when remaining duplicates are harder to find 16. **UDTF factory pattern** -- `create_split_large_blocks_udtf()` dynamically creates PySpark UDTF classes with configurable return types and max block sizes -17. **SparkDantic schema bridge** -- `SparkModel` subclasses of BAML Pydantic types automatically generate Spark schemas with IntegerType-to-LongType conversion +17. **SparkDantic schema bridge** -- `SparkModel` subclasses of Pydantic types automatically generate Spark schemas with IntegerType-to-LongType conversion. In SERF, this works bidirectionally: Pydantic-to-Spark for writing, and Spark-to-Pydantic for auto-generating entity types from input DataFrames 18. **Comprehensive analysis reports** -- 7-section blocking reports with input data, clustering params, distance distribution (percentiles), block size stats, Levenshtein analysis, sample blocks, and recommendations 19. **Per-stage timing and throughput** -- CLI tracks wall-clock time and throughput (companies/sec, blocks/sec) per pipeline stage, plus cross-iteration summary table 20. **Company name normalization** -- `cleanco` library for corporate suffix removal, multilingual stop word filtering for acronym generation, domain suffix removal for blocking keys ### 3.6 Patterns to Evolve in SERF -1. **BAML -> DSPy signatures** -- Replace BAML templates with DSPy `Signature` classes + `BAMLAdapter` for output formatting -2. **Poetry -> uv** -- Replace Poetry with the faster, standards-compliant uv package manager -3. **PySpark 3.5 -> 4.1** -- Leverage VARIANT type, Spark Connect, Python Data Source API -4. **Parquet -> Iceberg** -- Add ACID transactions, time travel, schema evolution for iterative ER -5. **Company-only -> domain-agnostic** -- Generalize entity types beyond companies -6. **Manual orchestration -> agentic** -- DSPy ReAcT agents control the pipeline dynamically +1. **BAML types -> fresh DSPy Pydantic types** -- Do NOT reuse Abzu's BAML-generated types (`abzu.baml_client.types`). Those types were auto-generated by BAML for a company-specific domain. SERF needs fresh, domain-agnostic Pydantic classes designed for DSPy signatures. Study Abzu's types for field patterns and ER metadata (source_ids, source_uuids, match_skip, match_skip_history), but build new classes from scratch. +2. **BAML templates -> DSPy signatures** -- Replace BAML templates with DSPy `Signature` classes + `BAMLAdapter` for output formatting. The signatures should use the new Pydantic types as input/output fields. +3. **Auto-generate entity types from DataFrames** -- When a user provides a PySpark DataFrame (or Parquet/CSV file), SERF should automatically infer Pydantic entity types from the DataFrame schema. This means inspecting `df.schema` (StructType), mapping Spark types to Python types, and generating a Pydantic class with the appropriate fields. The profiler (Section 5.3) identifies which fields are names, identifiers, dates, etc. -- this metadata enriches the generated type with DSPy field descriptions. Use this approach when it simplifies the user experience (e.g., `serf resolve --input data.parquet` should work without the user defining any types). For advanced use cases, users can define their own Pydantic entity types. +4. **Poetry -> uv** -- Replace Poetry with the faster, standards-compliant uv package manager +5. **PySpark 3.5 -> 4.1** -- Leverage VARIANT type, Spark Connect, Python Data Source API +6. **Parquet -> Iceberg** -- Add ACID transactions, time travel, schema evolution for iterative ER +7. **Company-only -> domain-agnostic** -- Generalize entity types beyond companies +8. **Manual orchestration -> agentic** -- DSPy ReAcT agents control the pipeline dynamically --- @@ -268,7 +270,8 @@ serf/ main.py # CLI entry point (exists, extend) dspy/ __init__.py - types.py # Pydantic entity types (exists, extend) + types.py # Fresh Pydantic entity types for DSPy (exists, rewrite) + type_generator.py # Auto-generate entity types from DataFrame schemas (NEW) baml_adapter.py # BAMLAdapter for DSPy (exists) signatures.py # DSPy signatures for ER (NEW) agents.py # DSPy ReAcT agents (NEW) @@ -306,6 +309,7 @@ serf/ tests/ test_config.py test_types.py + test_type_generator.py test_baml_adapter.py test_embeddings.py test_faiss_blocker.py @@ -337,13 +341,19 @@ serf/ ## 5. Data Model and Type System +**Important: SERF does NOT reuse Abzu's BAML-generated types.** Abzu's types were auto-generated by BAML for a company-specific SEC filing domain. SERF builds fresh Pydantic types designed for DSPy, preserving only the proven ER metadata patterns (source_ids, source_uuids, match_skip, match_skip_history). + ### 5.1 Domain-Agnostic Entity Types SERF should generalize beyond Abzu's Company-only model. The base entity type is inspired by [schema.org](https://schema.org/Person) property conventions: ```python class Entity(BaseModel): - """Base entity type for all resolvable entities.""" + """Base entity type for all resolvable entities. + + Domain-specific fields live in `attributes` or in subclasses. + ER metadata fields (id, uuid, source_ids, etc.) are fixed across all domains. + """ id: int uuid: Optional[str] = None name: str @@ -357,7 +367,7 @@ class Entity(BaseModel): match_skip_history: Optional[list[int]] = None # iterations that skipped this entity ``` -Specialized entity types extend this base: +Specialized entity types extend this base. These are **examples** -- users can define their own or let SERF auto-generate them from DataFrame schemas: ```python class Company(Entity): @@ -385,6 +395,51 @@ class Person(Entity): nationality: Optional[str] = None ``` +### 5.1.1 Auto-Generating Entity Types from DataFrames + +When the user provides a PySpark DataFrame (or Parquet/CSV file) without a custom entity type, SERF should auto-generate a Pydantic entity class from the DataFrame schema: + +```python +def entity_type_from_spark_schema( + schema: StructType, + profile: DatasetProfile, + entity_type_name: str = "AutoEntity", +) -> type[Entity]: + """Generate a Pydantic Entity subclass from a Spark StructType schema. + + Uses the DatasetProfile to enrich fields with DSPy descriptions + (e.g., marking a field as "name", "identifier", "date"). + + Parameters + ---------- + schema : StructType + The Spark schema to convert + profile : DatasetProfile + Profiling results identifying field types and roles + entity_type_name : str + Name for the generated class + + Returns + ------- + type[Entity] + A dynamically created Pydantic subclass of Entity + """ + ... +``` + +The type mapping from Spark to Python is straightforward: + +| Spark Type | Python Type | Notes | +| --------------------- | --------------------- | --------------------------- | +| StringType | str | Default for most fields | +| LongType/IntegerType | int | Numeric identifiers, counts | +| DoubleType/FloatType | float | Revenue, percentages | +| BooleanType | bool | Flags | +| ArrayType(StringType) | list[str] | Tags, categories | +| StructType | nested Pydantic model | Auto-generated recursively | + +ER metadata fields (id, uuid, source_ids, source_uuids, match_skip, match_skip_reason, match_skip_history) are automatically added by the framework -- the user's schema only needs to contain domain fields. The profiler's `FieldProfile` provides DSPy-compatible descriptions for each field. + ### 5.2 Block and Match Types ```python @@ -597,7 +652,7 @@ Raw Entities -> Embed (Qwen3) -> FAISS IVF Cluster -> Blocks ### 7.3 Phase 2: Schema Alignment + Matching + Merging -All three operations in a single DSPy signature: +All three operations in a single DSPy signature. The `schema_info` field is auto-generated from the Pydantic entity type (which itself may have been auto-generated from the input DataFrame schema via Section 5.1.1): ```python class BlockMatch(dspy.Signature): @@ -611,7 +666,7 @@ class BlockMatch(dspy.Signature): 5. Return ALL entities (merged + non-matched) """ block_records: str = dspy.InputField(desc="JSON array of entity records in this block") - schema_info: str = dspy.InputField(desc="Description of the record schema and field meanings") + schema_info: str = dspy.InputField(desc="Auto-generated description of entity fields and their roles from the Pydantic type and DatasetProfile") few_shot_examples: str = dspy.InputField(desc="Examples of correct merge behavior") resolution: BlockResolution = dspy.OutputField() ``` @@ -824,13 +879,14 @@ The following ordered steps should be executed by the Cursor Agent. Each step pr 4. Create all module directories with `__init__.py` files 5. Update `CLAUDE.md` to reflect new tooling (uv, Ruff) -### Step 2: Core Type System (1 hr) +### Step 2: Core Type System (1.5 hr) -1. Rewrite `src/serf/dspy/types.py` with domain-agnostic `Entity` base class + `Company`, `Person` specializations +1. Rewrite `src/serf/dspy/types.py` from scratch with fresh, DSPy-appropriate Pydantic types -- do NOT copy Abzu's BAML-generated types. Build domain-agnostic `Entity` base class with ER metadata (id, uuid, source_ids, source_uuids, match_skip, match_skip_reason, match_skip_history) + example `Company`, `Person` specializations 2. Add `EntityBlock`, `MatchDecision`, `BlockResolution` types 3. Add `FieldProfile`, `DatasetProfile` types for dataset analysis 4. Add `IterationMetrics`, `BlockingMetrics` TypedDicts -5. Write exhaustive unit tests: `tests/test_types.py` +5. Create `src/serf/dspy/type_generator.py` -- `entity_type_from_spark_schema()` that auto-generates a Pydantic Entity subclass from a Spark StructType + DatasetProfile. Maps Spark types to Python types, adds ER metadata fields automatically, generates DSPy field descriptions from profiling results +6. Write exhaustive unit tests: `tests/test_types.py`, `tests/test_type_generator.py` ### Step 3: DSPy Signatures and Adapter (1 hr) From 9e636212316589987cc8ae58cb7b3f0aab7030e6 Mon Sep 17 00:00:00 2001 From: Cursor Agent Date: Sun, 8 Mar 2026 04:13:43 +0000 Subject: [PATCH 3/3] Update SERF long-shot plan from research decisions Co-authored-by: Russell Jurney --- docs/SERF_LONG_SHOT_PLAN.md | 146 ++++++++++++++++++++++++------------ 1 file changed, 97 insertions(+), 49 deletions(-) diff --git a/docs/SERF_LONG_SHOT_PLAN.md b/docs/SERF_LONG_SHOT_PLAN.md index 5d8ba3c..148a6da 100644 --- a/docs/SERF_LONG_SHOT_PLAN.md +++ b/docs/SERF_LONG_SHOT_PLAN.md @@ -23,6 +23,20 @@ This document is a comprehensive implementation plan for building **SERF** (Sema --- +## 0. Research Phase Decisions (2026-03-08) + +This plan is aligned to the current implementation directives: + +1. Migrate from Poetry to `uv` first, before complete system implementation. +2. Exclude BAML runtime integration and `BAMLAdapter` for now. +3. Enforce a hard Gemini spend cap of `$100` for the build. +4. Use Gemini 2.0 Flash for ER pipeline operations; allow limited Gemini 2.5 Pro for validation data and GEPA/reflection workflows. +5. Complete research and document updates first, then pause for instruction before full implementation. +6. Start with entities represented by all three key benchmark tracks: DBLP-ACM, Walmart-Amazon, DBLP-Scholar. +7. Pin PySpark to `4.1+`. +8. Prioritize benchmark validation order: DBLP-ACM, then Walmart-Amazon, then DBLP-Scholar. +9. Reference Abzu prompt semantics and semantic blocking code directly; defer name-based blocking. + ## 1. Introduction and Motivation Entity resolution (ER) -- the process of determining when two or more records refer to the same real-world entity -- is among the oldest and most important problems in data management. The Fellegi-Sunter model (1969) formalized probabilistic record linkage, and decades of research have produced systems based on string similarity metrics (Jaccard, Jaro-Winkler, Levenshtein), rule-based blocking, and supervised classifiers. These traditional systems, exemplified by production platforms like Senzing and Informatica MDM, are fast and auditable but fundamentally brittle: they require hand-crafted rules per domain, break on schema variations, and cannot capture the deep semantic understanding needed when data sources use different terminology for the same concepts. @@ -45,7 +59,7 @@ Senzing's Jeff Jonas has argued that LLMs are [too slow, too expensive, and too | ----------------- | ----------------------------------------------------------------------------------------------- | | Too slow | Semantic blocking reduces LLM calls to O(blocks); PySpark parallelizes across clusters | | Too expensive | Gemini 2.0 Flash at $0.10/1M tokens; block-level matching reduces calls 50x or more vs pairwise | -| Hallucinations | BAML structured outputs + DSPy optimization + multi-round convergence | +| Hallucinations | DSPy structured outputs + Pydantic typing + multi-round convergence | | Not deterministic | Temperature=0, multiple convergence rounds, confidence scoring | | Not explainable | Chain-of-thought reasoning, match rationale in structured output fields | | Not scalable | PySpark for ETL, embedding models for blocking, LLMs only for semantic decisions | @@ -72,7 +86,7 @@ The concept of **Knowledge Graph Factories** ([blog post](https://blog.graphlet. | --------- | ------------------------- | --------------------------------- | ---------------------------------- | --------------------------- | | 2019-2022 | Deep Discovery / Graphlet | Sentence-transformers | Embedding similarity + classifiers | Rule-based field resolution | | 2023-2025 | Abzu | FAISS IVF + sentence-transformers | Gemini via BAML (block-level) | LLM-guided via BAML | -| 2026+ | **SERF** | FAISS IVF + Qwen3 embeddings | DSPy agents + Gemini/Claude | DSPy-optimized LLM merging | +| 2026+ | **SERF** | FAISS IVF + Qwen3 embeddings | DSPy agents + Gemini models | DSPy-optimized LLM merging | --- @@ -80,23 +94,21 @@ The concept of **Knowledge Graph Factories** ([blog post](https://blog.graphlet. Abzu's codebase at `/Users/rjurney/Software/weave` contains the proven ER patterns that SERF should adopt and generalize. The agent building SERF should study these files carefully: -### 3.1 BAML Templates (LLM Prompt Engineering) +### 3.1 Abzu Prompt and Client References (Port Semantics Only) -- **`baml_src/multi_er.baml`** -- The core multi-entity resolution prompt. Defines `MultiEntityResolution` and `FewShotMultiEntityResolution` functions using `Gemini20Flash`. Key types: `CompanyList` (block_key, block_key_type, block_size, companies), `MergeCompany` (id, source_ids only -- no name field). The LLM receives an entire block of companies and returns merged results with comprehensive ID tracking (lowest input ID becomes master, ALL OTHER IDs go to `source_ids`). Company class includes `match_skip_history` field for tracking which iterations skipped each company. Test cases cover two-company merges, UUID tracking, and multiple source UUID accumulation. **This pattern must be preserved in SERF using DSPy signatures instead of BAML.** +- **`baml_src/multi_er.baml`** -- Primary prompt semantics to port into DSPy signatures. Preserve ID-tracking behavior exactly: lowest input ID becomes master, all others go into `source_ids`, and merged records preserve full source lineage. SERF should reuse this logic in DSPy signatures and Python code, but should not include BAML runtime dependencies. - **`baml_src/article.baml`** -- Base schema definitions. `Company` class with fields: id, uuid, name, cik, ticker, description, website_url, headquarters_location, jurisdiction, revenue_usd, employees, founded_year, ceo, linkedin_url, source_ids, source_uuids, match_skip, match_skip_reason, match_skip_history. Also defines `Ticker` (id, uuid, symbol, exchange), `Exchange` enum (50+ Bloomberg codes), and `Relationship` types. -- **`baml_src/edge_er.baml`** -- Edge resolution prompt. Defines comprehensive types: `EdgeRelationshipInput` (type, description, amount, currency, date, percentage, quarter, url), `EdgeBlock` (src_name, dst_name, relationships, block_size), `MergedEdgeRelationship`, and `EdgeResolutionResult` (merged_relationships, was_resolved, original_count, resolved_count). The `ResolveEdgeBlock` function uses `Gemini20Flash` with rules for merging same-deal relationships across different types (e.g., "Supplier" + "SupplyAgreement"). Test cases cover duplicate investments and distinct relationships. +- **`baml_src/edge_er.baml`** -- Reference edge-merge semantics for `serf.edge.resolver` prompt behavior. Port intent and output constraints into DSPy signatures and typed Python models. -- **`baml_src/clients.baml`** -- LLM client configurations. Active clients: `Gemini20Flash` (60s/120s timeouts), `Gemini20FlashLong` (180s/180s timeouts, same model), `Gemini25FlashLite` (60s/120s timeouts). All use `temperature 0.0` and exponential retry policy (3 retries, 300ms base delay, 1.5x multiplier, 10s max). Commented out: O3Mini, GPT4o, Gemini25Flash, Gemini25Pro, fallback strategies. +- **`baml_src/clients.baml`** -- Reference timeout/retry policy values to carry over into DSPy LM configuration (`temperature=0`, exponential retries, and bounded request timeouts). -- **`baml_src/company_er.baml`** -- Pairwise company matching for cases where blocking isn't sufficient. - -- **`baml_src/final_er.baml`** -- Final deduplication for companies sharing the same UUID across different blocking iterations. +- **`baml_src/company_er.baml`**, **`baml_src/final_er.baml`** -- Secondary references for fallback and final-dedupe semantics; optional for first complete implementation. ### 3.2 Python ER Modules -- **`abzu/er/uuid.py`** (574 lines) -- **Critical pattern.** `UUIDMapper` maps UUIDs to consecutive integers for LLM processing (LLMs work better with small integers than UUIDs). `process_block_with_uuid_mapping()` is the core async function: (1) maps UUIDs to ints, (2) strips source_uuids before sending to LLM (avoids context bloat), (3) calls BAML, (4) restores source_uuids from cached data, (5) recovers missing companies via two-phase recovery (Step 1: add back missing UUIDs to existing output companies; Step 2: recover entire missing companies with `match_skip_reason = "missing_in_match_output"`). Ticker normalization handles str, list, Row, dict formats with Exchange enum validation. **This UUID-to-integer mapping pattern is essential for SERF.** +- **`abzu/er/uuid.py`** (574 lines) -- **Critical pattern.** `UUIDMapper` maps UUIDs to consecutive integers for LLM processing (LLMs work better with small integers than UUIDs). `process_block_with_uuid_mapping()` is the core async function: (1) maps UUIDs to ints, (2) strips source_uuids before sending to LLM (avoids context bloat), (3) calls LLM match/merge logic, (4) restores source_uuids from cached data, (5) recovers missing companies via two-phase recovery (Step 1: add back missing UUIDs to existing output companies; Step 2: recover entire missing companies with `match_skip_reason = "missing_in_match_output"`). Ticker normalization handles str, list, Row, dict formats with Exchange enum validation. **This UUID-to-integer mapping pattern is essential for SERF.** - **`abzu/er/match.py`** (537 lines) -- Main matching orchestration. `match_entities()` loads blocks from Parquet (Arrow optimization disabled to preserve Python lists), validates block schema via `validate_block_schema()`, separates singleton/multi-company blocks, processes multi-company blocks via `process_blocks_async()` with semaphore-based rate limiting. Includes `backup_file()` for file/directory backup before overwriting (keeps only most recent backup). Comprehensive error recovery: marks error-recovered companies with `match_skip_reason = "error_recovery"`, updates `match_skip_history` with current iteration number, and logs detailed recovery statistics. Always generates new UUIDs for resolved companies. @@ -104,13 +116,13 @@ Abzu's codebase at `/Users/rjurney/Software/weave` contains the proven ER patter - **`abzu/er/faiss.py`** (142 lines) -- `FAISSBlocker` class using `IndexIVFFlat` with inner product metric. Calculates `nlist` from `target_block_size`, caps at `sqrt(n)`. Optional `max_distance` filtering. Returns `{block_key: [uuid1, uuid2, ...]}`. -- **`abzu/er/edge_match.py`** (176 lines) -- Async edge resolution: `process_edge_block()`, `resolve_edge_blocks()`, `run_edge_resolution()`. Uses `EdgeBlock`, `EdgeRelationshipInput` BAML types with semaphore rate limiting. Error recovery returns original relationships unchanged. +- **`abzu/er/edge_match.py`** (176 lines) -- Async edge resolution: `process_edge_block()`, `resolve_edge_blocks()`, `run_edge_resolution()`. Uses typed edge blocks with semaphore rate limiting. Error recovery returns original relationships unchanged. - **`abzu/er/few_shot.py`** (80 lines) -- Few-shot example generation for merge prompts. `company_dicts_to_baml()` converts dicts to `MergeCompanyExampleSet`. Hardcoded `company_id_tracking_dicts` example demonstrating merge of companies with ids 1 (source_ids [3,7]) and 22 (source_ids [2,4]) into master id=1, source_ids=[22,3,7,2,4]. - **`abzu/er/metrics.py`** (161 lines) -- `BlockInfo`, `BlockingMetrics`, `IterationMetrics` TypedDicts. `IterationMetrics` includes `overall_reduction_pct` for cumulative tracking from original baseline. `get_all_iteration_metrics()` computes both per-round and overall reduction by tracking original company count from iteration 1. Also provides `get_blocking_metrics()`, `get_matching_metrics()`, and `get_evaluation_metrics()`. -- **`abzu/er/acronyms.py`** (88 lines) -- **Company name normalization.** `get_basename()` uses `cleanco` library for corporate suffix removal (Inc., LLC, etc.). `get_corporate_ending()` extracts the corporate ending. `get_acronyms()` generates acronyms from cleaned company names, filtering multilingual stop words via `get_multilingual_stop_words()`. Used by name-based blocking for key generation. +- **`abzu/er/acronyms.py`** (88 lines) -- Name normalization utilities. Keep this as optional preprocessing support, but do not prioritize Abzu's name-based blocking path in the first complete SERF implementation. ### 3.3 PySpark ER Modules @@ -118,7 +130,7 @@ Abzu's codebase at `/Users/rjurney/Software/weave` contains the proven ER patter - **`abzu/spark/er_eval.py`** (601 lines) -- Post-match evaluation with iteration-aware validation. `evaluate_er_matches()` explodes resolved companies, deduplicates exact copies, splits into BAML-processed vs skipped companies. Validates source_uuids against ORIGINAL raw companies data (always iteration 0). For iteration 2+, loads previous iteration output for comparison and validates against ALL historical UUIDs from all previous iterations. Comprehensive metrics: `iteration_input_companies`, `total_original_companies`, `companies_that_went_into_matching`, `skipped_records`, `unique_baml_processed`, `reduction_from_matching`, `ids_dropped_by_baml`. Match skip reason analysis: `singleton_block_count`, `error_recovery_count`, `missing_uuid_recovery_count`, `missing_primary_uuid_count`, `missing_source_uuid_count`. Overall status assessment with PASS/FAIL checks (coverage >= 99.99%, error < 0.01%, overlap < 1.0%). Saves `er_evaluation_metrics.parquet` with all metrics. -- **`abzu/spark/er_block.py`** (848 lines) -- Name-based (heuristic) blocking. Three strategies combined via union: first-word extraction (`get_first_word()`), acronym extraction (`get_acronym()` via `abzu.er.acronyms`), and domain suffix removal (`remove_domain_suffix()` with 50+ TLD suffixes). `build_blocks()` is the main entry point. Uses `normalize_company_dataframe()` for consistent schema and `create_split_large_blocks_udtf()` for oversized block splitting. +- **`abzu/spark/er_block.py`** (848 lines) -- Name-based (heuristic) blocking. Reference only. Do not implement this path in the first full SERF build; prioritize semantic blocking from `er_block_semantic.py`. - **`abzu/spark/refine_kg.py`** -- Maps relationships to resolved company nodes, optional edge ER. @@ -163,13 +175,23 @@ Abzu's codebase at `/Users/rjurney/Software/weave` contains the proven ER patter ### 3.6 Patterns to Evolve in SERF 1. **BAML types -> fresh DSPy Pydantic types** -- Do NOT reuse Abzu's BAML-generated types (`abzu.baml_client.types`). Those types were auto-generated by BAML for a company-specific domain. SERF needs fresh, domain-agnostic Pydantic classes designed for DSPy signatures. Study Abzu's types for field patterns and ER metadata (source_ids, source_uuids, match_skip, match_skip_history), but build new classes from scratch. -2. **BAML templates -> DSPy signatures** -- Replace BAML templates with DSPy `Signature` classes + `BAMLAdapter` for output formatting. The signatures should use the new Pydantic types as input/output fields. +2. **Abzu prompt semantics -> DSPy signatures** -- Port merge/match prompt behavior into DSPy `Signature` classes and typed Python orchestration. Do not use BAML runtime or `BAMLAdapter` in the first full implementation. 3. **Auto-generate entity types from DataFrames** -- When a user provides a PySpark DataFrame (or Parquet/CSV file), SERF should automatically infer Pydantic entity types from the DataFrame schema. This means inspecting `df.schema` (StructType), mapping Spark types to Python types, and generating a Pydantic class with the appropriate fields. The profiler (Section 5.3) identifies which fields are names, identifiers, dates, etc. -- this metadata enriches the generated type with DSPy field descriptions. Use this approach when it simplifies the user experience (e.g., `serf resolve --input data.parquet` should work without the user defining any types). For advanced use cases, users can define their own Pydantic entity types. -4. **Poetry -> uv** -- Replace Poetry with the faster, standards-compliant uv package manager -5. **PySpark 3.5 -> 4.1** -- Leverage VARIANT type, Spark Connect, Python Data Source API +4. **Poetry -> uv (first task)** -- Replace Poetry with the faster, standards-compliant uv package manager before any major feature work. +5. **PySpark 3.5 -> 4.1+** -- Pin to PySpark 4.1+ and leverage VARIANT type, Spark Connect, and Python Data Source API. 6. **Parquet -> Iceberg** -- Add ACID transactions, time travel, schema evolution for iterative ER 7. **Company-only -> domain-agnostic** -- Generalize entity types beyond companies 8. **Manual orchestration -> agentic** -- DSPy ReAcT agents control the pipeline dynamically +9. **Name-based blocking deferred** -- Ignore Abzu name-based blocking for the first complete build; focus on semantic blocking quality and scale. + +### 3.7 Required Abzu Reference Workflow During Implementation + +During implementation, fetch and reference the Abzu repository directly and map behavior before coding: + +1. Pull Abzu and inspect prompt semantics in `baml_src/multi_er.baml` (match + merge rules and ID/source lineage behavior). +2. Port semantic blocking patterns from `abzu/spark/er_block_semantic.py` first. +3. Ignore `abzu/spark/er_block.py` (name-based blocking) for the first full SERF build. +4. Preserve operational parity for UUID mapping, missing-record recovery, and iterative evaluation metrics. --- @@ -177,19 +199,19 @@ Abzu's codebase at `/Users/rjurney/Software/weave` contains the proven ER patter ### 4.1 Core Technologies -| Component | Technology | Rationale | -| ---------------------- | ------------------------------------------------------------------------------------------------------- | ----------------------------------------------------------------------- | -| **Package Manager** | **uv** (replacing Poetry) | 10-100x faster, PEP 621 compliant, built-in Python version management | -| **Data Processing** | **PySpark 4.1** | Python-first, Spark Connect, VARIANT type, Arrow UDFs | -| **Table Format** | **Apache Iceberg** | ACID transactions, time travel for iteration tracking, schema evolution | -| **LLM Framework** | **DSPy 3.x** with `BAMLAdapter` | Programming-not-prompting, automatic optimization, Pydantic integration | -| **Embeddings** | **Qwen3-Embedding** via sentence-transformers | Top MTEB leaderboard, multilingual support | -| **Vector Search** | **FAISS IndexIVFFlat** | Fast approximate nearest neighbor for semantic blocking | -| **Graph Processing** | **GraphFrames** | Connected components for transitive closure of match decisions | -| **LLM Models** | **Gemini 2.0 Flash** (production), **Gemini 2.5 Flash Lite** (lightweight), **Claude Opus 4** (quality) | Flash for cost efficiency at temp=0; Opus for difficult cases | -| **CLI** | **Click** | Existing SERF pattern, `show_default=True` | -| **Type Checking** | **zuban** (mypy-compatible) | Existing SERF pattern | -| **Linting/Formatting** | **Ruff** (replacing black/isort/flake8) | Single tool, 10-100x faster | +| Component | Technology | Rationale | +| ---------------------- | ----------------------------------------------------------------------------------------------- | ----------------------------------------------------------------------- | +| **Package Manager** | **uv** (replacing Poetry) | 10-100x faster, PEP 621 compliant, built-in Python version management | +| **Data Processing** | **PySpark 4.1+** | Python-first, Spark Connect, VARIANT type, Arrow UDFs | +| **Table Format** | **Apache Iceberg** | ACID transactions, time travel for iteration tracking, schema evolution | +| **LLM Framework** | **DSPy 3.x** with native structured outputs | Programming-not-prompting, automatic optimization, Pydantic integration | +| **Embeddings** | **Qwen3-Embedding** via sentence-transformers | Top MTEB leaderboard, multilingual support | +| **Vector Search** | **FAISS IndexIVFFlat** | Fast approximate nearest neighbor for semantic blocking | +| **Graph Processing** | **GraphFrames** | Connected components for transitive closure of match decisions | +| **LLM Models** | **Gemini 2.0 Flash** (all ER pipeline), **Gemini 2.5 Pro** (limited validation/GEPA reflection) | Flash for cost efficiency; Pro only for constrained quality workflows | +| **CLI** | **Click** | Existing SERF pattern, `show_default=True` | +| **Type Checking** | **zuban** (mypy-compatible) | Existing SERF pattern | +| **Linting/Formatting** | **Ruff** (replacing black/isort/flake8) | Single tool, 10-100x faster | ### 4.2 Migration from Poetry to uv @@ -208,7 +230,7 @@ dependencies = [ "dspy-ai>=3.0.3", "click>=8.1", "pyyaml>=6.0", - "pyspark>=4.0,<5.0", + "pyspark>=4.1,<5.0", "sentence-transformers>=5.1", "faiss-cpu>=1.9", "graphframes>=0.8", @@ -229,6 +251,7 @@ dev-dependencies = [ ``` Replace `poetry install` with `uv sync`, `poetry run` with `uv run`, `poetry add` with `uv add`. +This migration is mandatory as Step 1 and must complete before full pipeline implementation. ### 4.3 PySpark 4.1 Features to Leverage @@ -272,9 +295,9 @@ serf/ __init__.py types.py # Fresh Pydantic entity types for DSPy (exists, rewrite) type_generator.py # Auto-generate entity types from DataFrame schemas (NEW) - baml_adapter.py # BAMLAdapter for DSPy (exists) signatures.py # DSPy signatures for ER (NEW) agents.py # DSPy ReAcT agents (NEW) + budget.py # Gemini token/cost budget guard (NEW) block/ __init__.py embeddings.py # Sentence-transformer embedder (NEW) @@ -310,7 +333,6 @@ serf/ test_config.py test_types.py test_type_generator.py - test_baml_adapter.py test_embeddings.py test_faiss_blocker.py test_uuid_mapper.py @@ -574,11 +596,10 @@ For ML engineers building custom ER pipelines with DSPy optimization: import dspy from serf.dspy.signatures import BlockMatch, EntityMerge from serf.dspy.agents import ERAgent -from dspy.adapters.baml_adapter import BAMLAdapter # Configure DSPy lm = dspy.LM("gemini/gemini-2.0-flash", api_key=GEMINI_API_KEY) -dspy.configure(lm=lm, adapter=BAMLAdapter()) +dspy.configure(lm=lm) # Use individual signatures matcher = dspy.ChainOfThought(BlockMatch) @@ -632,7 +653,7 @@ class ERAgent(dspy.Module): The agent has access to tools for each pipeline phase: - `profile_dataset(path)` -- Run dataset profiling -- `create_blocks(path, method, params)` -- Run semantic or name-based blocking +- `create_blocks(path, method, params)` -- Run semantic blocking - `match_blocks(blocks_path, iteration)` -- Run LLM matching on blocks - `evaluate_matches(matches_path, raw_path)` -- Compute quality metrics - `check_convergence(metrics)` -- Decide if another round is needed @@ -789,6 +810,22 @@ class BenchmarkDataset: ... ``` +### 8.5 Initial Entity Schemas to Support First + +The first complete implementation should support entities from all three primary benchmark tracks at the type-system and profiling layers from day one: + +1. **DBLP-ACM (bibliographic)**: + - `title`, `authors`, `venue`, `year` + - Entity focus: publication/citation records +2. **Walmart-Amazon (product)**: + - `title`, `category`, `brand`, `modelno`, `price` + - Entity focus: product catalog records across heterogeneous schemas +3. **DBLP-Scholar (bibliographic at scale)**: + - Same core bibliographic fields as DBLP-ACM with larger candidate space + - Entity focus: publication records with significant scale asymmetry + +The profiler and auto type-generation path should identify these fields reliably and generate matching-ready Pydantic entity types without manual schema coding for common benchmark runs. + --- ## 9. Scaling Analysis @@ -818,17 +855,19 @@ Blocking is essential. Without it, LLM-based ER is economically impossible beyon | Claude Opus 4 | $15.00/1M | $75.00/1M | **~$56.25** | | GPT-4o | $2.50/1M | $10.00/1M | **~$8.13** | +The non-Gemini rows are comparative market context only; this build should use Gemini models exclusively per Section 9.6. + **Key insight**: Block-level matching reduces costs by 20-40x compared to pairwise LLM matching. At $0.63 per 10K records with Gemini Flash, SERF can process 1M records for ~$63. ### 9.3 Recommended Cascade Architecture -For production at scale, implement a cost-optimization cascade: +For production at scale, implement a Gemini-only cost-optimization cascade: 1. **Embedding similarity filter** (free): Skip blocks where all pairwise embedding similarities < threshold -2. **Gemini Flash** ($0.10/1M tokens): Process medium-difficulty blocks -3. **Gemini Pro / Claude Sonnet** ($3-10/1M tokens): Process the hardest blocks where Flash confidence is low +2. **Gemini 2.0 Flash** ($0.10/$0.40 per 1M input/output tokens): Process all ER pipeline blocks +3. **Gemini 2.5 Pro** ($1.25/$10.00 per 1M input/output tokens): Reserved for limited validation-data generation and GEPA reflection workflows -This cascade can reduce costs by 80%+ while maintaining F1 within 1-2% of full expensive-model coverage. +This policy keeps operational ER on Flash and reserves Pro for bounded quality loops while staying under budget. ### 9.4 Latency and Parallelism @@ -859,6 +898,8 @@ A `GEMINI_API_KEY` environment variable will be provided. The agent must stay wi 4. **Track token usage** by logging input/output token counts from API responses. If cumulative spend approaches $80, stop making Gemini 2.5 Pro calls and finish remaining work with Flash only. +5. **Hard stop at $100**: Before every Gemini API call, estimate worst-case incremental cost from token limits and reject the call if it would push cumulative spend above $100. Persist cumulative usage to a local budget ledger so restarts cannot bypass limits. + | Use Case | Model | Max Calls | Est. Cost | | ------------------------------ | ---------------- | ------------------------- | ---------- | | ER pipeline (match/merge/edge) | Gemini 2.0 Flash | Unlimited (within budget) | ~$10-30 | @@ -873,12 +914,18 @@ The following ordered steps should be executed by the Cursor Agent. Each step pr ### Step 1: Project Infrastructure (30 min) -1. Convert `pyproject.toml` from Poetry to uv (PEP 621 format) +1. Convert `pyproject.toml` from Poetry to uv (PEP 621 format). This must be completed before any major feature build work. 2. Update `.pre-commit-config.yaml` to use Ruff instead of black/isort/flake8 3. Update `config.yml` with ER pipeline paths, iteration templates, model configs, benchmark dataset URLs 4. Create all module directories with `__init__.py` files 5. Update `CLAUDE.md` to reflect new tooling (uv, Ruff) +### Step 1.5: Research Deliverable and Design Freeze (45 min) + +1. Update this plan with implementation constraints and benchmark-first priorities. +2. Confirm architecture choices (no BAMLAdapter runtime, PySpark 4.1+, Gemini budget guard). +3. Pause for user instruction before beginning complete implementation. + ### Step 2: Core Type System (1.5 hr) 1. Rewrite `src/serf/dspy/types.py` from scratch with fresh, DSPy-appropriate Pydantic types -- do NOT copy Abzu's BAML-generated types. Build domain-agnostic `Entity` base class with ER metadata (id, uuid, source_ids, source_uuids, match_skip, match_skip_reason, match_skip_history) + example `Company`, `Person` specializations @@ -886,17 +933,18 @@ The following ordered steps should be executed by the Cursor Agent. Each step pr 3. Add `FieldProfile`, `DatasetProfile` types for dataset analysis 4. Add `IterationMetrics`, `BlockingMetrics` TypedDicts 5. Create `src/serf/dspy/type_generator.py` -- `entity_type_from_spark_schema()` that auto-generates a Pydantic Entity subclass from a Spark StructType + DatasetProfile. Maps Spark types to Python types, adds ER metadata fields automatically, generates DSPy field descriptions from profiling results -6. Write exhaustive unit tests: `tests/test_types.py`, `tests/test_type_generator.py` +6. Add benchmark-ready schema mappings for DBLP-ACM, Walmart-Amazon, and DBLP-Scholar entity fields. +7. Write exhaustive unit tests: `tests/test_types.py`, `tests/test_type_generator.py` -### Step 3: DSPy Signatures and Adapter (1 hr) +### Step 3: DSPy Signatures and LM Configuration (1 hr) 1. Create `src/serf/dspy/signatures.py` with DSPy signatures: - `BlockMatch` -- match entire blocks - `EntityMerge` -- merge matched entities - `EdgeResolve` -- merge duplicate edges - `AnalyzeDataset` -- profile and recommend ER strategy -2. Verify `BAMLAdapter` works with new Pydantic types -3. Write unit tests: `tests/test_signatures.py`, `tests/test_baml_adapter.py` +2. Configure DSPy with Gemini model routing and typed output validation (no `BAMLAdapter`). +3. Write unit tests: `tests/test_signatures.py` ### Step 4: Embeddings and Blocking (1.5 hr) @@ -962,10 +1010,10 @@ The following ordered steps should be executed by the Cursor Agent. Each step pr 1. Create `tests/test_pipeline_integration.py` -- End-to-end pipeline test with small synthetic dataset 2. Test iterative convergence (3 rounds) 3. Test Iceberg time travel between iterations -4. Test benchmark evaluation on three datasets covering easy, medium, and hard difficulties: +4. Test benchmark evaluation on three datasets, in this order: - **DBLP-ACM** (easy, bibliographic) -- Baseline sanity check, expect ~99% F1 - - **DBLP-Scholar** (medium, bibliographic) -- Tests scale (64K right-side records) and fuzzy matching - - **Walmart-Amazon** (hard, products) -- Tests cross-schema matching with very different field formats and 22K right-side records + - **Walmart-Amazon** (hard, products) -- Cross-schema product matching and robustness check + - **DBLP-Scholar** (medium, bibliographic) -- Scale asymmetry (2.6K vs 64K) and fuzzy matching ### Step 13: Documentation and Cleanup (30 min) @@ -996,10 +1044,10 @@ Every module gets exhaustive unit tests. Tests should: ### 11.2 Integration Tests - **Pipeline test**: Synthetic dataset through block -> match -> eval -> iterate -- **Benchmark tests**: Run against all three benchmark datasets from the start: +- **Benchmark tests**: Run against all three benchmark datasets from the start, in this order: - **DBLP-ACM** (easy) -- Verify basic pipeline correctness and metrics computation - - **DBLP-Scholar** (medium) -- Verify handling of scale asymmetry (2.6K vs 64K records) - **Walmart-Amazon** (hard) -- Verify cross-schema matching with heterogeneous product data + - **DBLP-Scholar** (medium) -- Verify handling of scale asymmetry (2.6K vs 64K records) - **Iceberg test**: Write/read/time-travel with local Iceberg catalog - **GraphFrames test**: Connected components on known graph structure