Skip to content

Conversation

@conrabeatriz
Copy link

@conrabeatriz conrabeatriz commented Jan 28, 2026

Summary by Sourcery

Add a new LLM-powered entity disambiguation endpoint and supporting utilities for canonical entity refinement and mapping back into NER predictions.

New Features:

  • Introduce /anonymizer/disambiguatev2 endpoint that combines fuzzy pre-clustering with LLM-based canonical entity curation and role assignment.
  • Add utilities to enrich canonical entities with contextual windows and perform LLM inference with batching and token-limit handling for entity disambiguation.
  • Expose mapping logic to synchronize LLM-refined canonical entities back into document-level NER predictions, including role subclasses.

Enhancements:

  • Adjust CanonicalEntity model by disabling unused relations field for the current LLM disambiguation phase.
  • Extend experimentation notebooks to define prompts, run LLM-based disambiguation, and grid-search hyperparameters for models and prompt configurations.
  • Add reusable prompt templates for person entity disambiguation to support the new LLM endpoint.

…lar to the NER predictions, now enriched with role and entity_id fields where applicable.
… the DocumentAnnotations list with the role and the canonincal_entity_id where applicable. When there is a prediction that wasn't mapped the program generates a canonical_entity_id
… the DocumentAnnotations list with the role and the canonincal_entity_id where applicable. When there is a prediction that wasn't mapped the program generates a canonical_entity_id
… the DocumentAnnotations list with the role and the canonincal_entity_id where applicable. When there is a prediction that wasn't mapped the program generates a canonical_entity_id
@sourcery-ai
Copy link

sourcery-ai bot commented Jan 28, 2026

Reviewer's Guide

Adds a new /anonymizer/disambiguatev2 endpoint that uses an LLM-backed inference pipeline to refine canonical entity disambiguation and role assignment, including utilities for token-budgeted batching, context enrichment, and NER prediction remapping, plus supporting prompt templates and experimental notebooks.

Sequence diagram for the new /anonymizer/disambiguatev2 LLM-backed disambiguation endpoint

sequenceDiagram
    actor Client
    participant Router as anonymizer_disambiguate_v2
    participant Utils as anonymizer_utils
    participant LLM as OllamaLLMProvider
    participant HF as AutoTokenizer

    Client->>Router: POST /anonymizer/disambiguatev2
    activate Router

    note over Router: Validate query params (threshold, scorer, processor,
    note over Router: system_prompts, user_prompt_templates,
    note over Router: tokenizer_model, token_limit_frac)

    Router->>Utils: resolve_processor(processor)
    Router->>Utils: build_canonical_entities(labels, target_labels,
    Router-->>Router: canonical_entities_pre_cluster

    loop For each target_label in user_prompt_templates
        Router->>Utils: validate_canonical_entities(canonical_entities_pre_cluster, target_label)
        Router-->>Router: canonical_entities_val

        Router->>Utils: llm_canonical_entities_inference(
        activate Utils
        Utils->>Utils: add_canonical_entities_context(paragraphs,
        Utils-->>Utils: canonical_entities_with_context

        Utils->>HF: from_pretrained(tokenizer_model)
        HF-->>Utils: tokenizer

        loop If decompose_by is None
            Utils->>HF: apply_chat_template + encode
            HF-->>Utils: token_count
            Utils-->>Utils: adjust decompose_by within token_limit
        end

        Utils->>Utils: split canonical_entities_prompt into batches

        loop For each batch
            Utils->>HF: apply_chat_template + encode
            HF-->>Utils: len_tokens
            alt len_tokens <= model_context
                Utils->>LLM: generate(messages, options, format)
                LLM-->>Utils: response.text (JSON CanonicalEntities)
                Utils-->>Utils: parse canonical_entities from response
            else len_tokens > model_context
                Utils-->>Utils: skip batch
            end
        end

        Utils->>Utils: validate_canonical_entities(all_raw_outputs)
        Utils-->>Router: {
        deactivate Utils

        Router-->>Router: append canonical_entities_llm for target_label
    end

    Router->>Utils: map_canonical_entities_NER_preds(paragraphs, canonical_entities_llm)
    Utils-->>Router: predictions_llm

    Router-->>Client: DocumentAnnotations(data=predictions_llm)
    deactivate Router
Loading

Class diagram for updated canonical entity and document annotation models

classDiagram
    class CanonicalEntity {
        +UUID entity_id
        +str aymurai_label
        +str canonical_text
        +list~str~ aliases
        +dict~str, Any~ attributes
        +validate_entity_id() CanonicalEntity
    }

    class CanonicalEntities {
        +list~CanonicalEntity~ canonical_entities
    }

    class DocumentInformation {
        +str document
        +list~DocLabel~ labels
    }

    class DocLabel {
        +Any attrs
        +int start_char
        +int end_char
        +str text
    }

    class DocumentAnnotations {
        +list~DocumentInformation~ data
    }

    CanonicalEntities "1" --> "*" CanonicalEntity
    DocumentAnnotations "1" --> "*" DocumentInformation
    DocumentInformation "1" --> "*" DocLabel
    CanonicalEntity "1" --> "0..*" DocLabel : linked via
    CanonicalEntity --> DocumentAnnotations : used to enrich
Loading

File-Level Changes

Change Details Files
Introduce LLM-based canonical entity disambiguation utilities and mapping logic reused by the new endpoint.
  • Extend anonymizer utils imports to include DocumentInformation, CanonicalEntities, JSON pretty-print helper, and Ollama LLM provider.
  • Add validate_canonical_entities helper to normalize/validate CanonicalEntity objects and expose hex entity_id.
  • Add add_canonical_entities_context to attach per-entity context windows from paragraph predictions, with optional label filtering and window sizing.
  • Add get_model_tokens to compute token counts for a chat prompt using a configurable tokenizer model.
  • Add llm_canonical_entities_inference to batch canonical entities, respect model context/token limits, invoke Ollama with structured JSON schema, and aggregate LLM-refined canonical entities and prompts.
  • Add map_canonical_entities_NER_preds to sync LLM canonical entities with NER predictions by setting canonical_entity_id and label_subclass (role) or generating new IDs when no match exists.
  • Comment out unused CanonicalEntity.relations to simplify the model for the LLM phase.
aymurai/api/endpoints/routers/anonymizer/utils.py
aymurai/meta/entities.py
Expose a new /anonymizer/disambiguatev2 API endpoint that orchestrates fuzzy pre-clustering, LLM refinement per label, and mapping back into DocumentAnnotations.
  • Define anonymizer_disambiguate_v2 FastAPI route accepting paragraph-level predictions, per-label system and user prompts, fuzzy clustering parameters, LLM model and tokenizer parameters, and batching controls.
  • Validate query/body parameters including similarity threshold bounds, supported scorer/processor, presence of prompts, tokenizer model, and token_limit_frac range.
  • Reuse existing fuzzy clustering (build_canonical_entities) to build pre-clustered CanonicalEntity list filtered by optional target_labels.
  • Iterate over provided user_prompt_templates by label, invoke llm_canonical_entities_inference for each label with corresponding system prompt, and accumulate LLM canonical entities across labels.
  • Apply map_canonical_entities_NER_preds to propagate canonical_entity_id and role subclasses back into the DocumentInformation predictions and return them wrapped in DocumentAnnotations.
aymurai/api/endpoints/routers/anonymizer/anonymizer.py
Add reusable prompt templates for person-role and direction disambiguation and wire them into experiments and environment configuration.
  • Create prompt_templates.py with Spanish system and user prompts for PER-role assignment and canonical cleanup.
  • Introduce extensive prompt sets and templates for PER and DIRECCION experiments within new LLM disambiguation notebook, plus utilities for token counting, context enrichment, LLM-based grid search, and evaluation.
  • Update smoke-test notebook to support the v2 disambiguation endpoint, PER and DIRECCION labels, and LLM/ tokenizer configuration via environment variables.
  • Add disambiguate_v2 client helper in the smoke notebook sending paragraphs, prompt dicts, and LLM parameters to /anonymizer/disambiguatev2, and adjust document iteration slices and runtime config.
  • Change default API_BASE_URL and data paths in notebooks to new ports/directories and expand target labels to include DIRECCION.
aymurai/api/endpoints/routers/anonymizer/prompt_templates.py
notebooks/experiments/entity-disambiguation/07-LLM-entities-disambiguation.ipynb
notebooks/experiments/entity-disambiguation/08-disambiguation-endpoint-smoke.ipynb

Tips and commands

Interacting with Sourcery

  • Trigger a new review: Comment @sourcery-ai review on the pull request.
  • Continue discussions: Reply directly to Sourcery's review comments.
  • Generate a GitHub issue from a review comment: Ask Sourcery to create an
    issue from a review comment by replying to it. You can also reply to a
    review comment with @sourcery-ai issue to create an issue from it.
  • Generate a pull request title: Write @sourcery-ai anywhere in the pull
    request title to generate a title at any time. You can also comment
    @sourcery-ai title on the pull request to (re-)generate the title at any time.
  • Generate a pull request summary: Write @sourcery-ai summary anywhere in
    the pull request body to generate a PR summary at any time exactly where you
    want it. You can also comment @sourcery-ai summary on the pull request to
    (re-)generate the summary at any time.
  • Generate reviewer's guide: Comment @sourcery-ai guide on the pull
    request to (re-)generate the reviewer's guide at any time.
  • Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
    pull request to resolve all Sourcery comments. Useful if you've already
    addressed all the comments and don't want to see them anymore.
  • Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
    request to dismiss all existing Sourcery reviews. Especially useful if you
    want to start fresh with a new review - don't forget to comment
    @sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

  • Enable or disable review features such as the Sourcery-generated pull request
    summary, the reviewer's guide, and others.
  • Change the review language.
  • Add, remove or edit custom review instructions.
  • Adjust other review settings.

Getting Help

Copy link

@sourcery-ai sourcery-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey - I've found 4 issues, and left some high level feedback:

  • In utils.validate_canonical_entities the type hints say it returns list[CanonicalEntity], but it actually returns a list of dicts (model_dump() plus entity_id); either adjust the annotations/return type or avoid model_dump and return actual models to keep the contract consistent.
  • get_model_tokens in utils.py uses AutoTokenizer but the module never imports it (unlike the notebook); add the appropriate from transformers import AutoTokenizer import or wire the tokenizer in via dependency injection to avoid runtime errors.
  • llm_canonical_entities_inference is annotated to return a dict but may return None when even one entity exceeds the token limit; consider making the return type dict | None (and handling None at the call site) or raising an exception instead, so the behavior is explicit and type-safe.
Prompt for AI Agents
Please address the comments from this code review:

## Overall Comments
- In `utils.validate_canonical_entities` the type hints say it returns `list[CanonicalEntity]`, but it actually returns a list of dicts (`model_dump()` plus `entity_id`); either adjust the annotations/return type or avoid `model_dump` and return actual models to keep the contract consistent.
- `get_model_tokens` in `utils.py` uses `AutoTokenizer` but the module never imports it (unlike the notebook); add the appropriate `from transformers import AutoTokenizer` import or wire the tokenizer in via dependency injection to avoid runtime errors.
- `llm_canonical_entities_inference` is annotated to return a `dict` but may return `None` when even one entity exceeds the token limit; consider making the return type `dict | None` (and handling `None` at the call site) or raising an exception instead, so the behavior is explicit and type-safe.

## Individual Comments

### Comment 1
<location> `aymurai/api/endpoints/routers/anonymizer/utils.py:240-242` </location>
<code_context>
     return canonical_entities
+
+
+def validate_canonical_entities(
+    canonical_entities_raw: list[CanonicalEntity], target_label: str | None = None
+) -> list[CanonicalEntity]:
+
+    if target_label:
</code_context>

<issue_to_address>
**issue (bug_risk):** The return type annotation of `validate_canonical_entities` does not match its actual return value (dicts vs `CanonicalEntity` instances).

Currently the function converts inputs to `CanonicalEntity`, immediately `model_dump()`s them, adds an `entity_id` hex, and returns those dicts. That means the real return type is `list[dict]`, not `list[CanonicalEntity]`. Please either (a) return `CanonicalEntity` instances and do `model_dump`/`entity_id` decoration at the call sites that need dicts, or (b) change the annotation to something like `list[dict[str, Any]]` (or a named alias) so it matches the actual output and doesn’t mislead callers or static analysis.
</issue_to_address>

### Comment 2
<location> `aymurai/api/endpoints/routers/anonymizer/utils.py:482-491` </location>
<code_context>
+    new_ids_map = {}
</code_context>

<issue_to_address>
**issue (bug_risk):** `map_canonical_entities_NER_preds` calls `uuid.uuid4` but `uuid` is not imported in this module.

This will raise a `NameError` when that code path runs. Please add the missing `uuid` import in this module, or route ID generation through a helper that already handles `uuid` usage.
</issue_to_address>

### Comment 3
<location> `aymurai/api/endpoints/routers/anonymizer/utils.py:407-408` </location>
<code_context>
+
+            current_batch_size -= 1
+
+        if not decompose_by:  # Safety check if even 1 entity is too large
+            return None
+
+    # 4. Prepare batches. If decompose_by is 0 or less, we put all entities in one single list (one batch)
</code_context>

<issue_to_address>
**issue (bug_risk):** `llm_canonical_entities_inference` can return `None` despite being annotated as returning `dict`, and the caller assumes a dict.

When `decompose_by` is `None` and even a single-entity batch exceeds the token limit, this branch returns `None`, but the function is annotated as `-> dict` and the caller in `anonymizer_disambiguate_v2` immediately does `llm_response_dict["canonical_entities_llm"]`. That will raise when `None` is returned. Please either (1) always return a dict (with an explicit status/empty list) or (2) update the type to allow `None` and handle that case explicitly at the call site (e.g., skip the label or raise a controlled HTTP error).
</issue_to_address>

### Comment 4
<location> `aymurai/api/endpoints/routers/anonymizer/anonymizer.py:287-294` </location>
<code_context>
+
+    canonical_entities_llm = []
+
+    for target_label, user_prompt_template in user_prompt_templates.items():
+        system_prompt = system_prompts[target_label]
+
+        canonical_entities_val = validate_canonical_entities(
+            canonical_entities_raw=canonical_entities, target_label=target_label
+        )
+
+        llm_response_dict = llm_canonical_entities_inference(
+            paragraphs=paragraphs,
+            canonical_entities_pre_cluster=canonical_entities_val,
</code_context>

<issue_to_address>
**issue (bug_risk):** The loop over `user_prompt_templates` does not guard against empty canonical-entity sets, which can trigger the `None` return path in `llm_canonical_entities_inference`.

You’re calling `validate_canonical_entities(..., target_label=target_label)` and `llm_canonical_entities_inference` for every label, even when `canonical_entities_val` is empty. In that case, `llm_canonical_entities_inference` can return `None` (e.g., when `decompose_by` is `None`), causing `for ce in llm_response_dict["canonical_entities_llm"]` to fail. Please either skip labels with empty `canonical_entities_val` or explicitly handle a `None`/empty response before iterating or appending, especially for labels present in `system_prompts` / `user_prompt_templates` but not in `target_labels` or predictions.
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

Comment on lines +482 to +491
new_ids_map = {}

for document in predictions_llm:
if not document.labels:
continue

for label in document.labels:
label_text = label.attrs.aymurai_alt_text
if (
label.attrs.canonical_entity_id is None
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (bug_risk): map_canonical_entities_NER_preds calls uuid.uuid4 but uuid is not imported in this module.

This will raise a NameError when that code path runs. Please add the missing uuid import in this module, or route ID generation through a helper that already handles uuid usage.

Comment on lines +407 to +408
if not decompose_by: # Safety check if even 1 entity is too large
return None
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (bug_risk): llm_canonical_entities_inference can return None despite being annotated as returning dict, and the caller assumes a dict.

When decompose_by is None and even a single-entity batch exceeds the token limit, this branch returns None, but the function is annotated as -> dict and the caller in anonymizer_disambiguate_v2 immediately does llm_response_dict["canonical_entities_llm"]. That will raise when None is returned. Please either (1) always return a dict (with an explicit status/empty list) or (2) update the type to allow None and handle that case explicitly at the call site (e.g., skip the label or raise a controlled HTTP error).

Comment on lines +287 to +294
for target_label, user_prompt_template in user_prompt_templates.items():
system_prompt = system_prompts[target_label]

canonical_entities_val = validate_canonical_entities(
canonical_entities_raw=canonical_entities, target_label=target_label
)

llm_response_dict = llm_canonical_entities_inference(
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (bug_risk): The loop over user_prompt_templates does not guard against empty canonical-entity sets, which can trigger the None return path in llm_canonical_entities_inference.

You’re calling validate_canonical_entities(..., target_label=target_label) and llm_canonical_entities_inference for every label, even when canonical_entities_val is empty. In that case, llm_canonical_entities_inference can return None (e.g., when decompose_by is None), causing for ce in llm_response_dict["canonical_entities_llm"] to fail. Please either skip labels with empty canonical_entities_val or explicitly handle a None/empty response before iterating or appending, especially for labels present in system_prompts / user_prompt_templates but not in target_labels or predictions.

token_limit_frac: float = 2 / 3,
temperature: int = 0,
decompose_by: int | None = 0,
) -> dict:
Copy link
Contributor

@jedzill4 jedzill4 Jan 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

about this function:

  • avoid returning dict objects, it would be better for manteneability to use pydantic objects (if validation is required) or dataclasses instead
  • use google format for doctrings
  • avoid adding unnecesary llm coments overexplaning the code. If its necessary to explain a block of code its better to add a comment as a note. i.e # NOTE: this block of code do ...

Comment on lines +12 to 13
API_BASE_URL=http://localhost:8000
REQUEST_TIMEOUT=300
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

be aware that this changes over the .env.common affect everyone

return canonical_entities_val


def add_canonical_entities_context(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • too much complexity of this function. too much for loops & if statements, if they are really necessary, extract the code and create a helper function that run the code inside the loop/statement
  • avoid returning dict. use instead pydantic basemodels (if validation is required) or dataclasses


# MARK: Disambiguate
@router.post("/disambiguatev2", response_model=DocumentAnnotations)
async def anonymizer_disambiguate_v2(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • just use one version, we cant have 2 endpoints. You can add a new query paramater named backend that can be a Literal["fuzzyregex", "llm"] to manage which route should be taken
  • system_prompt/user_prompt should be optional. by default should load the promnts on prompt_templates
  • scorer/processor looks like choices ? there are any other options different to "ligth_normalizer" ? if so, use Literal["ligth_normalizer", ...] or Enums to declare literal options (if they are 2+ i would prefer Enums). If not, remove parameter

if threshold < 0 or threshold > 100:
raise HTTPException(status_code=400, detail="threshold must be 0-100.")

scorer_fn = SCORER_MAP.get(scorer.lower())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now i see. there are other options. So yes, instead of validating manually, use Literal or Enums. specially in this case I would use Enums

from enum import StrEnum, auto

class Scorer(StrEnum):
    ratio = auto()
    partial_ratio = auto()
    ...

_SCORER_MAPPER = {
    'ratio': ratio_fn,
    ...
}

def get_scorer(name: str) -> Callable[...]:
    # This would raise an exception if scorer is wrong, we like fast fails instead of try/excepts. Also the endpoint already handle that the option exists 
    return _SCORER_MAPPER[name]  
from aymurai.endpoints.router.anonimizer.utils import Scorer, get_scorer

async def anonymizer_disambiguate_v2(
    ...
    scorer: Scorer = Query(...),
    ...
)
    scorer_fn = get_scorer(scorer)

Also, we will support all those alternative scorers ? remember this is a production endpoint. I like flexibility but we need to think on what we also can mantain

if scorer_fn is None:
raise HTTPException(status_code=400, detail=f"Unsupported scorer: {scorer}")

if processor.lower() not in PROCESSOR_MAP:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as above

)
processor_fn = resolve_processor(processor)

for label, prompt in system_prompts.items():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

system_prompt needs to be a pydantic base model instead of dict, so we can handle the validation automatically

from pydantic import BaseModel

class NamedPrompt(BaseModel):
    label: str  # Actually, we should expose the possible labels instead of just let the user handle it as str
    prompt: str

class NamedPrompts(RootModel):
    _root: list[NamedPrompt] = Field(default_factory=list)

    @cached_property
    def as_dict(self) -> dict[str, NamedPrompt]:
        return {p.label: p for p in self._root}

    def get(self, label: str) -> NamedPrompt | None:
        return self.as_dict.get(label)

def get_default_prompt(label: str) -> str:
    <load the default prompt by label from a file (i would suggest use .yaml instead of .py, but its a matter of choice)>
    return prompt

then

async def anonymizer_disambiguate_v2(  
    ...
    # be aware that we are using plural so its more readable what its the object
    system_prompts: NamedPrompts = Query(default_factory=NamedPrompts, ...)
    user_template_prompts: NamedPrompts = Query(default_factory=NamedPrompts, ...)
    ...
):
    ...
    system_prompts = [
        system_prompts.get(label) or get_default_prompt(label)
        for label in Labels
    ]

    # Same for user_template_prompt

Note: this all is just psudo code

detail=f"user_prompt_template for label {label} doesn't exist.",
)

if not tokenizer_model:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why one would use a tokenizer different than the model ? its a needed paramater ?

"microsoft/phi-4",
description="Tokenizer instance to get the tokens of our prompt",
),
decompose_by: int
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that is the decompose_by parameter? its needed for the user ?

context_window_length: Optional[int] = Query(
120, description="Length of context window. Use None for full paragraph."
),
token_limit_frac: float = Query(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the user really needs to be able to change this parameter ? if not, can we make it an env variable instead ? define it as Setting

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants