-
Notifications
You must be signed in to change notification settings - Fork 0
Endpoint /disambiguate with LLM Inference #72
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: release/v2.0.0
Are you sure you want to change the base?
Endpoint /disambiguate with LLM Inference #72
Conversation
…int and its own name
…lar to the NER predictions, now enriched with role and entity_id fields where applicable.
… the DocumentAnnotations list with the role and the canonincal_entity_id where applicable. When there is a prediction that wasn't mapped the program generates a canonical_entity_id
… the DocumentAnnotations list with the role and the canonincal_entity_id where applicable. When there is a prediction that wasn't mapped the program generates a canonical_entity_id
… the DocumentAnnotations list with the role and the canonincal_entity_id where applicable. When there is a prediction that wasn't mapped the program generates a canonical_entity_id
Reviewer's GuideAdds a new /anonymizer/disambiguatev2 endpoint that uses an LLM-backed inference pipeline to refine canonical entity disambiguation and role assignment, including utilities for token-budgeted batching, context enrichment, and NER prediction remapping, plus supporting prompt templates and experimental notebooks. Sequence diagram for the new /anonymizer/disambiguatev2 LLM-backed disambiguation endpointsequenceDiagram
actor Client
participant Router as anonymizer_disambiguate_v2
participant Utils as anonymizer_utils
participant LLM as OllamaLLMProvider
participant HF as AutoTokenizer
Client->>Router: POST /anonymizer/disambiguatev2
activate Router
note over Router: Validate query params (threshold, scorer, processor,
note over Router: system_prompts, user_prompt_templates,
note over Router: tokenizer_model, token_limit_frac)
Router->>Utils: resolve_processor(processor)
Router->>Utils: build_canonical_entities(labels, target_labels,
Router-->>Router: canonical_entities_pre_cluster
loop For each target_label in user_prompt_templates
Router->>Utils: validate_canonical_entities(canonical_entities_pre_cluster, target_label)
Router-->>Router: canonical_entities_val
Router->>Utils: llm_canonical_entities_inference(
activate Utils
Utils->>Utils: add_canonical_entities_context(paragraphs,
Utils-->>Utils: canonical_entities_with_context
Utils->>HF: from_pretrained(tokenizer_model)
HF-->>Utils: tokenizer
loop If decompose_by is None
Utils->>HF: apply_chat_template + encode
HF-->>Utils: token_count
Utils-->>Utils: adjust decompose_by within token_limit
end
Utils->>Utils: split canonical_entities_prompt into batches
loop For each batch
Utils->>HF: apply_chat_template + encode
HF-->>Utils: len_tokens
alt len_tokens <= model_context
Utils->>LLM: generate(messages, options, format)
LLM-->>Utils: response.text (JSON CanonicalEntities)
Utils-->>Utils: parse canonical_entities from response
else len_tokens > model_context
Utils-->>Utils: skip batch
end
end
Utils->>Utils: validate_canonical_entities(all_raw_outputs)
Utils-->>Router: {
deactivate Utils
Router-->>Router: append canonical_entities_llm for target_label
end
Router->>Utils: map_canonical_entities_NER_preds(paragraphs, canonical_entities_llm)
Utils-->>Router: predictions_llm
Router-->>Client: DocumentAnnotations(data=predictions_llm)
deactivate Router
Class diagram for updated canonical entity and document annotation modelsclassDiagram
class CanonicalEntity {
+UUID entity_id
+str aymurai_label
+str canonical_text
+list~str~ aliases
+dict~str, Any~ attributes
+validate_entity_id() CanonicalEntity
}
class CanonicalEntities {
+list~CanonicalEntity~ canonical_entities
}
class DocumentInformation {
+str document
+list~DocLabel~ labels
}
class DocLabel {
+Any attrs
+int start_char
+int end_char
+str text
}
class DocumentAnnotations {
+list~DocumentInformation~ data
}
CanonicalEntities "1" --> "*" CanonicalEntity
DocumentAnnotations "1" --> "*" DocumentInformation
DocumentInformation "1" --> "*" DocLabel
CanonicalEntity "1" --> "0..*" DocLabel : linked via
CanonicalEntity --> DocumentAnnotations : used to enrich
File-Level Changes
Tips and commandsInteracting with Sourcery
Customizing Your ExperienceAccess your dashboard to:
Getting Help
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey - I've found 4 issues, and left some high level feedback:
- In
utils.validate_canonical_entitiesthe type hints say it returnslist[CanonicalEntity], but it actually returns a list of dicts (model_dump()plusentity_id); either adjust the annotations/return type or avoidmodel_dumpand return actual models to keep the contract consistent. get_model_tokensinutils.pyusesAutoTokenizerbut the module never imports it (unlike the notebook); add the appropriatefrom transformers import AutoTokenizerimport or wire the tokenizer in via dependency injection to avoid runtime errors.llm_canonical_entities_inferenceis annotated to return adictbut may returnNonewhen even one entity exceeds the token limit; consider making the return typedict | None(and handlingNoneat the call site) or raising an exception instead, so the behavior is explicit and type-safe.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- In `utils.validate_canonical_entities` the type hints say it returns `list[CanonicalEntity]`, but it actually returns a list of dicts (`model_dump()` plus `entity_id`); either adjust the annotations/return type or avoid `model_dump` and return actual models to keep the contract consistent.
- `get_model_tokens` in `utils.py` uses `AutoTokenizer` but the module never imports it (unlike the notebook); add the appropriate `from transformers import AutoTokenizer` import or wire the tokenizer in via dependency injection to avoid runtime errors.
- `llm_canonical_entities_inference` is annotated to return a `dict` but may return `None` when even one entity exceeds the token limit; consider making the return type `dict | None` (and handling `None` at the call site) or raising an exception instead, so the behavior is explicit and type-safe.
## Individual Comments
### Comment 1
<location> `aymurai/api/endpoints/routers/anonymizer/utils.py:240-242` </location>
<code_context>
return canonical_entities
+
+
+def validate_canonical_entities(
+ canonical_entities_raw: list[CanonicalEntity], target_label: str | None = None
+) -> list[CanonicalEntity]:
+
+ if target_label:
</code_context>
<issue_to_address>
**issue (bug_risk):** The return type annotation of `validate_canonical_entities` does not match its actual return value (dicts vs `CanonicalEntity` instances).
Currently the function converts inputs to `CanonicalEntity`, immediately `model_dump()`s them, adds an `entity_id` hex, and returns those dicts. That means the real return type is `list[dict]`, not `list[CanonicalEntity]`. Please either (a) return `CanonicalEntity` instances and do `model_dump`/`entity_id` decoration at the call sites that need dicts, or (b) change the annotation to something like `list[dict[str, Any]]` (or a named alias) so it matches the actual output and doesn’t mislead callers or static analysis.
</issue_to_address>
### Comment 2
<location> `aymurai/api/endpoints/routers/anonymizer/utils.py:482-491` </location>
<code_context>
+ new_ids_map = {}
</code_context>
<issue_to_address>
**issue (bug_risk):** `map_canonical_entities_NER_preds` calls `uuid.uuid4` but `uuid` is not imported in this module.
This will raise a `NameError` when that code path runs. Please add the missing `uuid` import in this module, or route ID generation through a helper that already handles `uuid` usage.
</issue_to_address>
### Comment 3
<location> `aymurai/api/endpoints/routers/anonymizer/utils.py:407-408` </location>
<code_context>
+
+ current_batch_size -= 1
+
+ if not decompose_by: # Safety check if even 1 entity is too large
+ return None
+
+ # 4. Prepare batches. If decompose_by is 0 or less, we put all entities in one single list (one batch)
</code_context>
<issue_to_address>
**issue (bug_risk):** `llm_canonical_entities_inference` can return `None` despite being annotated as returning `dict`, and the caller assumes a dict.
When `decompose_by` is `None` and even a single-entity batch exceeds the token limit, this branch returns `None`, but the function is annotated as `-> dict` and the caller in `anonymizer_disambiguate_v2` immediately does `llm_response_dict["canonical_entities_llm"]`. That will raise when `None` is returned. Please either (1) always return a dict (with an explicit status/empty list) or (2) update the type to allow `None` and handle that case explicitly at the call site (e.g., skip the label or raise a controlled HTTP error).
</issue_to_address>
### Comment 4
<location> `aymurai/api/endpoints/routers/anonymizer/anonymizer.py:287-294` </location>
<code_context>
+
+ canonical_entities_llm = []
+
+ for target_label, user_prompt_template in user_prompt_templates.items():
+ system_prompt = system_prompts[target_label]
+
+ canonical_entities_val = validate_canonical_entities(
+ canonical_entities_raw=canonical_entities, target_label=target_label
+ )
+
+ llm_response_dict = llm_canonical_entities_inference(
+ paragraphs=paragraphs,
+ canonical_entities_pre_cluster=canonical_entities_val,
</code_context>
<issue_to_address>
**issue (bug_risk):** The loop over `user_prompt_templates` does not guard against empty canonical-entity sets, which can trigger the `None` return path in `llm_canonical_entities_inference`.
You’re calling `validate_canonical_entities(..., target_label=target_label)` and `llm_canonical_entities_inference` for every label, even when `canonical_entities_val` is empty. In that case, `llm_canonical_entities_inference` can return `None` (e.g., when `decompose_by` is `None`), causing `for ce in llm_response_dict["canonical_entities_llm"]` to fail. Please either skip labels with empty `canonical_entities_val` or explicitly handle a `None`/empty response before iterating or appending, especially for labels present in `system_prompts` / `user_prompt_templates` but not in `target_labels` or predictions.
</issue_to_address>Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
| new_ids_map = {} | ||
|
|
||
| for document in predictions_llm: | ||
| if not document.labels: | ||
| continue | ||
|
|
||
| for label in document.labels: | ||
| label_text = label.attrs.aymurai_alt_text | ||
| if ( | ||
| label.attrs.canonical_entity_id is None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
issue (bug_risk): map_canonical_entities_NER_preds calls uuid.uuid4 but uuid is not imported in this module.
This will raise a NameError when that code path runs. Please add the missing uuid import in this module, or route ID generation through a helper that already handles uuid usage.
| if not decompose_by: # Safety check if even 1 entity is too large | ||
| return None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
issue (bug_risk): llm_canonical_entities_inference can return None despite being annotated as returning dict, and the caller assumes a dict.
When decompose_by is None and even a single-entity batch exceeds the token limit, this branch returns None, but the function is annotated as -> dict and the caller in anonymizer_disambiguate_v2 immediately does llm_response_dict["canonical_entities_llm"]. That will raise when None is returned. Please either (1) always return a dict (with an explicit status/empty list) or (2) update the type to allow None and handle that case explicitly at the call site (e.g., skip the label or raise a controlled HTTP error).
| for target_label, user_prompt_template in user_prompt_templates.items(): | ||
| system_prompt = system_prompts[target_label] | ||
|
|
||
| canonical_entities_val = validate_canonical_entities( | ||
| canonical_entities_raw=canonical_entities, target_label=target_label | ||
| ) | ||
|
|
||
| llm_response_dict = llm_canonical_entities_inference( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
issue (bug_risk): The loop over user_prompt_templates does not guard against empty canonical-entity sets, which can trigger the None return path in llm_canonical_entities_inference.
You’re calling validate_canonical_entities(..., target_label=target_label) and llm_canonical_entities_inference for every label, even when canonical_entities_val is empty. In that case, llm_canonical_entities_inference can return None (e.g., when decompose_by is None), causing for ce in llm_response_dict["canonical_entities_llm"] to fail. Please either skip labels with empty canonical_entities_val or explicitly handle a None/empty response before iterating or appending, especially for labels present in system_prompts / user_prompt_templates but not in target_labels or predictions.
| token_limit_frac: float = 2 / 3, | ||
| temperature: int = 0, | ||
| decompose_by: int | None = 0, | ||
| ) -> dict: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
about this function:
- avoid returning dict objects, it would be better for manteneability to use pydantic objects (if validation is required) or dataclasses instead
- use google format for doctrings
- avoid adding unnecesary llm coments overexplaning the code. If its necessary to explain a block of code its better to add a comment as a note. i.e
# NOTE: this block of code do ...
| API_BASE_URL=http://localhost:8000 | ||
| REQUEST_TIMEOUT=300 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
be aware that this changes over the .env.common affect everyone
| return canonical_entities_val | ||
|
|
||
|
|
||
| def add_canonical_entities_context( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- too much complexity of this function. too much for loops & if statements, if they are really necessary, extract the code and create a helper function that run the code inside the loop/statement
- avoid returning
dict. use instead pydantic basemodels (if validation is required) or dataclasses
|
|
||
| # MARK: Disambiguate | ||
| @router.post("/disambiguatev2", response_model=DocumentAnnotations) | ||
| async def anonymizer_disambiguate_v2( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- just use one version, we cant have 2 endpoints. You can add a new query paramater named
backendthat can be aLiteral["fuzzyregex", "llm"]to manage which route should be taken - system_prompt/user_prompt should be optional. by default should load the promnts on prompt_templates
- scorer/processor looks like choices ? there are any other options different to "ligth_normalizer" ? if so, use Literal["ligth_normalizer", ...] or Enums to declare literal options (if they are 2+ i would prefer Enums). If not, remove parameter
| if threshold < 0 or threshold > 100: | ||
| raise HTTPException(status_code=400, detail="threshold must be 0-100.") | ||
|
|
||
| scorer_fn = SCORER_MAP.get(scorer.lower()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now i see. there are other options. So yes, instead of validating manually, use Literal or Enums. specially in this case I would use Enums
from enum import StrEnum, auto
class Scorer(StrEnum):
ratio = auto()
partial_ratio = auto()
...
_SCORER_MAPPER = {
'ratio': ratio_fn,
...
}
def get_scorer(name: str) -> Callable[...]:
# This would raise an exception if scorer is wrong, we like fast fails instead of try/excepts. Also the endpoint already handle that the option exists
return _SCORER_MAPPER[name] from aymurai.endpoints.router.anonimizer.utils import Scorer, get_scorer
async def anonymizer_disambiguate_v2(
...
scorer: Scorer = Query(...),
...
)
scorer_fn = get_scorer(scorer)Also, we will support all those alternative scorers ? remember this is a production endpoint. I like flexibility but we need to think on what we also can mantain
| if scorer_fn is None: | ||
| raise HTTPException(status_code=400, detail=f"Unsupported scorer: {scorer}") | ||
|
|
||
| if processor.lower() not in PROCESSOR_MAP: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same as above
| ) | ||
| processor_fn = resolve_processor(processor) | ||
|
|
||
| for label, prompt in system_prompts.items(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
system_prompt needs to be a pydantic base model instead of dict, so we can handle the validation automatically
from pydantic import BaseModel
class NamedPrompt(BaseModel):
label: str # Actually, we should expose the possible labels instead of just let the user handle it as str
prompt: str
class NamedPrompts(RootModel):
_root: list[NamedPrompt] = Field(default_factory=list)
@cached_property
def as_dict(self) -> dict[str, NamedPrompt]:
return {p.label: p for p in self._root}
def get(self, label: str) -> NamedPrompt | None:
return self.as_dict.get(label)
def get_default_prompt(label: str) -> str:
<load the default prompt by label from a file (i would suggest use .yaml instead of .py, but its a matter of choice)>
return prompt
then
async def anonymizer_disambiguate_v2(
...
# be aware that we are using plural so its more readable what its the object
system_prompts: NamedPrompts = Query(default_factory=NamedPrompts, ...)
user_template_prompts: NamedPrompts = Query(default_factory=NamedPrompts, ...)
...
):
...
system_prompts = [
system_prompts.get(label) or get_default_prompt(label)
for label in Labels
]
# Same for user_template_prompt
Note: this all is just psudo code
| detail=f"user_prompt_template for label {label} doesn't exist.", | ||
| ) | ||
|
|
||
| if not tokenizer_model: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why one would use a tokenizer different than the model ? its a needed paramater ?
| "microsoft/phi-4", | ||
| description="Tokenizer instance to get the tokens of our prompt", | ||
| ), | ||
| decompose_by: int |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that is the decompose_by parameter? its needed for the user ?
| context_window_length: Optional[int] = Query( | ||
| 120, description="Length of context window. Use None for full paragraph." | ||
| ), | ||
| token_limit_frac: float = Query( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does the user really needs to be able to change this parameter ? if not, can we make it an env variable instead ? define it as Setting
Summary by Sourcery
Add a new LLM-powered entity disambiguation endpoint and supporting utilities for canonical entity refinement and mapping back into NER predictions.
New Features:
Enhancements: