Add Lookup System for Reference Data Enrichment in Prompts#1780
Add Lookup System for Reference Data Enrichment in Prompts#1780gaya3-zipstack wants to merge 28 commits intomainfrom
Conversation
… prompt studio projects
Signed-off-by: Gayathri <142381512+gaya3-zipstack@users.noreply.github.com>
Signed-off-by: Gayathri <142381512+gaya3-zipstack@users.noreply.github.com>
Signed-off-by: Gayathri <142381512+gaya3-zipstack@users.noreply.github.com>
Summary by CodeRabbit
WalkthroughAdds a full Look-Up system: Django models, migrations, REST APIs, services (indexing, retrieval, orchestration), LLM and storage integrations, Prompt Studio linkage, frontend UI for managing/debugging lookups, extensive tests, and configuration flags for automatic enrichment and timeouts. Changes
Sequence Diagram(s)sequenceDiagram
participant PS as Prompt Studio
participant Integration as LookupIntegrationService
participant Orchestrator as LookUpOrchestrator
participant Executor as LookUpExecutor
participant Loader as ReferenceDataLoader
participant Cache as LLM/Redis Cache
participant LLM as LLM Provider
participant Log as LookupLogEmitter
PS->>Integration: request enrich_if_linked(extracted_data, prompt_lookup_map)
Integration->>Orchestrator: build projects, execution_id
Orchestrator->>Executor: execute(project, input_data, execution_id)
Executor->>Loader: load_latest_for_project(project_id)
Loader-->>Executor: concatenated reference_data
Executor->>Cache: check cache(prompt + reference)
alt cache hit
Cache-->>Executor: cached_response
else cache miss
Executor->>LLM: generate(prompt, config)
LLM-->>Executor: llm_response
Executor->>Cache: store response
end
Executor->>Log: emit_enrichment_* (start/success/failure)
Executor-->>Orchestrator: enrichment_result
Orchestrator-->>Integration: aggregated_results
Integration-->>PS: lookup_enrichment + _lookup_metadata
Estimated code review effort🎯 4 (Complex) | ⏱️ ~75 minutes 🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 20
Note
Due to the large number of review comments, Critical, Major severity comments were prioritized as inline comments.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (3)
frontend/src/components/logging/log-modal/LogModal.jsx (1)
223-229:⚠️ Potential issue | 🟡 MinorIncorrect PropTypes definitions.
logDescModalOpenis used as a boolean (lines 175, 195) andsetLogDescModalOpenis used as a function (line 97), but both are defined asPropTypes.array.🔧 Proposed fix
LogModal.propTypes = { executionId: PropTypes.string, fileId: PropTypes.string, - logDescModalOpen: PropTypes.array, - setLogDescModalOpen: PropTypes.array, + logDescModalOpen: PropTypes.bool, + setLogDescModalOpen: PropTypes.func, filterParams: PropTypes.object, };backend/prompt_studio/prompt_studio_core_v2/views.py (1)
435-462:⚠️ Potential issue | 🟠 MajorHandle
LookupEnrichmentErrorinsingle_pass_extractionas well.The same lookup failure that now returns 400 in
fetch_responsewill still bubble as 500 here. Mirror the new error handling for consistency.🛠️ Proposed fix
def single_pass_extraction(self, request: HttpRequest, pk: uuid) -> Response: """API Entry point method to fetch response to prompt. @@ """ + from prompt_studio.prompt_studio_output_manager_v2.output_manager_helper import ( + LookupEnrichmentError, + ) # TODO: Handle fetch_response and single_pass_ # extraction using common function custom_tool = self.get_object() @@ - response: dict[str, Any] = PromptStudioHelper.prompt_responder( - tool_id=tool_id, - org_id=UserSessionUtils.get_organization_id(request), - user_id=custom_tool.created_by.user_id, - document_id=document_id, - run_id=run_id, - ) - return Response(response, status=status.HTTP_200_OK) + try: + response: dict[str, Any] = PromptStudioHelper.prompt_responder( + tool_id=tool_id, + org_id=UserSessionUtils.get_organization_id(request), + user_id=custom_tool.created_by.user_id, + document_id=document_id, + run_id=run_id, + ) + return Response(response, status=status.HTTP_200_OK) + except LookupEnrichmentError as e: + error_response = { + "error": str(e), + "error_type": e.error_type, + **e.details, + } + return Response(error_response, status=status.HTTP_400_BAD_REQUEST)backend/workflow_manager/workflow_v2/file_execution_tasks.py (1)
1025-1043:⚠️ Potential issue | 🟠 MajorLookup enrichment is skipped entirely for filesystem and database destinations—design mismatch with intended support.
Lookup integration is designed to support ETL, Workflow, and API deployments (per
workflow_integration.pydocumentation), but the current implementation only applies enrichment to API destinations. The condition on line 1037 (if output_result and not processing_error:) prevents enrichment from running for filesystem and database destinations becausehandle_output()returnsNonefor those types—it only returns the tool execution result for API deployments.This creates a silent data loss: enrichment is skipped for FILESYSTEM/DATABASE workflows even though the lookup system is designed to support them. Additionally, for FILESYSTEM/DATABASE destinations, persistence occurs inside
handle_output()before any enrichment opportunity, so even if enrichment ran, it wouldn't reach persisted outputs.Filesystem/database workflows should either receive enriched output via a redesigned flow, or enrichment scope should be explicitly limited to API deployments only (with documentation clarifying why).
🤖 Fix all issues with AI agents
In `@backend/lookup/integrations/llm_provider.py`:
- Around line 207-222: The current JSON extraction uses the regex json_pattern =
r"\{[^{}]*\}" which fails on nested objects; replace it with an iterative
balanced-brace extractor that scans the response string (the variable response)
and uses a stack/counter to find matching '{' ... '}' spans, attempt json.loads
on each balanced substring and return the first valid parse; update the logic
where matches are built/iterated (the block referencing json_pattern, matches,
and the json.loads try/except) to use this brace-matching approach so nested
JSON like {"data":{"value":1}} is correctly extracted and parsed.
In `@backend/lookup/integrations/redis_cache.py`:
- Around line 374-381: The warmup loop uses hashlib.md5 to build keys, which
diverges from the main generate_cache_key() scheme (SHA256) and triggers
security linting; update the key generation inside the preload_data loop to use
hashlib.sha256(prompt.encode()).hexdigest() (matching the generate_cache_key()
approach), keeping the same self.key_prefix and project:{project_id}: namespace
so keys are consistent; ensure the change is made where the code builds key (the
preload loop that calls self.set(key, response)) so warmup entries match the
primary cache lookup.
In `@backend/lookup/models/lookup_data_source.py`:
- Around line 168-196: The pre_save signal handler
auto_increment_version_and_update_latest on LookupDataSource is vulnerable to
race conditions because it reads Max("version_number") and then writes, so
concurrent creations can get duplicate version_number; update this by acquiring
a row-level lock and doing the read/update inside a transaction (e.g., wrap in
transaction.atomic() and use select_for_update() on LookupDataSource objects
filtered by project to lock the relevant rows before computing max and updating
is_latest) or switch to a DB-backed sequence/serial for version_number and set
is_latest in the same transaction to ensure uniqueness; ensure all logic
referencing version_number and is_latest in
auto_increment_version_and_update_latest runs inside that transaction context.
In `@backend/lookup/models/lookup_execution_audit.py`:
- Around line 134-140: The validation currently requires enriched_output only
when self.status == "success" but was_successful() treats "partial" as
successful too; update the validation in the LookupExecutionAudit model (the
block that checks self.status and self.enriched_output) to require
enriched_output when self.status is either "success" or "partial" (e.g., change
the condition to check status in ["success", "partial"]) and keep the
ValidationError message consistent.
In `@backend/lookup/models/lookup_project.py`:
- Around line 142-163: The clean() method wrongly enforces LLM_PROVIDER_CHOICES
even when llm_provider is nullable; update the validation in clean() to only
check membership if self.llm_provider is not None (and optionally not an empty
string) before comparing against valid_providers (referencing clean(),
LLM_PROVIDER_CHOICES and llm_provider), so nullable values pass and only
non-null providers are validated.
In `@backend/lookup/serializers.py`:
- Around line 331-361: The AdapterProcessor.get_adapter_instance_by_id function
currently returns adapter.adapter_name (string) but should return the full
AdapterInstance object; update get_adapter_instance_by_id in
backend/adapter_processor_v2/adapter_processor.py to return the adapter object
(not adapter.adapter_name), correct the return type annotation to the actual
AdapterInstance type (or Adapter) to match callers, and ensure error handling
returns None or raises appropriately before returning so callers like
LookupProfileSerializer.to_representation (which calls
AdapterProcessor.get_adapter_instance_by_id) receive the full instance or a
clear failure signal.
In `@backend/lookup/services/enrichment_merger.py`:
- Around line 103-112: The conflict counter misses cases where the existing
entry wins (winner == existing) because the current elif is unreachable; update
the resolution block in enrichment_merger.py so that whenever there is a
conflict between existing and field_entry (i.e., they are different lookup_ids)
you always increment conflicts_resolved after selecting a winner: keep the
existing branch that updates field_sources/merged_data/fields_added when winner
differs from existing (winner["lookup_id"] != existing["lookup_id"]), and
replace the elif with an else (or an explicit check for existing winning) that
simply increments conflicts_resolved when winner == existing and
existing["lookup_id"] != field_entry["lookup_id"]; this ensures
conflicts_resolved increments for both cases (new winner or existing winner).
In `@backend/lookup/services/llm_cache.py`:
- Around line 112-134: The generate_cache_key method currently concatenates
prompt and reference_data directly, causing possible collisions; update
generate_cache_key in LLMResponseCache to combine inputs with an unambiguous
scheme (e.g., length-prefixed fields like
"<len>:<prompt>|<len>:<reference_data>" or a reserved delimiter with escaping)
before hashing, and make the same change in the corresponding key-building
function in redis_cache (the method that assembles prompt+reference_data at
backend/lookup/integrations/redis_cache.py line ~107) so both implementations
use the identical separator/length-prefix format to ensure deterministic,
collision-free keys.
In `@backend/lookup/services/vector_db_cleanup_service.py`:
- Around line 286-301: The loop that calls cleanup_index_ids currently always
clears index_manager.index_ids_history, raw_index_id and marks
status/reindex_required even when result indicates failures, which can orphan
vector DB rows; change the logic in the loop that iterates index_managers so
that you only clear index_manager.index_ids_history, set
index_manager.raw_index_id = None, mark status {"indexed": False, "cleaned":
True} and set reindex_required = True (and save) when result["success"] is True
(or equivalent full-success indicator used by cleanup_for_data_source); if
result shows failures, preserve index_ids_history/raw_index_id and set status to
reflect cleanup failure (or leave indexed state) and still save errors via
index_manager.save(), matching the pattern in cleanup_for_data_source and
ensuring retryability.
In `@backend/lookup/services/workflow_integration.py`:
- Around line 179-183: The query in workflow_integration.py that matches
PromptStudioRegistry.prompt_registry_id against ToolInstance.tool_id is
type-unsafe because tool_instance.tool_id is a CharField while
prompt_registry_id is a UUIDField; update the lookup (around where
ToolInstance.objects.filter(...).first() is used) to explicitly convert
tool_instance.tool_id to a UUID before querying PromptStudioRegistry (e.g.,
parse with uuid.UUID) and handle invalid UUIDs with a try/except (log the
invalid value and return original_output, False) so the ORM receives a proper
UUID object rather than a raw string; alternatively, if you prefer a schema
change, migrate ToolInstance.tool_id to a UUIDField and adjust related code
accordingly.
In `@backend/lookup/tests/test_api/test_template_api.py`:
- Around line 21-37: The test setup creates a LookupPromptTemplate without the
required project relation; update setUp() to create a LookupProject and
associate it when creating the template (create a LookupProject instance and
pass it via the project argument to LookupPromptTemplate.objects.create),
ensuring the template's required OneToOneField to LookupProject is satisfied;
adjust any variable names as needed and keep the created project tied to self
(e.g., self.project) for reuse in tests referencing LookupPromptTemplate.
In `@backend/lookup/tests/test_migrations.py`:
- Around line 24-33: The setUp currently asserts self.migrate_from truthy
causing failures when migrate_from is intentionally None to represent the zero
state; remove that requirement and treat None as the initial migration by
allowing self.migrate_from to be [(self.app, None)] (instead of asserting) and
keep self.migrate_to handling as before; update the setUp logic around setUp,
self.migrate_from, self.migrate_to and the MigrationExecutor(connection) call so
tests can pass when migrate_from is None.
In `@backend/lookup/tests/test_services/test_lookup_orchestrator.py`:
- Around line 93-103: The test mocks define several helpers named
execute_side_effect that only accept positional args and use the name project,
but LookUpOrchestrator invokes executor.execute with keyword args
(lookup_project, input_data, execution_id, prompt_studio_project_id), causing
TypeError; update all six execute_side_effect helpers to accept the keyword
parameters (e.g., def execute_side_effect(*, lookup_project, input_data,
execution_id=None, prompt_studio_project_id=None) or accept **kwargs and read
lookup_project) and use lookup_project instead of project when constructing the
returned dict so the mocked signature matches the actual executor.execute call.
In
`@backend/prompt_studio/prompt_studio_output_manager_v2/output_manager_helper.py`:
- Around line 373-392: The current INFO logs in the post-processing loop print
full extracted values (see extracted_data_for_lookup, prompts loop, outputs and
logger.info calls), which may expose sensitive data; change those logger.info
calls to either logger.debug or remove the value content and instead log only
the prompt key and value type (e.g., prompt.prompt_key and
type(output_value).__name__) or a redacted placeholder, and similarly avoid
logging the full extracted_data_for_lookup dict at INFO—log its keys or count
only; update the two logger.info occurrences in this block and keep the skip
debug for PSOMKeys.NOTES as-is.
- Around line 432-459: Enriched values in the serialized_data loop are being
inserted raw (possibly dict/list) which breaks consistency with earlier
enforce_type serialization; fetch the prompt metadata from prompt_by_key (reuse
prompt_by_key) for the current prompt_key, run the same
enforce_type/serialization logic used elsewhere (e.g., json.dumps for structured
types like json/table/record/line-item) on enriched_value before assigning
item["lookup_replacement"]["enriched_value"], item["output"], and before calling
PromptStudioOutputManager.objects.filter(...).update(...), so stored/returned
outputs keep the same serialized format as non-enriched values.
In `@backend/prompt_studio/prompt_studio_v2/serializers.py`:
- Around line 35-58: The validation uses self.instance.tool_id_id (old value)
instead of respecting an incoming tool_id in updates; change the logic in the
serializer validation block so that if "tool_id" exists in self.initial_data it
overrides self.instance.tool_id_id (i.e., determine tool_id =
self.initial_data.get("tool_id") if present else self.instance.tool_id_id), then
perform the PromptStudioLookupLink.objects.filter check
(prompt_studio_project_id=tool_id, lookup_project=value) and raise
serializers.ValidationError if no link exists; keep the ImportError handling
as-is.
In
`@frontend/src/components/custom-tools/prompt-card/LookupReplacementIndicator.jsx`:
- Around line 48-61: The Popover uses trigger="hover" which blocks keyboard
access and the Button lacks an aria-label; change Popover's trigger to
trigger={["hover","focus"]} and add a descriptive aria-label to the Button (the
element rendering SwapOutlined) so screen readers and keyboard users can
discover the replacement details. Also replace any uses of the logical OR
fallback (e.g. value || fallback) in this component with the nullish coalescing
operator (value ?? fallback) so valid falsy values (0, false, "") are preserved;
look for usages in LookupReplacementIndicator.jsx around the Popover/content
rendering and update them accordingly.
In `@frontend/src/components/custom-tools/settings-modal/SettingsModal.jsx`:
- Around line 119-124: LookupManager is using the same numeric key (9) as
CustomDataSettings which overwrites listOfComponents[9] and creates duplicate
menu keys; update the LookupManager menu entry and its component mapping to use
an unused key (for example 10) by changing the getMenuItem call and the
listOfComponents index from 9 to 10 (referencing LookupManager, getMenuItem, and
listOfComponents) so CustomDataSettings remains reachable and keys are unique.
In `@frontend/src/components/logging/log-modal/LogModal.jsx`:
- Around line 43-57: The component is missing imports for Tag and SearchOutlined
which causes runtime errors: add Tag from "antd" and SearchOutlined from
"@ant-design/icons" to the existing import lists so getStageStyle (which returns
<SearchOutlined />) and the column render that uses <Tag> work correctly; update
the import statement(s) that currently import Button, Modal, Table, Tooltip and
CopyOutlined to also include Tag and SearchOutlined respectively.
In `@frontend/src/components/lookups/create-project-modal/CreateProjectModal.jsx`:
- Around line 9-16: handleSubmit currently treats any error as a validation
failure; change it to first await form.validateFields() in a try/catch to handle
and log/return validation errors, then call await onCreate(values) in a separate
try/catch so API/create failures are handled distinctly (log the error, show
user-visible feedback via message.error or set a submission error) and still
call form.resetFields() only on successful creation; reference handleSubmit,
form.validateFields, onCreate, and form.resetFields when updating the control
flow.
🟡 Minor comments (25)
frontend/src/components/custom-tools/display-logs/DisplayLogs.css-27-33 (1)
27-33:⚠️ Potential issue | 🟡 MinorCSS class mismatch with
DisplayLogs.jsx.The CSS defines
.display-logs-stage-extractand.display-logs-stage-index, butDisplayLogs.jsxusesTOOLstage which maps to.display-logs-stage-tool(not defined here). This mismatch means:
TOOLstage tags won't get their custom border stylingEXTRACTandINDEXclass definitions appear unused🔧 Proposed fix to align with DisplayLogs.jsx
-.display-logs-stage-extract { - border-color: `#52c41a`; -} - -.display-logs-stage-index { +.display-logs-stage-tool { + border-color: `#13c2c2`; +} + +.display-logs-stage-default { border-color: `#faad14`; }frontend/src/components/lookups/debug-tab/DebugTab.jsx-353-444 (1)
353-444:⚠️ Potential issue | 🟡 MinorGuard against missing/zero token metrics to avoid NaN UI math.
If
tokenCountorcontextLimitis missing/zero, percentage math yields NaN/Infinity and can break the progress display. Add a defensive guard with a fallback alert.🛡️ Suggested fix
- const { tokenCount, contextLimit, model } = errorDetails; - const usagePercent = Math.min( - Math.round((tokenCount / contextLimit) * 100), - 100 - ); - const overflowAmount = tokenCount - contextLimit; - const overflowPercent = Math.round((overflowAmount / contextLimit) * 100); + const { tokenCount, contextLimit, model } = errorDetails; + if ( + !Number.isFinite(tokenCount) || + !Number.isFinite(contextLimit) || + contextLimit <= 0 + ) { + return ( + <Alert + message="Context Window Exceeded" + description="Token usage details are unavailable for this run." + type="error" + showIcon={false} + style={{ marginBottom: 16 }} + /> + ); + } + const usagePercent = Math.min( + Math.round((tokenCount / contextLimit) * 100), + 100 + ); + const overflowAmount = Math.max(tokenCount - contextLimit, 0); + const overflowPercent = Math.round((overflowAmount / contextLimit) * 100);frontend/src/components/lookups/create-project-modal/CreateProjectModal.jsx-20-24 (1)
20-24:⚠️ Potential issue | 🟡 MinorReset the form on cancel to avoid stale values on reopen.
Currently, canceling leaves prior input in the form since the modal stays mounted. When reopened, users see previously entered values for name and description fields, creating a poor UX. Implement the suggested
handleCancelhandler to reset the form to its initial state before callingonCancel.♻️ Proposed refinement
export function CreateProjectModal({ open, onCancel, onCreate }) { const [form] = Form.useForm(); + const handleCancel = () => { + form.resetFields(); + onCancel(); + }; ... return ( <Modal title="Create Look-Up Project" open={open} - onCancel={onCancel} + onCancel={handleCancel}frontend/src/components/custom-tools/prompt-card/LookupReplacementIndicator.jsx-24-41 (1)
24-41:⚠️ Potential issue | 🟡 MinorUse nullish coalescing (
??) instead of logical OR (||) to preserve valid falsy values.The current code uses
|| "(empty)"which will incorrectly display(empty)for legitimate falsy values like0,false, or empty strings. Sinceoriginal_valueandenriched_valuecan be any type (per PropTypes), use??to only treatnullandundefinedas empty.Suggested fix
- {originalValue || "(empty)"} + {originalValue ?? "(empty)"}- {enrichedValue || "(empty)"} + {enrichedValue ?? "(empty)"}backend/lookup/services/vector_db_cleanup_service.py-244-250 (1)
244-250:⚠️ Potential issue | 🟡 MinorState update may incorrectly remove failed IDs from history.
When
result["deleted"] > 0, the code removes allids_to_deletefromindex_ids_history, but some of those IDs may have failed deletion. Sincecleanup_index_idsdoesn't return which specific IDs succeeded, failed IDs will be lost from history and can't be retried.Consider either:
- Only update history when
result["success"]is True (all deletions succeeded)- Have
cleanup_index_idsreturn the list of successfully deleted IDs🔧 Conservative fix option
- # Update history to remove deleted IDs - if result["deleted"] > 0: - remaining_ids = [ - id for id in index_manager.index_ids_history if id not in ids_to_delete - ] - index_manager.index_ids_history = remaining_ids - index_manager.save() + # Only update history when all deletions succeeded + if result["success"]: + remaining_ids = [ + id for id in index_manager.index_ids_history if id not in ids_to_delete + ] + index_manager.index_ids_history = remaining_ids + index_manager.save()backend/lookup/tests/test_services/test_audit_logger.py-226-245 (1)
226-245:⚠️ Potential issue | 🟡 MinorDrop unused
resultassignment.
Static analysis flaggedresultas unused here; the call can stand alone.🧹 Cleanup
- result = audit_logger.log_execution(**minimal_params) + audit_logger.log_execution(**minimal_params)backend/lookup/tests/test_services/test_audit_logger.py-93-105 (1)
93-105:⚠️ Potential issue | 🟡 MinorFix the mock exception setup in
test_project_not_found.When using
mock_project_model.DoesNotExist()as a side_effect, it creates a MagicMock instance rather than raising an exception. The except block catchingLookupProject.DoesNotExistis never exercised. Create a real exception class on the mock instead:Suggested fix
- mock_project_model.objects.get.side_effect = mock_project_model.DoesNotExist() + mock_project_model.DoesNotExist = type("DoesNotExist", (Exception,), {}) + mock_project_model.objects.get.side_effect = mock_project_model.DoesNotExistbackend/lookup/tests/test_services/test_lookup_executor.py-365-368 (1)
365-368:⚠️ Potential issue | 🟡 MinorSilence the unused-argument warning by prefixing parameters with underscores.
Ruff's ARG rule will flag the unused*argsand**kwargsin this test mock function.🔧 Suggested fix
- def delayed_generate(*args, **kwargs): + def delayed_generate(*_args, **_kwargs): import time time.sleep(0.01) # 10ms delay return '{"result": "data"}'backend/lookup/tests/test_api/test_execution_api.py-31-35 (1)
31-35:⚠️ Potential issue | 🟡 MinorSuppress Ruff S106 for test-only passwords (both setups).
🛠️ Suggested fix
- password='testpass123' + password='testpass123', # noqa: S106 - test-only password ... - password='testpass123' + password='testpass123', # noqa: S106 - test-only passwordAlso applies to: 166-170
backend/lookup/tests/test_api/test_execution_api.py-71-74 (1)
71-74:⚠️ Potential issue | 🟡 MinorSilence the unused patched arg to satisfy Ruff ARG002.
🛠️ Suggested fix
- def test_debug_with_ps_project(self, mock_executor_class, mock_orchestrator_class): + def test_debug_with_ps_project(self, _mock_executor_class, mock_orchestrator_class):backend/lookup/tests/test_api/test_linking_api.py-101-108 (1)
101-108:⚠️ Potential issue | 🟡 MinorRemove unused link variables to satisfy Ruff F841.
🧹 Suggested fix
- link1 = PromptStudioLookupLink.objects.create( + PromptStudioLookupLink.objects.create( prompt_studio_project_id=self.ps_project_id, lookup_project=self.lookup1 ) - link2 = PromptStudioLookupLink.objects.create( + PromptStudioLookupLink.objects.create( prompt_studio_project_id=self.ps_project_id, lookup_project=self.lookup2 )backend/lookup/integrations/redis_cache.py-120-131 (1)
120-131:⚠️ Potential issue | 🟡 MinorTreat empty cached values as hits.
Redis can return an empty string; the current truthiness check marks that as a miss.🛠️ Suggested fix
- if value: + if value is not None: logger.debug(f"Redis cache hit for key: {key[:20]}...") self._update_stats("hits") return value else: logger.debug(f"Redis cache miss for key: {key[:20]}...") self._update_stats("misses")backend/lookup/tests/test_api/test_linking_api.py-27-31 (1)
27-31:⚠️ Potential issue | 🟡 MinorSuppress Ruff S106 for the test-only password.
This avoids a lint error while keeping the test simple.🛠️ Suggested fix
- password='testpass123' + password='testpass123', # noqa: S106 - test-only passwordbackend/lookup/tests/test_services/test_reference_data_loader.py-71-71 (1)
71-71:⚠️ Potential issue | 🟡 MinorChange
file_type="text/csv"tofile_type="csv".The
file_typefield inLookupDataSourcevalidates againstFILE_TYPE_CHOICES, which only accepts"csv", not the MIME type"text/csv". This will fail model validation. Use the plain choice value instead.backend/lookup/integrations/unstract_llm_client.py-122-163 (1)
122-163:⚠️ Potential issue | 🟡 MinorUnused
configandtimeoutparameters should be utilized or removed.The
configparameter (for temperature, etc.) andtimeoutparameter are declared but never used. The LLM call at line 142 doesn't apply the configuration, and no timeout is enforced.Proposed fix to utilize or document the parameters
If the parameters are intended for future use, add a TODO comment. Otherwise, apply config to the LLM call:
def generate(self, prompt: str, config: dict[str, Any], timeout: int = 30) -> str: ... try: # Create LLM instance using SDK - llm = LLM(adapter_id=self.adapter_id, adapter_metadata=self.adapter_metadata) + # Apply config settings if provided + metadata = {**self.adapter_metadata} + if config.get("temperature") is not None: + metadata["temperature"] = config["temperature"] + llm = LLM(adapter_id=self.adapter_id, adapter_metadata=metadata) + # TODO: Implement timeout handling for LLM callsbackend/lookup/services/lookup_retrieval_service.py-151-155 (1)
151-155:⚠️ Potential issue | 🟡 MinorPotential
IndexErrorwhen accessingunique_chunksby index.If all chunks are duplicates (edge case),
unique_chunkscould be empty after deduplication, causing anIndexErroron line 154 when accessingunique_chunks[-1]andunique_chunks[0].Proposed fix
+ if not unique_chunks: + logger.warning( + f"All {len(all_chunks)} chunks were duplicates for project {project_id}" + ) + return "" + logger.info( f"Retrieved {len(unique_chunks)} unique chunks " f"(from {len(all_chunks)} total) for project {project_id}. " f"Score range: {unique_chunks[-1].score:.3f} - {unique_chunks[0].score:.3f}" )backend/lookup/tests/test_integrations/test_storage_integration.py-164-196 (1)
164-196:⚠️ Potential issue | 🟡 MinorRemove unused
expected_metavariable.The variable
expected_metais computed but never used in assertions. Either add an assertion to verify metadata content or remove it.Proposed fix - add assertion or remove variable
Option 1 - Add assertion:
# Mock JSON encoding for metadata import json expected_meta = json.dumps(metadata, indent=2) ... # Check metadata upload meta_path = f"{expected_path}.meta.json" meta_calls = [call for call in self.mock_fs.write.call_args_list if call[1]['path'] == meta_path] self.assertEqual(len(meta_calls), 1) + # Verify metadata content + actual_meta_data = meta_calls[0][1]['data'].decode('utf-8') + self.assertEqual(actual_meta_data, expected_meta)Option 2 - Remove unused variable:
- # Mock JSON encoding for metadata - import json - expected_meta = json.dumps(metadata, indent=2) -backend/lookup/integrations/llmwhisperer_client.py-188-222 (1)
188-222:⚠️ Potential issue | 🟡 MinorBlocking poll loop could impact request threads.
The
wait_for_extractionmethod usestime.sleep()which blocks the thread. If called from a request handler, this could exhaust worker threads. Consider:
- Using this only in background tasks/Celery workers
- Adding documentation about intended usage context
- For request threads, return immediately and use webhooks/callbacks
backend/lookup/models/prompt_studio_lookup_link.py-91-106 (1)
91-106:⚠️ Potential issue | 🟡 MinorAuto-assignment condition may incorrectly skip assignment for explicit zero.
The condition
self.execution_order == 0 and not self.pkwill not auto-assign if someone explicitly setsexecution_order=0. Also, for UUID primary keys with defaults,not self.pkmay not work as expected since UUID fields have defaults.🐛 Proposed fix
def save(self, *args, **kwargs): """Override save to auto-assign execution_order if not set.""" - if self.execution_order == 0 and not self.pk: + # Check if this is a new instance by querying the database + # (self.pk is always set for UUIDs due to default=uuid.uuid4) + is_new = self.pk is None or not PromptStudioLookupLink.objects.filter(pk=self.pk).exists() + + if is_new and self._state.adding and self.execution_order == 0: # Get the maximum execution order for this PS project max_order = PromptStudioLookupLink.objects.filter( prompt_studio_project_id=self.prompt_studio_project_id ).aggregate(max_order=models.Max("execution_order"))["max_order"] self.execution_order = (max_order or 0) + 1 super().save(*args, **kwargs)Using
self._state.addingis the Django-idiomatic way to check for new instances.backend/lookup/tests/test_integrations/test_redis_cache_integration.py-144-164 (1)
144-164:⚠️ Potential issue | 🟡 MinorMove the RedisError import to the top of the file for consistency.
The
from redis.exceptions import RedisErrorimport at line 147 is placed inline within the test method. While redis is a declared project dependency (inbackend/pyproject.tomland other modules), the import statement itself should be at the top of the file with other imports for consistency and maintainability, rather than deferred until the test method executes.backend/lookup/services/indexing_service.py-398-401 (1)
398-401:⚠️ Potential issue | 🟡 MinorAdd exception chaining for better debugging.
When re-raising in an except clause, use
raise ... from errorfrom Noneto preserve the exception chain.🔧 Proposed fix
try: project = LookupProject.objects.get(id=project_id) except LookupProject.DoesNotExist: - raise ValueError(f"Project {project_id} not found") + raise ValueError(f"Project {project_id} not found") from Nonebackend/lookup/services/lookup_executor.py-664-669 (1)
664-669:⚠️ Potential issue | 🟡 MinorAdd exception chaining in
_parse_llm_response.When converting exceptions to
ParseError, chain the original exception to preserve debugging context.🔧 Proposed fix
except json.JSONDecodeError as e: logger.warning(f"Failed to parse LLM response as JSON: {e}") - raise ParseError(f"Invalid JSON response: {str(e)}") + raise ParseError(f"Invalid JSON response: {e!s}") from e except Exception as e: logger.warning(f"Unexpected error parsing LLM response: {e}") - raise ParseError(f"Parse error: {str(e)}") + raise ParseError(f"Parse error: {e!s}") from ebackend/lookup/views.py-1357-1361 (1)
1357-1361:⚠️ Potential issue | 🟡 MinorAdd exception chaining for better debugging.
When converting
IntegrityErrortoDuplicateData, chain the original exception to preserve the database error details for debugging.🔧 Proposed fix
try: self.perform_create(serializer) - except IntegrityError: - raise DuplicateData(LookupProfileManagerErrors.PROFILE_NAME_EXISTS) + except IntegrityError as e: + raise DuplicateData(LookupProfileManagerErrors.PROFILE_NAME_EXISTS) from ebackend/lookup/views.py-260-265 (1)
260-265:⚠️ Potential issue | 🟡 MinorHardcoded
user_idmay cause issues with multi-user scenarios.Using a fixed
user_id = "lookup"means all lookup uploads share the same path segment, which could cause file collisions if two users upload files with the same name to the same project. Consider using the actual user ID or a unique identifier.🔧 Proposed fix
org_id = UserContext.get_organization_identifier() - # Use a fixed user_id for lookup uploads to match PS path structure - # Path: {base_path}/{org_id}/{user_id}/{tool_id}/{filename} - user_id = "lookup" + # Use actual user ID to prevent file collisions between users + user_id = str(request.user.user_id) if hasattr(request.user, 'user_id') else str(request.user.id) tool_id = str(project.id)backend/lookup/integrations/llm_provider.py-100-142 (1)
100-142:⚠️ Potential issue | 🟡 MinorUnused
timeoutparameter and missing exception chaining.The
timeoutparameter is declared but never used. Additionally, catching a broadExceptionand re-raising without chaining loses the original traceback context.🔧 Proposed fix
- def generate(self, prompt: str, config: dict[str, Any], timeout: int = 30) -> str: + def generate(self, prompt: str, config: dict[str, Any], timeout: int = 30) -> str: # noqa: ARG002 """Generate LLM response for Look-Up enrichment. Args: prompt: The prompt text with resolved variables config: LLM configuration (provider, model, temperature, etc.) - timeout: Request timeout in seconds + timeout: Request timeout in seconds (reserved for future use) Returns: JSON-formatted string with enrichment data Raises: RuntimeError: If LLM call fails """ start_time = time.time() try: if self.llm_available: # Use actual Unstract LLM integration llm_config = self._get_llm_instance(config) # In production, this would make actual LLM call # For now, simulate the call response = self._simulate_llm_call(prompt, llm_config) else: # Fallback implementation response = self._fallback_generate(prompt, config) # Validate response is JSON try: json.loads(response) except json.JSONDecodeError: # Try to extract JSON from response response = self._extract_json(response) elapsed_time = time.time() - start_time logger.info(f"LLM generation completed in {elapsed_time:.2f}s") return response except Exception as e: - logger.error(f"LLM generation failed: {e}") - raise RuntimeError(f"LLM generation failed: {str(e)}") + logger.exception("LLM generation failed") + raise RuntimeError(f"LLM generation failed: {e!s}") from e
| # Try to find JSON in the response | ||
| import re | ||
|
|
||
| # Look for JSON object pattern | ||
| json_pattern = r"\{[^{}]*\}" | ||
| matches = re.findall(json_pattern, response) | ||
|
|
||
| if matches: | ||
| # Try to parse each match | ||
| for match in matches: | ||
| try: | ||
| json.loads(match) | ||
| return match | ||
| except json.JSONDecodeError: | ||
| continue | ||
|
|
There was a problem hiding this comment.
JSON extraction regex won't match nested objects.
The pattern r"\{[^{}]*\}" excludes any braces inside the match, so nested JSON like {"data": {"value": 1}} won't be extracted correctly. Consider using a more robust approach.
🔧 Proposed fix using iterative brace matching
def _extract_json(self, response: str) -> str:
"""Extract JSON from LLM response if wrapped in text."""
- # Try to find JSON in the response
- import re
-
- # Look for JSON object pattern
- json_pattern = r"\{[^{}]*\}"
- matches = re.findall(json_pattern, response)
-
- if matches:
- # Try to parse each match
- for match in matches:
- try:
- json.loads(match)
- return match
- except json.JSONDecodeError:
- continue
+ # Try to find JSON by locating balanced braces
+ start_idx = response.find("{")
+ if start_idx == -1:
+ # No JSON object found
+ return json.dumps({
+ "raw_response": response[:500],
+ "confidence": 0.3,
+ "warning": "Could not extract structured data",
+ })
+
+ # Try progressively longer substrings starting from first brace
+ for end_idx in range(len(response), start_idx, -1):
+ if response[end_idx - 1] == "}":
+ candidate = response[start_idx:end_idx]
+ try:
+ json.loads(candidate)
+ return candidate
+ except json.JSONDecodeError:
+ continue
# If no valid JSON found, create a basic response
return json.dumps({
"raw_response": response[:500],
"confidence": 0.3,
"warning": "Could not extract structured data",
})🧰 Tools
🪛 Ruff (0.14.14)
[warning] 219-219: Consider moving this statement to an else block
(TRY300)
🤖 Prompt for AI Agents
In `@backend/lookup/integrations/llm_provider.py` around lines 207 - 222, The
current JSON extraction uses the regex json_pattern = r"\{[^{}]*\}" which fails
on nested objects; replace it with an iterative balanced-brace extractor that
scans the response string (the variable response) and uses a stack/counter to
find matching '{' ... '}' spans, attempt json.loads on each balanced substring
and return the first valid parse; update the logic where matches are
built/iterated (the block referencing json_pattern, matches, and the json.loads
try/except) to use this brace-matching approach so nested JSON like
{"data":{"value":1}} is correctly extracted and parsed.
| count = 0 | ||
|
|
||
| for prompt, response in preload_data.items(): | ||
| # Generate key with project namespace | ||
| key = f"{self.key_prefix}project:{project_id}:{hashlib.md5(prompt.encode()).hexdigest()}" | ||
|
|
||
| if self.set(key, response): | ||
| count += 1 |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
# Explore the file structure
cd backend/lookup/integrations && wc -l redis_cache.pyRepository: Zipstack/unstract
Length of output: 79
🏁 Script executed:
# Get context around the flagged lines and search for other hashing
cd backend/lookup/integrations && sed -n '350,400p' redis_cache.pyRepository: Zipstack/unstract
Length of output: 1208
🏁 Script executed:
# Search for hash-related code throughout the file
cd backend/lookup/integrations && grep -n "hashlib\|sha256\|md5" redis_cache.pyRepository: Zipstack/unstract
Length of output: 247
🏁 Script executed:
# Get context around line 108 to understand the main cache key scheme
cd backend/lookup/integrations && sed -n '100,120p' redis_cache.pyRepository: Zipstack/unstract
Length of output: 632
🏁 Script executed:
# Get more context to see the method name containing line 108
cd backend/lookup/integrations && sed -n '95,115p' redis_cache.pyRepository: Zipstack/unstract
Length of output: 684
Align warmup cache key hashing with the main cache scheme — use SHA256 instead of MD5.
The main cache key generation (generate_cache_key()) uses SHA256, but the warmup method uses MD5, creating an inconsistency. Additionally, MD5 is cryptographically weak and flagged by security linters.
🛠️ Suggested fix
- key = f"{self.key_prefix}project:{project_id}:{hashlib.md5(prompt.encode()).hexdigest()}"
+ prompt_hash = hashlib.sha256(prompt.encode("utf-8")).hexdigest()
+ key = f"{self.key_prefix}project:{project_id}:{prompt_hash}"📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| count = 0 | |
| for prompt, response in preload_data.items(): | |
| # Generate key with project namespace | |
| key = f"{self.key_prefix}project:{project_id}:{hashlib.md5(prompt.encode()).hexdigest()}" | |
| if self.set(key, response): | |
| count += 1 | |
| count = 0 | |
| for prompt, response in preload_data.items(): | |
| # Generate key with project namespace | |
| prompt_hash = hashlib.sha256(prompt.encode("utf-8")).hexdigest() | |
| key = f"{self.key_prefix}project:{project_id}:{prompt_hash}" | |
| if self.set(key, response): | |
| count += 1 |
🧰 Tools
🪛 Ruff (0.14.14)
[error] 378-378: Probable use of insecure hash functions in hashlib: md5
(S324)
🤖 Prompt for AI Agents
In `@backend/lookup/integrations/redis_cache.py` around lines 374 - 381, The
warmup loop uses hashlib.md5 to build keys, which diverges from the main
generate_cache_key() scheme (SHA256) and triggers security linting; update the
key generation inside the preload_data loop to use
hashlib.sha256(prompt.encode()).hexdigest() (matching the generate_cache_key()
approach), keeping the same self.key_prefix and project:{project_id}: namespace
so keys are consistent; ensure the change is made where the code builds key (the
preload loop that calls self.set(key, response)) so warmup entries match the
primary cache lookup.
| @receiver(pre_save, sender=LookupDataSource) | ||
| def auto_increment_version_and_update_latest(sender, instance, **kwargs): | ||
| """Signal to auto-increment version number and manage is_latest flag. | ||
|
|
||
| This signal: | ||
| 1. Auto-increments version_number if not set | ||
| 2. Marks all previous versions as not latest | ||
| """ | ||
| # Check if this is a new instance by querying the database | ||
| # (instance.pk is always truthy for UUIDField with default=uuid.uuid4) | ||
| is_new = not LookupDataSource.objects.filter(pk=instance.pk).exists() | ||
|
|
||
| if is_new: | ||
| # Get the highest version number for this project | ||
| max_version = LookupDataSource.objects.filter(project=instance.project).aggregate( | ||
| max_version=models.Max("version_number") | ||
| )["max_version"] | ||
|
|
||
| # Always auto-increment version for new instances | ||
| instance.version_number = (max_version or 0) + 1 | ||
|
|
||
| # Mark all previous versions as not latest | ||
| LookupDataSource.objects.filter(project=instance.project, is_latest=True).update( | ||
| is_latest=False | ||
| ) | ||
|
|
||
| # Ensure new version is marked as latest | ||
| instance.is_latest = True | ||
|
|
There was a problem hiding this comment.
Race condition risk in version auto-increment signal.
The pre_save signal queries Max("version_number") and then sets the new version, but concurrent inserts could result in duplicate version numbers since the unique constraint check happens at commit time. Consider using select_for_update() or database-level sequences.
🔒 Proposed mitigation using select_for_update
`@receiver`(pre_save, sender=LookupDataSource)
def auto_increment_version_and_update_latest(sender, instance, **kwargs):
# Check if this is a new instance
is_new = not LookupDataSource.objects.filter(pk=instance.pk).exists()
if is_new:
+ from django.db import transaction
+ with transaction.atomic():
+ # Lock existing records to prevent race conditions
+ existing = LookupDataSource.objects.filter(
+ project=instance.project
+ ).select_for_update().order_by('-version_number').first()
+
+ instance.version_number = (existing.version_number if existing else 0) + 1
- max_version = LookupDataSource.objects.filter(project=instance.project).aggregate(
- max_version=models.Max("version_number")
- )["max_version"]
-
- instance.version_number = (max_version or 0) + 1
# Mark all previous versions as not latest
LookupDataSource.objects.filter(project=instance.project, is_latest=True).update(
is_latest=False
)
instance.is_latest = TrueNote: This approach requires the signal to be called within a transaction context. Alternatively, consider handling version assignment in the model's save() method where you have more control over the transaction.
🧰 Tools
🪛 Ruff (0.14.14)
[warning] 169-169: Unused function argument: sender
(ARG001)
[warning] 169-169: Unused function argument: kwargs
(ARG001)
🤖 Prompt for AI Agents
In `@backend/lookup/models/lookup_data_source.py` around lines 168 - 196, The
pre_save signal handler auto_increment_version_and_update_latest on
LookupDataSource is vulnerable to race conditions because it reads
Max("version_number") and then writes, so concurrent creations can get duplicate
version_number; update this by acquiring a row-level lock and doing the
read/update inside a transaction (e.g., wrap in transaction.atomic() and use
select_for_update() on LookupDataSource objects filtered by project to lock the
relevant rows before computing max and updating is_latest) or switch to a
DB-backed sequence/serial for version_number and set is_latest in the same
transaction to ensure uniqueness; ensure all logic referencing version_number
and is_latest in auto_increment_version_and_update_latest runs inside that
transaction context.
| # Ensure error_message is provided for failed status | ||
| if self.status == "failed" and not self.error_message: | ||
| raise ValidationError("Error message is required for failed executions") | ||
|
|
||
| # Ensure enriched_output is provided for success status | ||
| if self.status == "success" and not self.enriched_output: | ||
| raise ValidationError("Enriched output is required for successful executions") |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Check how partial status is used in the codebase
rg '"partial"' --type py -C3 | head -100Repository: Zipstack/unstract
Length of output: 3646
🏁 Script executed:
#!/bin/bash
# Search for status patterns and partial usage
rg "status.*partial|partial.*status" --type py -C2 | head -80Repository: Zipstack/unstract
Length of output: 3771
🏁 Script executed:
#!/bin/bash
# Look at the full context of the validation file
cat -n backend/lookup/models/lookup_execution_audit.py | sed -n '120,160p'Repository: Zipstack/unstract
Length of output: 1136
🏁 Script executed:
#!/bin/bash
# Find all status values used in the codebase
rg "status\s*=\s*[\"']" --type py | grep -E "(success|failed|partial)" | head -30Repository: Zipstack/unstract
Length of output: 2733
Add validation for enriched_output on "partial" status to match "success" requirements.
The code validates enriched_output for status "success" but not for "partial". Since the was_successful() property treats both "success" and "partial" as successful outcomes, the validation should be consistent. If "success" requires enriched output, "partial" should too—partial success still implies some processed output exists.
Consider adding validation for "partial" status:
if self.status in ["success", "partial"] and not self.enriched_output:
raise ValidationError("Enriched output is required for successful executions")🧰 Tools
🪛 Ruff (0.14.14)
[warning] 136-136: Avoid specifying long messages outside the exception class
(TRY003)
[warning] 140-140: Avoid specifying long messages outside the exception class
(TRY003)
🤖 Prompt for AI Agents
In `@backend/lookup/models/lookup_execution_audit.py` around lines 134 - 140, The
validation currently requires enriched_output only when self.status == "success"
but was_successful() treats "partial" as successful too; update the validation
in the LookupExecutionAudit model (the block that checks self.status and
self.enriched_output) to require enriched_output when self.status is either
"success" or "partial" (e.g., change the condition to check status in
["success", "partial"]) and keep the ValidationError message consistent.
| def clean(self): | ||
| """Validate model fields.""" | ||
| super().clean() | ||
| from django.core.exceptions import ValidationError | ||
|
|
||
| # Validate LLM provider | ||
| valid_providers = [choice[0] for choice in self.LLM_PROVIDER_CHOICES] | ||
| if self.llm_provider not in valid_providers: | ||
| raise ValidationError( | ||
| f"Invalid LLM provider: {self.llm_provider}. " | ||
| f"Must be one of: {', '.join(valid_providers)}" | ||
| ) | ||
|
|
||
| # Validate LLM config structure | ||
| if self.llm_config and not isinstance(self.llm_config, dict): | ||
| raise ValidationError("LLM config must be a dictionary") | ||
|
|
||
| # Validate lookup_type (only static_data for POC) | ||
| if self.lookup_type != "static_data": | ||
| raise ValidationError( | ||
| "Only 'static_data' lookup type is supported in this POC" | ||
| ) |
There was a problem hiding this comment.
Validation fails for nullable llm_provider when it's None.
The llm_provider field allows null=True, but clean() validates that it must be in LLM_PROVIDER_CHOICES. When llm_provider is None, this validation will raise a ValidationError since None is not in the valid choices list.
🐛 Proposed fix
def clean(self):
"""Validate model fields."""
super().clean()
from django.core.exceptions import ValidationError
# Validate LLM provider
- valid_providers = [choice[0] for choice in self.LLM_PROVIDER_CHOICES]
- if self.llm_provider not in valid_providers:
- raise ValidationError(
- f"Invalid LLM provider: {self.llm_provider}. "
- f"Must be one of: {', '.join(valid_providers)}"
- )
+ if self.llm_provider is not None:
+ valid_providers = [choice[0] for choice in self.LLM_PROVIDER_CHOICES]
+ if self.llm_provider not in valid_providers:
+ raise ValidationError(
+ f"Invalid LLM provider: {self.llm_provider}. "
+ f"Must be one of: {', '.join(valid_providers)}"
+ )
# Validate LLM config structure
if self.llm_config and not isinstance(self.llm_config, dict):
raise ValidationError("LLM config must be a dictionary")🧰 Tools
🪛 Ruff (0.14.14)
[warning] 150-153: Avoid specifying long messages outside the exception class
(TRY003)
[warning] 157-157: Avoid specifying long messages outside the exception class
(TRY003)
[warning] 161-163: Avoid specifying long messages outside the exception class
(TRY003)
🤖 Prompt for AI Agents
In `@backend/lookup/models/lookup_project.py` around lines 142 - 163, The clean()
method wrongly enforces LLM_PROVIDER_CHOICES even when llm_provider is nullable;
update the validation in clean() to only check membership if self.llm_provider
is not None (and optionally not an empty string) before comparing against
valid_providers (referencing clean(), LLM_PROVIDER_CHOICES and llm_provider), so
nullable values pass and only non-null providers are validated.
| # Get tool_id from instance (update) or initial data (create) | ||
| tool_id = None | ||
| if self.instance: | ||
| tool_id = self.instance.tool_id_id | ||
| elif "tool_id" in self.initial_data: | ||
| tool_id = self.initial_data["tool_id"] | ||
|
|
||
| if tool_id: | ||
| try: | ||
| from lookup.models import PromptStudioLookupLink | ||
|
|
||
| link_exists = PromptStudioLookupLink.objects.filter( | ||
| prompt_studio_project_id=tool_id, | ||
| lookup_project=value, | ||
| ).exists() | ||
|
|
||
| if not link_exists: | ||
| raise serializers.ValidationError( | ||
| "Selected lookup project must be linked to this " | ||
| "Prompt Studio project at the project level first." | ||
| ) | ||
| except ImportError: | ||
| # Lookup app not installed, skip validation | ||
| pass |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
head -100 backend/prompt_studio/prompt_studio_v2/serializers.py | cat -nRepository: Zipstack/unstract
Length of output: 3410
🏁 Script executed:
find backend -name "models.py" -type f | xargs grep -l "ToolStudioPrompt" | head -5Repository: Zipstack/unstract
Length of output: 173
🏁 Script executed:
grep -n "class ToolStudioPrompt" backend/prompt_studio/prompt_studio_v2/models.pyRepository: Zipstack/unstract
Length of output: 98
🏁 Script executed:
sed -n '11,80p' backend/prompt_studio/prompt_studio_v2/models.py | cat -nRepository: Zipstack/unstract
Length of output: 2927
🏁 Script executed:
grep -n "lookup_project" backend/prompt_studio/prompt_studio_v2/models.pyRepository: Zipstack/unstract
Length of output: 104
🏁 Script executed:
sed -n '143,155p' backend/prompt_studio/prompt_studio_v2/models.py | cat -nRepository: Zipstack/unstract
Length of output: 656
🏁 Script executed:
find backend -type f -name "*.py" | xargs grep -l "PromptStudioLookupLink" | head -5Repository: Zipstack/unstract
Length of output: 404
🏁 Script executed:
cat -n backend/lookup/models/prompt_studio_lookup_link.pyRepository: Zipstack/unstract
Length of output: 5455
🏁 Script executed:
cat > /tmp/test_validation_logic.py << 'EOF'
# Test to understand the issue with validation logic
# Scenario 1: PATCH with both tool_id and lookup_project changing
# Current code flow:
print("=== Current Logic (Instance First) ===")
print("Request: PATCH with tool_id='new-tool', lookup_project='lookup-A'")
print("Instance (DB): tool_id='old-tool'")
print()
print("Execution:")
print("1. self.instance exists (PATCH request) → True")
print("2. tool_id = self.instance.tool_id_id # Uses 'old-tool' from DB")
print("3. Validates lookup_project against 'old-tool' (WRONG!)")
print("4. Should validate against 'new-tool' (incoming data)")
print()
print("=== Suggested Logic (Initial Data First) ===")
print("Request: PATCH with tool_id='new-tool', lookup_project='lookup-A'")
print("Instance (DB): tool_id='old-tool'")
print()
print("Execution:")
print("1. 'tool_id' in self.initial_data → True")
print("2. tool_id = self.initial_data['tool_id'] # Uses 'new-tool' from request")
print("3. Validates lookup_project against 'new-tool' (CORRECT!)")
EOF
python /tmp/test_validation_logic.pyRepository: Zipstack/unstract
Length of output: 764
🏁 Script executed:
find backend -type f -name "*.py" | xargs grep -l "ToolStudioPromptSerializer" | grep -E "(views|viewset)" | head -5Repository: Zipstack/unstract
Length of output: 294
🏁 Script executed:
grep -n "ToolStudioPromptSerializer" backend/prompt_studio/prompt_studio_v2/views.py | head -10Repository: Zipstack/unstract
Length of output: 198
🏁 Script executed:
sed -n '20,80p' backend/prompt_studio/prompt_studio_v2/views.py | cat -nRepository: Zipstack/unstract
Length of output: 2583
🏁 Script executed:
grep -A 5 "class Meta:" backend/prompt_studio/prompt_studio_v2/serializers.py | head -10Repository: Zipstack/unstract
Length of output: 273
🏁 Script executed:
cat > /tmp/drf_validation_test.py << 'EOF'
# Understanding Django REST Framework validation behavior
# When a PATCH request is made with multiple fields:
# 1. self.initial_data contains the incoming request data (what the client sent)
# 2. self.instance contains the DB object being updated
# 3. Field validators run during validation phase
# Example PATCH request:
# {
# "tool_id": "new-tool-uuid",
# "lookup_project": "lookup-uuid-A"
# }
# DB state before PATCH:
# ToolStudioPrompt(
# tool_id="old-tool-uuid",
# lookup_project="lookup-uuid-B"
# )
# During validation of lookup_project field:
# - self.initial_data["tool_id"] = "new-tool-uuid" (incoming)
# - self.instance.tool_id_id = "old-tool-uuid" (from DB)
# Current code prioritizes instance → validation against old tool_id (INCORRECT)
# Suggested code prioritizes initial_data → validation against new tool_id (CORRECT)
print("✓ The review comment is correct:")
print(" - Current logic validates against old tool_id on PATCH with updated tool_id")
print(" - Should validate against incoming tool_id when present")
print(" - Suggested fix correctly reverses the precedence")
EOF
python /tmp/drf_validation_test.pyRepository: Zipstack/unstract
Length of output: 281
Validate against the incoming tool_id on updates.
When a PATCH/PUT request includes both a new tool_id and lookup_project, the current logic validates against self.instance.tool_id_id (the old value), allowing the lookup project to be linked to the wrong project. The incoming tool_id from self.initial_data should take precedence during updates.
Suggested fix
- tool_id = None
- if self.instance:
- tool_id = self.instance.tool_id_id
- elif "tool_id" in self.initial_data:
- tool_id = self.initial_data["tool_id"]
+ tool_id = None
+ if "tool_id" in self.initial_data:
+ tool_id = self.initial_data["tool_id"]
+ elif self.instance:
+ tool_id = self.instance.tool_id_id🧰 Tools
🪛 Ruff (0.14.14)
[warning] 52-55: Avoid specifying long messages outside the exception class
(TRY003)
🤖 Prompt for AI Agents
In `@backend/prompt_studio/prompt_studio_v2/serializers.py` around lines 35 - 58,
The validation uses self.instance.tool_id_id (old value) instead of respecting
an incoming tool_id in updates; change the logic in the serializer validation
block so that if "tool_id" exists in self.initial_data it overrides
self.instance.tool_id_id (i.e., determine tool_id =
self.initial_data.get("tool_id") if present else self.instance.tool_id_id), then
perform the PromptStudioLookupLink.objects.filter check
(prompt_studio_project_id=tool_id, lookup_project=value) and raise
serializers.ValidationError if no link exists; keep the ImportError handling
as-is.
| <Popover | ||
| content={content} | ||
| title={null} | ||
| trigger="hover" | ||
| placement="top" | ||
| overlayClassName="lookup-replacement-popover" | ||
| > | ||
| <Button | ||
| size="small" | ||
| type="text" | ||
| className="prompt-card-action-button lookup-indicator-btn" | ||
| > | ||
| <SwapOutlined className="prompt-card-actions-head lookup-indicator-icon" /> | ||
| </Button> |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
cat -n frontend/src/components/custom-tools/prompt-card/LookupReplacementIndicator.jsxRepository: Zipstack/unstract
Length of output: 2759
🏁 Script executed:
rg "trigger=\[" --type jsx --type tsx -A 2 -B 2 frontend/src/components/ | head -50Repository: Zipstack/unstract
Length of output: 88
🏁 Script executed:
rg "LookupReplacementIndicator" --type jsx --type tsx -B 3 -A 3Repository: Zipstack/unstract
Length of output: 88
🏁 Script executed:
rg 'trigger=\[' -A 2 -B 2 frontend/src/components/ | head -50Repository: Zipstack/unstract
Length of output: 43
🏁 Script executed:
rg "LookupReplacementIndicator" -B 3 -A 3Repository: Zipstack/unstract
Length of output: 5353
🏁 Script executed:
rg "trigger=" frontend/src/components/ | grep -i popoverRepository: Zipstack/unstract
Length of output: 43
🌐 Web query:
Ant Design Popover trigger prop array syntax React
💡 Result:
In Ant Design (React), Popover inherits the trigger prop behavior from Tooltip, where trigger can be a single mode or multiple modes by passing an array. (4x.ant.design)
import { Popover, Button } from "antd";
<Popover
title="Title"
content="Content"
trigger={["hover", "click"]} // array syntax
>
<Button>Hover or click me</Button>
</Popover>Valid trigger values: "hover" | "focus" | "click" | "contextMenu" (and an array of them). (4x-ant-design.antgroup.com)
Citations:
- 1: https://4x.ant.design/components/popover/?utm_source=openai
- 2: https://4x-ant-design.antgroup.com/components/tooltip/?utm_source=openai
Improve keyboard accessibility and fix falsy-value rendering.
The hover-only trigger prevents keyboard users from discovering replacement details. Change trigger="hover" to trigger={["hover", "focus"]} and add an aria-label to the button for screen reader users.
Additionally, the || operator incorrectly treats valid falsy values (0, false, empty strings) as missing data. Use the nullish coalescing operator ?? instead to only replace null or undefined.
Suggested fix
return (
<Popover
content={content}
title={null}
- trigger="hover"
+ trigger={["hover", "focus"]}
placement="top"
overlayClassName="lookup-replacement-popover"
>
<Button
size="small"
type="text"
className="prompt-card-action-button lookup-indicator-btn"
+ aria-label="Show look-up replacement details"
>
<SwapOutlined className="prompt-card-actions-head lookup-indicator-icon" />
</Button>
</Popover>
);Update lines 29 and 40 to use nullish coalescing:
- <Text className="lookup-replacement-value original">
- {originalValue || "(empty)"}
+ <Text className="lookup-replacement-value original">
+ {originalValue ?? "(empty)"}
</Text>
...
- <Text className="lookup-replacement-value enriched">
- {enrichedValue || "(empty)"}
+ <Text className="lookup-replacement-value enriched">
+ {enrichedValue ?? "(empty)"}
</Text>🤖 Prompt for AI Agents
In
`@frontend/src/components/custom-tools/prompt-card/LookupReplacementIndicator.jsx`
around lines 48 - 61, The Popover uses trigger="hover" which blocks keyboard
access and the Button lacks an aria-label; change Popover's trigger to
trigger={["hover","focus"]} and add a descriptive aria-label to the Button (the
element rendering SwapOutlined) so screen readers and keyboard users can
discover the replacement details. Also replace any uses of the logical OR
fallback (e.g. value || fallback) in this component with the nullish coalescing
operator (value ?? fallback) so valid falsy values (0, false, "") are preserved;
look for usages in LookupReplacementIndicator.jsx around the Popover/content
rendering and update them accordingly.
| if (LookupManager) { | ||
| items.push(getMenuItem("Lookups", 9, <SearchOutlined />)); | ||
| listOfComponents[9] = ( | ||
| <LookupManager handleUpdateTool={handleUpdateTool} /> | ||
| ); | ||
| } |
There was a problem hiding this comment.
Key collision: LookupManager uses the same key (9) as CustomDataSettings.
The key 9 is already assigned to "Custom Data" on line 47 and listOfComponents[9] is set to <CustomDataSettings /> on line 68. When LookupManager is available, this code overwrites listOfComponents[9] with <LookupManager /> and adds a duplicate menu item with the same key, causing:
CustomDataSettingscomponent becomes unreachable- React will warn about duplicate keys in the menu
Use an unused key (e.g., 10) for the Lookups menu item.
🐛 Proposed fix
if (LookupManager) {
- items.push(getMenuItem("Lookups", 9, <SearchOutlined />));
- listOfComponents[9] = (
+ items.push(getMenuItem("Lookups", 10, <SearchOutlined />));
+ listOfComponents[10] = (
<LookupManager handleUpdateTool={handleUpdateTool} />
);
}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| if (LookupManager) { | |
| items.push(getMenuItem("Lookups", 9, <SearchOutlined />)); | |
| listOfComponents[9] = ( | |
| <LookupManager handleUpdateTool={handleUpdateTool} /> | |
| ); | |
| } | |
| if (LookupManager) { | |
| items.push(getMenuItem("Lookups", 10, <SearchOutlined />)); | |
| listOfComponents[10] = ( | |
| <LookupManager handleUpdateTool={handleUpdateTool} /> | |
| ); | |
| } |
🤖 Prompt for AI Agents
In `@frontend/src/components/custom-tools/settings-modal/SettingsModal.jsx` around
lines 119 - 124, LookupManager is using the same numeric key (9) as
CustomDataSettings which overwrites listOfComponents[9] and creates duplicate
menu keys; update the LookupManager menu entry and its component mapping to use
an unused key (for example 10) by changing the getMenuItem call and the
listOfComponents index from 9 to 10 (referencing LookupManager, getMenuItem, and
listOfComponents) so CustomDataSettings remains reachable and keys are unique.
| const handleSubmit = async () => { | ||
| try { | ||
| const values = await form.validateFields(); | ||
| await onCreate(values); | ||
| form.resetFields(); | ||
| } catch (error) { | ||
| console.error("Validation failed:", error); | ||
| } |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
cat -n frontend/src/components/lookups/create-project-modal/CreateProjectModal.jsxRepository: Zipstack/unstract
Length of output: 2220
Don't swallow onCreate errors or mislabel them as validation failures.
The current catch block logs both validation errors and create failures (e.g., API errors) as "Validation failed:", providing no visibility into what actually went wrong. Users receive no feedback when project creation fails. Distinguish between validation and create errors so create failures get proper error messaging.
🔧 Proposed fix (separate validation vs. create failures)
-import { Form, Input, Modal } from "antd";
+import { Form, Input, Modal, message } from "antd";
...
const handleSubmit = async () => {
try {
const values = await form.validateFields();
await onCreate(values);
form.resetFields();
} catch (error) {
- console.error("Validation failed:", error);
+ if (error?.errorFields) {
+ console.error("Validation failed:", error);
+ return;
+ }
+ message.error("Failed to create project. Please try again.");
+ console.error("Create failed:", error);
}
};📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| const handleSubmit = async () => { | |
| try { | |
| const values = await form.validateFields(); | |
| await onCreate(values); | |
| form.resetFields(); | |
| } catch (error) { | |
| console.error("Validation failed:", error); | |
| } | |
| const handleSubmit = async () => { | |
| try { | |
| const values = await form.validateFields(); | |
| await onCreate(values); | |
| form.resetFields(); | |
| } catch (error) { | |
| if (error?.errorFields) { | |
| console.error("Validation failed:", error); | |
| return; | |
| } | |
| message.error("Failed to create project. Please try again."); | |
| console.error("Create failed:", error); | |
| } | |
| }; |
🤖 Prompt for AI Agents
In `@frontend/src/components/lookups/create-project-modal/CreateProjectModal.jsx`
around lines 9 - 16, handleSubmit currently treats any error as a validation
failure; change it to first await form.validateFields() in a try/catch to handle
and log/return validation errors, then call await onCreate(values) in a separate
try/catch so API/create failures are handled distinctly (log the error, show
user-visible feedback via message.error or set a submission error) and still
call form.resetFields() only on successful creation; reference handleSubmit,
form.validateFields, onCreate, and form.resetFields when updating the control
flow.
There was a problem hiding this comment.
Actionable comments posted: 5
🤖 Fix all issues with AI agents
In `@docs/lookup_architecture_final.md`:
- Around line 186-196: The static method
LookupProfileManager.get_default_profile references an undefined
DefaultProfileError and assumes exactly one default exists; define or import
DefaultProfileError (e.g., from backend.lookup.exceptions) and update docs to
describe the profile initialization strategy (auto-create a default on
LookupProject creation or require profile creation) and system-wide defaults;
also handle the potential MultipleObjectsReturned from
LookupProfileManager.objects.get by documenting/implementing the uniqueness
constraint (or catching MultipleObjectsReturned and choosing a deterministic
resolution) so callers won’t fail on multiple/default-missing cases.
- Around line 273-278: The docs mention a UniqueConstraint named
unique_data_source_profile_index that prevents duplicate (data_source, profile)
entries but the indexing workflow's Step 7 creates a new IndexManager row
without addressing existing rows; update the documentation to specify the
intended reindexing behavior for the IndexManager model (whether to update the
existing IndexManager record, delete-and-recreate it, or use a DB upsert), and
change the guidance in Step 7 to explicitly recommend and describe the chosen
approach (for example, use IndexManager.objects.update_or_create(...) instead of
IndexManager.objects.create(...) if you want an upsert, or describe the
delete-then-create flow and its transactional implications if you prefer
recreate).
- Around line 434-443: The docs mark LLM selection as optional but don't state
the decision logic; update the Step 6 section to explicitly describe when to
call LLM.select_best_match (e.g., based on vector-search confidence thresholds,
profile configuration, or result count), reference the profile LLM flag
(profile.llm.id or a boolean like profile.enable_llm_selection) and the inputs
(matches array and query), and document rules such as "invoke when top-N scores
are below X, or when multiple candidates exceed threshold Y, otherwise skip if a
single match score > X" or point to the exact configuration parameter that
controls this behavior so readers know when LLM(...) and .select_best_match(...)
will be executed.
- Around line 321-385: The workflow lacks error handling and should document and
implement retries, rollbacks and status transitions: for failures in
LookupProfileManager.get_default_profile/X2Text.process (Step 2) set
data_source.extraction_status='failed' and schedule retry/backoff; for
Embedding.embed (Step 5) and VectorDB.index (Step 6) catch adapter-specific
exceptions, mark indexing_status='failed' on the LookupIndexManager entry (or on
data_source), and trigger a cleanup routine to delete any partial uploads in
storage and any partial vector DB entries; replace the direct
LookupIndexManager.objects.create call with
LookupIndexManager.objects.update_or_create (or an explicit reindex pathway) to
avoid IntegrityError on reindex, and document the exact retry policy,
idempotency guarantees, and atomic rollback steps (storage.delete,
vector_db.delete_index) so each failure path transitions the data_source and
LookupIndexManager to deterministic states.
- Around line 46-57: The model currently allows multiple profiles with
is_default=True because the Meta.constraints only enforces unique
(prompt_studio_tool, profile_name); add a partial unique constraint to the same
Meta that enforces uniqueness of prompt_studio_tool when is_default=True (i.e.,
UniqueConstraint(fields=["prompt_studio_tool"], condition=Q(is_default=True),
name="unique_default_profile_per_tool")), import Q from django.db.models and
ensure this pattern is also added to LookupProfileManager.Meta to enforce a
single default per tool/project (note: requires PostgreSQL and Django 4.0+).
🧹 Nitpick comments (2)
docs/lookup_architecture_final.md (2)
287-317: Add language specifiers to fenced code blocks.The workflow diagrams at lines 287-317, 321-385, and 389-445 should specify a language for better rendering and syntax highlighting. Consider using
text,mermaid, orplaintext.Example fix
-``` +```text User Action: Create New Profile ↓As per coding guidelines, markdownlint-cli2 flags these as MD040 violations.
Also applies to: 321-385, 389-445
453-500: Add language specifiers and document API error responses.The API endpoint blocks should specify a language (e.g.,
httportext) for proper rendering.Additionally, consider documenting:
- Authentication requirements (JWT, session, API key?)
- Error response formats (HTTP status codes, error payload structure)
- Pagination for list endpoints
- Rate limiting policies
As per coding guidelines, markdownlint-cli2 flags these as MD040 violations.
docs/lookup_architecture_final.md
Outdated
| # Flags | ||
| is_default = BooleanField(default=False) | ||
| reindex = BooleanField(default=False) | ||
|
|
||
| # Unique constraint | ||
| class Meta: | ||
| constraints = [ | ||
| UniqueConstraint( | ||
| fields=["prompt_studio_tool", "profile_name"], | ||
| name="unique_prompt_studio_tool_profile_name_index" | ||
| ) | ||
| ] |
There was a problem hiding this comment.
Missing constraint: Multiple default profiles allowed per project.
The model allows is_default=True for multiple profiles within the same project because there's no database constraint enforcing uniqueness. The UniqueConstraint on lines 52-57 only prevents duplicate profile names, not multiple defaults.
This same issue applies to the Lookup System pattern (lines 146-184).
🔒 Suggested constraint to ensure single default profile
Add a partial unique constraint (supported in PostgreSQL, Django 4.0+):
class Meta:
constraints = [
UniqueConstraint(
fields=["prompt_studio_tool", "profile_name"],
name="unique_prompt_studio_tool_profile_name_index"
),
UniqueConstraint(
fields=["prompt_studio_tool"],
condition=Q(is_default=True),
name="unique_default_profile_per_tool"
)
]Apply the same pattern to LookupProfileManager.Meta (lines 175-184).
🤖 Prompt for AI Agents
In `@docs/lookup_architecture_final.md` around lines 46 - 57, The model currently
allows multiple profiles with is_default=True because the Meta.constraints only
enforces unique (prompt_studio_tool, profile_name); add a partial unique
constraint to the same Meta that enforces uniqueness of prompt_studio_tool when
is_default=True (i.e., UniqueConstraint(fields=["prompt_studio_tool"],
condition=Q(is_default=True), name="unique_default_profile_per_tool")), import Q
from django.db.models and ensure this pattern is also added to
LookupProfileManager.Meta to enforce a single default per tool/project (note:
requires PostgreSQL and Django 4.0+).
docs/lookup_architecture_final.md
Outdated
| @staticmethod | ||
| def get_default_profile(project: LookupProject) -> "LookupProfileManager": | ||
| """Get default profile for a lookup project.""" | ||
| try: | ||
| return LookupProfileManager.objects.get( | ||
| lookup_project=project, | ||
| is_default=True | ||
| ) | ||
| except LookupProfileManager.DoesNotExist: | ||
| raise DefaultProfileError("No default profile found for lookup project") | ||
| ``` |
There was a problem hiding this comment.
Undefined exception and missing safeguards.
The DefaultProfileError exception is referenced but never defined in this document. Additionally, the static method assumes exactly one default profile exists, but there's no documented strategy for ensuring a default profile is created when a new LookupProject is initialized.
Recommendations:
- Define
DefaultProfileErrorin the exceptions section or reference where it's defined (e.g.,backend/lookup/exceptions.py) - Document the profile initialization strategy: Should a default profile be auto-created with the project? Should project creation require a profile? Should there be system-wide default adapter selections?
- Consider handling the edge case where
.get()might return multiple results if the constraint from the previous comment isn't implemented
🤖 Prompt for AI Agents
In `@docs/lookup_architecture_final.md` around lines 186 - 196, The static method
LookupProfileManager.get_default_profile references an undefined
DefaultProfileError and assumes exactly one default exists; define or import
DefaultProfileError (e.g., from backend.lookup.exceptions) and update docs to
describe the profile initialization strategy (auto-create a default on
LookupProject creation or require profile creation) and system-wide defaults;
also handle the potential MultipleObjectsReturned from
LookupProfileManager.objects.get by documenting/implementing the uniqueness
constraint (or catching MultipleObjectsReturned and choosing a deterministic
resolution) so callers won’t fail on multiple/default-missing cases.
docs/lookup_architecture_final.md
Outdated
| constraints = [ | ||
| models.UniqueConstraint( | ||
| fields=["data_source", "profile"], | ||
| name="unique_data_source_profile_index" | ||
| ) | ||
| ] |
There was a problem hiding this comment.
🛠️ Refactor suggestion | 🟠 Major
Document reindexing behavior with unique constraint.
The unique_data_source_profile_index constraint prevents duplicate index entries, but the indexing workflow (Step 7, line 374) shows creating a new IndexManager entry without handling the case where one already exists.
Document whether reindexing should:
- Update the existing
IndexManagerentry - Delete and recreate it
- Use
update_or_create()instead ofcreate()
🤖 Prompt for AI Agents
In `@docs/lookup_architecture_final.md` around lines 273 - 278, The docs mention a
UniqueConstraint named unique_data_source_profile_index that prevents duplicate
(data_source, profile) entries but the indexing workflow's Step 7 creates a new
IndexManager row without addressing existing rows; update the documentation to
specify the intended reindexing behavior for the IndexManager model (whether to
update the existing IndexManager record, delete-and-recreate it, or use a DB
upsert), and change the guidance in Step 7 to explicitly recommend and describe
the chosen approach (for example, use IndexManager.objects.update_or_create(...)
instead of IndexManager.objects.create(...) if you want an upsert, or describe
the delete-then-create flow and its transactional implications if you prefer
recreate).
docs/lookup_architecture_final.md
Outdated
| ``` | ||
| User Action: Upload Reference Data File | ||
| ↓ | ||
| Frontend: Reference Data Tab | ||
| - Upload vendors.csv | ||
| ↓ | ||
| API: POST /api/v1/unstract/{org_id}/lookup/data-source/ | ||
| - Creates LookupDataSource with extraction_status='pending' | ||
| ↓ | ||
| Backend: Celery Task - Extract and Index | ||
| ↓ | ||
| Step 1: Get Default Profile | ||
| profile = LookupProfileManager.get_default_profile(project) | ||
| ↓ | ||
| Step 2: Extract Text using Profile's X2Text Adapter | ||
| x2text = X2Text( | ||
| tool=util, | ||
| adapter_instance_id=profile.x2text.id | ||
| ) | ||
| extracted_text = x2text.process(file_content) | ||
| ↓ | ||
| Step 3: Store Extracted Text in Object Storage | ||
| storage.upload( | ||
| path=f"lookup/{project_id}/{filename}.txt", | ||
| content=extracted_text | ||
| ) | ||
| ↓ | ||
| Step 4: Chunk Text | ||
| chunks = create_chunks( | ||
| text=extracted_text, | ||
| chunk_size=profile.chunk_size, | ||
| overlap=profile.chunk_overlap | ||
| ) | ||
| ↓ | ||
| Step 5: Generate Embeddings using Profile's Embedding Adapter | ||
| embedding = Embedding( | ||
| tool=util, | ||
| adapter_instance_id=profile.embedding_model.id | ||
| ) | ||
| vectors = [embedding.embed(chunk) for chunk in chunks] | ||
| ↓ | ||
| Step 6: Store in Vector DB using Profile's Vector DB Adapter | ||
| vector_db = VectorDB( | ||
| tool=util, | ||
| adapter_instance_id=profile.vector_store.id | ||
| ) | ||
| index_id = vector_db.index( | ||
| documents=chunks, | ||
| embeddings=vectors, | ||
| metadata={...} | ||
| ) | ||
| ↓ | ||
| Step 7: Create Index Manager Entry | ||
| LookupIndexManager.objects.create( | ||
| data_source=data_source, | ||
| profile=profile, | ||
| vector_db_index_id=index_id, | ||
| indexing_status='completed', | ||
| chunk_count=len(chunks) | ||
| ) | ||
| ↓ | ||
| Step 8: Update Data Source Status | ||
| data_source.extraction_status = 'completed' | ||
| data_source.save() | ||
| ``` |
There was a problem hiding this comment.
Document error handling and rollback strategy.
The indexing workflow shows an 8-step happy path but doesn't address:
- What happens if extraction fails at Step 2?
- How to handle embedding failures at Step 5?
- What if vector DB indexing fails at Step 6?
- Should partial indexing be rolled back?
- How are retries handled?
At Step 7 (line 374), create() will raise IntegrityError if an index already exists due to the unique constraint. The workflow should use update_or_create() or document the reindexing pattern.
Recommendation: Add an error handling subsection documenting retry logic, rollback strategy, and status transitions for failed operations.
🧰 Tools
🪛 markdownlint-cli2 (0.20.0)
[warning] 321-321: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
🤖 Prompt for AI Agents
In `@docs/lookup_architecture_final.md` around lines 321 - 385, The workflow lacks
error handling and should document and implement retries, rollbacks and status
transitions: for failures in
LookupProfileManager.get_default_profile/X2Text.process (Step 2) set
data_source.extraction_status='failed' and schedule retry/backoff; for
Embedding.embed (Step 5) and VectorDB.index (Step 6) catch adapter-specific
exceptions, mark indexing_status='failed' on the LookupIndexManager entry (or on
data_source), and trigger a cleanup routine to delete any partial uploads in
storage and any partial vector DB entries; replace the direct
LookupIndexManager.objects.create call with
LookupIndexManager.objects.update_or_create (or an explicit reindex pathway) to
avoid IntegrityError on reindex, and document the exact retry policy,
idempotency guarantees, and atomic rollback steps (storage.delete,
vector_db.delete_index) so each failure path transitions the data_source and
LookupIndexManager to deterministic states.
docs/lookup_architecture_final.md
Outdated
| Step 6: LLM Selection (Optional) using Profile's LLM Adapter | ||
| llm = LLM( | ||
| tool=util, | ||
| adapter_instance_id=profile.llm.id | ||
| ) | ||
| best_match = llm.select_best_match( | ||
| query="Micro soft", | ||
| candidates=matches | ||
| ) | ||
| ↓ |
There was a problem hiding this comment.
🛠️ Refactor suggestion | 🟠 Major
Clarify when LLM selection is used.
Step 6 labels LLM selection as "(Optional)" but doesn't explain the criteria for when it should be invoked. Questions:
- Is it based on confidence scores from vector search?
- Is it a configuration setting in the profile?
- Does it depend on the number of matches returned?
- Is it always skipped if only one result has score > threshold?
Document the decision logic or reference the configuration parameter that controls this behavior.
🤖 Prompt for AI Agents
In `@docs/lookup_architecture_final.md` around lines 434 - 443, The docs mark LLM
selection as optional but don't state the decision logic; update the Step 6
section to explicitly describe when to call LLM.select_best_match (e.g., based
on vector-search confidence thresholds, profile configuration, or result count),
reference the profile LLM flag (profile.llm.id or a boolean like
profile.enable_llm_selection) and the inputs (matches array and query), and
document rules such as "invoke when top-N scores are below X, or when multiple
candidates exceed threshold Y, otherwise skip if a single match score > X" or
point to the exact configuration parameter that controls this behavior so
readers know when LLM(...) and .select_best_match(...) will be executed.
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Fix all issues with AI agents
In `@docs/LOOKUP_ARCHITECTURE.md`:
- Around line 344-395: Several fenced code blocks in the LOOKUP_ARCHITECTURE.md
doc are missing language identifiers (MD040); update each triple-backtick block
that lists routes (e.g., the blocks starting with "GET /lookup-projects/...",
"GET /lookup-profiles/...", "GET /data-sources/...", "GET
/lookup-templates/...", the "Linking & Execution" block with "GET
/lookup-links/..." and others) to include a language tag such as ```text (or
```bash) so markdownlint stops flagging MD040; apply the same change to the
other reported ranges (402-412, 435-455, 459-476, 480-494, 499-518, 615-637)
ensuring every fenced block with route listings has a language identifier.
- Around line 5-6: Update the "Last Updated" field in the document by replacing
the existing date value "2025-02-05" with the correct current date "2026-02-05"
so the metadata accurately reflects the new document addition; locate the **Last
Updated** line in LOOKUP_ARCHITECTURE.md and change only the date string,
leaving the **Status** line and other content unchanged.
| **Last Updated**: 2025-02-05 | ||
| **Status**: Implemented |
There was a problem hiding this comment.
Update the “Last Updated” date to reflect the current doc change.
This file is newly added in Feb 2026 but still shows 2025-02-05.
🔧 Proposed update
-**Last Updated**: 2025-02-05
+**Last Updated**: 2026-02-05📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| **Last Updated**: 2025-02-05 | |
| **Status**: Implemented | |
| **Last Updated**: 2026-02-05 | |
| **Status**: Implemented |
🤖 Prompt for AI Agents
In `@docs/LOOKUP_ARCHITECTURE.md` around lines 5 - 6, Update the "Last Updated"
field in the document by replacing the existing date value "2025-02-05" with the
correct current date "2026-02-05" so the metadata accurately reflects the new
document addition; locate the **Last Updated** line in LOOKUP_ARCHITECTURE.md
and change only the date string, leaving the **Status** line and other content
unchanged.
| ``` | ||
| GET /lookup-projects/ # List all projects | ||
| POST /lookup-projects/ # Create new project | ||
| GET /lookup-projects/{id}/ # Get project details | ||
| PUT /lookup-projects/{id}/ # Update project | ||
| DELETE /lookup-projects/{id}/ # Delete project | ||
| ``` | ||
|
|
||
| ### Profile Management | ||
|
|
||
| ``` | ||
| GET /lookup-profiles/ # List profiles | ||
| POST /lookup-profiles/ # Create profile | ||
| GET /lookup-profiles/{id}/ # Get profile details | ||
| PUT /lookup-profiles/{id}/ # Update profile | ||
| DELETE /lookup-profiles/{id}/ # Delete profile | ||
| POST /lookup-profiles/{id}/set-default/ # Set as default profile | ||
| ``` | ||
|
|
||
| ### Data Source Management | ||
|
|
||
| ``` | ||
| GET /data-sources/ # List data sources | ||
| POST /data-sources/ # Upload new reference data | ||
| GET /data-sources/{id}/ # Get data source details | ||
| DELETE /data-sources/{id}/ # Delete data source | ||
| POST /data-sources/{id}/reindex/ # Trigger reindexing | ||
| ``` | ||
|
|
||
| ### Template Management | ||
|
|
||
| ``` | ||
| GET /lookup-templates/ # List templates | ||
| POST /lookup-templates/ # Create template | ||
| GET /lookup-templates/{id}/ # Get template details | ||
| PUT /lookup-templates/{id}/ # Update template | ||
| DELETE /lookup-templates/{id}/ # Delete template | ||
| ``` | ||
|
|
||
| ### Linking & Execution | ||
|
|
||
| ``` | ||
| GET /lookup-links/ # List PS project links | ||
| POST /lookup-links/ # Create link | ||
| DELETE /lookup-links/{id}/ # Remove link | ||
|
|
||
| GET /execution-audits/ # List execution history | ||
| GET /execution-audits/{id}/ # Get execution details | ||
|
|
||
| POST /lookup-debug/test/ # Test lookup execution | ||
| ``` | ||
|
|
There was a problem hiding this comment.
Add language identifiers to fenced code blocks (MD040).
Multiple fenced blocks are missing a language tag, which trips markdownlint and reduces readability.
✅ Example fix pattern
-```
+```text
GET /lookup-projects/ # List all projects
...
Apply similarly to the other fenced blocks in this file.
</details>
Also applies to: 402-412, 435-455, 459-476, 480-494, 499-518, 615-637
<details>
<summary>🧰 Tools</summary>
<details>
<summary>🪛 markdownlint-cli2 (0.20.0)</summary>
[warning] 344-344: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
---
[warning] 354-354: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
---
[warning] 365-365: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
---
[warning] 375-375: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
---
[warning] 385-385: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
</details>
</details>
<details>
<summary>🤖 Prompt for AI Agents</summary>
In @docs/LOOKUP_ARCHITECTURE.md around lines 344 - 395, Several fenced code
blocks in the LOOKUP_ARCHITECTURE.md doc are missing language identifiers
(MD040); update each triple-backtick block that lists routes (e.g., the blocks
starting with "GET /lookup-projects/...", "GET /lookup-profiles/...", "GET
/data-sources/...", "GET /lookup-templates/...", the "Linking & Execution"
block with "GET /lookup-links/..." and others) to include a language tag such
as text (or bash) so markdownlint stops flagging MD040; apply the same
change to the other reported ranges (402-412, 435-455, 459-476, 480-494,
499-518, 615-637) ensuring every fenced block with route listings has a language
identifier.
</details>
<!-- fingerprinting:phantom:triton:eagle -->
<!-- This is an auto-generated comment by CodeRabbit -->
Test ResultsSummary
Runner Tests - Full Report
SDK1 Tests - Full Report
|
|




What
lookupDjango app with complete backend models, services, and APIsWhy
How
Backend (
backend/lookup/)Models:
LookupProject- Container for lookup configurationsLookupDataSource- Reference data file storage and metadataLookupProfileManager- Adapter configurations (X2Text, Embedding, VectorDB, LLM)LookupIndexManager- Vector DB index trackingLookupPromptTemplate- Template configurations for lookup executionPromptStudioLookupLink- Links Prompt Studio projects to Lookup projectsLookupExecutionAudit- Execution history and debuggingServices:
IndexingService- Document indexing with chunking and vector embeddingsLookupExecutor- Executes lookups using RAG retrievalLookupOrchestrator- Coordinates lookup workflowVectorDBCleanupService- Manages vector DB lifecycleAuditLogger- Execution logging and trackingIntegrations:
Frontend (
frontend/src/components/lookups/)LookUpProjectList- Project listing and managementLookUpProjectDetail- Project detail view with tabsProfileManagementTab- Adapter profile configurationReferenceDataTab- Reference data file managementTemplateTab- Lookup template configurationLinkedProjectsTab- Prompt Studio project linkingExecutionHistoryTab- Execution audit viewingDebugTab- Testing and debugging interfacePrompt Studio Integration
lookup_projectfield toToolStudioPromptmodelCan this PR break any existing features. If yes, please list possible items. If no, please explain why.
Low risk of breaking existing features:
lookup_project)urls_v2.pywithout modifying existing endpointsDatabase Migrations
0001_initial.py- Creates all lookup tables (LookupProject, LookupDataSource, LookupProfileManager, LookupIndexManager, LookupPromptTemplate, PromptStudioLookupLink, LookupExecutionAudit)0002_remove_reference_data_type.py- Schema refinement0003_add_file_execution_id.py- Adds file execution tracking0004_fix_file_paths_for_minio.py- MinIO storage path fixes0005_add_reindex_required_field.py- Adds reindex trackingprompt_studio_v2/0014_toolstudioprompt_lookup_project.py- Adds lookup_project FK to promptsEnv Config
No new required environment variables. Uses existing adapter configurations (X2Text, Embedding, VectorDB, LLM).
Relevant Docs
docs/lookup_architecture_final.md- Complete architecture designRelated Issues or PRs
Dependencies Versions
No new external dependencies added.
Notes on Testing
backend/lookup/tests/Screenshots
Lookups-Screenshots.zip
Checklist
I have read and understood the Contribution Guidelines.
🤖 Generated with Claude Code