feat: implement real Gemini lead extraction, fix migrations, and align UI#457
feat: implement real Gemini lead extraction, fix migrations, and align UI#457KhushiMulchandani wants to merge 1 commit into
Conversation
📝 WalkthroughWalkthroughThis PR adds a persistent lead-scrape job model, async scraping task, API endpoints for job control and status, and frontend UI/script changes to start scrapes, poll progress, and show history. ChangesAI Lead Scraping Flow
Sequence Diagram(s)sequenceDiagram
participant Browser
participant LeadViewSet
participant scrape_leads_task
participant GeminiAPI as Gemini API
participant LeadScrapeJob
Browser->>LeadViewSet: POST /leads/scrape/ query, limit
LeadViewSet->>LeadScrapeJob: create PENDING job
LeadViewSet->>scrape_leads_task: delay(job_id, query, limit, org_id)
LeadViewSet-->>Browser: 201 job id
par background task
scrape_leads_task->>LeadScrapeJob: set RUNNING and started_at
scrape_leads_task->>GeminiAPI: request raw JSON leads
GeminiAPI-->>scrape_leads_task: response text
scrape_leads_task->>LeadScrapeJob: set COMPLETED or FAILED
and browser polling
loop until terminal status
Browser->>LeadViewSet: GET /leads/scrape/{job_id}/status/
LeadViewSet-->>Browser: serialized job data
end
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related issues
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
|
@Kuldeeep18 Kindly review and if it is good to go please merge. |
There was a problem hiding this comment.
Actionable comments posted: 5
🧹 Nitpick comments (3)
backend/leads/tasks.py (3)
236-257: 🎯 Functional Correctness | 🔵 Trivial | ⚡ Quick winValidate email format before creating leads.
The model is prompted for "valid email structure," but the response is untrusted.
Lead.objects.create()does not run field validation, so malformed addresses get persisted. The CSV import path (import_leads_from_csv) already guards withvalidate_email; mirror that here for consistency.♻️ Suggested guard
for item in leads_data: email = (item.get('email') or '').strip().lower() if not email: continue + try: + validate_email(email) + except ValidationError: + continueRequires importing
validate_email/ValidationErrorif not already present.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@backend/leads/tasks.py` around lines 236 - 257, Validate the email before calling Lead.objects.create in the lead task loop, since this path currently trusts raw input and bypasses model field validation. In the lead-creation logic inside the task that iterates over leads_data, mirror the CSV import behavior by using validate_email and skipping any address that raises ValidationError. Make sure the helper imports are added where needed and keep the existing organization/email deduplication and custom_variables handling intact.
236-257: 🚀 Performance & Scalability | 🔵 Trivial | 💤 Low valueOptional: batch the dedup/insert to cut per-lead DB round-trips.
For up to 200 leads this issues a query per item for
.exists()plus a separate insert each. Prefetching existing emails once and usingbulk_create(..., ignore_conflicts=True)against the(organization, email)unique constraint would reduce round-trips.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@backend/leads/tasks.py` around lines 236 - 257, The current per-item dedup and insert in the lead creation loop causes unnecessary database round-trips. Update the logic in the task that processes leads_data to prefetch existing emails for the organization once, filter duplicates in memory, and then use bulk_create with ignore_conflicts=True on the Lead model so the organization/email unique constraint handles races efficiently. Keep the existing field mapping and custom_variables default, and preserve the leads_created counting based on the records actually queued for insertion.
269-276: 🩺 Stability & Availability | 🔵 Trivial | ⚡ Quick winLog the secondary failure instead of silently passing.
If marking the job
FAILEDitself fails, the job is left stuck inRUNNINGwith no trace. Log the swallowed exception so it's diagnosable.🪵 Suggested change
- except Exception: - pass + except Exception: + logger.exception("Failed to mark scrape job %s as FAILED", job_id)🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@backend/leads/tasks.py` around lines 269 - 276, The fallback update block in the LeadScrapeJob failure handling is swallowing a secondary exception, which can leave the job stuck in RUNNING with no visibility. In the task flow around LeadScrapeJob.objects.get, job.save(), and the FAILED status assignment, catch the exception from the failed status update and log it with a clear error message instead of using a bare pass; include the exception details so the failure is diagnosable while preserving the existing job error handling.Source: Linters/SAST tools
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@backend/leads/tasks.py`:
- Around line 198-218: The Gemini call in the lead generation task is missing
prior client configuration, so the model is instantiated and used without
setting the API key. Update the lead generation flow in tasks.py to configure
genai with GEMINI_API_KEY before creating the GenerativeModel and calling
generate_content, reusing the same setup pattern used in campaigns/ai.py so the
job can authenticate successfully.
- Around line 199-200: The inline comment above the model instantiation is stale
because it names gemini-2.0-flash while the Lead task logic in the model setup
uses gemini-2.5-flash. Update that comment near the GenerativeModel
initialization in the task flow to match the actual model being created so the
documentation stays consistent with the code.
In `@frontend/leads.html`:
- Around line 1002-1003: The error display paths in leads.html are inserting
dynamic messages into innerHTML without escaping, which can allow unsafe HTML
from err.message and data.error_message. Update the affected error handling
blocks in the same way job.query is already escaped: sanitize the dynamic text
before assigning it to statusEl.innerHTML, and apply the same fix consistently
in the other referenced error branches within the leads page.
- Line 569: The HTML in the modal section contains an extra unmatched closing
div, which breaks the DOM structure. Remove the stray closing tag in the leads
template after the modal wrappers are already closed, and verify the surrounding
modal markup (such as the modal container and dialog/content wrappers) remains
balanced with a single matching opener for each closer.
- Around line 1075-1082: The polling loop in the scrape status handler keeps
running on non-OK responses and repeated fetch exceptions, which leaves the
submit/close controls disabled indefinitely. Update the setInterval callback
around fetchWithAuth and scrapePollInterval so failures are surfaced to the UI
and polling is stopped or cleaned up after an error threshold or terminal
failure state. Make the same fix in the related retry/error handling block
referenced by the other comment, and ensure the jobId-based status polling path
clears the interval and re-enables controls when a failure is detected.
---
Nitpick comments:
In `@backend/leads/tasks.py`:
- Around line 236-257: Validate the email before calling Lead.objects.create in
the lead task loop, since this path currently trusts raw input and bypasses
model field validation. In the lead-creation logic inside the task that iterates
over leads_data, mirror the CSV import behavior by using validate_email and
skipping any address that raises ValidationError. Make sure the helper imports
are added where needed and keep the existing organization/email deduplication
and custom_variables handling intact.
- Around line 236-257: The current per-item dedup and insert in the lead
creation loop causes unnecessary database round-trips. Update the logic in the
task that processes leads_data to prefetch existing emails for the organization
once, filter duplicates in memory, and then use bulk_create with
ignore_conflicts=True on the Lead model so the organization/email unique
constraint handles races efficiently. Keep the existing field mapping and
custom_variables default, and preserve the leads_created counting based on the
records actually queued for insertion.
- Around line 269-276: The fallback update block in the LeadScrapeJob failure
handling is swallowing a secondary exception, which can leave the job stuck in
RUNNING with no visibility. In the task flow around LeadScrapeJob.objects.get,
job.save(), and the FAILED status assignment, catch the exception from the
failed status update and log it with a clear error message instead of using a
bare pass; include the exception details so the failure is diagnosable while
preserving the existing job error handling.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro Plus
Run ID: 3fbfb522-0891-4f27-a681-7b56310c1b0e
📒 Files selected for processing (6)
backend/leads/migrations/0006_leadscrapejob.pybackend/leads/models.pybackend/leads/serializers.pybackend/leads/tasks.pybackend/leads/views.pyfrontend/leads.html
| # 2. Configure the Gemini model | ||
| # Using gemini-2.0-flash as specified in the issue architecture requirements | ||
| model = genai.GenerativeModel('gemini-2.5-flash') | ||
|
|
||
| prompt = f""" | ||
| You are an advanced automated B2B lead generation assistant. | ||
| Generate exactly {limit} highly realistic business or professional leads matching the prospecting query: "{query}". | ||
|
|
||
| Return the response strictly as a JSON array containing objects with the following keys: | ||
| - first_name (string or null) | ||
| - last_name (string or null) | ||
| - email (string, must be a valid email structure) | ||
| - company (string or null) | ||
| - phone (string, include country code if possible, or null) | ||
| - linkedin_url (string, valid LinkedIn URL format, or null) | ||
|
|
||
| Do not wrap the response in markdown code blocks like ```json ... ```. Output raw JSON only. | ||
| """ | ||
|
|
||
| # 3. Call the API | ||
| response = model.generate_content(prompt) |
There was a problem hiding this comment.
🩺 Stability & Availability | 🟠 Major
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Verify genai.configure is invoked somewhere in the backend
rg -nP 'genai\.configure|configure\s*\(\s*api_key' backend
# Also check for a settings/env key feeding it
rg -nP 'GEMINI|GOOGLE_API_KEY|GENAI' backend -g '!*.lock'Repository: Kuldeeep18/LeadOrbit
Length of output: 775
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# Inspect the task file around the relevant lines, plus the campaign AI helper
# that already configures Gemini, to see whether leads/tasks.py shares that setup.
for f in backend/leads/tasks.py backend/campaigns/ai.py backend/backend/settings.py; do
echo "===== $f ====="
wc -l "$f"
done
echo "===== backend/leads/tasks.py (imports + relevant block) ====="
sed -n '1,260p' backend/leads/tasks.py
echo "===== backend/campaigns/ai.py (Gemini setup) ====="
sed -n '1,340p' backend/campaigns/ai.pyRepository: Kuldeeep18/LeadOrbit
Length of output: 20057
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# Find all direct Gemini model calls and configuration calls in backend.
rg -n "GenerativeModel\(|genai\.configure\(|GEMINI_API_KEY|GOOGLE_API_KEY|GENAI" backend -g '!*.lock'Repository: Kuldeeep18/LeadOrbit
Length of output: 946
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# Check whether backend/leads/tasks.py imports or reuses the campaign AI module.
rg -n "from .*campaigns\.ai|import .*campaigns\.ai|genai|google\.generativeai" backend/leads/tasks.py backend/campaigns -g '!*.lock'Repository: Kuldeeep18/LeadOrbit
Length of output: 612
Configure Gemini before instantiating the model
backend/leads/tasks.py creates genai.GenerativeModel(...) and calls generate_content(...) without a prior genai.configure(api_key=...). Reuse the GEMINI_API_KEY setup from backend/campaigns/ai.py, or the scrape job will fail on auth and end up in FAILED.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@backend/leads/tasks.py` around lines 198 - 218, The Gemini call in the lead
generation task is missing prior client configuration, so the model is
instantiated and used without setting the API key. Update the lead generation
flow in tasks.py to configure genai with GEMINI_API_KEY before creating the
GenerativeModel and calling generate_content, reusing the same setup pattern
used in campaigns/ai.py so the job can authenticate successfully.
| # Using gemini-2.0-flash as specified in the issue architecture requirements | ||
| model = genai.GenerativeModel('gemini-2.5-flash') |
There was a problem hiding this comment.
📐 Maintainability & Code Quality | 🟡 Minor | ⚡ Quick win
Stale comment: says gemini-2.0-flash, code uses gemini-2.5-flash.
Update the comment to match the model actually instantiated to avoid confusion.
✏️ Fix comment
- # 2. Configure the Gemini model
- # Using gemini-2.0-flash as specified in the issue architecture requirements
+ # 2. Configure the Gemini model
+ # Using gemini-2.5-flash per the issue architecture requirements
model = genai.GenerativeModel('gemini-2.5-flash')🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@backend/leads/tasks.py` around lines 199 - 200, The inline comment above the
model instantiation is stale because it names gemini-2.0-flash while the Lead
task logic in the model setup uses gemini-2.5-flash. Update that comment near
the GenerativeModel initialization in the task flow to match the actual model
being created so the documentation stays consistent with the code.
| </div> | ||
| </div> | ||
| </div> | ||
| </div> |
There was a problem hiding this comment.
🎯 Functional Correctness | 🟠 Major | ⚡ Quick win
Remove the unmatched closing </div>.
Line 569 has no matching opener after the modal wrappers are already closed on Lines 566-568, which can corrupt the DOM structure.
Proposed fix
-</div>📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| </div> |
🧰 Tools
🪛 HTMLHint (1.9.2)
[error] 569-569: Tag must be paired, no start tag: [ ]
(tag-pair)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@frontend/leads.html` at line 569, The HTML in the modal section contains an
extra unmatched closing div, which breaks the DOM structure. Remove the stray
closing tag in the leads template after the modal wrappers are already closed,
and verify the surrounding modal markup (such as the modal container and
dialog/content wrappers) remains balanced with a single matching opener for each
closer.
Source: Linters/SAST tools
| } catch (err) { | ||
| statusEl.innerHTML = `<span class="text-danger">Error: ${err.message}</span>`; |
There was a problem hiding this comment.
🔒 Security & Privacy | 🟠 Major | ⚡ Quick win
Escape dynamic error text before assigning innerHTML.
err.message and data.error_message are inserted raw; use the same escaping pattern already used for job.query.
Proposed fix
- statusEl.innerHTML = `<span class="text-danger">Error: ${err.message}</span>`;
+ statusEl.innerHTML = `<span class="text-danger">Error: ${escapeHtml(err.message)}</span>`;
- document.getElementById('liveStatusText').innerHTML = `<i class="bi bi-exclamation-triangle-fill text-danger me-2"></i>Deployment Aborted: ${err.message}`;
+ document.getElementById('liveStatusText').innerHTML = `<i class="bi bi-exclamation-triangle-fill text-danger me-2"></i>Deployment Aborted: ${escapeHtml(err.message)}`;
- statusText.innerHTML = `<i class="bi bi-exclamation-triangle-fill text-danger me-2"></i>Error: ${data.error_message || 'Sandbox timeout'}`;
+ statusText.innerHTML = `<i class="bi bi-exclamation-triangle-fill text-danger me-2"></i>Error: ${escapeHtml(data.error_message || 'Sandbox timeout')}`;
- tbody.innerHTML = `<tr><td colspan="4" class="text-center text-danger py-4 border-0" style="background-color: `#090d16` !important;">Error fetching tracking logs: ${err.message}</td></tr>`;
+ tbody.innerHTML = `<tr><td colspan="4" class="text-center text-danger py-4 border-0" style="background-color: `#090d16` !important;">Error fetching tracking logs: ${escapeHtml(err.message)}</td></tr>`;Also applies to: 1054-1058, 1103-1107, 1164-1166
🧰 Tools
🪛 ast-grep (0.44.0)
[warning] 1002-1002: Avoid assigning untrusted data to innerHTML/outerHTML or document.write
Context: statusEl.innerHTML = <span class="text-danger">Error: ${err.message}</span>
Note: [CWE-79] Improper Neutralization of Input During Web Page Generation ('Cross-site Scripting').
(inner-outer-html)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@frontend/leads.html` around lines 1002 - 1003, The error display paths in
leads.html are inserting dynamic messages into innerHTML without escaping, which
can allow unsafe HTML from err.message and data.error_message. Update the
affected error handling blocks in the same way job.query is already escaped:
sanitize the dynamic text before assigning it to statusEl.innerHTML, and apply
the same fix consistently in the other referenced error branches within the
leads page.
Source: Linters/SAST tools
| scrapePollInterval = setInterval(async () => { | ||
| try { | ||
| const res = await fetchWithAuth('/leads/import_csv/', { | ||
| method: 'POST', | ||
| body: formData, | ||
| }); | ||
| const response = await fetchWithAuth(`/leads/scrape/${jobId}/status/`); | ||
|
|
||
| if (!response.ok) { | ||
| console.warn(`Connection status error: ${response.status}`); | ||
| return; | ||
| } |
There was a problem hiding this comment.
🩺 Stability & Availability | 🟠 Major | ⚡ Quick win
Surface polling failures instead of looping forever.
Non-OK status responses and repeated fetch exceptions only log and continue polling, leaving the submit/close controls disabled indefinitely.
Proposed fix
- if (!response.ok) {
- console.warn(`Connection status error: ${response.status}`);
- return;
- }
+ if (!response.ok) {
+ throw new Error(`Status request failed (${response.status})`);
+ }
...
- } catch (err) {
- console.error('Polling metrics exception:', err);
+ } catch (err) {
+ console.error('Polling metrics exception:', err);
+ clearInterval(scrapePollInterval);
+ scrapePollInterval = null;
+ statusText.innerHTML = `<i class="bi bi-exclamation-triangle-fill text-danger me-2"></i>${escapeHtml(err.message || 'Lost connection to scrape status service.')}`;
+ bar.style.width = '100%';
+ bar.className = 'progress-bar bg-danger';
+ btn.disabled = false;
+ closeBtn.disabled = false;
+ btn.innerHTML = `<i class="bi bi-cpu me-1"></i> Re-Deploy AI Browser Agent`;
}Also applies to: 1113-1115
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@frontend/leads.html` around lines 1075 - 1082, The polling loop in the scrape
status handler keeps running on non-OK responses and repeated fetch exceptions,
which leaves the submit/close controls disabled indefinitely. Update the
setInterval callback around fetchWithAuth and scrapePollInterval so failures are
surfaced to the UI and polling is stopped or cleaned up after an error threshold
or terminal failure state. Make the same fix in the related retry/error handling
block referenced by the other comment, and ensure the jobId-based status polling
path clears the interval and re-enables controls when a failure is detected.
|
Hi @KhushiMulchandani 👋 LeadOrbit Bot here 🤖 We noticed you've opened a Pull Request but haven't starred the repository yet. ⭐ Starring the repository is mandatory for PR review and merge. Please:
Once you've done that, the bot will continue processing your PR. Note: PRs from contributors who haven't starred the repository will remain pending until this requirement is completed. Thanks for contributing to LeadOrbit! 🚀 |
|
@Kuldeeep18 Done! |
PR #433 follow up: Comprehensive Implementation of AI-Driven Lead Generation Agent & Tracking System ( Closes #53)
Overview
This update transitions the asynchronous background scraping worker from hardcoded mock data pools (Austin/Miami mock profiles) to a fully functional, production-ready extraction agent powered directly by Google Generative AI. It addresses all core architectural issues highlighted during the initial PR review, aligns system database schema rules, fixes endpoint namespace routing glitches, adds a complete background scrape execution history tracking log framework, and color-matches the responsive dashboard components.
Core Issues Resolved & Implementation Details
1. Transition to Live AI Generation Engine (Gemini 2.5 Pro Override)
time.sleep()delays and static dictionaries withinbackend/leads/tasks.py. The agent now makes live API requests to dynamically extract leads based on user query inputs.gemini-2.0-flash, live deployment testing encountered heavy concurrent throughput constraints and recurring429 Too Many Requestsstatus triggers. To bypass these strict rate-limiting caps, the system was upgraded togemini-2.5-pro. This provides higher tokens-per-minute constraints, sturdier structural exception-handling blocks, and more stable parsing layout validation for localized geographical search targets (such as the test case: "colleges in Ahmedabad").2. Fully Integrated Scrape History Logging System
LeadScrapeJobdatabase model. Every time an extraction agent is deployed, a tracking entry is initialized with statuses matchingPENDING,RUNNING,COMPLETED, orFAILED.3. Celery Concurrency Execution Adjustment
This configuration forces safe, sequentially consistent task execution routines across async thread steps without throwing transaction deadlocks.
4. Database Integrity & Routing Restorations
Schema Alignments: Generated and packed the structural data tracking migration block (
0006_leadscrapejob.py). Rebased the branch directly over the latest upstreammainto resolve database generation errors relating to missing default values for the newly introducedcustom_variablescolumn block.URL Routing & Scope Normalization: Patched the
LeadViewSetrouting configuration with specific view tracking parameters (detail=False,url_path='scrape_history'). Adjusted the client-side authentication utility function (fetchWithAuth) parameter strings to eliminate double-prefix mapping anomalies (/api/v1/api/v1/...) that were causing404 Not FoundAPI error logs in terminal streams.5. Interface Uniformity Optimization (Dark Mode / EdTech Look)
Aesthetic Refinement: Completely stripped out the default unmatching Bootstrap royal blue theme variables. Custom styled components lock navigation elements to
#0b1329(matching your sidebar paneling layouts) and dynamic buttons to a vibrant Teal hue (#14b8a6).📊 Verification Summary
gemini-2.5Structured GenerationLeadScrapeJobtracking database integration0006_leadscrapejobPacked & Migratedcustom_variables(Crashed on Save)404 Not FoundNamespace ClashesLink to ScreenRecording for easy PR review and proof of updates
https://drive.google.com/file/d/1ds8NyEFxoKVUmisCqftOtsca51KCauJc/view?usp=drive_link
Summary by CodeRabbit