Fix UnboundLocalError in spider() when crawler encounters exception#16
Fix UnboundLocalError in spider() when crawler encounters exception#16MichaelMVS wants to merge 2 commits intoamerico:mainfrom
Conversation
📝 WalkthroughWalkthroughLink parsing and crawl control were reworked: LinkParser now requires a baseUrl, getLinks returns (content_type, links) and robustly handles network/encoding; spider uses FIFO queue, tracks visited URLs, and returns the accumulated links only after the crawl completes. Module-level side effects removed. Changes
Sequence Diagram(s)sequenceDiagram
participant Spider
participant LinkParser
participant Network
participant HTMLParser
Spider->>LinkParser: instantiate(baseUrl)
loop crawl pages (FIFO)
Spider->>LinkParser: getLinks(url)
LinkParser->>Network: fetch URL (urllib)
alt network error
Network-->>LinkParser: URLError / HTTPError
LinkParser-->>Spider: (content_type="", links=[])
else success
Network-->>LinkParser: response (content_type, bytes)
LinkParser->>HTMLParser: decode bytes (utf-8, errors="replace") & parse tags
HTMLParser-->>LinkParser: extracted hrefs
LinkParser-->>Spider: (content_type, filtered_links)
end
Spider->>Spider: mark url visited, enqueue new unseen links (FIFO)
end
Spider-->>Caller: return aggregated links
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Pull request overview
Fixes a crash in core/crawler.py where spider() could raise UnboundLocalError if link extraction fails before links is assigned, and removes an ad-hoc debug invocation.
Changes:
- Move
return linksoutside the crawl loop to avoid returning mid-iteration and to preventlinksbeing unbound. - Remove unused local (
foundWord) and the module-level debugprint(spider(...)).
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
core/crawler.py
Outdated
| links = [] | ||
| pagesToVisit = [url] | ||
| numberVisited = 0 | ||
| while numberVisited < maxPages and pagesToVisit: | ||
| numberVisited += 1 | ||
| url = pagesToVisit.pop() |
core/crawler.py
Outdated
| parser = LinkParser() | ||
| data, links = parser.getLinks(url) | ||
| except: | ||
| pass | ||
| return links No newline at end of file |
core/crawler.py
Outdated
| try: | ||
| parser = LinkParser() | ||
| data, links = parser.getLinks(url) | ||
| except: | ||
| pass |
core/crawler.py
Outdated
| response = urlopen(url) | ||
| if response.getheader('Content-Type')=='text/html': | ||
| htmlBytes = response.read() | ||
| htmlString = htmlBytes.decode("utf-8") | ||
| self.feed(htmlString) | ||
| return htmlString, self.links | ||
| else: | ||
| return "",[] |
core/crawler.py
Outdated
| def spider(url, maxPages): | ||
| """Web crawler that visits pages and collects links. | ||
|
|
||
| Args: | ||
| url (str): Starting URL to begin crawl | ||
| maxPages (int): Maximum number of pages to visit | ||
|
|
||
| Returns: | ||
| list: All unique links discovered during crawl | ||
| """ |
There was a problem hiding this comment.
Actionable comments posted: 4
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@core/crawler.py`:
- Around line 56-67: The spider() loop currently overwrites links each iteration
and never enqueues discovered URLs, so change it to accumulate and crawl beyond
the seed: keep a visited set (e.g., visited_urls) and a queue/pagesToVisit list,
when parser.getLinks(url) returns data and found_links append found_links to a
global all_links list (or extend a set) and for each found link, if not in
visited_urls and not in pagesToVisit then append to pagesToVisit; mark the
current url in visited_urls and pop from pagesToVisit until numberVisited hits
maxPages; also avoid bare except by catching specific exceptions around
parser.getLinks(url) so failures don't stop enqueuing. Ensure references to
pagesToVisit, numberVisited, links (rename to found_links or all_links) and
parser.getLinks are updated accordingly.
- Around line 37-43: The Content-Type equality check is too strict and drops
valid HTML like "text/html; charset=utf-8"; update the conditional in the
fetch/parse routine to robustly detect HTML by reading
response.getheader('Content-Type') (handle None), lowercasing it, and checking
startswith('text/html') or 'text/html' in the header before proceeding to
read/decode into htmlBytes/htmlString and calling self.feed(htmlString) to
populate self.links; otherwise return the empty result as before.
- Around line 65-66: Replace the bare "except: pass" in core/crawler.py with
scoped exception handling: explicitly re-raise system-exiting exceptions
(KeyboardInterrupt, SystemExit) and catch Exception as e for all other errors,
then log the failure with context (e.g., URL or function name) using
logger.exception or logger.error; for example, change the bare except block to
"except (KeyboardInterrupt, SystemExit): raise" followed by "except Exception as
e: logger.exception('Failed to crawl %s', url, exc_info=e)" so you don't
silently swallow errors and you retain useful debug information.
- Line 36: In getLinks(), validate the URL scheme before calling urlopen by
using parse.urlparse(url).scheme and only proceed for "http" or "https"
(skip/log and return empty links for others), and add a timeout to the urlopen
call (e.g., urlopen(url, timeout=10) or use a configurable TIMEOUT constant) to
avoid hanging on slow/unresponsive hosts; ensure you handle/propagate the same
exceptions that getLinks currently expects when adding the timeout.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
core/crawler.py
Outdated
|
|
||
| self.links = [] | ||
| self.baseUrl = url | ||
| response = urlopen(url) |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
find . -name "crawler.py" -type f | head -20Repository: americo/sqlifinder
Length of output: 79
🏁 Script executed:
head -50 ./core/crawler.pyRepository: americo/sqlifinder
Length of output: 1571
🏁 Script executed:
wc -l ./core/crawler.pyRepository: americo/sqlifinder
Length of output: 82
🏁 Script executed:
cat -n ./core/crawler.pyRepository: americo/sqlifinder
Length of output: 2446
🏁 Script executed:
grep -n "timeout" ./core/crawler.pyRepository: americo/sqlifinder
Length of output: 44
Add URL scheme validation and timeout to urlopen() call.
The getLinks() method opens URLs without validating the scheme, which could allow unsafe non-HTTP protocols. Additionally, missing timeout can hang the crawler on slow or unresponsive servers. The parse module is already imported.
Proposed fix
self.links = []
self.baseUrl = url
- response = urlopen(url)
+ parsed = parse.urlparse(url)
+ if parsed.scheme not in {"http", "https"}:
+ return "", []
+ response = urlopen(url, timeout=10)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| response = urlopen(url) | |
| self.links = [] | |
| self.baseUrl = url | |
| parsed = parse.urlparse(url) | |
| if parsed.scheme not in {"http", "https"}: | |
| return "", [] | |
| response = urlopen(url, timeout=10) |
🧰 Tools
🪛 Ruff (0.15.9)
[error] 36-36: Audit URL open for permitted schemes. Allowing use of file: or custom schemes is often unexpected.
(S310)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@core/crawler.py` at line 36, In getLinks(), validate the URL scheme before
calling urlopen by using parse.urlparse(url).scheme and only proceed for "http"
or "https" (skip/log and return empty links for others), and add a timeout to
the urlopen call (e.g., urlopen(url, timeout=10) or use a configurable TIMEOUT
constant) to avoid hanging on slow/unresponsive hosts; ensure you
handle/propagate the same exceptions that getLinks currently expects when adding
the timeout.
core/crawler.py
Outdated
| except: | ||
| pass |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
cat -n core/crawler.py | head -80Repository: americo/sqlifinder
Length of output: 2446
Replace bare except + pass with scoped exception handling and logging.
This bare except suppresses all exceptions—including system exceptions like KeyboardInterrupt—and silently hides failures during crawling, making the crawler impossible to debug when URLs fail. Any network error, parsing error, or other exception simply vanishes with no trace.
Proposed fix
+import logging
@@
- except:
- pass
+ except Exception as exc:
+ logging.debug("Failed to crawl %s: %s", url, exc)
+ continue📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| except: | |
| pass | |
| except Exception as exc: | |
| logging.debug("Failed to crawl %s: %s", url, exc) | |
| continue |
🧰 Tools
🪛 Ruff (0.15.9)
[error] 65-65: Do not use bare except
(E722)
[error] 65-66: try-except-pass detected, consider logging the exception
(S110)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@core/crawler.py` around lines 65 - 66, Replace the bare "except: pass" in
core/crawler.py with scoped exception handling: explicitly re-raise
system-exiting exceptions (KeyboardInterrupt, SystemExit) and catch Exception as
e for all other errors, then log the failure with context (e.g., URL or function
name) using logger.exception or logger.error; for example, change the bare
except block to "except (KeyboardInterrupt, SystemExit): raise" followed by
"except Exception as e: logger.exception('Failed to crawl %s', url, exc_info=e)"
so you don't silently swallow errors and you retain useful debug information.
…ptions Issues addressed (per PR feedback): 1. spider() now accumulates all discovered links across pages instead of only returning the last page's links. A visited set prevents re-queuing. 2. Content-Type check uses .startswith() to handle values like 'text/html; charset=utf-8'. 3. Bare 'except:' replaced with specific (URLError, HTTPError, ValueError) to avoid suppressing KeyboardInterrupt/SystemExit. 4. Added LinkParser.__init__ to accept baseUrl, fixing the original UnboundLocalError on 'self.baseUrl' referenced before assignment. 5. spider() now pushes newly discovered links onto pagesToVisit so the loop actually traverses up to maxPages pages.
There was a problem hiding this comment.
♻️ Duplicate comments (2)
core/crawler.py (2)
72-78:⚠️ Potential issue | 🟡 Minor
numberVisitedis incremented before the duplicate check.When a duplicate URL is popped from
pagesToVisit(see related comment about duplicates),numberVisitedis incremented before theif url in visited: continuecheck. This wastes crawl budget on skipped duplicates.Move the increment after the visited check:
Proposed fix
while numberVisited < maxPages and pagesToVisit: - numberVisited += 1 url = pagesToVisit.pop(0) if url in visited: continue visited.add(url) + numberVisited += 1🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@core/crawler.py` around lines 72 - 78, In core/crawler.py inside the main crawl loop (the while loop that uses numberVisited, pagesToVisit and visited), move the increment of numberVisited so it happens only after confirming the popped URL is not a duplicate: pop url from pagesToVisit, check "if url in visited: continue" (and only then add to visited), and only after that increment numberVisited; update any related logic that assumes the increment location (e.g., any logging or loop-exit checks that reference numberVisited) so the crawl budget isn't consumed by duplicates.
85-88:⚠️ Potential issue | 🟠 MajorDuplicate URLs can be added to
all_linksandpagesToVisit.The check
if link not in visitedis insufficient. If the same link appears on multiple pages before being crawled, it will be added multiple times because it hasn't enteredvisitedyet.Consequences:
all_linkscontains duplicates (violates docstring: "All unique URLs")- Duplicates in
pagesToVisitwastenumberVisitedcounts when popped and skippedProposed fix using a set for discovered links
- all_links: list[str] = [] + all_links: set[str] = set() numberVisited = 0 while numberVisited < maxPages and pagesToVisit: @@ for link in links: - if link not in visited: + if link not in visited and link not in all_links: pagesToVisit.append(link) - all_links.append(link) + all_links.add(link) - return all_links + return list(all_links)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@core/crawler.py` around lines 85 - 88, The loop that appends links uses only the visited set so duplicate links discovered before being visited can be added multiple times to pagesToVisit and all_links; introduce a new set (e.g., discovered or discovered_links) initialized from visited and update it whenever you append a link: check "if link not in discovered" before appending to pagesToVisit and all_links, add the link to discovered immediately after appending, and keep using the existing visited set when a page is actually crawled (this ensures pagesToVisit and all_links contain unique URLs while preserving the current visited logic).
🧹 Nitpick comments (2)
core/crawler.py (2)
80-83: Exception handling inspider()is redundant.
getLinks()already catchesURLErrorandHTTPErrorinternally (lines 44-45) and returns("", []). These exceptions will never propagate to line 82. OnlyValueErrorcould potentially be caught here (e.g., fromHTMLParser.feed()).Consider simplifying to only catch what can actually propagate:
Proposed fix
try: _, links = LinkParser(url).getLinks(url) - except (URLError, HTTPError, ValueError): + except ValueError: continue🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@core/crawler.py` around lines 80 - 83, In spider(), the except block is catching URLError and HTTPError which never propagate because LinkParser.getLinks (called as LinkParser(url).getLinks(url)) already handles them and returns ("", []); update the error handling to only catch exceptions that can actually bubble up (e.g., ValueError from HTMLParser.feed()) by changing the except clause to "except ValueError:" (or remove the try/except if you prefer to let ValueError propagate), ensuring you reference the LinkParser.getLinks call so the change is applied to the correct spot.
67-74: Consider usingcollections.dequefor efficient FIFO queue.
list.pop(0)is O(n) because all remaining elements must shift. Withdeque.popleft(), this becomes O(1).Proposed fix
+from collections import deque @@ - pagesToVisit: list[str] = [url] + pagesToVisit: deque[str] = deque([url]) @@ - url = pagesToVisit.pop(0) + url = pagesToVisit.popleft()🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@core/crawler.py` around lines 67 - 74, The current FIFO queue uses a list named pagesToVisit and calls pagesToVisit.pop(0) inside the while loop (in the loop controlling numberVisited), which is O(n); replace pagesToVisit with a collections.deque (import collections or from collections import deque), initialize pagesToVisit = deque([url]) and change the pop to pagesToVisit.popleft() and any list appends to use pagesToVisit.append(...) so queue operations become O(1); update any type hints/comments accordingly (e.g., pagesToVisit: deque[str]) and ensure imports are added at top of core/crawler.py.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In `@core/crawler.py`:
- Around line 72-78: In core/crawler.py inside the main crawl loop (the while
loop that uses numberVisited, pagesToVisit and visited), move the increment of
numberVisited so it happens only after confirming the popped URL is not a
duplicate: pop url from pagesToVisit, check "if url in visited: continue" (and
only then add to visited), and only after that increment numberVisited; update
any related logic that assumes the increment location (e.g., any logging or
loop-exit checks that reference numberVisited) so the crawl budget isn't
consumed by duplicates.
- Around line 85-88: The loop that appends links uses only the visited set so
duplicate links discovered before being visited can be added multiple times to
pagesToVisit and all_links; introduce a new set (e.g., discovered or
discovered_links) initialized from visited and update it whenever you append a
link: check "if link not in discovered" before appending to pagesToVisit and
all_links, add the link to discovered immediately after appending, and keep
using the existing visited set when a page is actually crawled (this ensures
pagesToVisit and all_links contain unique URLs while preserving the current
visited logic).
---
Nitpick comments:
In `@core/crawler.py`:
- Around line 80-83: In spider(), the except block is catching URLError and
HTTPError which never propagate because LinkParser.getLinks (called as
LinkParser(url).getLinks(url)) already handles them and returns ("", []); update
the error handling to only catch exceptions that can actually bubble up (e.g.,
ValueError from HTMLParser.feed()) by changing the except clause to "except
ValueError:" (or remove the try/except if you prefer to let ValueError
propagate), ensuring you reference the LinkParser.getLinks call so the change is
applied to the correct spot.
- Around line 67-74: The current FIFO queue uses a list named pagesToVisit and
calls pagesToVisit.pop(0) inside the while loop (in the loop controlling
numberVisited), which is O(n); replace pagesToVisit with a collections.deque
(import collections or from collections import deque), initialize pagesToVisit =
deque([url]) and change the pop to pagesToVisit.popleft() and any list appends
to use pagesToVisit.append(...) so queue operations become O(1); update any type
hints/comments accordingly (e.g., pagesToVisit: deque[str]) and ensure imports
are added at top of core/crawler.py.
Summary
Fixed an
UnboundLocalErrorbug incore/crawler.pythat occurs when thespider()function encounters an exception before thewhileloop completes.The Bug
The original code had the
return linksstatement INSIDE thewhileloop:This caused:
parser.getLinks()call,linkswas never assigned, causingUnboundLocalErrorThis matches the reported issue #6.
Fix
Moved
return linksoutside thewhileloop so all pages inpagesToVisitare processed, and the function properly returns an empty list if no links were found.Also removed the debug
print(spider(...))statement at the end of the file.Summary by CodeRabbit
Bug Fixes
Improvements