Skip to content

Fix UnboundLocalError in spider() when crawler encounters exception#16

Open
MichaelMVS wants to merge 2 commits intoamerico:mainfrom
MichaelMVS:fix-spider-unboundlocalerror
Open

Fix UnboundLocalError in spider() when crawler encounters exception#16
MichaelMVS wants to merge 2 commits intoamerico:mainfrom
MichaelMVS:fix-spider-unboundlocalerror

Conversation

@MichaelMVS
Copy link
Copy Markdown

@MichaelMVS MichaelMVS commented Apr 10, 2026

Summary

Fixed an UnboundLocalError bug in core/crawler.py that occurs when the spider() function encounters an exception before the while loop completes.

The Bug

The original code had the return links statement INSIDE the while loop:

while numberVisited < maxPages and pagesToVisit:
    numberVisited += 1
    url = pagesToVisit.pop()
    try:
        parser = LinkParser()
        data, links = parser.getLinks(url)
    except:
        pass
    return links  # BUG: inside while loop, returns on first iteration!

This caused:

  1. The function to return after processing only the first URL
  2. If an exception occurred before any successful parser.getLinks() call, links was never assigned, causing UnboundLocalError

This matches the reported issue #6.

Fix

Moved return links outside the while loop so all pages in pagesToVisit are processed, and the function properly returns an empty list if no links were found.

Also removed the debug print(spider(...)) statement at the end of the file.

Summary by CodeRabbit

  • Bug Fixes

    • Crawler no longer stops prematurely and returns complete results after finishing the crawl.
    • Skips already-visited pages to avoid duplicates.
  • Improvements

    • More robust link parsing and network handling, reducing errors on malformed or non-HTML responses.
    • Crawl order changed to breadth-first (FIFO) for more predictable traversal.
    • Removed stray console output for cleaner runtime behavior.

Copilot AI review requested due to automatic review settings April 10, 2026 12:42
@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Apr 10, 2026

📝 Walkthrough

Walkthrough

Link parsing and crawl control were reworked: LinkParser now requires a baseUrl, getLinks returns (content_type, links) and robustly handles network/encoding; spider uses FIFO queue, tracks visited URLs, and returns the accumulated links only after the crawl completes. Module-level side effects removed.

Changes

Cohort / File(s) Summary
Crawler core
core/crawler.py
Major rewrite of crawler control and parser: LinkParser now requires baseUrl in __init__, stores typed links, and only appends non-None hrefs. getLinks(url: str) -> tuple[str, list[str]] adds explicit URLError/HTTPError handling, decodes with utf-8/errors="replace", and returns empty for non-HTML Content-Type prefixes. spider(url: str, maxPages: int) -> list[str] now uses FIFO queue (pop(0)), a visited set to skip duplicates, removes premature return inside loop, and eliminates module-level print() and unused foundWord variable.

Sequence Diagram(s)

sequenceDiagram
    participant Spider
    participant LinkParser
    participant Network
    participant HTMLParser

    Spider->>LinkParser: instantiate(baseUrl)
    loop crawl pages (FIFO)
        Spider->>LinkParser: getLinks(url)
        LinkParser->>Network: fetch URL (urllib)
        alt network error
            Network-->>LinkParser: URLError / HTTPError
            LinkParser-->>Spider: (content_type="", links=[])
        else success
            Network-->>LinkParser: response (content_type, bytes)
            LinkParser->>HTMLParser: decode bytes (utf-8, errors="replace") & parse tags
            HTMLParser-->>LinkParser: extracted hrefs
            LinkParser-->>Spider: (content_type, filtered_links)
        end
        Spider->>Spider: mark url visited, enqueue new unseen links (FIFO)
    end
    Spider-->>Caller: return aggregated links
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Poem

🐰
I hopped through tags and fixed my pace,
No lost loops, no sudden race,
Visited paths now kept in sight,
Queues flow proper, encodings right,
A tidy crawl—how pure delight! 🥕✨

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 75.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately describes the main fix—resolving an UnboundLocalError in the spider() function caused by a misplaced return statement inside the while loop.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes a crash in core/crawler.py where spider() could raise UnboundLocalError if link extraction fails before links is assigned, and removes an ad-hoc debug invocation.

Changes:

  • Move return links outside the crawl loop to avoid returning mid-iteration and to prevent links being unbound.
  • Remove unused local (foundWord) and the module-level debug print(spider(...)).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

core/crawler.py Outdated
Comment on lines +56 to +61
links = []
pagesToVisit = [url]
numberVisited = 0
while numberVisited < maxPages and pagesToVisit:
numberVisited += 1
url = pagesToVisit.pop()
core/crawler.py Outdated
Comment on lines +63 to +67
parser = LinkParser()
data, links = parser.getLinks(url)
except:
pass
return links No newline at end of file
core/crawler.py Outdated
Comment on lines +62 to +66
try:
parser = LinkParser()
data, links = parser.getLinks(url)
except:
pass
core/crawler.py Outdated
Comment on lines +36 to +43
response = urlopen(url)
if response.getheader('Content-Type')=='text/html':
htmlBytes = response.read()
htmlString = htmlBytes.decode("utf-8")
self.feed(htmlString)
return htmlString, self.links
else:
return "",[]
core/crawler.py Outdated
Comment on lines +45 to +54
def spider(url, maxPages):
"""Web crawler that visits pages and collects links.

Args:
url (str): Starting URL to begin crawl
maxPages (int): Maximum number of pages to visit

Returns:
list: All unique links discovered during crawl
"""
Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@core/crawler.py`:
- Around line 56-67: The spider() loop currently overwrites links each iteration
and never enqueues discovered URLs, so change it to accumulate and crawl beyond
the seed: keep a visited set (e.g., visited_urls) and a queue/pagesToVisit list,
when parser.getLinks(url) returns data and found_links append found_links to a
global all_links list (or extend a set) and for each found link, if not in
visited_urls and not in pagesToVisit then append to pagesToVisit; mark the
current url in visited_urls and pop from pagesToVisit until numberVisited hits
maxPages; also avoid bare except by catching specific exceptions around
parser.getLinks(url) so failures don't stop enqueuing. Ensure references to
pagesToVisit, numberVisited, links (rename to found_links or all_links) and
parser.getLinks are updated accordingly.
- Around line 37-43: The Content-Type equality check is too strict and drops
valid HTML like "text/html; charset=utf-8"; update the conditional in the
fetch/parse routine to robustly detect HTML by reading
response.getheader('Content-Type') (handle None), lowercasing it, and checking
startswith('text/html') or 'text/html' in the header before proceeding to
read/decode into htmlBytes/htmlString and calling self.feed(htmlString) to
populate self.links; otherwise return the empty result as before.
- Around line 65-66: Replace the bare "except: pass" in core/crawler.py with
scoped exception handling: explicitly re-raise system-exiting exceptions
(KeyboardInterrupt, SystemExit) and catch Exception as e for all other errors,
then log the failure with context (e.g., URL or function name) using
logger.exception or logger.error; for example, change the bare except block to
"except (KeyboardInterrupt, SystemExit): raise" followed by "except Exception as
e: logger.exception('Failed to crawl %s', url, exc_info=e)" so you don't
silently swallow errors and you retain useful debug information.
- Line 36: In getLinks(), validate the URL scheme before calling urlopen by
using parse.urlparse(url).scheme and only proceed for "http" or "https"
(skip/log and return empty links for others), and add a timeout to the urlopen
call (e.g., urlopen(url, timeout=10) or use a configurable TIMEOUT constant) to
avoid hanging on slow/unresponsive hosts; ensure you handle/propagate the same
exceptions that getLinks currently expects when adding the timeout.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 3787df60-1883-4d50-95db-0e2816f71e0a

📥 Commits

Reviewing files that changed from the base of the PR and between ef57bf3 and 655fe9a.

📒 Files selected for processing (1)
  • core/crawler.py

core/crawler.py Outdated

self.links = []
self.baseUrl = url
response = urlopen(url)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

find . -name "crawler.py" -type f | head -20

Repository: americo/sqlifinder

Length of output: 79


🏁 Script executed:

head -50 ./core/crawler.py

Repository: americo/sqlifinder

Length of output: 1571


🏁 Script executed:

wc -l ./core/crawler.py

Repository: americo/sqlifinder

Length of output: 82


🏁 Script executed:

cat -n ./core/crawler.py

Repository: americo/sqlifinder

Length of output: 2446


🏁 Script executed:

grep -n "timeout" ./core/crawler.py

Repository: americo/sqlifinder

Length of output: 44


Add URL scheme validation and timeout to urlopen() call.

The getLinks() method opens URLs without validating the scheme, which could allow unsafe non-HTTP protocols. Additionally, missing timeout can hang the crawler on slow or unresponsive servers. The parse module is already imported.

Proposed fix
     self.links = []
     self.baseUrl = url
-    response = urlopen(url)
+    parsed = parse.urlparse(url)
+    if parsed.scheme not in {"http", "https"}:
+        return "", []
+    response = urlopen(url, timeout=10)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
response = urlopen(url)
self.links = []
self.baseUrl = url
parsed = parse.urlparse(url)
if parsed.scheme not in {"http", "https"}:
return "", []
response = urlopen(url, timeout=10)
🧰 Tools
🪛 Ruff (0.15.9)

[error] 36-36: Audit URL open for permitted schemes. Allowing use of file: or custom schemes is often unexpected.

(S310)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@core/crawler.py` at line 36, In getLinks(), validate the URL scheme before
calling urlopen by using parse.urlparse(url).scheme and only proceed for "http"
or "https" (skip/log and return empty links for others), and add a timeout to
the urlopen call (e.g., urlopen(url, timeout=10) or use a configurable TIMEOUT
constant) to avoid hanging on slow/unresponsive hosts; ensure you
handle/propagate the same exceptions that getLinks currently expects when adding
the timeout.

core/crawler.py Outdated
Comment on lines +65 to +66
except:
pass
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

cat -n core/crawler.py | head -80

Repository: americo/sqlifinder

Length of output: 2446


Replace bare except + pass with scoped exception handling and logging.

This bare except suppresses all exceptions—including system exceptions like KeyboardInterrupt—and silently hides failures during crawling, making the crawler impossible to debug when URLs fail. Any network error, parsing error, or other exception simply vanishes with no trace.

Proposed fix
+import logging
@@
-        except:
-            pass
+        except Exception as exc:
+            logging.debug("Failed to crawl %s: %s", url, exc)
+            continue
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
except:
pass
except Exception as exc:
logging.debug("Failed to crawl %s: %s", url, exc)
continue
🧰 Tools
🪛 Ruff (0.15.9)

[error] 65-65: Do not use bare except

(E722)


[error] 65-66: try-except-pass detected, consider logging the exception

(S110)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@core/crawler.py` around lines 65 - 66, Replace the bare "except: pass" in
core/crawler.py with scoped exception handling: explicitly re-raise
system-exiting exceptions (KeyboardInterrupt, SystemExit) and catch Exception as
e for all other errors, then log the failure with context (e.g., URL or function
name) using logger.exception or logger.error; for example, change the bare
except block to "except (KeyboardInterrupt, SystemExit): raise" followed by
"except Exception as e: logger.exception('Failed to crawl %s', url, exc_info=e)"
so you don't silently swallow errors and you retain useful debug information.

…ptions

Issues addressed (per PR feedback):
1. spider() now accumulates all discovered links across pages instead of
   only returning the last page's links. A visited set prevents re-queuing.
2. Content-Type check uses .startswith() to handle values like
   'text/html; charset=utf-8'.
3. Bare 'except:' replaced with specific (URLError, HTTPError, ValueError)
   to avoid suppressing KeyboardInterrupt/SystemExit.
4. Added LinkParser.__init__ to accept baseUrl, fixing the original
   UnboundLocalError on 'self.baseUrl' referenced before assignment.
5. spider() now pushes newly discovered links onto pagesToVisit so the
   loop actually traverses up to maxPages pages.
Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (2)
core/crawler.py (2)

72-78: ⚠️ Potential issue | 🟡 Minor

numberVisited is incremented before the duplicate check.

When a duplicate URL is popped from pagesToVisit (see related comment about duplicates), numberVisited is incremented before the if url in visited: continue check. This wastes crawl budget on skipped duplicates.

Move the increment after the visited check:

Proposed fix
     while numberVisited < maxPages and pagesToVisit:
-        numberVisited += 1
         url = pagesToVisit.pop(0)
 
         if url in visited:
             continue
         visited.add(url)
+        numberVisited += 1
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@core/crawler.py` around lines 72 - 78, In core/crawler.py inside the main
crawl loop (the while loop that uses numberVisited, pagesToVisit and visited),
move the increment of numberVisited so it happens only after confirming the
popped URL is not a duplicate: pop url from pagesToVisit, check "if url in
visited: continue" (and only then add to visited), and only after that increment
numberVisited; update any related logic that assumes the increment location
(e.g., any logging or loop-exit checks that reference numberVisited) so the
crawl budget isn't consumed by duplicates.

85-88: ⚠️ Potential issue | 🟠 Major

Duplicate URLs can be added to all_links and pagesToVisit.

The check if link not in visited is insufficient. If the same link appears on multiple pages before being crawled, it will be added multiple times because it hasn't entered visited yet.

Consequences:

  1. all_links contains duplicates (violates docstring: "All unique URLs")
  2. Duplicates in pagesToVisit waste numberVisited counts when popped and skipped
Proposed fix using a set for discovered links
-    all_links: list[str] = []
+    all_links: set[str] = set()
     numberVisited = 0
 
     while numberVisited < maxPages and pagesToVisit:
@@
         for link in links:
-            if link not in visited:
+            if link not in visited and link not in all_links:
                 pagesToVisit.append(link)
-                all_links.append(link)
+                all_links.add(link)
 
-    return all_links
+    return list(all_links)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@core/crawler.py` around lines 85 - 88, The loop that appends links uses only
the visited set so duplicate links discovered before being visited can be added
multiple times to pagesToVisit and all_links; introduce a new set (e.g.,
discovered or discovered_links) initialized from visited and update it whenever
you append a link: check "if link not in discovered" before appending to
pagesToVisit and all_links, add the link to discovered immediately after
appending, and keep using the existing visited set when a page is actually
crawled (this ensures pagesToVisit and all_links contain unique URLs while
preserving the current visited logic).
🧹 Nitpick comments (2)
core/crawler.py (2)

80-83: Exception handling in spider() is redundant.

getLinks() already catches URLError and HTTPError internally (lines 44-45) and returns ("", []). These exceptions will never propagate to line 82. Only ValueError could potentially be caught here (e.g., from HTMLParser.feed()).

Consider simplifying to only catch what can actually propagate:

Proposed fix
         try:
             _, links = LinkParser(url).getLinks(url)
-        except (URLError, HTTPError, ValueError):
+        except ValueError:
             continue
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@core/crawler.py` around lines 80 - 83, In spider(), the except block is
catching URLError and HTTPError which never propagate because
LinkParser.getLinks (called as LinkParser(url).getLinks(url)) already handles
them and returns ("", []); update the error handling to only catch exceptions
that can actually bubble up (e.g., ValueError from HTMLParser.feed()) by
changing the except clause to "except ValueError:" (or remove the try/except if
you prefer to let ValueError propagate), ensuring you reference the
LinkParser.getLinks call so the change is applied to the correct spot.

67-74: Consider using collections.deque for efficient FIFO queue.

list.pop(0) is O(n) because all remaining elements must shift. With deque.popleft(), this becomes O(1).

Proposed fix
+from collections import deque
@@
-    pagesToVisit: list[str] = [url]
+    pagesToVisit: deque[str] = deque([url])
@@
-        url = pagesToVisit.pop(0)
+        url = pagesToVisit.popleft()
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@core/crawler.py` around lines 67 - 74, The current FIFO queue uses a list
named pagesToVisit and calls pagesToVisit.pop(0) inside the while loop (in the
loop controlling numberVisited), which is O(n); replace pagesToVisit with a
collections.deque (import collections or from collections import deque),
initialize pagesToVisit = deque([url]) and change the pop to
pagesToVisit.popleft() and any list appends to use pagesToVisit.append(...) so
queue operations become O(1); update any type hints/comments accordingly (e.g.,
pagesToVisit: deque[str]) and ensure imports are added at top of
core/crawler.py.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@core/crawler.py`:
- Around line 72-78: In core/crawler.py inside the main crawl loop (the while
loop that uses numberVisited, pagesToVisit and visited), move the increment of
numberVisited so it happens only after confirming the popped URL is not a
duplicate: pop url from pagesToVisit, check "if url in visited: continue" (and
only then add to visited), and only after that increment numberVisited; update
any related logic that assumes the increment location (e.g., any logging or
loop-exit checks that reference numberVisited) so the crawl budget isn't
consumed by duplicates.
- Around line 85-88: The loop that appends links uses only the visited set so
duplicate links discovered before being visited can be added multiple times to
pagesToVisit and all_links; introduce a new set (e.g., discovered or
discovered_links) initialized from visited and update it whenever you append a
link: check "if link not in discovered" before appending to pagesToVisit and
all_links, add the link to discovered immediately after appending, and keep
using the existing visited set when a page is actually crawled (this ensures
pagesToVisit and all_links contain unique URLs while preserving the current
visited logic).

---

Nitpick comments:
In `@core/crawler.py`:
- Around line 80-83: In spider(), the except block is catching URLError and
HTTPError which never propagate because LinkParser.getLinks (called as
LinkParser(url).getLinks(url)) already handles them and returns ("", []); update
the error handling to only catch exceptions that can actually bubble up (e.g.,
ValueError from HTMLParser.feed()) by changing the except clause to "except
ValueError:" (or remove the try/except if you prefer to let ValueError
propagate), ensuring you reference the LinkParser.getLinks call so the change is
applied to the correct spot.
- Around line 67-74: The current FIFO queue uses a list named pagesToVisit and
calls pagesToVisit.pop(0) inside the while loop (in the loop controlling
numberVisited), which is O(n); replace pagesToVisit with a collections.deque
(import collections or from collections import deque), initialize pagesToVisit =
deque([url]) and change the pop to pagesToVisit.popleft() and any list appends
to use pagesToVisit.append(...) so queue operations become O(1); update any type
hints/comments accordingly (e.g., pagesToVisit: deque[str]) and ensure imports
are added at top of core/crawler.py.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 315db880-0799-42ac-a77a-34c37b93ea74

📥 Commits

Reviewing files that changed from the base of the PR and between 655fe9a and fe9f9e8.

📒 Files selected for processing (1)
  • core/crawler.py

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants