Skip to content

Add sports team website scraping workflow#14

Merged
DisabledAbel merged 5 commits into
mainfrom
feature/sports-website-scraping-16370083516219159048
Jun 12, 2026
Merged

Add sports team website scraping workflow#14
DisabledAbel merged 5 commits into
mainfrom
feature/sports-website-scraping-16370083516219159048

Conversation

@DisabledAbel

@DisabledAbel DisabledAbel commented Jun 11, 2026

Copy link
Copy Markdown
Owner

Added official sports team website scraping via Firecrawl to supplement TheSportsDB data, allowing for up to 200 games per team. Merged scraped data into the ICS generation workflow.


PR created automatically by Jules for task 16370083516219159048 started by @DisabledAbel

Summary by CodeRabbit

  • New Features
    • Upcoming events now merge per-team supplemental schedules (scraped and CSV-sourced) for more complete, timely listings — improving coverage for select teams (notably WNBA).
  • Data
    • Added cached per-team supplemental schedule files to surface recent team-specific games when absent from league feeds.
  • Tests
    • Added test verifying supplemental schedule integration with upcoming events.
  • CI
    • Workflow updated to include supplemental schedule files when preparing commits.

This change enhances the sports schedule workflow by optionally scraping
official team websites using Firecrawl's structured extraction.

Key changes:
- Defined `SPORTS_EXTRACT_SCHEMA` and `fetchScheduleFromWebsite` in `lib/sports.js`.
- Updated `getUpcomingEvents` to merge supplemental data from `lib/data/sports/supplemental/`.
- Enhanced `scripts/fetch-sports.js` to discover team websites and manage scraping with a 24h staleness check.
- Added tests to verify merging of scraped data.
- Exported normalization logic for use in the fetch script.
@google-labs-jules

Copy link
Copy Markdown
Contributor

👋 Jules, reporting for duty! I'm here to lend a hand with this pull request.

When you start a review, I'll add a 👀 emoji to each comment to let you know I've read it. I'll focus on feedback directed at me and will do my best to stay out of conversations between you and other bots or reviewers to keep the noise down.

I'll push a commit with your requested changes shortly after. Please note there might be a delay between these steps, but rest assured I'm on the job!

For more direct control, you can switch me to Reactive Mode. When this mode is on, I will only act on comments where you specifically mention me with @jules. You can find this option in the Pull Request section of your global Jules UI settings. You can always switch back!

New to Jules? Learn more at jules.google/docs.


For security, I will only act on instructions from the user who triggered this task.

@vercel

vercel Bot commented Jun 11, 2026

Copy link
Copy Markdown

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
make-ics Ready Ready Preview, Comment Jun 12, 2026 5:15am

@coderabbitai

coderabbitai Bot commented Jun 11, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

Warning

Review limit reached

@DisabledAbel, we couldn't start this review because you've reached your PR review rate limit.

More reviews will be available in 51 minutes and 1 second. Learn how PR review limits work.

Your organization has run out of usage credits. Purchase more credits in the billing tab to continue.

⌛ How to resolve this issue?

After more reviews become available, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans include higher PR review limits than trial, open-source, and free plans. In all cases, reviews become available again over time. During sustained high-volume PR review activity, CodeRabbit may temporarily slow when the next review becomes available.

Please see our Fair Usage Limits Policy for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: b0fb82ab-666d-4754-90dd-e9f0efaaaa93

📥 Commits

Reviewing files that changed from the base of the PR and between f130c18 and a11c703.

📒 Files selected for processing (1)
  • scripts/fetch-sports.js
📝 Walkthrough

Walkthrough

Adds Firecrawl extraction schema and two helpers (normalizeScrapedEvent, fetchScheduleFromWebsite) to fetch and normalize scraped team schedules, caches per-team supplemental JSON under lib/data/sports/supplemental/, integrates loading those supplemental events into getUpcomingEvents, updates fetch-sports to produce supplemental files (CSV and website scrapes), updates CI to stage them, and adds a test validating merged results.

Changes

Supplemental Event Aggregation via Firecrawl

Layer / File(s) Summary
Firecrawl extraction schema and helpers
lib/sports.js
Adds SUPPLEMENTAL_DATA_DIR and SPORTS_EXTRACT_SCHEMA; exports normalizeScrapedEvent(game, teamName) and fetchScheduleFromWebsite(websiteUrl, { env, fetchImpl }) to call Firecrawl, enforce API key and timeout, and return extracted games arrays.
Supplemental event loading in getUpcomingEvents
lib/sports.js
Adds internal loadSupplementalTeamEvents(teamId) to read supplemental/<teamId>.json and integrates its promise into getUpcomingEvents so supplemental events are merged with league/season and eventsnext results and then deduped/filtered.
Supplemental data collection in fetch-sports script
scripts/fetch-sports.js
Initializes SUPPLEMENTAL_DATA_DIR, updates WNBA league id to 4516, adds fetchWNBASupplemental(teams) to parse a WNBA CSV into per-team supplemental JSON, adds isSupplementalStale(teamId), and implements per-team Firecrawl scraping + normalization + write when stale (throttled, per-team error handling).
Supplemental data files and CI
lib/data/sports/supplemental/*.json, .github/workflows/fetch-sports.yml
Adds supplemental JSON files for teams 136437 and 136438; updates CI commit step to include lib/data/sports/supplemental/*.json when staging changes.
Test for supplemental event merging
test/sports.test.js
Adds test that writes a supplemental team JSON file, invokes getUpcomingEvents, asserts supplemental events are present in merged results, and removes the file in cleanup.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

Suggested labels

codex

Poem

🐰 I hopped through HTML and CSV,
Fetched team schedules merrily,
I normalized each game with cheer,
Saved per-team files for all to share,
Calendar-bound — the season's near!

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The PR title 'Add sports team website scraping workflow' accurately reflects the main addition: a new Firecrawl-based website scraping workflow integrated into the sports data pipeline. However, it omits the equally significant WNBA supplemental CSV source and the merged supplemental data integration into getUpcomingEvents, which are substantial parts of the changeset.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feature/sports-website-scraping-16370083516219159048

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@lib/sports.js`:
- Around line 123-141: The Firecrawl POST currently has no timeout and can hang;
modify the call in lib/sports.js that invokes fetchImpl(...) to use an
AbortController signal and a timeout (e.g., from env.FIRECRAWL_TIMEOUT_MS with a
sensible default like 10000 ms): create AbortController, pass controller.signal
in the fetchImpl options, start a setTimeout to call controller.abort() after
the timeout, and clear the timer after the response; ensure downstream error
handling treats an abort as a timeout error.
- Around line 97-108: The scraped game.time is appended with ":00"
unconditionally causing malformed times when seconds are already present; update
the logic in lib/sports.js around the timestamp/strTime construction
(variables/functions: game.time, strTime, strTimestamp, timestamp) to first
normalize game.time to a valid HH:mm or HH:mm:ss format (e.g., if it matches
HH:mm append ":00", if it already matches HH:mm:ss leave as-is, or if missing
use "00:00:00"), then compose strTime from that normalized value and build
timestamp as `${game.date}T${normalizedTime}` so downstream filtering sees valid
upcoming timestamps.

In `@scripts/fetch-sports.js`:
- Around line 183-195: The current block only writes a file when games &&
games.length > 0, leaving teams with zero results unpersisted; change the logic
so you always write a supplemental JSON for the team even when games is empty.
Specifically, compute normalizedEvents = games && games.length ? games.map(g =>
normalizeScrapedEvent(g, team.strTeam)) : [] and always call fs.writeFile to
write an object containing teamId (team.idTeam), teamName (team.strTeam),
updatedAt (new Date().toISOString()), and events: normalizedEvents (empty array
when no games); keep the console.log but adjust it to reflect zero results when
applicable so teams aren't retried repeatedly.
- Around line 188-193: The fetch-sports script writes supplemental team files to
lib/data/sports/supplemental/<teamId>.json (see the write to filePath in
scripts/fetch-sports.js where JSON with teamId/teamName/updatedAt/events is
created), but the GitHub Actions job in .github/workflows/fetch-sports.yml only
stages lib/data/sports/*.json and thus will skip the supplemental folder; update
the workflow's staging/commit/include pattern to also include
lib/data/sports/supplemental/**/*.json (or move the output to the already-staged
path) so the supplemental artifacts produced by the writeFile call are picked up
and committed.

In `@test/sports.test.js`:
- Around line 200-241: The test "getUpcomingEvents merges supplemental (scraped)
data" is time-dependent; make it deterministic by fixing the clock during the
test: stub Date.now()/new Date() (or use sinon.useFakeTimers) so now is a fixed
timestamp, or compute all scrapedData date/time fields relative to a controlled
"now" used by getUpcomingEvents; ensure the scrapedData fields (dateEvent,
strTime, strTimestamp, updatedAt) are set consistently with the fake clock and
restore the clock in finally; reference the test name, getUpcomingEvents, and
createFetchMock when applying the clock stub so assertions about future-event
inclusion remain stable.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 8529cc9e-0bd3-4bae-b708-77a8b9a84e49

📥 Commits

Reviewing files that changed from the base of the PR and between 28527fa and fc39d3e.

📒 Files selected for processing (3)
  • lib/sports.js
  • scripts/fetch-sports.js
  • test/sports.test.js

Comment thread lib/sports.js
Comment thread lib/sports.js Outdated
Comment thread scripts/fetch-sports.js Outdated
Comment thread scripts/fetch-sports.js Outdated
Comment thread test/sports.test.js
- Implement Firecrawl scraping for official sports team websites
- Add extraction schema and normalization for scraped events
- Support supplemental data merging in getUpcomingEvents
- Add staleness check based on JSON updatedAt field
- Implement 10s timeout for Firecrawl requests
- Update GitHub workflow to persist supplemental data
- Ensure deterministic testing for merging logic
- Implement supplemental WNBA schedule fetching from SportsDataverse CSV
- Add Firecrawl-based scraping for official sports team websites
- Implement a robust character-based CSV parser for SportsDataverse data
- Add 10s timeout and AbortController for Firecrawl requests
- Ensure 'Z' UTC designator in scraped timestamps for correct filtering
- Update staleness check to use JSON 'updatedAt' field
- Fix WNBA league ID to 4516 in fetch script
- Make supplemental merging tests deterministic with mock timers
- Update GitHub workflow to persist and commit supplemental data folder

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@scripts/fetch-sports.js`:
- Around line 129-131: The CSV parser currently can drop the final row when the
file lacks a trailing newline because leftover data remains in
currentField/currentRow and never gets pushed into rows; update the parsing
logic (where currentField, currentRow, and rows are maintained—e.g., inside the
parse loop function) to, after the loop completes, append the last currentField
to currentRow (if any) and then append currentRow to rows (if it contains any
fields) before using rows[0] as header; ensure you handle empty strings vs
undefined appropriately so a final empty field is preserved when intended.
- Around line 246-249: The current freshness check computes lastUpdated = new
Date(data.updatedAt).getTime() but doesn't handle malformed dates (NaN), so
files with bad updatedAt are treated as fresh; update the logic around
lastUpdated (the variable computed from data.updatedAt) to detect NaN (e.g.,
Number.isNaN(lastUpdated) or !isFinite(lastUpdated)) and immediately treat the
record as stale (return true) instead of proceeding to the subtraction; keep the
existing now and twentyFourHoursMs check for valid timestamps.
- Around line 93-96: This fetch call to download the WNBA CSV lacks a timeout;
wrap it with an AbortController and the same FETCH_TIMEOUT_MS timeout pattern
used elsewhere (e.g., in fetchJson) so the request can't hang indefinitely:
create an AbortController, start a timer that calls controller.abort() after
FETCH_TIMEOUT_MS, pass controller.signal into fetch(url, { signal }), and ensure
the timer is cleared in a finally block so response/text handling (response,
csvText) only proceeds if not aborted.
- Line 158: The strTime value created in the CSV parsing can be in "HH:mm" which
is inconsistent with normalizeScrapedEvent and other sources expecting
"HH:mm:ss"; update the logic that assigns strTime (the line creating strTime
from date.split('T')[1] in scripts/fetch-sports.js) to normalize times to
"HH:mm:ss" by detecting a two-segment time (regex or split) and appending ":00"
when needed, otherwise leaving a full "HH:mm:ss" unchanged so downstream parsers
receive a consistent format.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 052c3391-fa20-4766-9fc8-cdd6b9c91879

📥 Commits

Reviewing files that changed from the base of the PR and between fc39d3e and e2ed30f.

📒 Files selected for processing (6)
  • .github/workflows/fetch-sports.yml
  • lib/data/sports/supplemental/136437.json
  • lib/data/sports/supplemental/136438.json
  • lib/sports.js
  • scripts/fetch-sports.js
  • test/sports.test.js
✅ Files skipped from review due to trivial changes (1)
  • lib/data/sports/supplemental/136438.json
🚧 Files skipped from review as they are similar to previous changes (2)
  • lib/sports.js
  • test/sports.test.js

Comment thread scripts/fetch-sports.js
Comment thread scripts/fetch-sports.js
Comment thread scripts/fetch-sports.js Outdated
Comment thread scripts/fetch-sports.js
- Add WNBA schedule data from SportsDataverse (wehoop) CSV
- Implement official team website scraping using Firecrawl extraction
- Build robust character-based CSV parser for wehoop datasets
- Add AbortController timeouts for all supplemental data fetches
- Normalize scraped/CSV times to HH:mm:ss with 'Z' UTC designator
- Refine staleness logic to handle malformed updatedAt dates
- Update GitHub workflow to persist the supplemental/ data folder
- Fix WNBA league ID to 4516 for accurate TSDB lookups
- Ensure deterministic sports merging tests with mock timers

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
scripts/fetch-sports.js (1)

186-199: 💤 Low value

Team name mismatch may silently skip events.

The lookup teamSupplemental.get(team.strTeam) assumes TheSportsDB team names exactly match the CSV's home_display_name/away_display_name. If they differ (e.g., "Las Vegas Aces" vs "Aces"), events won't be saved for that team, with no warning logged.

Consider logging when a team has no matching events:

Proposed fix
     for (const team of teams) {
       const teamEvents = teamSupplemental.get(team.strTeam);
       if (teamEvents) {
         const filePath = path.join(SUPPLEMENTAL_DATA_DIR, `${team.idTeam}.json`);
         // ...
         console.log(`    Saved ${teamEvents.length} WNBA supplemental events for ${team.strTeam}`);
+      } else {
+        console.log(`    No WNBA supplemental events found for ${team.strTeam} (name mismatch?)`);
       }
     }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@scripts/fetch-sports.js` around lines 186 - 199, The code currently uses
teamSupplemental.get(team.strTeam) and silently skips teams if the exact name
doesn't match; update the block around teamSupplemental.get to log a warning
when teamEvents is falsy (include team.strTeam and team.idTeam in the message)
and attempt a tolerant fallback lookup (e.g., normalized lowercase/trim match or
match on team.idTeam or other CSV display name keys) before deciding to skip; if
no fallback match is found, still log that no supplemental events were found for
that team so missing data is visible, otherwise proceed to write the merged JSON
as before using filePath and fs.writeFile.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@scripts/fetch-sports.js`:
- Around line 186-199: The code currently uses
teamSupplemental.get(team.strTeam) and silently skips teams if the exact name
doesn't match; update the block around teamSupplemental.get to log a warning
when teamEvents is falsy (include team.strTeam and team.idTeam in the message)
and attempt a tolerant fallback lookup (e.g., normalized lowercase/trim match or
match on team.idTeam or other CSV display name keys) before deciding to skip; if
no fallback match is found, still log that no supplemental events were found for
that team so missing data is visible, otherwise proceed to write the merged JSON
as before using filePath and fs.writeFile.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: ec2a04a3-4112-4057-a201-314f305d09e5

📥 Commits

Reviewing files that changed from the base of the PR and between e2ed30f and f130c18.

📒 Files selected for processing (1)
  • scripts/fetch-sports.js

- Integrate WNBA schedule data from SportsDataverse (wehoop) CSV
- Implement official team website scraping using Firecrawl extraction
- Implement a robust character-based CSV parser for wehoop datasets
- Add AbortController timeouts for all supplemental data fetches
- Implement tolerant (case-insensitive/trimmed) team name matching for WNBA
- Normalize scraped/CSV times to HH:mm:ss with 'Z' UTC designator
- Refine staleness logic to handle malformed updatedAt dates
- Update GitHub workflow to persist the supplemental/ data folder
- Fix WNBA league ID to 4516 for accurate TSDB lookups
- Ensure deterministic sports merging tests with mock timers
@DisabledAbel DisabledAbel merged commit 7cb1758 into main Jun 12, 2026
4 checks passed
@DisabledAbel DisabledAbel deleted the feature/sports-website-scraping-16370083516219159048 branch June 12, 2026 05:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant