Add sports team website scraping workflow#14
Conversation
This change enhances the sports schedule workflow by optionally scraping official team websites using Firecrawl's structured extraction. Key changes: - Defined `SPORTS_EXTRACT_SCHEMA` and `fetchScheduleFromWebsite` in `lib/sports.js`. - Updated `getUpcomingEvents` to merge supplemental data from `lib/data/sports/supplemental/`. - Enhanced `scripts/fetch-sports.js` to discover team websites and manage scraping with a 24h staleness check. - Added tests to verify merging of scraped data. - Exported normalization logic for use in the fetch script.
|
👋 Jules, reporting for duty! I'm here to lend a hand with this pull request. When you start a review, I'll add a 👀 emoji to each comment to let you know I've read it. I'll focus on feedback directed at me and will do my best to stay out of conversations between you and other bots or reviewers to keep the noise down. I'll push a commit with your requested changes shortly after. Please note there might be a delay between these steps, but rest assured I'm on the job! For more direct control, you can switch me to Reactive Mode. When this mode is on, I will only act on comments where you specifically mention me with New to Jules? Learn more at jules.google/docs. For security, I will only act on instructions from the user who triggered this task. |
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
|
Warning Review limit reached
More reviews will be available in 51 minutes and 1 second. Learn how PR review limits work. Your organization has run out of usage credits. Purchase more credits in the billing tab to continue. ⌛ How to resolve this issue?After more reviews become available, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans include higher PR review limits than trial, open-source, and free plans. In all cases, reviews become available again over time. During sustained high-volume PR review activity, CodeRabbit may temporarily slow when the next review becomes available. Please see our Fair Usage Limits Policy for further information. 📝 WalkthroughWalkthroughAdds Firecrawl extraction schema and two helpers ( ChangesSupplemental Event Aggregation via Firecrawl
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Suggested labels
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 5
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@lib/sports.js`:
- Around line 123-141: The Firecrawl POST currently has no timeout and can hang;
modify the call in lib/sports.js that invokes fetchImpl(...) to use an
AbortController signal and a timeout (e.g., from env.FIRECRAWL_TIMEOUT_MS with a
sensible default like 10000 ms): create AbortController, pass controller.signal
in the fetchImpl options, start a setTimeout to call controller.abort() after
the timeout, and clear the timer after the response; ensure downstream error
handling treats an abort as a timeout error.
- Around line 97-108: The scraped game.time is appended with ":00"
unconditionally causing malformed times when seconds are already present; update
the logic in lib/sports.js around the timestamp/strTime construction
(variables/functions: game.time, strTime, strTimestamp, timestamp) to first
normalize game.time to a valid HH:mm or HH:mm:ss format (e.g., if it matches
HH:mm append ":00", if it already matches HH:mm:ss leave as-is, or if missing
use "00:00:00"), then compose strTime from that normalized value and build
timestamp as `${game.date}T${normalizedTime}` so downstream filtering sees valid
upcoming timestamps.
In `@scripts/fetch-sports.js`:
- Around line 183-195: The current block only writes a file when games &&
games.length > 0, leaving teams with zero results unpersisted; change the logic
so you always write a supplemental JSON for the team even when games is empty.
Specifically, compute normalizedEvents = games && games.length ? games.map(g =>
normalizeScrapedEvent(g, team.strTeam)) : [] and always call fs.writeFile to
write an object containing teamId (team.idTeam), teamName (team.strTeam),
updatedAt (new Date().toISOString()), and events: normalizedEvents (empty array
when no games); keep the console.log but adjust it to reflect zero results when
applicable so teams aren't retried repeatedly.
- Around line 188-193: The fetch-sports script writes supplemental team files to
lib/data/sports/supplemental/<teamId>.json (see the write to filePath in
scripts/fetch-sports.js where JSON with teamId/teamName/updatedAt/events is
created), but the GitHub Actions job in .github/workflows/fetch-sports.yml only
stages lib/data/sports/*.json and thus will skip the supplemental folder; update
the workflow's staging/commit/include pattern to also include
lib/data/sports/supplemental/**/*.json (or move the output to the already-staged
path) so the supplemental artifacts produced by the writeFile call are picked up
and committed.
In `@test/sports.test.js`:
- Around line 200-241: The test "getUpcomingEvents merges supplemental (scraped)
data" is time-dependent; make it deterministic by fixing the clock during the
test: stub Date.now()/new Date() (or use sinon.useFakeTimers) so now is a fixed
timestamp, or compute all scrapedData date/time fields relative to a controlled
"now" used by getUpcomingEvents; ensure the scrapedData fields (dateEvent,
strTime, strTimestamp, updatedAt) are set consistently with the fake clock and
restore the clock in finally; reference the test name, getUpcomingEvents, and
createFetchMock when applying the clock stub so assertions about future-event
inclusion remain stable.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 8529cc9e-0bd3-4bae-b708-77a8b9a84e49
📒 Files selected for processing (3)
lib/sports.jsscripts/fetch-sports.jstest/sports.test.js
- Implement Firecrawl scraping for official sports team websites - Add extraction schema and normalization for scraped events - Support supplemental data merging in getUpcomingEvents - Add staleness check based on JSON updatedAt field - Implement 10s timeout for Firecrawl requests - Update GitHub workflow to persist supplemental data - Ensure deterministic testing for merging logic
- Implement supplemental WNBA schedule fetching from SportsDataverse CSV - Add Firecrawl-based scraping for official sports team websites - Implement a robust character-based CSV parser for SportsDataverse data - Add 10s timeout and AbortController for Firecrawl requests - Ensure 'Z' UTC designator in scraped timestamps for correct filtering - Update staleness check to use JSON 'updatedAt' field - Fix WNBA league ID to 4516 in fetch script - Make supplemental merging tests deterministic with mock timers - Update GitHub workflow to persist and commit supplemental data folder
There was a problem hiding this comment.
Actionable comments posted: 4
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@scripts/fetch-sports.js`:
- Around line 129-131: The CSV parser currently can drop the final row when the
file lacks a trailing newline because leftover data remains in
currentField/currentRow and never gets pushed into rows; update the parsing
logic (where currentField, currentRow, and rows are maintained—e.g., inside the
parse loop function) to, after the loop completes, append the last currentField
to currentRow (if any) and then append currentRow to rows (if it contains any
fields) before using rows[0] as header; ensure you handle empty strings vs
undefined appropriately so a final empty field is preserved when intended.
- Around line 246-249: The current freshness check computes lastUpdated = new
Date(data.updatedAt).getTime() but doesn't handle malformed dates (NaN), so
files with bad updatedAt are treated as fresh; update the logic around
lastUpdated (the variable computed from data.updatedAt) to detect NaN (e.g.,
Number.isNaN(lastUpdated) or !isFinite(lastUpdated)) and immediately treat the
record as stale (return true) instead of proceeding to the subtraction; keep the
existing now and twentyFourHoursMs check for valid timestamps.
- Around line 93-96: This fetch call to download the WNBA CSV lacks a timeout;
wrap it with an AbortController and the same FETCH_TIMEOUT_MS timeout pattern
used elsewhere (e.g., in fetchJson) so the request can't hang indefinitely:
create an AbortController, start a timer that calls controller.abort() after
FETCH_TIMEOUT_MS, pass controller.signal into fetch(url, { signal }), and ensure
the timer is cleared in a finally block so response/text handling (response,
csvText) only proceeds if not aborted.
- Line 158: The strTime value created in the CSV parsing can be in "HH:mm" which
is inconsistent with normalizeScrapedEvent and other sources expecting
"HH:mm:ss"; update the logic that assigns strTime (the line creating strTime
from date.split('T')[1] in scripts/fetch-sports.js) to normalize times to
"HH:mm:ss" by detecting a two-segment time (regex or split) and appending ":00"
when needed, otherwise leaving a full "HH:mm:ss" unchanged so downstream parsers
receive a consistent format.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 052c3391-fa20-4766-9fc8-cdd6b9c91879
📒 Files selected for processing (6)
.github/workflows/fetch-sports.ymllib/data/sports/supplemental/136437.jsonlib/data/sports/supplemental/136438.jsonlib/sports.jsscripts/fetch-sports.jstest/sports.test.js
✅ Files skipped from review due to trivial changes (1)
- lib/data/sports/supplemental/136438.json
🚧 Files skipped from review as they are similar to previous changes (2)
- lib/sports.js
- test/sports.test.js
- Add WNBA schedule data from SportsDataverse (wehoop) CSV - Implement official team website scraping using Firecrawl extraction - Build robust character-based CSV parser for wehoop datasets - Add AbortController timeouts for all supplemental data fetches - Normalize scraped/CSV times to HH:mm:ss with 'Z' UTC designator - Refine staleness logic to handle malformed updatedAt dates - Update GitHub workflow to persist the supplemental/ data folder - Fix WNBA league ID to 4516 for accurate TSDB lookups - Ensure deterministic sports merging tests with mock timers
There was a problem hiding this comment.
🧹 Nitpick comments (1)
scripts/fetch-sports.js (1)
186-199: 💤 Low valueTeam name mismatch may silently skip events.
The lookup
teamSupplemental.get(team.strTeam)assumes TheSportsDB team names exactly match the CSV'shome_display_name/away_display_name. If they differ (e.g., "Las Vegas Aces" vs "Aces"), events won't be saved for that team, with no warning logged.Consider logging when a team has no matching events:
Proposed fix
for (const team of teams) { const teamEvents = teamSupplemental.get(team.strTeam); if (teamEvents) { const filePath = path.join(SUPPLEMENTAL_DATA_DIR, `${team.idTeam}.json`); // ... console.log(` Saved ${teamEvents.length} WNBA supplemental events for ${team.strTeam}`); + } else { + console.log(` No WNBA supplemental events found for ${team.strTeam} (name mismatch?)`); } }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@scripts/fetch-sports.js` around lines 186 - 199, The code currently uses teamSupplemental.get(team.strTeam) and silently skips teams if the exact name doesn't match; update the block around teamSupplemental.get to log a warning when teamEvents is falsy (include team.strTeam and team.idTeam in the message) and attempt a tolerant fallback lookup (e.g., normalized lowercase/trim match or match on team.idTeam or other CSV display name keys) before deciding to skip; if no fallback match is found, still log that no supplemental events were found for that team so missing data is visible, otherwise proceed to write the merged JSON as before using filePath and fs.writeFile.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@scripts/fetch-sports.js`:
- Around line 186-199: The code currently uses
teamSupplemental.get(team.strTeam) and silently skips teams if the exact name
doesn't match; update the block around teamSupplemental.get to log a warning
when teamEvents is falsy (include team.strTeam and team.idTeam in the message)
and attempt a tolerant fallback lookup (e.g., normalized lowercase/trim match or
match on team.idTeam or other CSV display name keys) before deciding to skip; if
no fallback match is found, still log that no supplemental events were found for
that team so missing data is visible, otherwise proceed to write the merged JSON
as before using filePath and fs.writeFile.
- Integrate WNBA schedule data from SportsDataverse (wehoop) CSV - Implement official team website scraping using Firecrawl extraction - Implement a robust character-based CSV parser for wehoop datasets - Add AbortController timeouts for all supplemental data fetches - Implement tolerant (case-insensitive/trimmed) team name matching for WNBA - Normalize scraped/CSV times to HH:mm:ss with 'Z' UTC designator - Refine staleness logic to handle malformed updatedAt dates - Update GitHub workflow to persist the supplemental/ data folder - Fix WNBA league ID to 4516 for accurate TSDB lookups - Ensure deterministic sports merging tests with mock timers
Added official sports team website scraping via Firecrawl to supplement TheSportsDB data, allowing for up to 200 games per team. Merged scraped data into the ICS generation workflow.
PR created automatically by Jules for task 16370083516219159048 started by @DisabledAbel
Summary by CodeRabbit