Skip to content

📅 Expand sports data with ESPN scraping and SportsDataverse#15

Merged
DisabledAbel merged 4 commits into
mainfrom
feature/sports-website-scraping-16370083516219159048
Jun 12, 2026
Merged

📅 Expand sports data with ESPN scraping and SportsDataverse#15
DisabledAbel merged 4 commits into
mainfrom
feature/sports-website-scraping-16370083516219159048

Conversation

@DisabledAbel

@DisabledAbel DisabledAbel commented Jun 12, 2026

Copy link
Copy Markdown
Owner
  • Automate ESPN schedule scraping every 6 hours via Firecrawl
  • Integrate SportsDataverse CSVs for NBA, NFL, NHL, and WNBA
  • Implement robust character-based CSV parser for external datasets
  • Add tolerant team matching (abbreviations, prefix, case-insensitive)
  • Normalize all supplemental times to HH:mm:ss with 'Z' designator
  • Update GitHub Actions workflow to run every 6 hours
  • Synchronize internal cache staleness check with 6-hour workflow
  • Ensure deterministic merging tests with mock timers

Summary by CodeRabbit

  • New Features

    • Schedules now refresh every 6 hours; added ESPN schedule scraping with timeout and validation.
    • Supplemental CSV ingestion added, merging per-team supplemental data with fallback scraping.
  • Bug Fixes

    • Normalized team names, badge image URLs, and match status fields across datasets.
  • Chores

    • Workflow staging updated to tolerate supplemental data file presence; removed some outdated supplemental team files.

- Automate ESPN schedule scraping every 6 hours via Firecrawl
- Integrate SportsDataverse CSVs for NBA, NFL, NHL, and WNBA
- Implement robust character-based CSV parser for external datasets
- Add tolerant team matching (abbreviations, prefix, case-insensitive)
- Normalize all supplemental times to HH:mm:ss with 'Z' designator
- Update GitHub Actions workflow to run every 6 hours
- Synchronize internal cache staleness check with 6-hour workflow
- Ensure deterministic merging tests with mock timers
@vercel

vercel Bot commented Jun 12, 2026

Copy link
Copy Markdown

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
make-ics Ready Ready Preview, Comment Jun 12, 2026 7:17pm

@coderabbitai

coderabbitai Bot commented Jun 12, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

📝 Walkthrough

Walkthrough

Adds ESPN schedule scraping (Firecrawl), per-league CSV supplemental ingestion, reduces supplemental staleness to 6 hours, updates main orchestration to prefer ESPN with website fallback, and runs the workflow every 6 hours while staging supplemental JSON outputs.

Changes

Supplemental Schedule Scraping and Caching

Layer / File(s) Summary
Workflow scheduling and supplemental file staging
.github/workflows/fetch-sports.yml
Cron updated from 12-hour to 6-hour intervals; commit staging now includes lib/data/sports/supplemental/*.json with tolerant git add behavior.
ESPN schedule scraping via Firecrawl
lib/sports.js
New exported fetchScheduleFromESPN(leagueSlug, teamSlug, options) validates FIRECRAWL_API_KEY, issues a Firecrawl extract with ESPN prompt and schema, enforces timeout via AbortController, and returns parsed games with ESPN-specific timeout error mapping.
ESPN integration and helpers
scripts/fetch-sports.js
Adds LEAGUE_TO_ESPN_SLUG, TEAM_ESPN_SLUG_OVERRIDES, and imports fetchScheduleFromESPN; provides getESPNTeamSlug(team) helper (overrides then slugified fallback).
Supplemental CSV ingestion
scripts/fetch-sports.js
fetchLeagueSupplementalCSV(league, teams) downloads CSVs, performs quote-aware parsing, validates headers, normalizes rows into event objects, aggregates events per team with tolerant matching, and writes per-team supplemental JSON files.
Main orchestration and staleness
scripts/fetch-sports.js
main() ensures supplemental dir, runs per-league CSV ingestion when configured, sets supplemental staleness to 6 hours, rescrapes stale teams preferring ESPN (league+team slug) with fallback to team strWebsite, merges/normalizes results, writes supplemental JSON, and applies per-team throttling.
Data updates
lib/data/sports/..., lib/data/sports/supplemental/...
Many lib/data/sports/*.json files had updatedAt changes and event normalizations (badge URL migrations, strStatus/score/video/result normalizations); two lib/data/sports/supplemental/*.json WNBA supplemental JSON files were removed.

Sequence Diagram

sequenceDiagram
  participant FetchScript as scripts/fetch-sports.js
  participant SportsLib as lib/sports.fetchScheduleFromESPN
  participant Firecrawl as Firecrawl API
  participant FS as Filesystem

  FetchScript->>SportsLib: fetchScheduleFromESPN(leagueSlug, teamSlug, opts)
  SportsLib->>Firecrawl: POST /extract (ESPN URL, schema, prompt, timeout)
  Firecrawl-->>SportsLib: extracted { games: [...] }
  SportsLib-->>FetchScript: parsed games array
  FetchScript->>FS: merge/normalize and write `lib/data/sports/supplemental/<teamId>.json`
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

  • DisabledAbel/MakeICS#14: Modifies the same supplemental scraping/merging pipeline and workflow staging for lib/data/sports/supplemental/*.json.

"🐰 A rabbit hops and parses CSV lines,
Firecrawl scrapes schedules through sunny times.
Six-hour beats keep data in tune,
Supplemental JSONs hum by noon.
Hooray — fresh games for every moon!"

🚥 Pre-merge checks | ✅ 3 | ❌ 2

❌ Failed checks (1 warning, 1 inconclusive)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Title check ❓ Inconclusive The title includes an emoji and uses vague phrasing ('Expand sports data') that doesn't clearly convey the main technical changes, making it generic and non-descriptive. Replace with a more specific, emoji-free title that highlights the primary change, e.g., 'Add ESPN scraping via Firecrawl and SportsDataverse CSV integration for sports schedules.'
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feature/sports-website-scraping-16370083516219159048

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@DisabledAbel

DisabledAbel commented Jun 12, 2026

Copy link
Copy Markdown
Owner Author

@coderabbitai fix conflicts

@coderabbitai

coderabbitai Bot commented Jun 12, 2026

Copy link
Copy Markdown
Contributor

No merge conflicts detected when merging into main.

Your branch is good to go!

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (3)
.github/workflows/fetch-sports.yml (1)

5-5: Consider reducing FEED_REFRESH_INTERVAL to match the workflow cadence.

The workflow now runs every 6 hours, but the ICS feeds use a 24-hour refresh interval (set by FEED_REFRESH_INTERVAL = 'PT24H' in lib/sports.js). ICS clients will only check for updates once per day, potentially showing data up to 24 hours stale even though fresh data is available every 6 hours. If you want clients to benefit from the increased update frequency, consider reducing FEED_REFRESH_INTERVAL to 'PT6H' or 'PT12H'.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In @.github/workflows/fetch-sports.yml at line 5, The feed refresh interval in
lib/sports.js (FEED_REFRESH_INTERVAL = 'PT24H') is out of sync with the
fetch-sports.yml workflow cron (now every 6 hours); update FEED_REFRESH_INTERVAL
to a shorter ISO8601 duration such as 'PT6H' or 'PT12H' in lib/sports.js so ICS
clients check more frequently and align with the workflow cadence, ensuring you
change the FEED_REFRESH_INTERVAL constant and run tests that reference it (e.g.,
any tests or consumers of FEED_REFRESH_INTERVAL) to confirm no regressions.
scripts/fetch-sports.js (1)

454-469: ⚖️ Poor tradeoff

Firecrawl scraping overwrites CSV-derived supplemental data.

When Firecrawl is enabled and scraping succeeds, the code writes to the same file path as fetchLeagueSupplementalCSV, replacing any CSV-derived events. This means for leagues with both CSV config and Firecrawl support (NBA, NFL, NHL, WNBA), the CSV data is fetched but immediately overwritten if Firecrawl succeeds.

If this is intentional (Firecrawl data is more authoritative), consider skipping CSV fetch when Firecrawl is available, or merging both sources:

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@scripts/fetch-sports.js` around lines 454 - 469, The Firecrawl save block
overwrites CSV-derived supplemental files (written to SUPPLEMENTAL_DATA_DIR
using team.idTeam.json) causing CSV events to be lost; update the logic in
scripts/fetch-sports.js so that when Firecrawl scraping (allScrapedGames /
normalizeScrapedEvent) succeeds you either (A) skip calling
fetchLeagueSupplementalCSV for leagues where Firecrawl is available, or (B) read
the existing JSON file (if present), merge CSV-derived events with normalized
Firecrawl events de-duplicating by a stable key (e.g., event ID/date), then
write the merged events back to the same filePath; ensure you reference and
preserve teamId/teamName/updatedAt fields when writing the merged payload.
lib/sports.js (1)

111-124: 💤 Low value

Home/away team inference may fail on partial name matches.

The startsWith/endsWith logic on lines 114-115 assumes exact team name at the boundaries. If the scraped game.name is "Chelsea vs Arsenal FC" and teamName is "Arsenal", the endsWith check fails because the string ends with "arsenal fc" not "arsenal". This is a minor concern since game.homeTeam/game.awayTeam from scraping would take precedence when populated.

Consider using includes for more tolerant matching if the fallback is frequently needed:

♻️ Optional improvement
   return {
     idEvent: id,
     strEvent: game.name,
-    strHomeTeam: game.homeTeam || (game.name.toLowerCase().startsWith(teamName.toLowerCase()) ? teamName : null),
-    strAwayTeam: game.awayTeam || (game.name.toLowerCase().endsWith(teamName.toLowerCase()) ? teamName : null),
+    strHomeTeam: game.homeTeam || (game.name.toLowerCase().split(/\s+vs\.?\s+/i)[0]?.includes(teamName.toLowerCase()) ? teamName : null),
+    strAwayTeam: game.awayTeam || (game.name.toLowerCase().split(/\s+vs\.?\s+/i)[1]?.includes(teamName.toLowerCase()) ? teamName : null),
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@lib/sports.js` around lines 111 - 124, The fallback boundary checks for
assigning strHomeTeam/strAwayTeam use startsWith/endsWith on game.name which
fails for suffixes like "FC"; update the fallback logic that sets strHomeTeam
and strAwayTeam (the return block using game.name and teamName) to perform a
case-insensitive, more tolerant match — either use String.prototype.includes
with both values lowercased or, preferably, use a case-insensitive word-boundary
regex for teamName so you still respect whole-word matches but allow trailing
prefixes/suffixes (e.g., match "Arsenal" inside "Arsenal FC"); apply this change
to the expressions that currently use startsWith/endsWith so the fallback
correctly sets strHomeTeam/strAwayTeam when scraped fields are missing.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@scripts/fetch-sports.js`:
- Around line 231-248: The code currently blindly builds strTimestamp from
dateRaw which can produce invalid ISO strings for non-YYYY-MM-DD inputs; update
the logic around dateRaw/timeRaw handling (symbols: dateRaw, dateEvent, strTime,
strTimestamp) to validate and normalize dates before concatenation: if dateRaw
contains 'T' keep existing branch, otherwise only construct
`${dateEvent}T${strTime}Z` when dateEvent matches /^\d{4}-\d{2}-\d{2}$/; if it
doesn't, attempt to parse dateRaw with new Date(dateRaw) and if valid use the
parsed date's YYYY-MM-DD (or toISOString()) to build a correct ISO timestamp,
and if parsing fails set strTimestamp to null (or skip constructing) and ensure
downstream parseApiTimestamp is fed null/handled accordingly and/or log a
warning.

---

Nitpick comments:
In @.github/workflows/fetch-sports.yml:
- Line 5: The feed refresh interval in lib/sports.js (FEED_REFRESH_INTERVAL =
'PT24H') is out of sync with the fetch-sports.yml workflow cron (now every 6
hours); update FEED_REFRESH_INTERVAL to a shorter ISO8601 duration such as
'PT6H' or 'PT12H' in lib/sports.js so ICS clients check more frequently and
align with the workflow cadence, ensuring you change the FEED_REFRESH_INTERVAL
constant and run tests that reference it (e.g., any tests or consumers of
FEED_REFRESH_INTERVAL) to confirm no regressions.

In `@lib/sports.js`:
- Around line 111-124: The fallback boundary checks for assigning
strHomeTeam/strAwayTeam use startsWith/endsWith on game.name which fails for
suffixes like "FC"; update the fallback logic that sets strHomeTeam and
strAwayTeam (the return block using game.name and teamName) to perform a
case-insensitive, more tolerant match — either use String.prototype.includes
with both values lowercased or, preferably, use a case-insensitive word-boundary
regex for teamName so you still respect whole-word matches but allow trailing
prefixes/suffixes (e.g., match "Arsenal" inside "Arsenal FC"); apply this change
to the expressions that currently use startsWith/endsWith so the fallback
correctly sets strHomeTeam/strAwayTeam when scraped fields are missing.

In `@scripts/fetch-sports.js`:
- Around line 454-469: The Firecrawl save block overwrites CSV-derived
supplemental files (written to SUPPLEMENTAL_DATA_DIR using team.idTeam.json)
causing CSV events to be lost; update the logic in scripts/fetch-sports.js so
that when Firecrawl scraping (allScrapedGames / normalizeScrapedEvent) succeeds
you either (A) skip calling fetchLeagueSupplementalCSV for leagues where
Firecrawl is available, or (B) read the existing JSON file (if present), merge
CSV-derived events with normalized Firecrawl events de-duplicating by a stable
key (e.g., event ID/date), then write the merged events back to the same
filePath; ensure you reference and preserve teamId/teamName/updatedAt fields
when writing the merged payload.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 78dbb368-82f4-4a03-8242-26f88a473011

📥 Commits

Reviewing files that changed from the base of the PR and between 85c1bfc and ed214ca.

📒 Files selected for processing (4)
  • .github/workflows/fetch-sports.yml
  • lib/sports.js
  • scripts/fetch-sports.js
  • test/sports.test.js

Comment thread scripts/fetch-sports.js
Resolved conflicts in:
- lib/sports.js (content)
- scripts/fetch-sports.js (content)

Co-authored-by: CodeRabbit <noreply@coderabbit.ai>

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (3)
scripts/fetch-sports.js (2)

369-381: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Zero-result scrapes never advance the staleness marker.

isSupplementalStale() only looks at updatedAt, but this loop only writes a file when allScrapedGames.length > 0. Teams with no schedule yet, a bad ESPN slug, or a transient extractor miss will stay stale forever and get retried on every 6-hour run, which defeats the throttle and can burn the Firecrawl budget.

Also applies to: 417-423, 457-472

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@scripts/fetch-sports.js` around lines 369 - 381, isSupplementalStale()
currently only checks data.updatedAt, but the scraper only writes the
supplemental file when allScrapedGames.length > 0, so zero-result scrapes never
update the marker and teams are retried every run; change the write logic so
every scrape attempt updates a timestamp (either update updatedAt on every
attempt or add a lastAttempt/lastSuccess pair) and have isSupplementalStale()
consider the appropriate timestamp(s) (e.g., lastAttempt to throttle retries and
lastSuccess to detect stale valid data). Update the code paths that write the
supplemental JSON (the branch that currently only writes when
allScrapedGames.length > 0) to always persist the marker and ensure
isSupplementalStale() reads the new field names you choose.

413-415: ⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

CSV refresh and scrape refresh currently cannot coexist in the same per-team cache.

fetchLeagueSupplementalCSV() writes ${team.idTeam}.json with a fresh updatedAt before the Firecrawl loop runs, so isSupplementalStale() immediately skips ESPN for CSV-backed leagues. If a scrape does run later, Lines 462-467 replace that same file with scraped events only, so the CSV rows are dropped instead of merged. Right now the two sources are mutually exclusive.

Also applies to: 417-423, 457-467

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@scripts/fetch-sports.js` around lines 413 - 415, fetchLeagueSupplementalCSV
currently writes `${team.idTeam}.json` with a fresh updatedAt before the
Firecrawl scrape runs, causing isSupplementalStale to skip ESPN and later
scraped events to overwrite the file (dropping CSV rows); change
fetchLeagueSupplementalCSV to not overwrite the canonical `${team.idTeam}.json`
prematurely but instead either (A) read the existing cache file if present and
merge CSV rows into it (deduplicating by event id) and update events/metadata
atomically, or (B) write CSV data to a separate interim file like
`${team.idTeam}.supplemental.json` and then, in the Firecrawl write path that
currently replaces the same file (the code that writes `${team.idTeam}.json`
after scraping), merge interim supplemental rows with scraped events and write
the combined result with a single updatedAt; ensure isSupplementalStale still
inspects the merged result so CSV and scraped sources coexist.
lib/sports.js (1)

212-215: ⚠️ Potential issue | 🟠 Major

Validate extract.games is an array before returning (ESPN and website).

fetchScheduleFromESPN(...) returns payload?.data?.extract?.games || payload?.extract?.games || [] without an Array.isArray check. In scripts/fetch-sports.js, the result is treated as an array (espnGames.length + allScrapedGames.push(...espnGames)) and then normalized via normalizeScrapedEvent, which dereferences game.name.toLowerCase(). If Firecrawl returns a non-array (e.g., string/object), this can throw during normalization and break the scrape.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@lib/sports.js` around lines 212 - 215, fetchScheduleFromESPN currently
returns payload?.data?.extract?.games || payload?.extract?.games || [] without
validating the type; update fetchScheduleFromESPN to check that the chosen value
is an array (use Array.isArray) before returning and otherwise return [] (and
optionally log a warning including payload or the problematic extract) so
callers like scripts/fetch-sports.js that rely on espnGames being an array
(espnGames.length, allScrapedGames.push(...espnGames), normalizeScrapedEvent
dereferencing game.name) won't throw when Firecrawl returns a non-array.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@lib/sports.js`:
- Around line 212-215: fetchScheduleFromESPN currently returns
payload?.data?.extract?.games || payload?.extract?.games || [] without
validating the type; update fetchScheduleFromESPN to check that the chosen value
is an array (use Array.isArray) before returning and otherwise return [] (and
optionally log a warning including payload or the problematic extract) so
callers like scripts/fetch-sports.js that rely on espnGames being an array
(espnGames.length, allScrapedGames.push(...espnGames), normalizeScrapedEvent
dereferencing game.name) won't throw when Firecrawl returns a non-array.

In `@scripts/fetch-sports.js`:
- Around line 369-381: isSupplementalStale() currently only checks
data.updatedAt, but the scraper only writes the supplemental file when
allScrapedGames.length > 0, so zero-result scrapes never update the marker and
teams are retried every run; change the write logic so every scrape attempt
updates a timestamp (either update updatedAt on every attempt or add a
lastAttempt/lastSuccess pair) and have isSupplementalStale() consider the
appropriate timestamp(s) (e.g., lastAttempt to throttle retries and lastSuccess
to detect stale valid data). Update the code paths that write the supplemental
JSON (the branch that currently only writes when allScrapedGames.length > 0) to
always persist the marker and ensure isSupplementalStale() reads the new field
names you choose.
- Around line 413-415: fetchLeagueSupplementalCSV currently writes
`${team.idTeam}.json` with a fresh updatedAt before the Firecrawl scrape runs,
causing isSupplementalStale to skip ESPN and later scraped events to overwrite
the file (dropping CSV rows); change fetchLeagueSupplementalCSV to not overwrite
the canonical `${team.idTeam}.json` prematurely but instead either (A) read the
existing cache file if present and merge CSV rows into it (deduplicating by
event id) and update events/metadata atomically, or (B) write CSV data to a
separate interim file like `${team.idTeam}.supplemental.json` and then, in the
Firecrawl write path that currently replaces the same file (the code that writes
`${team.idTeam}.json` after scraping), merge interim supplemental rows with
scraped events and write the combined result with a single updatedAt; ensure
isSupplementalStale still inspects the merged result so CSV and scraped sources
coexist.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 7525c059-9958-40f7-993f-472f86f6fb91

📥 Commits

Reviewing files that changed from the base of the PR and between ed214ca and b32ff03.

📒 Files selected for processing (2)
  • lib/sports.js
  • scripts/fetch-sports.js

- Scrape ESPN team schedule pages every 6 hours via Firecrawl
- Integrate bulk schedules from SportsDataverse (NBA, NFL, NHL, WNBA)
- Implement robust character-based CSV parser for external datasets
- Add tolerant team matching with abbreviations and prefix support
- Normalize all times to HH:mm:ss with 'Z' UTC designator
- Synchronize internal caching with 6-hour workflow frequency
- Ensure deterministic merging logic in unit tests
- Fixed WNBA league ID to 4516 and improved date validation robustness

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
lib/data/sports/4350.json (1)

51-7500: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Preserve scheduled status instead of null for upcoming fixtures.

The bulk change sets strStatus to null for upcoming events that previously used "NS". That diverges from upstream ingestion/normalization (scripts/fetch-sports.js and lib/sports.js both initialize scheduled events as "NS"), and can break downstream status filtering/rendering semantics.

Keep strStatus: "NS" for scheduled fixtures (or map to a documented non-null enum), and reserve null for truly unknown state only.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@lib/data/sports/4350.json` around lines 51 - 7500, The JSON entries set
strStatus to null for scheduled fixtures, breaking upstream normalization
expected by scripts/fetch-sports.js and lib/sports.js which use "NS" for
not-started events; restore strStatus: "NS" (or another documented non-null
enum) for all upcoming/scheduled events (identify entries by the
strTimestamp/strEvent or idEvent fields, e.g., idEvent values like "2487453" and
similar) so downstream filtering/rendering keeps the original scheduled state;
only leave strStatus null when the state is truly unknown.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@lib/data/sports/4387.json`:
- Line 4: The league cache file lib/data/sports/4387.json contains an outdated
"updatedAt" timestamp; update the pipeline so league files either get their
"updatedAt" set to the current run time when refreshed or ensure
system/container time is correct so files are written with real-time timestamps.
In scripts/fetch-sports.js adjust the write path for league cache files (or the
code that generates updatedAt) so it assigns Date.now()/new Date().toISOString()
on successful refresh, or alternatively ensure the process that writes
lib/data/sports/*.json uses the same freshness logic as isSupplementalStale
(referencing isSupplementalStale and teamId) so 6-hour staleness detection is
driven by the correct supplemental files rather than stale league timestamps.

---

Outside diff comments:
In `@lib/data/sports/4350.json`:
- Around line 51-7500: The JSON entries set strStatus to null for scheduled
fixtures, breaking upstream normalization expected by scripts/fetch-sports.js
and lib/sports.js which use "NS" for not-started events; restore strStatus: "NS"
(or another documented non-null enum) for all upcoming/scheduled events
(identify entries by the strTimestamp/strEvent or idEvent fields, e.g., idEvent
values like "2487453" and similar) so downstream filtering/rendering keeps the
original scheduled state; only leave strStatus null when the state is truly
unknown.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 8f453581-53f5-4b6a-8cba-4749790ca86e

📥 Commits

Reviewing files that changed from the base of the PR and between b32ff03 and 67cf17d.

📒 Files selected for processing (24)
  • lib/data/sports/4329.json
  • lib/data/sports/4330.json
  • lib/data/sports/4331.json
  • lib/data/sports/4332.json
  • lib/data/sports/4334.json
  • lib/data/sports/4335.json
  • lib/data/sports/4337.json
  • lib/data/sports/4339.json
  • lib/data/sports/4344.json
  • lib/data/sports/4346.json
  • lib/data/sports/4350.json
  • lib/data/sports/4351.json
  • lib/data/sports/4380.json
  • lib/data/sports/4387.json
  • lib/data/sports/4391.json
  • lib/data/sports/4408.json
  • lib/data/sports/4424.json
  • lib/data/sports/4480.json
  • lib/data/sports/4481.json
  • lib/data/sports/4482.json
  • lib/data/sports/supplemental/136437.json
  • lib/data/sports/supplemental/136438.json
  • lib/sports.js
  • scripts/fetch-sports.js
💤 Files with no reviewable changes (3)
  • lib/data/sports/supplemental/136438.json
  • lib/data/sports/supplemental/136437.json
  • lib/sports.js
✅ Files skipped from review due to trivial changes (11)
  • lib/data/sports/4480.json
  • lib/data/sports/4482.json
  • lib/data/sports/4329.json
  • lib/data/sports/4346.json
  • lib/data/sports/4339.json
  • lib/data/sports/4391.json
  • lib/data/sports/4335.json
  • lib/data/sports/4331.json
  • lib/data/sports/4332.json
  • lib/data/sports/4408.json
  • lib/data/sports/4380.json
🚧 Files skipped from review as they are similar to previous changes (1)
  • scripts/fetch-sports.js

Comment thread lib/data/sports/4387.json
"leagueId": "4387",
"leagueName": "NBA",
"updatedAt": "2026-06-12T14:31:29.226Z",
"updatedAt": "2026-06-11T06:34:00.453Z",

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Description: Find where updatedAt is set and check for timestamp generation issues.

echo "=== Searching for updatedAt assignment in fetch-sports.js ==="
rg -n -C5 'updatedAt.*=' scripts/fetch-sports.js

echo -e "\n=== Checking other league files for backwards timestamps ==="
# Extract updatedAt from all league JSON files and sort to identify anomalies
fd -e json . lib/data/sports/ --exec sh -c 'echo -n "{}: "; jq -r ".updatedAt // \"missing\"" {}'

echo -e "\n=== Searching for Date or timestamp generation patterns ==="
rg -n -C3 'new Date|Date\.now|toISOString' scripts/fetch-sports.js

Repository: DisabledAbel/MakeICS

Length of output: 2956


🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "=== Show lib/data/sports/4387.json (around updatedAt) ==="
nl -ba lib/data/sports/4387.json | sed -n '1,80p'

echo "=== Show scripts/fetch-sports.js around the updatedAt writes (lines ~280-520) ==="
if [ -f scripts/fetch-sports.js ]; then
  nl -ba scripts/fetch-sports.js | sed -n '260,340p'
  nl -ba scripts/fetch-sports.js | sed -n '360,430p'
  nl -ba scripts/fetch-sports.js | sed -n '450,510p'
else
  echo "scripts/fetch-sports.js not found"
  fd -e js fetch-sports . | head -n 20
fi

Repository: DisabledAbel/MakeICS

Length of output: 183


🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "=== Show lib/data/sports/4387.json (around updatedAt) ==="
nl -ba lib/data/sports/4387.json | sed -n '1,80p'

echo "=== Show scripts/fetch-sports.js around the updatedAt writes (lines ~260-520) ==="
if [ -f scripts/fetch-sports.js ]; then
  nl -ba scripts/fetch-sports.js | sed -n '260,340p'
  nl -ba scripts/fetch-sports.js | sed -n '360,430p'
  nl -ba scripts/fetch-sports.js | sed -n '450,510p'
else
  echo "scripts/fetch-sports.js not found"
  fd -e js fetch-sports . | head -n 20
fi

Repository: DisabledAbel/MakeICS

Length of output: 183


🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "=== lib/data/sports/4387.json (start) ==="
sed -n '1,40p' lib/data/sports/4387.json

echo "=== lib/data/sports/4387.json (around updatedAt) ==="
rg -n '"updatedAt"' lib/data/sports/4387.json
# Print a small window around the match
line=$(rg -n '"updatedAt"' lib/data/sports/4387.json | head -n1 | cut -d: -f1)
start=$((line-5)); end=$((line+5))
sed -n "${start},${end}p" lib/data/sports/4387.json

echo "=== scripts/fetch-sports.js: around updatedAt writes ==="
# show multiple windows around the line numbers from earlier rg output
sed -n '290,330p' scripts/fetch-sports.js
sed -n '395,430p' scripts/fetch-sports.js
sed -n '455,485p' scripts/fetch-sports.js

echo "=== scripts/fetch-sports.js: around 6-hour staleness check ==="
sed -n '365,415p' scripts/fetch-sports.js

Repository: DisabledAbel/MakeICS

Length of output: 3627


🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "=== lib/data/sports/4387.json (lines 1-20) ==="
sed -n '1,20p' lib/data/sports/4387.json

echo "=== lib/data/sports/4387.json (lines 1-10 around updatedAt) ==="
line=$(rg -n '"updatedAt"' lib/data/sports/4387.json | head -n1 | cut -d: -f1 || true)
echo "updatedAt line: ${line}"
start=$((line-5)); end=$((line+5))
if [ "${start}" -lt 1 ]; then start=1; fi
sed -n "${start},${end}p" lib/data/sports/4387.json

echo "=== scripts/fetch-sports.js (staleness check + updatedAt writes) ==="
sed -n '330,430p' scripts/fetch-sports.js
sed -n '430,520p' scripts/fetch-sports.js
sed -n '250,330p' scripts/fetch-sports.js
sed -n '350,405p' scripts/fetch-sports.js

Repository: DisabledAbel/MakeICS

Length of output: 11780


🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "=== Find staleness logic ==="
rg -n 'sixHoursMs|isSupplementalStale' -S .

echo "=== Find who reads data.updatedAt / updatedAt for staleness ==="
rg -n 'data\.updatedAt|\.updatedAt.*getTime|now - lastUpdated' -S scripts lib src .

echo "=== Find writes to DATA_DIR / league json updatedAt ==="
rg -n 'DATA_DIR|lib/data/sports|fetchLeagueEvents|events\.length > 0' -S scripts/fetch-sports.js
rg -n 'await fs\.writeFile\([^)]*DATA_DIR|leagueId: league\.id|updatedAt: new Date\(\)\.toISOString\(\)' -S scripts/fetch-sports.js

Repository: DisabledAbel/MakeICS

Length of output: 992


🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "=== Where league JSON updatedAt is written (DATA_DIR write) ==="
rg -n 'DATA_DIR|lib/data/sports|leagueId: league\.id|writeFile\(.+\$\{league\.id\}\.json' scripts -S

echo "=== Where updatedAt is read elsewhere ==="
rg -n '\bupdatedAt\b' --glob '!node_modules/**' -S .

echo "=== Where league JSON files are consumed ==="
rg -n 'lib/data/sports|sports/\$\{.*\}\.json|DATA_DIR.*sports' --glob '!node_modules/**' -S .

Repository: DisabledAbel/MakeICS

Length of output: 3198


🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "=== lib/sports.js (entire file if small; else key sections) ==="
wc -l lib/sports.js || true
sed -n '1,220p' lib/sports.js
sed -n '220,500p' lib/sports.js

echo "=== test/sports.test.js (around updatedAt cases) ==="
wc -l test/sports.test.js || true
sed -n '1,260p' test/sports.test.js
sed -n '260,520p' test/sports.test.js

Repository: DisabledAbel/MakeICS

Length of output: 22611


Fix updatedAt semantics for league cache files

  • lib/data/sports/4387.json currently has "updatedAt": "2026-06-11T06:34:00.453Z".
  • In scripts/fetch-sports.js, the 6-hour staleness logic (isSupplementalStale) checks only lib/data/sports/supplemental/${teamId}.json, so this league updatedAt value won’t affect the supplemental 6-hour staleness detection.
  • If updatedAt is intended to reflect the current refresh run (for monitoring/cache invalidation), ensure the pipeline/system time isn’t skewed or that league cache files aren’t being written with stale timestamps (other lib/data/sports/*.json files also show updatedAt from 2026-06-11).
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@lib/data/sports/4387.json` at line 4, The league cache file
lib/data/sports/4387.json contains an outdated "updatedAt" timestamp; update the
pipeline so league files either get their "updatedAt" set to the current run
time when refreshed or ensure system/container time is correct so files are
written with real-time timestamps. In scripts/fetch-sports.js adjust the write
path for league cache files (or the code that generates updatedAt) so it assigns
Date.now()/new Date().toISOString() on successful refresh, or alternatively
ensure the process that writes lib/data/sports/*.json uses the same freshness
logic as isSupplementalStale (referencing isSupplementalStale and teamId) so
6-hour staleness detection is driven by the correct supplemental files rather
than stale league timestamps.

- Implemented structured extraction using Firecrawl API to scrape official team websites and ESPN.
- Integrated supplemental bulk data from SportsDataverse/NFLverse for WNBA, NBA, NFL, and NHL.
- Updated background workflow to run every 6 hours and persist team-specific supplemental data.
- Enhanced merging and deduplication logic in `lib/sports.js` to handle multi-source schedules.
- Added deterministic unit tests for event merging.
@DisabledAbel

Copy link
Copy Markdown
Owner Author

@coderabbitai fix conflicts

@DisabledAbel DisabledAbel merged commit e93b928 into main Jun 12, 2026
4 checks passed
@DisabledAbel DisabledAbel deleted the feature/sports-website-scraping-16370083516219159048 branch June 12, 2026 19:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant