Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
34 changes: 34 additions & 0 deletions .github/workflows/daily-scraper-go.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,34 @@
name: Daily Go Scraper

on:
schedule:
- cron: '30 4 * * *' # 4:30 AM UTC, runs shortly after the Python version
workflow_dispatch: # Allow manual runs

jobs:
scrape:
runs-on: ubuntu-latest
steps:
- name: Checkout repository
uses: actions/checkout@v4

- name: Set up Go
uses: actions/setup-go@v5
with:
go-version: 'stable'
cache-dependency-path: scraper_go/go.sum

- name: Run Scraper
env:
TMDB_API_KEY: ${{ secrets.TMDB_API_KEY }}
run: |
cd scraper_go
go run cmd/scraper/main.go

- name: Commit and push changes
run: |
git config --global user.name "github-actions[bot]"
git config --global user.email "github-actions[bot]@users.noreply.github.com"
git add frontend/public/data_go.json
git diff --quiet && git diff --staged --quiet || git commit -m "chore: daily showtime update (Go)"
git push
2 changes: 1 addition & 1 deletion .gitignore
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
letterboxd/
/letterboxd/

.env
.worktrees/
13 changes: 7 additions & 6 deletions conductor/tech-stack.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,12 +11,13 @@ kinꚘbok uses a decoupled architecture with a statically hosted frontend that c
- **Testing:** Vitest

# Backend (Scraper)
- **Language:** Python 3.11+
- **HTTP Client:** HTTPX
- **HTML Parsing:** BeautifulSoup4
- **Data Validation & Modeling:** Pydantic
- **String Matching:** RapidFuzz
- **Testing:** Pytest
- **Language:** Go 1.25+ & Python 3.11+ (Running in parallel during migration)
- **Framework (Go):** Colly/v2 (for web scraping)
- **HTTP Client:** HTTPX (Python) / net/http (Go)
- **HTML Parsing:** BeautifulSoup4 (Python) / Goquery via Colly (Go)
- **Data Validation & Modeling:** Pydantic (Python) / Custom Go schemas with validation
- **String Matching & Normalization:** RapidFuzz (Python) / Custom slug-matching & GenerateSlug (Go)
- **Testing:** Pytest (Python) / Go testing toolchain (Go)

# CI/CD & Deployment
- **Automation:** GitHub Actions (daily scraper runs, formatting checks, deployment)
Expand Down
5 changes: 5 additions & 0 deletions conductor/tracks.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,3 +11,8 @@ This file tracks all major tracks for the project. Each track has its own detail

- [x] **Track: UX revamp when user clicks cinema map points and typeaheads for search bar**
*Link: [./tracks/map_search_ux_20260619/](./tracks/map_search_ux_20260619/)*

---

- [x] **Track: Implement the kinobok scraper in Golang using Colly with Goroutines/Channels for concurrency, running in parallel with the Python scraper.**
*Link: [./tracks/golang_scraper_20260702/](./tracks/golang_scraper_20260702/)*
5 changes: 5 additions & 0 deletions conductor/tracks/golang_scraper_20260702/index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
# Track golang_scraper_20260702 Context

- [Specification](./spec.md)
- [Implementation Plan](./plan.md)
- [Metadata](./metadata.json)
8 changes: 8 additions & 0 deletions conductor/tracks/golang_scraper_20260702/metadata.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
{
"track_id": "golang_scraper_20260702",
"type": "feature",
"status": "new",
"created_at": "2026-07-02T00:00:00Z",
"updated_at": "2026-07-02T00:00:00Z",
"description": "Implement the kinobok scraper in Golang using Colly with Goroutines/Channels for concurrency, running in parallel with the Python scraper."
}
34 changes: 34 additions & 0 deletions conductor/tracks/golang_scraper_20260702/plan.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,34 @@
# Implementation Plan: Golang Scraper

## Phase 1: Filmweb Scraper (Concurrent)
- [x] Task: Filmweb Models and Colly Setup
- [x] Write Tests (Red Phase): Define mock server responses and test basic Colly initialization.
- [x] Implement (Green Phase): Configure Colly collector and set up Goroutine/Channel architecture.
- [x] Task: Filmweb Parsing Logic
- [x] Write Tests (Red Phase): Test parsing logic for extracting titles, cinemas, and showtimes from mock HTML.
- [x] Implement (Green Phase): Implement Colly callbacks, parse HTML, and feed results through channels.
- [x] Task: Conductor - User Manual Verification 'Phase 1: Filmweb Scraper (Concurrent)' (Protocol in workflow.md)

## Phase 2: Letterboxd and TMDB Integrations
- [x] Task: TMDB API Integration
- [x] Write Tests (Red Phase): Test concurrent fetching of metadata and posters using a mock HTTP client.
- [x] Implement (Green Phase): Write concurrent HTTP requests to TMDB and merge with movie data.
- [x] Task: Letterboxd Integration
- [x] Write Tests (Red Phase): Test extraction/parsing of Letterboxd watchlists.
- [x] Implement (Green Phase): Build Letterboxd scraping/parsing logic.
- [x] Task: Conductor - User Manual Verification 'Phase 2: Letterboxd and TMDB Integrations' (Protocol in workflow.md)

## Phase 3: Data Aggregation & Export
- [x] Task: Orchestration in Main
- [x] Write Tests (Red Phase): Test the synchronization and merging of data from Filmweb, TMDB, and Letterboxd.
- [x] Implement (Green Phase): Coordinate channels and Goroutines in the main entrypoint (`cmd/scraper/main.go`).
- [x] Task: Strict Parity JSON Export
- [x] Write Tests (Red Phase): Assert that the generated `data_go.json` strictly adheres to the existing Next.js frontend schema.
- [x] Implement (Green Phase): Write the final JSON export logic in `internal/export`.
- [x] Task: Conductor - User Manual Verification 'Phase 3: Data Aggregation & Export' (Protocol in workflow.md)

## Phase 4: CI/CD Integration
- [x] Task: GitHub Actions Updates
- [x] Write Tests (Red Phase): (Skip logic tests, test via dry-run or local action simulator if possible).
- [x] Implement (Green Phase): Configure `daily-scraper-go.yml` to execute the Go scraper concurrently with Python and upload `data_go.json` as an artifact or commit it.
- [x] Task: Conductor - User Manual Verification 'Phase 4: CI/CD Integration' (Protocol in workflow.md)
29 changes: 29 additions & 0 deletions conductor/tracks/golang_scraper_20260702/spec.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
# Specification: Golang Scraper Implementation

## Overview
This track focuses on implementing the backend scraper for kinobok in Golang using the Colly framework, intended to eventually replace the existing Python scraper. The initial deployment will run in parallel with the Python scraper to ensure data parity before a full transition.

## Functional Requirements
1. **Filmweb Scraper:** Implement the logic to scrape movie showtimes and cinema details from Filmweb using the Colly framework.
2. **Letterboxd Scraper:** Implement the logic to process Letterboxd watchlists/data.
3. **TMDB Scraper:** Implement integration with the TMDB API/scraper to fetch posters and metadata for movies.
4. **Data Export:** Generate the final JSON file (`data_go.json`) containing all parsed and matched data.

## Non-Functional Requirements
1. **Strict Parity:** The exported JSON file (`data_go.json`) MUST perfectly match the schema of the current Python scraper's `data.json` to ensure Next.js frontend compatibility.
2. **Parallel Execution:** The new Golang scraper must be integrated into the existing CI/CD (GitHub Actions) to run alongside the Python scraper, outputting to a separate file (`data_go.json`) without breaking the current production build.
3. **Language/Framework:** Use Golang and the Colly web scraping framework as defined in the `scraper_go` directory.
4. **Concurrency:** Heavily utilize Goroutines and Channels within the scraping process to maximize throughput and enhance the overall execution speed.

## Acceptance Criteria
- [ ] `FilmwebScraper` correctly parses cinemas, times, and movie titles utilizing concurrent processing.
- [ ] `Letterboxd` integration correctly extracts watchlist data.
- [ ] `TMDB` integration accurately fetches required movie metadata concurrently.
- [ ] `export` package correctly generates `data_go.json` with strict schema parity.
- [ ] Concurrency patterns (Goroutines/Channels) are demonstrably used for performance.
- [ ] GitHub Actions workflow is updated to run the Golang scraper and output `data_go.json` alongside `data.json`.
- [ ] The Next.js frontend can flawlessly consume `data_go.json` if swapped (to be tested locally or manually).

## Out of Scope
- Modifying the frontend application code to permanently switch to `data_go.json`.
- Removing or disabling the Python scraper.
Loading