From 01984941c768fb1a3ddc09a72f46d5622674f4d3 Mon Sep 17 00:00:00 2001 From: Elshad Toklayev Date: Sun, 31 May 2026 20:52:32 +0300 Subject: [PATCH] docs(report): align Advanced SQL section with the implemented queries Section 5 printed SQL that no longer matched the live API, which is a risk for a graded report where the source is inspected during the demo. Rewrite Q1-Q6 to mirror apps/api/src/modules/analytics/analytics.service.ts: - Q1 top artists: DENSE_RANK (not RANK), no 30-day window, no role='primary' filter, HAVING COUNT(*) > 1; drop the pct_of_plays/distinct_tracks columns. - Q2 heatmap: drop the 90-day window the code never applied; add the tracks.hidden_at join. - Q3 hidden gems: global "never played by anyone" anti-join (LEFT JOIN ... WHERE lh.id IS NULL), artist via albums.primary_artist_id, configurable minPlaylistCount, random sampling; drop the per-user/preview-only framing. - Q4 discover: single top-track cohort (not all played tracks), cooccurrence_count, NOT EXISTS exclusion, random sampling. - Q5 trending: two-CTE recent/prior growth-ratio with a threshold, not a single-scan COUNT(*) FILTER delta. - Q6: replace the fabricated "curated picks" RANK query (no such SQL exists) with the real Jaccard similar-playlists query behind the "More like this" rail. Also fix the "Five representative queries" count (six), the Figure 9 "seed track" caption, the Figure 11 "(Query 6)" reference, the row-count "13 application tables" label (six are catalog tables), and the "deterministic seed scripts" wording. --- final-report/statify-final-report.md | 264 ++++++++++++++++----------- 1 file changed, 153 insertions(+), 111 deletions(-) diff --git a/final-report/statify-final-report.md b/final-report/statify-final-report.md index 8af615e..bed8b8a 100644 --- a/final-report/statify-final-report.md +++ b/final-report/statify-final-report.md @@ -37,7 +37,7 @@ Listening history is stored one row per play and aggregated for top-N rankings, ## 2. Entity-Relationship Diagram -The diagram below is generated from the live production schema in crow's-foot notation. It comprises the thirteen application tables, grouped into two domains: a read-mostly **catalog** derived from the Million Playlist Dataset, and the read-write **application** data created by listeners and administrators. `tracks` and `users` are the two hubs that nearly every relationship passes through. +The diagram below is generated from the live production schema in crow's-foot notation. It comprises the thirteen tables, grouped into two domains: a read-mostly **catalog** derived from the Million Playlist Dataset, and the read-write **application** data created by listeners and administrators. `tracks` and `users` are the two hubs that nearly every relationship passes through. ![Statify ER diagram](assets/fig01_erd.png) @@ -297,7 +297,7 @@ The catalog is built from the **Spotify Million Playlist Dataset**. The first te ### Application data -Accounts, listening history, and playlists combine real in-app activity with deterministic seed scripts. Listening history is generated as roughly 60,000 play events following a Zipf popularity distribution, so a small head of tracks is played heavily while a long tail is played lightly and "most played" rankings are non-degenerate; a handful of additional rows come from genuine preview plays. Community users and their public and private playlists are seeded similarly, and the operational tables (`refresh_tokens`, `audit_log`, `ingest_checkpoints`) are produced by ordinary application and ingestion activity. +Accounts, listening history, and playlists combine real in-app activity with randomized seed scripts (Zipf-distributed, and re-anchored to the seed run's date rather than reproducible run-to-run). Listening history is generated as roughly 60,000 play events following a Zipf popularity distribution, so a small head of tracks is played heavily while a long tail is played lightly and "most played" rankings are non-degenerate; a handful of additional rows come from genuine preview plays. Community users and their public and private playlists are seeded similarly, and the operational tables (`refresh_tokens`, `audit_log`, `ingest_checkpoints`) are produced by ordinary application and ingestion activity. ### Row counts @@ -320,53 +320,56 @@ _Production row counts, as of the database export on 29 May 2026._ | `user_playlist_tracks` | 2,810 | Tracks across user playlists | | `audit_log` | 176 | Recorded admin actions | | `ingest_checkpoints` | 10 | Ingested dataset slices | -| **Total** | **1,195,704** | **across 13 application tables** | +| **Total** | **1,195,704** | **across all 13 tables** | --- ## 5. Advanced SQL Queries -Five representative queries that power the product. Each is valid PostgreSQL against the schema above; in the running application the same logic is served by the API (via Prisma and parameterized raw SQL). Bind parameters such as `$1` carry the current listener's id. +Six queries that power the product, transcribed from the live API service in `apps/api/src/modules/analytics/analytics.service.ts`. Each is valid PostgreSQL against the schema above and runs in the application as parameterized raw SQL through Prisma; bind parameters such as `$1` carry the current listener's id, and `$2`/`$3` carry the row limit and thresholds. ### Q1. Top artists by play count ```sql -SELECT a.id, +SELECT DENSE_RANK() OVER (ORDER BY COUNT(*) DESC, + SUM(lh.duration_played_ms) DESC) AS rank, + a.id, a.name, - COUNT(*) AS plays, - COUNT(DISTINCT lh.track_id) AS distinct_tracks, - ROUND(SUM(lh.duration_played_ms) / 60000.0, 1) AS minutes, - RANK() OVER (ORDER BY COUNT(*) DESC) AS rank, - ROUND(100.0 * COUNT(*) / SUM(COUNT(*)) OVER (), 2) AS pct_of_plays + a.image_url, + COUNT(*)::int AS listen_count, + ROUND(SUM(lh.duration_played_ms)::numeric / 60000.0, 2) AS total_minutes FROM listening_history lh -JOIN track_artists ta ON ta.track_id = lh.track_id AND ta.role = 'primary' +JOIN tracks t ON t.id = lh.track_id +JOIN track_artists ta ON ta.track_id = t.id JOIN artists a ON a.id = ta.artist_id WHERE lh.user_id = $1 - AND lh.played_at >= now() - interval '30 days' + AND t.hidden_at IS NULL AND a.hidden_at IS NULL -GROUP BY a.id, a.name -ORDER BY plays DESC -LIMIT 10; +GROUP BY a.id, a.name, a.image_url +HAVING COUNT(*) > 1 +ORDER BY rank ASC, a.name ASC +LIMIT $2; ``` -- **What it does:** Joins each play event to its primary artist, counts plays per artist over the last 30 days, and uses window functions to assign a dense ranking and each artist's share of total plays. -- **Why it is useful:** This is the core ranking behind the listener's profile. The nested `SUM(COUNT(*)) OVER ()` turns raw counts into percentages in a single pass, and filtering on `role = 'primary'` avoids double-counting featured credits. +- **What it does:** Joins each play to every artist credited on the track, counts plays per artist over the listener's full history, and uses `DENSE_RANK()` (with total minutes listened as the tie-breaker) to rank them, keeping only artists with more than one play. +- **Why it is useful:** This is the core ranking behind the listener's profile. The window function produces a compact, gap-free ranking in a single pass, and the minutes tie-breaker keeps the order stable when two artists have the same play count. - **In the product:** Powers **Most played** on the Catalog page (Figure 8) and the **Top artists** tab on Stats (Figure 5). ### Q2. Listening heatmap by weekday and hour ```sql -SELECT EXTRACT(DOW FROM lh.played_at)::int AS dow, -- 0 = Sunday .. 6 = Saturday - EXTRACT(HOUR FROM lh.played_at)::int AS hour, -- 0 .. 23 - COUNT(*) AS plays +SELECT EXTRACT(DOW FROM lh.played_at)::int AS day_of_week, -- 0 = Sunday .. 6 = Saturday + EXTRACT(HOUR FROM lh.played_at)::int AS hour_of_day, -- 0 .. 23 + COUNT(*)::int AS listen_count FROM listening_history lh +JOIN tracks t ON t.id = lh.track_id WHERE lh.user_id = $1 - AND lh.played_at >= now() - interval '90 days' -GROUP BY dow, hour -ORDER BY dow, hour; + AND t.hidden_at IS NULL +GROUP BY day_of_week, hour_of_day +ORDER BY day_of_week, hour_of_day; ``` -- **What it does:** Buckets every play in the last 90 days by day-of-week and hour-of-day and counts how many fall in each of the 168 cells. +- **What it does:** Buckets every non-hidden play across the listener's full history by day-of-week and hour-of-day and counts how many fall in each of the 168 cells. - **Why it is useful:** Returns the long form that the client pivots into a 7x24 grid, revealing when a listener is most active. Date-part extraction keeps the work in the database rather than pulling raw timestamps to the client. - **In the product:** Renders the **When you listen** heatmap on the Stats page (Figure 5). @@ -375,119 +378,158 @@ ORDER BY dow, hour; ```sql SELECT t.id, t.name, - a.name AS artist, - COUNT(DISTINCT mpt.playlist_id) AS playlist_count, - COALESCE(plays.cnt, 0) AS your_plays + pa.name AS primary_artist_name, + COUNT(DISTINCT mpt.playlist_id) AS playlist_count FROM tracks t -JOIN mpd_playlist_tracks mpt ON mpt.track_id = t.id -JOIN track_artists ta ON ta.track_id = t.id AND ta.role = 'primary' -JOIN artists a ON a.id = ta.artist_id -LEFT JOIN ( - SELECT track_id, COUNT(*) AS cnt - FROM listening_history - WHERE user_id = $1 - GROUP BY track_id - ) plays ON plays.track_id = t.id -WHERE t.hidden_at IS NULL - AND t.preview_url IS NOT NULL -GROUP BY t.id, t.name, a.name, plays.cnt -HAVING COUNT(DISTINCT mpt.playlist_id) >= 3 - AND COALESCE(plays.cnt, 0) = 0 -ORDER BY playlist_count DESC -LIMIT 24; +JOIN albums al ON al.id = t.album_id +JOIN artists pa ON pa.id = al.primary_artist_id +JOIN mpd_playlist_tracks mpt ON mpt.track_id = t.id +LEFT JOIN listening_history lh ON lh.track_id = t.id +WHERE lh.id IS NULL -- never played by anyone (anti-join) + AND t.hidden_at IS NULL + AND al.hidden_at IS NULL + AND pa.hidden_at IS NULL +GROUP BY t.id, t.name, pa.name +HAVING COUNT(DISTINCT mpt.playlist_id) >= $1 -- minimum playlist count, default 3 +ORDER BY playlist_count DESC, t.name ASC +LIMIT $2; ``` -- **What it does:** Counts how many MPD playlists each track appears in, left-joins the listener's own play counts, then keeps only tracks that sit in at least three playlists yet have never been played by the user. -- **Why it is useful:** Surfaces broadly endorsed but personally unheard tracks. The `LEFT JOIN` plus `HAVING ... = 0` is an anti-join that expresses "popular with curators, absent from your history", and the preview filter guarantees every result is playable. +- **What it does:** Counts how many MPD playlists each track appears in and, through a `LEFT JOIN ... WHERE lh.id IS NULL` anti-join, keeps only tracks that sit in at least three playlists yet have never been played by anyone in the app. +- **Why it is useful:** Surfaces broadly endorsed but overlooked tracks. The anti-join expresses "popular with curators, absent from the play history" without a correlated subquery; the service then draws the top-ranked rows into a pool and applies `ORDER BY random()` so the page varies between visits. - **In the product:** Drives the Hidden gems page (Figure 10), where each card shows the playlist count. ### Q4. Discovery by playlist co-occurrence ```sql -WITH seeds AS ( -- tracks the listener has actually played - SELECT DISTINCT track_id - FROM listening_history - WHERE user_id = $1 +WITH top_track AS ( -- the listener's single most-played track + SELECT lh.track_id + FROM listening_history lh + WHERE lh.user_id = $1 + GROUP BY lh.track_id + ORDER BY COUNT(*) DESC, lh.track_id ASC + LIMIT 1 ), -neighbours AS ( -- MPD playlists those seed tracks live in +cohort_playlists AS ( -- MPD playlists that contain that track SELECT DISTINCT mpt.playlist_id FROM mpd_playlist_tracks mpt - JOIN seeds s ON s.track_id = mpt.track_id + WHERE mpt.track_id = (SELECT track_id FROM top_track) ) SELECT t.id, t.name, - a.name AS artist, - COUNT(DISTINCT mpt.playlist_id) AS shared_playlists + pa.name AS primary_artist_name, + COUNT(*)::int AS cooccurrence_count FROM mpd_playlist_tracks mpt -JOIN neighbours n ON n.playlist_id = mpt.playlist_id -JOIN tracks t ON t.id = mpt.track_id -JOIN track_artists ta ON ta.track_id = t.id AND ta.role = 'primary' -JOIN artists a ON a.id = ta.artist_id -WHERE t.hidden_at IS NULL - AND t.preview_url IS NOT NULL - AND t.id NOT IN (SELECT track_id FROM seeds) -- exclude what they already know -GROUP BY t.id, t.name, a.name -ORDER BY shared_playlists DESC -LIMIT 24; +JOIN tracks t ON t.id = mpt.track_id +JOIN albums al ON al.id = t.album_id +JOIN artists pa ON pa.id = al.primary_artist_id +WHERE mpt.playlist_id IN (SELECT playlist_id FROM cohort_playlists) + AND mpt.track_id <> (SELECT track_id FROM top_track) + AND t.hidden_at IS NULL + AND al.hidden_at IS NULL + AND pa.hidden_at IS NULL + AND NOT EXISTS (SELECT 1 FROM listening_history lh + WHERE lh.user_id = $1 AND lh.track_id = t.id) +GROUP BY t.id, t.name, pa.name +ORDER BY cooccurrence_count DESC, t.name ASC +LIMIT $2; -- candidate pool; the service then samples it with ORDER BY random() ``` -- **What it does:** A two-stage CTE: collect the tracks a listener has played (seeds), find the MPD playlists those seeds appear in (neighbours), then rank every other track in those playlists by how many of them it shares. -- **Why it is useful:** This is item-to-item collaborative filtering expressed in pure SQL: tracks that repeatedly co-occur with a listener's music in human-made playlists are strong recommendations. The seed set is reused to exclude already-played tracks. -- **In the product:** Generates the Discover feed (Figure 9): every candidate is annotated with the seed track and its shared-playlist count. +- **What it does:** Takes the listener's single most-played track, finds the MPD playlists it appears in (its "cohort"), then ranks every other track in those playlists by how often it co-occurs, excluding tracks the listener has already played. +- **Why it is useful:** This is item-to-item collaborative filtering expressed in pure SQL: tracks that repeatedly share human-made playlists with a listener's favourite are strong recommendations. The service samples the top co-occurrence pool with `ORDER BY random()` so the feed stays fresh between visits. +- **In the product:** Generates the Discover feed (Figure 9): every candidate is annotated with its shared-playlist (co-occurrence) count. ### Q5. Trending artists: last 7 days vs the previous 7 ```sql -SELECT a.id, - a.name, - COUNT(*) FILTER ( - WHERE lh.played_at >= now() - interval '7 days') AS plays_7d, - COUNT(*) FILTER ( - WHERE lh.played_at >= now() - interval '14 days' - AND lh.played_at < now() - interval '7 days') AS plays_prev_7d, - COUNT(*) FILTER (WHERE lh.played_at >= now() - interval '7 days') - - COUNT(*) FILTER ( - WHERE lh.played_at >= now() - interval '14 days' - AND lh.played_at < now() - interval '7 days') AS delta -FROM listening_history lh -JOIN track_artists ta ON ta.track_id = lh.track_id AND ta.role = 'primary' -JOIN artists a ON a.id = ta.artist_id -WHERE lh.user_id = $1 - AND lh.played_at >= now() - interval '14 days' - AND a.hidden_at IS NULL -GROUP BY a.id, a.name -HAVING COUNT(*) FILTER (WHERE lh.played_at >= now() - interval '7 days') > 0 -ORDER BY delta DESC -LIMIT 10; +WITH recent AS ( -- plays in the last 7 days, per artist + SELECT a.id AS artist_id, a.name AS artist_name, COUNT(*)::int AS plays + FROM listening_history lh + JOIN tracks t ON t.id = lh.track_id + JOIN track_artists ta ON ta.track_id = lh.track_id + JOIN artists a ON a.id = ta.artist_id + WHERE lh.user_id = $1 + AND lh.played_at >= now() - interval '7 days' + AND t.hidden_at IS NULL + AND a.hidden_at IS NULL + GROUP BY a.id, a.name +), +prior AS ( -- plays in the preceding 7 days (days 8-14) + SELECT a.id AS artist_id, COUNT(*)::int AS plays + FROM listening_history lh + JOIN tracks t ON t.id = lh.track_id + JOIN track_artists ta ON ta.track_id = lh.track_id + JOIN artists a ON a.id = ta.artist_id + WHERE lh.user_id = $1 + AND lh.played_at >= now() - interval '14 days' + AND lh.played_at < now() - interval '7 days' + AND t.hidden_at IS NULL + AND a.hidden_at IS NULL + GROUP BY a.id +) +SELECT recent.artist_id, + recent.artist_name, + recent.plays AS recent_plays, + COALESCE(prior.plays, 0) AS prior_plays, + CASE WHEN COALESCE(prior.plays, 0) = 0 THEN recent.plays::numeric + ELSE ROUND((recent.plays - prior.plays)::numeric + / prior.plays::numeric, 4) + END AS growth +FROM recent +LEFT JOIN prior ON prior.artist_id = recent.artist_id +WHERE CASE WHEN COALESCE(prior.plays, 0) = 0 THEN recent.plays::numeric + ELSE (recent.plays - prior.plays)::numeric / prior.plays::numeric + END >= $2 -- growth threshold, default 0.25 +ORDER BY growth DESC, recent_plays DESC, recent.artist_name ASC +LIMIT $3; ``` -- **What it does:** Uses filtered aggregates to count each artist's plays in the current week and the week before it, then computes the week-over-week change in one scan of the last fourteen days. -- **Why it is useful:** `COUNT(*) FILTER (WHERE ...)` compares two time windows without self-joins or subqueries, so the momentum (rising or fading) of each artist comes from a single, index-friendly pass. +- **What it does:** Aggregates each artist's plays in two windows - the last 7 days (`recent`) and the 7 days before that (`prior`) - then joins them and computes a week-over-week growth ratio, keeping artists whose growth clears a threshold (default 0.25). +- **Why it is useful:** Expressing momentum as a ratio rather than a raw difference keeps light and heavy listeners comparable, and the `CASE` fallback treats a brand-new artist (no prior plays) as pure growth. The 7-day window is a fixed application constant. - **In the product:** Backs **Trending artists - 7d vs prior 7d** on the Overview and Stats pages (Figures 4 and 5). -### Q6. Community: most-followed curated playlists +### Q6. Similar playlists by shared-track overlap (Jaccard) ```sql --- Curated MPD playlists, ranked by reach and weighted by how active they are -SELECT p.id, - p.name, - p.num_followers, - p.num_edits, - COUNT(mpt.track_id) AS track_count, - ROUND(p.duration_ms / 3600000.0, 1) AS hours, - RANK() OVER (ORDER BY p.num_followers DESC) AS follower_rank -FROM mpd_playlists p -JOIN mpd_playlist_tracks mpt ON mpt.playlist_id = p.id -GROUP BY p.id, p.name, p.num_followers, p.num_edits, p.duration_ms -HAVING COUNT(mpt.track_id) >= 10 -- ignore thin or stub playlists -ORDER BY p.num_followers DESC, - p.num_edits DESC -LIMIT 12; +WITH source AS ( -- the tracks in the playlist we compare against + SELECT mpt.track_id + FROM mpd_playlist_tracks mpt + JOIN tracks st ON st.id = mpt.track_id + WHERE mpt.playlist_id = $1 + AND st.hidden_at IS NULL +) +SELECT mp.id, + mp.name, + ROUND( + COUNT(*) FILTER (WHERE other.track_id IN (SELECT track_id FROM source))::numeric + / NULLIF(( + SELECT COUNT(*) FROM ( + SELECT track_id FROM source + UNION + SELECT mpt2.track_id + FROM mpd_playlist_tracks mpt2 + JOIN tracks ut ON ut.id = mpt2.track_id + WHERE mpt2.playlist_id = mp.id + AND ut.hidden_at IS NULL + ) u + ), 0)::numeric, + 4 + ) AS jaccard, + COUNT(*) FILTER (WHERE other.track_id IN (SELECT track_id FROM source))::int AS shared_tracks +FROM mpd_playlists mp +JOIN mpd_playlist_tracks other ON other.playlist_id = mp.id +JOIN tracks ot ON ot.id = other.track_id +WHERE mp.id <> $1 + AND ot.hidden_at IS NULL +GROUP BY mp.id, mp.name +HAVING COUNT(*) FILTER (WHERE other.track_id IN (SELECT track_id FROM source)) > 0 +ORDER BY jaccard DESC, shared_tracks DESC, mp.name ASC +LIMIT $2; ``` -- **What it does:** Ranks the curated MPD playlists by follower count, joining in their track lists to report the size and total running time of each, and drops playlists with fewer than ten tracks so the leaderboard reflects substantial sets. -- **Why it is useful:** It turns the raw dataset's popularity signal (`num_followers`) into the editorial "curated picks" rail, and the `RANK()` window plus the `num_edits` tie-breaker give a stable ordering even when several playlists share a follower count. -- **In the product:** Powers the **Curated picks - Ranked by followers** section on the Community page (Figure 11). +- **What it does:** Given one MPD playlist, computes its Jaccard similarity (shared tracks divided by the union of tracks) to every other playlist, using `FILTER`ed aggregates for the intersection and a `UNION` subquery wrapped in `NULLIF` for the union size, and keeps only playlists that share at least one track. +- **Why it is useful:** Jaccard overlap is a standard set-similarity measure; computing it entirely in SQL turns the raw playlist-track memberships into a "more like this" recommendation without moving any data to the client. (The Community page's "curated picks" rail, by contrast, is a simple `ORDER BY num_followers DESC` and is not counted among these advanced queries.) +- **In the product:** Powers the **More like this** rail on each MPD playlist's detail page (`/catalog/playlists/[id]`). --- @@ -533,7 +575,7 @@ _Figure 8. The catalog browser over artists, albums, tracks, and MPD playlists. ![Discovery by playlist co-occurrence (Query 4): tracks that s](assets/fig09_discover.jpg) -_Figure 9. Discovery by playlist co-occurrence (Query 4): tracks that share MPD playlists with the listener's history, each labelled with the seed track and how many playlists they share._ +_Figure 9. Discovery by playlist co-occurrence (Query 4): tracks that share MPD playlists with the listener's most-played track, each annotated with how many playlists they share._ ![Hidden gems (Query 3): tracks that appear in many MPD playli](assets/fig10_hidden_gems.jpg) @@ -543,7 +585,7 @@ _Figure 10. Hidden gems (Query 3): tracks that appear in many MPD playlists but ![The community view of public `user_playlists`, plus editoria](assets/fig11_community.jpg) -_Figure 11. The community view of public `user_playlists`, plus editorial "curated picks" drawn from MPD playlists ranked by follower count (Query 6)._ +_Figure 11. The community view of public `user_playlists`, plus editorial "curated picks" drawn from MPD playlists ordered by follower count._ ![Account settings backed by the `users` row: display name, em](assets/fig12_account.jpg) @@ -567,7 +609,7 @@ The database is PostgreSQL 16, hosted on Neon in production, with the `pg_trgm` ### Authentication and data pipeline -Passwords are hashed with Argon2id; sessions use short-lived JWT access tokens plus rotating refresh tokens that are stored hashed in `refresh_tokens` and delivered as httpOnly cookies. The data pipeline is a resumable MPD ingest (parse, normalize, batch-upsert, checkpoint), a media-backfill step against iTunes, Spotify, and Deezer for artwork and previews, and deterministic seed scripts for listening history and playlists. +Passwords are hashed with Argon2id; sessions use short-lived JWT access tokens plus rotating refresh tokens that are stored hashed in `refresh_tokens` and delivered as httpOnly cookies. The data pipeline is a resumable MPD ingest (parse, normalize, batch-upsert, checkpoint), a media-backfill step against iTunes, Spotify, and Deezer for artwork and previews, and seed scripts for listening history and playlists. ### Quality and delivery