vinted-dataset is a standalone scraper and JSON dataset generator for Vinted metadata.
It is meant as a practical replacement for stale static datasets such as teddy-vltn/vinted-dataset, with a focus on:
- current catalog trees
- current brand ids
- current color ids
- current status ids
- region-aware output
- publishable JSON snapshots committed into the repository
The project is intentionally simple:
- a Playwright-based scraper in
src/ - generated dataset files in
output/ - an optional tiny HTTP server for serving the latest local snapshot
Vinted metadata changes over time.
Static datasets become outdated quickly:
- categories are added or moved
- new brands appear
- region coverage changes
- localized labels differ across markets
At the same time, Vinted does not expose a stable public dataset endpoint for all of this. The relevant data is available through a mix of:
- server-rendered page payloads
- internal filter endpoints
- market-specific localized catalog pages
This repository automates collecting that data and exporting it back into plain JSON files that can be reused by other projects.
Vinted currently protects important endpoints behind anti-bot systems such as Cloudflare challenges and DataDome-style checks.
That means:
- direct
fetch/axios/ raw HTTP scraping is unreliable - internal APIs often return
403or503without a valid browser session - HTML alone is not enough for all metadata
This scraper first creates a real browser session with Playwright, then uses that same session to read:
- catalog tree data embedded in Next.js / Flight payloads
- filter metadata from Vinted internal JSON endpoints
That is the main design decision of this repository.
The dataset is written into output/.
Current files:
output/brand.jsonoutput/colors.jsonoutput/statuses.jsonoutput/sizes.jsonoutput/regions.jsonoutput/_meta.jsonoutput/<region>/groups.json
Example layout:
output/
_meta.json
brand.json
colors.json
regions.json
sizes.json
statuses.json
de/
groups.json
fr/
groups.json
it/
groups.json
...
Flat object mapping brand label to brand id:
{
"Nike": "53",
"adidas": "14",
"Balenciaga": "2369"
}Flat array of color objects:
[
{
"id": "1",
"label": "Black",
"hex": "#000000"
}
]Note:
hexis only present when Vinted exposes it- not every market returns the same color labels
Flat array of status objects:
[
{
"id": "1",
"label": "Uusi ilman hintalappua",
"type": "default"
}
]Important:
- labels are localized by the market the scraper used
- ids stay stable
- if you need multilingual labels, collect and merge them across regions in your own downstream pipeline
Object keyed by size-group id:
{
"4": {
"XS / 34": 2,
"S / 36": 3
}
}Important:
- size-group coverage depends on what Vinted exposes through filter responses
- this file may remain sparse unless size extraction is extended further
Nested category tree for one Vinted market:
{
"Women": {
"id": 1904,
"slug": "WOMEN_ROOT",
"children": {
"Clothing": {
"id": 4,
"slug": "CLOTHING",
"children": {}
}
}
}
}Node shape:
id: Vinted catalog idslug: catalog code or generated fallbacksize_id: optionalchildren: nested subcategories
Metadata about the generated snapshot:
{
"generated_at": "2026-04-05T19:50:00.000Z",
"region_count": 25,
"roots": [],
"scraped_regions": ["de", "fr", "it"]
}src/
cli.js
config.js
service.js
output/
...
package.json
Dockerfile
README.md
Entry point for:
scrapeserve
Static configuration:
- supported regions
- default root catalogs
- brand search alphabet
Core implementation:
- browser bootstrap
- catalog tree extraction
- filter endpoint calls
- JSON writing
- optional HTTP server
Requirements:
- Node.js 20+
- npm
- Playwright Chromium
Setup:
npm install
npx playwright install chromiumnpm run scrapeThis will:
- create or update
output/ - scrape all configured regions
- merge results into existing JSON files
- preserve already scraped regions on later runs
REGION_CODES=de npm run scrape
REGION_CODES=de,fr,it npm run scrape
REGION_CODES=at,ie,gr npm run scrapeThis is useful when:
- you want to bootstrap quickly
- a few markets are missing
- you want to refresh one subset without touching everything
REGION_CODES=de,fr BRAND_REGION=de npm run scrapeBrand discovery is the slowest part of the scraper.
By default, one region is used as the primary source for brand.json.
HEADLESS=false REGION_CODES=de npm run scrapeThis is useful when:
- browser startup works but the scraper fails later
- you want to inspect challenge pages manually
- Vinted changes frontend behavior
Scrapes are incremental by design.
The scraper does not wipe the dataset folder before every run.
That means:
- existing region files remain in place
- later runs can add missing markets
- global files such as
brand.jsonare merged with existing data _meta.jsontracks which regions have already been scraped
This is important if you plan to commit output/ into the repository.
The project also includes a minimal HTTP server:
npm run serveOptional:
PORT=4010 npm run serveEndpoints:
GET /healthGET /datasetGET /dataset/brand.jsonGET /dataset/colors.jsonGET /dataset/statuses.jsonPOST /refresh
This is intentionally small and local-first. It is not meant to be a production API platform.
A simple Dockerfile is included.
Build:
docker build -t vinted-dataset-service .Run:
docker run --rm -p 4010:4010 vinted-dataset-serviceIf you want to persist output outside the container, mount a volume and set DATASET_OUTPUT_DIR.
Supported variables:
REGION_CODESBRAND_REGIONHEADLESSPORTDATASET_OUTPUT_DIR
Examples:
REGION_CODES=de,fr BRAND_REGION=de npm run scrape
HEADLESS=false REGION_CODES=de npm run scrape
PORT=4010 npm run serve
DATASET_OUTPUT_DIR=./output npm run scrapeAt a high level:
- Launch Chromium with Playwright.
- Open a Vinted catalog page to establish a valid browser session.
- Read root catalogs from embedded Next.js / Flight payloads.
- Walk each root catalog page and extract nested category trees.
- Query internal filter endpoints for brands, colors, sizes, and statuses.
- Merge results into JSON snapshots under
output/.
For brands specifically:
- the scraper uses a prefix-based search strategy
- this is slower than the other metadata collectors
- this is currently the most practical way to enumerate many brand ids
This repository is practical, not perfect.
Current limitations:
- Vinted can change frontend payload structure at any time
- anti-bot behavior can change without notice
- brand discovery is expensive and may still miss edge cases
- status labels are localized, not automatically multilingual
- size coverage is less complete than category coverage
- some regions may temporarily fail depending on Vinted availability
- duplicate brand ids with different labels can exist across markets
The first places to inspect are:
src/service.jssrc/config.js
Most likely breakpoints:
- catalog tree extraction from
self.__next_f.push(...) - internal filter endpoint parameters
- anti-bot/session behavior
- localized response shapes
Recommended debugging workflow:
- Run with
HEADLESS=false. - Limit to one region:
REGION_CODES=de npm run scrape- Watch console output.
- Check which step failed:
- browser launch
- root discovery
- catalog extraction
- brand collection
- filter collection
This repository is designed to support committed snapshots.
A practical workflow:
- Run a refresh locally.
- Review changes in
output/. - Commit both code and updated dataset files.
- Publish the repository.
That gives users:
- the scraper itself
- a usable current snapshot immediately after clone
If you move this out into its own GitHub repository, a sensible layout is:
- keep
src/ - keep
output/ - keep
package.json - keep
Dockerfile - keep this README
- remove unrelated Vintrack-specific files
You can then publish it as:
- a scraper tool
- a JSON dataset repository
- both
This project relies on Vinted’s live website behavior and internal responses.
Use it responsibly and expect maintenance work over time.
This is not an official Vinted project and has no guarantee of long-term API stability.