WorldOfTaxonomy

1,000 classification systems. 1,212,000+ codes. 321,000+ crosswalk edges.
The open-source Rosetta Stone for global industry, trade, occupation, health, and regulatory taxonomies.

The Problem

Every country, industry body, and standards organization has its own classification system. When you need to reconcile data across them, you're on your own.

A truck driver in the US is NAICS 484, SOC 53-3032, ISCO-08 8332, NACE 49.4, and ISIC 4923 - five different codes in five different systems that all mean the same thing. Figuring that out manually costs hours. Doing it at scale costs entire teams.

WorldOfTaxonomy solves this. One queryable graph connects all 1,000 systems. One API call translates any code to any other system. One MCP server gives AI agents access to the entire taxonomy universe.

Architecture

System Overview

The platform serves four consumer interfaces - a web application, a REST API, an MCP server, and an AI-readable wiki - all backed by a shared PostgreSQL database.

graph TB
  subgraph Data["Data Layer"]
    PG[(PostgreSQL)]
    WIKI["wiki/*.md files"]
  end
  subgraph Backend["Python Backend"]
    INGEST["Ingesters - 1,000 systems"]
    API["FastAPI REST API - /api/v1/*"]
    MCP["MCP Server - stdio transport"]
    WIKILOADER["Wiki Loader - wiki.py"]
  end
  subgraph Frontend["Next.js Frontend"]
    NEXT["Next.js 15 App Router"]
    GUIDE["/guide/* pages"]
  end
  subgraph Consumers
    BROWSER["Web Browsers"]
    AIAGENT["AI Agents - Claude, GPT, etc."]
    CRAWLER["AI Crawlers - Perplexity, etc."]
    DEV["Developer Applications"]
  end
  INGEST -->|ingest| PG
  API -->|query| PG
  MCP -->|query| PG
  WIKILOADER -->|read| WIKI
  MCP -->|instructions| WIKILOADER
  NEXT -->|proxy /api/*| API
  NEXT -->|read| WIKI
  GUIDE -->|render| WIKI
  BROWSER --> NEXT
  BROWSER --> GUIDE
  AIAGENT --> MCP
  CRAWLER -->|/llms-full.txt| NEXT
  DEV --> API

Ingestion Pipeline

Each of the 1,000 systems has a dedicated ingester that fetches from authoritative sources and loads into three core tables.

graph TD
  subgraph Sources["Official Sources"]
    CSV["CSV files - NAICS, ISIC"]
    XLSX["Excel files - NACE, ANZSIC"]
    HTML["HTML/PDF - SIC, NIC"]
    CURATED["Expert-Curated - Domain taxonomies"]
  end
  subgraph Pipeline["Ingestion Pipeline"]
    PARSE["Parse and Validate"]
    UPSERT["Upsert Nodes into classification_node"]
    XWALK["Build Crosswalks into equivalence"]
    PROV["Set Provenance - 4-tier audit"]
  end
  subgraph DB["Database Tables"]
    SYS["classification_system - 1,000+ systems"]
    NODE["classification_node - 1.2M+ nodes"]
    EQUIV["equivalence - 321K+ edges"]
  end
  CSV --> PARSE
  XLSX --> PARSE
  HTML --> PARSE
  CURATED --> PARSE
  PARSE --> UPSERT
  PARSE --> XWALK
  PARSE --> PROV
  UPSERT --> NODE
  XWALK --> EQUIV
  PROV --> SYS
  SYS --- NODE
  NODE --- EQUIV

API Request Flow

sequenceDiagram
  participant C as Client
  participant RL as Rate Limiter
  participant AUTH as Auth Layer
  participant R as Router
  participant Q as Query Layer
  participant DB as PostgreSQL

  C->>RL: GET /api/v1/search?q=physician
  RL->>RL: Check rate - 30/min anon, 1000/min auth
  RL->>AUTH: Forward request
  AUTH->>AUTH: Validate JWT or API key
  AUTH->>R: Authenticated request
  R->>Q: search(conn, query, limit)
  Q->>DB: SELECT with ts_vector query
  DB-->>Q: Matching nodes
  Q-->>R: Results with system context
  R-->>C: JSON response

MCP Session Lifecycle

sequenceDiagram
  participant AI as AI Agent
  participant MCP as MCP Server
  participant WIKI as Wiki Loader
  participant DB as PostgreSQL

  AI->>MCP: initialize - JSON-RPC
  MCP->>WIKI: build_wiki_context()
  WIKI-->>MCP: Structural knowledge - ~15K tokens
  MCP-->>AI: serverInfo + instructions + capabilities
  Note over AI: Agent now knows all 1,000 systems and crosswalk topology
  AI->>MCP: tools/call search_classifications
  MCP->>DB: Query nodes
  DB-->>MCP: Results
  MCP-->>AI: Tool response as JSON
  AI->>MCP: resources/read taxonomy://wiki/crosswalk-map
  MCP->>WIKI: load_wiki_page - crosswalk-map
  WIKI-->>MCP: Full markdown content
  MCP-->>AI: Resource content

Directory Structure

WorldOfTaxonomy/
├── world_of_taxonomy/
│   ├── api/              # FastAPI REST API (lifespan pool, rate limiting)
│   │   ├── routers/      # systems, nodes, search, equivalences, countries, auth, crosswalk_graph
│   │   └── schemas.py    # Pydantic response models
│   ├── mcp/              # MCP server (stdio transport, 21 tools)
│   ├── ingest/           # One ingester per system (100+ files)
│   │   ├── naics.py      # Downloads from Census Bureau
│   │   ├── nace_derived.py  # EU national adaptations (copy NACE + equivalences)
│   │   ├── isic_derived.py  # LATAM/Asia/Africa adaptations
│   │   └── crosswalk_*.py   # 20+ crosswalk ingesters
│   ├── query/            # Query layer (browse, search, equivalence)
│   ├── schema.sql        # Core tables
│   └── schema_auth.sql   # Auth tables
├── frontend/             # Next.js 15 + TypeScript + Tailwind + shadcn/ui
│   └── src/app/          # Home, Explore, System, Dashboard, Crosswalk Explorer, Guide
├── wiki/                 # Curated guides (serves web, MCP, llms.txt, and API)
├── tests/                # pytest (test_wot schema isolation, never touches production)
└── data/                 # Downloaded source files (gitignored, re-downloadable)

Database (PostgreSQL):

classification_system      -- 1,000 rows: id, name, region, authority, node_count
classification_node        -- 1.2M rows: system_id, code, title, level, parent_code
equivalence                -- 321K rows: source_system, source_code, target_system, target_code, match_type
country_system_link        -- 27K rows: country_code, system_id, relevance ('official'|'regional'|'recommended')

Quick Start

Docker (recommended - runs in under 2 minutes):

git clone https://github.com/colaberry/WorldOfTaxonomy.git
cd WorldOfTaxonomy
docker compose up

Open http://localhost:3000. The API is at http://localhost:8000.

Then ingest your first systems:

# Core global systems (~3 minutes)
docker compose exec backend python3 -m world_of_taxonomy ingest naics
docker compose exec backend python3 -m world_of_taxonomy ingest isic
docker compose exec backend python3 -m world_of_taxonomy ingest crosswalk

# Everything (~30-45 minutes)
docker compose exec backend python3 -m world_of_taxonomy ingest all

Python only (bring your own PostgreSQL):

pip install -e .
cp .env.example .env   # set DATABASE_URL and JWT_SECRET
python3 -m world_of_taxonomy init
python3 -m world_of_taxonomy ingest naics
python3 -m uvicorn world_of_taxonomy.api.app:create_app --factory --port 8000

API in 60 Seconds

# Translate NAICS 4841 (general freight trucking) to all equivalent systems
curl "http://localhost:8000/api/v1/systems/naics_2022/nodes/4841/translations"

# Search for "hospital" across all 1,000 systems simultaneously
curl "http://localhost:8000/api/v1/search?q=hospital&grouped=true"

# Get every classification system applicable to Germany
curl "http://localhost:8000/api/v1/countries/DE"

# Find all codes in NACE with no mapping to NAICS
curl "http://localhost:8000/api/v1/diff?a=nace_rev2&b=naics_2022"

# Full text search within a specific system
curl "http://localhost:8000/api/v1/search?q=logistics&system=isco_08"

Python client:

import httplib2, json

base = "http://localhost:8000/api/v1"

# Get all systems for France
r = httplib2.Http().request(f"{base}/countries/FR")
profile = json.loads(r[1])
# {'official': 'naf_rev2', 'regional': 'nace_rev2', 'recommended': ['isic_rev4', ...]}

# Translate a code
r = httplib2.Http().request(f"{base}/systems/naics_2022/nodes/5415/translations")
translations = json.loads(r[1])
# {'nace_rev2': '62.01', 'isic_rev4': '6201', 'sic_1987': '7371', ...}

Use With Claude / AI Agents (MCP)

WorldOfTaxonomy ships with a Model Context Protocol server. Add it to Claude Desktop and your AI gets instant access to all 1,000 systems as structured tools.

~/Library/Application Support/Claude/claude_desktop_config.json:

{
  "mcpServers": {
    "world-of-taxonomy": {
      "command": "python3",
      "args": ["-m", "world_of_taxonomy", "mcp"],
      "env": {
        "DATABASE_URL": "your-database-url"
      }
    }
  }
}

21 tools available, including:

Tool	What it does
`translate_code`	Translate any code to a target system
`translate_across_all_systems`	One code -> all 1,000 systems at once
`search_classifications`	Full-text search across all codes
`get_country_taxonomy_profile`	Official + recommended systems for any country
`compare_sector`	Side-by-side root nodes across two systems
`get_system_diff`	Codes in system A with no mapping to B
`explore_industry_tree`	Browse hierarchy with context
`find_by_keyword_all_systems`	Search grouped by system

Example prompt to Claude:

"I have a dataset with NACE codes. Convert every unique code to NAICS and ISIC equivalents and flag any that have no crosswalk."

What's Covered

16 categories. 1,000 systems. Every major region.

Category	Systems	Highlights
Industry	68	NAICS, ISIC, NACE + 58 national adaptations (EU, LATAM, Asia, Africa)
Life Sciences	108	ICD-11, ICD-10-CM/PCS, LOINC (102K), MeSH, SNOMED, NDC, NCI Thesaurus (211K)
Domain Deep-Dives	400+	Sector vocabularies for 40+ industry verticals
Regulatory	80+	GDPR, FDA, SOX, HIPAA, ISO standards, EU directives, NIST frameworks
Occupational	10	SOC, ISCO-08, ESCO (14K skills), O*NET, ANZSCO, NOC, KldB, ROME
Product / Trade	11	HS 2022, UNSPSC (77K codes), CPC, SITC, HTS, Schedule B, ECCN
Research & Knowledge	8	FORD, JEL, LCC, PACS, MSC, ACM CCS, arXiv, ANZSRC
Financial / Investment	7	GICS, ICB, CFI (ISO 10962), COFOG, COICOP, GHG Protocol, Patent CPC
Geographic	7	ISO 3166-1/2, UN M.49, EU NUTS, US FIPS, World Bank income groups
Education	3	ISCED 2011, ISCED-F 2013, CIP 2020

249 countries are profiled with their official, regional, and recommended systems.

Use Cases

Data engineering: Reconcile supplier data (NAICS) with EU reporting (NACE) automatically
Compliance: Map your product portfolio to HS codes for any customs jurisdiction
HR / Recruitment: Translate job titles between SOC (US), ESCO (EU), ISCO (global)
Financial research: Bridge GICS sector codes to NAICS for cross-market analysis
AI / LLM: Give your AI agent structured knowledge of every industry and occupation
Academic: Study how different countries classify the same economic activity
Healthcare: Cross-reference ICD-11 diagnoses against ICD-10-CM/PCS variants
Trade compliance: Classify goods across HS, HTS, ECCN, and Schedule B simultaneously

REST API Reference

GET /api/v1/systems                                List all systems
GET /api/v1/systems?country={code}                 Systems for a country
GET /api/v1/systems/{id}                           System detail
GET /api/v1/systems/{id}/nodes/{code}              A specific code
GET /api/v1/systems/{id}/nodes/{code}/children     Direct children
GET /api/v1/systems/{id}/nodes/{code}/ancestors    Parent chain to root
GET /api/v1/systems/{id}/nodes/{code}/equivalences Cross-system mappings
GET /api/v1/systems/{id}/nodes/{code}/translations All equivalences in one call
GET /api/v1/systems/{id}/nodes/{code}/siblings     Sibling codes
GET /api/v1/search?q={query}                       Full-text search
GET /api/v1/search?q={query}&grouped=true          Results grouped by system
GET /api/v1/compare?a={sys}&b={sys}                Side-by-side comparison
GET /api/v1/diff?a={sys}&b={sys}                   Codes in A with no match in B
GET /api/v1/countries/stats                        Coverage stats (world map)
GET /api/v1/countries/{code}                       Country taxonomy profile
GET /api/v1/systems/{src}/crosswalk/{tgt}/graph    Crosswalk graph for visualization

Interactive docs: http://localhost:8000/docs (Swagger UI, auto-generated)

Rate limits (default): 30 req/min unauthenticated, 1000 req/min with API key.

Contributing

We are actively seeking contributors to add classification systems.

Every country has a national industry classification standard that should be in this graph. Most are public domain or openly licensed. Adding one takes about 2 hours following our guide.

Read CONTRIBUTING.md for the full TDD-enforced guide.

Quick version:

Pick a system from the open issues or request one
Write failing tests first (tests/test_ingest_<system>.py) - TDD is non-negotiable
Implement world_of_taxonomy/ingest/<system>.py
Wire into __main__.py, update CLAUDE.md
Open a PR - one system per PR

Systems most wanted:

Additional national industry codes (Middle East, Sub-Saharan Africa, Central Asia)
UNSD/Eurostat statistical classifications
Commodity classifications (agricultural, mineral, pharmaceutical)
US state-level occupation codes

Roadmap

See ROADMAP.md for the full list. Key near-term items:

Hosted public API (no self-hosting required)
Python client library (pip install world-of-taxonomy-client)
JavaScript / TypeScript client
Bulk export: Parquet + CSV snapshots on Hugging Face Datasets
API key dashboard (frontend UI for managing keys)
dbt package for warehouse-native crosswalk joins
Weekly "new system" digest for contributors

Data Sources and Licensing

Code is MIT licensed. Classification data is sourced from public domain and openly licensed sources (UN, Eurostat, US Census Bureau, WHO, ILO, etc.).

WorldOfTaxonomy does not redistribute raw data files. Each ingester downloads data directly from its authoritative source at ingest time. See DATA_SOURCES.md for per-system attribution and license information.

Community

GitHub Issues: Bug reports, feature requests, new system requests
GitHub Discussions: Questions, ideas, show-and-tell
Contributing: See CONTRIBUTING.md

Built with precision. Open forever. PRs welcome.

Name		Name	Last commit message	Last commit date
Latest commit History 218 Commits
.github		.github
blog		blog
crosswalk-data		crosswalk-data
data		data
docs		docs
frontend		frontend
migrations		migrations
scripts		scripts
skills		skills
tests		tests
tree-data		tree-data
wiki		wiki
world_of_taxonomy		world_of_taxonomy
.dockerignore		.dockerignore
.editorconfig		.editorconfig
.env.example		.env.example
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
CHANGELOG.md		CHANGELOG.md
CITATION.cff		CITATION.cff
CLAUDE.md		CLAUDE.md
CONTRIBUTING.md		CONTRIBUTING.md
DATA_SOURCES.md		DATA_SOURCES.md
DESIGN.md		DESIGN.md
Dockerfile.backend		Dockerfile.backend
HANDOVER.md		HANDOVER.md
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
OAUTH_PRODUCTION_SETUP.md		OAUTH_PRODUCTION_SETUP.md
README.md		README.md
ROADMAP.md		ROADMAP.md
SECURITY.md		SECURITY.md
alembic.ini		alembic.ini
cloudbuild.yaml		cloudbuild.yaml
docker-compose.yml		docker-compose.yml
pytest.ini		pytest.ini
requirements.txt		requirements.txt
run_mcp.sh		run_mcp.sh
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

WorldOfTaxonomy

The Problem

Architecture

System Overview

Ingestion Pipeline

API Request Flow

MCP Session Lifecycle

Directory Structure

Quick Start

API in 60 Seconds

Use With Claude / AI Agents (MCP)

What's Covered

Use Cases

REST API Reference

Contributing

Roadmap

Data Sources and Licensing

Community

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

WorldOfTaxonomy

The Problem

Architecture

System Overview

Ingestion Pipeline

API Request Flow

MCP Session Lifecycle

Directory Structure

Quick Start

API in 60 Seconds

Use With Claude / AI Agents (MCP)

What's Covered

Use Cases

REST API Reference

Contributing

Roadmap

Data Sources and Licensing

Community

About

Topics

Resources

License

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages