You give it a company name, a website, or some PDFs — it figures out which FSC codes that company falls under.
Under the hood it crawls the website, pulls text out of documents, runs everything through embeddings + vector search against ~498 pre-seeded FSC codes, then uses GPT-4o to pick the best matches and explain why.
┌──────────────┐ ┌──────────────┐ ┌─────────────────────────────────┐
│ Next.js │────▶│ NestJS │────▶│ Trigger.dev Background Jobs │
│ Frontend │◀────│ API │ │ │
└──────────────┘ └──────────────┘ │ 1. Crawl website (axios/cheerio)│
React 19 Port 3001 │ 2. Parse PDFs (pdf-parse/OCR) │
TanStack Query File uploads │ 3. Extract company summary │
shadcn/ui CORS + validation │ 4. Embed → pgvector search │
│ 5. GPT-4o rerank → top codes │
└───────────┬─────────────────────┘
│
▼
┌─────────────────────────┐
│ PostgreSQL + pgvector │
│ 498 FSC codes w/ │
│ OpenAI embeddings │
└─────────────────────────┘
- Submit a company with a name, website URL, and/or uploaded PDFs/images
- Crawls the website (axios + cheerio), focuses on about/products/services pages, picks up PDFs it finds along the way
- Extracts text from documents — tries pdf-parse first, falls back to GPT-4o vision for scanned/image-based content
- Embeds everything with
text-embedding-3-small, searches pgvector for the closest FSC codes - GPT-4o reranks the top candidates down to 5–10 results, each with a confidence score and written reasoning
- If a very similar company was already classified (>95% cosine similarity), it just reuses those codes instead of burning more API calls
- The frontend polls for status updates so you can watch it go through each step in real time
Each code comes with a confidence bar and you can expand the AI's reasoning:
| Layer | Tech |
|---|---|
| Frontend | Next.js 16, React 19, TailwindCSS v4, shadcn/ui, TanStack Query |
| API | NestJS 11, class-validator, multer |
| Background Jobs | Trigger.dev v4.4.1 |
| Database | PostgreSQL + Prisma + pgvector |
| AI | GPT-4o (reranking + OCR vision), text-embedding-3-small (embeddings) |
| Crawling | axios + cheerio |
| PDF parsing | pdf-parse, GPT-4o vision (OCR fallback) |
fsc-classifier/
├── apps/
│ ├── api/ # NestJS REST API
│ ├── frontend/ # Next.js App Router
│ └── trigger/ # Trigger.dev background tasks
├── packages/
│ ├── database/ # Prisma schema + pgvector helpers
│ ├── openai/ # OpenAI client singleton
│ └── shared/ # Shared TypeScript types
└── data/ # FSC classification source PDF
Company Input (name + URL + documents)
│
▼
┌─── Crawl Website ───────────────────────────┐
│ axios + cheerio │
│ → Homepage + priority subpages (max 8) │
│ → Auto-detect & download PDFs from site │
└──────────────────────────────────────────────┘
│
▼
┌─── Parse Documents ──────────────────────────────┐
│ pdf-parse (fast) → GPT-4o vision (OCR fallback) │
│ → Extract text from uploaded PDFs/images │
└───────────────────────────────────────────────────┘
│
▼
┌─── Classify ───────────────────────────────────────────┐
│ 1. Aggregate all text (cap at 30K chars) │
│ 2. GPT-4o extracts a structured company summary │
│ 3. text-embedding-3-small embeds the summary │
│ 4. pgvector cosine search → top 20 FSC candidates │
│ 5. Cache check: similar company already done? reuse it │
│ 6. GPT-4o reranks to top 5-10 with confidence + reason │
└─────────────────────────────────────────────────────────┘
│
▼
Final FSC codes (ranked, with confidence & reasoning)
Six tasks orchestrated by classify-company. The flow depends on what input you provide:
With website URL:
classify-company
│
├─ [CRAWLING] crawl-website
│ ├─ fetch-page (homepage)
│ ├─ fetch-page (batch: /about, /products, /services... up to 7)
│ └─ detect-and-fetch-pdfs (if PDF links found on site)
│ └─ parse-document (batch: each discovered PDF)
│
├─ [PARSING] parse-document (batch: each uploaded file)
│
└─ [CLASSIFYING] classify-fsc
├─ embed text → text-embedding-3-small
├─ pgvector search → top 20 FSC candidates
├─ cache hit (>0.95 similarity)? → reuse codes, done
└─ cache miss → GPT-4o rerank → save top 5-10 codes
With uploaded documents only (no URL):
classify-company
│
├─ [PARSING] parse-document (batch: each uploaded file)
│
└─ [CLASSIFYING] classify-fsc
├─ embed text → text-embedding-3-small
├─ pgvector search → top 20 FSC candidates
├─ cache hit? → reuse codes, done
└─ cache miss → GPT-4o rerank → save top 5-10 codes
The crawling step is skipped entirely when there's no URL — the status goes straight from PENDING to PARSING. Subpages and PDFs are fetched in parallel batches. If any step fails, the job is marked FAILED and Trigger.dev retries (1-3x depending on the task).
You'll need: Node.js 20+, pnpm, PostgreSQL with pgvector, and API keys for OpenAI and Trigger.dev.
pnpm install
cp .env.example .env
# fill in DATABASE_URL, OPENAI_API_KEY, TRIGGER_SECRET_KEY
pnpm db:generate
pnpm db:migrate
# seeds ~498 FSC codes with embeddings, takes about 2 min
pnpm --filter @fsc-c/db seed:fsc
pnpm devThat gives you:
- Frontend at http://localhost:3000
- API at http://localhost:3001
- Trigger.dev dev server running tasks locally
| Method | Path | What it does |
|---|---|---|
POST |
/classify |
Submit a company (multipart/form-data) |
GET |
/classify/:id |
Poll status + get results |
GET |
/classify |
List all companies |
GET |
/classify/search?fscCode=3416 |
Find companies by FSC code |
All responses: { success: boolean, data?: T, error?: string }
Things I'd want to tackle next if I keep building on this:
-
Smarter crawling — Right now the crawler follows a hardcoded list of priority subpages (/about, /products, /services, etc.). I'd like to replace that with AI-driven link discovery — let the model decide which pages are worth visiting based on context. Looking at Firecrawl as a potential drop-in replacement that handles this out of the box (smart crawling, JS rendering, structured extraction).
-
E2E tests — No tests yet. Would add Playwright tests covering the full flow: submit a company, watch it process, verify the results page renders correctly. Also API integration tests for the classification pipeline with mocked OpenAI responses.
-
User accounts + OAuth — Currently there's no auth at all — anyone can submit and view everything. Next step would be adding user registration via OAuth (Google, GitHub) so companies and classification history are tied to individual accounts.
-
Multiple AI provider support — The whole pipeline is hardwired to OpenAI right now (embeddings, reranking, OCR). I'd want to abstract that behind provider adapters so you could swap in Anthropic, Gemini, open-source models, etc. without touching the pipeline logic. Especially useful for the embedding step where there are good cheaper alternatives.
MIT



