Skip to content

kolezka/fsc-classifier

Repository files navigation

FSC Classifier

You give it a company name, a website, or some PDFs — it figures out which FSC codes that company falls under.

Under the hood it crawls the website, pulls text out of documents, runs everything through embeddings + vector search against ~498 pre-seeded FSC codes, then uses GPT-4o to pick the best matches and explain why.

Dashboard

Architecture

┌──────────────┐     ┌──────────────┐     ┌─────────────────────────────────┐
│   Next.js    │────▶│   NestJS     │────▶│   Trigger.dev Background Jobs   │
│   Frontend   │◀────│   API        │     │                                 │
└──────────────┘     └──────────────┘     │  1. Crawl website (axios/cheerio)│
     React 19          Port 3001          │  2. Parse PDFs (pdf-parse/OCR)  │
     TanStack Query    File uploads       │  3. Extract company summary     │
     shadcn/ui         CORS + validation  │  4. Embed → pgvector search     │
                                          │  5. GPT-4o rerank → top codes   │
                                          └───────────┬─────────────────────┘
                                                      │
                                                      ▼
                                          ┌─────────────────────────┐
                                          │  PostgreSQL + pgvector   │
                                          │  498 FSC codes w/        │
                                          │  OpenAI embeddings       │
                                          └─────────────────────────┘

What it does

  • Submit a company with a name, website URL, and/or uploaded PDFs/images
  • Crawls the website (axios + cheerio), focuses on about/products/services pages, picks up PDFs it finds along the way
  • Extracts text from documents — tries pdf-parse first, falls back to GPT-4o vision for scanned/image-based content
  • Embeds everything with text-embedding-3-small, searches pgvector for the closest FSC codes
  • GPT-4o reranks the top candidates down to 5–10 results, each with a confidence score and written reasoning
  • If a very similar company was already classified (>95% cosine similarity), it just reuses those codes instead of burning more API calls
  • The frontend polls for status updates so you can watch it go through each step in real time

Classification Results

Each code comes with a confidence bar and you can expand the AI's reasoning:

AI Reasoning

Tech Stack

Layer Tech
Frontend Next.js 16, React 19, TailwindCSS v4, shadcn/ui, TanStack Query
API NestJS 11, class-validator, multer
Background Jobs Trigger.dev v4.4.1
Database PostgreSQL + Prisma + pgvector
AI GPT-4o (reranking + OCR vision), text-embedding-3-small (embeddings)
Crawling axios + cheerio
PDF parsing pdf-parse, GPT-4o vision (OCR fallback)

Project Structure

fsc-classifier/
├── apps/
│   ├── api/            # NestJS REST API
│   ├── frontend/       # Next.js App Router
│   └── trigger/        # Trigger.dev background tasks
├── packages/
│   ├── database/       # Prisma schema + pgvector helpers
│   ├── openai/         # OpenAI client singleton
│   └── shared/         # Shared TypeScript types
└── data/               # FSC classification source PDF

The pipeline, step by step

Company Input (name + URL + documents)
        │
        ▼
┌─── Crawl Website ───────────────────────────┐
│  axios + cheerio                             │
│  → Homepage + priority subpages (max 8)      │
│  → Auto-detect & download PDFs from site     │
└──────────────────────────────────────────────┘
        │
        ▼
┌─── Parse Documents ──────────────────────────────┐
│  pdf-parse (fast) → GPT-4o vision (OCR fallback) │
│  → Extract text from uploaded PDFs/images         │
└───────────────────────────────────────────────────┘
        │
        ▼
┌─── Classify ───────────────────────────────────────────┐
│  1. Aggregate all text (cap at 30K chars)               │
│  2. GPT-4o extracts a structured company summary        │
│  3. text-embedding-3-small embeds the summary           │
│  4. pgvector cosine search → top 20 FSC candidates      │
│  5. Cache check: similar company already done? reuse it │
│  6. GPT-4o reranks to top 5-10 with confidence + reason │
└─────────────────────────────────────────────────────────┘
        │
        ▼
  Final FSC codes (ranked, with confidence & reasoning)

Trigger.dev Task Flow

Six tasks orchestrated by classify-company. The flow depends on what input you provide:

With website URL:

classify-company
│
├─ [CRAWLING] crawl-website
│  ├─ fetch-page (homepage)
│  ├─ fetch-page (batch: /about, /products, /services... up to 7)
│  └─ detect-and-fetch-pdfs (if PDF links found on site)
│     └─ parse-document (batch: each discovered PDF)
│
├─ [PARSING] parse-document (batch: each uploaded file)
│
└─ [CLASSIFYING] classify-fsc
   ├─ embed text → text-embedding-3-small
   ├─ pgvector search → top 20 FSC candidates
   ├─ cache hit (>0.95 similarity)? → reuse codes, done
   └─ cache miss → GPT-4o rerank → save top 5-10 codes

With uploaded documents only (no URL):

classify-company
│
├─ [PARSING] parse-document (batch: each uploaded file)
│
└─ [CLASSIFYING] classify-fsc
   ├─ embed text → text-embedding-3-small
   ├─ pgvector search → top 20 FSC candidates
   ├─ cache hit? → reuse codes, done
   └─ cache miss → GPT-4o rerank → save top 5-10 codes

The crawling step is skipped entirely when there's no URL — the status goes straight from PENDING to PARSING. Subpages and PDFs are fetched in parallel batches. If any step fails, the job is marked FAILED and Trigger.dev retries (1-3x depending on the task).

Running locally

You'll need: Node.js 20+, pnpm, PostgreSQL with pgvector, and API keys for OpenAI and Trigger.dev.

pnpm install

cp .env.example .env
# fill in DATABASE_URL, OPENAI_API_KEY, TRIGGER_SECRET_KEY

pnpm db:generate
pnpm db:migrate

# seeds ~498 FSC codes with embeddings, takes about 2 min
pnpm --filter @fsc-c/db seed:fsc

pnpm dev

That gives you:

Add Company

API

Method Path What it does
POST /classify Submit a company (multipart/form-data)
GET /classify/:id Poll status + get results
GET /classify List all companies
GET /classify/search?fscCode=3416 Find companies by FSC code

All responses: { success: boolean, data?: T, error?: string }

Next Steps

Things I'd want to tackle next if I keep building on this:

  • Smarter crawling — Right now the crawler follows a hardcoded list of priority subpages (/about, /products, /services, etc.). I'd like to replace that with AI-driven link discovery — let the model decide which pages are worth visiting based on context. Looking at Firecrawl as a potential drop-in replacement that handles this out of the box (smart crawling, JS rendering, structured extraction).

  • E2E tests — No tests yet. Would add Playwright tests covering the full flow: submit a company, watch it process, verify the results page renders correctly. Also API integration tests for the classification pipeline with mocked OpenAI responses.

  • User accounts + OAuth — Currently there's no auth at all — anyone can submit and view everything. Next step would be adding user registration via OAuth (Google, GitHub) so companies and classification history are tied to individual accounts.

  • Multiple AI provider support — The whole pipeline is hardwired to OpenAI right now (embeddings, reranking, OCR). I'd want to abstract that behind provider adapters so you could swap in Anthropic, Gemini, open-source models, etc. without touching the pipeline logic. Especially useful for the embedding step where there are good cheaper alternatives.

License

MIT

About

Full-stack app that classifies companies into Federal Supply Classification (FSC) codes using web crawling, PDF parsing, OpenAI embeddings, pgvector search, and GPT-4o reranking

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors