CourseForge

An AI-powered, multi-pass course generation engine that discovers, curates, and sequences real-world educational resources into personalized, dependency-aware learning paths.

CourseForge does not generate synthetic content. Instead, it acts as an intelligent curriculum architect — it searches the open web for existing high-quality educational material (YouTube tutorials, documentation, academic papers, Wikipedia, university syllabi), extracts the conceptual structure of a topic, builds a pedagogically-grounded curriculum DAG (Directed Acyclic Graph), and assigns the best-fit resource to each node using vector similarity search and inference model evaluation.

The result is a complete, structured course where every node has a real learning resource, every prerequisite is mapped, and the entire curriculum is coherent — generated end-to-end in a single pipeline run.

Architecture Overview

CourseForge is a full-stack application with a React + Vite frontend and a Node.js / Express backend, backed by MongoDB (with Atlas Vector Search for semantic retrieval). The backend orchestrates a complex, multi-pass generation pipeline that coordinates multiple inference models, search APIs, reranking classifiers, and embedding models to produce a complete course from a single topic string.

The pipeline is designed around three core principles:

Ground everything in discovered evidence. The inference model never invents the topic structure from parametric knowledge alone. Every concept in the curriculum must trace back to real educational content found during discovery. Concepts are assigned confidence scores based on how many independent source types confirm them.
Separate concerns across specialized models. Different pipeline stages have fundamentally different computational profiles. Chunk classification needs speed and volume (thousands of binary yes/no calls). Concept extraction needs reasoning depth. Skeleton generation needs creative synthesis. CourseForge allows routing each stage to a different inference model, optimizing cost, latency, and quality simultaneously.
Preserve context across stateless inference calls. Each inference model call is stateless, but the pipeline is deeply stateful. CourseForge maintains and injects a running context — skeleton design reasoning, per-node design notes, a resource assignment log — into every downstream prompt so the model can act as a coherent reviewer of its own prior decisions.

┌─────────────────────────────────────────────────────────────────────┐
│                        GENERATION PIPELINE                         │
│                                                                    │
│  ┌──────────┐   ┌──────────┐   ┌──────────┐   ┌────────────────┐  │
│  │  Pass 1  │──▶│  Pass 2  │──▶│  Pass 3  │──▶│   Pass 3.5     │  │
│  │Discovery │   │ Skeleton │   │ RAG-Fill │   │ Dedup Checker  │  │
│  └──────────┘   └──────────┘   └──────────┘   └────────────────┘  │
│       │              ▲              │                   │          │
│       ▼              │              ▼                   ▼          │
│  ┌──────────┐   ┌──────────┐   ┌──────────┐   ┌──────────────┐   │
│  │  Filter  │   │  Skill   │   │  Vector  │   │   Pass 4     │   │
│  │ + Embed  │   │  Level   │   │  Search  │   │  Coherence   │   │
│  │ + Graph  │   │  + Gate  │   │  + Score │   │  Validation  │   │
│  └──────────┘   └──────────┘   └──────────┘   └──────────────┘   │
└─────────────────────────────────────────────────────────────────────┘

The Generation Pipeline

Pass 1 — Unified Multi-Source Discovery

Pass 1 is the foundation. It discovers, validates, extracts concepts from, and indexes all available educational material for the topic. It runs as a sequence of sub-steps (1A through 1G), each building on the previous.

Step 1A.0 — Sub-Domain Decomposition (LLM Query Expansion)

Before any searches are launched, an inference model decomposes the user's topic into 5–7 distinct sub-domains or learning milestones. For a topic like "MERN Stack", this produces targeted queries like "MongoDB aggregation pipeline tutorial" and "JWT middleware Express.js best practices" rather than generic "MERN Stack full course" variations.

Why this matters: Generic search queries produce overlapping results that cover the same introductory ground. Sub-domain decomposition forces search diversity across the topic's breadth, ensuring the discovery phase finds resources for niche sub-topics that would otherwise be missed.

Steps 1A.1–1A.4 — Parallel Multi-Source Discovery

Four independent discovery streams launch simultaneously via Promise.allSettled:

Stream	Source	What It Finds
YouTube	`yt-search`	Tutorial videos, lecture series, crash courses
Educational Web	Tavily Search API	Documentation, guides, blog tutorials, reference material
Academic	arXiv + Semantic Scholar	Research papers, survey papers, formal treatments
Structured	Wikipedia + University Syllabi	Section headings (concept lists), curated curricula

Each stream is fault-tolerant — a failed stream contributes zero documents but does not abort the pipeline. The Promise.allSettled pattern ensures a Tavily API outage never blocks YouTube discovery.

Why multi-source matters: Each source type has different strengths. YouTube provides accessible worked examples but suffers from simplification bias. Academic papers provide formal rigor but lack practical demonstrations. Wikipedia provides editorial consensus on what concepts exist. University syllabi provide expert-designed prerequisite orderings. No single source is sufficient; the intersection of all four produces a robust concept space.

Step 1B — Content Validation and Filtering

All discovered documents pass through source-type-specific filters:

YouTube: Content density pre-filter on description text, then caption fetching. Videos without captions or with captions shorter than 200 characters are discarded (captions are essential for concept extraction and chunk embedding).
Web/Academic: Content density scoring using a ratio of explanatory language markers ("because", "for example", "this means") vs. declarative markers ("announces", "will launch", "according to"). Documents scoring below the density threshold are removed.
Structured: Fast-passed — Wikipedia outlines and syllabi represent editorial consensus and are always accepted if they have content.

Step 1C — Concept Extraction Per Source Type

Each filtered document is processed to extract the concepts it teaches, their inferred dependencies, and any explicitly stated prerequisites.

Map-Reduce Strategy: Documents are split into boundary-aware semantic chunks (1200–4800 characters each, using topic-shift markers like "now let's talk about", "the next concept is", and markdown headers). Each chunk is sent to an inference model independently, and the results are merged per-document via a reduce phase that deduplicates concepts and unions dependency relationships.

Why chunk-level extraction: Sending an entire 30-minute video transcript to a model causes token truncation and dilutes signal. Boundary-aware chunking ensures each inference call receives a focused, semantically coherent unit — the same units that will later be embedded and stored for vector search.

AIMD Congestion Control: All extraction calls across all streams share a single global queue governed by an Additive Increase / Multiplicative Decrease (AIMD) algorithm — the same congestion control strategy used in TCP. On each successful call, the concurrency window grows by +0.5. On a rate-limit (HTTP 429) or local inference server capacity error (HTTP 400 from KV-cache exhaustion), the window is halved and the queue pauses for the provider's requested retry duration. This allows a high-throughput API key to quickly scale to 20+ concurrent calls while a free-tier key safely settles at 1–3 concurrent calls without ever receiving an abort.

Step 1D — Confidence-Weighted Concept Graph

All per-document concept maps are merged into a single, deduplicated concept graph. Each concept accumulates:

Sources: Which source types mentioned it (YouTube, academic, web, structured)
Confidence: A weighted average based on source reliability weights:
- Structured (0.90) — editorial consensus, expert curricula
- Academic (0.85) — peer-reviewed, formal terminology
- Educational Web (0.70) — practical coverage, variable quality
- YouTube (0.55) — accessible but simplification-prone

A concept confirmed by Wikipedia AND an arXiv paper AND a tutorial blog has much higher confidence than one mentioned only in a single YouTube video. This confidence score directly influences skeleton generation — high-confidence concepts are treated as confirmed curriculum material, while low-confidence concepts are flagged and included only if pedagogically necessary.

Observed vs. Inferred Prerequisites: The graph stores two types of dependency relationships. Inferred dependencies are the inference model's judgment about what concept A requires. Observed prerequisites are explicitly stated in source text — phrases like "assuming you already understand closures" or "prerequisite: linear algebra". Observed prerequisites are treated as high-confidence DAG edges during skeleton generation.

Step 1E — Coverage Profile

A coverage analysis determines the health of the discovery results: which source type dominates, the overall weighted confidence, whether structured/academic coverage exists, and coverage warnings (e.g., "No practical tutorial content found. Course may lack worked examples."). This profile is injected into downstream prompts so the inference model can calibrate its assumptions.

Step 1F — Embed and Store All Content Chunks

All filtered documents are chunked using boundary-aware semantic chunking, classified as pedagogical or non-pedagogical (using a three-layer strategy: regex heuristic → inference model binary classifier → fail-open fallback), and the pedagogical chunks are embedded using a vector embedding model (Voyage AI voyage-3-large, 1024 dimensions, asymmetric query/document modes) and stored in MongoDB as ContentChunk documents.

Pedagogical Classification: A three-layer classifier filters out non-educational content (channel intros, sponsor reads, call-to-action segments, outro filler) from the embedding store:

Heuristic pre-check (sync, zero cost) — regex patterns for known non-pedagogical phrases
Inference model classifier (async, API call) — binary yes/no pedagogical judgment
Fail-open fallback — if the API call fails, the chunk is included rather than dropped

Step 1G — Persist Concept Map and Coverage Profile

The merged concept graph, sub-domain list, source URLs, and coverage profile are persisted to the Course document. This is the knowledge base that all downstream passes operate on.

Skill Level Analysis & User Interaction Gates

The pipeline includes two human-in-the-loop interaction gates where the user makes decisions that shape the remaining generation:

Gate 1 — Skill Level Selection

After Pass 1 completes, the inference model analyzes the concept graph and coverage profile to generate 2–5 skill level options calibrated to the actual discovered content. Each option specifies:

A skill level (novice, beginner, capable, intermediate, advanced)
A topic-specific label and description
An assumedKnowledge list — concepts the learner already knows at this level
An estimated skippedNodeCount

The user selects their level, and the pipeline resumes with a feasibility check — the inference model estimates how many weeks the topic requires at the selected skill level, split into core weeks and scaffolding weeks (prerequisite ramp-up). This produces the minimum and recommended duration constraints for the duration picker.

Gate 2 — Duration Selection

The user picks their target duration (in weeks). The system computes a timePressure signal (tight, comfortable, or generous) by comparing the chosen duration to the feasibility recommendation. This signal influences resource selection — tight budgets bias toward concise text articles; generous budgets allow deeper, longer resources.

These two gates transform the pipeline from a one-size-fits-all generator into a learner-calibrated engine. A novice studying "Machine Learning" for 12 weeks gets a fundamentally different curriculum than an intermediate learner studying it for 4 weeks.

Prerequisite Augmentation

For novice and beginner learners only, the pipeline runs a targeted prerequisite discovery pass before skeleton generation. It examines the concept graph's observedAssumes chains, asks the inference model which prerequisites are pedagogically critical for this learner profile, and runs focused YouTube + web searches for foundational content on those prerequisites. The discovered prerequisite material is filtered, chunked, embedded, and added to the vector store so that the skeleton and RAG-fill passes have material to draw from for scaffold nodes.

Why this matters: Without prerequisite augmentation, a novice studying "Distributed Systems" would have a skeleton that references concepts like "consensus algorithms" but the vector store would have no content about "network protocols" or "client-server architecture" — the foundational material the novice actually needs first.

Pass 2 — Grounded Skeleton Generation

Pass 2 generates the curriculum DAG — the structural backbone of the course. The inference model receives:

The full concept map with confidence scores
The learner's assumed knowledge (from skill level selection)
Time budget constraints (duration × 10 hours/week)
A scaffolding directive (deep for novices, moderate for beginners, none for higher levels)
Sub-domain organization from Pass 1

It produces a chapter-organized DAG where each node has:

A title, type (concept, skill, project, assessment), and learning objective
Key terms, estimated duration, prerequisite node references
isAssumedKnowledge flag for nodes the learner can skip
conceptConfidenceLow flag for concepts backed only by low-reliability sources
optional flag for nodes that can be cut under time pressure
nodeRole — scaffold (prerequisite) or core (main topic)
A designNote explaining why this node is a distinct topic

The skeleton also includes a skeletonReasoning — a 1–2 paragraph explanation of the curriculum's pedagogical progression, which is persisted and injected into all downstream inference calls.

DAG Validation: The generated skeleton is validated for cycles using iterative DFS, topologically sorted using a chapter-aware variant of Kahn's algorithm (which preserves chapter ordering when multiple nodes have in-degree zero), and persisted as TopicNode documents with their learning objective embeddings.

Pass 3 — RAG-Based Resource Filling

Pass 3 walks the DAG in topological order and assigns a real learning resource to each node using Retrieval-Augmented Generation:

Vector Search: The node's learning objective embedding is used for approximate nearest-neighbor search against all stored ContentChunk embeddings (via Atlas Vector Search, with an in-memory cosine similarity fallback for non-Atlas deployments). Results are grouped by source, and scores are adjusted by sourceConfidence and penalized if the resource was already assigned to a prior node.
Video Duration Filtering: YouTube candidates are enriched with metadata (duration, channel name, thumbnail) via yt-search, and videos whose duration falls outside 0.5×–3× the node's estimated duration are penalized.
LLM Candidate Scoring: The top candidates are presented to an inference model along with the node's learning objective, key terms, prior node objectives, the resource assignment log, skeleton reasoning, the node's design note, and time pressure / node role signals. The model selects the best-fit resource and returns a fitScore, conceptsTaught, conceptsAssumed, concept connections, and reasoning.
Multi-Source Gap-Fill Search: If the vector store yields no candidate above the minimum fit threshold (0.65), a cascading gap-fill search launches:
- Web articles via Tavily (with full content extraction)
- YouTube videos via yt-search (scored by the inference model)
- Academic papers via arXiv + Semantic Scholar
Each gap-fill candidate passes through the three-tier filter (algorithmic scoring → Jina + Cohere rerankers → inference model spot-check) before being considered.
Resource Persistence: Accepted resources are persisted as Resource documents with full metadata — fit score, concepts taught, assignment reasoning, reading guides (generated by the inference model for text articles), and concept connections.

Pass 3.5 — Duplicate Resource Checker

After resource filling, an algorithmic scan detects cases where the same resource (identified by YouTube video ID or article URL) was assigned to multiple nodes. Each duplicate group is classified by severity:

Critical: Two nodes share all their resources — they may be redundant nodes
High: Two single-resource nodes share that one resource
Moderate: Partial overlap with other unique resources

For each duplicate group, the inference model receives full context (both nodes' design notes, learning objectives, assignment reasoning, and available alternative candidates from the vector store) and makes a three-way decision:

Reassign: Replace the duplicate with an alternative resource on one node
Merge: Combine the two nodes into one (with DAG surgery — prerequisite re-pointing, resource migration, topological re-sort)
Confirm: The duplication is intentional (same resource genuinely serves both nodes)

This runs for up to 2 rounds to catch cascading duplicates created by reassignments.

Pass 4 — Coherence Validation

The final pass presents the entire ordered curriculum — every node's title, type, learning objective, design note, assigned resources with fit scores and reasoning — to the inference model for a holistic review. It identifies:

Knowledge gaps: Nodes where the assigned resource doesn't adequately cover the learning objective
Redundancies: Node pairs that overlap significantly
Ordering issues: Nodes that should appear earlier or later in the sequence

For every flagged gap (including gaps from Pass 3 where no resource met the fit threshold), the pipeline runs a targeted refetch — a fresh vector search biased by the gap description, followed by the full multi-source gap-fill cascade if needed. Resolved gaps clear their coherenceFlags; unresolved gaps remain flagged for the user.

Finally, the pipeline computes the initial UserProgress state — marking assumed-knowledge nodes as completed and unlocking the first available node via topological traversal.

Context Persistence Across Pipeline Stages

A critical design challenge in multi-pass inference pipelines is that each model call is stateless, but curriculum design is inherently a stateful process — resource selection for node 15 should be informed by what was already assigned to nodes 1–14.

CourseForge solves this through context persistence — a set of data structures that capture the reasoning from earlier pipeline stages and inject relevant context into every downstream prompt:

Context Signal	Generated In	Injected Into	Purpose
`skeletonReasoning`	Pass 2	Pass 3, 3.5, 4	Why the curriculum is structured this way
`designNote` (per node)	Pass 2	Pass 3	Why this node is a distinct topic
`assignmentLog` (running)	Pass 3	Pass 3 (subsequent nodes), Pass 3.5	What was already assigned and why
`assignmentReasoning` (per resource)	Pass 3	Pass 4	Why each resource was chosen
`timePressure`	Gate 2	Pass 3	Resource type bias (concise articles vs. deep videos)
`nodeRole`	Pass 2	Pass 3	Scaffold nodes prefer video; core nodes flex by time pressure

This allows the inference model to act as a coherent reviewer of its own prior decisions — detecting intentional vs. accidental resource reuse, understanding why neighboring nodes are separate topics, and evaluating whether the executed curriculum matches the intended pedagogical progression.

Multi-Model Architecture (BYOM)

CourseForge implements a Bring Your Own Model (BYOM) architecture where the user configures separate inference model credentials for four distinct pipeline tiers:

Tier	Pipeline Stages	Computational Profile
Chunk Classification	Pedagogical classification, content density	High volume (hundreds of calls), binary yes/no, latency-sensitive
Spot Checking	Resource filter spot-checks	Medium volume, short structured output
Concept Extraction	Per-chunk concept extraction (Step 1C)	High volume, moderate reasoning depth, AIMD-controlled
Course Generation	Feasibility, skeleton, resource scoring, coherence, duplicate review	Low volume, deep reasoning, long structured output

Each tier is wired to a createLLMClient instance from the unified LLM abstraction layer (llmService.js), which provides a single complete(prompt, options) interface over 11 supported providers:

Cloud: OpenAI, Anthropic (Claude), Google Gemini, Mistral, Groq, OpenRouter
Cloud Inference: OpenAI-compatible inference endpoints
Local: LM Studio, Ollama
Enterprise: Amazon Bedrock (Converse API with native AWS Sig V4 signing — no SDK required)

All providers are implemented via raw fetch() calls — no provider SDKs are used. The abstraction handles:

Retry with exponential backoff for transient errors (429, 502, 503, 504) — up to 10 attempts with full-jitter backoff
Multi-strategy JSON parsing — markdown fence stripping, regex block extraction, JavaScript-to-JSON normalization (for local models that return single-quoted keys or trailing commas)
Thinking model support — <think>...</think> tag stripping for reasoning models (DeepSeek-R1, Qwen3), minimum token floor enforcement for local inference servers
Malformed JSON retry — automatic re-prompt with stricter instructions on first parse failure

This architecture means a user can route high-volume classification calls through a fast, cheap model while using a more capable model for the creative skeleton generation — optimizing cost and quality simultaneously.

Data Structures & Models

Course

The top-level document representing a generated course. Stores the topic, user preferences, generation progress, concept map (with full confidence-weighted graph and coverage profile), skill level options, feasibility results, selected duration, skeleton reasoning, coherence report, and a timestamped generation log.

TopicNode

A single node in the curriculum DAG. Contains the title, type, learning objective, key terms, prerequisite references (as ObjectId edges), topological index, estimated duration, design note, node role (scaffold/core), optional flag, assumed knowledge flag, concept confidence flag, coherence flags, and a 1024-dimensional objective embedding vector for semantic matching.

ContentChunk

A semantically coherent text segment extracted from a discovered resource. Stores the source type, source identifiers (video ID or article URL), raw text, a 1024-dimensional embedding vector, pedagogical classification, chunk index, and source confidence weight. These documents are the retrieval targets for Atlas Vector Search during Pass 3.

Resource

A learning resource assigned to a specific TopicNode. Polymorphic — stores YouTube-specific metadata (video ID, thumbnail, channel name, duration) or article-specific metadata (URL, site name, raw content, reading guide, estimated read time). Also stores the assignment reasoning, fit score, concepts taught, concepts assumed, and concept connections.

UserProgress

Tracks the learner's advancement through the DAG — completed nodes, unlocked nodes, and current position. The initial state is computed by the pipeline (assumed-knowledge nodes start completed; the first topologically-reachable node is unlocked).

User

Stores authentication (Google OAuth), profile information, and the BYOM configuration — four separate model configs (provider, model ID, API key) for the four pipeline tiers.

Key Services & Components

Backend Services

Service	Responsibility
`generationService.js`	Pipeline orchestrator — coordinates all passes, manages SSE streaming, implements feasibility validation, skeleton generation, RAG-fill, duplicate checking, and coherence validation
`llmService.js`	Unified LLM abstraction — provider-agnostic `complete()` interface with retry, JSON parsing, thinking model support
`conceptGraphService.js`	Concept extraction (Map-Reduce over chunks), AIMD queue, concept graph merge, coverage profile computation
`embeddingService.js`	Vector embeddings (Voyage AI), boundary-aware semantic chunking, pedagogical classification, Atlas Vector Search with in-memory fallback
`filterService.js`	Three-tier resource filter — algorithmic scoring (URL structure, page structure, content density, video channel signals), classifier layer (Jina Reranker + Cohere Rerank), LLM spot-check
`graphService.js`	Pure graph algorithms — iterative DFS cycle detection, chapter-aware topological sort (Kahn's), unlocked node computation, shortest prerequisite path (BFS + topo-sort)
`searchService.js`	Tavily web search integration for educational content discovery
`youtubeService.js`	YouTube search and caption extraction
`academicService.js`	arXiv and Semantic Scholar paper discovery
`structuredReferenceService.js`	Wikipedia outline and university syllabus extraction

Frontend

The React frontend (built with Vite, Tailwind CSS, Framer Motion) provides:

Dashboard — course cards with progress tracking
Course Viewer — split-panel curriculum navigation with a DAG-based node progression system, content rendering (video embeds, article reader with markdown support, PDF rendering with arXiv probe-and-fallback), and tabbed multi-resource views
Settings — BYOM model configuration for all four pipeline tiers
Real-time Progress — SSE-driven generation progress streaming with phase-specific status messages

Technology Stack

Backend

Runtime: Node.js with Express
Database: MongoDB with Mongoose ODM
Vector Search: MongoDB Atlas Vector Search (cosine similarity, 1024-dim) with in-memory fallback
Embeddings: Voyage AI (voyage-3-large) — asymmetric query/document embedding modes
Search APIs: Tavily (web), yt-search (YouTube), arXiv API, Semantic Scholar API
Rerankers: Jina Reranker v2, Cohere Rerank v3.5
Authentication: Google OAuth 2.0 with JWT session tokens
Real-time: Server-Sent Events (SSE) for generation progress streaming
Inference Models: Provider-agnostic via unified abstraction (OpenAI, Anthropic, Gemini, Mistral, Groq, OpenRouter, LM Studio, Ollama, Amazon Bedrock)

Frontend

Framework: React 19 with Vite
Styling: Tailwind CSS 4
Animations: Framer Motion
Icons: Lucide React
SSE Client: @microsoft/fetch-event-source
Layout: react-resizable-panels for split-panel course viewer
Routing: React Router v7
Markdown: react-markdown with remark-gfm

Project Structure

CourseForge2/
├── client/                          # React frontend (Vite)
│   └── src/
│       ├── api/                     # Axios API client
│       ├── components/
│       │   ├── course/              # Course viewer components
│       │   └── dashboard/           # Dashboard components
│       ├── context/                 # React context providers
│       ├── pages/                   # Route pages (Dashboard, CourseViewer, Login, Settings)
│       └── utils/                   # Client utilities
│
├── server/                          # Express backend
│   ├── index.js                     # Server entry point
│   └── src/
│       ├── config/                  # Database and environment config
│       ├── controllers/             # Route handlers (auth, course, graph)
│       ├── middleware/              # Auth, validation, error handling
│       ├── models/                  # Mongoose schemas (Course, TopicNode, ContentChunk, Resource, User, UserProgress)
│       ├── routes/                  # Express route definitions
│       └── services/                # Core business logic (generation, LLM, embedding, filter, graph, search, etc.)
│
└── CourseForge_sequence_of_proposals/  # Historical design documents (development context)

Getting Started

Prerequisites

Node.js 18+
MongoDB (local or Atlas — Atlas required for vector search; in-memory fallback available for local)
At least one inference model API key (see BYOM configuration)

Environment Variables

Server (server/.env):

MONGODB_URI=mongodb+srv://...
JWT_SECRET=your-jwt-secret
GOOGLE_CLIENT_ID=your-google-oauth-client-id

# Embedding & Classification (server-managed)
VOYAGE_API_KEY=your-voyage-api-key

# Search APIs
TAVILY_API_KEY=your-tavily-api-key

# Rerankers (used in resource filtering)
JINA_API_KEY=your-jina-api-key
COHERE_API_KEY=your-cohere-api-key

Client (client/.env):

VITE_API_URL=http://localhost:3001
VITE_GOOGLE_CLIENT_ID=your-google-oauth-client-id

Installation

# Install server dependencies
cd server && npm install

# Install client dependencies
cd ../client && npm install

Running

# Start the backend (with file watching)
cd server && npm run dev

# Start the frontend (in a separate terminal)
cd client && npm run dev

Atlas Vector Search Setup

If using MongoDB Atlas, create a Vector Search index on the contentchunks collection:

Index name: content_embed_index
Index type: Atlas Vector Search (NOT Atlas Search)

{
  "fields": [
    {
      "type": "vector",
      "path": "embedding",
      "numDimensions": 1024,
      "similarity": "cosine"
    },
    {
      "type": "filter",
      "path": "courseId"
    }
  ]
}

If running locally without Atlas, the system automatically falls back to in-memory cosine similarity search.

License

This project is private and not currently licensed for distribution.

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
client		client
server		server
.gitignore		.gitignore
README.md		README.md
package-lock.json		package-lock.json
package.json		package.json

Folders and files

Latest commit

History

Repository files navigation

CourseForge

Table of Contents

Architecture Overview

The Generation Pipeline

Pass 1 — Unified Multi-Source Discovery

Step 1A.0 — Sub-Domain Decomposition (LLM Query Expansion)

Steps 1A.1–1A.4 — Parallel Multi-Source Discovery

Step 1B — Content Validation and Filtering

Step 1C — Concept Extraction Per Source Type

Step 1D — Confidence-Weighted Concept Graph

Step 1E — Coverage Profile

Step 1F — Embed and Store All Content Chunks

Step 1G — Persist Concept Map and Coverage Profile

Skill Level Analysis & User Interaction Gates

Gate 1 — Skill Level Selection

Gate 2 — Duration Selection

Prerequisite Augmentation

Pass 2 — Grounded Skeleton Generation

Pass 3 — RAG-Based Resource Filling

Pass 3.5 — Duplicate Resource Checker

Pass 4 — Coherence Validation

Context Persistence Across Pipeline Stages

Multi-Model Architecture (BYOM)

Data Structures & Models

Course

TopicNode

ContentChunk

Resource

UserProgress

User

Key Services & Components

Backend Services

Frontend

Technology Stack

Backend

Frontend

Project Structure

Getting Started

Prerequisites

Environment Variables

Installation

Running

Atlas Vector Search Setup

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages