____ _ _ _ ___
/ ___| ___| |__ ___ | | __ _ _ __ / \ |_ _|
\___ \ / __| '_ \ / _ \| |/ _` | '__| / _ \ | |
___) | (__| | | | (_) | | (_| | | / ___ \ | |
|____/ \___|_| |_|\___/|_|\__,_|_| /_/ \_\___|
An Advanced AI-Powered Academic Research Assistant System
The academic landscape is overwhelming. Researchers, students, and educators spend countless hours manually parsing through dense research papers, struggling to find relevant information, summarize findings, and connect concepts across multiple documents. We built ScholarAI to solve this exact problem. Our goal was to create an intelligent system capable of understanding complex academic texts, enabling rapid knowledge retrieval, and ultimately accelerating the research process.
We are a dedicated team of student developers passionate about leveraging artificial intelligence to solve real-world educational challenges. This project was developed as a comprehensive semester project, demonstrating our ability to architect and deploy complex, full-stack AI applications.
The Team:
- Godfrey - Full Stack & AI Integrations
- Grish Narayanan S - Backend Architecture & DevOps
- Hariprakash - Frontend Experience & UI/UX
- Handling Complex PDF Structures: Parsing academic papers with multi-column layouts, tables, and mathematical formulas required extensive tuning of PyMuPDF and PyPDF combinations.
- Context Window Limitations: Designing an effective Retrieval-Augmented Generation (RAG) pipeline to feed relevant context to LLMs without exceeding token limits or losing critical information.
- State Management in React: Ensuring real-time synchronization between the document upload state, processing status, and chat interface using Zustand and React Query.
- Performance Optimization: Vector search over large document collections using ChromaDB needed significant optimization to provide sub-second query responses.
We adopted a modern, decoupled architecture. The backend is powered by Python and FastAPI, chosen for its asynchronous capabilities and seamless integration with the Python AI ecosystem (LangChain, Groq, Google GenAI). We utilized ChromaDB for local vector storage, allowing us to keep embedding and retrieval fast and private. The frontend is a React 19 application built with Vite, TypeScript, and Tailwind CSS, offering a highly responsive and visually appealing user interface. We utilized Framer Motion to provide smooth transitions and feedback during long-running tasks like document processing.
- Security: Implemented robust JWT-based authentication stored securely in HttpOnly cookies. Rate limiting and input validation are enforced at the API gateway level to prevent abuse. Role-Based Access Control (RBAC) ensures isolated document access between users.
- UX: Prioritized a clean, distraction-free reading and chatting interface. Uploads are handled via drag-and-drop with real-time progress indicators. The system supports a "dark mode" tailored for long reading sessions, reducing eye strain.
- Deepened our understanding of Vector Databases and semantic search methodologies.
- Mastered asynchronous programming in Python with FastAPI and ARQ.
- Learned the intricacies of managing complex React state and optimizing rendering performance in large applications.
- Gained practical experience with advanced prompt engineering and LLM orchestration.
- Collaborative Workspaces: Allow multiple researchers to annotate and chat over shared document collections.
- Citation Generation: Automatic extraction and formatting of citations in various styles (APA, MLA, IEEE).
- Knowledge Graph Integration: Visualizing relationships between different papers and concepts.
- Cloud Deployment: Fully containerized Kubernetes deployment for high availability.
"Building ScholarAI has been an incredible journey. We pushed the boundaries of what we thought was possible within a single semester. We hope this system serves as a powerful tool for anyone navigating the complex world of academic research. Dive in, explore the code, and feel free to contribute!" β The ScholarAI Team
- Project Overview
- Technology Stack
- System Architecture
- Core Workflows
- Directory Structure
- Installation & Setup
- Configuration
- API Documentation
- Deployment Guide
- Security & Privacy
- Scalability Considerations
- Testing
- Troubleshooting
- Contributing
- License
- Acknowledgments
- FAQs
ScholarAI is a comprehensive Web Application and AI Platform designed specifically for the academic and research community. It acts as an intelligent assistant, enabling users to interact directly with their research documents through advanced natural language processing.
EdTech, Academic Research, Artificial Intelligence.
- University Students (Undergraduates, Postgraduates, PhD Candidates)
- Academic Researchers and Professors
- R&D Professionals in Corporate Sectors
- Anyone who needs to rapidly digest large volumes of technical or academic literature.
To drastically reduce the time spent reading, summarizing, and extracting key information from academic papers, PDFs, and textbooks by providing a conversational interface powered by state-of-the-art Large Language Models (LLMs) and Retrieval-Augmented Generation (RAG).
The project utilizes a modern, robust, and highly scalable technology stack.
- Core: React 19, TypeScript
- Build Tool: Vite
- Styling: Tailwind CSS 4.x, Framer Motion (Animations)
- State Management: Zustand (Global State), React Query (Server State/Caching)
- Routing: React Router DOM
- Forms & Validation: React Hook Form, Zod
- Markdown & Syntax Highlighting: React Markdown, React Syntax Highlighter
- Framework: FastAPI (Python)
- Server: Uvicorn
- Background Jobs: ARQ (Async Redis Queue)
- Data Validation: Pydantic
- Document Processing: PyPDF, PyMuPDF (Fitz), pdf2image, pytesseract
- Web Scraping/Parsing: BeautifulSoup4, HTTPX
- Primary Data Store: MongoDB (Motor Async Driver)
- Vector Database: ChromaDB (Local vector storage)
- Caching & Queue Broker: Redis
- Strategy: JWT (JSON Web Tokens) via PyJWT
- Password Hashing: bcrypt, passlib
- Session Management: Cookie-based authentication (
connect.sid)
- Orchestration: LangChain (Text splitters, Prompt management)
- LLM Providers:
- Groq (Ultra-fast inference)
- Google GenAI (Gemini integration)
- OpenAI
- Anthropic
- Embeddings: Sentence Transformers, HuggingFace Hub
- Proxy/Reverse Proxy: Nginx (Configuration included)
- Environment Management: python-dotenv
ScholarAI follows a modern client-server architecture with specialized micro-services for handling AI inference and document processing.
graph TD
Client[Web Browser Client\nReact/Vite] -->|HTTPS Requests| Proxy[Nginx / API Gateway]
Proxy -->|Static Assets| FrontendStore[Static File Hosting]
Proxy -->|REST API| BackendAPI[FastAPI Backend Server]
BackendAPI -->|Authentication| AuthModule[Auth Service]
BackendAPI -->|Document Upload| Storage[Local/Cloud File Storage]
BackendAPI -->|Async Tasks| RedisQueue[Redis Message Queue]
RedisQueue --> Worker[ARQ Background Worker]
Worker -->|Extract Text| Parsing[PDF Parsing Engine]
Worker -->|Generate Embeddings| EmbeddingModel[Sentence Transformers]
Worker -->|Store Vectors| Chroma[ChromaDB Vector Store]
BackendAPI -->|Query Vector| Chroma
BackendAPI -->|Construct Prompt| RAG[RAG Pipeline]
RAG -->|Inference| LLM[LLM Providers: Groq, Google, OpenAI]
AuthModule -->|User Data| MongoDB[(MongoDB Database)]
BackendAPI -->|Metadata| MongoDB
- Presentation Layer (Frontend): Handles user interactions, document uploads, and displaying chat responses. Communicates with the backend exclusively via RESTful APIs.
- API Layer (Backend): Built with FastAPI, it routes requests, authenticates users, and orchestrates the flow of data between the database, vector store, and AI models.
- Processing Layer (Workers): Heavy tasks like parsing 100-page PDFs and generating embeddings are offloaded to background workers using ARQ and Redis. This ensures the main API remains responsive.
- Data Layer:
- MongoDB: Stores user profiles, billing plans, document metadata, and chat history.
- ChromaDB: Stores the mathematical representations (embeddings) of text chunks extracted from documents, enabling semantic similarity search.
This sequence diagram illustrates what happens when a user uploads a new academic paper.
sequenceDiagram
participant U as User
participant F as Frontend
participant API as FastAPI Backend
participant DB as MongoDB
participant Q as Redis Queue
participant W as Worker
participant V as ChromaDB
U->>F: Uploads PDF Document
F->>API: POST /documents/upload (Multipart Form)
API->>API: Validate File & User Quota
API->>DB: Save Document Metadata (Status: Pending)
API->>Q: Enqueue 'process_document' Job
API-->>F: Return Document ID & Status
Note over Q,W: Asynchronous Processing
Q->>W: Process Job
W->>W: Extract Text (PyMuPDF)
W->>W: Chunk Text (LangChain)
W->>W: Generate Embeddings
W->>V: Store Chunks + Embeddings
W->>DB: Update Document Status (Status: Completed)
F->>API: Poll /documents/{id}/status
API-->>F: Status: Completed
F-->>U: Display Ready Notification
When a user asks a question about their uploaded document, the following process occurs:
flowchart TD
A[User Inputs Question] --> B(Frontend sends POST request to /chat)
B --> C{API Validates Request & Auth}
C --> D[Generate Embedding for User Question]
D --> E[Search ChromaDB for Similar Vectors]
E --> F[Retrieve Top-K Relevant Chunks]
F --> G[Construct Prompt with Context & Question]
G --> H[Send Prompt to LLM e.g., Groq/Gemini]
H --> I[Receive LLM Response]
I --> J[Save Chat to MongoDB]
J --> K[Return Response to Frontend]
K --> L[Render Answer with Citations]
The project is structured as a monorepo containing both the frontend and backend applications.
ScholarAI-AcademicAssistant/
β
βββ backend/ # Python FastAPI Backend
β βββ app/ # Application Logic
β β βββ api/ # API Route Handlers
β β βββ core/ # Configuration, Security, Logging
β β βββ models/ # Pydantic & MongoDB Models
β β βββ services/ # Business Logic (RAG, Auth, Processing)
β β βββ main.py # FastAPI Application Entrypoint
β βββ chroma_db/ # Local Vector Database Storage (Auto-generated)
β βββ tests/ # Pytest Test Suites
β βββ run.py # Development Server Runner
β βββ requirements.txt # Python Dependencies
β βββ pytest.ini # Pytest Configuration
β
βββ frontend/ # React Frontend
β βββ src/ # React Source Code
β β βββ components/ # Reusable UI Components
β β βββ pages/ # Page/Route Components
β β βββ store/ # Zustand State Management
β β βββ hooks/ # Custom React Hooks
β β βββ lib/ # Utility Functions (Axios instances, helpers)
β β βββ App.tsx # Main Application Component
β βββ public/ # Static Assets (Images, Icons)
β βββ package.json # Node Dependencies & Scripts
β βββ vite.config.ts # Vite Configuration
β βββ tsconfig.json # TypeScript Configuration
β βββ nginx.conf # Nginx Reverse Proxy Config
β
βββ Docs/ # Additional Documentation
βββ start.bat # Windows Startup Script
βββ start.sh # Unix/Linux Startup Script
βββ openapi.yaml # Complete OpenAPI Specification
βββ README.md # Project Documentation (This File)
Follow these comprehensive steps to set up the ScholarAI development environment on your local machine.
Ensure you have the following installed on your system:
- Node.js (v18.0.0 or higher) - Required for the frontend.
- Python (v3.10.0 or higher) - Required for the backend.
- MongoDB - Running locally or accessible via a cloud URI (e.g., MongoDB Atlas).
- Redis - Running locally or accessible via a network URI.
- Git - Version control system.
git clone https://github.com/godfrey/ScholarAI-AcademicAssistant.git
cd ScholarAI-AcademicAssistant-
Navigate to the backend directory:
cd backend -
Create and activate a virtual environment (highly recommended):
# On Windows python -m venv venv venv\Scripts\activate # On macOS/Linux python3 -m venv venv source venv/bin/activate
-
Install the required Python dependencies:
pip install -r requirements.txt
-
Configure Environment Variables: Create a
.envfile in thebackenddirectory based on the.env.testfile or the configuration section below. -
Start the backend development server:
python run.py # The API will be available at http://localhost:2022
-
Open a new terminal window and navigate to the frontend directory:
cd frontend -
Install the Node dependencies:
npm install # or if using yarn: yarn install -
Start the frontend development server:
npm run dev # The application will be available at http://localhost:5173 (or as configured by Vite)
For background tasks (like PDF processing) to function, you must start the ARQ worker process. In a new terminal window (with the Python virtual environment activated):
cd backend
arq app.worker.WorkerSettingsThe system relies heavily on environment variables for configuration. Do not hardcode secrets into the source code.
Create a file named .env in the backend/ directory.
# Server Configuration
PROJECT_NAME="ScholarAI API"
VERSION="1.0.0"
API_PREFIX="/api"
PORT=2022
DEBUG=True
# Security Configuration
SECRET_KEY="your-super-secret-jwt-key-change-in-production"
ALGORITHM="HS256"
ACCESS_TOKEN_EXPIRE_MINUTES=1440 # 24 hours
# Database Configuration
MONGODB_URL="mongodb://localhost:27017"
MONGODB_DB_NAME="scholarai_db"
# Redis Configuration (For ARQ)
REDIS_URL="redis://localhost:6379/0"
# AI Provider API Keys
GROQ_API_KEY="your-groq-api-key"
GOOGLE_API_KEY="your-gemini-api-key"
OPENAI_API_KEY="your-openai-api-key" # Optional
ANTHROPIC_API_KEY="your-anthropic-api-key" # Optional
# Storage Configuration
UPLOAD_DIRECTORY="./scratch/uploads"
CHROMA_DB_DIRECTORY="./chroma_storage"The frontend uses Vite's environment variables. Create a .env file in the frontend/ directory.
# API Endpoint
VITE_API_URL="http://localhost:2022/api"
# Feature Flags
VITE_ENABLE_ANALYTICS=falseScholarAI provides a comprehensive RESTful API. Below is a subset of the critical endpoints. The full interactive Swagger documentation is automatically generated by FastAPI and can be accessed at http://localhost:2022/docs when the backend is running.
You can also view the complete specification in the openapi.yaml file located in the project root.
Creates a new user account.
Request Body:
{
"email": "user@university.edu",
"password": "SecurePassword123!",
"name": "Jane Doe"
}Response (201 Created): User registered successfully.
Authenticates a user and sets an HttpOnly cookie containing the JWT session token.
Request Body:
{
"email": "user@university.edu",
"password": "SecurePassword123!"
}Response (200 OK):
{
"user": {
"_id": "60d5ecb54b3... ",
"email": "user@university.edu",
"name": "Jane Doe",
"role": "user",
"planTier": "FREE"
}
}Retrieves a list of all documents uploaded by the authenticated user. Requires valid session cookie.
Response (200 OK):
[
{
"id": "doc_123",
"filename": "quantum_computing_review.pdf",
"uploadDate": "2023-10-27T10:00:00Z",
"status": "COMPLETED",
"sizeBytes": 2048576
}
]Uploads a new document for processing. Requires valid session cookie. Payload is multipart/form-data.
Sends a question related to a specific document.
Request Body:
{
"documentId": "doc_123",
"query": "What are the main limitations of the Qubit decoherence mentioned in the paper?"
}Response (200 OK):
{
"answer": "Based on the provided text, the main limitations are... [Detailed Response]",
"sources": [
{
"pageNumber": 12,
"textSnippet": "...decoherence times remain a significant hurdle..."
}
]
}This section outlines how to deploy ScholarAI to a production environment.
To deploy using Docker, you would typically use Docker Compose to orchestrate the Frontend, Backend, MongoDB, and Redis containers.
Note: Dockerfiles are not explicitly provided in the base repository yet, but this is the standard approach.
Sample docker-compose.yml structure:
version: '3.8'
services:
frontend:
build: ./frontend
ports:
- "80:80"
depends_on:
- backend
backend:
build: ./backend
ports:
- "2022:2022"
environment:
- MONGODB_URL=mongodb://mongo:27017
- REDIS_URL=redis://redis:6379/0
# ... other environment variables
depends_on:
- mongo
- redis
worker:
build: ./backend
command: arq app.worker.WorkerSettings
environment:
- MONGODB_URL=mongodb://mongo:27017
- REDIS_URL=redis://redis:6379/0
# ... other environment variables
depends_on:
- mongo
- redis
mongo:
image: mongo:latest
ports:
- "27017:27017"
volumes:
- mongo_data:/data/db
redis:
image: redis:alpine
ports:
- "6379:6379"
volumes:
mongo_data:- Server Setup: Provision a Virtual Private Server (VPS) from AWS (EC2), DigitalOcean, or Linode.
- Dependencies: Install Nginx, Python 3.10+, Node.js, Redis, and MongoDB on the server.
- Backend Setup:
- Clone the repository.
- Set up a virtual environment and install dependencies.
- Configure systemd services to run the FastAPI application via Gunicorn with Uvicorn workers.
- Configure a systemd service for the ARQ worker.
- Frontend Setup:
- Navigate to the frontend directory.
- Run
npm installandnpm run build. - The compiled assets will be in the
distfolder.
- Nginx Configuration:
- Use the provided
frontend/nginx.confas a template. - Configure Nginx to serve the static frontend files from the
distdirectory. - Set up a reverse proxy in Nginx to forward API requests (e.g.,
/api/*) to the Gunicorn server running on port 2022.
- Use the provided
- SSL/TLS: Use Certbot (Let's Encrypt) to secure your domain with HTTPS.
Security is paramount, especially when handling potentially sensitive academic research and user data.
- Authentication: We use industry-standard JSON Web Tokens (JWT) for authentication. Tokens are stored in HttpOnly, Secure cookies to prevent Cross-Site Scripting (XSS) attacks.
- Password Hashing: Passwords are never stored in plaintext. They are hashed using
bcryptwith a strong work factor before being stored in the database. - Data Isolation: The system employs Role-Based Access Control (RBAC) and strict ownership checks. A user can only query and access documents they have explicitly uploaded.
- Vector Database Security: ChromaDB runs locally within the backend network context. It is not exposed to the public internet.
- Input Validation: All API inputs are strictly validated using Pydantic schemas, preventing SQL injection (though MongoDB is used) and NoSQL injection attacks.
- API Rate Limiting: Prevent abuse and DDoS attacks by limiting the number of requests per IP address.
As user adoption grows, the system is designed to scale horizontally.
- Stateless API: The FastAPI backend is completely stateless (session state is in JWT/MongoDB), allowing you to spin up multiple instances behind a load balancer (like Nginx or AWS ALB).
- Asynchronous Workers: Document processing is CPU-intensive. By decoupling this via Redis and ARQ, you can scale the number of worker instances independently from the web API instances to handle spikes in document uploads.
- Database Scaling:
- MongoDB: Can be scaled horizontally using Sharding if the user/metadata volume becomes massive.
- ChromaDB: For enterprise scale, the local ChromaDB instance should be migrated to a distributed vector database solution like Milvus, Pinecone, or a managed Chroma cloud instance.
- Caching: Implement aggressive caching strategies using Redis for frequently accessed metadata or popular queries to reduce LLM API calls and database load.
Ensuring code quality and reliability is a core part of the development process.
The backend uses pytest for unit and integration testing.
cd backend
# Run all tests
pytest
# Run tests with coverage report
pytest --cov=app tests/Note: Ensure you have set up a separate test database (configured via .env.test) to avoid modifying your development data.
(Add frontend testing commands here if implemented, e.g., using Vitest or Jest) Currently, the focus is on manual QA and static analysis via TypeScript.
cd frontend
# Run TypeScript compiler checks
npm run lintIssue: Backend server fails to start.
- Solution: Ensure MongoDB and Redis are running and accessible at the URLs specified in your
.envfile. Check that all required Python dependencies are installed in your activated virtual environment.
Issue: Document processing is stuck in 'Pending' state.
- Solution: Ensure the ARQ background worker is running (
arq app.worker.WorkerSettingsin the backend directory). Check the worker logs for any errors related to PDF parsing or API limits with the LLM providers.
Issue: Frontend shows 'Network Error' when attempting to login or upload.
- Solution: Verify that the
VITE_API_URLin the frontend.envis pointing to the correct backend address (usuallyhttp://localhost:2022/api). Ensure the backend is actually running. Check CORS configurations in FastAPI if accessing from a different domain.
Issue: High memory usage by ChromaDB.
- Solution: Local ChromaDB keeps a lot of data in memory. If processing massive documents, you may need to increase your system's RAM or migrate to a client/server vector database model.
We welcome contributions to ScholarAI!
- Fork the repository.
- Create a feature branch:
git checkout -b feature/amazing-feature. - Commit your changes:
git commit -m 'Add amazing feature'. - Push to the branch:
git push origin feature/amazing-feature. - Open a Pull Request.
Please ensure your code follows the existing style guidelines and passes all tests.
This project is licensed under the MIT License - see the LICENSE file for details.
- The creators and maintainers of FastAPI, React, and Tailwind CSS.
- LangChain for simplifying LLM orchestration.
- HuggingFace for providing accessible open-source embedding models.
- Our university professors and peers for their invaluable feedback during the development process.
Q: Do I need an OpenAI API key to use this? A: Not necessarily. The system is configured to support multiple providers including Groq and Google GenAI. You only need API keys for the services you intend to use. Groq provides a very generous free tier which is excellent for development.
Q: How is data stored? A: User data and document metadata are stored in MongoDB. The actual text content is extracted, converted into vector embeddings, and stored locally in ChromaDB. Original PDF files are stored in the local file system (or configurable cloud storage).
Q: Can I use this for commercial purposes? A: Yes, the project is released under the MIT License, which permits commercial use. However, you must comply with the terms of service of the third-party APIs (LLMs, etc.) you connect to the system.
Q: Is the system capable of handling images within PDFs?
A: The system utilizes PyMuPDF and pytesseract to attempt text extraction, but complex diagrams and non-textual image analysis are currently outside the primary scope of the RAG pipeline. The focus is on textual academic content.