Chat with Your Documents Using GPT-4.1
Transform how you interact with documents through conversational AI and intelligent semantic search
SemanticScout is a cutting-edge "Chat with Your Documents" application that combines the power of GPT-4.1 conversational AI with advanced semantic search. Ask questions naturally and get intelligent, context-aware answers from your document collection. Built with the latest 2025 AI technologies, it features adaptive search intelligence that automatically optimizes for any domain - from technical papers to financial reports.
- π§ Intelligent Processing: Multi-format document support (PDF, DOCX, TXT, MD)
- π Semantic Search: Natural language queries with contextual understanding
- π― Adaptive Intelligence: Domain-agnostic threshold adaptation for optimal results
- π¬ Chat Interface: GPT-4.1 powered conversational document exploration
- π Visual Analytics: Interactive document relationship visualization
- β‘ Real-time Results: Sub-2-second search responses with relevance scoring
- π¨ Professional UI: Modern Gradio interface with automatic dark/light theme support
- π‘οΈ Enterprise Ready: Secure, scalable architecture with comprehensive testing
Deploy your own instance on Hugging Face Spaces (see deployment section)
docker-compose up --build- Python 3.11+
- OpenAI API Key (Get yours here)
- Git for version control
-
Clone the repository:
git clone https://github.com/NeurArk/SemanticScout.git cd SemanticScout -
Create virtual environment:
python -m venv venv source venv/bin/activate # On Windows: venv\Scripts\activate
-
Install dependencies:
pip install -r requirements.txt
-
Configure environment:
cp .env.example .env # Edit .env and add your OpenAI API key # Optional: Change OPENAI_MODEL to gpt-4.1-mini for lower costs
-
Initialize the environment:
python scripts/setup.py
-
Launch the application:
python app.py
-
Open your browser to
http://localhost:7860
- Drag & drop or click to upload PDF, DOCX, TXT, or MD files
- Maximum file size: 100MB
- Real-time processing with progress indicators
- Conversational Interface: Ask questions naturally and get intelligent responses
- Context-Aware: Maintains conversation history for follow-up questions
- Source Attribution: Every answer includes document sources
- Enter to Send: Streamlined UX with keyboard shortcuts
- Enter natural language queries like:
- "What is attention?" - Tests adaptive search
- "Compare revenue models across documents" - Multi-document analysis
- "Explain the transformer architecture" - Technical deep dives
- Document Distribution: See your document types at a glance
- Size vs Complexity: Scatter plot showing document characteristics
- Theme Adaptive: Charts automatically adjust to light/dark modes
- Apple Financial Report (
apple_financial_report.pdf) - Q3 2024 quarterly results - Attention Is All You Need (
attention_is_all_you_need.pdf) - Transformer architecture paper - SaaS Agreement Example (
saas_agreement_example.pdf) - Enterprise software contract
- General: "What is attention?" β Tests adaptive search for short queries
- Financial: "What was Apple's revenue in Q3 2024?"
- Technical: "Explain the transformer architecture"
- Legal: "What are the termination clauses in the SaaS agreement?"
- Cross-document: "Compare the complexity between transformers and Apple's financials"
| Component | Technology | Version | Purpose |
|---|---|---|---|
| AI Framework | LangChain + LangGraph | Latest | RAG pipeline orchestration |
| Language Model | OpenAI GPT-4.1 | Latest (2025) | Chat & query understanding |
| Embeddings | text-embedding-3-large | 3072-dim | Semantic vector generation |
| Vector DB | ChromaDB | Latest | Efficient similarity search |
| UI Framework | Gradio | Latest | Interactive web interface |
| Visualization | Plotly + NetworkX | Latest | Document relationship graphs |
βββββββββββββββββββ βββββββββββββββββββ βββββββββββββββββββ
β Document ββββββ Processing ββββββ Vector β
β Upload β β Pipeline β β Storage β
β (Multi-format)β β (Extraction) β β (ChromaDB) β
βββββββββββββββββββ βββββββββββββββββββ βββββββββββββββββββ
β β β
βΌ βΌ βΌ
βββββββββββββββββββ βββββββββββββββββββ βββββββββββββββββββ
β Gradio UI ββββββ Search Engine ββββββ OpenAI API β
β (User Interface) β (Semantic) β β (GPT-4.1) β
βββββββββββββββββββ βββββββββββββββββββ βββββββββββββββββββ
SemanticScout employs an advanced domain-agnostic adaptive search system that automatically adjusts retrieval parameters based on:
- Query Complexity Analysis: Linguistic patterns determine optimal search thresholds
- Corpus Vocabulary Extraction: Dynamic analysis of document collection characteristics
- Auto-calibration: Real-time adjustment based on result distribution
- Query Expansion: Automatic enhancement of short queries for better recall
This ensures optimal results whether searching technical papers, financial reports, or legal documents without manual configuration.
- Processing Speed: < 30 seconds per document
- Chat Response: < 3 seconds for contextual answers
- Search Accuracy: Adaptive thresholds ensure optimal recall/precision balance
- Scalability: Tested with technical papers, financial reports, and legal documents
- Resource Efficient: ~$0.014 per query with GPT-4.1 (or ~$0.003 with GPT-4.1-mini)
- Test Coverage: 82% with comprehensive unit and integration tests
SemanticScout/
βββ app.py # Main Gradio application
βββ core/ # Core business logic
β βββ models/ # Data models
β βββ document_processor.py # Document processing
β βββ embedder.py # Embedding generation
β βββ rag_pipeline.py # RAG orchestration
β βββ chat_engine.py # Chat functionality
β βββ vector_store.py # Vector storage
β βββ utils/ # Utilities including adaptive search
βββ config/ # Configuration
βββ tests/ # Test suites (82% coverage)
βββ samples/ # Example documents
βββ images/ # UI screenshots
βββ requirements.txt # Dependencies
# Run all tests
pytest
# Run with coverage (82% achieved!)
pytest --cov=core
# Run specific test category
pytest tests/unit/
pytest tests/integration/# Format code
black . --line-length 100
# Run linting
ruff check .
# Type checking (if configured)
mypy core/
# Run all tests with coverage
pytest --cov=core --cov-report=htmlPerfect for client presentations and development:
python app.py --host 0.0.0.0 --port 7860Free hosting for demos:
- Fork this repository
- Create a new Space on Hugging Face
- Connect your GitHub repository
- Add your OpenAI API key to Space secrets
docker build -t semantic-scout .
docker run -p 7860:7860 -e OPENAI_API_KEY=your_key semantic-scout- Document Discovery: Find relevant documents across large repositories
- Knowledge Management: Organize and search company knowledge bases
- Research Assistance: Accelerate literature reviews and research
- Compliance: Locate policy documents and regulatory information
- Client Presentations: Showcase AI capabilities with real documents
- Technical Interviews: Demonstrate semantic search understanding
- Portfolio Projects: Highlight modern AI/ML development skills
- Proof of Concepts: Validate semantic search for specific domains
If something fails during the demo, restart the application and check logs in the logs/ directory or run docker-compose up again.
Combines the power of semantic search with GPT-4.1 to provide accurate, contextual answers based on your documents.
Our proprietary algorithm automatically adjusts search parameters based on query complexity and document characteristics, ensuring optimal results without manual tuning.
Efficient vector storage with cosine similarity search, optimized for semantic retrieval at scale.
We welcome contributions! Please see our development guidelines:
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature) - Commit your changes (
git commit -m 'Add amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
- Follow the existing code style and patterns
- Add tests for new features
- Update documentation for significant changes
- Ensure all CI checks pass
This project is licensed under the MIT License - see the LICENSE file for details.
SemanticScout showcases advanced AI/ML capabilities including:
- RAG (Retrieval Augmented Generation) implementation
- Vector database integration and optimization
- Modern LLM application development
- Production-ready software architecture
- Enterprise-grade user experience design
Perfect for demonstrating expertise in:
- π€ AI/ML Engineering
- π§ Python Development
- ποΈ System Architecture
- π¨ UI/UX Design
- π Data Visualization
- Live Demo: [Coming Soon on Hugging Face]
- GitHub: https://github.com/NeurArk/SemanticScout
- Portfolio: NeurArk
- Issues & Support: GitHub Issues
Built with β€οΈ by NeurArk β’ Powered by OpenAI GPT-4.1 & LangChain
Transform your document search experience with the power of semantic AI.
