A powerful semantic search application providing intelligent Q&A across a curated collection of options trading educational content. Built with modern technologies for fast, accurate, and context-aware responses.
OPTEEE uses advanced natural language processing and vector similarity search to help traders learn from a comprehensive knowledge base of options trading transcripts and educational videos. Ask questions in plain English and get detailed answers with direct links to relevant source material.
- Semantic Search: Advanced NLP-powered search that understands meaning, not just keywords
- Fast Retrieval: FAISS vector database delivers millisecond search responses
- Multi-Source Knowledge Base: Combines video transcripts and academic research papers
- Video Integration: Direct links to specific timestamps in source YouTube videos
- Research Paper Support: Academic papers with page references and section context
- Chat Interface: Modern, responsive chat UI with conversation history
- Persistent Conversation History: Full user/assistant threads stored in SQL (Postgres recommended)
- Source Citations: Every answer includes clickable references with timestamps or page numbers
- Context-Aware: Maintains conversation history for follow-up questions
- Responsive Design: Works seamlessly on desktop and mobile devices
OPTEEE draws from two primary sources:
| Source Type | Content | Count |
|---|---|---|
| Video Transcripts | Options trading tutorials, strategy explanations, market analysis | 17,200+ chunks |
| Research Papers | Academic papers on PEAD, volatility, retail trading behavior | 8,900+ chunks |
Total: 26,100+ searchable knowledge chunks
- Backend: FastAPI with RESTful API endpoints
- Frontend: React with modern UI components
- Search Engine: Sentence-transformers with FAISS vector database
- NLP Model: all-MiniLM-L6-v2 for semantic embeddings
- Deployment: Docker containerization with resource-limited local serving
- Python 3.13 (recommended; 3.14+ not yet supported by scipy/numba)
- Docker (optional, for containerized deployment)
- Git
Use a virtual environment for all local Python work (serving, transcript pipeline, vector store rebuild):
cd opteee
# Create venv (once)
python3 -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
# Install dependencies
pip install -r requirements-serve.txt # Serving only
# OR
pip install -r requirements.txt # Full (includes Whisper, pipeline)Always activate venv before running Python scripts:
source venv/bin/activate
python3 main.py
python3 run_pipeline.py --step scrapeThe venv/ directory is in .gitignore and should not be committed.
- Clone the repository:
git clone https://github.com/bthaile/opteee.git
cd opteee- Create your
.envfile:
cp .env.example .env
# Edit .env with your API key(s)- Build and run:
docker compose up --buildThe application will be available at http://localhost:7860
The Docker setup uses Dockerfile.serve — a slim image with CPU-only PyTorch, no Whisper/Selenium overhead, and mounts the pre-built vector store as a volume. Resource limits (2GB RAM, 2 CPUs) are configured in docker-compose.yml.
source venv/bin/activate
pip install -r requirements-serve.txt
python main.pyUse requirements-serve.txt for serving only. The full requirements.txt includes pipeline dependencies (Whisper, YouTube downloaders) needed for transcript processing.
-
GET
/api/health- Health check endpoint- Returns service status and version information
-
POST
/api/chat- Main chat endpoint- Request body:
{ "query": "What is a covered call?", "provider": "claude", "num_results": 5, "format": "html", "conversation_history": [], "conversation_id": "optional-existing-conversation-id" } formatsupportshtml,json, andbot(preferjsonfor chat bots)- Returns answer with sources, timestamps, and
conversation_id
- Request body:
-
POST
/api/conversations- Create a new persisted conversation- Returns conversation metadata (
id,title, timestamps)
- Returns conversation metadata (
-
GET
/api/conversations?limit=25- List recent conversations- Returns newest-first conversation summaries for sidebar/history UIs
-
GET
/api/conversations/{conversation_id}- Load one conversation with full message history- Returns all persisted
userandassistantmessages for replay/rebuild
- Returns all persisted
-
GET
/- Serves the React frontend application
opteee/
├── main.py # FastAPI application entry point
├── config.py # Configuration and settings
├── rag_pipeline.py # RAG implementation
├── vector_search.py # Vector similarity search
├── create_vector_store.py # Vector store creation (transcripts + PDFs)
├── rebuild_vector_store.py # Vector store rebuilding
├── run_transcripts.sh # Transcript pipeline (venv + scrape→whisper→vectors)
├── process_pdfs.py # PDF semantic chunking utility
├── app/
│ ├── db/ # SQLAlchemy engine, models, and DB init
│ ├── models/ # Pydantic models
│ │ └── chat_models.py # Chat request/response models (supports video + PDF)
│ └── services/ # Business logic services
│ ├── rag_service.py # RAG service implementation
│ ├── conversation_service.py # Conversation/message persistence service
│ ├── history_utils.py # History sanitization for prompt context
│ └── formatters.py # Response formatting (HTML + JSON/bot-friendly)
├── frontend/
│ └── build/ # React production build
├── vector_store/ # FAISS vector database files
├── processed_transcripts/ # Processed video transcript chunks (JSON)
├── processed_pdfs/ # Processed PDF document chunks (JSON)
├── transcripts/ # Raw transcript data
├── static/ # Static assets (CSS, JS)
├── templates/ # HTML templates
├── bots/ # Platform-agnostic bot integration docs/examples
│ ├── README.md # Canonical bot integration guide
│ └── examples/ # Minimal client examples
├── docs/ # Documentation
├── archive/ # Archived utilities and scripts
├── Dockerfile # Production Docker image (full pipeline)
├── Dockerfile.serve # Slim Docker image (serving only, CPU-only PyTorch)
├── docker-compose.yml # Local Docker serving with resource limits
├── requirements.txt # Full dependencies (pipeline + serving)
├── requirements-serve.txt # Slim dependencies (serving only)
└── tests/ # Unit tests for persistence and history logic
- FastAPI - High-performance Python web framework
- React - Modern frontend JavaScript library
- Sentence Transformers - State-of-the-art sentence embeddings
- FAISS - Efficient similarity search and clustering
- Docker - Containerization platform
- LangChain - LLM orchestration and RAG pipeline
- Backend Changes: Modify FastAPI endpoints in
main.pyor services inapp/services/ - Frontend Changes: Update React components in
frontend/src/(requires separate build) - Testing: Run locally with
source venv/bin/activate && python main.py - Vector Store Updates: Rebuild with
source venv/bin/activate && python rebuild_vector_store.py - Deploy:
docker compose up --build
Key configuration options in config.py:
MODEL_NAME: Sentence transformer model (default: "all-MiniLM-L6-v2")TOP_K: Number of top results to retrieve (default: 5)CHUNK_SIZE: Size of text chunks for processing (default: 500)CHUNK_OVERLAP: Overlap between chunks (default: 50)
OPTEEE uses an automated GitHub Actions workflow to keep the knowledge base up-to-date with the latest educational content. The system automatically discovers new videos, generates transcripts, and deploys updates.
The knowledge base is automatically updated every Sunday at 8:00 PM UTC (3:00 PM CT) through the Process Video Transcripts Weekly workflow:
What happens automatically:
- Video Discovery - Scans YouTube channels for new educational content
- Transcript Generation - Creates text transcripts from videos using YouTube API and Whisper
- Text Processing - Chunks transcripts into searchable segments (250 words with 50-word overlap)
- Repository Update - Commits new transcripts and processed data to the repository
After the weekly pipeline commits new transcripts, rebuild the vector store locally to make new content searchable:
source venv/bin/activate
python rebuild_vector_store.py
docker compose up --buildYou can manually trigger the knowledge base update at any time:
Via GitHub Web Interface:
- Navigate to the Actions tab in the GitHub repository
- Select "Process Video Transcripts Weekly" workflow
- Click "Run workflow" button
- Choose the branch (usually
main) - Click "Run workflow" to start
Via GitHub CLI:
gh workflow run "Process Video Transcripts Weekly"To run the full transcript pipeline locally—including Whisper for videos without YouTube captions:
Prerequisites:
- Python 3.13 (
brew install python@3.13on macOS — Python 3.14 is not yet supported by scipy/numba) ffmpeginstalled (brew install ffmpegon macOS)- venv created with Python 3.13:
python3.13 -m venv venv && source venv/bin/activate && pip install -r requirements.txt YOUTUBE_API_KEYin.env(for video discovery)
Activate venv before running any pipeline step:
source venv/bin/activateOne-liner (runs all steps in sequence):
./run_transcripts.shScans the configured YouTube channels (see pipeline_config.py → CHANNEL_URLS) and writes a list of all video IDs, titles, and metadata to outlier_trading_videos.json. Already-known videos are skipped on subsequent runs.
source venv/bin/activate
python3 run_pipeline.py --step scrape --non-interactiveWhat it does:
- Uses
yt-dlpto enumerate every video across all channel URLs (videos, shorts, streams, podcasts, live) - De-duplicates by video ID
- Saves results to
outlier_trading_videos.json
Attempts to pull captions directly from YouTube for every video in outlier_trading_videos.json. Videos where captions are unavailable are marked for Whisper processing.
source venv/bin/activate
python3 run_pipeline.py --step transcripts --non-interactiveWhat it does:
- Reads
outlier_trading_videos.json - Calls the
youtube-transcript-apifor each video - Saves successful transcripts to
transcripts/<video_id>.txt(one line per segment:123.45s: text) - Records successes/failures in
transcript_progress.json - Skips videos already in
transcripts/— only processes new ones
To force re-fetching everything:
python3 run_pipeline.py --step transcripts --non-interactive --force-reprocessFor any video that YouTube couldn't provide captions for, this step downloads the audio track and runs OpenAI Whisper to generate a transcript locally. The pipeline captures failures in Step 2 and processes them here.
source venv/bin/activate
# Run as pipeline step (recommended — runs automatically after transcripts)
python3 run_pipeline.py --step whisper --non-interactive
# Or run retry_and_whisper directly for more control:
python3 retry_and_whisper.py --whisper-only # Skip YouTube retry, go straight to Whisper
python3 retry_and_whisper.py # Retry YouTube first, then Whisper for rest
python3 retry_and_whisper.py --retry-only # Only retry YouTube (no Whisper)
python3 retry_and_whisper.py --max-whisper 10 # Limit to N videos (for testing)What it does under the hood:
- Audio download — Uses
yt-dlp+ffmpegto pull the best available audio stream and convert it to a 128 kbps MP3, saved toaudio_files/<video_id>.mp3 - Whisper transcription — Loads the Whisper model (
WHISPER_MODELinpipeline_config.py, defaulttiny) and transcribes the audio, producing timestamped segments - Writes the result to
transcripts/<video_id>.txtin the same123.45s: textformat as YouTube transcripts - Updates
transcript_progress.json(whisper_processedlist)
Whisper model options (set WHISPER_MODEL in pipeline_config.py):
| Model | Speed | Accuracy | Notes |
|---|---|---|---|
tiny |
~32× faster than base | ~90% | Default — good for large batches |
base |
baseline | ~95% | Good balance |
small |
~2× slower than base | ~97% | Better for tricky audio |
medium |
~5× slower | ~99% | High accuracy, slower |
large |
~10× slower | Best | Use only when quality matters most |
Note: Audio files in audio_files/ are not committed to the repository and can be deleted after transcription to save disk space.
Converts raw transcript files into overlapping word-window chunks with full metadata (video URL, timestamp, title). This is what gets indexed into the vector store.
source venv/bin/activate
python3 run_pipeline.py --step preprocess --non-interactive
# Or run the preprocessor directly for more control:
python3 preprocess_transcripts.py # Process all new transcripts
python3 preprocess_transcripts.py --force # Force reprocess everything
python3 preprocess_transcripts.py --video-id ABC123 # Process one specific videoWhat it does:
- Reads all
.txtfiles fromtranscripts/ - Splits each into overlapping chunks (default: 250 words per chunk, 50-word overlap — configured in
pipeline_config.py) - Attaches metadata:
video_id,title,url,timestamp,upload_date - Outputs one JSON file per video to
processed_transcripts/<video_id>.json - Skips already-processed videos unless
--forceis passed
Embeds all processed chunks using the sentence-transformer model and writes the FAISS index to vector_store/.
source venv/bin/activate
python3 run_pipeline.py --step vectors --non-interactive
# Or rebuild directly (also picks up processed PDFs):
python3 rebuild_vector_store.pydocker compose up --build -dThe vector store is mounted as a volume, so the container picks up the updated index without a full image rebuild. The --build flag ensures any code changes are included.
Run all five steps sequentially (scrape → transcripts → whisper → preprocess → vectors):
python3 run_pipeline.py --non-interactiveThe Whisper step runs automatically after transcripts and processes any videos that couldn't get captions from YouTube.
Note: The GitHub Actions workflow uses YouTube transcripts only (no Whisper). Whisper is for local processing of videos without captions.
To rebuild the vector store locally (for development or testing):
source venv/bin/activate
python rebuild_vector_store.py
# Or use the create script directly
python create_vector_store.pyNote: The vector store files (vector_store/) are large and should not be committed to the repository. They are rebuilt automatically during deployment.
The automated pipeline is configured in .github/workflows/process-transcripts.yml:
Key Settings:
- Schedule: Weekly on Sunday at 20:00 UTC
- Timeout: 180 minutes (3 hours) for large processing jobs
- Python Version: 3.10
- Dependencies: FFmpeg (for audio processing), PyTorch, Sentence-Transformers
Required Secrets:
YOUTUBE_API_KEY- For accessing YouTube API to fetch video metadata and transcripts
Check Processing Status:
- View workflow runs in the GitHub Actions tab
- Each run generates a processing report showing:
- Number of videos discovered
- Transcripts generated
- Processed chunks created
Verify Locally:
- Test the
/api/healthendpoint - Run a sample query to verify new content is searchable
To add new YouTube channels or playlists to the discovery process:
- Update the scraper configuration in the pipeline scripts
- The next automated run will discover videos from the new sources
- Or manually trigger the workflow to process immediately
To add academic papers or PDF documents to the knowledge base:
-
Prepare PDFs: Place PDF files in a local directory (e.g.,
~/research-papers/) -
Process PDFs locally:
# Process PDFs with semantic chunking python process_pdfs.py ~/research-papers/ # Analyze first without processing (preview) python process_pdfs.py ~/research-papers/ --analyze-only
-
Commit processed chunks:
git add processed_pdfs/ git commit -m "Add research papers: [description]" git push -
Rebuild vector store:
source venv/bin/activate && python rebuild_vector_store.py && docker compose up --build
PDF Processing Features:
- Semantic chunking: Preserves paragraph boundaries and section context
- Section detection: Identifies headers and includes section names in metadata
- Page tracking: Each chunk includes page number and range
- Author extraction: Extracts author metadata when available
- Lightweight storage: Raw PDFs stay local, only JSON chunks are committed (~95% smaller)
Note: Raw PDF files are not committed to the repository (see .gitignore). Only the processed JSON chunks in processed_pdfs/ are stored in Git.
If automated updates fail:
- Check GitHub Actions logs - View detailed error messages in the workflow run
- Verify secrets - Ensure
YOUTUBE_API_KEYis valid - Check API quotas - YouTube API has daily limits
- Manual rebuild - Trigger the workflow manually if the scheduled run missed
- Local testing - Run the pipeline locally to debug issues
Common Issues:
- YouTube API quota exceeded - Wait for quota reset (midnight Pacific Time)
- Transcripts not available - Some videos may not have captions enabled
- Long processing times - Large batches may take 1-2 hours
Two Dockerfiles are provided:
Dockerfile.serve(default in docker-compose) - Slim image for serving: CPU-only PyTorch, no Whisper/Selenium, mounts pre-built vector storeDockerfile- Full image for production: includes vector store build step, all pipeline dependencies
# .env file (see .env.example)
CLAUDE_API_KEY=... # Anthropic API key (at least one LLM key required)
OPENAI_API_KEY=... # OpenAI API key (optional)
DATABASE_URL=... # Optional; enables persisted conversation history (Postgres recommended)For Docker + host Postgres on macOS, use:
DATABASE_URL=postgresql+psycopg://postgres:postgres@host.docker.internal:5432/opteeeNotes:
- If
DATABASE_URLis not set, OPTEEE falls back to local SQLite (opteee.db). - Conversation tables are created automatically on app startup.
Docker Compose is configured with sensible limits for local development:
- Memory: 2GB
- CPUs: 2
| Action | Command |
|---|---|
| Stop containers | docker compose down |
| Stop and remove volumes | docker compose down -v |
| Build and start | docker compose up --build |
| Build and start (detached) | docker compose up --build -d |
| Clean rebuild (no cache) | docker compose build --no-cache && docker compose up -d |
| View logs | docker compose logs -f |
| View last 50 lines | docker compose logs --tail 50 |
Typical workflow after code or config changes:
# Stop current containers
docker compose down
# Rebuild and restart
docker compose up --build -dAfter vector store rebuild (e.g. after python rebuild_vector_store.py):
docker compose up --build -dThe --build flag ensures the image is rebuilt with any code changes; the vector store is mounted as a volume, so a rebuild picks up updated vector_store/ files without rebuilding the image.
Use this when running OPTEEE locally with Docker and you want an automatic weekly refresh that:
- Pulls latest changes from GitHub
- Rebuilds/restarts the Docker service
- Verifies
/api/healthon port7860
weekly-refresh.sh- refresh script used by launchdcom.opteee.weekly-refresh.plist- launchd job definition (template tracked in git)
cd /Users/bradfordhaile/clawd/opteee
mkdir -p logs
chmod +x weekly-refresh.sh
cp com.opteee.weekly-refresh.plist ~/Library/LaunchAgents/com.opteee.weekly-refresh.plist
launchctl bootstrap gui/$(id -u) ~/Library/LaunchAgents/com.opteee.weekly-refresh.plist
launchctl enable gui/$(id -u)/com.opteee.weekly-refreshThe tracked plist is configured to run every Sunday at 11:00 PM local time:
Weekday=0(Sunday)Hour=23Minute=0
# Run immediately (manual test)
launchctl kickstart -k gui/$(id -u)/com.opteee.weekly-refresh
# Check status
launchctl print gui/$(id -u)/com.opteee.weekly-refresh
# Disable/enable
launchctl disable gui/$(id -u)/com.opteee.weekly-refresh
launchctl enable gui/$(id -u)/com.opteee.weekly-refresh
# Reload after plist edits
launchctl bootout gui/$(id -u) ~/Library/LaunchAgents/com.opteee.weekly-refresh.plist
launchctl bootstrap gui/$(id -u) ~/Library/LaunchAgents/com.opteee.weekly-refresh.plistLaunchd output logs are written to:
logs/weekly-refresh.out.loglogs/weekly-refresh.err.log
- The script skips
git pullif your repo has uncommitted changes to avoid clobbering local edits. - Keep Docker Desktop running so scheduled refresh jobs can rebuild/restart successfully.
- macOS compatibility: These
launchctlcommands use the modernbootstrap/bootoutsyntax (not legacyload/unload) and work on macOS Tahoe 26 and earlier.
OPTEEE supports simple bot clients across platforms (Telegram, Slack, webhooks, custom chat apps).
Use bots/README.md as the canonical integration guide (includes conversation state support).
bots/ is the only supported bot integration path.
docs/BEGINNER_GUIDE.md- Getting started guidedocs/DEPLOYMENT_STEPS.md- Deployment instructionsbots/README.md- Canonical bot integration guide (includes conversation support)bots/examples/python_client.py- Minimal Python bot client exampledocs/BOT_INTEGRATION.md- Redirect/compatibility bot guide
Contributions are welcome! Here's how you can help:
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature) - Make your changes
- Run tests to ensure everything works
- Commit your changes (
git commit -m 'Add amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
Please ensure your code follows the existing style and includes appropriate documentation.
This project is licensed under the MIT License - see the LICENSE file for details.
- Thanks to all contributors who have helped build this project
- Built with open-source technologies and libraries
- Educational content from various options trading educators
- Issues: Please use GitHub Issues for bug reports and feature requests
- Discussions: Join the conversation in GitHub Discussions
Note: This is an educational tool. Always do your own research and consult with financial professionals before making trading decisions.