barrulus · barrulus · Jun 10, 2025 · Jun 9, 2025 · Jun 9, 2025 · Jun 9, 2025
diff --git a/.env.example b/.env.example
@@ -0,0 +1,23 @@
+# Environment Configuration for Vector Code Retrieval System
+
+# Ollama Configuration
+OLLAMA_HOST=http://localhost:11434
+OLLAMA_MODEL=qwen3:8b
+
+# Embedding Configuration
+EMBEDDING_SERVER=http://localhost:5000
+EMBEDDING_MODEL=nomic-ai/nomic-embed-text-v1.5
+
+# Storage Configuration
+CHROMA_PATH=./chroma_code
+
+# Service Selection
+USE_LOCAL_EMBEDDINGS=true
+USE_LOCAL_OLLAMA=true
+
+# Trust Remote Code Settings (automatically managed by trust_manager.py)
+# Format: TRUST_REMOTE_CODE_<MODEL_HASH>=true|false
+# These are automatically added when you approve/deny models
+# Example:
+# # TRUST_REMOTE_CODE_A1B2C3D4_MODEL=nomic-ai/nomic-embed-text-v1.5
+# TRUST_REMOTE_CODE_A1B2C3D4=true
diff --git a/.env_example b/.env_example
diff --git a/.gitignore_example b/.gitignore_example
@@ -0,0 +1,7 @@
+.gitignore
+venv/
+.vscode/
+__pycache__/
+chroma_code/
+chroma_db/
+*-queries.md
diff --git a/.vscode/settings.json b/.vscode/settings.json
diff --git a/README.md b/README.md
@@ -9,7 +9,8 @@ A powerful semantic search system for log files that enables natural language qu
 - **Local LLM Integration**: Generates AI responses using Ollama with customizable models
 - **Interactive Query Interface**: Rich terminal interface with markdown rendering
 - **GPU Acceleration**: Optional GPU support for faster embedding generation
-- **Comprehensive File Support**: Indexes `.py`, `.log`, `.js`, `.ts`, `.md`, `.sql`, `.html`, `.csv` files
+- **Automatic File Detection**: Intelligently detects and indexes all text-based files by content analysis
+- **Security-First Design**: Client-side trust_remote_code management with consent prompts and persistent tracking
 - **Environment Configuration**: Fully configurable via `.env` files
 
 ## Quick Start
@@ -63,25 +64,30 @@ USE_LOCAL_EMBEDDINGS=true
 USE_LOCAL_OLLAMA=true
 ```
 
-### 3. Index Your Log Files
+### 3. Index Your Files
 
-Index a directory using local embeddings (default):
+Index a directory with automatic file detection:
 
 ```bash
-python index.py /path/to/your/logs
+python index.py /path/to/your/files
 ```
 
+The system will:
+- Automatically detect all text-based files by content analysis
+- Skip binary files and common build/cache directories
+- Prompt for trust_remote_code consent if needed for the embedding model
+
 Or specify embedding type:
 
 ```bash
-# Use local SentenceTransformer embeddings
-python index.py /path/to/logs --local-embeddings
+# Use local SentenceTransformer embeddings (default)
+python index.py /path/to/files --local-embeddings
 
 # Use Ollama embeddings
-python index.py /path/to/logs --ollama-embeddings
+python index.py /path/to/files --ollama-embeddings
 
 # Use remote embedding server
-python index.py /path/to/logs --remote-embeddings
+python index.py /path/to/files --remote-embeddings
 ```
 
 Additional options:
@@ -94,14 +100,19 @@ python index.py /path/to/logs --model custom-model --chunk-size 1500
 python index.py /path/to/logs --chroma-path ./my_custom_db
 ```
 
-### 4. Query Your Logs
+### 4. Query Your Indexed Content
 
 Start the interactive query interface:
 
 ```bash
 python ask.py
 ```
 
+The system will:
+- Auto-detect the embedding type used during indexing
+- Apply same trust_remote_code settings for consistency
+- Generate responses using Ollama's local LLM
+
 Or specify a custom output file:
 
 ```bash
@@ -113,22 +124,32 @@ python ask.py my_queries.md
 ### Core Components
 
 1. **Unified Indexer (`index.py`)**
-   - Processes repositories and creates vector embeddings
+   - Processes repositories with automatic file detection
    - Supports multiple embedding strategies via handler classes
-   - Chunks code into configurable segments (default: 2000 characters)
+   - Chunks content into configurable segments (default: 2000 characters)
+   - Client-side trust_remote_code management
    - Stores embeddings in ChromaDB with metadata tracking
 
 2. **Query Interface (`ask.py`)**  
-   - Interactive CLI for natural language log queries
-   - Auto-detects embedding type from metadata
+   - Interactive CLI for natural language queries
+   - Auto-detects embedding type and trust settings from metadata
    - Generates responses using Ollama's local LLM
+   - Consistent security model with indexing phase
    - Saves all Q&A pairs with timestamps
 
 3. **Embedding Server (`embedding_server.py`)**
    - Optional remote embedding service with GPU support
+   - Respects client-side trust_remote_code decisions
    - RESTful API with health checks and server info
-   - Configurable via command-line arguments
-   - Supports batch processing and model caching
+   - Dynamic model loading with trust setting caching
+   - Supports batch processing and multiple model variants
+
+4. **Trust Manager (`trust_manager.py`)**
+   - Centralized security management for trust_remote_code
+   - Auto-detection of models requiring remote code execution
+   - Interactive consent prompts with risk/benefit explanations
+   - Persistent approval tracking in .env files
+   - CLI tools for managing trust settings
 
 ### Embedding Handlers
 
@@ -149,6 +170,7 @@ python ask.py my_queries.md
 | `CHROMA_PATH` | ChromaDB storage path | `./chroma_code` |
 | `USE_LOCAL_EMBEDDINGS` | Default embedding strategy | `true` |
 | `USE_LOCAL_OLLAMA` | Use local Ollama instance | `true` |
+| `TRUST_REMOTE_CODE_*` | Model-specific trust settings | Auto-managed |
 
 ### Command Line Options
 
@@ -180,6 +202,73 @@ Options:
   --debug                Enable debug mode
 ```
 
+## Security: Trust Remote Code Management
+
+The system includes a comprehensive security framework for models that require `trust_remote_code=True`. This client-side security system:
+
+- **Auto-detects** which models likely need remote code execution based on known patterns
+- **Prompts for informed consent** with detailed security warnings
+- **Persists decisions** in `.env` with model-specific hash tracking
+- **Client-side control** - trust decisions made locally, not on remote servers
+- **Cross-component consistency** - same security model for indexing, querying, and serving
+
+### How It Works
+
+1. **Detection**: System analyzes model names against known patterns
+2. **User Consent**: Interactive prompts with clear risk/benefit explanations  
+3. **Persistence**: Decisions saved locally with model identification hashes
+4. **Communication**: Client sends trust settings to remote embedding servers
+
+### Managing Trust Settings
+
+```bash
+# List all approved/denied models
+python trust_manager.py --list
+
+# Check if a specific model needs trust_remote_code
+python trust_manager.py --check "nomic-ai/nomic-embed-text-v1.5"
+```
+
+### Security Flow
+
+When you first use a model requiring remote code execution:
+
+```
+==============================================================
+SECURITY WARNING: Remote Code Execution
+==============================================================
+Model: nomic-ai/nomic-embed-text-v1.5
+
+This model may require 'trust_remote_code=True' which allows
+the model to execute arbitrary code during loading.
+
+RISKS:
+- The model could execute malicious code
+- Your system could be compromised
+- Data could be stolen or corrupted
+
+BENEFITS:
+- Access to newer/specialized models
+- Better embedding quality for some models
+
+Your choice will be saved for this model.
+==============================================================
+Allow remote code execution for this model? [y/N]:
+```
+
+### Trust Settings Storage
+
+Approval decisions are stored in your `.env` file:
+
+```bash
+# Example entries (automatically managed)
+# TRUST_REMOTE_CODE_A1B2C3D4_MODEL=nomic-ai/nomic-embed-text-v1.5
+TRUST_REMOTE_CODE_A1B2C3D4=true
+
+# TRUST_REMOTE_CODE_E5F6G7H8_MODEL=sentence-transformers/all-MiniLM-L6-v2  
+TRUST_REMOTE_CODE_E5F6G7H8=false
+```
+
 ## Advanced Usage
 
 ### Remote Embedding Server
@@ -257,23 +346,33 @@ The system automatically detects and works with databases created by older versi
 ## Dependencies
 
 - **chromadb**: Vector database for embeddings
-- **sentence-transformers**: Local embedding generation
+- **sentence-transformers**: Local embedding generation (optional, only needed for local embeddings)
 - **ollama**: LLM client for local inference
 - **rich**: Enhanced terminal output and markdown rendering
 - **flask**: Web server for embedding API
 - **python-dotenv**: Environment configuration management
+- **tiktoken**: Token counting utilities
+- **einops**: Tensor operations for advanced models
+- **requests**: HTTP client for remote services
 
 ## File Structure
 
 ```
 ├── index.py              # Unified indexing script
 ├── ask.py                # Interactive query interface  
 ├── embedding_server.py   # Remote embedding server
+├── trust_manager.py      # Security: trust_remote_code management
 ├── requirements.txt      # Python dependencies
 ├── .env_example         # Environment configuration template
 └── chroma_code/         # Default ChromaDB storage (created after indexing)
 ```
 
 ## License
 
-This project is designed for local development and research use. Please ensure compliance with the terms of service for any external models or APIs used.
+This project is designed for local development and research use. Please ensure compliance with the terms of service for any external models or APIs used.
+
+## Contributions
+
+I welcome any assistance on this project, especially around trying new models for better performance and testing against ore logs than I have at my disposal!
+
+Please just fork off of dev and then submit a PR
diff --git a/__pycache__/ask.cpython-311.pyc b/__pycache__/ask.cpython-311.pyc
diff --git a/__pycache__/index_remote.cpython-311.pyc b/__pycache__/index_remote.cpython-311.pyc