Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
23 changes: 23 additions & 0 deletions .env.example
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
# Environment Configuration for Vector Code Retrieval System

# Ollama Configuration
OLLAMA_HOST=http://localhost:11434
OLLAMA_MODEL=qwen3:8b

# Embedding Configuration
EMBEDDING_SERVER=http://localhost:5000
EMBEDDING_MODEL=nomic-ai/nomic-embed-text-v1.5

# Storage Configuration
CHROMA_PATH=./chroma_code

# Service Selection
USE_LOCAL_EMBEDDINGS=true
USE_LOCAL_OLLAMA=true

# Trust Remote Code Settings (automatically managed by trust_manager.py)
# Format: TRUST_REMOTE_CODE_<MODEL_HASH>=true|false
# These are automatically added when you approve/deny models
# Example:
# # TRUST_REMOTE_CODE_A1B2C3D4_MODEL=nomic-ai/nomic-embed-text-v1.5
# TRUST_REMOTE_CODE_A1B2C3D4=true
14 changes: 0 additions & 14 deletions .env_example

This file was deleted.

7 changes: 7 additions & 0 deletions .gitignore_example
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
.gitignore
venv/
.vscode/
__pycache__/
chroma_code/
chroma_db/
*-queries.md
4 changes: 0 additions & 4 deletions .vscode/settings.json

This file was deleted.

133 changes: 116 additions & 17 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,8 @@ A powerful semantic search system for log files that enables natural language qu
- **Local LLM Integration**: Generates AI responses using Ollama with customizable models
- **Interactive Query Interface**: Rich terminal interface with markdown rendering
- **GPU Acceleration**: Optional GPU support for faster embedding generation
- **Comprehensive File Support**: Indexes `.py`, `.log`, `.js`, `.ts`, `.md`, `.sql`, `.html`, `.csv` files
- **Automatic File Detection**: Intelligently detects and indexes all text-based files by content analysis
- **Security-First Design**: Client-side trust_remote_code management with consent prompts and persistent tracking
- **Environment Configuration**: Fully configurable via `.env` files

## Quick Start
Expand Down Expand Up @@ -63,25 +64,30 @@ USE_LOCAL_EMBEDDINGS=true
USE_LOCAL_OLLAMA=true
```

### 3. Index Your Log Files
### 3. Index Your Files

Index a directory using local embeddings (default):
Index a directory with automatic file detection:

```bash
python index.py /path/to/your/logs
python index.py /path/to/your/files
```

The system will:
- Automatically detect all text-based files by content analysis
- Skip binary files and common build/cache directories
- Prompt for trust_remote_code consent if needed for the embedding model

Or specify embedding type:

```bash
# Use local SentenceTransformer embeddings
python index.py /path/to/logs --local-embeddings
# Use local SentenceTransformer embeddings (default)
python index.py /path/to/files --local-embeddings

# Use Ollama embeddings
python index.py /path/to/logs --ollama-embeddings
python index.py /path/to/files --ollama-embeddings

# Use remote embedding server
python index.py /path/to/logs --remote-embeddings
python index.py /path/to/files --remote-embeddings
```

Additional options:
Expand All @@ -94,14 +100,19 @@ python index.py /path/to/logs --model custom-model --chunk-size 1500
python index.py /path/to/logs --chroma-path ./my_custom_db
```

### 4. Query Your Logs
### 4. Query Your Indexed Content

Start the interactive query interface:

```bash
python ask.py
```

The system will:
- Auto-detect the embedding type used during indexing
- Apply same trust_remote_code settings for consistency
- Generate responses using Ollama's local LLM

Or specify a custom output file:

```bash
Expand All @@ -113,22 +124,32 @@ python ask.py my_queries.md
### Core Components

1. **Unified Indexer (`index.py`)**
- Processes repositories and creates vector embeddings
- Processes repositories with automatic file detection
- Supports multiple embedding strategies via handler classes
- Chunks code into configurable segments (default: 2000 characters)
- Chunks content into configurable segments (default: 2000 characters)
- Client-side trust_remote_code management
- Stores embeddings in ChromaDB with metadata tracking

2. **Query Interface (`ask.py`)**
- Interactive CLI for natural language log queries
- Auto-detects embedding type from metadata
- Interactive CLI for natural language queries
- Auto-detects embedding type and trust settings from metadata
- Generates responses using Ollama's local LLM
- Consistent security model with indexing phase
- Saves all Q&A pairs with timestamps

3. **Embedding Server (`embedding_server.py`)**
- Optional remote embedding service with GPU support
- Respects client-side trust_remote_code decisions
- RESTful API with health checks and server info
- Configurable via command-line arguments
- Supports batch processing and model caching
- Dynamic model loading with trust setting caching
- Supports batch processing and multiple model variants

4. **Trust Manager (`trust_manager.py`)**
- Centralized security management for trust_remote_code
- Auto-detection of models requiring remote code execution
- Interactive consent prompts with risk/benefit explanations
- Persistent approval tracking in .env files
- CLI tools for managing trust settings

### Embedding Handlers

Expand All @@ -149,6 +170,7 @@ python ask.py my_queries.md
| `CHROMA_PATH` | ChromaDB storage path | `./chroma_code` |
| `USE_LOCAL_EMBEDDINGS` | Default embedding strategy | `true` |
| `USE_LOCAL_OLLAMA` | Use local Ollama instance | `true` |
| `TRUST_REMOTE_CODE_*` | Model-specific trust settings | Auto-managed |

### Command Line Options

Expand Down Expand Up @@ -180,6 +202,73 @@ Options:
--debug Enable debug mode
```

## Security: Trust Remote Code Management

The system includes a comprehensive security framework for models that require `trust_remote_code=True`. This client-side security system:

- **Auto-detects** which models likely need remote code execution based on known patterns
- **Prompts for informed consent** with detailed security warnings
- **Persists decisions** in `.env` with model-specific hash tracking
- **Client-side control** - trust decisions made locally, not on remote servers
- **Cross-component consistency** - same security model for indexing, querying, and serving

### How It Works

1. **Detection**: System analyzes model names against known patterns
2. **User Consent**: Interactive prompts with clear risk/benefit explanations
3. **Persistence**: Decisions saved locally with model identification hashes
4. **Communication**: Client sends trust settings to remote embedding servers

### Managing Trust Settings

```bash
# List all approved/denied models
python trust_manager.py --list

# Check if a specific model needs trust_remote_code
python trust_manager.py --check "nomic-ai/nomic-embed-text-v1.5"
```

### Security Flow

When you first use a model requiring remote code execution:

```
==============================================================
SECURITY WARNING: Remote Code Execution
==============================================================
Model: nomic-ai/nomic-embed-text-v1.5

This model may require 'trust_remote_code=True' which allows
the model to execute arbitrary code during loading.

RISKS:
- The model could execute malicious code
- Your system could be compromised
- Data could be stolen or corrupted

BENEFITS:
- Access to newer/specialized models
- Better embedding quality for some models

Your choice will be saved for this model.
==============================================================
Allow remote code execution for this model? [y/N]:
```

### Trust Settings Storage

Approval decisions are stored in your `.env` file:

```bash
# Example entries (automatically managed)
# TRUST_REMOTE_CODE_A1B2C3D4_MODEL=nomic-ai/nomic-embed-text-v1.5
TRUST_REMOTE_CODE_A1B2C3D4=true

# TRUST_REMOTE_CODE_E5F6G7H8_MODEL=sentence-transformers/all-MiniLM-L6-v2
TRUST_REMOTE_CODE_E5F6G7H8=false
```

## Advanced Usage

### Remote Embedding Server
Expand Down Expand Up @@ -257,23 +346,33 @@ The system automatically detects and works with databases created by older versi
## Dependencies

- **chromadb**: Vector database for embeddings
- **sentence-transformers**: Local embedding generation
- **sentence-transformers**: Local embedding generation (optional, only needed for local embeddings)
- **ollama**: LLM client for local inference
- **rich**: Enhanced terminal output and markdown rendering
- **flask**: Web server for embedding API
- **python-dotenv**: Environment configuration management
- **tiktoken**: Token counting utilities
- **einops**: Tensor operations for advanced models
- **requests**: HTTP client for remote services

## File Structure

```
├── index.py # Unified indexing script
├── ask.py # Interactive query interface
├── embedding_server.py # Remote embedding server
├── trust_manager.py # Security: trust_remote_code management
├── requirements.txt # Python dependencies
├── .env_example # Environment configuration template
└── chroma_code/ # Default ChromaDB storage (created after indexing)
```

## License

This project is designed for local development and research use. Please ensure compliance with the terms of service for any external models or APIs used.
This project is designed for local development and research use. Please ensure compliance with the terms of service for any external models or APIs used.

## Contributions

I welcome any assistance on this project, especially around trying new models for better performance and testing against ore logs than I have at my disposal!

Please just fork off of dev and then submit a PR
Binary file removed __pycache__/ask.cpython-311.pyc
Binary file not shown.
Binary file removed __pycache__/index_remote.cpython-311.pyc
Binary file not shown.
Loading