Create Classifier Trait and Classification Pipeline Architecture

## Summary

Create the foundational Classifier trait and classification pipeline structure to enable semantic analysis of extracted strings. This framework will allow multiple classifiers to analyze strings and assign tags (URLs, domains, file paths, etc.) based on context and patterns.

## Current State

The infrastructure needed for this feature is already in place:

✅ **Existing Components:**
- `FoundString` struct with tags, encoding, offset, RVA, confidence scoring
- `Tag` enum with comprehensive variants (URL, Domain, IPv4, IPv6, FilePath, etc.)
- `ContainerInfo` with sections, imports, exports, resources
- `SectionType` and `BinaryFormat` enums
- `BasicExtractor` for string extraction from binaries
- ASCII and UTF-16LE extraction with confidence scoring

🚧 **To Be Built:**
- `Classifier` trait for pluggable classification strategies
- `StringContext` struct to package extraction context
- `Classification` result type with confidence scoring
- `ClassificationPipeline` orchestration layer
- Concrete classifier implementations

## Context

Stringy extracts strings from binary files and needs to classify them to distinguish meaningful strings from random garbage data. The classification system will:

- Provide a common interface (Classifier trait) for different classification strategies
- Package string data with contextual information (StringContext) for informed classification
- Enable a pipeline architecture where multiple classifiers can process strings sequentially or in parallel
- Support confidence scoring to handle ambiguous cases
- Integrate seamlessly with existing `BasicExtractor` and `FoundString` types
- Leverage existing section weighting and confidence scores from extraction phase

## Proposed Solution

### 1. Classifier Trait

Create a trait that all classifiers will implement in `src/classification/traits.rs`:

```rust
use crate::types::{Tag, StringyError, Result};

/// Trait for string classifiers that analyze and tag extracted strings
pub trait Classifier: Send + Sync {
    /// Classify a string and return relevant tags with confidence scores
    /// 
    /// Returns a vector of classifications, allowing one string to match
    /// multiple patterns (e.g., a URL that's also a domain).
    fn classify(&self, context: &StringContext) -> Result<Vec<Classification>>;
    
    /// Get the name/identifier of this classifier for logging and debugging
    fn name(&self) -> &str;
    
    /// Check if this classifier can handle the given context
    /// 
    /// Allows classifiers to opt-out based on encoding, binary format,
    /// or other contextual factors. Default implementation accepts all.
    fn can_classify(&self, context: &StringContext) -> bool {
        true
    }
    
    /// Priority/order hint for the pipeline (lower = earlier execution)
    /// 
    /// Some classifiers may want to run before others, e.g., import/export
    /// classifiers should run before generic pattern matching.
    fn priority(&self) -> i32 {
        100  // Default middle priority
    }
}
```

### 2. StringContext Struct

Encapsulate string data with contextual information in `src/classification/context.rs`:

```rust
use crate::types::{
    FoundString, SectionType, BinaryFormat, ContainerInfo, 
    ImportInfo, ExportInfo
};

/// Context information about an extracted string for classification
#[derive(Debug, Clone)]
pub struct StringContext<'a> {
    /// The extracted string with metadata
    pub found_string: &'a FoundString,
    
    /// Section type where string was found (StringData, Code, etc.)
    pub section_type: Option<SectionType>,
    
    /// Binary format (ELF, PE, Mach-O)
    pub binary_format: BinaryFormat,
    
    /// Nearby strings for context-aware classification (within ±100 bytes)
    pub surrounding_strings: Vec<String>,
    
    /// Full container information
    pub container_info: &'a ContainerInfo,
    
    /// Matching import if string is from import table
    pub import_match: Option<&'a ImportInfo>,
    
    /// Matching export if string is from export table  
    pub export_match: Option<&'a ExportInfo>,
}

impl<'a> StringContext<'a> {
    /// Create a new context for a found string
    pub fn new(
        found_string: &'a FoundString,
        container_info: &'a ContainerInfo,
    ) -> Self {
        let section_type = found_string.section.as_ref()
            .and_then(|name| Self::infer_section_type(name, container_info));
        
        let import_match = container_info.imports.iter()
            .find(|imp| imp.name == found_string.text);
            
        let export_match = container_info.exports.iter()
            .find(|exp| exp.name == found_string.text);
        
        Self {
            found_string,
            section_type,
            binary_format: container_info.format,
            surrounding_strings: Vec::new(),
            container_info,
            import_match,
            export_match,
        }
    }
    
    /// Infer section type from section name
    fn infer_section_type(name: &str, info: &ContainerInfo) -> Option<SectionType> {
        info.sections.iter()
            .find(|s| s.name == name)
            .map(|s| s.section_type)
    }
    
    /// Add surrounding strings for context
    pub fn with_surrounding(&mut self, strings: Vec<String>) {
        self.surrounding_strings = strings;
    }
}
```

### 3. Classification Result

Define the output structure in `src/classification/mod.rs`:

```rust
use crate::types::Tag;

/// Result of classifying a string
#[derive(Debug, Clone, PartialEq)]
pub struct Classification {
    /// The assigned tag
    pub tag: Tag,
    
    /// Confidence score (0.0 to 1.0)
    /// - 1.0: Definitive match (e.g., valid URL with scheme)
    /// - 0.7-0.9: Strong match (e.g., domain pattern with TLD)
    /// - 0.5-0.7: Moderate match (e.g., path-like structure)
    /// - <0.5: Weak match (should probably be filtered)
    pub confidence: f64,
    
    /// Optional explanation for debugging/logging
    pub reason: Option<String>,
    
    /// The classifier that produced this result
    pub classifier_name: String,
}

impl Classification {
    /// Create a new classification result
    pub fn new(tag: Tag, confidence: f64, classifier_name: String) -> Self {
        Self {
            tag,
            confidence,
            reason: None,
            classifier_name,
        }
    }
    
    /// Add an explanation
    pub fn with_reason(mut self, reason: impl Into<String>) -> Self {
        self.reason = Some(reason.into());
        self
    }
    
    /// Check if this is a high-confidence classification
    pub fn is_high_confidence(&self) -> bool {
        self.confidence >= 0.7
    }
}
```

### 4. Classification Pipeline

Create orchestration layer in `src/classification/pipeline.rs`:

```rust
use crate::types::Result;
use super::{Classifier, Classification, StringContext};
use std::collections::HashMap;

/// Pipeline for orchestrating multiple classifiers
pub struct ClassificationPipeline {
    classifiers: Vec<Box<dyn Classifier>>,
}

impl ClassificationPipeline {
    /// Create a new empty pipeline
    pub fn new() -> Self {
        Self {
            classifiers: Vec::new(),
        }
    }
    
    /// Create a pipeline with default classifiers
    pub fn with_defaults() -> Self {
        let mut pipeline = Self::new();
        // Will add default classifiers in future issues:
        // - ImportExportClassifier (priority: 10)
        // - UrlClassifier (priority: 50)
        // - IpAddressClassifier (priority: 60)
        // - FilePathClassifier (priority: 70)
        // - Base64Classifier (priority: 80)
        // - FormatStringClassifier (priority: 90)
        pipeline
    }
    
    /// Add a classifier to the pipeline
    pub fn add_classifier(&mut self, classifier: Box<dyn Classifier>) {
        self.classifiers.push(classifier);
        // Sort by priority after adding
        self.classifiers.sort_by_key(|c| c.priority());
    }
    
    /// Classify a string using all applicable classifiers
    pub fn classify(&self, context: &StringContext) -> Result<Vec<Classification>> {
        let mut results = Vec::new();
        
        for classifier in &self.classifiers {
            // Skip if classifier can't handle this context
            if !classifier.can_classify(context) {
                continue;
            }
            
            // Run classification and collect results
            match classifier.classify(context) {
                Ok(classifications) => results.extend(classifications),
                Err(e) => {
                    eprintln!("Warning: Classifier '{}' failed: {}", 
                              classifier.name(), e);
                    continue;
                }
            }
        }
        
        // Merge and deduplicate results
        Ok(self.merge_results(results))
    }
    
    /// Merge duplicate tags, keeping highest confidence
    fn merge_results(&self, mut results: Vec<Classification>) -> Vec<Classification> {
        let mut tag_map: HashMap<String, Classification> = HashMap::new();
        
        for classification in results.drain(..) {
            let key = format!("{:?}", classification.tag);
            
            tag_map.entry(key)
                .and_modify(|existing| {
                    // Keep classification with higher confidence
                    if classification.confidence > existing.confidence {
                        *existing = classification.clone();
                    }
                })
                .or_insert(classification);
        }
        
        // Sort by confidence (highest first)
        let mut merged: Vec<_> = tag_map.into_values().collect();
        merged.sort_by(|a, b| {
            b.confidence.partial_cmp(&a.confidence)
                .unwrap_or(std::cmp::Ordering::Equal)
        });
        
        merged
    }
    
    /// Get the number of classifiers in the pipeline
    pub fn len(&self) -> usize {
        self.classifiers.len()
    }
    
    /// Check if the pipeline is empty
    pub fn is_empty(&self) -> bool {
        self.classifiers.is_empty()
    }
}

impl Default for ClassificationPipeline {
    fn default() -> Self {
        Self::new()
    }
}
```

### 5. Integration with Existing Extraction

Example usage in `src/lib.rs` or extraction workflow:

```rust
use crate::extraction::{BasicExtractor, ExtractionConfig, StringExtractor};
use crate::classification::{ClassificationPipeline, StringContext};
use crate::types::{FoundString, ContainerInfo};

/// Extract and classify strings from a binary
pub fn extract_and_classify(
    data: &[u8],
    container_info: &ContainerInfo,
) -> Result<Vec<FoundString>> {
    // Extract strings
    let extractor = BasicExtractor::new();
    let config = ExtractionConfig::default();
    let mut strings = extractor.extract(data, container_info, &config)?;
    
    // Set up classification pipeline
    let pipeline = ClassificationPipeline::with_defaults();
    
    // Classify each string
    for found_string in &mut strings {
        let context = StringContext::new(found_string, container_info);
        
        if let Ok(classifications) = pipeline.classify(&context) {
            // Apply high-confidence tags to the string
            for classification in classifications {
                if classification.is_high_confidence() {
                    found_string.tags.push(classification.tag);
                }
            }
        }
    }
    
    Ok(strings)
}
```

## Implementation Steps

1. **Phase 1: Core Framework** (`src/classification/`)
   - Create `traits.rs` with `Classifier` trait
   - Create `context.rs` with `StringContext` struct  
   - Add `Classification` struct to `mod.rs`
   - Create `pipeline.rs` with `ClassificationPipeline`
   - Re-export types in `mod.rs`

2. **Phase 2: Testing Infrastructure**
   - Create `tests/classification_tests.rs`
   - Implement `MockClassifier` for testing
   - Test pipeline ordering and priority
   - Test confidence scoring and merging
   - Test context creation from `FoundString`

3. **Phase 3: Integration**
   - Update `src/lib.rs` to re-export classification types
   - Add integration example in docs
   - Update `BasicExtractor` workflow example

4. **Phase 4: First Real Classifier (Separate Issue)**
   - Implement `ImportExportClassifier` as proof-of-concept
   - Test with real PE/ELF/Mach-O binaries

## Testing Strategy

### Unit Tests

```rust
#[cfg(test)]
mod tests {
    use super::*;
    
    struct MockClassifier {
        name: String,
        priority: i32,
    }
    
    impl Classifier for MockClassifier {
        fn classify(&self, ctx: &StringContext) -> Result<Vec<Classification>> {
            Ok(vec![
                Classification::new(
                    Tag::Url,
                    0.9,
                    self.name.clone()
                )
            ])
        }
        
        fn name(&self) -> &str {
            &self.name
        }
        
        fn priority(&self) -> i32 {
            self.priority
        }
    }
    
    #[test]
    fn test_pipeline_ordering() {
        let mut pipeline = ClassificationPipeline::new();
        pipeline.add_classifier(Box::new(MockClassifier { 
            name: "low".into(), 
            priority: 100 
        }));
        pipeline.add_classifier(Box::new(MockClassifier { 
            name: "high".into(), 
            priority: 10 
        }));
        
        // Verify classifiers are sorted by priority
        assert_eq!(pipeline.len(), 2);
    }
    
    #[test]
    fn test_confidence_merging() {
        // Test that duplicate tags keep highest confidence
    }
    
    #[test]
    fn test_context_creation() {
        // Test StringContext properly extracts section type, 
        // matches imports/exports, etc.
    }
}
```

### Integration Tests

Test with real binaries:
- PE file with imports/exports
- ELF file with URLs in .rodata
- Mach-O with version strings

## Performance Considerations

- **Lazy Evaluation**: Only create `StringContext` for strings that need classification
- **Parallel Processing**: Future enhancement to run classifiers in parallel (requires `Send + Sync`)
- **Caching**: Consider caching regex compilation in classifiers
- **Short-Circuit**: `can_classify()` allows early exit
- **Priority Ordering**: Run fast classifiers first (imports/exports before regex patterns)

## Acceptance Criteria

- [ ] `Classifier` trait defined with `classify()`, `name()`, `can_classify()`, and `priority()` methods
- [ ] `StringContext` struct with all necessary fields and helper methods
- [ ] `Classification` struct for results with tag, confidence, reason, and classifier name
- [ ] `ClassificationPipeline` with add, classify, and merge functionality
- [ ] Pipeline respects classifier priority ordering
- [ ] Unit tests for pipeline, context, and classification
- [ ] `MockClassifier` for testing framework
- [ ] Documentation comments for all public types and methods
- [ ] Integration example showing usage with `BasicExtractor`
- [ ] Re-exports in `src/lib.rs`

## Dependencies

**Existing Types (from `src/types.rs`):**
- `FoundString` - Contains text, encoding, offset, tags, confidence
- `Tag` - Enum of all semantic tags
- `SectionType` - Classification of binary sections  
- `BinaryFormat` - ELF, PE, Mach-O detection
- `ContainerInfo` - Binary metadata, sections, imports, exports
- `ImportInfo` / `ExportInfo` - Symbol information
- `StringyError` and `Result` - Error handling

**Will Enable (Future Issues):**
- Individual classifier implementations (URL, Domain, IP, FilePath, etc.)
- Smart scoring that combines extraction confidence + classification confidence
- Context-aware classification (e.g., grouped strings suggest log format)

## Related Requirements

- Requirement 3.1: Semantic Classification Framework  
- Task-ID: stringy-analyzer/semantic-classification-framework
- Blocks: URL Classifier (#TBD), IP Classifier (#TBD), FilePath Classifier (#TBD)

## References

- Existing extraction: `src/extraction/basic.rs`
- Type definitions: `src/types.rs`
- Container parsing: `src/container/mod.rs`

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Create Classifier Trait and Classification Pipeline Architecture #14

Summary

Current State

Context

Proposed Solution

1. Classifier Trait

2. StringContext Struct

3. Classification Result

4. Classification Pipeline

5. Integration with Existing Extraction

Implementation Steps

Testing Strategy

Unit Tests

Integration Tests

Performance Considerations

Acceptance Criteria

Dependencies

Related Requirements

References

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Create Classifier Trait and Classification Pipeline Architecture #14

Description

Summary

Current State

Context

Proposed Solution

1. Classifier Trait

2. StringContext Struct

3. Classification Result

4. Classification Pipeline

5. Integration with Existing Extraction

Implementation Steps

Testing Strategy

Unit Tests

Integration Tests

Performance Considerations

Acceptance Criteria

Dependencies

Related Requirements

References

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions