You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Create the foundational Classifier trait and classification pipeline structure to enable semantic analysis of extracted strings. This framework will allow multiple classifiers to analyze strings and assign tags (URLs, domains, file paths, etc.) based on context and patterns.
Current State
The infrastructure needed for this feature is already in place:
✅ Existing Components:
FoundString struct with tags, encoding, offset, RVA, confidence scoring
Tag enum with comprehensive variants (URL, Domain, IPv4, IPv6, FilePath, etc.)
ContainerInfo with sections, imports, exports, resources
SectionType and BinaryFormat enums
BasicExtractor for string extraction from binaries
ASCII and UTF-16LE extraction with confidence scoring
🚧 To Be Built:
Classifier trait for pluggable classification strategies
StringContext struct to package extraction context
Classification result type with confidence scoring
ClassificationPipeline orchestration layer
Concrete classifier implementations
Context
Stringy extracts strings from binary files and needs to classify them to distinguish meaningful strings from random garbage data. The classification system will:
Provide a common interface (Classifier trait) for different classification strategies
Package string data with contextual information (StringContext) for informed classification
Enable a pipeline architecture where multiple classifiers can process strings sequentially or in parallel
Support confidence scoring to handle ambiguous cases
Integrate seamlessly with existing BasicExtractor and FoundString types
Leverage existing section weighting and confidence scores from extraction phase
Proposed Solution
1. Classifier Trait
Create a trait that all classifiers will implement in src/classification/traits.rs:
usecrate::types::{Tag,StringyError,Result};/// Trait for string classifiers that analyze and tag extracted stringspubtraitClassifier:Send + Sync{/// Classify a string and return relevant tags with confidence scores/// /// Returns a vector of classifications, allowing one string to match/// multiple patterns (e.g., a URL that's also a domain).fnclassify(&self,context:&StringContext) -> Result<Vec<Classification>>;/// Get the name/identifier of this classifier for logging and debuggingfnname(&self) -> &str;/// Check if this classifier can handle the given context/// /// Allows classifiers to opt-out based on encoding, binary format,/// or other contextual factors. Default implementation accepts all.fncan_classify(&self,context:&StringContext) -> bool{true}/// Priority/order hint for the pipeline (lower = earlier execution)/// /// Some classifiers may want to run before others, e.g., import/export/// classifiers should run before generic pattern matching.fnpriority(&self) -> i32{100// Default middle priority}}
2. StringContext Struct
Encapsulate string data with contextual information in src/classification/context.rs:
usecrate::types::{FoundString,SectionType,BinaryFormat,ContainerInfo,ImportInfo,ExportInfo};/// Context information about an extracted string for classification#[derive(Debug,Clone)]pubstructStringContext<'a>{/// The extracted string with metadatapubfound_string:&'aFoundString,/// Section type where string was found (StringData, Code, etc.)pubsection_type:Option<SectionType>,/// Binary format (ELF, PE, Mach-O)pubbinary_format:BinaryFormat,/// Nearby strings for context-aware classification (within ±100 bytes)pubsurrounding_strings:Vec<String>,/// Full container informationpubcontainer_info:&'aContainerInfo,/// Matching import if string is from import tablepubimport_match:Option<&'aImportInfo>,/// Matching export if string is from export table pubexport_match:Option<&'aExportInfo>,}impl<'a>StringContext<'a>{/// Create a new context for a found stringpubfnnew(found_string:&'aFoundString,container_info:&'aContainerInfo,) -> Self{let section_type = found_string.section.as_ref().and_then(|name| Self::infer_section_type(name, container_info));let import_match = container_info.imports.iter().find(|imp| imp.name == found_string.text);let export_match = container_info.exports.iter().find(|exp| exp.name == found_string.text);Self{
found_string,
section_type,binary_format: container_info.format,surrounding_strings:Vec::new(),
container_info,
import_match,
export_match,}}/// Infer section type from section namefninfer_section_type(name:&str,info:&ContainerInfo) -> Option<SectionType>{
info.sections.iter().find(|s| s.name == name).map(|s| s.section_type)}/// Add surrounding strings for contextpubfnwith_surrounding(&mutself,strings:Vec<String>){self.surrounding_strings = strings;}}
3. Classification Result
Define the output structure in src/classification/mod.rs:
usecrate::types::Tag;/// Result of classifying a string#[derive(Debug,Clone,PartialEq)]pubstructClassification{/// The assigned tagpubtag:Tag,/// Confidence score (0.0 to 1.0)/// - 1.0: Definitive match (e.g., valid URL with scheme)/// - 0.7-0.9: Strong match (e.g., domain pattern with TLD)/// - 0.5-0.7: Moderate match (e.g., path-like structure)/// - <0.5: Weak match (should probably be filtered)pubconfidence:f64,/// Optional explanation for debugging/loggingpubreason:Option<String>,/// The classifier that produced this resultpubclassifier_name:String,}implClassification{/// Create a new classification resultpubfnnew(tag:Tag,confidence:f64,classifier_name:String) -> Self{Self{
tag,
confidence,reason:None,
classifier_name,}}/// Add an explanationpubfnwith_reason(mutself,reason:implInto<String>) -> Self{self.reason = Some(reason.into());self}/// Check if this is a high-confidence classificationpubfnis_high_confidence(&self) -> bool{self.confidence >= 0.7}}
4. Classification Pipeline
Create orchestration layer in src/classification/pipeline.rs:
usecrate::types::Result;usesuper::{Classifier,Classification,StringContext};use std::collections::HashMap;/// Pipeline for orchestrating multiple classifierspubstructClassificationPipeline{classifiers:Vec<Box<dynClassifier>>,}implClassificationPipeline{/// Create a new empty pipelinepubfnnew() -> Self{Self{classifiers:Vec::new(),}}/// Create a pipeline with default classifierspubfnwith_defaults() -> Self{letmut pipeline = Self::new();// Will add default classifiers in future issues:// - ImportExportClassifier (priority: 10)// - UrlClassifier (priority: 50)// - IpAddressClassifier (priority: 60)// - FilePathClassifier (priority: 70)// - Base64Classifier (priority: 80)// - FormatStringClassifier (priority: 90)
pipeline
}/// Add a classifier to the pipelinepubfnadd_classifier(&mutself,classifier:Box<dynClassifier>){self.classifiers.push(classifier);// Sort by priority after addingself.classifiers.sort_by_key(|c| c.priority());}/// Classify a string using all applicable classifierspubfnclassify(&self,context:&StringContext) -> Result<Vec<Classification>>{letmut results = Vec::new();for classifier in&self.classifiers{// Skip if classifier can't handle this contextif !classifier.can_classify(context){continue;}// Run classification and collect resultsmatch classifier.classify(context){Ok(classifications) => results.extend(classifications),Err(e) => {eprintln!("Warning: Classifier '{}' failed: {}",
classifier.name(), e);continue;}}}// Merge and deduplicate resultsOk(self.merge_results(results))}/// Merge duplicate tags, keeping highest confidencefnmerge_results(&self,mutresults:Vec<Classification>) -> Vec<Classification>{letmut tag_map:HashMap<String,Classification> = HashMap::new();for classification in results.drain(..){let key = format!("{:?}", classification.tag);
tag_map.entry(key).and_modify(|existing| {// Keep classification with higher confidenceif classification.confidence > existing.confidence{*existing = classification.clone();}}).or_insert(classification);}// Sort by confidence (highest first)letmut merged:Vec<_> = tag_map.into_values().collect();
merged.sort_by(|a, b| {
b.confidence.partial_cmp(&a.confidence).unwrap_or(std::cmp::Ordering::Equal)});
merged
}/// Get the number of classifiers in the pipelinepubfnlen(&self) -> usize{self.classifiers.len()}/// Check if the pipeline is emptypubfnis_empty(&self) -> bool{self.classifiers.is_empty()}}implDefaultforClassificationPipeline{fndefault() -> Self{Self::new()}}
5. Integration with Existing Extraction
Example usage in src/lib.rs or extraction workflow:
usecrate::extraction::{BasicExtractor,ExtractionConfig,StringExtractor};usecrate::classification::{ClassificationPipeline,StringContext};usecrate::types::{FoundString,ContainerInfo};/// Extract and classify strings from a binarypubfnextract_and_classify(data:&[u8],container_info:&ContainerInfo,) -> Result<Vec<FoundString>>{// Extract stringslet extractor = BasicExtractor::new();let config = ExtractionConfig::default();letmut strings = extractor.extract(data, container_info,&config)?;// Set up classification pipelinelet pipeline = ClassificationPipeline::with_defaults();// Classify each stringfor found_string in&mut strings {let context = StringContext::new(found_string, container_info);ifletOk(classifications) = pipeline.classify(&context){// Apply high-confidence tags to the stringfor classification in classifications {if classification.is_high_confidence(){
found_string.tags.push(classification.tag);}}}}Ok(strings)}
Implementation Steps
Phase 1: Core Framework (src/classification/)
Create traits.rs with Classifier trait
Create context.rs with StringContext struct
Add Classification struct to mod.rs
Create pipeline.rs with ClassificationPipeline
Re-export types in mod.rs
Phase 2: Testing Infrastructure
Create tests/classification_tests.rs
Implement MockClassifier for testing
Test pipeline ordering and priority
Test confidence scoring and merging
Test context creation from FoundString
Phase 3: Integration
Update src/lib.rs to re-export classification types
Add integration example in docs
Update BasicExtractor workflow example
Phase 4: First Real Classifier (Separate Issue)
Implement ImportExportClassifier as proof-of-concept
Test with real PE/ELF/Mach-O binaries
Testing Strategy
Unit Tests
#[cfg(test)]mod tests {usesuper::*;structMockClassifier{name:String,priority:i32,}implClassifierforMockClassifier{fnclassify(&self,ctx:&StringContext) -> Result<Vec<Classification>>{Ok(vec![Classification::new(Tag::Url,0.9,self.name.clone())])}fnname(&self) -> &str{&self.name}fnpriority(&self) -> i32{self.priority}}#[test]fntest_pipeline_ordering(){letmut pipeline = ClassificationPipeline::new();
pipeline.add_classifier(Box::new(MockClassifier{name:"low".into(),priority:100}));
pipeline.add_classifier(Box::new(MockClassifier{name:"high".into(),priority:10}));// Verify classifiers are sorted by priorityassert_eq!(pipeline.len(),2);}#[test]fntest_confidence_merging(){// Test that duplicate tags keep highest confidence}#[test]fntest_context_creation(){// Test StringContext properly extracts section type, // matches imports/exports, etc.}}
Integration Tests
Test with real binaries:
PE file with imports/exports
ELF file with URLs in .rodata
Mach-O with version strings
Performance Considerations
Lazy Evaluation: Only create StringContext for strings that need classification
Parallel Processing: Future enhancement to run classifiers in parallel (requires Send + Sync)
Caching: Consider caching regex compilation in classifiers
Short-Circuit: can_classify() allows early exit
Priority Ordering: Run fast classifiers first (imports/exports before regex patterns)
Acceptance Criteria
Classifier trait defined with classify(), name(), can_classify(), and priority() methods
StringContext struct with all necessary fields and helper methods
Classification struct for results with tag, confidence, reason, and classifier name
ClassificationPipeline with add, classify, and merge functionality
Pipeline respects classifier priority ordering
Unit tests for pipeline, context, and classification
MockClassifier for testing framework
Documentation comments for all public types and methods
Integration example showing usage with BasicExtractor
Summary
Create the foundational Classifier trait and classification pipeline structure to enable semantic analysis of extracted strings. This framework will allow multiple classifiers to analyze strings and assign tags (URLs, domains, file paths, etc.) based on context and patterns.
Current State
The infrastructure needed for this feature is already in place:
✅ Existing Components:
FoundStringstruct with tags, encoding, offset, RVA, confidence scoringTagenum with comprehensive variants (URL, Domain, IPv4, IPv6, FilePath, etc.)ContainerInfowith sections, imports, exports, resourcesSectionTypeandBinaryFormatenumsBasicExtractorfor string extraction from binaries🚧 To Be Built:
Classifiertrait for pluggable classification strategiesStringContextstruct to package extraction contextClassificationresult type with confidence scoringClassificationPipelineorchestration layerContext
Stringy extracts strings from binary files and needs to classify them to distinguish meaningful strings from random garbage data. The classification system will:
BasicExtractorandFoundStringtypesProposed Solution
1. Classifier Trait
Create a trait that all classifiers will implement in
src/classification/traits.rs:2. StringContext Struct
Encapsulate string data with contextual information in
src/classification/context.rs:3. Classification Result
Define the output structure in
src/classification/mod.rs:4. Classification Pipeline
Create orchestration layer in
src/classification/pipeline.rs:5. Integration with Existing Extraction
Example usage in
src/lib.rsor extraction workflow:Implementation Steps
Phase 1: Core Framework (
src/classification/)traits.rswithClassifiertraitcontext.rswithStringContextstructClassificationstruct tomod.rspipeline.rswithClassificationPipelinemod.rsPhase 2: Testing Infrastructure
tests/classification_tests.rsMockClassifierfor testingFoundStringPhase 3: Integration
src/lib.rsto re-export classification typesBasicExtractorworkflow examplePhase 4: First Real Classifier (Separate Issue)
ImportExportClassifieras proof-of-conceptTesting Strategy
Unit Tests
Integration Tests
Test with real binaries:
Performance Considerations
StringContextfor strings that need classificationSend + Sync)can_classify()allows early exitAcceptance Criteria
Classifiertrait defined withclassify(),name(),can_classify(), andpriority()methodsStringContextstruct with all necessary fields and helper methodsClassificationstruct for results with tag, confidence, reason, and classifier nameClassificationPipelinewith add, classify, and merge functionalityMockClassifierfor testing frameworkBasicExtractorsrc/lib.rsDependencies
Existing Types (from
src/types.rs):FoundString- Contains text, encoding, offset, tags, confidenceTag- Enum of all semantic tagsSectionType- Classification of binary sectionsBinaryFormat- ELF, PE, Mach-O detectionContainerInfo- Binary metadata, sections, imports, exportsImportInfo/ExportInfo- Symbol informationStringyErrorandResult- Error handlingWill Enable (Future Issues):
Related Requirements
References
src/extraction/basic.rssrc/types.rssrc/container/mod.rs