feat: add artifacts-based configuration for selective destination syncing#126
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces a SyncToDestination configuration across the DocumentProcessor, ChunksGenerator, and VectorEmbeddingsGenerator components, allowing for granular control over which processed files are synchronized to the final destination. The controllers have been updated to conditionally trigger reconciliation and synchronization based on these flags and the presence of unprocessed files. Additionally, the Snowflake destination logic was refactored to handle multiple file types and prevent redundant uploads by comparing metadata. Review feedback suggests improving the robustness of JSON parsing for raw file paths by using minimal anonymous structs and refactoring the metadata comparison logic to reduce code duplication using generics.
| func filesAreEqual(filePath string, existingData string, newData []byte) bool { | ||
| if strings.HasSuffix(filePath, VectorEmbeddingsFileSuffix) { | ||
| var existing, incoming EmbeddingsFile | ||
| if json.Unmarshal([]byte(existingData), &existing) == nil && | ||
| json.Unmarshal(newData, &incoming) == nil && | ||
| existing.EmbeddingDocument != nil && incoming.EmbeddingDocument != nil && | ||
| existing.EmbeddingDocument.Metadata != nil && incoming.EmbeddingDocument.Metadata != nil { | ||
| return existing.EmbeddingDocument.Metadata.Equal(incoming.EmbeddingDocument.Metadata) | ||
| } | ||
| } else if strings.HasSuffix(filePath, ChunksFileSuffix) { | ||
| var existing, incoming ChunksFile | ||
| if json.Unmarshal([]byte(existingData), &existing) == nil && | ||
| json.Unmarshal(newData, &incoming) == nil && | ||
| existing.ChunksDocument != nil && incoming.ChunksDocument != nil && | ||
| existing.ChunksDocument.Metadata != nil && incoming.ChunksDocument.Metadata != nil { | ||
| return existing.ChunksDocument.Metadata.Equal(incoming.ChunksDocument.Metadata) | ||
| } | ||
| } else if strings.HasSuffix(filePath, ConvertedFileSuffix) { | ||
| var existing, incoming ConvertedFile | ||
| if json.Unmarshal([]byte(existingData), &existing) == nil && | ||
| json.Unmarshal(newData, &incoming) == nil && | ||
| existing.ConvertedDocument != nil && incoming.ConvertedDocument != nil && | ||
| existing.ConvertedDocument.Metadata != nil && incoming.ConvertedDocument.Metadata != nil { | ||
| return existing.ConvertedDocument.Metadata.Equal(incoming.ConvertedDocument.Metadata) | ||
| } | ||
| } | ||
| return false | ||
| } |
There was a problem hiding this comment.
The filesAreEqual function has a significant amount of repetitive code for handling different file types (EmbeddingsFile, ChunksFile, ConvertedFile). The logic for unmarshalling and comparing metadata is nearly identical in each branch of the if/else if chain.
This repetition makes the code harder to maintain and increases the chance of introducing bugs if a new file type is added or if the comparison logic needs to be changed.
Consider refactoring this function to reduce duplication. You could use a generic helper function to encapsulate the common logic. This would make the code cleaner, more concise, and easier to extend in the future.
Example of a refactored approach using generics:
// compareMetadata is a generic helper to reduce repetition in filesAreEqual
func compareMetadata[T any](existingData string, newData []byte, getMeta func(*T) *Metadata) bool {
var existing, incoming T
if json.Unmarshal([]byte(existingData), &existing) != nil || json.Unmarshal(newData, &incoming) != nil {
return false
}
metaExisting := getMeta(&existing)
metaIncoming := getMeta(&incoming)
if metaExisting != nil && metaIncoming != nil {
return metaExisting.Equal(metaIncoming)
}
return false
}
// filesAreEqual compares files based on their type and returns true if metadata is unchanged
func filesAreEqual(filePath string, existingData string, newData []byte) bool {
if strings.HasSuffix(filePath, VectorEmbeddingsFileSuffix) {
return compareMetadata(existingData, newData, func(f *EmbeddingsFile) *Metadata {
if f.EmbeddingDocument != nil { return f.EmbeddingDocument.Metadata }
return nil
})
} else if strings.HasSuffix(filePath, ChunksFileSuffix) {
return compareMetadata(existingData, newData, func(f *ChunksFile) *Metadata {
if f.ChunksDocument != nil { return f.ChunksDocument.Metadata }
return nil
})
} else if strings.HasSuffix(filePath, ConvertedFileSuffix) {
return compareMetadata(existingData, newData, func(f *ConvertedFile) *Metadata {
if f.ConvertedDocument != nil { return f.ConvertedDocument.Metadata }
return nil
})
}
return false
}ab1e868 to
973922f
Compare
973922f to
3d60885
Compare
3d60885 to
847640c
Compare
847640c to
e1b8e69
Compare
…cing
Enables selective syncing of file types with custom artifact paths under
a stages directory structure
Users can configure which file types (converted/chunks/embeddings) to sync
and customize their destination paths
Example usage in UnstructuredDataProduct spec:
destinationConfig:
artifacts:
- type: stage
name: documentProcessorConfig
path: processed-documents
- type: stage
name: chunksGeneratorConfig
path: chunks
- type: stage
name: vectorEmbeddingsGeneratorConfig
path: vector-embeddings
Signed-off-by: Puneet Punamiya <ppunamiy@redhat.com>
e1b8e69 to
ea88c03
Compare
Enables selective syncing of file types with custom artifact paths under
a stages directory structure
Users can configure which file types (converted/chunks/embeddings) to sync
and customize their destination paths
Example usage in UnstructuredDataProduct