Skip to content

feat: add artifacts-based configuration for selective destination syncing#126

Open
PuneetPunamiya wants to merge 1 commit intoredhat-data-and-ai:mainfrom
PuneetPunamiya:adds-support-for-selective-destination-sync
Open

feat: add artifacts-based configuration for selective destination syncing#126
PuneetPunamiya wants to merge 1 commit intoredhat-data-and-ai:mainfrom
PuneetPunamiya:adds-support-for-selective-destination-sync

Conversation

@PuneetPunamiya
Copy link
Copy Markdown
Contributor

@PuneetPunamiya PuneetPunamiya commented Mar 30, 2026

  • Enables selective syncing of file types with custom artifact paths under
    a stages directory structure

  • Users can configure which file types (converted/chunks/embeddings) to sync
    and customize their destination paths

  • Example usage in UnstructuredDataProduct

spec:
  destinationConfig:
    artifacts:
     - type: stage
       name: documentProcessorConfig
       path: processed-documents
     - type: stage
       name: chunksGeneratorConfig
       path: chunks
     - type: stage
       name: vectorEmbeddingsGeneratorConfig
       path: vector-embeddings

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a SyncToDestination configuration across the DocumentProcessor, ChunksGenerator, and VectorEmbeddingsGenerator components, allowing for granular control over which processed files are synchronized to the final destination. The controllers have been updated to conditionally trigger reconciliation and synchronization based on these flags and the presence of unprocessed files. Additionally, the Snowflake destination logic was refactored to handle multiple file types and prevent redundant uploads by comparing metadata. Review feedback suggests improving the robustness of JSON parsing for raw file paths by using minimal anonymous structs and refactoring the metadata comparison logic to reduce code duplication using generics.

Comment on lines +159 to +186
func filesAreEqual(filePath string, existingData string, newData []byte) bool {
if strings.HasSuffix(filePath, VectorEmbeddingsFileSuffix) {
var existing, incoming EmbeddingsFile
if json.Unmarshal([]byte(existingData), &existing) == nil &&
json.Unmarshal(newData, &incoming) == nil &&
existing.EmbeddingDocument != nil && incoming.EmbeddingDocument != nil &&
existing.EmbeddingDocument.Metadata != nil && incoming.EmbeddingDocument.Metadata != nil {
return existing.EmbeddingDocument.Metadata.Equal(incoming.EmbeddingDocument.Metadata)
}
} else if strings.HasSuffix(filePath, ChunksFileSuffix) {
var existing, incoming ChunksFile
if json.Unmarshal([]byte(existingData), &existing) == nil &&
json.Unmarshal(newData, &incoming) == nil &&
existing.ChunksDocument != nil && incoming.ChunksDocument != nil &&
existing.ChunksDocument.Metadata != nil && incoming.ChunksDocument.Metadata != nil {
return existing.ChunksDocument.Metadata.Equal(incoming.ChunksDocument.Metadata)
}
} else if strings.HasSuffix(filePath, ConvertedFileSuffix) {
var existing, incoming ConvertedFile
if json.Unmarshal([]byte(existingData), &existing) == nil &&
json.Unmarshal(newData, &incoming) == nil &&
existing.ConvertedDocument != nil && incoming.ConvertedDocument != nil &&
existing.ConvertedDocument.Metadata != nil && incoming.ConvertedDocument.Metadata != nil {
return existing.ConvertedDocument.Metadata.Equal(incoming.ConvertedDocument.Metadata)
}
}
return false
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The filesAreEqual function has a significant amount of repetitive code for handling different file types (EmbeddingsFile, ChunksFile, ConvertedFile). The logic for unmarshalling and comparing metadata is nearly identical in each branch of the if/else if chain.

This repetition makes the code harder to maintain and increases the chance of introducing bugs if a new file type is added or if the comparison logic needs to be changed.

Consider refactoring this function to reduce duplication. You could use a generic helper function to encapsulate the common logic. This would make the code cleaner, more concise, and easier to extend in the future.

Example of a refactored approach using generics:

// compareMetadata is a generic helper to reduce repetition in filesAreEqual
func compareMetadata[T any](existingData string, newData []byte, getMeta func(*T) *Metadata) bool {
	var existing, incoming T
	if json.Unmarshal([]byte(existingData), &existing) != nil || json.Unmarshal(newData, &incoming) != nil {
		return false
	}
	metaExisting := getMeta(&existing)
	metaIncoming := getMeta(&incoming)
	if metaExisting != nil && metaIncoming != nil {
		return metaExisting.Equal(metaIncoming)
	}
	return false
}

// filesAreEqual compares files based on their type and returns true if metadata is unchanged
func filesAreEqual(filePath string, existingData string, newData []byte) bool {
	if strings.HasSuffix(filePath, VectorEmbeddingsFileSuffix) {
		return compareMetadata(existingData, newData, func(f *EmbeddingsFile) *Metadata {
			if f.EmbeddingDocument != nil { return f.EmbeddingDocument.Metadata }
			return nil
		})
	} else if strings.HasSuffix(filePath, ChunksFileSuffix) {
		return compareMetadata(existingData, newData, func(f *ChunksFile) *Metadata {
			if f.ChunksDocument != nil { return f.ChunksDocument.Metadata }
			return nil
		})
	} else if strings.HasSuffix(filePath, ConvertedFileSuffix) {
		return compareMetadata(existingData, newData, func(f *ConvertedFile) *Metadata {
			if f.ConvertedDocument != nil { return f.ConvertedDocument.Metadata }
			return nil
		})
	}
	return false
}

@PuneetPunamiya PuneetPunamiya force-pushed the adds-support-for-selective-destination-sync branch from ab1e868 to 973922f Compare March 31, 2026 10:00
@PuneetPunamiya PuneetPunamiya force-pushed the adds-support-for-selective-destination-sync branch from 973922f to 3d60885 Compare April 7, 2026 05:32
@PuneetPunamiya PuneetPunamiya force-pushed the adds-support-for-selective-destination-sync branch from 3d60885 to 847640c Compare April 7, 2026 05:39
@PuneetPunamiya PuneetPunamiya changed the title feat: add syncToDestination flags for selective file syncing feat: add artifacts-based configuration for selective destination syncing Apr 7, 2026
@PuneetPunamiya PuneetPunamiya force-pushed the adds-support-for-selective-destination-sync branch from 847640c to e1b8e69 Compare April 7, 2026 06:21
…cing

Enables selective syncing of file types with custom artifact paths under
a stages directory structure

Users can configure which file types (converted/chunks/embeddings) to sync
and customize their destination paths

Example usage in UnstructuredDataProduct spec:
  destinationConfig:
    artifacts:
     - type: stage
       name: documentProcessorConfig
       path: processed-documents
     - type: stage
       name: chunksGeneratorConfig
       path: chunks
     - type: stage
       name: vectorEmbeddingsGeneratorConfig
       path: vector-embeddings

Signed-off-by: Puneet Punamiya <ppunamiy@redhat.com>
@PuneetPunamiya PuneetPunamiya force-pushed the adds-support-for-selective-destination-sync branch from e1b8e69 to ea88c03 Compare April 7, 2026 08:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant