Skip to content

Datafusion codec - adding libs , data source plugin and data source aware plugin #37

Draft
bharath-techie wants to merge 3 commits intofeature/datafusionfrom
search-df-codecs
Draft

Datafusion codec - adding libs , data source plugin and data source aware plugin #37
bharath-techie wants to merge 3 commits intofeature/datafusionfrom
search-df-codecs

Conversation

@bharath-techie
Copy link
Copy Markdown
Owner

Description

[Describe what this change achieves]

Related Issues

Resolves #[Issue number to be closed when this PR is merged]

Check List

  • Functionality includes testing.
  • API changes companion pull request created, if applicable.
  • Public documentation issue/PR created, if applicable.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

mch2 and others added 3 commits July 30, 2025 23:04
Signed-off-by: Marc Handalian <marc.handalian@gmail.com>
Signed-off-by: bharath-techie <bharath78910@gmail.com>
Signed-off-by: bharath-techie <bharath78910@gmail.com>
@bharath-techie bharath-techie marked this pull request as draft August 15, 2025 15:15
* Represents a stream of record batches from a DataFusion query execution.
* This interface provides access to query results in a streaming fashion.
*/
public interface RecordBatchStream extends AutoCloseable {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe this can be generic typed RecordBatchStream<T>

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes makes sense. I had a todo in CSV RBS to refactor a bit

Comment on lines +32 to +55
/**
* Create a new session context for query execution.
*
* @param globalRuntimeEnvId the global runtime environment ID
* @return a CompletableFuture containing the session context ID
*/
CompletableFuture<Long> createSessionContext(long globalRuntimeEnvId);

/**
* Execute a Substrait query plan.
*
* @param sessionContextId the session context ID
* @param substraitPlanBytes the serialized Substrait query plan
* @return a CompletableFuture containing the result stream
*/
CompletableFuture<RecordBatchStream> executeSubstraitQuery(long sessionContextId, byte[] substraitPlanBytes);

/**
* Close a session context and free associated resources.
*
* @param sessionContextId the session context ID to close
* @return a CompletableFuture that completes when the context is closed
*/
CompletableFuture<Void> closeSessionContext(long sessionContextId);
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How is this specific to the data source? Isn't this more around query execution?

Comment on lines +4 to +15

use datafusion::prelude::*;
use datafusion::execution::context::SessionContext;
use std::collections::HashMap;
use std::sync::Arc;
use anyhow::Result;

/// Manages DataFusion session contexts
pub struct SessionContextManager {
contexts: HashMap<*mut SessionContext, Arc<SessionContext>>,
next_runtime_id: u64,
}
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We want to avoid data fusion dependencies in dataformat-csv?

opensearchplugin {
name = 'dataformat-csv'
description = 'CSV data format plugin for OpenSearch DataFusion'
classname = 'org.opensearch.datafusion.csv.CsvDataFormatPlugin'
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We want to decouple the package namespacing?

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes , i'll work with mohit , this whole dataformat plugin will probably be moved/designed as something that can work for both query and indexing. I have todos in plugin class for the same.

Comment on lines +10 to +13
[dependencies]
# DataFusion dependencies
datafusion = "49.0.0"
datafusion-substrait = "49.0.0"
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should avoid this dependency?

@bharath-techie bharath-techie force-pushed the feature/datafusion branch 2 times, most recently from 31f431b to 53e2fa9 Compare January 22, 2026 11:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants