Skip to content

rust: Add identify_reader_sync/async for Read + Seek types#1313

Closed
withbitsinmind wants to merge 2 commits intogoogle:mainfrom
withbitsinmind:main
Closed

rust: Add identify_reader_sync/async for Read + Seek types#1313
withbitsinmind wants to merge 2 commits intogoogle:mainfrom
withbitsinmind:main

Conversation

@withbitsinmind
Copy link

This PR adds two new functions to session, identify_reader_sync and identify_reader_async which makes it possible to pass in any type that implements Read and Seek. It is analogous to the identify_stream function in the Python implementation #979.

I suppose an alternative, and perhaps cleaner solution, would be to drop the sealed trait pattern for SyncInput and SyncInputApi (and the async equivalent) and instead expose the SyncInputApi so that users can implement it for arbitrary types.

@withbitsinmind withbitsinmind requested a review from ia0 as a code owner March 2, 2026 20:26
@google-cla
Copy link

google-cla bot commented Mar 2, 2026

Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

View this failed invocation of the CLA check for more information.

For the most up to date status, view the checks section at the bottom of the pull request.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces identify_reader_sync and identify_reader_async to allow identifying file types from Read + Seek sources, which is a great addition for flexibility. The implementation looks solid, but I have a few suggestions. There's a potential issue with integer truncation on 32-bit systems when getting the file length, which could lead to problems with large files. I've also pointed out a small readability improvement in the read_at method and suggested refactoring the new test to reduce code duplication. Overall, these are good changes that enhance the library's usability.

Comment on lines +131 to +157
fn identify_by_reader_reference() {
#[derive(Debug, Deserialize)]
#[serde(deny_unknown_fields)]
struct Test {
prediction_mode: String,
content_base64: String,
status: String,
prediction: Option<Prediction>,
}
let path = format!(
"../../tests_data/reference/{MODEL_NAME}-inference_examples_by_content.json.gz"
);
let mut tests = String::new();
GzDecoder::new(File::open(path).unwrap()).read_to_string(&mut tests).unwrap();
let tests: Vec<Test> = serde_json::from_str(&tests).unwrap();
let mut session = Session::new().unwrap();
for test in tests {
if test.prediction_mode != "high-confidence" {
continue;
}
assert_eq!(test.status, "ok");
let expected = test.prediction.unwrap();
let content = BASE64.decode(test.content_base64.as_bytes()).unwrap();
let actual = session.identify_reader_sync(std::io::Cursor::new(content)).unwrap();
assert_prediction(actual, expected, &test.content_base64);
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

There is significant code duplication between this new test identify_by_reader_reference and the existing identify_by_content_reference test. The struct definition, test data loading, and test loop are nearly identical.

To improve maintainability, you could refactor the common parts. For example, define the Test struct once outside the test functions and create a helper function to load the test data. This would make the tests cleaner and easier to manage.

@ia0
Copy link
Member

ia0 commented Mar 3, 2026

Thanks for the PR!

Do you have an example of an object that implements Read and Seek and is not File? This could help figure out the best solution.

I see at least the following options:

  • Add the identify_reader_{sync,async} functions as you did. I don't think we should do that. We should use the identify_content_{sync,async} functions directly.
  • Add the ReadSeek wrapper as you did. This doesn't look great.
  • Unseal {Sync,Async}Input as you proposed. I guess that's an option, but then we need to decide if it should be an unsafe trait.
  • Provide a blanket implementation for Read and Seek. We need to make sure this is not a breaking change.

@withbitsinmind
Copy link
Author

The example object in my case is HttpRangeReader, a private struct, which implements Read and Seek (or the async equivalents) using HTTP Range requests towards a URL. This implementation would make it possible to run magika on said resource without having to download it to a file beforehand. It could for example be the presigned URL for a large S3 object.

As for the unseal alternative, my take is that it should not be marked as unsafe because it is not in itself unsafe, though an implementor could of course provide an incorrect or unsafe implementation.

@ia0
Copy link
Member

ia0 commented Mar 3, 2026

The example object in my case is HttpRangeReader, a private struct, which implements Read and Seek (or the async equivalents) using HTTP Range requests towards a URL.

Thanks! This makes sense. Could you tell me if #1315 would work for you?

@withbitsinmind
Copy link
Author

Brilliant! I can confirm that #1315 works for me 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants