rust: Add identify_reader_sync/async for Read + Seek types#1313
rust: Add identify_reader_sync/async for Read + Seek types#1313withbitsinmind wants to merge 2 commits intogoogle:mainfrom
Conversation
|
Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA). View this failed invocation of the CLA check for more information. For the most up to date status, view the checks section at the bottom of the pull request. |
There was a problem hiding this comment.
Code Review
This pull request introduces identify_reader_sync and identify_reader_async to allow identifying file types from Read + Seek sources, which is a great addition for flexibility. The implementation looks solid, but I have a few suggestions. There's a potential issue with integer truncation on 32-bit systems when getting the file length, which could lead to problems with large files. I've also pointed out a small readability improvement in the read_at method and suggested refactoring the new test to reduce code duplication. Overall, these are good changes that enhance the library's usability.
| fn identify_by_reader_reference() { | ||
| #[derive(Debug, Deserialize)] | ||
| #[serde(deny_unknown_fields)] | ||
| struct Test { | ||
| prediction_mode: String, | ||
| content_base64: String, | ||
| status: String, | ||
| prediction: Option<Prediction>, | ||
| } | ||
| let path = format!( | ||
| "../../tests_data/reference/{MODEL_NAME}-inference_examples_by_content.json.gz" | ||
| ); | ||
| let mut tests = String::new(); | ||
| GzDecoder::new(File::open(path).unwrap()).read_to_string(&mut tests).unwrap(); | ||
| let tests: Vec<Test> = serde_json::from_str(&tests).unwrap(); | ||
| let mut session = Session::new().unwrap(); | ||
| for test in tests { | ||
| if test.prediction_mode != "high-confidence" { | ||
| continue; | ||
| } | ||
| assert_eq!(test.status, "ok"); | ||
| let expected = test.prediction.unwrap(); | ||
| let content = BASE64.decode(test.content_base64.as_bytes()).unwrap(); | ||
| let actual = session.identify_reader_sync(std::io::Cursor::new(content)).unwrap(); | ||
| assert_prediction(actual, expected, &test.content_base64); | ||
| } | ||
| } |
There was a problem hiding this comment.
There is significant code duplication between this new test identify_by_reader_reference and the existing identify_by_content_reference test. The struct definition, test data loading, and test loop are nearly identical.
To improve maintainability, you could refactor the common parts. For example, define the Test struct once outside the test functions and create a helper function to load the test data. This would make the tests cleaner and easier to manage.
|
Thanks for the PR! Do you have an example of an object that implements I see at least the following options:
|
|
The example object in my case is As for the unseal alternative, my take is that it should not be marked as unsafe because it is not in itself unsafe, though an implementor could of course provide an incorrect or unsafe implementation. |
Thanks! This makes sense. Could you tell me if #1315 would work for you? |
|
Brilliant! I can confirm that #1315 works for me 👍 |
This PR adds two new functions to session,
identify_reader_syncandidentify_reader_asyncwhich makes it possible to pass in any type that implementsReadandSeek. It is analogous to theidentify_streamfunction in the Python implementation #979.I suppose an alternative, and perhaps cleaner solution, would be to drop the sealed trait pattern for
SyncInputandSyncInputApi(and the async equivalent) and instead expose theSyncInputApiso that users can implement it for arbitrary types.