[SYNPY-1749]Allow quote, apostrophe and ellipsis in store_row_async#1316
[SYNPY-1749]Allow quote, apostrophe and ellipsis in store_row_async#1316
Conversation
|
@danlu1 Is this still WIP, or are you looking for reviews? |
|
@andrewelamb sorry I should have marked this a draft. |
…ctly when upload data from a dataframe
…ger output json string
|
The integration test failures are in the recordset and submission modules and do not appear to be related to my changes. |
linglp
left a comment
There was a problem hiding this comment.
I think overall it looks good. The tests can be consolidated a bit to test all the edge cases in fewer integration tests to improve performance, and the docstring can be updated to reflect the new state of the code since json.dumps() was removed. There's also some logic that can be simplified in the redundant checks where sample_values is created but never actually used. The function name could also be more descriptive of what it actually does now (sanitizing special values rather than just converting dtypes)
BryanFauble
left a comment
There was a problem hiding this comment.
This is looking great, once we get the last few items handled (and develop merged in), I can approve!
BryanFauble
left a comment
There was a problem hiding this comment.
Nice work on this fix -- the backslash-escaping approach for embedded quotes makes sense, and the Ellipsis/pd.NA handling is solid. I flagged one issue that I think needs to be addressed before merge, plus a few cleanup nits.
Note: This comment was drafted with AI assistance and reviewed by me for accuracy.
There was a problem hiding this comment.
Hi @danlu1 ! Thanks for your hard work. I think I found two bugs after a few rounds of testing myself:
Bug 1: Nested np.nan not handled
df = pd.DataFrame({"val": [[1.0, np.nan], [2.0, 3.0]]})
The nested np.nan would currently pass through to JSON serialization, which could cause issues since np.nan is not valid JSON.
This happened becauseconvert_dtypes() only converts top-level np.nan to pd.NA but nested np.nan would remain unchanged. And even though in your code, you have:
if obj is pd.NA:
But this won't handle np.nan. The fix would just be:
def _reformat_special_values(obj):
if pd.isna(obj): # Catches pd.NA, np.nan, and None
return None
Bug 2: Top-level missing values not handled
def _serialize_json_value(x):
if isinstance(x, (list, dict)):
# pd.NA handling is only here, inside _reformat_special_values
...
if x is ...:
return "..."
return x # <-- Top-level pd.NA, np.nan, None just pass through
Top-level pd.NA (or np.nan) isn't converted to None. It only works now because .replace({pd.NA: None}) is called later.
BryanFauble
left a comment
There was a problem hiding this comment.
This is great progress on a real user-reported issue, nice work on the recursive handling of Ellipsis and pd.NA in nested structures! There's one item that needs to be fixed before we can merge.
Note: This comment was drafted with AI assistance and reviewed by me for accuracy.
…uote-apostrophe-in-store-rows
…args
Replace the mutable dict default with None sentinel and merge user
overrides on top of {"escapechar": "\\"} so callers who pass their own
to_csv_kwargs still get the escapechar fix. Update the docstring to
describe the merge behavior.
There was a problem hiding this comment.
Pull request overview
This PR addresses JSON/CSV serialization failures when uploading DataFrames via store_rows_async, specifically when cells contain JSON-like dict/list values with embedded quotes/apostrophes and when Ellipsis/pd.NA values appear in nested structures.
Changes:
- Adds a default
escapechar("\\ ") toto_csv_kwargsinstore_rows_async. - Threads
to_csv_kwargsthrough the DataFrame streaming/upload path so the same CSV settings are used downstream. - Introduces
convert_dtypes_to_json_serializableand adds unit + integration coverage around quotes/apostrophes/ellipsis handling.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 7 comments.
| File | Description |
|---|---|
synapseclient/models/mixins/table_components.py |
Adds default escapechar, threads to_csv_kwargs, and adds convert_dtypes_to_json_serializable used before chunked DF upload. |
synapseclient/core/upload/multipart_upload_async.py |
Updates docstring to describe to_csv_kwargs usage for DataFrame multipart uploads. |
tests/unit/synapseclient/mixins/unit_test_table_components.py |
Adds unit tests for convert_dtypes_to_json_serializable. |
tests/integration/synapseclient/models/async/test_table_async.py |
Adds an integration test covering quotes/apostrophes/ellipsis round-tripping through table storage/query. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
…ializable - Use pd.isna() instead of `is pd.NA` to catch nested np.nan (not valid JSON) - Handle top-level pd.NA/np.nan/None in _serialize_json_value - Remove redundant dropna()+len() guard so all-NA columns still get converted to object dtype with None values
- Update convert_dtypes_to_json_serializable docstring to describe actual behavior (recursive Ellipsis/pd.NA/np.nan/None cleanup, dtype cast for all non-ROW_* columns, in-place mutation) - Use DEFAULT_ESCAPSE_CHAR constant instead of hardcoded "\\" in store_rows_async to_csv_kwargs default - Rename test_no_conversion_when_no_na_in_column -> test_int_and_float_columns_converted_to_object and fix docstring to match the assertion (columns are always cast to object) - Rename test_none_in_list_serialized_to_empty_list -> test_none_in_list_column_remains_none and fix docstring to match the actual None (not []) behavior - In test_mixed_column_types_no_conversion_needed, snapshot df with copy(deep=True) before the in-place call and assert against the snapshot so the test can detect unintended mutations
BryanFauble
left a comment
There was a problem hiding this comment.
The last round of changes with my final patches looked good to me.
@danlu1 is going to give this a final review as well before we merge in.
…e-rows merge upstream changes
Problem:
A JSON serialization issue occurs when a DataFrame passed to store_row_async contains a list or dictionary with strings that include both double quotes and apostrophes.
Solution:
"escapechar": "\\"toto_csv_kwargsinstore_rows_asyncto_csv_kwargsto_stream_and_update_from_dfso it can take the passedto_csv_kwargsvalues for downstream data processing.Testing:
Unit test and integration test have been added.