-
Notifications
You must be signed in to change notification settings - Fork 0
QDB-18722 - [python] - remove deprecated API code - fix bugs #115
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Changes from all commits
e196cc6
a55ed5a
948afbb
f501e5b
3c0a781
dc354a6
ab83d22
2546402
2b23456
e0bba12
9d59029
cbdb5c6
2d653f5
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -571,10 +571,29 @@ struct value_converter<py::bytes, qdb_blob_t> | |
| } | ||
| }; | ||
|
|
||
| #if 1 | ||
| template <> | ||
| struct value_converter<traits::pyobject_dtype, qdb_blob_t> | ||
| : public value_converter<py::bytes, qdb_blob_t> | ||
| {}; | ||
| #else template <> | ||
| struct value_converter<traits::pyobject_dtype, qdb_blob_t> | ||
| { | ||
| using dtype = traits::pyobject_dtype; | ||
|
|
||
| value_converter<py::bytes, qdb_blob_t> delegate_{}; | ||
|
|
||
| inline qdb_blob_t operator()(py::object const & x) const | ||
| { | ||
| if (dtype::is_null(x)) | ||
| { | ||
| return traits::null_value<qdb_blob_t>(); | ||
| } | ||
|
|
||
| return delegate_(py::cast<py::bytes>(x)); | ||
| } | ||
| }; | ||
|
Comment on lines
+581
to
+595
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What is being fixed here? Why isn't the is-null check of py::bytes enough?
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This change adds an explicit traits::pyobject_dtype -> qdb_blob_t converter to correctly handle NumPy object arrays when writing blob data. Previously, object-typed values were passed directly through py::castpy::bytes without a null check. As a result, Python nulls such as None were not recognized and were incorrectly sent through the bytes conversion path. With this implementation, nulls are handled explicitly: values are first checked via dtype::is_null(x), and None is converted to traits::null_value<qdb_blob_t>(). Non-null values are delegated to the existing py::bytes -> qdb_blob_t converter, so the byte conversion logic remains unchanged. This fixes blob writes from dtype=object arrays and other object-based inputs by preserving null semantics instead of forcing all values through the bytes conversion path. |
||
| #endif | ||
|
|
||
| template <> | ||
| struct value_converter<qdb_blob_t, py::bytes> | ||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -505,6 +505,22 @@ class masked_array | |
| py::array_t<bool> ret{ShapeContainer{xs.size()}}; | ||
| bool * p_ret = static_cast<bool *>(ret.mutable_data()); | ||
|
|
||
| #if 0 | ||
| if constexpr (std::is_same_v<Dtype, traits::pyobject_dtype>) | ||
| { | ||
| auto begin = static_cast<PyObject * const *>(xs.data()); | ||
| auto end = begin + xs.size(); | ||
|
|
||
| // NumPy object arrays store raw PyObject * values, not py::object wrappers. | ||
| for (auto cur = begin; cur != end; ++cur, ++p_ret) | ||
| { | ||
| *p_ret = (*cur == nullptr) || (*cur == Py_None); | ||
| } | ||
|
|
||
| return mask::of_array(py::cast<py::array>(ret)); | ||
| } | ||
| #endif | ||
|
|
||
|
Comment on lines
+509
to
+523
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. :o this does a huge array copy, right? Why can't This seems like a very expensive conversion. Why is it not integrated into the In the current shape, this requires a much better approach. From what I understand you're sending a number array of objects, which we don't really support / STRONGLY DISCOURAGE, to prevent people from doing extremely performance-terrible stuff, and handle all the sanitization and conversion in
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This special-case is needed because NumPy object arrays store raw PyObject * values, not py::object wrappers. The generic masked_null() path reads elements as Dtype::value_type, which is valid for regular fixed-width dtypes but not for traits::pyobject_dtype. For dtype('O'), we therefore read the buffer as PyObject * const * and mark entries as null when the slot is nullptr or Py_None. This does not introduce an extra object-array copy. masked_null() already allocates a boolean mask and scans the array; this only changes how object-array elements are read during that scan. The fix belongs here because the bug was in mask reconstruction, not in the conversion step that produced the object array. |
||
| // The step_size is `1` for all fixed-width dtypes, but in case | ||
| // of variable width dtypes, is, well, variable. | ||
| py::ssize_t step_size = Dtype::stride_size(xs.itemsize()); | ||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -185,6 +185,7 @@ def _coerce_dtype( | |
| "Forced dtype provided for column '%s' = %s, but that column is not found in the table. Skipping...", | ||
| k, | ||
| ) | ||
| continue | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Good |
||
|
|
||
| i = offsets[k] | ||
| dtype_[i] = dt | ||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -79,7 +79,7 @@ struct array_recode_cdtype_dispatch; | |
| { \ | ||
| auto tmp = convert::point_array<DTYPE, VALUE_TYPE>(input); \ | ||
| auto tmp2 = convert::point_array<VALUE_TYPE, DTYPE>(tmp); \ | ||
| output = input; \ | ||
| output = std::move(tmp2); \ | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Ok |
||
| } \ | ||
| }; | ||
|
|
||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,50 @@ | ||
| from quasardb import test_convert as m | ||
|
|
||
| import numpy as np | ||
| import numpy.ma as ma | ||
| import quasardb | ||
|
|
||
| from utils import assert_indexed_arrays_equal | ||
|
|
||
|
|
||
| def _blob_recode(values): | ||
| idx = np.array( | ||
| [ | ||
| np.datetime64("2017-01-01T00:00:00", "ns"), | ||
| np.datetime64("2017-01-01T00:00:01", "ns"), | ||
| np.datetime64("2017-01-01T00:00:02", "ns"), | ||
| np.datetime64("2017-01-01T00:00:03", "ns"), | ||
| ] | ||
| ) | ||
| return m.test_array_recode(quasardb.ColumnType.Blob, np.dtype("O"), (idx, values)) | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm not a big fan of providing blobs as discrete python objects. I believe binary / blob data can be supported using:
What 'O' means is that all of these are encoded as separate Python objects, and that means that when we access them in the C++ side, each of the value lookups is super expensive as we cannot just access numpy's native arrays, which defeats the whole purpose of using numpy -- we may as well just use a regular list. It's easily 100x slower (and use a lot more memory), so I prefer to just make it impossible to do, and do the necessary conversion in numpy/init.py
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I used np.dtype("O") here only to check the exact regression path we were fixing: pyobject_dtype -> qdb_blob_t plus readback mask reconstruction. I am not proposing object arrays as a good blob representation. I agree they are much slower. If we want to forbid blob writes from dtype('O'), I will turn this into a rejection test instead. |
||
|
|
||
|
|
||
| def test_blob_object_array_recode_roundtrip_bytes(): | ||
| values = np.array([b"alpha", b"beta", b"\x00gamma", b"delta"], dtype=np.object_) | ||
| result = _blob_recode(values) | ||
| assert_indexed_arrays_equal((result[0], values), result) | ||
|
|
||
|
|
||
| def test_blob_object_array_recode_roundtrip_none_values(): | ||
| values = np.array([b"alpha", None, b"gamma", None], dtype=np.object_) | ||
| result = _blob_recode(values) | ||
| assert_indexed_arrays_equal((result[0], values), result) | ||
|
|
||
|
|
||
| def test_blob_object_array_recode_roundtrip_masked_values(): | ||
| values = ma.masked_array( | ||
| data=np.array([b"alpha", b"beta", b"gamma", b"delta"], dtype=np.object_), | ||
| mask=[False, True, False, True], | ||
| ) | ||
| result = _blob_recode(values) | ||
| assert_indexed_arrays_equal((result[0], values), result) | ||
|
|
||
|
|
||
| def test_blob_object_array_recode_roundtrip_empty_bytes(): | ||
| values = np.array([b"", b"beta", b"", b"delta"], dtype=np.object_) | ||
| result = _blob_recode(values) | ||
| expected = ma.masked_array( | ||
| data=np.array([None, b"beta", None, b"delta"], dtype=np.object_), | ||
| mask=[True, False, True, False], | ||
| ) | ||
| assert_indexed_arrays_equal((result[0], expected), result) | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is being fixed here? This appears to change 'A -> B' conversion into 'B -> A', is that intentional? Why?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I fixed the qdb -> numpy conversion path for delegated dtypes such as int16, int32, and float32.
For delegated dtypes, conversion must go through the delegate type first. For example, int16 is handled via int64: the QDB value is first converted to int64, then the intermediate int64 value is checked against the delegate null sentinel, and only non-null values are narrowed to int16.
Previously, this pipeline was composed in the wrong order. The narrowing transform was applied too early, so QDB null sentinels could be converted into ordinary narrowed values such as 0 instead of being preserved as nulls. This caused masked values to disappear in roundtrip tests.
The pipeline now works in the correct order:
convert from the QDB primitive to the delegate dtype
detect null at the delegate stage
emit the destination null sentinel for null values
otherwise cast to the final dtype
The fix is therefore not just the null check itself, but the placement of that check in the conversion pipeline. Null handling now happens at the correct stage, which preserves masks for narrower numpy dtypes.