QDB-18722 - [python] - remove deprecated API code - fix bugs#115
QDB-18722 - [python] - remove deprecated API code - fix bugs#115
Conversation
solatis
left a comment
There was a problem hiding this comment.
Mostly concerned about starting to support raw object conversions in our c++ code, I don't think that's a great approach.
What exactly is the problem with binary data that needs to be fixed?
Doing this special-case conversion directly inside the masked array is the wrong place, if you implement this in convert/value.hpp (but.. i think you already had added a py::object based conversion there?), all array, masked array, etc conversions will automatically make use of that.
| auto xform = [](delegate_value_type const & x) -> value_type { | ||
| if (Delegate::is_null(x)) | ||
| { | ||
| return Delegate::null_value(); | ||
| return To::null_value(); | ||
| } | ||
| else | ||
| { | ||
| return static_cast<delegate_value_type>(x); | ||
| return static_cast<value_type>(x); | ||
| } | ||
| }; | ||
| return ranges::views::transform(xform) | delegate(); | ||
| return delegate() | ranges::views::transform(xform); |
There was a problem hiding this comment.
What is being fixed here? This appears to change 'A -> B' conversion into 'B -> A', is that intentional? Why?
There was a problem hiding this comment.
I fixed the qdb -> numpy conversion path for delegated dtypes such as int16, int32, and float32.
For delegated dtypes, conversion must go through the delegate type first. For example, int16 is handled via int64: the QDB value is first converted to int64, then the intermediate int64 value is checked against the delegate null sentinel, and only non-null values are narrowed to int16.
Previously, this pipeline was composed in the wrong order. The narrowing transform was applied too early, so QDB null sentinels could be converted into ordinary narrowed values such as 0 instead of being preserved as nulls. This caused masked values to disappear in roundtrip tests.
The pipeline now works in the correct order:
convert from the QDB primitive to the delegate dtype
detect null at the delegate stage
emit the destination null sentinel for null values
otherwise cast to the final dtype
The fix is therefore not just the null check itself, but the placement of that check in the conversion pipeline. Null handling now happens at the correct stage, which preserves masks for narrower numpy dtypes.
| { | ||
| using dtype = traits::pyobject_dtype; | ||
|
|
||
| value_converter<py::bytes, qdb_blob_t> delegate_{}; | ||
|
|
||
| inline qdb_blob_t operator()(py::object const & x) const | ||
| { | ||
| if (dtype::is_null(x)) | ||
| { | ||
| return traits::null_value<qdb_blob_t>(); | ||
| } | ||
|
|
||
| return delegate_(py::cast<py::bytes>(x)); | ||
| } | ||
| }; |
There was a problem hiding this comment.
What is being fixed here? Why isn't the is-null check of py::bytes enough?
There was a problem hiding this comment.
This change adds an explicit traits::pyobject_dtype -> qdb_blob_t converter to correctly handle NumPy object arrays when writing blob data.
Previously, object-typed values were passed directly through py::castpy::bytes without a null check. As a result, Python nulls such as None were not recognized and were incorrectly sent through the bytes conversion path.
With this implementation, nulls are handled explicitly: values are first checked via dtype::is_null(x), and None is converted to traits::null_value<qdb_blob_t>(). Non-null values are delegated to the existing py::bytes -> qdb_blob_t converter, so the byte conversion logic remains unchanged.
This fixes blob writes from dtype=object arrays and other object-based inputs by preserving null semantics instead of forcing all values through the bytes conversion path.
| "Forced dtype provided for column '%s' = %s, but that column is not found in the table. Skipping...", | ||
| k, | ||
| ) | ||
| continue |
| if constexpr (std::is_same_v<Dtype, traits::pyobject_dtype>) | ||
| { | ||
| auto begin = static_cast<PyObject * const *>(xs.data()); | ||
| auto end = begin + xs.size(); | ||
|
|
||
| // NumPy object arrays store raw PyObject * values, not py::object wrappers. | ||
| for (auto cur = begin; cur != end; ++cur, ++p_ret) | ||
| { | ||
| *p_ret = (*cur == nullptr) || (*cur == Py_None); | ||
| } | ||
|
|
||
| return mask::of_array(py::cast<py::array>(ret)); | ||
| } | ||
|
|
||
| using value_type = typename Dtype::value_type; | ||
|
|
There was a problem hiding this comment.
:o this does a huge array copy, right? Why can't PyObject * be interpreted as py::object -- py::object is PyObject.
This seems like a very expensive conversion. Why is it not integrated into the convert facilities, and why isn't it explained better why we need to do this conversion here?
In the current shape, this requires a much better approach. From what I understand you're sending a number array of objects, which we don't really support / STRONGLY DISCOURAGE, to prevent people from doing extremely performance-terrible stuff, and handle all the sanitization and conversion in numpy/__init__.py, such that we don't have to do these kind of crazy things: we support unicode and blobs both as "raw", continuous allocations. This is a conscious performance vs memory consumption tradeoff.
There was a problem hiding this comment.
This special-case is needed because NumPy object arrays store raw PyObject * values, not py::object wrappers. The generic masked_null() path reads elements as Dtype::value_type, which is valid for regular fixed-width dtypes but not for traits::pyobject_dtype.
For dtype('O'), we therefore read the buffer as PyObject * const * and mark entries as null when the slot is nullptr or Py_None.
This does not introduce an extra object-array copy. masked_null() already allocates a boolean mask and scans the array; this only changes how object-array elements are read during that scan. The fix belongs here because the bug was in mask reconstruction, not in the conversion step that produced the object array.
| xs = convert::masked_array<double, traits::float64_dtype>( | ||
| ranges::views::counted(column.data.doubles, data.row_count)); | ||
| break; | ||
| case qdb_ts_column_symbol: |
There was a problem hiding this comment.
Why is the reader emitting "symbol" as a value type? symbol is a column type, not a value type, and the reader only "sees" value types.
In the same way the query api cannot return a "symbol" type for its results, because "symbol" is not a value
| auto tmp = convert::point_array<DTYPE, VALUE_TYPE>(input); \ | ||
| auto tmp2 = convert::point_array<VALUE_TYPE, DTYPE>(tmp); \ | ||
| output = input; \ | ||
| output = std::move(tmp2); \ |
| np.datetime64("2017-01-01T00:00:03", "ns"), | ||
| ] | ||
| ) | ||
| return m.test_array_recode(quasardb.ColumnType.Blob, np.dtype("O"), (idx, values)) |
There was a problem hiding this comment.
I'm not a big fan of providing blobs as discrete python objects.
I believe binary / blob data can be supported using:
- 'S' (zero-terminated blobs, which is... iffy)
- 'b'
- 'i8' (i.e.
char)
What 'O' means is that all of these are encoded as separate Python objects, and that means that when we access them in the C++ side, each of the value lookups is super expensive as we cannot just access numpy's native arrays, which defeats the whole purpose of using numpy -- we may as well just use a regular list. It's easily 100x slower (and use a lot more memory), so I prefer to just make it impossible to do, and do the necessary conversion in numpy/init.py
There was a problem hiding this comment.
I used np.dtype("O") here only to check the exact regression path we were fixing: pyobject_dtype -> qdb_blob_t plus readback mask reconstruction.
I am not proposing object arrays as a good blob representation. I agree they are much slower.
If we want to forbid blob writes from dtype('O'), I will turn this into a rejection test instead.
As long as that path is still accepted by the code, though, I think it is worth having a regression test for it.
No description provided.