Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
17 changes: 17 additions & 0 deletions quasardb/convert/array.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -184,6 +184,7 @@ struct convert_array<From, To>

static constexpr convert_array<From, Delegate> const delegate{};

#if 1
[[nodiscard]] constexpr inline auto operator()() const noexcept
{
auto xform = [](value_type const & x) -> delegate_value_type {
Expand All @@ -198,6 +199,22 @@ struct convert_array<From, To>
};
return ranges::views::transform(xform) | delegate();
};
#else
[[nodiscard]] constexpr inline auto operator()() const noexcept
{
auto xform = [](delegate_value_type const & x) -> value_type {
if (Delegate::is_null(x))
{
return To::null_value();
}
else
{
return static_cast<value_type>(x);
}
};
return delegate() | ranges::views::transform(xform);
Comment on lines +205 to +215
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is being fixed here? This appears to change 'A -> B' conversion into 'B -> A', is that intentional? Why?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I fixed the qdb -> numpy conversion path for delegated dtypes such as int16, int32, and float32.

For delegated dtypes, conversion must go through the delegate type first. For example, int16 is handled via int64: the QDB value is first converted to int64, then the intermediate int64 value is checked against the delegate null sentinel, and only non-null values are narrowed to int16.

Previously, this pipeline was composed in the wrong order. The narrowing transform was applied too early, so QDB null sentinels could be converted into ordinary narrowed values such as 0 instead of being preserved as nulls. This caused masked values to disappear in roundtrip tests.

The pipeline now works in the correct order:

convert from the QDB primitive to the delegate dtype
detect null at the delegate stage
emit the destination null sentinel for null values
otherwise cast to the final dtype

The fix is therefore not just the null check itself, but the placement of that check in the conversion pipeline. Null handling now happens at the correct stage, which preserves masks for narrower numpy dtypes.

};
#endif
};

}; // namespace qdb::convert::detail
Expand Down
19 changes: 19 additions & 0 deletions quasardb/convert/value.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -571,10 +571,29 @@ struct value_converter<py::bytes, qdb_blob_t>
}
};

#if 1
template <>
struct value_converter<traits::pyobject_dtype, qdb_blob_t>
: public value_converter<py::bytes, qdb_blob_t>
{};
#else template <>
struct value_converter<traits::pyobject_dtype, qdb_blob_t>
{
using dtype = traits::pyobject_dtype;

value_converter<py::bytes, qdb_blob_t> delegate_{};

inline qdb_blob_t operator()(py::object const & x) const
{
if (dtype::is_null(x))
{
return traits::null_value<qdb_blob_t>();
}

return delegate_(py::cast<py::bytes>(x));
}
};
Comment on lines +581 to +595
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is being fixed here? Why isn't the is-null check of py::bytes enough?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change adds an explicit traits::pyobject_dtype -> qdb_blob_t converter to correctly handle NumPy object arrays when writing blob data.

Previously, object-typed values were passed directly through py::castpy::bytes without a null check. As a result, Python nulls such as None were not recognized and were incorrectly sent through the bytes conversion path.

With this implementation, nulls are handled explicitly: values are first checked via dtype::is_null(x), and None is converted to traits::null_value<qdb_blob_t>(). Non-null values are delegated to the existing py::bytes -> qdb_blob_t converter, so the byte conversion logic remains unchanged.

This fixes blob writes from dtype=object arrays and other object-based inputs by preserving null semantics instead of forcing all values through the bytes conversion path.

#endif

template <>
struct value_converter<qdb_blob_t, py::bytes>
Expand Down
16 changes: 16 additions & 0 deletions quasardb/masked_array.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -505,6 +505,22 @@ class masked_array
py::array_t<bool> ret{ShapeContainer{xs.size()}};
bool * p_ret = static_cast<bool *>(ret.mutable_data());

#if 0
if constexpr (std::is_same_v<Dtype, traits::pyobject_dtype>)
{
auto begin = static_cast<PyObject * const *>(xs.data());
auto end = begin + xs.size();

// NumPy object arrays store raw PyObject * values, not py::object wrappers.
for (auto cur = begin; cur != end; ++cur, ++p_ret)
{
*p_ret = (*cur == nullptr) || (*cur == Py_None);
}

return mask::of_array(py::cast<py::array>(ret));
}
#endif

Comment on lines +509 to +523
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:o this does a huge array copy, right? Why can't PyObject * be interpreted as py::object -- py::object is PyObject.

This seems like a very expensive conversion. Why is it not integrated into the convert facilities, and why isn't it explained better why we need to do this conversion here?

In the current shape, this requires a much better approach. From what I understand you're sending a number array of objects, which we don't really support / STRONGLY DISCOURAGE, to prevent people from doing extremely performance-terrible stuff, and handle all the sanitization and conversion in numpy/__init__.py, such that we don't have to do these kind of crazy things: we support unicode and blobs both as "raw", continuous allocations. This is a conscious performance vs memory consumption tradeoff.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This special-case is needed because NumPy object arrays store raw PyObject * values, not py::object wrappers. The generic masked_null() path reads elements as Dtype::value_type, which is valid for regular fixed-width dtypes but not for traits::pyobject_dtype.

For dtype('O'), we therefore read the buffer as PyObject * const * and mark entries as null when the slot is nullptr or Py_None.

This does not introduce an extra object-array copy. masked_null() already allocates a boolean mask and scans the array; this only changes how object-array elements are read during that scan. The fix belongs here because the bug was in mask reconstruction, not in the conversion step that produced the object array.

// The step_size is `1` for all fixed-width dtypes, but in case
// of variable width dtypes, is, well, variable.
py::ssize_t step_size = Dtype::stride_size(xs.itemsize());
Expand Down
1 change: 1 addition & 0 deletions quasardb/numpy/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -185,6 +185,7 @@ def _coerce_dtype(
"Forced dtype provided for column '%s' = %s, but that column is not found in the table. Skipping...",
k,
)
continue
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good


i = offsets[k]
dtype_[i] = dt
Expand Down
2 changes: 1 addition & 1 deletion tests/test_convert.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -79,7 +79,7 @@ struct array_recode_cdtype_dispatch;
{ \
auto tmp = convert::point_array<DTYPE, VALUE_TYPE>(input); \
auto tmp2 = convert::point_array<VALUE_TYPE, DTYPE>(tmp); \
output = input; \
output = std::move(tmp2); \
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok

} \
};

Expand Down
50 changes: 50 additions & 0 deletions tests/test_convert_blob.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,50 @@
from quasardb import test_convert as m

import numpy as np
import numpy.ma as ma
import quasardb

from utils import assert_indexed_arrays_equal


def _blob_recode(values):
idx = np.array(
[
np.datetime64("2017-01-01T00:00:00", "ns"),
np.datetime64("2017-01-01T00:00:01", "ns"),
np.datetime64("2017-01-01T00:00:02", "ns"),
np.datetime64("2017-01-01T00:00:03", "ns"),
]
)
return m.test_array_recode(quasardb.ColumnType.Blob, np.dtype("O"), (idx, values))
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not a big fan of providing blobs as discrete python objects.

I believe binary / blob data can be supported using:

  • 'S' (zero-terminated blobs, which is... iffy)
  • 'b'
  • 'i8' (i.e. char)

What 'O' means is that all of these are encoded as separate Python objects, and that means that when we access them in the C++ side, each of the value lookups is super expensive as we cannot just access numpy's native arrays, which defeats the whole purpose of using numpy -- we may as well just use a regular list. It's easily 100x slower (and use a lot more memory), so I prefer to just make it impossible to do, and do the necessary conversion in numpy/init.py

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I used np.dtype("O") here only to check the exact regression path we were fixing: pyobject_dtype -> qdb_blob_t plus readback mask reconstruction.

I am not proposing object arrays as a good blob representation. I agree they are much slower.

If we want to forbid blob writes from dtype('O'), I will turn this into a rejection test instead.
As long as that path is still accepted by the code, though, I think it is worth having a regression test for it.



def test_blob_object_array_recode_roundtrip_bytes():
values = np.array([b"alpha", b"beta", b"\x00gamma", b"delta"], dtype=np.object_)
result = _blob_recode(values)
assert_indexed_arrays_equal((result[0], values), result)


def test_blob_object_array_recode_roundtrip_none_values():
values = np.array([b"alpha", None, b"gamma", None], dtype=np.object_)
result = _blob_recode(values)
assert_indexed_arrays_equal((result[0], values), result)


def test_blob_object_array_recode_roundtrip_masked_values():
values = ma.masked_array(
data=np.array([b"alpha", b"beta", b"gamma", b"delta"], dtype=np.object_),
mask=[False, True, False, True],
)
result = _blob_recode(values)
assert_indexed_arrays_equal((result[0], values), result)


def test_blob_object_array_recode_roundtrip_empty_bytes():
values = np.array([b"", b"beta", b"", b"delta"], dtype=np.object_)
result = _blob_recode(values)
expected = ma.masked_array(
data=np.array([None, b"beta", None, b"delta"], dtype=np.object_),
mask=[True, False, True, False],
)
assert_indexed_arrays_equal((result[0], expected), result)
Loading