-
Notifications
You must be signed in to change notification settings - Fork 73
Description
Bug: tdigest const_iterator returns dangling reference causing incorrect values
Summary
The tdigest<T>::const_iterator has as issue where it returns std::pair<const T&, const W> with a reference to a temporary value, leading to undefined behavior and incorrect results when iterating over centroids.
Root Cause
In tdigest.hpp line 319, the iterator's value_type is defined as:
using value_type = std::pair<const T&, const W>;In tdigest_impl.hpp line 681, operator*() returns:
return value_type(centroids_[index_].get_mean(), centroids_[index_].get_weight());The problem is that centroid::get_mean() returns by value (line 83 in tdigest.hpp):
T get_mean() const { return mean_; } // Returns T, not T&When constructing the pair, const T& binds to the temporary T returned by get_mean(). This temporary is destroyed immediately after the pair is constructed, leaving the reference dangling.
Impact
- Reading
centroid.firstorit->firstreturns garbage values or always returns the same incorrect value - The issue manifests differently depending on compiler optimizations and build environment
- In some environments (like ClickHouse), all iterations return the same incorrect value (e.g.,
0.0for all centroids) - In other environments, you get random garbage values like
3.00897e-314
Reproduction
Compile and run this test program with optimizations enabled (-O2 or -O3):
#include <iostream>
#include <memory>
#include "tdigest.hpp"
int main() {
auto td = std::make_unique<datasketches::tdigest<double>>(100);
// Insert distinct values
for (int i = 0; i < 10; i++) {
td->update(i);
}
std::cout << "Iterating with explicit begin/end:" << std::endl;
auto it = td->begin();
auto end_it = td->end();
int count = 0;
while (it != end_it) {
double mean = it->first; // Dangling reference!
uint64_t weight = it->second;
std::cout << count++ << ": mean=" << mean << ", weight=" << weight << std::endl;
++it;
}
return 0;
}Expected output: Different mean values (0, 1, 2, 3, ..., 9)
Actual output: Same value repeated or garbage values:
0: mean=0.000, weight=1
1: mean=0.000, weight=1
2: mean=0.000, weight=1
...
or
0: mean=3.00897e-314, weight=1
1: mean=3.00897e-314, weight=1
...
Why range-based for loops sometimes work
Range-based for loops with const auto&& may appear to work in simple test programs due to compiler-specific temporary lifetime extension, but this is not guaranteed and fails in complex build environments.
Proposed Fix
Change the value_type to return values instead of references:
In tdigest.hpp line 319:
// Before:
using value_type = std::pair<const T&, const W>;
// After:
using value_type = std::pair<T, W>;This ensures that operator*() returns a pair containing copied values rather than a reference to a temporary, eliminating the undefined behavior.