Skip to content

Bug: tdigest const_iterator returns dangling reference causing incorrect values #473

@MaheshGPai

Description

@MaheshGPai

Bug: tdigest const_iterator returns dangling reference causing incorrect values

Summary

The tdigest<T>::const_iterator has as issue where it returns std::pair<const T&, const W> with a reference to a temporary value, leading to undefined behavior and incorrect results when iterating over centroids.

Root Cause

In tdigest.hpp line 319, the iterator's value_type is defined as:

using value_type = std::pair<const T&, const W>;

In tdigest_impl.hpp line 681, operator*() returns:

return value_type(centroids_[index_].get_mean(), centroids_[index_].get_weight());

The problem is that centroid::get_mean() returns by value (line 83 in tdigest.hpp):

T get_mean() const { return mean_; }  // Returns T, not T&

When constructing the pair, const T& binds to the temporary T returned by get_mean(). This temporary is destroyed immediately after the pair is constructed, leaving the reference dangling.

Impact

  • Reading centroid.first or it->first returns garbage values or always returns the same incorrect value
  • The issue manifests differently depending on compiler optimizations and build environment
  • In some environments (like ClickHouse), all iterations return the same incorrect value (e.g., 0.0 for all centroids)
  • In other environments, you get random garbage values like 3.00897e-314

Reproduction

Compile and run this test program with optimizations enabled (-O2 or -O3):

#include <iostream>
#include <memory>
#include "tdigest.hpp"

int main() {
    auto td = std::make_unique<datasketches::tdigest<double>>(100);
    
    // Insert distinct values
    for (int i = 0; i < 10; i++) {
        td->update(i);
    }
    
    std::cout << "Iterating with explicit begin/end:" << std::endl;
    auto it = td->begin();
    auto end_it = td->end();
    
    int count = 0;
    while (it != end_it) {
        double mean = it->first;  // Dangling reference!
        uint64_t weight = it->second;
        std::cout << count++ << ": mean=" << mean << ", weight=" << weight << std::endl;
        ++it;
    }
    
    return 0;
}

Expected output: Different mean values (0, 1, 2, 3, ..., 9)

Actual output: Same value repeated or garbage values:

0: mean=0.000, weight=1
1: mean=0.000, weight=1
2: mean=0.000, weight=1
...

or

0: mean=3.00897e-314, weight=1
1: mean=3.00897e-314, weight=1
...

Why range-based for loops sometimes work

Range-based for loops with const auto&& may appear to work in simple test programs due to compiler-specific temporary lifetime extension, but this is not guaranteed and fails in complex build environments.

Proposed Fix

Change the value_type to return values instead of references:

In tdigest.hpp line 319:

// Before:
using value_type = std::pair<const T&, const W>;

// After:
using value_type = std::pair<T, W>;

This ensures that operator*() returns a pair containing copied values rather than a reference to a temporary, eliminating the undefined behavior.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions