Skip to content

[Bug](parquet) SIGSEGV in gen_filter_map when reading nested columns with filter_all=true #63887

@Larborator

Description

@Larborator

Search before asking

  • I had searched in the issues and found no similar issues.

Version

4.x/master

What's Wrong?

BE crashes with SIGSEGV (null pointer dereference at 0x0) when querying Parquet-based external tables (Paimon/Hive/Iceberg) with nested type columns (Struct/Array/Map), if a predicate filters out all rows in a RowGroup.

The crash occurs in ScalarColumnReader::gen_filter_map which dereferences filter_map.filter_map_data() — this is nullptr when filter_all=true.

Root Cause: _read_nested_column only checks has_filter() but not filter_all(). When all rows are filtered out, FilterMap is initialized via init(nullptr, total_rows, true), setting _has_filter=true but _filter_map_data=nullptr. The newer FilterMap::generate_nested_filter_map already has the correct guard (if (!has_filter() || filter_all()) return error), but the inline gen_filter_map lacks this check.

*** SIGSEGV address not mapped to object (@0x0) received by PID 72584 (TID 88200 OR 0x7f319ec10700) from PID 0; stack trace: ***
0# 0x000055EC0721DC35 in /doris/be/lib/doris_be
 1# PosixSignals::chained_handler(int, siginfo*, void*) [clone .part.0] in /usr/local/jdk-17.0.10/lib/server/libjvm.so
 2# JVM_handle_linux_signal in /usr/local/jdk-17.0.10/lib/server/libjvm.so
 3# 0x00007F783FC78630 in /lib64/libpthread.so.0
 4# doris::ScalarColumnReader<false, true>::gen_filter_map(doris::FilterMap&, unsigned long, unsigned long, unsigned long, std::vector<unsigned char, std::allocator<unsigned char> >&, std::unique_ptr<doris::FilterMap, std::default_delete<doris::FilterMap> >*) in /doris/be/lib/doris_be
 5# doris::ScalarColumnReader<false, true>::_read_nested_column(doris::COW<doris::IColumn>::immutable_ptr<doris::IColumn>&, std::shared_ptr<doris::IDataType const>&, doris::FilterMap&, unsigned long, unsigned long*, bool*, bool)::{lambda(unsigned long, unsigned long)#1}::operator()(unsigned long, unsigned long) const in /doris/be/lib/doris_be
 6# doris::ScalarColumnReader<false, true>::_read_nested_column(doris::COW<doris::IColumn>::immutable_ptr<doris::IColumn>&, std::shared_ptr<doris::IDataType const>&, doris::FilterMap&, unsigned long, unsigned long*, bool*, bool) in /doris/be/lib/doris_be
 7# doris::ScalarColumnReader<false, true>::read_column_data(doris::COW<doris::IColumn>::immutable_ptr<doris::IColumn>&, std::shared_ptr<doris::IDataType const>&, std::shared_ptr<doris::TableSchemaChangeHelper::Node> const&, doris::FilterMap&, unsigned long, unsigned long*, bool*, bool, long) in /doris/be/lib/doris_be
 8# doris::StructColumnReader::read_column_data(doris::COW<doris::IColumn>::immutable_ptr<doris::IColumn>&, std::shared_ptr<doris::IDataType const>&, std::shared_ptr<doris::TableSchemaChangeHelper::Node> const&, doris::FilterMap&, unsigned long, unsigned long*, bool*, bool, long) in /doris/be/lib/doris_be
 9# doris::RowGroupReader::_read_column_data(doris::Block*, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&, unsigned long, unsigned long*, bool*, doris::FilterMap&) in /doris/be/lib/doris_be
10# doris::RowGroupReader::_do_lazy_read(doris::Block*, unsigned long, unsigned long*, bool*) in /doris/be/lib/doris_be
11# doris::RowGroupReader::next_batch(doris::Block*, unsigned long, unsigned long*, bool*) in /doris/be/lib/doris_be
12# doris::ParquetReader::get_next_block(doris::Block*, unsigned long*, bool*) in /doris/be/lib/doris_be
13# doris::PaimonReader::get_next_block_inner(doris::Block*, unsigned long*, bool*) in /doris/be/lib/doris_be
14# doris::TableFormatReader::get_next_block(doris::Block*, unsigned long*, bool*) in /doris/be/lib/doris_be
15# doris::FileScanner::_get_block_wrapped(doris::RuntimeState*, doris::Block*, bool*) in /doris/be/lib/doris_be
16# doris::FileScanner::_get_block_impl(doris::RuntimeState*, doris::Block*, bool*) in /doris/be/lib/doris_be
17# doris::Scanner::get_block(doris::RuntimeState*, doris::Block*, bool*) in /doris/be/lib/doris_be
18# doris::Scanner::get_block_after_projects(doris::RuntimeState*, doris::Block*, bool*) in /doris/be/lib/doris_be
19# doris::ScannerScheduler::_scanner_scan(std::shared_ptr<doris::ScannerContext>, std::shared_ptr<doris::ScanTask>) in /doris/be/lib/doris_be
20# 0x000055EC0CCBBB55 in /doris/be/lib/doris_be
21# doris::ScannerSplitRunner::process_for(std::chrono::duration<long, std::ratio<1l, 1000000000l> >) in /doris/be/lib/doris_be
22# doris::PrioritizedSplitRunner::process() in /doris/be/lib/doris_be
23# doris::TimeSharingTaskExecutor::_dispatch_thread() in /doris/be/lib/doris_be
24# doris::Thread::supervise_thread(void*) in /doris/be/lib/doris_be
25# start_thread in /lib64/libpthread.so.0
26# __clone in /lib64/libc.so.6

### What You Expected?

The query should return results without crashing. When `filter_all=true`, nested columns should be correctly skipped without dereferencing `nullptr`.


### How to Reproduce?

The following standalone program reproduces the core logic of `gen_filter_map`. Commenting out the `if (filter_all)` guard and running the `else` branch causes SIGSEGV:

```cpp
#include <cassert>
#include <cstdint>
#include <cstdio>
#include <vector>

int main() {
    // Simulate filter_all=true: filter_map_data is nullptr
    const uint8_t* filter_map_data = nullptr;
    bool has_filter = true;
    bool filter_all = true;

    // rep_levels for a nested column: 3 rows with varying element counts
    std::vector<uint16_t> rep_levels = {0, 1, 1, 0, 1, 0};
    std::vector<uint8_t> nested_filter_map_data;

    if (has_filter) {
        if (filter_all) {
            // FIX: skip gen_filter_map, produce all-zero nested filter
            nested_filter_map_data.assign(rep_levels.size(), 0);
            printf("PASS: filter_all path correctly produces all-zero nested filter\n");
        } else {
            // BUG: dereferences nullptr → SIGSEGV
            size_t filter_loc = 0;
            nested_filter_map_data.resize(rep_levels.size());
            for (size_t i = 0; i < rep_levels.size(); i++) {
                if (i != 0 && rep_levels[i] == 0) filter_loc++;
                nested_filter_map_data[i] = filter_map_data[filter_loc]; // CRASH HERE
            }
        }
    }

    for (auto v : nested_filter_map_data) assert(v == 0);
    printf("All elements filtered — correct behavior\n");
    return 0;
}

Anything Else?

No response

Are you willing to submit PR?

  • Yes I am willing to submit a PR!

Code of Conduct

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions