Search before asking
Version
from doris:be-4.0.0 to doris:be-4.1.1
test: apache/doris:fe-4.1.1 / apache/doris:be-4.1.1
Runtime version reported by FE/BE:
doris-4.1.1-rc01-b10073ad9ca
Official source tag/commit matched from Docker BE log:
4.1.1
b10073ad9ca17cd5685c4dd3b3ef650f256376d0
Test environment:
Host architecture: arm64
Docker client: 28.5.2
Docker server: 29.5.2
FE image: apache/doris:fe-4.1.1, linux/arm64, sha256:318ab41551d884ded601366193d6115ffdf6471e78e28c572c3dfcfa99d2255e
BE image: apache/doris:be-4.1.1, linux/arm64, sha256:4905607a194641fb47284b616836766aca283ae85575491c799351a41deec60d
What's Wrong?
A minimal query using nested lambda expressions crashes Doris BE with SIGSEGV.
The query does not require any physical table or inserted data. It only builds a one-element array with array_agg('a'), then evaluates array_map(x -> array_count(y -> y = x, ids), ids).
Minimal crashing query:
WITH base AS (
SELECT array_agg('a') AS ids
)
SELECT array_map(
x -> array_count(y -> y = x, ids),
ids
) AS result
FROM base;
Expected result:
Actual client error:
ERROR 1105 (HY000) at line 1: RpcException, msg: send fragments failed. io.grpc.StatusRuntimeException: UNAVAILABLE: io exception, host: 172.30.82.3
The BE process exits. docker ps -a shows:
doris-minrepro-be Exited (0) 9 seconds ago apache/doris:be-4.1.1
BE log shows the failing query and SIGSEGV:
*** Query id: f944e19b5ee14d86-bfd49529a73f9a6a ***
*** is nereids: 1 ***
*** tablet id: 0 ***
*** Aborted at 1779956843 (unix time) try "date -d @1779956843" if you are using GNU date ***
*** Current BE git commitID: b10073ad9ca ***
*** SIGSEGV address not mapped to object (@0x0) received by PID 762 (TID 1215 OR 0xfffde7f865c0) from PID 0; stack trace: ***
Relevant stack frames:
doris::signal::(anonymous namespace)::FailureSignalHandler
doris::is_column_const
doris::PreparedFunctionImpl::default_implementation_for_constant_arguments
doris::VectorizedFnCall::_do_execute
doris::VLambdaFunctionExpr::execute_column
doris::ArrayMapFunction::execute
doris::VectorizedFnCall::_do_execute
doris::VLambdaFunctionExpr::execute_column
doris::ArrayMapFunction::execute
doris::PipelineTask::execute
This is not only a crash. Nested lambda binding also appears semantically wrong even when the query does not crash:
SELECT array_map(x -> array_count(y -> y = x, ['a']), ['b']) AS should_be_zero;
SELECT array_map(x -> array_count(y -> y = x, ['a']), ['a']) AS should_be_one;
Expected:
should_be_zero
[0]
should_be_one
[1]
Actual on 4.1.1:
should_be_zero
[1]
should_be_one
[1]
So the root issue seems to be nested lambda scoping/capture, and the array_agg version turns the same scoping bug into an invalid column access and BE crash.
What You Expected?
The minimal crashing query should return [1] and should never crash BE.
Nested lambdas should resolve variables according to lexical lambda scope. For example:
SELECT array_map(x -> array_count(y -> y = x, ['a']), ['b']);
should return [0], because the outer lambda variable x is 'b', while the inner array contains only 'a'.
BE should also fail safely with a query error if an expression binding is invalid. A SQL expression should not be able to terminate the BE process.
How to Reproduce?
Run a minimal FE/BE Docker cluster:
docker rm -f doris-minrepro-fe doris-minrepro-be >/dev/null 2>&1 || true
docker network rm doris-minrepro >/dev/null 2>&1 || true
docker network create --subnet 172.30.82.0/24 doris-minrepro
docker run -d \
--name doris-minrepro-fe \
--network doris-minrepro \
--ip 172.30.82.2 \
-p 39030:9030 \
-p 38030:8030 \
-e FE_SERVERS=fe1:172.30.82.2:9010 \
-e FE_ID=1 \
apache/doris:fe-4.1.1
for i in $(seq 1 90); do
if docker exec doris-minrepro-fe mysql -uroot -h127.0.0.1 -P9030 -e 'SHOW FRONTENDS'; then
break
fi
sleep 2
done
docker run -d \
--name doris-minrepro-be \
--network doris-minrepro \
--ip 172.30.82.3 \
-p 38040:8040 \
-e FE_MASTER_IP=172.30.82.2 \
-e BE_IP=172.30.82.3 \
-e BE_PORT=9050 \
apache/doris:be-4.1.1
for i in $(seq 1 120); do
docker exec doris-minrepro-fe mysql -uroot -h127.0.0.1 -P9030 -e 'SHOW BACKENDS' > /tmp/doris-minrepro-backends.out 2>&1 || true
if grep -q 'true' /tmp/doris-minrepro-backends.out; then
cat /tmp/doris-minrepro-backends.out
break
fi
sleep 2
done
Run normal control queries first:
docker exec -i doris-minrepro-fe mysql -uroot -h127.0.0.1 -P9030 <<'SQL'
WITH base AS (SELECT array_agg('a') AS ids)
SELECT ids FROM base;
WITH base AS (SELECT array_agg('a') AS ids)
SELECT array_count(y -> y = 'a', ids) AS result FROM base;
WITH base AS (SELECT array_agg('a') AS ids)
SELECT array_map(x -> x, ids) AS result FROM base;
SELECT array_map(x -> array_count(y -> y = x, ['a']), ['a']) AS expected_result;
SQL
Expected and observed output:
ids
["a"]
result
1
result
["a"]
expected_result
[1]
Run the crashing query:
docker exec -i doris-minrepro-fe mysql -uroot -h127.0.0.1 -P9030 <<'SQL'
WITH base AS (
SELECT array_agg('a') AS ids
)
SELECT array_map(
x -> array_count(y -> y = x, ids),
ids
) AS result
FROM base;
SQL
Observed:
ERROR 1105 (HY000) at line 1: RpcException, msg: send fragments failed. io.grpc.StatusRuntimeException: UNAVAILABLE: io exception, host: 172.30.82.3
Collect evidence:
docker ps -a --filter name=doris-minrepro-be --format '{{.Names}}\t{{.Status}}\t{{.Image}}'
docker logs doris-minrepro-be 2>&1 | \
grep -En 'Query id|SIGSEGV|FailureSignalHandler|is_column_const|VectorizedFnCall|VLambdaFunctionExpr|ArrayMapFunction|PipelineTask'
rm -rf /tmp/doris-minrepro-log
docker cp doris-minrepro-be:/opt/apache-doris/be/log /tmp/doris-minrepro-log
Optional physical-table reproduction with one row:
CREATE DATABASE IF NOT EXISTS repro;
USE repro;
DROP TABLE IF EXISTS t_array_lambda;
CREATE TABLE t_array_lambda (
es_id VARCHAR(8),
fid VARCHAR(8)
)
DISTRIBUTED BY HASH(es_id) BUCKETS 1
PROPERTIES (
"replication_num" = "1"
);
INSERT INTO t_array_lambda VALUES ('g', 'a');
WITH base AS (
SELECT es_id, array_agg(fid) AS ids
FROM t_array_lambda
GROUP BY es_id
)
SELECT array_map(
x -> array_count(y -> y = x, ids),
ids
) AS result
FROM base;
Anything Else?
Source-level investigation
The Docker BE log reports commit b10073ad9ca, which matches the official 4.1.1 tag target:
git ls-remote https://github.com/apache/doris.git | grep 'refs/tags/4.1.1'
# 73552015e7587a04f857cb0257fc6e178958d389 refs/tags/4.1.1
# b10073ad9ca17cd5685c4dd3b3ef650f256376d0 refs/tags/4.1.1^{}
From source inspection, the likely root cause is in ArrayMapFunction's handling of VColumnRef gaps for nested lambdas.
ArrayMapFunction::execute collects slot refs from the lambda body, computes gap, then recursively applies that gap to all VColumnRef nodes under the body:
|
LambdaArgs args_info; |
|
// collect used slot ref in lambda function body |
|
std::vector<int>& output_slot_ref_indexs = args_info.output_slot_ref_indexs; |
|
_collect_slot_ref_column_id(children[0], output_slot_ref_indexs); |
|
|
|
int gap = 0; |
|
if (!output_slot_ref_indexs.empty()) { |
|
auto max_id = |
|
std::max_element(output_slot_ref_indexs.begin(), output_slot_ref_indexs.end()); |
|
gap = *max_id + 1; |
|
_set_column_ref_column_id(children[0], gap); |
|
} |
_collect_slot_ref_column_id(children[0], output_slot_ref_indexs);
if (!output_slot_ref_indexs.empty()) {
auto max_id = std::max_element(output_slot_ref_indexs.begin(), output_slot_ref_indexs.end());
gap = *max_id + 1;
_set_column_ref_column_id(children[0], gap);
}
The recursive setter does not appear to stop at nested lambda boundaries:
|
void _set_column_ref_column_id(VExprSPtr expr, int gap) const { |
|
for (const auto& child : expr->children()) { |
|
if (child->is_column_ref()) { |
|
auto* ref = static_cast<VColumnRef*>(child.get()); |
|
ref->set_gap(gap); |
|
} else { |
|
_set_column_ref_column_id(child, gap); |
|
} |
|
} |
void _set_column_ref_column_id(VExprSPtr expr, int gap) const {
for (const auto& child : expr->children()) {
if (child->is_column_ref()) {
auto* ref = static_cast<VColumnRef*>(child.get());
ref->set_gap(gap);
} else {
_set_column_ref_column_id(child, gap);
}
}
}
VColumnRef::set_gap only sets the gap once, so an inner lambda variable can inherit the outer lambda gap and cannot be corrected by the inner ArrayMapFunction execution:
|
void set_gap(int gap) { |
|
if (_gap == 0) { |
|
_gap = gap; |
|
} |
|
} |
At execution time, VColumnRef reads the column by column_id + gap:
|
Status execute_column(VExprContext* context, const Block* block, Selector* selector, |
|
size_t count, ColumnPtr& result_column) const override { |
|
DCHECK(_open_finished || block == nullptr); |
|
auto origin_column = block->get_by_position(_column_id + _gap).column; |
|
result_column = filter_column_with_selector(origin_column, selector, count); |
|
return Status::OK(); |
|
} |
|
|
|
DataTypePtr execute_type(const Block* block) const override { |
|
DCHECK(_open_finished || block == nullptr); |
|
return block->get_by_position(_column_id + _gap).type; |
The const overload of Block::get_by_position has no runtime boundary check:
|
ColumnWithTypeAndName& get_by_position(size_t position) { |
|
DCHECK(data.size() > position) |
|
<< ", data.size()=" << data.size() << ", position=" << position; |
|
return data[position]; |
|
} |
|
const ColumnWithTypeAndName& get_by_position(size_t position) const { return data[position]; } |
The wrong/out-of-range ColumnPtr is then dereferenced in the constant-argument path:
PreparedFunctionImpl::default_implementation_for_constant_arguments:
|
Status PreparedFunctionImpl::default_implementation_for_constant_arguments( |
|
FunctionContext* context, Block& block, const ColumnNumbers& args, uint32_t result, |
|
size_t input_rows_count, bool* executed) const { |
|
*executed = false; |
|
ColumnNumbers args_expect_const = get_arguments_that_are_always_constant(); |
|
|
|
// Check that these arguments are really constant. |
|
for (auto arg_num : args_expect_const) { |
|
if (arg_num < args.size() && |
|
!is_column_const(*block.get_by_position(args[arg_num]).column)) { |
|
return Status::InvalidArgument("Argument at index {} for function {} must be constant", |
|
arg_num, get_name()); |
|
} |
|
} |
|
|
|
if (args.empty() || !use_default_implementation_for_constants() || |
|
!VectorizedUtils::all_arguments_are_constant(block, args)) { |
|
return Status::OK(); |
|
} |
|
|
VectorizedUtils::all_arguments_are_constant:
|
static bool all_arguments_are_constant(const Block& block, const ColumnNumbers& args) { |
|
for (const auto& arg : args) { |
|
if (!is_column_const(*block.get_by_position(arg).column)) { |
|
return false; |
|
} |
|
} |
|
return true; |
|
} |
is_column_const:
|
bool is_column_const(const IColumn& column) { |
|
return is_column<ColumnConst>(column); |
|
} |
This matches the observed crash stack.
Suggested fix direction
I think the fix should be in lambda variable scoping, not just in the crash site.
-
Make _set_column_ref_column_id and _collect_slot_ref_column_id scope-aware.
- When traversing the current lambda body, do not recurse into nested
LAMBDA_FUNCTION_EXPR / nested lambda bodies as if they belonged to the same lambda scope.
- Inner lambda parameters should be assigned/resolved by the inner lambda execution context only.
-
Avoid storing mutable execution-specific gap state directly on shared VColumnRef nodes when nested lambda expressions can reuse the same expression tree.
- Passing the gap through an execution context, or cloning/rebinding the relevant lambda body per scope, would be safer than mutating
VColumnRef::_gap globally.
-
Add a runtime guard in VColumnRef::execute_column or use safe_get_by_position before dereferencing.
- This would turn the current process crash into a query error.
- However, this is only a safety net. It would not fix the wrong-result case shown above.
-
Add regression tests for both cases:
-- Should not crash; should return [1]
WITH base AS (
SELECT array_agg('a') AS ids
)
SELECT array_map(
x -> array_count(y -> y = x, ids),
ids
) AS result
FROM base;
-- Should return [0], not [1]
SELECT array_map(x -> array_count(y -> y = x, ['a']), ['b']) AS should_be_zero;
Production context
The production query that first exposed the bug was an account-behavior aggregation using this pattern:
WITH base AS (
SELECT
es_id,
ARRAY_AGG(fid) AS ids
FROM query_log
WHERE created_at >= '2026-05-18 15:26:00'
AND created_at < '2026-05-18 15:36:00'
GROUP BY es_id
)
SELECT
COUNT(*) AS groups_count,
SUM(ARRAY_SIZE(array_map(
x -> array_count(y -> y = x, ids),
array_distinct(ids)
))) AS bucket_count
FROM base;
The minimal reproduction above removes the production table, timestamp filter, grouping cardinality, and data volume from the equation. A single array_agg('a') is enough to reproduce the BE crash.
Are you willing to submit PR?
Code of Conduct
Search before asking
Version
from doris:be-4.0.0 to doris:be-4.1.1
test: apache/doris:fe-4.1.1 / apache/doris:be-4.1.1
Runtime version reported by FE/BE:
Official source tag/commit matched from Docker BE log:
Test environment:
What's Wrong?
A minimal query using nested lambda expressions crashes Doris BE with SIGSEGV.
The query does not require any physical table or inserted data. It only builds a one-element array with
array_agg('a'), then evaluatesarray_map(x -> array_count(y -> y = x, ids), ids).Minimal crashing query:
Expected result:
Actual client error:
The BE process exits.
docker ps -ashows:BE log shows the failing query and SIGSEGV:
Relevant stack frames:
This is not only a crash. Nested lambda binding also appears semantically wrong even when the query does not crash:
Expected:
Actual on 4.1.1:
So the root issue seems to be nested lambda scoping/capture, and the
array_aggversion turns the same scoping bug into an invalid column access and BE crash.What You Expected?
The minimal crashing query should return
[1]and should never crash BE.Nested lambdas should resolve variables according to lexical lambda scope. For example:
should return
[0], because the outer lambda variablexis'b', while the inner array contains only'a'.BE should also fail safely with a query error if an expression binding is invalid. A SQL expression should not be able to terminate the BE process.
How to Reproduce?
Run a minimal FE/BE Docker cluster:
Run normal control queries first:
Expected and observed output:
Run the crashing query:
Observed:
Collect evidence:
Optional physical-table reproduction with one row:
Anything Else?
Source-level investigation
The Docker BE log reports commit
b10073ad9ca, which matches the official4.1.1tag target:From source inspection, the likely root cause is in
ArrayMapFunction's handling ofVColumnRefgaps for nested lambdas.ArrayMapFunction::executecollects slot refs from the lambda body, computesgap, then recursively applies that gap to allVColumnRefnodes under the body:doris/be/src/exprs/lambda_function/varray_map_function.cpp
Lines 82 to 93 in b10073a
The recursive setter does not appear to stop at nested lambda boundaries:
doris/be/src/exprs/lambda_function/varray_map_function.cpp
Lines 339 to 347 in b10073a
VColumnRef::set_gaponly sets the gap once, so an inner lambda variable can inherit the outer lambda gap and cannot be corrected by the innerArrayMapFunctionexecution:doris/be/src/exprs/vcolumn_ref.h
Lines 78 to 82 in b10073a
At execution time,
VColumnRefreads the column bycolumn_id + gap:doris/be/src/exprs/vcolumn_ref.h
Lines 59 to 69 in b10073a
The const overload of
Block::get_by_positionhas no runtime boundary check:doris/be/src/core/block/block.h
Lines 127 to 132 in b10073a
The wrong/out-of-range
ColumnPtris then dereferenced in the constant-argument path:PreparedFunctionImpl::default_implementation_for_constant_arguments:doris/be/src/exprs/function/function.cpp
Lines 122 to 141 in b10073a
VectorizedUtils::all_arguments_are_constant:doris/be/src/exec/common/util.hpp
Lines 156 to 163 in b10073a
is_column_const:doris/be/src/core/column/column.cpp
Lines 80 to 82 in b10073a
This matches the observed crash stack.
Suggested fix direction
I think the fix should be in lambda variable scoping, not just in the crash site.
Make
_set_column_ref_column_idand_collect_slot_ref_column_idscope-aware.LAMBDA_FUNCTION_EXPR/ nested lambda bodies as if they belonged to the same lambda scope.Avoid storing mutable execution-specific gap state directly on shared
VColumnRefnodes when nested lambda expressions can reuse the same expression tree.VColumnRef::_gapglobally.Add a runtime guard in
VColumnRef::execute_columnor usesafe_get_by_positionbefore dereferencing.Add regression tests for both cases:
Production context
The production query that first exposed the bug was an account-behavior aggregation using this pattern:
The minimal reproduction above removes the production table, timestamp filter, grouping cardinality, and data volume from the equation. A single
array_agg('a')is enough to reproduce the BE crash.Are you willing to submit PR?
Code of Conduct