Adding feature maxRowsInMemoryPerSegment by itallam · Pull Request #6 · itallam/druid

itallam · 2021-07-30T19:02:20Z

Description

Druid currently supports existing feature maxRowsInMemory. The existing feature however is setting a maximum number of rows at ingestion time before segments are being persisted. It is a global setting and even segments with smaller number of rows will be persisted once the global maxRowsInMemory setting is reached.
The new feature maxRowsInMemoryPerSegment allows for defining the maximum for each separate segment. This supports improved ingestion on middlemanager side. Only once a single segment has reached the maximum number of rows it will be persisted, and any other remaining segments will not be impacted. This reduces the number writes to segments and makes sure there are no segments with only a few rows getting persisted when it is not needed.

Key changed/added classes in this PR

org.apache.druid.segment.realtime.appenderator.StreamAppenderator
org.apache.druid.segment.indexing.TuningConfig

This PR has:

been self-reviewed.
added documentation for new or modified features or behaviors.
been tested in a test Druid cluster.

Make BrokerDynamicConfig backwards-compatible. By default, brokers who poll coordinators without the broker dynamic config endpoint (and receive a 404) will return an empty BrokerDynamicConfig.

* Add minor fixes to follow up #19091 * Reset state if scaling fails * Add anotehr test * Fixes * Remove extra subtype

…erlord-extensions` and `druid-kubernetes-extensions` (#19071) * fabric8 bump checkpoint * expose webclientoptions in configuration and override additionalconfig for vertx * document new power * Remove unnecessary complication of the druid vertx factory wrapper * use junit5 * bump okhttp to try and fix a dependency enforcer * update licenses.yaml * nit spacing * min dep enforcer requires bump of kotlin-stdlib * tinkering with licenses yaml * Bump kubernetes-extensions k8s dep so we can align on okhttp also get a bunch of other licenses in line * cleanup object mapper for k8s-ol-ext * minor dep updates + exclude protobuf from druid-kubernetes-extensions * Keep working on getting these dependencies to play nicely together * Update aws sdk in licenses * fix kubernetes-extensions pom * Working on static checks still * k8s overlord ext pom

* retention * final * format * review * review

Adds two new Broker TierSelectorStrategy implementations to provide finer control over how Brokers select Historical and Realtime servers for query execution. - strict – Only selects servers whose priorities match the configured list. Unlike other existing strategies, there is no fallback to servers with other priorities if the configured priorities are unavailable. This also addresses a current limitation with watched tiers: when multiple tiers are configured, Brokers can still retain visibility into the state of the cluster, while enforcing query isolation at the time of server selection rather than filtering servers at the time of building the Broker's server view. - pooled – Pools servers across the configured priorities and selects among them, allowing queries to utilize multiple priority tiers for improved availability. This is particularly useful for querying realtime servers where the number of task replicas per tier may be limited for cost reasons. Both strategies require the configured set of priorities to be non-empty. Similar to queries routed to tiers that are not part of the watched tiers, these strategies may result in queries returning no data if the configured tiers are unavailable.

…ction engine (#19113) Changes: - Remove experimental tag from compaction supervisors and MSQ compaction engine - Update docs and remove mention of old properties - Override default engine in provided in the supervisor spec

…19118)

* Prometheus config for mergebuffer used bytes * Typo * Update extensions-contrib/prometheus-emitter/src/main/resources/defaultMetrics.json Co-authored-by: aho135 <andrewho135@gmail.com> --------- Co-authored-by: aho135 <andrewho135@gmail.com>

* bump some dependencies to newer versions * use junit 5.13.x still * revert fastutil bump and re-bump junit

…the MSQ engine (#19059) * incremental-compaction-mode * style * pending * format * build * UncompactedInputSpec * test * uncompacted * format * format2 * format * fix * review * Update indexing-service/src/main/java/org/apache/druid/indexing/common/task/MinorCompactionInputSpec.java Co-authored-by: Kashif Faraz <kashif.faraz@gmail.com> * Update indexing-service/src/main/java/org/apache/druid/indexing/common/actions/MarkSegmentToUpgradeAction.java Co-authored-by: Kashif Faraz <kashif.faraz@gmail.com> * set * test * review * minor * bug * checkstyle * format * Update server/src/main/java/org/apache/druid/metadata/IndexerSQLMetadataStorageCoordinator.java Co-authored-by: Kashif Faraz <kashif.faraz@gmail.com> * review * review * build --------- Co-authored-by: Kashif Faraz <kashif.faraz@gmail.com>

…eranceTest Clone of #19126.

…19116) * ScheduledExecutors * min

Adds a new supervisor/count metric when SupervisorStatsMonitor is enabled in druid.monitoring.monitors. The metric reports each supervisor’s state (RUNNING, SUSPENDED, UNHEALTHY_SUPERVISOR, etc.) for Prometheus, StatsD, and other metric systems. Available dimensions are supervisorId`, `type`, `state`, `dataSource`, `stream` (optional), `detailedState` (optional).

Updates the labeler GH action from v5 to v6 to get rid of the deprecated node v20 warnings in the GH action runs.

Update various GH actions to more up-to-date versions. This should address any node@20 warnings.

Fix potential NPE in broker dynamic config syncer. Probably unlikely, but might as well handle gracefully.

Upgrade to AWS SDK v2.40.0. Dependencies hdfs-storage + ranger have transitive dependencies on V1. Those will need to be excluded or upgraded as well. Core modules/extensions required for runtime are not reliant on V1.

Upgrades org.apache.zookeeper from 3.8.4 to 3.8.6 to remediate CVE-2026-24308. --------- Co-authored-by: Ashwin Tumma <ashwin.tumma@salesforce.com>

…oted datasource names in resource check (#19147)

Currently the most natural endpoint to use for service health (e.g. if adding Druid services to a load balancer) is /status/health. However, this does not play nicely with graceful shutdown mechanisms. When druid.server.http.unannouncePropagationDelay is used, there is a delay between unannounce and server shutdown, which allows Druid's internal service discovery to stop sending traffic to a service before it shuts down its server. However, /status/health continues to return OK until the server is shut down, so external load balancers cannot take advantage of this. This patch adds /status/ready, an endpoint that is tied to announcement. It allows external load balancers to take advantage of this graceful shutdown mechanism.

GroupBy queries that group on high-cardinality dimensions can create a large number of spill files. This problem is more likely when queries contain many aggregators and/or aggregators with large memory footprints (e.g., DataSketch). This is because GroupBy can only hold a limited number of unique groupings in memory before flushing to disk — the exact limit depends on the size of each row, which is determined by the size of the aggregators. The issue arises when GroupBy attempts to merge all the spill files. Currently, GroupBy merges spill files by opening all of them simultaneously. Opening these files requires memory for objects such as MappingIterator, SmileParser, etc., which can cause historical nodes to OOM. This PR fixes the issue by introducing a new property: druid.query.groupBy.maxSpillFileCount The maximum number of spill files allowed per GroupBy query. When the limit is reached, the query fails with a ResourceLimitExceededException. This property can be used to prevent historical nodes from OOMing due to an excessive number of spill files being opened simultaneously during the merge phase. Defaults to Integer.MAX_VALUE (unlimited). Can also be set per query via the query context key maxSpillFileCount. Note that this new config, maxSpillFileCount, is complementary to the existing maxOnDiskStorage. maxOnDiskStorage limits total bytes across all spill files, but cannot prevent a large number of tiny files — a query can create hundreds of thousands of spill files while staying well under the byte limit. maxSpillFileCount fills this gap by limiting file count directly, which bounds the number of simultaneously open file handles during the merge phase. This situation arises when aggregators like ThetaSketch pre-allocate a large fixed buffer per row in memory (e.g. ~131KB), causing the buffer to flush frequently with only a small number of rows; since each row corresponds to a unique grouping key in a high-cardinality dimension, each sketch has seen very few values at flush time and serializes to only a few bytes on disk using the sketch's compact format.

It is possible for the KillSupervisorsCustomDutyTest to flake because the final getSupervisorHistory call fails to return 404. It is possible that this happens because the two entries are cleaned up in different duty runs, perhaps because the tombstone was too new to clean up on the first run. To guard against this, wait for the metadata/kill/supervisor/count metric to sum to two.

There is a check in QueryVirtualStorageTest.assertQueryMetrics that verifies query/load/batch/time is zero when query/load/count is zero, and it sometimes flakes. This patch should hopefully fix the flakes.

Fixes #18643 and couple of other typos in documentation.

* Migrate kubernetes-extensions to junit5 * Convert kubernetes-overlord-extensions to junit5 * get rid of imports of static assertions

* improve and fix supervisor view * update test * add aggregate lag label

S3 achieve strong read-after-write consistency in 2020. The current s3 backend architecture assumes a prior consistency model and therefore does some redundant calls which are both slow and costly.

…nner (#19467)

…4 conversion (#19482)

…19481) The home dashboard's Services card has two code paths: the SQL path counts every node role via sys.servers, while the coordinator-only fallback was calling /druid/coordinator/v1/servers?simple — an inventory-backed endpoint that only reports segment-loading servers (historical, peon, and brokers that hold broadcast segments). As a result, on clusters where the console talks to the coordinator without SQL, overlord, coordinator, router, broker, and indexer counts were silently dropped from the card, making the cluster look smaller than it actually is.

…l spill runs (#19439)

…new tab (#19483) Several Blueprint MenuItem and AnchorButton usages in the web console open external links in a new tab via target="_blank" but do not set the companion rel attribute. Unlike the project's own ExternalLink component, Blueprint does not inject rel="noopener noreferrer" automatically (verified against the rendered HTML in about-dialog's snapshot), so each new tab can reach back into the opener window and the destination receives a Referer header. Add rel="noopener noreferrer" to every existing target="_blank" call site that was missing it: the help menu and Explore link in the header bar, the "Visit Druid" button in the about dialog, the DruidSQL docs menu item in the workbench, the array ingest mode docs menu item in the run panel, the flattenSpec help button in the load-data view, and the "Learn more" button in the SQL data loader schema step. Snapshot tests are updated to match the new rendered HTML; no other behavior changes.

…aler (#19369)

) The Docker tutorial links to the `ports` section of `docker-compose.yml` to show how to override the console port. The link was hardcoded to the 0.21.1 release of the file, so readers who follow it land on a four-year -old version that no longer reflects the current cluster layout. The three other GitHub links in the same page (lines 51, 84, and 134) use the `{{DRUIDVERSION}}` template variable, which the docs build replaces with the current Druid release tag — this one link looks like it was just missed. Replace `0.21.1` with `{{DRUIDVERSION}}` so the link follows the rest of the file, and update the line anchor from `#L125` to `#L129` to point at the router service's `ports:` row in the current file layout.

This PR fixes incorrect Peon sink-level query metric emission in SinkQuerySegmentWalker. This bug exists since v32, introduced by #17170. The existing code iterated over METRICS_TO_REPORT and used a switch without break statements. Because reportMetric.getValue() is bound to the current metric reporter, fallthrough caused later accumulator values to be reported through the wrong reporter. For example, the query/segment/time reporter could be called with segment time, then wait time, then segment-and-cache time before emit(). Since DefaultQueryMetrics stores metric values by metric name before emission, the last value overwrote earlier ones.

This adds the query context parameter realtimeSegmentsMode and deprecates realtimeSegmentsOnly. realtimeSegmentsOnly=true maps to realtimeSegmentsOnly=exclusive and realtimeSegmentsOnly=false maps to realtimeSegmentsOnly=include. This is useful when performing things like blue/green deployments and you only want to query new historical replica ASGs and not touch any "live" nodes (neither realtime nor historical).

…19487) The `DRUID_LOG4J` bullet in the Docker tutorial points at line 52 of the example `distribution/docker/environment` file, but that line is empty. The actual `DRUID_LOG4J=` entry is on line 51, so the anchor lands one line below the target. Off by one. Update the anchor from `#L52` to `#L51` so the link lands on the `DRUID_LOG4J=` row in the current file layout.

#19488) The non-SQL fallback in `ServicesView` builds rows from `/druid/coordinator/v1/servers?simple`, which does not return `start_time`, so the column is set to a hardcoded epoch placeholder: start_time: '1970:01:01T00:00:00Z', The date portion uses `:` separators instead of `-`, which makes the literal an invalid ISO 8601 string — `new Date('1970:01:01T00:00:00Z')` returns `Invalid Date`. The `Start time` column's `formatDate` cell hands the value to `dayjs(...).toISOString()`, which throws on an invalid input; the surrounding `try/catch` then falls through and returns the original string as-is. As a result, clusters where the console talks to the coordinator without SQL render every row's `Start time` cell as the literal text `1970:01:01T00:00:00Z` rather than a parsed date. Replace the separators with dashes so the placeholder is a valid ISO 8601 epoch timestamp, matching every other ISO 8601 string literal in the web console (e.g. `doctor-checks.tsx`, `sampler.mock.ts`).

* Multi-K8s task scheduling * Register multik8s task runner factory * Support shared executor in Kubernetes task runner * Build per-cluster Kubernetes runners * Support task report streaming for multik8s * Honor shared informers in multik8s factory * Style changes * Tighten multik8s tag map test types * Fix forbidden APIs * Annotate nullable methods * Licenses fix * Propagate pod template selection in multik8s * Harden multik8s capacity executor * Docs for multik8s * Fix multik8s overlord pod source lookup

This patch adds extension points for InputSpecSlicer and InputSliceReader, and uses them to implement TableInputSpec. This eliminates and generalizes the "newTableInputSpecSlicer" method on the ControllerContext, which was previously needed because the slicing logic differs for tasks and Dart.

…rt (#19317)

Overlord task report requests can sometimes be too eager during ingestion and ping a task before the http server servicing the chat requests has spun up. This causes 4xx/5xx to be returned, which are not correctly parsed by the chat client. While this doesn't explicitly fail the ingestion, it spams the logs and causes confusion.

When virtual storage is not enabled, it is not possible for a segment to be acquired unless it is already-existing.

This patch changes StorageMonitor to get metrics from StorageLocations directly, rather than through SegmentLocalCacheManager. This simplifies the logic by removing an unnecessary layer. It also ensures that metrics are reported properly no matter how the StorageLocations are accessed.

* fix unit tests, bump actions-timeline CI is failing to startup to run unit tests, complaining about actions-timeline version not being allowed, switched to latest per https://github.com/apache/infrastructure-actions/blob/main/actions.yml * fix S3InputSourceTest

changes: * add `PartialSegmentMetadataCacheEntry` a `CacheEntry` that range-reads the V10 header on mount, constructs `PartialSegmentFileMapperV10`, and shrinks its reservation to actual on-disk size * add `PartialSegmentBundleCacheEntry` and `PartialSegmentBundleCacheEntryIdentifier` are `CacheEntry` associated with each file bundle of a v10 segment that sparse-allocates and evicts its containers as a unit; places holds metadata and transitive parent bundle entries holds via the `StorageLocation` methods (weak reference holds on the parent cache entries) and reference-counted usage references * add `PartialSegmentCacheBootstrap` a helper that restores partial-format entries from on-disk layout on historical startup (not wired up yet); cleans orphaned bundles * add `ResizableCacheEntry` interface and `StorageLocation.adjustReservation` (shrink-only) so the metadata entry can tighten its reservation post-mount * rename `SegmentFileBuilder.startFileGroup` → `startFileBundle`; introduce `ROOT_BUNDLE_NAME` as the default bundle for containers written without an explicit declaration * rename json field `SegmentFileContainerMetadata.fileGroup` → `bundle`; now non-null via getter, normalizes to `ROOT_BUNDLE_NAME` in the constructor, default value omitted from JSON using a custom `JsonInclude` filter * Extract shared `DirectoryBackedRangeReader` and `CountingRangeReader` test helpers; consolidate duplicates across processing + server tests

…ory (#19508)

#19497) OrcInputFormat.initialize() — which swaps Thread.currentThread().setContextClassLoader() and calls FileSystem.get(conf) — was invoked on every createReader() call. When a ParallelIndexTask runs multiple ORC subtasks concurrently in the same JVM (as in embedded tests)

jtuglu1 and others added 30 commits March 10, 2026 02:43

fix: make BrokerDynamicConfig backwards-compatible (#19120)

b7ac0bf

Make BrokerDynamicConfig backwards-compatible. By default, brokers who poll coordinators without the broker dynamic config endpoint (and receive a 404) will return an empty BrokerDynamicConfig.

Add minor fixes to follow up #19091 (#19119)

760e7e4

* Add minor fixes to follow up #19091 * Reset state if scaling fails * Add anotehr test * Fixes * Remove extra subtype

Use time-based retention for MSQ query result file cleanup (#19074)

bfcc0e5

* retention * final * format * review * review

add thrift inputformat implementation (#19111)

5c842e7

fix native compaction oom issue (#19121)

ad1b5f2

Fix NPE in OverlordCompactionScheduler.getAllCompactionSnapshots() (#…

a28cc24

…19118)

Add missing @JsonProperty in KubernetesTaskRunnerStaticConfig (#19125)

f4a7400

bump some dependencies to newer versions (#19122)

4e3dc36

* bump some dependencies to newer versions * use junit 5.13.x still * revert fastutil bump and re-bump junit

Publish test container logs on failure and fix broken KinesisFaultTol…

a2ed22c

…eranceTest Clone of #19126.

remove experimental wording for tombstones and k8s ingest (#19128)

6fd9250

Refactored MetadataSegmentView to use at most 500ms initial delay (#…

37e95d4

…19116) * ScheduledExecutors * min

Update labeler from v5 to v6(#19130)

7f23b5d

Updates the labeler GH action from v5 to v6 to get rid of the deprecated node v20 warnings in the GH action runs.

Update more GH actions (#19133)

3b60ab3

Update various GH actions to more up-to-date versions. This should address any node@20 warnings.

Fix potential NPE in broker dynamic config syncer

2277611

Fix potential NPE in broker dynamic config syncer. Probably unlikely, but might as well handle gracefully.

Upgrade to AWS SDK V2 (#18891)

e95e050

Upgrade to AWS SDK v2.40.0. Dependencies hdfs-storage + ranger have transitive dependencies on V1. Those will need to be excluded or upgraded as well. Core modules/extensions required for runtime are not reliant on V1.

Increase timeout in IngestionSmokeTest (#19142)

f3e7a7b

Fix CVE-2026-24308: Upgrade Apache ZooKeeper to 3.8.6 (#19135)

7758049

Upgrades org.apache.zookeeper from 3.8.4 to 3.8.6 to remediate CVE-2026-24308. --------- Co-authored-by: Ashwin Tumma <ashwin.tumma@salesforce.com>

Fix authorization failure for table(append(...)) queries caused by qu…

30cdb06

…oted datasource names in resource check (#19147)

Don't include loadTime in LoadSegmentsResult when not loading. (#19149)

0aedeb9

There is a check in QueryVirtualStorageTest.assertQueryMetrics that verifies query/load/batch/time is zero when query/load/count is zero, and it sometimes flakes. This patch should hopefully fix the flakes.

Fix typos in documentation (#19152)

947ddd5

Fixes #18643 and couple of other typos in documentation.

K8s extensions junit5 migration (#19154)

6dd5459

* Migrate kubernetes-extensions to junit5 * Convert kubernetes-overlord-extensions to junit5 * get rid of imports of static assertions

vogievetsky and others added 30 commits May 18, 2026 13:17

minor: Web console: improve and fix supervisor view (#19475)

375dfd1

* improve and fix supervisor view * update test * add aggregate lag label

bump bouncy castle due to CVE-2026-5588 (#19473)

b04b349

perf: Streamline s3 backend (#19394)

8308a66

S3 achieve strong read-after-write consistency in 2020. The current s3 backend architecture assumes a prior consistency model and therefore does some redundant calls which are both slow and costly.

fix: ensure sequence metadata list accesses are threadsafe in task ru…

1001b6f

…nner (#19467)

Bump postgres dep due to high vuln (#19474)

577dd6a

perf: Optimize HLL sketch merge by avoiding unnecessary HLL_8 to HLL_…

4628e43

…4 conversion (#19482)

perf: Optimize SpillingGrouper to avoid unnecessary disk I/O for smal…

781aba6

…l spill runs (#19439)

refactor: clean up min/max task count limits in streaming task autosc…

6e2f472

…aler (#19369)

refactor: add ColumnFormat implementation for string columns (#19410)

953b505

minor: Jump to the last ever derby version (#19492)

d273641

feat: consolidate S3 client creation and enable ARN role for MSQ expo…

a946950

…rt (#19317)

perf: Hoist config check in acquireSegment. (#19503)

f0ea14b

When virtual storage is not enabled, it is not possible for a segment to be acquired unless it is already-existing.

fix: Close exposure to potential NPE in S3Utils.useHttps (#19504)

47925aa

docs: Stop Referring to MSQ as an extension (#19507)

b3b1ff2

refactor: remove deprecated zookeeper-based task runner (#19500)

50ce465

feat: ingest support for numeric typed ExpressionLambdaAggregatorFact…

06ef24c

…ory (#19508)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding feature maxRowsInMemoryPerSegment#6

Adding feature maxRowsInMemoryPerSegment#6
itallam wants to merge 4826 commits into
itallam:feature-maxrowsinmemorypersegmentfrom
apache:master

itallam commented Jul 30, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

Conversation

itallam commented Jul 30, 2021

Description

Key changed/added classes in this PR

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants