Skip to content

Adding feature maxRowsInMemoryPerSegment#6

Open
itallam wants to merge 4826 commits into
itallam:feature-maxrowsinmemorypersegmentfrom
apache:master
Open

Adding feature maxRowsInMemoryPerSegment#6
itallam wants to merge 4826 commits into
itallam:feature-maxrowsinmemorypersegmentfrom
apache:master

Conversation

@itallam
Copy link
Copy Markdown
Owner

@itallam itallam commented Jul 30, 2021

Description

Druid currently supports existing feature maxRowsInMemory. The existing feature however is setting a maximum number of rows at ingestion time before segments are being persisted. It is a global setting and even segments with smaller number of rows will be persisted once the global maxRowsInMemory setting is reached.
The new feature maxRowsInMemoryPerSegment allows for defining the maximum for each separate segment. This supports improved ingestion on middlemanager side. Only once a single segment has reached the maximum number of rows it will be persisted, and any other remaining segments will not be impacted. This reduces the number writes to segments and makes sure there are no segments with only a few rows getting persisted when it is not needed.


Key changed/added classes in this PR
  • org.apache.druid.segment.realtime.appenderator.StreamAppenderator
  • org.apache.druid.segment.indexing.TuningConfig

This PR has:

  • been self-reviewed.
  • added documentation for new or modified features or behaviors.
  • been tested in a test Druid cluster.

jtuglu1 and others added 30 commits March 10, 2026 02:43
Make BrokerDynamicConfig backwards-compatible. By default, brokers who poll coordinators without the broker dynamic config endpoint (and receive a 404) will return an empty BrokerDynamicConfig.
* Add minor fixes to follow up #19091

* Reset state if scaling fails

* Add anotehr test

* Fixes

* Remove extra subtype
…erlord-extensions` and `druid-kubernetes-extensions` (#19071)

* fabric8 bump checkpoint

* expose webclientoptions in configuration and override additionalconfig for vertx

* document new power

* Remove unnecessary complication of the druid vertx factory wrapper

* use junit5

* bump okhttp to try and fix a dependency enforcer

* update licenses.yaml

* nit spacing

* min dep enforcer requires bump of kotlin-stdlib

* tinkering with licenses yaml

* Bump kubernetes-extensions k8s dep so we can align on okhttp

also get a bunch of other licenses in line

* cleanup object mapper for k8s-ol-ext

* minor dep updates + exclude protobuf from druid-kubernetes-extensions

* Keep working on getting these dependencies to play nicely together

* Update aws sdk in licenses

* fix kubernetes-extensions pom

* Working on static checks still

* k8s overlord ext pom
* retention

* final

* format

* review

* review
Adds two new Broker TierSelectorStrategy implementations to provide finer control over how Brokers select Historical and Realtime servers for query execution.

- strict – Only selects servers whose priorities match the configured list.
Unlike other existing strategies, there is no fallback to servers with other priorities if the configured priorities are unavailable. This also addresses a current limitation with watched tiers: when multiple tiers are configured, Brokers can still retain visibility into the state of the cluster, while enforcing query isolation at the time of server selection rather than filtering servers at the time of building the Broker's server view.

- pooled – Pools servers across the configured priorities and selects among them, allowing queries to utilize multiple priority tiers for improved availability. This is particularly useful for querying realtime servers where the number of task replicas per tier may be limited for cost reasons.

Both strategies require the configured set of priorities to be non-empty. Similar to queries routed to tiers that are not part of the watched tiers, these strategies may result in queries returning no data if the configured tiers are unavailable.
…ction engine (#19113)

Changes:
- Remove experimental tag from compaction supervisors and MSQ compaction engine
- Update docs and remove mention of old properties
- Override default engine in provided in the supervisor spec
* Prometheus config for mergebuffer used bytes

* Typo

* Update extensions-contrib/prometheus-emitter/src/main/resources/defaultMetrics.json

Co-authored-by: aho135 <andrewho135@gmail.com>

---------

Co-authored-by: aho135 <andrewho135@gmail.com>
* bump some dependencies to newer versions

* use junit 5.13.x still

* revert fastutil bump and re-bump junit
…the MSQ engine (#19059)

* incremental-compaction-mode

* style

* pending

* format

* build

* UncompactedInputSpec

* test

* uncompacted

* format

* format2

* format

* fix

* review

* Update indexing-service/src/main/java/org/apache/druid/indexing/common/task/MinorCompactionInputSpec.java

Co-authored-by: Kashif Faraz <kashif.faraz@gmail.com>

* Update indexing-service/src/main/java/org/apache/druid/indexing/common/actions/MarkSegmentToUpgradeAction.java

Co-authored-by: Kashif Faraz <kashif.faraz@gmail.com>

* set

* test

* review

* minor

* bug

* checkstyle

* format

* Update server/src/main/java/org/apache/druid/metadata/IndexerSQLMetadataStorageCoordinator.java

Co-authored-by: Kashif Faraz <kashif.faraz@gmail.com>

* review

* review

* build

---------

Co-authored-by: Kashif Faraz <kashif.faraz@gmail.com>
Adds a new supervisor/count metric when SupervisorStatsMonitor is enabled in druid.monitoring.monitors. The metric reports each supervisor’s state (RUNNING, SUSPENDED, UNHEALTHY_SUPERVISOR, etc.) for Prometheus, StatsD, and other metric systems.


Available dimensions are supervisorId`, `type`, `state`, `dataSource`, `stream` (optional), `detailedState` (optional).
Updates the labeler GH action from v5 to v6 to get rid of the deprecated node v20 warnings in the GH action runs.
Update various GH actions to more up-to-date versions. This should address any node@20 warnings.
Fix potential NPE in broker dynamic config syncer. Probably unlikely, but might as well handle gracefully.
Upgrade to AWS SDK v2.40.0. Dependencies hdfs-storage + ranger have transitive dependencies on V1. Those will need to be excluded or upgraded as well. Core modules/extensions required for runtime are not reliant on V1.
Upgrades org.apache.zookeeper from 3.8.4 to 3.8.6 to remediate CVE-2026-24308.


---------

Co-authored-by: Ashwin Tumma <ashwin.tumma@salesforce.com>
Currently the most natural endpoint to use for service health (e.g. if
adding Druid services to a load balancer) is /status/health. However,
this does not play nicely with graceful shutdown mechanisms.

When druid.server.http.unannouncePropagationDelay is used, there is a
delay between unannounce and server shutdown, which allows Druid's
internal service discovery to stop sending traffic to a service before
it shuts down its server. However, /status/health continues to return OK
until the server is shut down, so external load balancers cannot take
advantage of this.

This patch adds /status/ready, an endpoint that is tied to announcement.
It allows external load balancers to take advantage of this graceful
shutdown mechanism.
GroupBy queries that group on high-cardinality dimensions can create a large number of spill files. This problem is more likely when queries contain many aggregators and/or aggregators with large memory footprints (e.g., DataSketch). This is because GroupBy can only hold a limited number of unique groupings in memory before flushing to disk — the exact limit depends on the size of each row, which is determined by the size of the aggregators. The issue arises when GroupBy attempts to merge all the spill files. Currently, GroupBy merges spill files by opening all of them simultaneously. Opening these files requires memory for objects such as MappingIterator, SmileParser, etc., which can cause historical nodes to OOM.

This PR fixes the issue by introducing a new property: druid.query.groupBy.maxSpillFileCount
The maximum number of spill files allowed per GroupBy query. When the limit is reached, the query fails with a ResourceLimitExceededException. This property can be used to prevent historical nodes from OOMing due to an excessive number of spill files being opened simultaneously during the merge phase. Defaults to Integer.MAX_VALUE (unlimited). Can also be set per query via the query context key maxSpillFileCount.

Note that this new config, maxSpillFileCount, is complementary to the existing maxOnDiskStorage. maxOnDiskStorage limits total bytes across all spill files, but cannot prevent a large number of tiny files — a query can create hundreds of thousands of spill files while staying well under the byte limit. maxSpillFileCount fills this gap by limiting file count directly, which bounds the number of simultaneously open file handles during the merge phase. This situation arises when aggregators like ThetaSketch pre-allocate a large fixed buffer per row in memory (e.g. ~131KB), causing the buffer to flush frequently with only a small number of rows; since each row corresponds to a unique grouping key in a high-cardinality dimension, each sketch has seen very few values at flush time and serializes to only a few bytes on disk using the sketch's compact format.
It is possible for the KillSupervisorsCustomDutyTest to flake
because the final getSupervisorHistory call fails to return 404.

It is possible that this happens because the two entries are
cleaned up in different duty runs, perhaps because the tombstone
was too new to clean up on the first run. To guard against this,
wait for the metadata/kill/supervisor/count metric to sum to two.
There is a check in QueryVirtualStorageTest.assertQueryMetrics that
verifies query/load/batch/time is zero when query/load/count is zero,
and it sometimes flakes. This patch should hopefully fix the flakes.
Fixes #18643 and couple of other typos in documentation.
* Migrate kubernetes-extensions to junit5

* Convert kubernetes-overlord-extensions to junit5

* get rid of imports of static assertions
vogievetsky and others added 30 commits May 18, 2026 13:17
* improve and fix supervisor view

* update test

* add aggregate lag label
S3 achieve strong read-after-write consistency in 2020. The current s3 backend architecture assumes a prior consistency model and therefore does some redundant calls which are both slow and costly.
…19481)

The home dashboard's Services card has two code paths: the SQL path
counts every node role via sys.servers, while the coordinator-only
fallback was calling /druid/coordinator/v1/servers?simple — an
inventory-backed endpoint that only reports segment-loading servers
(historical, peon, and brokers that hold broadcast segments). As a
result, on clusters where the console talks to the coordinator without
SQL, overlord, coordinator, router, broker, and indexer counts were
silently dropped from the card, making the cluster look smaller than it
actually is.
…new tab (#19483)

Several Blueprint MenuItem and AnchorButton usages in the web console
open external links in a new tab via target="_blank" but do not set the
companion rel attribute. Unlike the project's own ExternalLink
component, Blueprint does not inject rel="noopener noreferrer"
automatically (verified against the rendered HTML in about-dialog's
snapshot), so each new tab can reach back into the opener window and
the destination receives a Referer header.

Add rel="noopener noreferrer" to every existing target="_blank" call
site that was missing it: the help menu and Explore link in the header
bar, the "Visit Druid" button in the about dialog, the DruidSQL docs
menu item in the workbench, the array ingest mode docs menu item in the
run panel, the flattenSpec help button in the load-data view, and the
"Learn more" button in the SQL data loader schema step.

Snapshot tests are updated to match the new rendered HTML; no other
behavior changes.
)

The Docker tutorial links to the `ports` section of `docker-compose.yml`
to show how to override the console port. The link was hardcoded to the
0.21.1 release of the file, so readers who follow it land on a four-year
-old version that no longer reflects the current cluster layout. The
three other GitHub links in the same page (lines 51, 84, and 134) use
the `{{DRUIDVERSION}}` template variable, which the docs build replaces
with the current Druid release tag — this one link looks like it was
just missed.

Replace `0.21.1` with `{{DRUIDVERSION}}` so the link follows the rest of
the file, and update the line anchor from `#L125` to `#L129` to point at
the router service's `ports:` row in the current file layout.
This PR fixes incorrect Peon sink-level query metric emission in SinkQuerySegmentWalker. This bug exists since v32, introduced by #17170.

The existing code iterated over METRICS_TO_REPORT and used a switch without break statements. Because reportMetric.getValue() is bound to the current metric reporter, fallthrough caused later accumulator values to be reported through the wrong reporter. For example, the query/segment/time reporter could be called with segment time, then wait time, then segment-and-cache time before emit(). Since DefaultQueryMetrics stores metric values by metric name before emission, the last value overwrote earlier ones.
This adds the query context parameter realtimeSegmentsMode and deprecates realtimeSegmentsOnly. realtimeSegmentsOnly=true maps to realtimeSegmentsOnly=exclusive and realtimeSegmentsOnly=false maps to realtimeSegmentsOnly=include.

This is useful when performing things like blue/green deployments and you only want to query new historical replica ASGs and not touch any "live" nodes (neither realtime nor historical).
…19487)

The `DRUID_LOG4J` bullet in the Docker tutorial points at line 52 of the
example `distribution/docker/environment` file, but that line is empty.
The actual `DRUID_LOG4J=` entry is on line 51, so the anchor lands one
line below the target. Off by one.

Update the anchor from `#L52` to `#L51` so the link lands on the
`DRUID_LOG4J=` row in the current file layout.
#19488)

The non-SQL fallback in `ServicesView` builds rows from
`/druid/coordinator/v1/servers?simple`, which does not return
`start_time`, so the column is set to a hardcoded epoch placeholder:

    start_time: '1970:01:01T00:00:00Z',

The date portion uses `:` separators instead of `-`, which makes the
literal an invalid ISO 8601 string — `new Date('1970:01:01T00:00:00Z')`
returns `Invalid Date`. The `Start time` column's `formatDate` cell
hands the value to `dayjs(...).toISOString()`, which throws on an
invalid input; the surrounding `try/catch` then falls through and
returns the original string as-is. As a result, clusters where the
console talks to the coordinator without SQL render every row's
`Start time` cell as the literal text `1970:01:01T00:00:00Z` rather
than a parsed date.

Replace the separators with dashes so the placeholder is a valid ISO
8601 epoch timestamp, matching every other ISO 8601 string literal in
the web console (e.g. `doctor-checks.tsx`, `sampler.mock.ts`).
* Multi-K8s task scheduling

* Register multik8s task runner factory

* Support shared executor in Kubernetes task runner

* Build per-cluster Kubernetes runners

* Support task report streaming for multik8s

* Honor shared informers in multik8s factory

* Style changes

* Tighten multik8s tag map test types

* Fix forbidden APIs

* Annotate nullable methods

* Licenses fix

* Propagate pod template selection in multik8s

* Harden multik8s capacity executor

* Docs for multik8s

* Fix multik8s overlord pod source lookup
This patch adds extension points for InputSpecSlicer and InputSliceReader,
and uses them to implement TableInputSpec. This eliminates and generalizes
the "newTableInputSpecSlicer" method on the ControllerContext, which was
previously needed because the slicing logic differs for tasks and Dart.
Overlord task report requests can sometimes be too eager during
ingestion and ping a task before the http server servicing the chat
requests has spun up. This causes 4xx/5xx to be returned, which
are not correctly parsed by the chat client. While this doesn't
explicitly fail the ingestion, it spams the logs and causes confusion.
When virtual storage is not enabled, it is not possible for a segment
to be acquired unless it is already-existing.
This patch changes StorageMonitor to get metrics from StorageLocations
directly, rather than through SegmentLocalCacheManager. This simplifies
the logic by removing an unnecessary layer. It also ensures that metrics
are reported properly no matter how the StorageLocations are accessed.
* fix unit tests, bump actions-timeline

CI is failing to startup to run unit tests, complaining about actions-timeline version not being allowed, switched to latest per https://github.com/apache/infrastructure-actions/blob/main/actions.yml

* fix S3InputSourceTest
changes:
* add `PartialSegmentMetadataCacheEntry` a `CacheEntry` that range-reads the V10 header on mount, constructs `PartialSegmentFileMapperV10`, and shrinks its reservation to actual on-disk size
* add `PartialSegmentBundleCacheEntry` and `PartialSegmentBundleCacheEntryIdentifier` are `CacheEntry` associated with each file bundle of a v10 segment that sparse-allocates and evicts its containers as a unit; places holds metadata and transitive parent bundle entries holds via the `StorageLocation` methods (weak reference holds on the parent cache entries) and reference-counted usage references
* add `PartialSegmentCacheBootstrap` a helper that restores partial-format entries from on-disk layout on historical startup (not wired up yet); cleans orphaned bundles
* add `ResizableCacheEntry` interface and `StorageLocation.adjustReservation` (shrink-only) so the metadata entry can tighten its reservation post-mount
* rename `SegmentFileBuilder.startFileGroup` → `startFileBundle`; introduce `ROOT_BUNDLE_NAME` as the default bundle for containers written without an explicit declaration                                                              * rename json field `SegmentFileContainerMetadata.fileGroup` → `bundle`; now non-null via getter, normalizes to `ROOT_BUNDLE_NAME` in the constructor, default value omitted from JSON using a custom `JsonInclude` filter
* Extract shared `DirectoryBackedRangeReader` and `CountingRangeReader` test helpers; consolidate duplicates across processing + server tests
#19497)

OrcInputFormat.initialize() — which swaps Thread.currentThread().setContextClassLoader() and calls FileSystem.get(conf) — was invoked on every createReader() call. When a ParallelIndexTask runs multiple ORC subtasks concurrently in the same JVM (as in embedded tests)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.