HIVE-29638: Add AutoScaling to K8s operator by ayushtkn · Pull Request #6507 · apache/hive

ayushtkn · 2026-05-26T13:56:20Z

What changes were proposed in this pull request?

Add auto scaling to Hive Operator

Why are the changes needed?

Better usage & cloud saving.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Manually

Installed Dependencies (ZK, Postgres & Ozone)

helm repo add bitnami https://charts.bitnami.com/bitnami
helm install zookeeper bitnami/zookeeper \
  --set replicaCount=1 --set auth.enabled=false \
  --set image.repository=bitnamilegacy/zookeeper \
  --set image.tag=3.9.3-debian-12-r21 \
  --set global.security.allowInsecureImages=true --wait


helm install postgres bitnami/postgresql \
  --set auth.username=hive --set auth.password=hive123 \
  --set auth.database=metastore --wait


kubectl create secret generic hive-db-secret --from-literal=password=hive123


helm repo add ozone https://apache.github.io/ozone-helm-charts/
helm install ozone ozone/ozone --version 0.2.0 --wait
sleep 50
kubectl exec statefulset/ozone-om -- ozone sh volume create /s3v
kubectl exec statefulset/ozone-om -- ozone sh bucket create /s3v/hive

Started Hive Operator With AutoScaling Enabled (Very Low Thresholds for Testing)

helm install hive ./helm/hive-operator \
  --set cluster.database.type=postgres \
  --set cluster.database.url="jdbc:postgresql://postgres-postgresql:5432/metastore" \
  --set cluster.database.driver="org.postgresql.Driver" \
  --set cluster.database.username=hive \
  --set cluster.database.passwordSecretRef.name=hive-db-secret \
  --set cluster.database.passwordSecretRef.key=password \
  --set cluster.database.driverJarUrl="https://repo1.maven.org/maven2/org/postgresql/postgresql/42.7.5/postgresql-42.7.5.jar" \
  --set cluster.zookeeper.quorum="zookeeper:2181" \
  --set cluster.storage.coreSiteOverrides."fs\.defaultFS"="s3a://hive" \
  --set cluster.storage.coreSiteOverrides."fs\.s3a\.endpoint"="http://ozone-s3g-rest:9878" \
  --set-string cluster.storage.coreSiteOverrides."fs\.s3a\.path\.style\.access"=true \
  --set 'cluster.storage.envVars[0].name=HADOOP_OPTIONAL_TOOLS' \
  --set 'cluster.storage.envVars[0].value=hadoop-aws' \
  --set 'cluster.storage.envVars[1].name=AWS_ACCESS_KEY_ID' \
  --set 'cluster.storage.envVars[1].value=ozone' \
  --set 'cluster.storage.envVars[2].name=AWS_SECRET_ACCESS_KEY' \
  --set 'cluster.storage.envVars[2].value=ozone' \
  --set cluster.hiveServer2.autoscaling.enabled=true \
  --set cluster.metastore.autoscaling.enabled=true \
  --set cluster.llap.autoscaling.enabled=true \
  --set cluster.tezAm.autoscaling.enabled=true \
  --set-string cluster.llap.configOverrides."hive\.llap\.daemon\.task\.scheduler\.wait\.queue\.size"="1" \
  --set cluster.hiveServer2.autoscaling.scaleUpThreshold=1 \
  --set cluster.metastore.autoscaling.scaleUpThreshold=2

Launched Beeline

kubectl exec -it deployment/hive-hiveserver2 -- beeline -u "jdbc:hive2://hive-hiveserver2:10001/;transportMode=http;httpPath=cliservice"

OUTPUTS:

Initial Start -> Only 1 HMS, 1 HS2 (1 == Min Configured)

Hits First Beeline Session -> Tez AM, LLAP Daemons starts (Min 1 configured)

AutoScaling HS2 to 2 & Tez AM(Reduced max threshold)

Tez AM

HS2

Auto Scaling HMS & LLAP to 2

HMS

LLAP (Load reduced by the time, query finished :-( )

Scale Downs (After Cooling Periods)

Scheduled

Done (After waiting for cool down period for specific service)

CPU tracking

HS2

HMS

difin

LGTM

zhangbutao · 2026-06-18T13:37:46Z

Thanx @zhangbutao for the great insights!!!

You hit the nail on the head regarding the shift from "YARN-thinking" to "Kubernetes-native thinking."

Physical vs. Logical Isolation
You are completely right about Workload Management (WLM). Trying to carve up a single JVM's heap and CPU cycles among competing tenants is incredibly complex and never gives you 100% true isolation. By shifting to Kubernetes, we get true physical isolation via namespaces, cgroups, and dedicated pod resources.

How this could work technically
What you are describing is entirely feasible. The LLAP instances register themselves in ZooKeeper under a specific app name (defaulting to @llap0). If we update the Operator to support an array of LLAP profiles (e.g., llap-cluster1, llap-cluster2), the Operator would spin up multiple independent StatefulSets, each registering to a different ZK path.

Then, exactly as you said, a user simply sets hive.llap.daemon.service.hosts=@llap-cluster1 in their JDBC string or session. TezAM would look up that specific ZK path, find those specific pods, and route the fragments exclusively to that tenant's dedicated executors.

The Autoscaling Synergy
The best part is how it ties into the autoscaling logic in this PR! Because each tenant's LLAP cluster would be its own independent K8s StatefulSet, the autoscaler would scale llap-cluster1 and llap-cluster2 completely independently. If user1 isn't running queries, their dedicated LLAP cluster scales to zero, costing nothing, while user2 can comfortably stay scaled up to 100 pods.

This is a fantastic concept for multi-tenancy. Since the core autoscaling loop and K8s operator primitives are established in this PR, building out "Multi-Tenant LLAP Compute Groups" on top of it feels like a perfect follow-up Jira ticket. I think it is definitely worth exploring! I will definitely give it a shot :-)

Your thoughts align completely with mine—this idea is both feasible and highly valuable. The reason I came up with this idea is that other MPP-architecture OLAP analytical engines, such as StarRocks and Doris, already have similar compute-group functionality that effectively isolates multi-tenant workloads. So the solution we've conceived is absolutely feasible and has practical value. Therefore, it is well worth our effort to explore this capability in depth. Thanks @ayushtkn

zhangbutao

+1 LGTM

sonarqubecloud · 2026-06-18T17:45:11Z

Quality Gate passed

Issues
57 New issues
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
1.8% Duplication on New Code

See analysis details on SonarQube Cloud

ayushtkn · 2026-06-19T02:23:54Z

Thanx @aturoczy , @tanishq-chugh , @difin and @zhangbutao for the reviews!!!

asf-ci-hive added tests pending tests passed and removed tests pending tests passed labels May 26, 2026

ayushtkn force-pushed the K8sautoscaling branch from b985f23 to 2220c4e Compare May 26, 2026 22:09

asf-ci-hive added tests failed tests pending tests passed and removed tests pending tests failed labels May 26, 2026

ayushtkn force-pushed the K8sautoscaling branch from 2220c4e to 930af89 Compare May 27, 2026 14:06

asf-ci-hive added tests pending tests passed tests unstable and removed tests passed tests pending tests unstable labels May 27, 2026

ayushtkn changed the title ~~WIP: Add AutoScaling to K8s operator~~ HIVE-29492: Add AutoScaling to K8s operator May 29, 2026

asf-ci-hive added tests passed tests pending and removed tests pending tests passed labels May 29, 2026

difin reviewed Jun 17, 2026

View reviewed changes

Comment thread ...ng/src/kubernetes/src/java/org/apache/hive/kubernetes/operator/autoscaling/MetricsCache.java Outdated

difin reviewed Jun 17, 2026

View reviewed changes

Comment thread ...ubernetes/src/java/org/apache/hive/kubernetes/operator/autoscaling/PrometheusTextParser.java Outdated

difin reviewed Jun 17, 2026

View reviewed changes

Comment thread ...kubernetes/src/java/org/apache/hive/kubernetes/operator/dependent/HiveDependentResource.java

difin approved these changes Jun 17, 2026

View reviewed changes

ayushtkn added 22 commits June 18, 2026 16:17

K8s: Add AutoScaling

1af603b

Fix Scaling HMS & Refactor

54693db

Fix CPU Utilization Scaling

2c09456

Fix HS2 Scaling Down

92e3312

Fixes

926f1cf

Refactor

bde0ca3

Remove Promethous and any other dependency

cb9ff15

Add HS2 Priority for Auto Scaling

5d0b656

CleanUp

a4af13a

Add CPU for HS2 & HMS

50000c0

Auto Suspend

6f36315

Refactor

c25b9c5

Make port configurable

6752554

Formatting issues

8af73b0

Remove Hardcoded Ports

835f530

Refactor

38356af

Cleanup

3c0db8e

Fix Log

be439fb

CoPilot comments

7af3e98

Add Parallel

b2fe477

Fix Some Sonar issues

939ba2c

Review Comments

49c8ec4

zhangbutao approved these changes Jun 18, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HIVE-29638: Add AutoScaling to K8s operator#6507

HIVE-29638: Add AutoScaling to K8s operator#6507
ayushtkn merged 22 commits into
apache:masterfrom
ayushtkn:K8sautoscaling

ayushtkn commented May 26, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

difin left a comment

Uh oh!

zhangbutao commented Jun 18, 2026

Uh oh!

zhangbutao left a comment

Uh oh!

sonarqubecloud Bot commented Jun 18, 2026

Uh oh!

ayushtkn commented Jun 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

Conversation

ayushtkn commented May 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Installed Dependencies (ZK, Postgres & Ozone)

Started Hive Operator With AutoScaling Enabled (Very Low Thresholds for Testing)

Launched Beeline

Initial Start -> Only 1 HMS, 1 HS2 (1 == Min Configured)

Hits First Beeline Session -> Tez AM, LLAP Daemons starts (Min 1 configured)

AutoScaling HS2 to 2 & Tez AM(Reduced max threshold)

Auto Scaling HMS & LLAP to 2

Scale Downs (After Cooling Periods)

CPU tracking

HS2

HMS

Uh oh!

Uh oh!

Uh oh!

Uh oh!

difin left a comment

Choose a reason for hiding this comment

Uh oh!

zhangbutao commented Jun 18, 2026

Uh oh!

zhangbutao left a comment

Choose a reason for hiding this comment

Uh oh!

sonarqubecloud Bot commented Jun 18, 2026

Quality Gate passed

Uh oh!

ayushtkn commented Jun 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

ayushtkn commented May 26, 2026 •

edited

Loading