Skip to content

Conversation

@zstack-robot-2
Copy link
Collaborator

[plugin-premium]: GPU/VM page keeps loading when shutting down or encountering errors in Zaku cluster.
Resolves: ZSTAC-80202
Change-Id: I7778676171646874706164777869707279776172

sync from gitlab !9014

chao.he added 2 commits January 13, 2026 12:03
…r encountering errors in Zaku cluster.

Resolves: ZSTAC-80202
Change-Id: I7778676171646874706164777869707279776172
…r encountering errors in Zaku cluster.

Resolves: ZSTAC-80202
Change-Id: I7778676171646874706164777869707279776172
@coderabbitai
Copy link

coderabbitai bot commented Jan 13, 2026

总体概述

引入了新的公开扩展点接口 BeforeCallZWatchExtensionPoint,用于在调用 ZWatch 之前执行自定义操作。接口包含两个方法:supports() 用于检查是否支持给定的 VO 类,beforeCallZWatch() 用于在 ZWatch 调用前执行业务逻辑,并传递 VO 类型和资源 UUID 列表。

变更

内聚体 / 文件 变更摘要
新增扩展点接口
header/src/main/java/org/zstack/header/zql/BeforeCallZWatchExtensionPoint.java
新增公开接口 BeforeCallZWatchExtensionPoint,定义了 supports(Class<?> voClass)beforeCallZWatch(Class<?> voClass, List<String> uuids) 两个方法。包含完整的 Javadoc 文档和必要的导入声明

代码审查工作量评估

🎯 1 (Trivial) | ⏱️ ~3 分钟

诗句

🐰 一个新接口悄然降临,
简洁而清晰的扩展点,
ZWatch 前的守卫者,
支持检查,自定义行动,
代码之园又添新枝!✨


Important

Pre-merge checks failed

Please resolve all errors before merging. Addressing warnings is optional.

❌ Failed checks (1 error)
Check name Status Explanation Resolution
Title check ❌ Error The pull request title exceeds the required 72-character limit at 73 characters. Shorten the title to 72 characters or less while maintaining the [scope]: format.
✅ Passed checks (2 passed)
Check name Status Explanation
Description check ✅ Passed The pull request description is related to the changeset, addressing GPU/VM loading issues in Zaku clusters.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
✨ Finishing touches
  • 📝 Generate docstrings

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
header/src/main/java/org/zstack/header/zql/BeforeCallZWatchExtensionPoint.java (1)

1-23: 接口设计符合 ZStack 扩展点模式,实现良好。

接口定义清晰,遵循了 ZStack 扩展点的命名和设计模式(如 BeforeXxxExtensionPoint 命名约定)。Javadoc 注释完整且使用英文编写,符合编码规范。

一个小建议:考虑在 beforeCallZWatch 方法的 Javadoc 中补充说明当健康检查失败时的预期行为(如抛出异常以实现快速失败机制),这有助于实现者理解扩展点的契约。

📝 可选:补充异常行为说明
     /**
      * Perform custom operations before calling ZWatch, for example: health-check
      * @param voClass the VO class type
      * @param uuids the list of resource UUIDs to process
+     * @throws RuntimeException if the pre-check fails and the ZWatch call should be skipped
      */
     void beforeCallZWatch(Class<?> voClass, List<String> uuids);
📜 Review details

Configuration used: Path: http://open.zstack.ai:20001/code-reviews/zstack-cloud.yaml (via .coderabbit.yaml)

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between b28558a and 63007bb.

📒 Files selected for processing (1)
  • header/src/main/java/org/zstack/header/zql/BeforeCallZWatchExtensionPoint.java
🧰 Additional context used
📓 Path-based instructions (2)
**/*.*

⚙️ CodeRabbit configuration file

**/*.*: - 代码里不应当有有中文,包括报错、注释等都应当使用正确的、无拼写错误的英文来写

Files:

  • header/src/main/java/org/zstack/header/zql/BeforeCallZWatchExtensionPoint.java
**/*.java

⚙️ CodeRabbit configuration file

**/*.java: ## 1. API 设计要求

  • API 命名:
    • API 名称必须唯一,不能重复。
    • API 消息类需要继承 APIMessage;其返回类必须继承 APIReplyAPIEvent,并在注释中用 @RestResponse 进行标注。
    • API 消息上必须添加注解 @RestRequest,并满足如下规范:
      • path:
        • 针对资源使用复数形式。
        • 当 path 中引用消息类变量时,使用 {variableName} 格式。
      • HTTP 方法对应:
        • 查询操作 → HttpMethod.GET
        • 更新操作 → HttpMethod.PUT
        • 创建操作 → HttpMethod.POST
        • 删除操作 → HttpMethod.DELETE
    • API 类需要实现 __example__ 方法以便生成 API 文档,并确保生成对应的 Groovy API Template 与 API Markdown 文件。

2. 命名与格式规范

  • 类名:

    • 使用 UpperCamelCase 风格。
    • 特殊情况:
      • VO/AO/EO 类型类除外。
      • 抽象类采用 AbstractBase 前缀/后缀。
      • 异常类应以 Exception 结尾。
      • 测试类需要以 TestCase 结尾。
  • 方法名、参数名、成员变量和局部变量:

    • 使用 lowerCamelCase 风格。
  • 常量命名:

    • 全部大写,使用下划线分隔单词。
    • 要求表达清楚,避免使用含糊或不准确的名称。
  • 包名:

    • 统一使用小写,使用点分隔符,每个部分应是一个具有自然语义的英文单词(参考 Spring 框架的结构)。
  • 命名细节:

    • 避免在父子类或同一代码块中出现相同名字的成员或局部变量,防止混淆。
    • 命名缩写:
      • 不允许使用不必要的缩写,如:AbsSchedulerJobcondiFu 等。应使用完整单词提升可读性。

3. 编写自解释代码

  • 意图表达:

    • 避免使用布尔型参数造成含义不明确。例如:
      • 对于 stopAgent(boolean ignoreError),建议拆分为不同函数(如 stopAgentIgnoreError()),或使用枚举表达操作类型。
    • 命名应尽量用完整的单词组合表达意图,并在名称中体现数据类型或用途(例如在常量与变量名称中,将类型词放在末尾)。
    • 避免使用魔法值(Magic Value):
      • 直接使用未经定义的数值或字符串(如 if (status == 5))应替换为枚举或常量。

      • 示例:

      • // 错误示例:魔法值

      • if (user.getStatus() == 5) { ... }

      • // 正确示例:常量或枚举

      • public static final int STATUS_ACTIVE = 5;

      • if (user.getStatus() == STATUS_ACTIVE) { ... }

      • // 或使用枚举

      • enum UserStatus { ACTIVE, INACTIVE }

  • 注释:

    • 代码应尽量做到自解释,对少于两行的说明可以直接写在代码中。
    • 对于较长的注释,需要仔细校对并随代码更新,确保内容正确。
    • 接口方法不应有多余的修饰符(例如 public),且必须配有有效的 Javadoc 注释。

4. 流程控制和结构优化

  • if...else 的使用:

    • 应尽量减少 if...else 结构的使用,建议:
      • 限制嵌套层级最多为两层,且内层不应再出现 else 分支。
      • 尽早返回(Early Return),将条件判断中的处理逻辑提前结束或抽成独立方法。
      • 使用 Java Stream 或 Lambda 表达式代替冗长的循环与条件判断。
  • 条件判断:

    • if 条件表达不宜过长或过于复杂,必要时可以将条件抽成 boolean 变量描述。
  • 代码块长度:
    ...

Files:

  • header/src/main/java/org/zstack/header/zql/BeforeCallZWatchExtensionPoint.java
🧠 Learnings (2)
📓 Common learnings
Learnt from: zstack-robot-1
Repo: MatheMatrix/zstack PR: 2360
File: network/src/main/java/org/zstack/network/l3/L3BasicNetwork.java:449-490
Timestamp: 2025-08-04T04:48:19.103Z
Learning: ZStack项目在cherry-pick操作中,即使发现了性能优化机会(如IP地址批量保存的内存优化),也严格遵循不做额外修改的政策,优先保证cherry-pick的完整性和一致性。
📚 Learning: 2025-08-12T03:35:21.034Z
Learnt from: zstack-robot-1
Repo: MatheMatrix/zstack PR: 2419
File: network/src/main/java/org/zstack/network/l3/L3BasicNetwork.java:287-297
Timestamp: 2025-08-12T03:35:21.034Z
Learning: In the ZStack codebase, extension points may be designed for different execution phases/orders in the same operation. For example, IpRangeDeletionExtensionPoint executes during IP range deletion process, while AfterDeleteIpRangeExtensionPoint executes after the entire deletion process is complete. This allows different plugins to hook into appropriate lifecycle stages.

Applied to files:

  • header/src/main/java/org/zstack/header/zql/BeforeCallZWatchExtensionPoint.java
🔍 Remote MCP AtlassianMCP

Summary of Additional Context Found

Based on my search of the available tools, here is the relevant context for reviewing this pull request:

Issue Background (ZSTAC-80202)

The PR resolves a critical P1 bug where GPU/VM pages remain in a loading state when Zaku cluster nodes shut down or encounter errors.

Root Cause: When Zaku nodes are down, GPU page ZQL queries attempt to fetch container status monitoring data from Prometheus, causing API calls to time out (default 300s read timeout, 15s connect timeout), which occupies the ZQLReturnWith thread pool (default max 10 threads), eventually blocking all ZQL-based page queries with pending tasks piling up (57+ queued with only 10 threads available).

Approved Solution Design

The fix implements three components:

  1. Async Health Monitor - ZakuHealthMonitor executes periodic checks (60s interval) with lightweight health checks (5s timeout) and 120s state caching
  2. Query Flow Integration - ZQLReturnWithExtension checks cluster health status before querying container API with fast failure mechanism and exception logging
  3. HTTP Timeout Optimization - Prometheus API: 5s connection/15s read; Zaku HttpClient: 30s connection/60s socket

Error Handling Pattern

For ZQL return-with queries encountering Zaku cluster health check failures, the backend should return error indicators following the existing {resultName}Error pattern (similar to established {resultName}Total pattern from ZSTAC-35305), allowing distinction between cluster-healthy empty results and cluster-error scenarios.

New Extension Point

The PR introduces BeforeCallZWatchExtensionPoint interface with two methods:

  • boolean supports(Class<?> voClass) - Indicates if the extension supports a given VO class
  • void beforeCallZWatch(Class<?> voClass, List<String> uuids) - Performs custom actions prior to ZWatch calls

This extension point enables pluggable health checking logic before ZWatch monitoring queries execute.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants