-
Notifications
You must be signed in to change notification settings - Fork 0
<fix>[plugin-premium]: GPU/VM page keeps loading when shutting down or... #3195
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: 5.5.6
Are you sure you want to change the base?
<fix>[plugin-premium]: GPU/VM page keeps loading when shutting down or... #3195
Conversation
…r encountering errors in Zaku cluster. Resolves: ZSTAC-80202 Change-Id: I7778676171646874706164777869707279776172
…r encountering errors in Zaku cluster. Resolves: ZSTAC-80202 Change-Id: I7778676171646874706164777869707279776172
总体概述引入了新的公开扩展点接口 变更
代码审查工作量评估🎯 1 (Trivial) | ⏱️ ~3 分钟 诗句
Important Pre-merge checks failedPlease resolve all errors before merging. Addressing warnings is optional. ❌ Failed checks (1 error)
✅ Passed checks (2 passed)
✨ Finishing touches
Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (1)
header/src/main/java/org/zstack/header/zql/BeforeCallZWatchExtensionPoint.java (1)
1-23: 接口设计符合 ZStack 扩展点模式,实现良好。接口定义清晰,遵循了 ZStack 扩展点的命名和设计模式(如
BeforeXxxExtensionPoint命名约定)。Javadoc 注释完整且使用英文编写,符合编码规范。一个小建议:考虑在
beforeCallZWatch方法的 Javadoc 中补充说明当健康检查失败时的预期行为(如抛出异常以实现快速失败机制),这有助于实现者理解扩展点的契约。📝 可选:补充异常行为说明
/** * Perform custom operations before calling ZWatch, for example: health-check * @param voClass the VO class type * @param uuids the list of resource UUIDs to process + * @throws RuntimeException if the pre-check fails and the ZWatch call should be skipped */ void beforeCallZWatch(Class<?> voClass, List<String> uuids);
📜 Review details
Configuration used: Path: http://open.zstack.ai:20001/code-reviews/zstack-cloud.yaml (via .coderabbit.yaml)
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
header/src/main/java/org/zstack/header/zql/BeforeCallZWatchExtensionPoint.java
🧰 Additional context used
📓 Path-based instructions (2)
**/*.*
⚙️ CodeRabbit configuration file
**/*.*: - 代码里不应当有有中文,包括报错、注释等都应当使用正确的、无拼写错误的英文来写
Files:
header/src/main/java/org/zstack/header/zql/BeforeCallZWatchExtensionPoint.java
**/*.java
⚙️ CodeRabbit configuration file
**/*.java: ## 1. API 设计要求
- API 命名:
- API 名称必须唯一,不能重复。
- API 消息类需要继承
APIMessage;其返回类必须继承APIReply或APIEvent,并在注释中用@RestResponse进行标注。- API 消息上必须添加注解
@RestRequest,并满足如下规范:
path:
- 针对资源使用复数形式。
- 当 path 中引用消息类变量时,使用
{variableName}格式。- HTTP 方法对应:
- 查询操作 →
HttpMethod.GET- 更新操作 →
HttpMethod.PUT- 创建操作 →
HttpMethod.POST- 删除操作 →
HttpMethod.DELETE- API 类需要实现
__example__方法以便生成 API 文档,并确保生成对应的 Groovy API Template 与 API Markdown 文件。
2. 命名与格式规范
类名:
- 使用 UpperCamelCase 风格。
- 特殊情况:
- VO/AO/EO 类型类除外。
- 抽象类采用
Abstract或Base前缀/后缀。- 异常类应以
Exception结尾。- 测试类需要以
Test或Case结尾。方法名、参数名、成员变量和局部变量:
- 使用 lowerCamelCase 风格。
常量命名:
- 全部大写,使用下划线分隔单词。
- 要求表达清楚,避免使用含糊或不准确的名称。
包名:
- 统一使用小写,使用点分隔符,每个部分应是一个具有自然语义的英文单词(参考 Spring 框架的结构)。
命名细节:
- 避免在父子类或同一代码块中出现相同名字的成员或局部变量,防止混淆。
- 命名缩写:
- 不允许使用不必要的缩写,如:
AbsSchedulerJob、condi、Fu等。应使用完整单词提升可读性。
3. 编写自解释代码
意图表达:
- 避免使用布尔型参数造成含义不明确。例如:
- 对于
stopAgent(boolean ignoreError),建议拆分为不同函数(如stopAgentIgnoreError()),或使用枚举表达操作类型。- 命名应尽量用完整的单词组合表达意图,并在名称中体现数据类型或用途(例如在常量与变量名称中,将类型词放在末尾)。
- 避免使用魔法值(Magic Value):
直接使用未经定义的数值或字符串(如 if (status == 5))应替换为枚举或常量。
示例:
// 错误示例:魔法值
if (user.getStatus() == 5) { ... }
// 正确示例:常量或枚举
public static final int STATUS_ACTIVE = 5;
if (user.getStatus() == STATUS_ACTIVE) { ... }
// 或使用枚举
enum UserStatus { ACTIVE, INACTIVE }
注释:
- 代码应尽量做到自解释,对少于两行的说明可以直接写在代码中。
- 对于较长的注释,需要仔细校对并随代码更新,确保内容正确。
- 接口方法不应有多余的修饰符(例如
public),且必须配有有效的 Javadoc 注释。
4. 流程控制和结构优化
if...else 的使用:
- 应尽量减少 if...else 结构的使用,建议:
- 限制嵌套层级最多为两层,且内层不应再出现
else分支。- 尽早返回(Early Return),将条件判断中的处理逻辑提前结束或抽成独立方法。
- 使用 Java Stream 或 Lambda 表达式代替冗长的循环与条件判断。
条件判断:
- if 条件表达不宜过长或过于复杂,必要时可以将条件抽成 boolean 变量描述。
代码块长度:
...
Files:
header/src/main/java/org/zstack/header/zql/BeforeCallZWatchExtensionPoint.java
🧠 Learnings (2)
📓 Common learnings
Learnt from: zstack-robot-1
Repo: MatheMatrix/zstack PR: 2360
File: network/src/main/java/org/zstack/network/l3/L3BasicNetwork.java:449-490
Timestamp: 2025-08-04T04:48:19.103Z
Learning: ZStack项目在cherry-pick操作中,即使发现了性能优化机会(如IP地址批量保存的内存优化),也严格遵循不做额外修改的政策,优先保证cherry-pick的完整性和一致性。
📚 Learning: 2025-08-12T03:35:21.034Z
Learnt from: zstack-robot-1
Repo: MatheMatrix/zstack PR: 2419
File: network/src/main/java/org/zstack/network/l3/L3BasicNetwork.java:287-297
Timestamp: 2025-08-12T03:35:21.034Z
Learning: In the ZStack codebase, extension points may be designed for different execution phases/orders in the same operation. For example, IpRangeDeletionExtensionPoint executes during IP range deletion process, while AfterDeleteIpRangeExtensionPoint executes after the entire deletion process is complete. This allows different plugins to hook into appropriate lifecycle stages.
Applied to files:
header/src/main/java/org/zstack/header/zql/BeforeCallZWatchExtensionPoint.java
🔍 Remote MCP AtlassianMCP
Summary of Additional Context Found
Based on my search of the available tools, here is the relevant context for reviewing this pull request:
Issue Background (ZSTAC-80202)
The PR resolves a critical P1 bug where GPU/VM pages remain in a loading state when Zaku cluster nodes shut down or encounter errors.
Root Cause: When Zaku nodes are down, GPU page ZQL queries attempt to fetch container status monitoring data from Prometheus, causing API calls to time out (default 300s read timeout, 15s connect timeout), which occupies the ZQLReturnWith thread pool (default max 10 threads), eventually blocking all ZQL-based page queries with pending tasks piling up (57+ queued with only 10 threads available).
Approved Solution Design
The fix implements three components:
- Async Health Monitor - ZakuHealthMonitor executes periodic checks (60s interval) with lightweight health checks (5s timeout) and 120s state caching
- Query Flow Integration - ZQLReturnWithExtension checks cluster health status before querying container API with fast failure mechanism and exception logging
- HTTP Timeout Optimization - Prometheus API: 5s connection/15s read; Zaku HttpClient: 30s connection/60s socket
Error Handling Pattern
For ZQL return-with queries encountering Zaku cluster health check failures, the backend should return error indicators following the existing {resultName}Error pattern (similar to established {resultName}Total pattern from ZSTAC-35305), allowing distinction between cluster-healthy empty results and cluster-error scenarios.
New Extension Point
The PR introduces BeforeCallZWatchExtensionPoint interface with two methods:
boolean supports(Class<?> voClass)- Indicates if the extension supports a given VO classvoid beforeCallZWatch(Class<?> voClass, List<String> uuids)- Performs custom actions prior to ZWatch calls
This extension point enables pluggable health checking logic before ZWatch monitoring queries execute.
[plugin-premium]: GPU/VM page keeps loading when shutting down or encountering errors in Zaku cluster.
Resolves: ZSTAC-80202
Change-Id: I7778676171646874706164777869707279776172
sync from gitlab !9014