Skip to content

Conversation

@MatheMatrix
Copy link
Owner

support thirdparty_ceph to bm root volume

Resolves: ZSTAC-73396

Change-Id: I6f6f616777677a6f74696c6c736c737a626f7863

sync from gitlab !9008

@coderabbitai
Copy link

coderabbitai bot commented Jan 12, 2026

总览

该变更通过数据库架构升级、新扩展点接口和Ceph存储逻辑集成,实现第三方克隆支持。添加数据库列处理裸金属2实例配置,引入扩展机制验证实例类型并转换块卷,在Ceph克隆流程中集成第三方SDK交互。

变更明细

分组 / 文件 变更摘要
数据库架构迁移
conf/db/upgrade/V5.5.6__schema.sql
新增升级存储过程 UpdateBareMetal2InstanceProvisionNicVO,处理 BareMetal2InstanceProvisionNicVO 表的 instanceUuid 列和 isPrimaryProvisionNic 列的添加;为 BareMetal2ChassisNicVO 表添加 isPrimaryProvisionNic 列,包含事务处理和异常回滚逻辑。
扩展点接口
plugin/ceph/src/main/java/org/zstack/storage/ceph/primary/CephPrimaryStorageCheckInstanceTypeExtensionPoint.java
新增公共扩展点接口,声明四个方法:isSupportCloneByThirdParty、convertToBlockVolume、isBlockVolume、populateBlockVolumeDetails,用于第三方克隆能力检测和块卷处理。
Ceph存储克隆逻辑
plugin/ceph/src/main/java/org/zstack/storage/ceph/primary/CephPrimaryStorageBase.java
为 CloneRsp 添加 volumeId 和 volumeStatus 字段;扩展克隆流程集成扩展点,支持第三方克隆的令牌发现、块卷转换、卷详情填充及安装路径更新,覆盖模板根卷和快照克隆两条路径。

代码审查难度估计

🎯 3 (中等复杂) | ⏱️ ~20 分钟

🐰 ✨ 在代码的梦幻城堡里
克隆之力跨越第三方的桥梁
数据库记忆闪闪发光
扩展点编织新的魔法之网
块卷在Ceph的摇篮中安睡 🌙

🚥 Pre-merge checks | ✅ 2 | ❌ 1
❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed PR标题遵循了要求的[scope]: 格式(fix类型,ceph作用域),且长度为54字符,在72字符限制内,内容准确反映了主要变更内容。
Description check ✅ Passed PR描述与变更集相关,提到了支持third-party ceph用于裸金属根卷,并包含有效的追踪信息(JIRA key和Change-Id)。

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings

Warning

There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure.

🔧 ast-grep (0.40.4)
plugin/ceph/src/main/java/org/zstack/storage/ceph/primary/CephPrimaryStorageBase.java

Comment @coderabbitai help to get the list of available commands and usage tips.

support thirdparty_ceph to bm root volume

Resolves: ZSTAC-73396

Change-Id: I6f6f616777677a6f74696c6c736c737a626f7863
@MatheMatrix MatheMatrix force-pushed the sync/jin.shen/ZSTAC-81282@@3 branch from 9fb0ae3 to 7609ddf Compare January 12, 2026 09:58
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🤖 Fix all issues with AI agents
In @conf/db/upgrade/V4.8.0.8__schema.sql:
- Around line 30-34: The DROP FOREIGN KEY for constraint
fkBareMetal2InstanceProvisionNicVOInstanceVO on table
BareMetal2InstanceProvisionNicVO lacks an existence check and makes the
migration non-idempotent; change the direct ALTER TABLE ... DROP FOREIGN KEY to
first verify the constraint exists (e.g. query INFORMATION_SCHEMA or reuse the
repository's DROP_CONSTRAINT helper) and only execute the drop when present,
then proceed with the CALL ADD_CONSTRAINT(...) as before so the script is safe
to run multiple times.

In @conf/db/upgrade/V5.5.6__schema.sql:
- Around line 30-34: The DROP FOREIGN KEY for
fkBareMetal2InstanceProvisionNicVOInstanceVO on table
BareMetal2InstanceProvisionNicVO lacks an existence check and breaks
idempotency; modify V5.5.6__schema.sql to query information_schema
(constraints/KEY_COLUMN_USAGE) to confirm the constraint exists before executing
ALTER TABLE `BareMetal2InstanceProvisionNicVO` DROP FOREIGN KEY
`fkBareMetal2InstanceProvisionNicVOInstanceVO`, and only run the DROP when the
check returns a positive match so the upgrade script is safe to re-run; keep the
subsequent CALL ADD_CONSTRAINT(...) as-is after the conditional drop.

In
@plugin/ceph/src/main/java/org/zstack/storage/ceph/primary/CephPrimaryStorageBase.java:
- Around line 2660-2693: The loop over
CephPrimaryStorageCheckInstanceTypeExtensionPoint has wrong control flow and
null handling: replace the current if (!result) { break; } with if
(!Boolean.TRUE.equals(result)) { continue; } so a null/false just skips that
extension instead of aborting all, and after ext.convertToBlockVolume(vo); add a
break; to stop after the first supporting extension; also review usages of
cmd.token/cmd.monIp/cmd.tpTimeout to avoid emitting sensitive token values in
logs or audit output.
🧹 Nitpick comments (1)
plugin/ceph/src/main/java/org/zstack/storage/ceph/primary/CephPrimaryStorageCheckInstanceTypeExtensionPoint.java (1)

1-13: 接口契约不清晰且 Boolean 返回值易引入 NPE(建议收紧签名并补全 Javadoc)
目前 String uuid 的语义不明确(不同方法可能需要 vmUuid/volumeUuid/psUuid),并且 Boolean 存在返回 null 的潜在风险,调用方若按 primitive 语义使用会出问题;同时接口方法缺少有效 Javadoc。

📜 Review details

Configuration used: Path: http://open.zstack.ai:20001/code-reviews/zstack-cloud.yaml (via .coderabbit.yaml)

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between ee9f20d and 9fb0ae3.

⛔ Files ignored due to path filters (3)
  • sdk/src/main/java/org/zstack/sdk/BareMetal2ChassisNicInventory.java is excluded by !sdk/**
  • sdk/src/main/java/org/zstack/sdk/BareMetal2InstanceInventory.java is excluded by !sdk/**
  • sdk/src/main/java/org/zstack/sdk/BareMetal2InstanceProvisionNicInventory.java is excluded by !sdk/**
📒 Files selected for processing (4)
  • conf/db/upgrade/V4.8.0.8__schema.sql
  • conf/db/upgrade/V5.5.6__schema.sql
  • plugin/ceph/src/main/java/org/zstack/storage/ceph/primary/CephPrimaryStorageBase.java
  • plugin/ceph/src/main/java/org/zstack/storage/ceph/primary/CephPrimaryStorageCheckInstanceTypeExtensionPoint.java
🧰 Additional context used
📓 Path-based instructions (3)
**/*.*

⚙️ CodeRabbit configuration file

**/*.*: - 代码里不应当有有中文,包括报错、注释等都应当使用正确的、无拼写错误的英文来写

Files:

  • plugin/ceph/src/main/java/org/zstack/storage/ceph/primary/CephPrimaryStorageCheckInstanceTypeExtensionPoint.java
  • conf/db/upgrade/V4.8.0.8__schema.sql
  • plugin/ceph/src/main/java/org/zstack/storage/ceph/primary/CephPrimaryStorageBase.java
  • conf/db/upgrade/V5.5.6__schema.sql
**/*.java

⚙️ CodeRabbit configuration file

**/*.java: ## 1. API 设计要求

  • API 命名:
    • API 名称必须唯一,不能重复。
    • API 消息类需要继承 APIMessage;其返回类必须继承 APIReplyAPIEvent,并在注释中用 @RestResponse 进行标注。
    • API 消息上必须添加注解 @RestRequest,并满足如下规范:
      • path:
        • 针对资源使用复数形式。
        • 当 path 中引用消息类变量时,使用 {variableName} 格式。
      • HTTP 方法对应:
        • 查询操作 → HttpMethod.GET
        • 更新操作 → HttpMethod.PUT
        • 创建操作 → HttpMethod.POST
        • 删除操作 → HttpMethod.DELETE
    • API 类需要实现 __example__ 方法以便生成 API 文档,并确保生成对应的 Groovy API Template 与 API Markdown 文件。

2. 命名与格式规范

  • 类名:

    • 使用 UpperCamelCase 风格。
    • 特殊情况:
      • VO/AO/EO 类型类除外。
      • 抽象类采用 AbstractBase 前缀/后缀。
      • 异常类应以 Exception 结尾。
      • 测试类需要以 TestCase 结尾。
  • 方法名、参数名、成员变量和局部变量:

    • 使用 lowerCamelCase 风格。
  • 常量命名:

    • 全部大写,使用下划线分隔单词。
    • 要求表达清楚,避免使用含糊或不准确的名称。
  • 包名:

    • 统一使用小写,使用点分隔符,每个部分应是一个具有自然语义的英文单词(参考 Spring 框架的结构)。
  • 命名细节:

    • 避免在父子类或同一代码块中出现相同名字的成员或局部变量,防止混淆。
    • 命名缩写:
      • 不允许使用不必要的缩写,如:AbsSchedulerJobcondiFu 等。应使用完整单词提升可读性。

3. 编写自解释代码

  • 意图表达:

    • 避免使用布尔型参数造成含义不明确。例如:
      • 对于 stopAgent(boolean ignoreError),建议拆分为不同函数(如 stopAgentIgnoreError()),或使用枚举表达操作类型。
    • 命名应尽量用完整的单词组合表达意图,并在名称中体现数据类型或用途(例如在常量与变量名称中,将类型词放在末尾)。
    • 避免使用魔法值(Magic Value):
      • 直接使用未经定义的数值或字符串(如 if (status == 5))应替换为枚举或常量。

      • 示例:

      • // 错误示例:魔法值

      • if (user.getStatus() == 5) { ... }

      • // 正确示例:常量或枚举

      • public static final int STATUS_ACTIVE = 5;

      • if (user.getStatus() == STATUS_ACTIVE) { ... }

      • // 或使用枚举

      • enum UserStatus { ACTIVE, INACTIVE }

  • 注释:

    • 代码应尽量做到自解释,对少于两行的说明可以直接写在代码中。
    • 对于较长的注释,需要仔细校对并随代码更新,确保内容正确。
    • 接口方法不应有多余的修饰符(例如 public),且必须配有有效的 Javadoc 注释。

4. 流程控制和结构优化

  • if...else 的使用:

    • 应尽量减少 if...else 结构的使用,建议:
      • 限制嵌套层级最多为两层,且内层不应再出现 else 分支。
      • 尽早返回(Early Return),将条件判断中的处理逻辑提前结束或抽成独立方法。
      • 使用 Java Stream 或 Lambda 表达式代替冗长的循环与条件判断。
  • 条件判断:

    • if 条件表达不宜过长或过于复杂,必要时可以将条件抽成 boolean 变量描述。
  • 代码块长度:
    ...

Files:

  • plugin/ceph/src/main/java/org/zstack/storage/ceph/primary/CephPrimaryStorageCheckInstanceTypeExtensionPoint.java
  • plugin/ceph/src/main/java/org/zstack/storage/ceph/primary/CephPrimaryStorageBase.java
**/*.sql

⚙️ CodeRabbit configuration file

**/*.sql: - Review the SQL code, make sure has no errors and confirm that:

  • Upgrading scene has been carefully handled
  • Do not use DEFAULT 0000-00-00 00:00:00 , use DEFAULT CURRENT_TIMESTAMP instead
  • When NOT NULL exists, must use stored procedure or other functions to process historical data, this is very very important
  • 数据库记录中,如果字符串长度不可控,不要用 vchar,用 text 类型

Files:

  • conf/db/upgrade/V4.8.0.8__schema.sql
  • conf/db/upgrade/V5.5.6__schema.sql
🧠 Learnings (7)
📚 Learning: 2025-08-03T04:10:21.683Z
Learnt from: zstack-robot-1
Repo: MatheMatrix/zstack PR: 0
File: :0-0
Timestamp: 2025-08-03T04:10:21.683Z
Learning: ZStack 数据库升级脚本 V4.10.16__schema.sql 中的 UPGRADE_VM_METADATA_TABLES_DIRECT 存储过程使用直接 RENAME TABLE 操作,不具备幂等性。需要通过 information_schema 检查表和约束的存在性来确保脚本可以安全重复执行。

Applied to files:

  • conf/db/upgrade/V4.8.0.8__schema.sql
  • conf/db/upgrade/V5.5.6__schema.sql
📚 Learning: 2025-08-14T06:56:19.585Z
Learnt from: zstack-robot-2
Repo: MatheMatrix/zstack PR: 2435
File: storage/src/main/java/org/zstack/storage/snapshot/group/VolumeSnapshotGroupBase.java:47-47
Timestamp: 2025-08-14T06:56:19.585Z
Learning: 在VolumeSnapshotGroupBase.java中,VmInstanceResourceMetadataManager的注入和SKIP_RESOURCE_ROLLBACK标记虽然在当前版本中未被使用,但这些导入在大型重构PR中是有意为之的,用于保持代码一致性或为后续功能做准备。

Applied to files:

  • plugin/ceph/src/main/java/org/zstack/storage/ceph/primary/CephPrimaryStorageBase.java
📚 Learning: 2025-08-14T06:48:00.549Z
Learnt from: zstack-robot-1
Repo: MatheMatrix/zstack PR: 2435
File: header/src/main/java/org/zstack/header/storage/snapshot/group/APICheckMemorySnapshotGroupConflictMsg.java:3-7
Timestamp: 2025-08-14T06:48:00.549Z
Learning: 在ZStack项目中,同一包内的类(如VolumeSnapshotGroupVO)不需要显式导入。同时,uuid()辅助方法在API消息类的__example__()方法中可以直接使用,无需静态导入,这是通过项目的构建系统或框架配置实现的。

Applied to files:

  • plugin/ceph/src/main/java/org/zstack/storage/ceph/primary/CephPrimaryStorageBase.java
📚 Learning: 2025-08-14T06:48:00.549Z
Learnt from: zstack-robot-1
Repo: MatheMatrix/zstack PR: 2435
File: header/src/main/java/org/zstack/header/storage/snapshot/group/APICheckMemorySnapshotGroupConflictMsg.java:3-7
Timestamp: 2025-08-14T06:48:00.549Z
Learning: In the ZStack project, API message classes may not require explicit imports for VolumeSnapshotGroupVO references in APIParam annotations or for the uuid() helper method used in __example__() methods, even when these classes/methods are defined in other packages. The build system or framework appears to handle these references automatically.

Applied to files:

  • plugin/ceph/src/main/java/org/zstack/storage/ceph/primary/CephPrimaryStorageBase.java
📚 Learning: 2025-08-25T03:52:37.301Z
Learnt from: ZStack-Robot
Repo: MatheMatrix/zstack PR: 2504
File: storage/src/main/java/org/zstack/storage/snapshot/VolumeSnapshotTreeBase.java:775-819
Timestamp: 2025-08-25T03:52:37.301Z
Learning: 在ZStack项目的VolumeSnapshotTreeBase类中,当通过dbf.findByUuid()方法获取VolumeVO对象时,需要进行null检查,因为该方法可能在找不到对应记录时返回null,直接使用可能导致NullPointerException。

Applied to files:

  • plugin/ceph/src/main/java/org/zstack/storage/ceph/primary/CephPrimaryStorageBase.java
📚 Learning: 2025-07-22T02:30:46.123Z
Learnt from: MatheMatrix
Repo: MatheMatrix/zstack PR: 2311
File: plugin/sdnController/src/main/java/org/zstack/sdnController/hardwareVxlan/KVMRealizeHardwareVxlanNetworkBackend.java:52-53
Timestamp: 2025-07-22T02:30:46.123Z
Learning: In ZStack, backend realization code (such as KVMRealizeHardwareVxlanNetworkBackend) can safely assume the existence of the VO for the resource being operated on (e.g., HardwareL2VxlanNetworkVO for a given L2NetworkInventory), due to system lifecycle management and strong foreign key constraints in the database schema. Explicit null checks after findByUuid are not required in these contexts.

Applied to files:

  • plugin/ceph/src/main/java/org/zstack/storage/ceph/primary/CephPrimaryStorageBase.java
📚 Learning: 2025-08-12T05:52:18.323Z
Learnt from: MatheMatrix
Repo: MatheMatrix/zstack PR: 2418
File: compute/src/main/java/org/zstack/compute/vm/VmInstanceBase.java:3873-3885
Timestamp: 2025-08-12T05:52:18.323Z
Learning: 在 compute/src/main/java/org/zstack/compute/vm/VmInstanceBase.java 的 setNoIpAllocationStaticIp() 流程中,静态 IP 标签删除已统一在 change-ip-in-database 步骤对 voRemoveSet 循环处理:对收集到的旧 IPv4/IPv6 UsedIpVO 均调用 ipOperator.deleteStaticIpByVmUuidAndL3Uuid(vmUuid, l3Uuid, IPv6NetworkUtils.ipv6AddessToTagValue(ip)) 清理对应的 static-ip 标签。因此无需在分支里分别删除 IPv4/IPv6 的标签。

Applied to files:

  • plugin/ceph/src/main/java/org/zstack/storage/ceph/primary/CephPrimaryStorageBase.java
🔍 Remote MCP AtlassianMCP

Excellent! Now I have all the relevant context. Let me compile a comprehensive summary of the findings.

Additional Context for PR #3189 Review

Background and Business Context

ZSTAC-73396 is a P0 high-priority feature request from China Export Credit Insurance Corporation for elastic bare metal (弹性裸金属) to support using high-speed cloud disks as root volumes with HA support. The business impact is significant - customer project acceptance is blocked, affecting ~1M total settlement revenue and ~400k ZStack revenue.

Problem Context:

  • Previous deployment used legacy cloud disks as system disks, which encountered IO errors (TIC-3691)
  • Gateway node has single-point-of-failure risk without HA groups
  • Network deployment has single-link risk without aggregation
  • Customer requires high-speed cloud disk support for production use to avoid gateway node bottleneck

Solution: Support high-speed cloud disks as elastic bare metal instance root volumes, targeting March 12 launch date.

Status: ZSTAC-73396 is Closed/Fixed in version 4.8.0-zhongxinbao with 20h DEV effort and 40h QA effort. ZSTAC-81282 is a clone issue for version 5.5.6 (currently Open), assigned to 沈今 (Jin Shen).


Technical Changes Summary

The PR implements third-party Ceph clone support for bare-metal root volumes:

  1. Database Schema Upgrades (SQL migration scripts):

    • Introduces UpdateBareMetal2InstanceProvisionNicVO() procedure in both V4.8.0.8 and V5.5.6
    • Adds instanceUuid column tracking for provision NIC table
    • Adds isPrimaryProvisionNic boolean flags for both BareMetal2InstanceProvisionNicVO and BareMetal2ChassisNicVO tables
    • Ensures transactional safety with explicit error handling
  2. Ceph Primary Storage Enhancement (CephPrimaryStorageBase.java):

    • Introduces extension point mechanism for third-party clone support
    • For each CephPrimaryStorageCheckInstanceTypeExtensionPoint, checks if third-party clone is supported
    • If conditions met (root volume, baremetal2 VM type, third-party token): discovers Ceph mon IP, assigns token, sets SDK timeout, converts volume to block volume before cloning
    • Post-clone: populates block volume details and adjusts target volume path
    • Adds new fields to CloneRsp: volumeId and volumeStatus
  3. New Extension Point Interface (CephPrimaryStorageCheckInstanceTypeExtensionPoint.java):

    • isSupportCloneByThirdParty(String uuid) - check third-party clone support
    • convertToBlockVolume(VolumeVO vo) - convert to block volume
    • isBlockVolume(String uuid) - identify block volumes
    • populateBlockVolumeDetails(String uuid, String volumeId, String volumeStatus) - populate results

Related Issue History

  • TIC-3691: IO error issue when elastic bare metal + Ceph storage uses cloud disk as system disk. Status: Closed with unresolved risk assessment on production readiness.
  • ZSTAC-79783: Earlier version of this fix for version 4.8.30-zhongxinbao (merged per GitLab automation)
  • Referenced Confluence pages document detailed QA test plans and UI/backend implementation specs

Key Review Points

  • The extension point design allows flexible third-party Ceph integration without hardcoding
  • Schema migrations use defensive patterns (checks IF ... EXISTS before adding columns)
  • Volume transformation logic (regular cloud disk → block volume) is critical for third-party clone compatibility
  • Changes are targeted for 5.5.6 release branch per target PR branch specification
🔇 Additional comments (3)
conf/db/upgrade/V5.5.6__schema.sql (1)

1-74: 存储过程整体结构良好

整体设计遵循了良好的升级脚本实践:

  • 使用 INFORMATION_SCHEMA 检查列存在性确保幂等性
  • 使用事务 (START TRANSACTION / COMMIT) 保证原子性
  • 使用 EXIT HANDLER FOR SQLEXCEPTION 处理异常并回滚
  • 历史数据迁移逻辑正确(instanceUuiduuid 复制,isPrimaryProvisionNic 按条件初始化)

除上述外键删除的幂等性问题外,其余逻辑正确。

conf/db/upgrade/V4.8.0.8__schema.sql (1)

1-74: 两个版本分支使用相同的迁移逻辑

此文件 (V4.8.0.8__schema.sql) 与 V5.5.6__schema.sql 内容完全一致,用于在不同版本分支(4.8.x 和 5.5.x)中执行相同的 schema 升级。这符合跨版本 backport 的常规做法。

请确保对两个文件应用相同的修复。

plugin/ceph/src/main/java/org/zstack/storage/ceph/primary/CephPrimaryStorageBase.java (1)

487-517: CloneRsp 新增字段 OK,但建议确认序列化兼容性与空值语义
如果 agent 并非总是返回 volumeId/volumeStatus,建议在调用侧按“可空”处理并避免作为必填字段使用。

Comment on lines 30 to 34
ALTER TABLE `zstack`.`BareMetal2InstanceProvisionNicVO`
DROP FOREIGN KEY `fkBareMetal2InstanceProvisionNicVOInstanceVO`;

CALL ADD_CONSTRAINT('BareMetal2InstanceProvisionNicVO', 'fkBareMetal2InstanceProvisionNicVOInstanceVO',
'instanceUuid', 'BareMetal2InstanceVO', 'uuid', 'CASCADE');
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

外键删除操作同样缺少存在性检查

V5.5.6__schema.sql 相同的问题:DROP FOREIGN KEY 操作缺少存在性检查,需要在删除前验证外键是否存在以确保脚本幂等性。

请参考 V5.5.6__schema.sql 中的修复建议,在此文件中应用相同的修改。

🤖 Prompt for AI Agents
In @conf/db/upgrade/V4.8.0.8__schema.sql around lines 30 - 34, The DROP FOREIGN
KEY for constraint fkBareMetal2InstanceProvisionNicVOInstanceVO on table
BareMetal2InstanceProvisionNicVO lacks an existence check and makes the
migration non-idempotent; change the direct ALTER TABLE ... DROP FOREIGN KEY to
first verify the constraint exists (e.g. query INFORMATION_SCHEMA or reuse the
repository's DROP_CONSTRAINT helper) and only execute the drop when present,
then proceed with the CALL ADD_CONSTRAINT(...) as before so the script is safe
to run multiple times.

Comment on lines +30 to +34
ALTER TABLE `zstack`.`BareMetal2InstanceProvisionNicVO`
DROP FOREIGN KEY `fkBareMetal2InstanceProvisionNicVOInstanceVO`;

CALL ADD_CONSTRAINT('BareMetal2InstanceProvisionNicVO', 'fkBareMetal2InstanceProvisionNicVOInstanceVO',
'instanceUuid', 'BareMetal2InstanceVO', 'uuid', 'CASCADE');
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

外键删除操作缺少存在性检查,影响脚本幂等性

DROP FOREIGN KEY 操作在执行前没有检查外键是否存在。如果数据库处于部分升级状态(外键已被删除但列迁移未完成),此操作会失败并导致整个事务回滚。

建议在删除外键前先检查其是否存在,类似于列的存在性检查方式:

🔧 建议的修复方案
+    -- Check if foreign key exists before dropping
+    SELECT COUNT(*)
+        INTO @fk_exists
+        FROM INFORMATION_SCHEMA.TABLE_CONSTRAINTS
+        WHERE TABLE_SCHEMA = 'zstack'
+          AND TABLE_NAME = 'BareMetal2InstanceProvisionNicVO'
+          AND CONSTRAINT_NAME = 'fkBareMetal2InstanceProvisionNicVOInstanceVO'
+          AND CONSTRAINT_TYPE = 'FOREIGN KEY';
+
+    IF @fk_exists > 0 THEN
         ALTER TABLE `zstack`.`BareMetal2InstanceProvisionNicVO`
         DROP FOREIGN KEY `fkBareMetal2InstanceProvisionNicVOInstanceVO`;
+    END IF;

Based on learnings,ZStack 升级脚本需要通过 information_schema 检查表和约束的存在性来确保脚本可以安全重复执行。

🤖 Prompt for AI Agents
In @conf/db/upgrade/V5.5.6__schema.sql around lines 30 - 34, The DROP FOREIGN
KEY for fkBareMetal2InstanceProvisionNicVOInstanceVO on table
BareMetal2InstanceProvisionNicVO lacks an existence check and breaks
idempotency; modify V5.5.6__schema.sql to query information_schema
(constraints/KEY_COLUMN_USAGE) to confirm the constraint exists before executing
ALTER TABLE `BareMetal2InstanceProvisionNicVO` DROP FOREIGN KEY
`fkBareMetal2InstanceProvisionNicVOInstanceVO`, and only run the DROP when the
check returns a positive match so the upgrade script is safe to re-run; keep the
subsequent CALL ADD_CONSTRAINT(...) as-is after the conditional drop.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
plugin/ceph/src/main/java/org/zstack/storage/ceph/primary/CephPrimaryStorageBase.java (1)

2652-2715: 安全性:thirdparty token 通过 cmd 下发,建议明确避免日志泄露。
cmd.token 看起来是“平台 token”,通常属于敏感信息;建议在 AgentCommand.token 字段上加 @NoLogging(或确认现有链路不会把 cmd 序列化打印到日志)。

🤖 Fix all issues with AI agents
In @conf/db/upgrade/V5.5.6__schema.sql:
- Around line 30-31: ALTER TABLE DROP FOREIGN KEY is executed unconditionally
and will fail if fkBareMetal2InstanceProvisionNicVOInstanceVO doesn't exist;
modify V5.5.6__schema.sql to first check information_schema for a constraint
named fkBareMetal2InstanceProvisionNicVOInstanceVO on table
BareMetal2InstanceProvisionNicVO in schema zstack and only run ALTER TABLE
`zstack`.`BareMetal2InstanceProvisionNicVO` DROP FOREIGN KEY
`fkBareMetal2InstanceProvisionNicVOInstanceVO` when that check returns true,
ensuring the migration is idempotent and safe to re-run.

In
@plugin/ceph/src/main/java/org/zstack/storage/ceph/primary/CephPrimaryStorageBase.java:
- Around line 2675-2691: The code sets cmd.monIp in CephPrimaryStorageBase but
HttpCaller.doCall() unconditionally overwrites cmd.monIp, so third-party SDKs
cannot rely on the specified mon; fix by either making HttpCaller.doCall()
populate cmd.monIp only when it is null (check HttpCaller.doCall() and change
assignment to if (cmd.monIp == null) cmd.monIp = base.getSelf().getHostname();)
or introduce a distinct field (e.g., cmd.thirdPartyMonIp) set in
CephPrimaryStorageBase (where cmd.monIp is currently assigned) and update all
call sites and the agent/HttpCaller to use thirdPartyMonIp for SDK usage while
preserving monIp for HTTP routing; update references in CephPrimaryStorageBase
(cmd.monIp assignment near ext.convertToBlockVolume(...) and similar blocks at
the other noted location) and in HttpCaller.doCall()/agent handling to use the
new conditional or new field consistently.
- Around line 2704-2711: In the success-callback loop in CephPrimaryStorageBase
where you iterate CephPrimaryStorageCheckInstanceTypeExtensionPoint, replace the
current `if (!ext.isBlockVolume(...)) { break; }` with `continue` so one
non-matching extension doesn't stop others, and change the installPath rewrite
to only call makeVolumeInstallPathByTargetPool when ret.installPath is non-null
and not already a Ceph install path for the target pool (e.g., check
ret.installPath.startsWith("ceph://") or startsWith(targetCephPoolName + "/") or
a similar prefix/pattern); otherwise leave volumePath alone—use these checks
around the assignment that currently does `volumePath = ret.installPath == null
? volumePath : makeVolumeInstallPathByTargetPool(ret.installPath,
targetCephPoolName)`.
🧹 Nitpick comments (2)
conf/db/upgrade/V5.5.6__schema.sql (1)

9-15: DDL 语句会导致隐式 COMMIT,ROLLBACK 无法回滚 DDL 操作

MySQL/MariaDB 中 ALTER TABLEDROP FOREIGN KEY 等 DDL 语句会触发隐式 COMMIT。如果在 DDL 执行后发生错误,ROLLBACK 无法撤销已完成的 DDL 变更。

虽然存在性检查(通过 INFORMATION_SCHEMA)已确保脚本具备幂等性,但异常处理器给人一种"可完全回滚"的错觉。建议:

  1. 在注释中说明 DDL 的事务限制
  2. 或移除事务包装,依赖幂等性检查来保证可重复执行

基于 learnings,这是 ZStack 升级脚本中已知的限制。

plugin/ceph/src/main/java/org/zstack/storage/ceph/primary/CephPrimaryStorageBase.java (1)

487-517: CloneRsp 新增字段 OK,但建议统一字段可见性与序列化风格,避免“public 字段 + getter/setter”混用。
当前 volumeId/volumeStatuspublic,同时又提供 getter/setter;若 JSON 反序列化依赖 public 字段也许没问题,但风格不一致会增加后续维护成本。

建议改为 private 字段 + getter/setter(如不影响序列化)
 public static class CloneRsp extends AgentResponse {
     public Long size;
     public Long actualSize;
     public String installPath;
-    public String volumeId;
-    public String volumeStatus;
+    private String volumeId;
+    private String volumeStatus;
📜 Review details

Configuration used: Path: http://open.zstack.ai:20001/code-reviews/zstack-cloud.yaml (via .coderabbit.yaml)

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 9fb0ae3 and 7609ddf.

⛔ Files ignored due to path filters (3)
  • sdk/src/main/java/org/zstack/sdk/BareMetal2ChassisNicInventory.java is excluded by !sdk/**
  • sdk/src/main/java/org/zstack/sdk/BareMetal2InstanceInventory.java is excluded by !sdk/**
  • sdk/src/main/java/org/zstack/sdk/BareMetal2InstanceProvisionNicInventory.java is excluded by !sdk/**
📒 Files selected for processing (3)
  • conf/db/upgrade/V5.5.6__schema.sql
  • plugin/ceph/src/main/java/org/zstack/storage/ceph/primary/CephPrimaryStorageBase.java
  • plugin/ceph/src/main/java/org/zstack/storage/ceph/primary/CephPrimaryStorageCheckInstanceTypeExtensionPoint.java
🚧 Files skipped from review as they are similar to previous changes (1)
  • plugin/ceph/src/main/java/org/zstack/storage/ceph/primary/CephPrimaryStorageCheckInstanceTypeExtensionPoint.java
🧰 Additional context used
📓 Path-based instructions (3)
**/*.*

⚙️ CodeRabbit configuration file

**/*.*: - 代码里不应当有有中文,包括报错、注释等都应当使用正确的、无拼写错误的英文来写

Files:

  • conf/db/upgrade/V5.5.6__schema.sql
  • plugin/ceph/src/main/java/org/zstack/storage/ceph/primary/CephPrimaryStorageBase.java
**/*.sql

⚙️ CodeRabbit configuration file

**/*.sql: - Review the SQL code, make sure has no errors and confirm that:

  • Upgrading scene has been carefully handled
  • Do not use DEFAULT 0000-00-00 00:00:00 , use DEFAULT CURRENT_TIMESTAMP instead
  • When NOT NULL exists, must use stored procedure or other functions to process historical data, this is very very important
  • 数据库记录中,如果字符串长度不可控,不要用 vchar,用 text 类型

Files:

  • conf/db/upgrade/V5.5.6__schema.sql
**/*.java

⚙️ CodeRabbit configuration file

**/*.java: ## 1. API 设计要求

  • API 命名:
    • API 名称必须唯一,不能重复。
    • API 消息类需要继承 APIMessage;其返回类必须继承 APIReplyAPIEvent,并在注释中用 @RestResponse 进行标注。
    • API 消息上必须添加注解 @RestRequest,并满足如下规范:
      • path:
        • 针对资源使用复数形式。
        • 当 path 中引用消息类变量时,使用 {variableName} 格式。
      • HTTP 方法对应:
        • 查询操作 → HttpMethod.GET
        • 更新操作 → HttpMethod.PUT
        • 创建操作 → HttpMethod.POST
        • 删除操作 → HttpMethod.DELETE
    • API 类需要实现 __example__ 方法以便生成 API 文档,并确保生成对应的 Groovy API Template 与 API Markdown 文件。

2. 命名与格式规范

  • 类名:

    • 使用 UpperCamelCase 风格。
    • 特殊情况:
      • VO/AO/EO 类型类除外。
      • 抽象类采用 AbstractBase 前缀/后缀。
      • 异常类应以 Exception 结尾。
      • 测试类需要以 TestCase 结尾。
  • 方法名、参数名、成员变量和局部变量:

    • 使用 lowerCamelCase 风格。
  • 常量命名:

    • 全部大写,使用下划线分隔单词。
    • 要求表达清楚,避免使用含糊或不准确的名称。
  • 包名:

    • 统一使用小写,使用点分隔符,每个部分应是一个具有自然语义的英文单词(参考 Spring 框架的结构)。
  • 命名细节:

    • 避免在父子类或同一代码块中出现相同名字的成员或局部变量,防止混淆。
    • 命名缩写:
      • 不允许使用不必要的缩写,如:AbsSchedulerJobcondiFu 等。应使用完整单词提升可读性。

3. 编写自解释代码

  • 意图表达:

    • 避免使用布尔型参数造成含义不明确。例如:
      • 对于 stopAgent(boolean ignoreError),建议拆分为不同函数(如 stopAgentIgnoreError()),或使用枚举表达操作类型。
    • 命名应尽量用完整的单词组合表达意图,并在名称中体现数据类型或用途(例如在常量与变量名称中,将类型词放在末尾)。
    • 避免使用魔法值(Magic Value):
      • 直接使用未经定义的数值或字符串(如 if (status == 5))应替换为枚举或常量。

      • 示例:

      • // 错误示例:魔法值

      • if (user.getStatus() == 5) { ... }

      • // 正确示例:常量或枚举

      • public static final int STATUS_ACTIVE = 5;

      • if (user.getStatus() == STATUS_ACTIVE) { ... }

      • // 或使用枚举

      • enum UserStatus { ACTIVE, INACTIVE }

  • 注释:

    • 代码应尽量做到自解释,对少于两行的说明可以直接写在代码中。
    • 对于较长的注释,需要仔细校对并随代码更新,确保内容正确。
    • 接口方法不应有多余的修饰符(例如 public),且必须配有有效的 Javadoc 注释。

4. 流程控制和结构优化

  • if...else 的使用:

    • 应尽量减少 if...else 结构的使用,建议:
      • 限制嵌套层级最多为两层,且内层不应再出现 else 分支。
      • 尽早返回(Early Return),将条件判断中的处理逻辑提前结束或抽成独立方法。
      • 使用 Java Stream 或 Lambda 表达式代替冗长的循环与条件判断。
  • 条件判断:

    • if 条件表达不宜过长或过于复杂,必要时可以将条件抽成 boolean 变量描述。
  • 代码块长度:
    ...

Files:

  • plugin/ceph/src/main/java/org/zstack/storage/ceph/primary/CephPrimaryStorageBase.java
🧠 Learnings (9)
📚 Learning: 2025-08-03T04:10:21.683Z
Learnt from: zstack-robot-1
Repo: MatheMatrix/zstack PR: 0
File: :0-0
Timestamp: 2025-08-03T04:10:21.683Z
Learning: ZStack 数据库升级脚本 V4.10.16__schema.sql 中的 UPGRADE_VM_METADATA_TABLES_DIRECT 存储过程使用直接 RENAME TABLE 操作,不具备幂等性。需要通过 information_schema 检查表和约束的存在性来确保脚本可以安全重复执行。

Applied to files:

  • conf/db/upgrade/V5.5.6__schema.sql
📚 Learning: 2025-08-03T04:10:21.683Z
Learnt from: zstack-robot-1
Repo: MatheMatrix/zstack PR: 0
File: :0-0
Timestamp: 2025-08-03T04:10:21.683Z
Learning: 在 ZStack 数据库升级脚本中,直接使用 RENAME TABLE 不能保证幂等性。应该通过 information_schema.tables 检查表的存在性,只在源表存在且目标表不存在时才执行重命名操作,以确保升级脚本可以安全地重复执行。

Applied to files:

  • conf/db/upgrade/V5.5.6__schema.sql
📚 Learning: 2025-08-03T03:42:34.349Z
Learnt from: ZStack-Robot
Repo: MatheMatrix/zstack PR: 2293
File: conf/db/upgrade/V4.10.16__schema.sql:69-82
Timestamp: 2025-08-03T03:42:34.349Z
Learning: 在 ZStack 项目中进行数据库 DDL 升级时,由于 MySQL/MariaDB 中的 RENAME TABLE、ALTER TABLE 等 DDL 语句会隐式 COMMIT,因此应使用临时表 + 数据复制的方式来确保升级可逆,而不是依赖事务的 ROLLBACK 机制。

Applied to files:

  • conf/db/upgrade/V5.5.6__schema.sql
📚 Learning: 2025-08-25T03:52:37.301Z
Learnt from: ZStack-Robot
Repo: MatheMatrix/zstack PR: 2504
File: storage/src/main/java/org/zstack/storage/snapshot/VolumeSnapshotTreeBase.java:775-819
Timestamp: 2025-08-25T03:52:37.301Z
Learning: 在ZStack项目的VolumeSnapshotTreeBase类中,当通过dbf.findByUuid()方法获取VolumeVO对象时,需要进行null检查,因为该方法可能在找不到对应记录时返回null,直接使用可能导致NullPointerException。

Applied to files:

  • plugin/ceph/src/main/java/org/zstack/storage/ceph/primary/CephPrimaryStorageBase.java
📚 Learning: 2025-08-14T06:56:19.585Z
Learnt from: zstack-robot-2
Repo: MatheMatrix/zstack PR: 2435
File: storage/src/main/java/org/zstack/storage/snapshot/group/VolumeSnapshotGroupBase.java:47-47
Timestamp: 2025-08-14T06:56:19.585Z
Learning: 在VolumeSnapshotGroupBase.java中,VmInstanceResourceMetadataManager的注入和SKIP_RESOURCE_ROLLBACK标记虽然在当前版本中未被使用,但这些导入在大型重构PR中是有意为之的,用于保持代码一致性或为后续功能做准备。

Applied to files:

  • plugin/ceph/src/main/java/org/zstack/storage/ceph/primary/CephPrimaryStorageBase.java
📚 Learning: 2025-08-14T06:48:00.549Z
Learnt from: zstack-robot-1
Repo: MatheMatrix/zstack PR: 2435
File: header/src/main/java/org/zstack/header/storage/snapshot/group/APICheckMemorySnapshotGroupConflictMsg.java:3-7
Timestamp: 2025-08-14T06:48:00.549Z
Learning: 在ZStack项目中,同一包内的类(如VolumeSnapshotGroupVO)不需要显式导入。同时,uuid()辅助方法在API消息类的__example__()方法中可以直接使用,无需静态导入,这是通过项目的构建系统或框架配置实现的。

Applied to files:

  • plugin/ceph/src/main/java/org/zstack/storage/ceph/primary/CephPrimaryStorageBase.java
📚 Learning: 2025-08-14T06:48:00.549Z
Learnt from: zstack-robot-1
Repo: MatheMatrix/zstack PR: 2435
File: header/src/main/java/org/zstack/header/storage/snapshot/group/APICheckMemorySnapshotGroupConflictMsg.java:3-7
Timestamp: 2025-08-14T06:48:00.549Z
Learning: In the ZStack project, API message classes may not require explicit imports for VolumeSnapshotGroupVO references in APIParam annotations or for the uuid() helper method used in __example__() methods, even when these classes/methods are defined in other packages. The build system or framework appears to handle these references automatically.

Applied to files:

  • plugin/ceph/src/main/java/org/zstack/storage/ceph/primary/CephPrimaryStorageBase.java
📚 Learning: 2025-07-22T02:30:46.123Z
Learnt from: MatheMatrix
Repo: MatheMatrix/zstack PR: 2311
File: plugin/sdnController/src/main/java/org/zstack/sdnController/hardwareVxlan/KVMRealizeHardwareVxlanNetworkBackend.java:52-53
Timestamp: 2025-07-22T02:30:46.123Z
Learning: In ZStack, backend realization code (such as KVMRealizeHardwareVxlanNetworkBackend) can safely assume the existence of the VO for the resource being operated on (e.g., HardwareL2VxlanNetworkVO for a given L2NetworkInventory), due to system lifecycle management and strong foreign key constraints in the database schema. Explicit null checks after findByUuid are not required in these contexts.

Applied to files:

  • plugin/ceph/src/main/java/org/zstack/storage/ceph/primary/CephPrimaryStorageBase.java
📚 Learning: 2025-08-12T05:52:18.323Z
Learnt from: MatheMatrix
Repo: MatheMatrix/zstack PR: 2418
File: compute/src/main/java/org/zstack/compute/vm/VmInstanceBase.java:3873-3885
Timestamp: 2025-08-12T05:52:18.323Z
Learning: 在 compute/src/main/java/org/zstack/compute/vm/VmInstanceBase.java 的 setNoIpAllocationStaticIp() 流程中,静态 IP 标签删除已统一在 change-ip-in-database 步骤对 voRemoveSet 循环处理:对收集到的旧 IPv4/IPv6 UsedIpVO 均调用 ipOperator.deleteStaticIpByVmUuidAndL3Uuid(vmUuid, l3Uuid, IPv6NetworkUtils.ipv6AddessToTagValue(ip)) 清理对应的 static-ip 标签。因此无需在分支里分别删除 IPv4/IPv6 的标签。

Applied to files:

  • plugin/ceph/src/main/java/org/zstack/storage/ceph/primary/CephPrimaryStorageBase.java
🧬 Code graph analysis (1)
plugin/ceph/src/main/java/org/zstack/storage/ceph/primary/CephPrimaryStorageBase.java (2)
header/src/main/java/org/zstack/header/vm/VmInstanceSpec.java (2)
  • VmInstanceSpec (25-911)
  • ImageSpec (123-164)
core/src/main/java/org/zstack/core/db/Q.java (1)
  • Q (16-193)
🔍 Remote MCP AtlassianMCP

Relevant Jira context: ZSTAC-73396

  • Key: ZSTAC-73396 — Summary: "弹性裸金属支持使用云盘作为系统盘且支持HA" (support elastic bare-metal using cloud disk as root + HA). Status: Closed. Assignee: 沈今 (jin.shen).

  • Description highlights business impact (customer acceptance blocked, ~¥1M+ total, ~¥400k ZStack revenue) and blockers: IO errors when using cloud disk as root (TIC-3691), gateway single-point-of-failure (no HA group), and single-link network risk. Solution intent: enable high-speed cloud disk as bm root and HA; development tracked and planned (reference Confluence page linked in Jira).

  • Comments record related merge requests already merged for earlier branches (ZSTAC-79783 → 4.8.30-zhongxinbao) implementing thirdparty_ceph support across repos (premium, zstack-utility, zstack). This PR (targeting 5.5.6) appears to be the backport/sync of that work.

🔇 Additional comments (5)
conf/db/upgrade/V5.5.6__schema.sql (4)

47-52: isPrimaryProvisionNic 设置逻辑需确认业务合理性

将所有现有 BareMetal2InstanceProvisionNicVO 记录的 isPrimaryProvisionNic 设为 TRUE

请确认这是预期行为 —— 如果一个 instance 可能有多个 provision NIC,将全部设为 primary 可能不符合业务语义(通常只有一个应为 primary)。


61-67: BareMetal2ChassisNicVO 的更新逻辑正确

此处仅将 isProvisionNic = TRUE 的记录设置 isPrimaryProvisionNic = TRUE,逻辑合理。与上面的全量更新形成对比,这里的条件过滤更为精确。


1-74: 整体结构良好:使用存储过程处理 NOT NULL 列,符合编码规范

脚本使用存储过程处理历史数据,通过 INFORMATION_SCHEMA 进行存在性检查保证幂等性,符合 ZStack 数据库升级脚本的最佳实践。


17-38: 两个辅助存储过程已具备幂等性

ADD_COLUMNADD_CONSTRAINT 存储过程都通过 INFORMATION_SCHEMA 检查实现了幂等性:

  • ADD_COLUMN:通过检查 INFORMATION_SCHEMA.COLUMNS 确认列存在性,不存在时才执行 ALTER TABLE ADD COLUMN
  • ADD_CONSTRAINT:通过检查 INFORMATION_SCHEMA.TABLE_CONSTRAINTS 确认约束存在性,不存在时才执行 ALTER TABLE ADD CONSTRAINT

脚本可以安全地重复执行。

plugin/ceph/src/main/java/org/zstack/storage/ceph/primary/CephPrimaryStorageBase.java (1)

57-63: 移除未使用的导入 SystemTagVOSystemTagVO_

搜索结果表明这两个导入仅出现在 import 语句中,代码里没有实际使用它们。与 VmInstanceVO/VmInstanceVO_ 有在后续代码中被使用不同,这两个导入应该被删除,以避免维护混淆。

⛔ Skipped due to learnings
Learnt from: zstack-robot-2
Repo: MatheMatrix/zstack PR: 2435
File: storage/src/main/java/org/zstack/storage/snapshot/group/VolumeSnapshotGroupBase.java:47-47
Timestamp: 2025-08-14T06:56:19.585Z
Learning: 在VolumeSnapshotGroupBase.java中,VmInstanceResourceMetadataManager的注入和SKIP_RESOURCE_ROLLBACK标记虽然在当前版本中未被使用,但这些导入在大型重构PR中是有意为之的,用于保持代码一致性或为后续功能做准备。
Learnt from: MatheMatrix
Repo: MatheMatrix/zstack PR: 2418
File: compute/src/main/java/org/zstack/compute/vm/VmInstanceBase.java:3873-3885
Timestamp: 2025-08-12T05:52:18.323Z
Learning: 在 compute/src/main/java/org/zstack/compute/vm/VmInstanceBase.java 的 setNoIpAllocationStaticIp() 流程中,静态 IP 标签删除已统一在 change-ip-in-database 步骤对 voRemoveSet 循环处理:对收集到的旧 IPv4/IPv6 UsedIpVO 均调用 ipOperator.deleteStaticIpByVmUuidAndL3Uuid(vmUuid, l3Uuid, IPv6NetworkUtils.ipv6AddessToTagValue(ip)) 清理对应的 static-ip 标签。因此无需在分支里分别删除 IPv4/IPv6 的标签。
Learnt from: ZStack-Robot
Repo: MatheMatrix/zstack PR: 2504
File: storage/src/main/java/org/zstack/storage/snapshot/VolumeSnapshotTreeBase.java:775-819
Timestamp: 2025-08-25T03:52:37.301Z
Learning: 在ZStack项目的VolumeSnapshotTreeBase类中,当通过dbf.findByUuid()方法获取VolumeVO对象时,需要进行null检查,因为该方法可能在找不到对应记录时返回null,直接使用可能导致NullPointerException。
Learnt from: ZStack-Robot
Repo: MatheMatrix/zstack PR: 2418
File: compute/src/main/java/org/zstack/compute/vm/VmInstanceBase.java:3856-3865
Timestamp: 2025-08-12T05:39:14.846Z
Learning: 在 compute/src/main/java/org/zstack/compute/vm/VmInstanceBase.java 的 setNoIpAllocationStaticIp 流程中,从数据库加载的 VmNicVO 实例其 getUsedIps() 为空时也不会为 null(返回空集合)。因此无需对 getUsedIps() 再做 Optional/空列表归一化的空指针保护;若找不到对应 NIC,使用 orElse(new VmNicVO()) 的约定允许后续逻辑通过 NPE 暴露问题,与项目既有约定一致。
Learnt from: MatheMatrix
Repo: MatheMatrix/zstack PR: 2763
File: sdk/src/main/java/org/zstack/sdk/UpdateHostKernelInterfaceAction.java:31-31
Timestamp: 2025-10-20T11:27:25.928Z
Learning: UpdateHostKernelInterface API (`sdk/src/main/java/org/zstack/sdk/UpdateHostKernelInterfaceAction.java`) 在 PR #2763 时尚未被实际使用,因此对其参数约束的变更(如 name 字段从必需改为可选)不会造成破坏性影响。
Learnt from: MatheMatrix
Repo: MatheMatrix/zstack PR: 2311
File: plugin/sdnController/src/main/java/org/zstack/sdnController/hardwareVxlan/KVMRealizeHardwareVxlanNetworkBackend.java:52-53
Timestamp: 2025-07-22T02:30:46.123Z
Learning: In ZStack, backend realization code (such as KVMRealizeHardwareVxlanNetworkBackend) can safely assume the existence of the VO for the resource being operated on (e.g., HardwareL2VxlanNetworkVO for a given L2NetworkInventory), due to system lifecycle management and strong foreign key constraints in the database schema. Explicit null checks after findByUuid are not required in these contexts.

Comment on lines +30 to +31
ALTER TABLE `zstack`.`BareMetal2InstanceProvisionNicVO`
DROP FOREIGN KEY `fkBareMetal2InstanceProvisionNicVOInstanceVO`;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

DROP FOREIGN KEY 缺少存在性检查,可能导致执行失败

当外键 fkBareMetal2InstanceProvisionNicVOInstanceVO 不存在时(例如新数据库或脚本重复执行),此语句会抛出异常。

建议在执行前检查外键是否存在:

🔧 建议修复:添加外键存在性检查
    IF instanceUuid_exists = 0 THEN
        CALL ADD_COLUMN('BareMetal2InstanceProvisionNicVO', 'instanceUuid', 'VARCHAR(32)', FALSE, '');

        UPDATE `zstack`.`BareMetal2InstanceProvisionNicVO`
        SET `instanceUuid` = `uuid`;

+        -- Check if foreign key exists before dropping
+        IF EXISTS (
+            SELECT 1 FROM INFORMATION_SCHEMA.TABLE_CONSTRAINTS
+            WHERE TABLE_SCHEMA = 'zstack'
+              AND TABLE_NAME = 'BareMetal2InstanceProvisionNicVO'
+              AND CONSTRAINT_NAME = 'fkBareMetal2InstanceProvisionNicVOInstanceVO'
+              AND CONSTRAINT_TYPE = 'FOREIGN KEY'
+        ) THEN
             ALTER TABLE `zstack`.`BareMetal2InstanceProvisionNicVO`
             DROP FOREIGN KEY `fkBareMetal2InstanceProvisionNicVOInstanceVO`;
+        END IF;

        CALL ADD_CONSTRAINT('BareMetal2InstanceProvisionNicVO', 'fkBareMetal2InstanceProvisionNicVOInstanceVO',
                                    'instanceUuid', 'BareMetal2InstanceVO', 'uuid', 'CASCADE');

        UPDATE `zstack`.`BareMetal2InstanceProvisionNicVO`
        SET `uuid` = REPLACE(UUID(), '-', '');
    END IF;
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
ALTER TABLE `zstack`.`BareMetal2InstanceProvisionNicVO`
DROP FOREIGN KEY `fkBareMetal2InstanceProvisionNicVOInstanceVO`;
-- Check if foreign key exists before dropping
IF EXISTS (
SELECT 1 FROM INFORMATION_SCHEMA.TABLE_CONSTRAINTS
WHERE TABLE_SCHEMA = 'zstack'
AND TABLE_NAME = 'BareMetal2InstanceProvisionNicVO'
AND CONSTRAINT_NAME = 'fkBareMetal2InstanceProvisionNicVOInstanceVO'
AND CONSTRAINT_TYPE = 'FOREIGN KEY'
) THEN
ALTER TABLE `zstack`.`BareMetal2InstanceProvisionNicVO`
DROP FOREIGN KEY `fkBareMetal2InstanceProvisionNicVOInstanceVO`;
END IF;
🤖 Prompt for AI Agents
In @conf/db/upgrade/V5.5.6__schema.sql around lines 30 - 31, ALTER TABLE DROP
FOREIGN KEY is executed unconditionally and will fail if
fkBareMetal2InstanceProvisionNicVOInstanceVO doesn't exist; modify
V5.5.6__schema.sql to first check information_schema for a constraint named
fkBareMetal2InstanceProvisionNicVOInstanceVO on table
BareMetal2InstanceProvisionNicVO in schema zstack and only run ALTER TABLE
`zstack`.`BareMetal2InstanceProvisionNicVO` DROP FOREIGN KEY
`fkBareMetal2InstanceProvisionNicVOInstanceVO` when that check returns true,
ensuring the migration is idempotent and safe to re-run.

Comment on lines +2660 to +2693
List<CephPrimaryStorageCheckInstanceTypeExtensionPoint> exts = pluginRgty.getExtensionList(CephPrimaryStorageCheckInstanceTypeExtensionPoint.class);
for (CephPrimaryStorageCheckInstanceTypeExtensionPoint ext : exts) {
Boolean result = ext.isSupportCloneByThirdParty(msg.getVolume().getVmInstanceUuid());
if (!result) {
break;
}
boolean isRootVolume = Q.New(VolumeVO.class)
.eq(VolumeVO_.uuid, msg.getVolume().getUuid())
.eq(VolumeVO_.type, VolumeType.Root)
.isExists();
boolean isBareMetal2Instance = Q.New(VmInstanceVO.class)
.eq(VmInstanceVO_.uuid, msg.getVolume().getVmInstanceUuid())
.eq(VmInstanceVO_.type, "baremetal2")
.isExists();
boolean hasToken = CephSystemTags.THIRDPARTY_PLATFORM.hasTag(msg.getPrimaryStorageUuid());
if (isRootVolume && isBareMetal2Instance && hasToken) {
CephPrimaryStorageVO cephPrimaryStorageVO = dbf.findByUuid(msg.getPrimaryStorageUuid(), CephPrimaryStorageVO.class);
String monIp = cephPrimaryStorageVO.getMons()
.stream()
.filter(v -> v.getStatus() == MonStatus.Connected)
.map(CephPrimaryStorageMonVO::getHostname)
.findAny()
.orElseThrow(() -> new OperationFailureException(
operr("all ceph mons of primary storage[uuid:%s] are not in Connected state", cephPrimaryStorageVO.getUuid())
));
cmd.token = CephSystemTags.THIRDPARTY_PLATFORM.getTokenByResourceUuid(msg.getPrimaryStorageUuid(),
CephSystemTags.THIRDPARTY_PLATFORM_TOKEN);
cmd.monIp = monIp;
cmd.tpTimeout = CephGlobalConfig.THIRD_PARTY_SDK_TIMEOUT.value(String.class);
VolumeVO vo = Q.New(VolumeVO.class)
.eq(VolumeVO_.uuid, msg.getVolume().getUuid()).find();
ext.convertToBlockVolume(vo);
}
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

第三方 clone 扩展点循环的控制流有明显 bug:break 会阻断后续扩展点;且 Boolean 可能 NPE。

  • if (!result) { break; }:一旦第一个扩展返回 false,就不会再检查其他扩展(通常应 continue)。
  • Boolean result 直接 !result:返回 null 会 NPE。
  • 另外,isRootVolume/isBareMetal2Instance/hasToken 每次循环都会重复 DB 查询,建议提到循环外计算一次。
可行修复(continue + null-safe + 提前计算条件)
-                        List<CephPrimaryStorageCheckInstanceTypeExtensionPoint> exts = pluginRgty.getExtensionList(CephPrimaryStorageCheckInstanceTypeExtensionPoint.class);
-                        for (CephPrimaryStorageCheckInstanceTypeExtensionPoint ext : exts) {
-                            Boolean result = ext.isSupportCloneByThirdParty(msg.getVolume().getVmInstanceUuid());
-                            if (!result) {
-                                break;
-                            }
-                            boolean isRootVolume = Q.New(VolumeVO.class)
-                                    .eq(VolumeVO_.uuid, msg.getVolume().getUuid())
-                                    .eq(VolumeVO_.type, VolumeType.Root)
-                                    .isExists();
-                            boolean isBareMetal2Instance = Q.New(VmInstanceVO.class)
-                                    .eq(VmInstanceVO_.uuid, msg.getVolume().getVmInstanceUuid())
-                                    .eq(VmInstanceVO_.type, "baremetal2")
-                                    .isExists();
-                            boolean hasToken = CephSystemTags.THIRDPARTY_PLATFORM.hasTag(msg.getPrimaryStorageUuid());
-                            if (isRootVolume && isBareMetal2Instance && hasToken) {
+                        List<CephPrimaryStorageCheckInstanceTypeExtensionPoint> exts =
+                                pluginRgty.getExtensionList(CephPrimaryStorageCheckInstanceTypeExtensionPoint.class);
+
+                        boolean isRootVolume = Q.New(VolumeVO.class)
+                                .eq(VolumeVO_.uuid, msg.getVolume().getUuid())
+                                .eq(VolumeVO_.type, VolumeType.Root)
+                                .isExists();
+                        boolean isBareMetal2Instance = Q.New(VmInstanceVO.class)
+                                .eq(VmInstanceVO_.uuid, msg.getVolume().getVmInstanceUuid())
+                                .eq(VmInstanceVO_.type, "baremetal2")
+                                .isExists();
+                        boolean hasToken = CephSystemTags.THIRDPARTY_PLATFORM.hasTag(msg.getPrimaryStorageUuid());
+
+                        for (CephPrimaryStorageCheckInstanceTypeExtensionPoint ext : exts) {
+                            if (!Boolean.TRUE.equals(ext.isSupportCloneByThirdParty(msg.getVolume().getVmInstanceUuid()))) {
+                                continue;
+                            }
+                            if (isRootVolume && isBareMetal2Instance && hasToken) {
                                 CephPrimaryStorageVO cephPrimaryStorageVO = dbf.findByUuid(msg.getPrimaryStorageUuid(), CephPrimaryStorageVO.class);
                                 String monIp = cephPrimaryStorageVO.getMons()
                                         .stream()
                                         .filter(v -> v.getStatus() == MonStatus.Connected)
                                         .map(CephPrimaryStorageMonVO::getHostname)
                                         .findAny()
                                         .orElseThrow(() -> new OperationFailureException(
                                                 operr("all ceph mons of primary storage[uuid:%s] are not in Connected state", cephPrimaryStorageVO.getUuid())
                                         ));
                                 cmd.token = CephSystemTags.THIRDPARTY_PLATFORM.getTokenByResourceUuid(msg.getPrimaryStorageUuid(),
                                         CephSystemTags.THIRDPARTY_PLATFORM_TOKEN);
                                 cmd.monIp = monIp;
                                 cmd.tpTimeout = CephGlobalConfig.THIRD_PARTY_SDK_TIMEOUT.value(String.class);
                                 VolumeVO vo = Q.New(VolumeVO.class)
                                         .eq(VolumeVO_.uuid, msg.getVolume().getUuid()).find();
                                 ext.convertToBlockVolume(vo);
                             }
                         }

另外强烈建议把 "baremetal2" 抽成常量/枚举,避免魔法值扩散。

Comment on lines +2675 to +2692
if (isRootVolume && isBareMetal2Instance && hasToken) {
CephPrimaryStorageVO cephPrimaryStorageVO = dbf.findByUuid(msg.getPrimaryStorageUuid(), CephPrimaryStorageVO.class);
String monIp = cephPrimaryStorageVO.getMons()
.stream()
.filter(v -> v.getStatus() == MonStatus.Connected)
.map(CephPrimaryStorageMonVO::getHostname)
.findAny()
.orElseThrow(() -> new OperationFailureException(
operr("all ceph mons of primary storage[uuid:%s] are not in Connected state", cephPrimaryStorageVO.getUuid())
));
cmd.token = CephSystemTags.THIRDPARTY_PLATFORM.getTokenByResourceUuid(msg.getPrimaryStorageUuid(),
CephSystemTags.THIRDPARTY_PLATFORM_TOKEN);
cmd.monIp = monIp;
cmd.tpTimeout = CephGlobalConfig.THIRD_PARTY_SDK_TIMEOUT.value(String.class);
VolumeVO vo = Q.New(VolumeVO.class)
.eq(VolumeVO_.uuid, msg.getVolume().getUuid()).find();
ext.convertToBlockVolume(vo);
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

convertToBlockVolume() 放在 clone 之前且无回滚,失败路径可能把 Volume 状态改脏。
这段在 NoRollbackFlow 内执行:一旦 httpCall(CLONE_PATH...) 失败,当前 flow 会 fail,但 DB 侧转换可能已发生;后续重试/清理可能遇到不一致状态。至少需要:

  • 要么把转换放到 clone 成功后;
  • 要么把该 flow 改成可 rollback,并在 rollback 里恢复转换(或让扩展点提供 revert)。

Comment on lines +2675 to +2691
if (isRootVolume && isBareMetal2Instance && hasToken) {
CephPrimaryStorageVO cephPrimaryStorageVO = dbf.findByUuid(msg.getPrimaryStorageUuid(), CephPrimaryStorageVO.class);
String monIp = cephPrimaryStorageVO.getMons()
.stream()
.filter(v -> v.getStatus() == MonStatus.Connected)
.map(CephPrimaryStorageMonVO::getHostname)
.findAny()
.orElseThrow(() -> new OperationFailureException(
operr("all ceph mons of primary storage[uuid:%s] are not in Connected state", cephPrimaryStorageVO.getUuid())
));
cmd.token = CephSystemTags.THIRDPARTY_PLATFORM.getTokenByResourceUuid(msg.getPrimaryStorageUuid(),
CephSystemTags.THIRDPARTY_PLATFORM_TOKEN);
cmd.monIp = monIp;
cmd.tpTimeout = CephGlobalConfig.THIRD_PARTY_SDK_TIMEOUT.value(String.class);
VolumeVO vo = Q.New(VolumeVO.class)
.eq(VolumeVO_.uuid, msg.getVolume().getUuid()).find();
ext.convertToBlockVolume(vo);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

cmd.monIp 在这里赋值很可能会被 HttpCaller.doCall() 覆盖,导致第三方 clone 依赖的 monIp 不可控。
HttpCaller.doCall() 每次发送前都会 cmd.monIp = base.getSelf().getHostname();,因此你在 clone flow 里提前写入的 cmd.monIp 可能根本不会到 agent(或变成随机挑选的 mon)。如果第三方 SDK 需要“指定的 monIp”,这会直接造成功能不稳定。

修复方向(示例):HttpCaller 仅在 monIp 为空时填充
-            cmd.monUuid = base.getSelf().getUuid();
-            cmd.monIp = base.getSelf().getHostname();
+            cmd.monUuid = base.getSelf().getUuid();
+            if (cmd.monIp == null) {
+                cmd.monIp = base.getSelf().getHostname();
+            }

如果 monIp 语义需要区分“ceph mon 用于 httpCall 的地址”和“第三方平台 SDK 需要的 monIp”,建议新增独立字段(例如 thirdPartyMonIp)而不是复用 monIp

Also applies to: 3624-3636

🤖 Prompt for AI Agents
In
@plugin/ceph/src/main/java/org/zstack/storage/ceph/primary/CephPrimaryStorageBase.java
around lines 2675 - 2691, The code sets cmd.monIp in CephPrimaryStorageBase but
HttpCaller.doCall() unconditionally overwrites cmd.monIp, so third-party SDKs
cannot rely on the specified mon; fix by either making HttpCaller.doCall()
populate cmd.monIp only when it is null (check HttpCaller.doCall() and change
assignment to if (cmd.monIp == null) cmd.monIp = base.getSelf().getHostname();)
or introduce a distinct field (e.g., cmd.thirdPartyMonIp) set in
CephPrimaryStorageBase (where cmd.monIp is currently assigned) and update all
call sites and the agent/HttpCaller to use thirdPartyMonIp for SDK usage while
preserving monIp for HTTP routing; update references in CephPrimaryStorageBase
(cmd.monIp assignment near ext.convertToBlockVolume(...) and similar blocks at
the other noted location) and in HttpCaller.doCall()/agent handling to use the
new conditional or new field consistently.

Comment on lines +2704 to +2711
List<CephPrimaryStorageCheckInstanceTypeExtensionPoint> exts = pluginRgty.getExtensionList(CephPrimaryStorageCheckInstanceTypeExtensionPoint.class);
for (CephPrimaryStorageCheckInstanceTypeExtensionPoint ext : exts) {
if (!ext.isBlockVolume(msg.getVolume().getUuid())) {
break;
}
volumePath = ret.installPath == null ? volumePath : makeVolumeInstallPathByTargetPool(ret.installPath, targetCephPoolName);
ext.populateBlockVolumeDetails(msg.getVolume().getUuid(), ret.volumeId, ret.volumeStatus);
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

成功回调里的扩展点循环同样不应 break;且 installPath 重写逻辑可能拼错路径。

  • if (!ext.isBlockVolume(...)) { break; }:同样会阻断后续扩展;更合理是 continue(除非你能保证扩展点只有一个)。
  • volumePath = ... makeVolumeInstallPathByTargetPool(ret.installPath, targetCephPoolName):如果 ret.installPath 已经是完整的 ceph://...pool/uuid,这里会生成错误路径。至少应先判断前缀并保底。
可行修复(continue + installPath 前缀判断)
                                 List<CephPrimaryStorageCheckInstanceTypeExtensionPoint> exts = pluginRgty.getExtensionList(CephPrimaryStorageCheckInstanceTypeExtensionPoint.class);
                                 for (CephPrimaryStorageCheckInstanceTypeExtensionPoint ext : exts) {
                                     if (!ext.isBlockVolume(msg.getVolume().getUuid())) {
-                                        break;
+                                        continue;
                                     }
-                                    volumePath = ret.installPath == null ? volumePath : makeVolumeInstallPathByTargetPool(ret.installPath, targetCephPoolName);
+                                    if (ret.installPath != null) {
+                                        volumePath = ret.installPath.startsWith("ceph://")
+                                                ? ret.installPath
+                                                : makeVolumeInstallPathByTargetPool(ret.installPath, targetCephPoolName);
+                                    }
                                     ext.populateBlockVolumeDetails(msg.getVolume().getUuid(), ret.volumeId, ret.volumeStatus);
                                 }
🤖 Prompt for AI Agents
In
@plugin/ceph/src/main/java/org/zstack/storage/ceph/primary/CephPrimaryStorageBase.java
around lines 2704 - 2711, In the success-callback loop in CephPrimaryStorageBase
where you iterate CephPrimaryStorageCheckInstanceTypeExtensionPoint, replace the
current `if (!ext.isBlockVolume(...)) { break; }` with `continue` so one
non-matching extension doesn't stop others, and change the installPath rewrite
to only call makeVolumeInstallPathByTargetPool when ret.installPath is non-null
and not already a Ceph install path for the target pool (e.g., check
ret.installPath.startsWith("ceph://") or startsWith(targetCephPoolName + "/") or
a similar prefix/pattern); otherwise leave volumePath alone—use these checks
around the assignment that currently does `volumePath = ret.installPath == null
? volumePath : makeVolumeInstallPathByTargetPool(ret.installPath,
targetCephPoolName)`.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants