Skip to content

Conversation

@Gerrit91
Copy link
Contributor

@Gerrit91 Gerrit91 commented Oct 1, 2025

Description

Re-raising this from metal-stack/docs-archive#232. Existing comments are hard to carry over. :(

@metal-robot metal-robot bot added the area: documentation Affects the documentation area. label Oct 1, 2025
@metal-robot metal-robot bot added this to Development Oct 1, 2025
@netlify
Copy link

netlify bot commented Oct 1, 2025

Deploy Preview for metal-stack-io ready!

Name Link
🔨 Latest commit e475078
🔍 Latest deploy log https://app.netlify.com/projects/metal-stack-io/deploys/697770d7a2ac140007d1188f
😎 Deploy Preview https://deploy-preview-122--metal-stack-io.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

@Gerrit91 Gerrit91 added the triage This should be talked about in the next planning. label Nov 17, 2025
@iljarotar iljarotar moved this to In Progress in Development Nov 17, 2025
@metal-robot metal-robot bot removed the triage This should be talked about in the next planning. label Nov 17, 2025
@iljarotar iljarotar marked this pull request as ready for review November 17, 2025 13:26
@iljarotar iljarotar requested a review from a team as a code owner November 17, 2025 13:26
@Gerrit91
Copy link
Contributor Author

Gerrit91 commented Dec 1, 2025

@izvyk Maybe this MEP is interesting for you to review. I would like to hear your feedback about it. Are you interested in taking a look?

@Gerrit91 Gerrit91 moved this from In Progress to Upcoming in Development Dec 1, 2025
@Gerrit91 Gerrit91 moved this from Upcoming to In Progress in Development Dec 1, 2025
@izvyk
Copy link

izvyk commented Dec 3, 2025

@izvyk Maybe this MEP is interesting for you to review. I would like to hear your feedback about it. Are you interested in taking a look?

Thank you, I am definitely interested!


## New Approach for Bootstrapping

After a server is mounted in a rack in the data center, the BMC of a server gets connected to a management switch. The BMC obtains an IP address via DHCP broadcast from a DNS server, typically running on an mgmt-server in the data center partition. Then, the metal-bmc periodically checks the DHCP lease list in order to discover new BMCs or update existing ones.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understand correctly, the IP is obtained via DHCP from the management server, which hosts both the DNS and DHCP services. I suggest clarifying that a bit.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The DNS server is not always present, it's just used for another component called metal-image-cache, which syncs OS images into the partition in order to speed up machine provisioning and gain independence from the internet.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a reference to what a mgmt-server is now.

Copy link
Contributor Author

@Gerrit91 Gerrit91 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for reading and your sensible inputs. I tried to integrate your points.


## New Approach for Bootstrapping

After a server is mounted in a rack in the data center, the BMC of a server gets connected to a management switch. The BMC obtains an IP address via DHCP broadcast from a DNS server, typically running on an mgmt-server in the data center partition. Then, the metal-bmc periodically checks the DHCP lease list in order to discover new BMCs or update existing ones.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The DNS server is not always present, it's just used for another component called metal-image-cache, which syncs OS images into the partition in order to speed up machine provisioning and gain independence from the internet.


In order to minimize the BMC interface, we should try to bundle as much of the implementation as possible in a single microservice. This microservice should have a proto / gRPC API for access.

A suitable microservice is already in place on the mgmt-servers called the [metal-bmc](https://github.com/metal-stack/metal-bmc), which can be extended for this purpose. The metal-bmc will implement the server API. The API can be called by the metal-api (indirectly through NSQ), the metal-hammer and a metal-bmc CLI.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should not introduce new features to metal-api, instead metal-apiserver must be used.

Calling metal-bmc is not done via NSQ, its done by hooking into a streaming service. Exposing this API also means that this requires to open a port to the public where this api is reachable at the metal-bmc. This is against our current policy which tries as hard as possible to have the partition visible at all.

metal-hammer actually does not talk to metal-bmc, and IMHO is it no good practice to have other communication connections than from a microservice to the apiserver.

Comment on lines +28 to +31
- Every new vendor has to be individually whitelisted in go-hal, a new board of a Supermicro server potentially requires a pull request.
- The current interface implements functions using different underlying drivers whenever it fits, so it's not obvious if IPMI or Redfish is used, sometimes information from different protocols differ.
- There are almost no unit tests and no automated integration tests (despite indirect integration testing through our release integration, which is creating and deleting machines).
- The CLI is not implementing the entire API interface. It is implemented only roughly. To use it for testing, it requires ad hoc code changes and recompilation.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agreed

- The current interface implements functions using different underlying drivers whenever it fits, so it's not obvious if IPMI or Redfish is used, sometimes information from different protocols differ.
- There are almost no unit tests and no automated integration tests (despite indirect integration testing through our release integration, which is creating and deleting machines).
- The CLI is not implementing the entire API interface. It is implemented only roughly. To use it for testing, it requires ad hoc code changes and recompilation.
- There is no possibility for an operator to provide the user / password that a server was shipped with in order to initialize it with these credentials. Such that the metal-hammer must be booted first before it can be managed through BMC. This also implies that the implementation relies on inband to work.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The benefit is that actually no manual action is required once new machines are physically provisioned in the datacenter other thant powering it on.
With the described approach this would not work anymore, because a list of machineuuid to bmc user/password must be provided and manually entered.
This is difficult, because hw vendors only deliver this bmc password list with the macaddress of the bmc interface and not with the machine id.


A suitable microservice is already in place on the mgmt-servers called the [metal-bmc](https://github.com/metal-stack/metal-bmc), which can be extended for this purpose. The metal-bmc will implement the server API. The API can be called by the metal-api (indirectly through NSQ), the metal-hammer and a metal-bmc CLI.

In general, it should be preferred to run actions from remote (a.k.a outband) in order to have the functionality easily accessible for other services. Another advantage is to only bundle heavy-weight proprietary tools like `sum` in a single component. There are only few exceptions where for example an IPMI inband connection is required. For this, we need to offer a special package, which purpose can be described as enabling a server to be managed from remote. This is explained in more detail in a later section of this MEP.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the remote BMC connectivity is not enabled by default, there is no way of enabling it other than with the ipmitools from within the machine


The CLI of the new metal-bmc API must become a first-class citizen in order to simplify testing the API. The entire new API should be generically implemented such that operators can run commands easily against a BMC.

## Additions to the metal-api
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
## Additions to the metal-api
## Additions to the metal-apiserver


When a machine gets connected to the leaf switches and boots for the first time, the metal-hammer is run through PXE boot.

The metal-hammer gets access to the BMC API as well as to the metal-api through the pixiecore. The metal-hammer will lookup the BMC in the metal-api by the locally discovered UUID. If there is a relation between the machine and the BMC already, the metal-hammer does not need to do anything specific. It may call the new BMC API at any given point during the provisioning sequence.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which calls are required from the metal-hammer to metal-bmc, please explain


The metal-hammer gets access to the BMC API as well as to the metal-api through the pixiecore. The metal-hammer will lookup the BMC in the metal-api by the locally discovered UUID. If there is a relation between the machine and the BMC already, the metal-hammer does not need to do anything specific. It may call the new BMC API at any given point during the provisioning sequence.

If there is no relation yet, the metal-hammer attempts to establish this relation by using IPMI inband information. The metal-hammer tries to figure out the BMC mac address and attempts to generate a privileged IPMI user and password. If this works, then the metal-hammer updates the BMC table with working access credentials. This way, it is not strictly required for operators to manually insert connection data into the BMC table, but the metal-hammer can generate them through inband capabilities. If it does not work, an operator has to manually provide credentials.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But then the inband connection can not be disabled


From here everything should work the same as before but through remote accessing the BMC API.

## New metalctl commands
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The whole idea of a new bmc table is nice !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area: documentation Affects the documentation area.

Projects

Status: In Progress

Development

Successfully merging this pull request may close these issues.

4 participants