Skip to content

Fixes #353 - Add health check command#431

Open
jakduch wants to merge 1 commit intotheforeman:masterfrom
jakduch:feature/healthcheck
Open

Fixes #353 - Add health check command#431
jakduch wants to merge 1 commit intotheforeman:masterfrom
jakduch:feature/healthcheck

Conversation

@jakduch
Copy link
Copy Markdown

@jakduch jakduch commented Mar 29, 2026

Adds a new foremanctl health command that performs post-install health checks on all Foreman services.

Checks performed:

  • Core services running (foreman, httpd, redis, postgresql)
  • Dynflow workers running (orchestrator, worker, worker-hosts-queue)
  • Pulp services running (pulp-api, pulp-content)
  • Candlepin service running
  • Foreman API responding (GET /api/v2/ping)
  • Foreman tasks status (via Katello ping)

The command reports individual results for each check and provides a summary. It exits non-zero when any check fails, making it suitable for scripting and CI use.

As discussed in the issue, this can also be integrated as a final step of deployment in the future.

@pablomh
Copy link
Copy Markdown
Contributor

pablomh commented Mar 29, 2026

Could you please show the output? Thx!

@jakduch
Copy link
Copy Markdown
Author

jakduch commented Mar 29, 2026

Example output when all checks pass:

$ foremanctl health

TASK [Check core services are running] *********
ok: [localhost] => (item=foreman) => Service foreman is running
ok: [localhost] => (item=httpd) => Service httpd is running
ok: [localhost] => (item=redis) => Service redis is running
ok: [localhost] => (item=postgresql) => Service postgresql is running

TASK [Check dynflow services are running] ******
ok: [localhost] => (item=dynflow-sidekiq@orchestrator) => Service dynflow-sidekiq@orchestrator is running
ok: [localhost] => (item=dynflow-sidekiq@worker) => Service dynflow-sidekiq@worker is running
ok: [localhost] => (item=dynflow-sidekiq@worker-hosts-queue) => Service dynflow-sidekiq@worker-hosts-queue is running

TASK [Check Pulp services are running] *********
ok: [localhost] => (item=pulp-api) => Service pulp-api is running
ok: [localhost] => (item=pulp-content) => Service pulp-content is running

TASK [Check Candlepin service is running] ******
ok: [localhost] => Service candlepin is running

TASK [Report Foreman API status] ***************
ok: [localhost] => Foreman API: OK

TASK [Report Foreman tasks status] *************
ok: [localhost] => Foreman tasks: ok

TASK [Health check summary] ********************
ok: [localhost] => All health checks passed.

Example output when a service is down:

$ foremanctl health

TASK [Check core services are running] *********
ok: [localhost] => (item=foreman) => Service foreman is running
ok: [localhost] => (item=httpd) => Service httpd is running
failed: [localhost] => (item=redis) => Service redis is not running
ok: [localhost] => (item=postgresql) => Service postgresql is running

...

TASK [Health check summary] ********************
ok: [localhost] => Health check issues found: Failed services: redis

TASK [Fail if any checks failed] ***************
fatal: [localhost]: FAILED! => Health check failed: 1 service(s) down, Foreman API OK

The command exits non-zero on failure, so it can be used in scripts like foremanctl health && echo "All good" || echo "Issues found".

@ehelms
Copy link
Copy Markdown
Member

ehelms commented Mar 29, 2026

This is a good start. Can you look into following the conventions we've established for check_ roles (https://github.com/theforeman/foremanctl/tree/master/src/roles)? I also think you can drop the "post-install" nomenclature and just focus on this being a health check command.

@jakduch jakduch force-pushed the feature/healthcheck branch from d6a97eb to d702027 Compare March 29, 2026 16:01
@jakduch
Copy link
Copy Markdown
Author

jakduch commented Mar 29, 2026

Good point, refactored to follow the check_ role conventions:

  • Added check_services role — verifies all Foreman services are running (core, dynflow, pulp, candlepin)
  • Added check_foreman_api role — checks API responds and Foreman tasks status via /api/v2/ping
  • Health playbook now uses checks/tasks/execute_check.yml to run them, same pattern as the existing checks role
  • Dropped "post-install" wording from metadata and description

Force-pushed.

@ehelms
Copy link
Copy Markdown
Member

ehelms commented Mar 29, 2026

Design question for you:

  • Do we need service and API checks? Shouldn't that tell us the same information?
  • If we check services, should those be conditional based on what features are enabled?
  • This could use some sort of validation for this new command

@jakduch jakduch force-pushed the feature/healthcheck branch from d702027 to 9ea20cc Compare March 29, 2026 19:33
@jakduch
Copy link
Copy Markdown
Author

jakduch commented Mar 29, 2026

Good questions.

Service vs API checks - do we need both?

They operate at different layers. Service checks verify that systemd units are running - this works offline and does not depend on a functional API. The API check (/api/v2/ping) verifies end-to-end functionality including database connectivity and task runner status. A service can be running but not responding (misconfiguration, certificate issues), and conversely the API might not report on every systemd unit. That said, if the ping endpoint already covers everything we need, we could simplify to just the API check. Happy to go either way.

Conditional checks based on features - yes, agreed.

Updated Pulp and Candlepin checks to only run when content/ features are enabled (enabled_features | select("contains", "content/")). Same condition on the Katello tasks status check in check_foreman_api. Core services and dynflow are always checked. Force-pushed.

Validation - could you clarify what you have in mind?

An ansible-playbook --syntax-check for the new playbook, a test scenario in CI, or something else?

@jakduch jakduch force-pushed the feature/healthcheck branch 2 times, most recently from 2c44427 to 4c3bd96 Compare March 30, 2026 11:01
- "../../vars/base.yaml"
tasks:
- name: Execute health checks
ansible.builtin.include_tasks: ../../roles/checks/tasks/execute_check.yml
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a better pattern:

  ansible.builtin.include_role:
    name: common
    tasks_from: login

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done - switched to include_role: name: checks, tasks_from: execute_check to reuse the existing checks role and avoid the relative path.

loop:
- pulp-api
- pulp-content
when: enabled_features | select('contains', 'content/') | list | length > 0
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like we could use a better way to do this pattern of checking. Is there a filter or some other Ansible construct we can encapsulate feature checking logic in?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a has_content_features filter to foremanctl.py - the condition is now just enabled_features | has_content_features.

@ehelms
Copy link
Copy Markdown
Member

ehelms commented Mar 30, 2026

I would expect to add this to the end of our Github Action tests so verify the command works and ensure that it passes.

@jakduch
Copy link
Copy Markdown
Author

jakduch commented Mar 30, 2026

Updated based on your feedback:

  1. include_role pattern - Replaced include_tasks with relative path by include_role: name: checks, tasks_from: execute_check. This reuses the existing checks role and avoids the ../../roles/ path.

  2. Feature check filter - Added has_content_features filter to foremanctl.py so the condition is now just enabled_features | has_content_features instead of the verbose select('contains', 'content/') | list | length > 0.

  3. CI validation - Created a separate PR Refs #353 - Add health check validation to CI #434 to add ./foremanctl health to the test workflow, to keep this PR focused on the command itself.

@ehelms
Copy link
Copy Markdown
Member

ehelms commented Mar 30, 2026

3. CI validation - Created a separate PR Refs #353 - Add health check validation to CI #434 to add ./foremanctl health to the test workflow, to keep this PR focused on the command itself.

I can't know in the PR if this is working if the CI is split off to another PR :) Testing and changes to CI should generally be kept together with the change.

return compact_list(plugins)


def has_content_features(features):
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you forsee having a method per feature? Or should it be a more generic method that a feature can be passed in to?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point - a generic filter makes more sense. I'll change it to has_feature_prefix(features, prefix) so it can be reused for any feature group, e.g. enabled_features | has_feature_prefix('content/').

@ehelms
Copy link
Copy Markdown
Member

ehelms commented Mar 30, 2026

What are your thoughts on how foreman-proxy which is an optional service that can be present should be handled?

@jakduch
Copy link
Copy Markdown
Author

jakduch commented Mar 30, 2026

Fair enough - I'll merge the CI change from #434 into this PR and close #434.

@jakduch
Copy link
Copy Markdown
Author

jakduch commented Mar 30, 2026

foreman-proxy could be added to check_services with a condition similar to content features, e.g. when: 'foreman-proxy' in enabled_features. Same pattern as the existing deploy.yaml which already checks 'foreman-proxy' in enabled_features before including the proxy role. I'll add it.

Add a new foremanctl health command that verifies the state of all
Foreman services after installation or during troubleshooting.

Checks performed:
- Core services running (foreman, httpd, redis, postgresql)
- Dynflow workers running (orchestrator, worker, worker-hosts-queue)
- Pulp services running (pulp-api, pulp-content)
- Candlepin service running
- Foreman API responding (GET /api/v2/ping)
- Foreman tasks status (via Katello ping response)

Reports a summary of all failures and exits non-zero if any check
fails, making it suitable for scripting and CI use.
@jakduch jakduch force-pushed the feature/healthcheck branch from 15e6135 to 845cd50 Compare March 30, 2026 15:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants