Skip to content

Add metrics via prometheus#2

Open
tlwr wants to merge 3 commits intovixus0:mainfrom
tlwr:main
Open

Add metrics via prometheus#2
tlwr wants to merge 3 commits intovixus0:mainfrom
tlwr:main

Conversation

@tlwr
Copy link

@tlwr tlwr commented Jun 22, 2021

Disclaimer

I have not tested this as I have not been particularly energised to set up mocks, nor do I have access to an AWS k8s cluster

What

Prometheus can be enabled via flags with configurable address and port

Prometheus exposes binary metrics, as well as three controller metrics:

  • skuttle_node_termination_errors_total
  • skuttle_node_termination_skips_total
  • skuttle_node_terminations_total

These metrics record errors when skuttling, skips (when nodes exist but are picked up by skuttle) and skuttles (node terminations)

Why

So we can have time-series charts showing EC2 node terminations, when skuttle decides to drop depth charges

How to use

When prometheus exposes metrics, they appear like the following:

# HELP skuttle_temination_skips_total Total number of EC2 instance terminations skipped
# TYPE skuttle_temination_skips_total counter
skuttle_temination_skips_total{az="az1",region="region1",role="unknown"} 7
skuttle_temination_skips_total{az="az2",region="region1",role="unknown"} 7
# HELP skuttle_teminations_total Total number of EC2 instance terminations
# TYPE skuttle_teminations_total counter
skuttle_teminations_total{az="az1",region="region1",role="master"} 7
skuttle_teminations_total{az="az2",region="region1",role="master"} 7
# HELP skuttle_termination_errors_total Total number of errors terminating EC2 instances
# TYPE skuttle_termination_errors_total counter
skuttle_termination_errors_total{az="az1",region="region1",role="worker"} 7
skuttle_termination_errors_total{az="az2",region="region1",role="worker"} 7

When Kubernetes does not have sufficient label metadata that should be set by default (refer to the code), the metrics appear as:

# HELP skuttle_temination_skips_total Total number of EC2 instance terminations skipped
# TYPE skuttle_temination_skips_total counter
skuttle_temination_skips_total{az="",region="",role=""} 3
# HELP skuttle_teminations_total Total number of EC2 instance terminations
# TYPE skuttle_teminations_total counter
skuttle_teminations_total{az="",region="",role=""} 3
# HELP skuttle_termination_errors_total Total number of errors terminating EC2 instances
# TYPE skuttle_termination_errors_total counter
skuttle_termination_errors_total{az="",region="",role=""} 3

These metrics enable a skuttle SLI:

sum(rate(skuttle_teminations_total[15m]))
/
(
  sum(rate(skuttle_teminations_total[15m]))
  + sum(rate(skuttle_temination_skips_total[15m]))
  + sum(rate(skuttle_temination_errors_total[15m]))
)

which could potentially be drilled down via region or availability zones, which may provide insights which could optimise spot instance placement

How to review

  • Fetch this code locally
  • Test it
  • Cherry pick onto main and GPG sign my commits with your key

tlwr added 2 commits June 22, 2021 08:35
Signed-off-by: toby lorne <toby@toby.codes>
adds an optional prometheus server which handles /metrics

adds flags for enabling prometheus server, configuring listen address
and port

adds 3 metrics:

- skuttle_node_termination_errors_total
- skuttle_node_termination_skips_total
- skuttle_node_terminations_total

this will enable monitoring node skuttling activity via prometheus

Signed-off-by: toby lorne <toby@toby.codes>
Copy link
Owner

@vixus0 vixus0 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good overall, just the topology label keys can be updated :)

Re: testing, I have a vague idea around integration testing with kind since you can define multiple nodes with labels in its configuration file. Just need to get round to setting it up.

Setting the topology labels is dependent on the cloud-controller, right? The Kubernetes docs suggest /zone and /region should be set but I'm not sure about /role. It's not in the "well known labels": https://github.com/kubernetes/api/blob/ccc65c06cccc78a07b45598ec7c135dca7d84ed2/core/v1/well_known_labels.go#L22

role -> instance type
role is not available, instance type is useful

refering to
https://kubernetes.io/docs/reference/labels-annotations-taints/

Signed-off-by: toby lorne <toby@toby.codes>
Co-authored-by: vixus0 <vixus0@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants