New RFE: Monitoring Kadalu Kubernetes Storage by aravindavk · Pull Request #25 · kadalu/rfcs

aravindavk · 2021-08-29T13:42:41Z

Signed-off-by: Aravinda Vishwanathapura aravinda@kadalu.io

Signed-off-by: Aravinda Vishwanathapura <aravinda@kadalu.io>

aravindavk · 2021-08-29T13:44:13Z

@leelavg ^^

vatsa287

Thanks @aravindavk for detailed design. I think we can divide this into 2 PR's. Metrics & alerts,events.

aravindavk · 2021-08-29T15:49:41Z

Thanks @aravindavk for detailed design. I think we can divide this into 2 PR's. Metrics & alerts,events.

Yeah agree.. may be multiple PRs required for metrics itself. One for the framework and other PRs for each metric types.

vatsa287 · 2021-08-29T15:07:12Z

+- *Storage Units Utilization*
+- *Storage units/bricks CPU,Memory and Uptime metrics*
+- *CSI Provisioner CPU,Memory and Uptime metrics*
+- *CSI Node plugins CPU,Memory and Uptime metrics*


Any idea on how to deploy nodeplugin/exporter.py?

@vatsa287 ig there's no separate deployment strategy for nodeplugin, it'll be same as provisioner

however as the role and containers in the pods are different, same port mapping can be used

leelavg

To be frank, I'm not familiar with Prometheus workings/concepts yet so, while addressing the comments please consider that.
I didn't review for grammatical errors, when you re-visit fix them if you feel so 😅

General Queries:

Will we be storing collected metrics until Prometheus performs a scrape?
Are we targeting full implementation for 0.8.6 itself?
I hope we'll be reusing the existing exporter.py
Will this RFC be amended with nested structures wrt metrics or it'll be documented as part of implementation

I might've more queries when I see the actual implementation, for now addressing these will get me started looking into Prometheus, thanks.

leelavg · 2021-08-30T03:26:58Z

+
+Kadalu Operator runs `kubectl get pods -n kadalu` to get the list of all the resources available in Kadalu namespace. Additionally it fetches the nodes list and all the Storage information from the ConfigMap. With these information a few metrics will be derived as follows.
+
+- Number of Up CSI node plugins by comparing the list of nodes and the list returned by `get pods` command.


Might need to adjust the metrics or a note corresponding to taints & tolerations on nodes

Ack. To start with we can show, up_node_plugins or something.

leelavg · 2021-08-30T03:28:57Z

+
+Metrics related to the resource counts.
+
+- Number of Storages


Number of Storage Pools might be a good phrase?

Will this be just a number or a nested structure differentiating type and kadalu_format etc?

Will this be just a number or a nested structure differentiating type and kadalu_format etc?

Necessary labels should be present for Prometheus. With JSON format, this need not be a separate metric can be derived from len(metrics.storages)

Number of Storage Pools might be a good phrase?

Ack

leelavg · 2021-08-30T03:31:30Z

+
+==== Health Metrics
+
+Metrics related to the state of the resources.


Does this mean we'll make data available to the user from which below states can be inferred?

Same for remaining How to questions

leelavg · 2021-08-30T03:34:01Z

+
+==== Events
+
+A few Events can be derived from the collected metrics by comparing with the latest data with the previously collected metrics. For example,


Will we be storing "previously collected metrics" to derive the events?

Not all historical data, only previous cycle metrics. This need not be persistent, Operator restart will start fresh(On Operator restart, a few events may get missed)

leelavg · 2021-08-30T03:35:46Z

+- *Number of Storage pools*
+- *Number of PVs*
+- *Number of Storage Units/Bricks*
+- *Operator Health* - Operator is running or not


"Operator is running or not" with desired state ig?

leelavg · 2021-08-30T03:38:17Z

+- *Health of Metrics exporter*
+- *CSI Provisioner Health*
+- *CSI/Quotad health*
+- *CSI/Mounts health* (Based on expected number of Volumes in ConfigMap and number of mount processes). Gluster client process will continue to run even if all the bricks are down, it waits for the brick processes and re-connects as soon as they are available. Detect this by doing a regular IO from the mount or parsing the log files for `ENOTCONN` errors.


"regular IO from the mount"

Please clarify which mount will be used for performing this op, the provisioner with some test dir or a new pod etc?

From the mount available in the CSI provisioner pod.

leelavg · 2021-08-30T03:41:09Z

+- *Storage Units Utilization*
+- *Storage units/bricks CPU,Memory and Uptime metrics*
+- *CSI Provisioner CPU,Memory and Uptime metrics*
+- *CSI Node plugins CPU,Memory and Uptime metrics*


@vatsa287 ig there's no separate deployment strategy for nodeplugin, it'll be same as provisioner

however as the role and containers in the pods are different, same port mapping can be used

leelavg · 2021-08-30T03:42:55Z

+
+[source,yaml]
+----
+      annotations:


Scrape interval is configurable by user, like another annotation would suffice here?

Prometheus is Pull based, that means it calls the APIs and collects the metrics. Metric exporters should not have its own scrape interval https://prometheus.io/docs/instrumenting/writing_exporters/#scheduling

aravindavk · 2021-08-30T05:00:04Z

To be frank, I'm not familiar with Prometheus workings/concepts yet so, while addressing the comments please consider that.
I didn't review for grammatical errors, when you re-visit fix them if you feel so 😅

Yeah, I wrote it in a flow. I will review once for grammatical errors.

General Queries:

Will we be storing collected metrics until Prometheus performs a scrape?
No. Operator in future may collect the metrics in periodic interval and stores the two values(current and previous).

Are we targeting full implementation for 0.8.6 itself?
At least the framework and basic metrics. Advanced metrics, events and alerts are future.

I hope we'll be reusing the existing exporter.py

Yes. Prometheus definitions will now move to operator/exporter. The code that is collecting metrics in CSI/exporter will be used with csi/exporter but exports in json format.

Will this RFC be amended with nested structures wrt metrics or it'll be documented as part of implementation

Some more details I will add soon.

I might've more queries when I see the actual implementation, for now addressing these will get me started looking into Prometheus, thanks.

Thanks.

leelavg · 2021-08-30T05:21:04Z

Thanks for the info. Will get going 😄.

New RFE: Monitoring Kadalu Kubernetes Storage

2f5bac8

Signed-off-by: Aravinda Vishwanathapura <aravinda@kadalu.io>

aravindavk requested review from amarts and vatsa287 August 29, 2021 13:42

vatsa287 approved these changes Aug 29, 2021

View reviewed changes

vatsa287 reviewed Aug 29, 2021

View reviewed changes

leelavg reviewed Aug 30, 2021

View reviewed changes

vatsa287 mentioned this pull request Sep 1, 2021

across: Implement metrics API kadalu/kadalu#643

Merged


		Kadalu Operator runs `kubectl get pods -n kadalu` to get the list of all the resources available in Kadalu namespace. Additionally it fetches the nodes list and all the Storage information from the ConfigMap. With these information a few metrics will be derived as follows.

		- Number of Up CSI node plugins by comparing the list of nodes and the list returned by `get pods` command.


		==== Health Metrics

		Metrics related to the state of the resources.


		==== Events

		A few Events can be derived from the collected metrics by comparing with the latest data with the previously collected metrics. For example,

Conversation

aravindavk commented Aug 29, 2021

Uh oh!

aravindavk commented Aug 29, 2021

Uh oh!

vatsa287 left a comment

Choose a reason for hiding this comment

Uh oh!

aravindavk commented Aug 29, 2021

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

leelavg left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

aravindavk commented Aug 30, 2021

Uh oh!

leelavg commented Aug 30, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

leelavg left a comment •

edited

Loading