New RFE: Monitoring Kadalu Kubernetes Storage#25
Conversation
Signed-off-by: Aravinda Vishwanathapura <aravinda@kadalu.io>
|
@leelavg ^^ |
vatsa287
left a comment
There was a problem hiding this comment.
Thanks @aravindavk for detailed design. I think we can divide this into 2 PR's. Metrics & alerts,events.
Yeah agree.. may be multiple PRs required for metrics itself. One for the framework and other PRs for each metric types. |
| - *Storage Units Utilization* | ||
| - *Storage units/bricks CPU,Memory and Uptime metrics* | ||
| - *CSI Provisioner CPU,Memory and Uptime metrics* | ||
| - *CSI Node plugins CPU,Memory and Uptime metrics* |
There was a problem hiding this comment.
Any idea on how to deploy nodeplugin/exporter.py?
There was a problem hiding this comment.
@vatsa287 ig there's no separate deployment strategy for nodeplugin, it'll be same as provisioner
- however as the role and containers in the pods are different, same port mapping can be used
There was a problem hiding this comment.
To be frank, I'm not familiar with Prometheus workings/concepts yet so, while addressing the comments please consider that.
I didn't review for grammatical errors, when you re-visit fix them if you feel so 😅
General Queries:
- Will we be storing collected metrics until Prometheus performs a scrape?
- Are we targeting full implementation for 0.8.6 itself?
- I hope we'll be reusing the existing
exporter.py - Will this RFC be amended with nested structures wrt metrics or it'll be documented as part of implementation
I might've more queries when I see the actual implementation, for now addressing these will get me started looking into Prometheus, thanks.
|
|
||
| Kadalu Operator runs `kubectl get pods -n kadalu` to get the list of all the resources available in Kadalu namespace. Additionally it fetches the nodes list and all the Storage information from the ConfigMap. With these information a few metrics will be derived as follows. | ||
|
|
||
| - Number of Up CSI node plugins by comparing the list of nodes and the list returned by `get pods` command. |
There was a problem hiding this comment.
- Might need to adjust the metrics or a note corresponding to taints & tolerations on nodes
There was a problem hiding this comment.
Ack. To start with we can show, up_node_plugins or something.
|
|
||
| Metrics related to the resource counts. | ||
|
|
||
| - Number of Storages |
There was a problem hiding this comment.
- Number of Storage Pools might be a good phrase?
- Will this be just a number or a nested structure differentiating
typeandkadalu_formatetc?
There was a problem hiding this comment.
Will this be just a number or a nested structure differentiating type and kadalu_format etc?
Necessary labels should be present for Prometheus. With JSON format, this need not be a separate metric can be derived from len(metrics.storages)
Number of Storage Pools might be a good phrase?
Ack
|
|
||
| ==== Health Metrics | ||
|
|
||
| Metrics related to the state of the resources. |
There was a problem hiding this comment.
- Does this mean we'll make data available to the user from which below states can be inferred?
- Same for remaining
How toquestions
|
|
||
| ==== Events | ||
|
|
||
| A few Events can be derived from the collected metrics by comparing with the latest data with the previously collected metrics. For example, |
There was a problem hiding this comment.
- Will we be storing "previously collected metrics" to derive the events?
There was a problem hiding this comment.
Not all historical data, only previous cycle metrics. This need not be persistent, Operator restart will start fresh(On Operator restart, a few events may get missed)
| - *Number of Storage pools* | ||
| - *Number of PVs* | ||
| - *Number of Storage Units/Bricks* | ||
| - *Operator Health* - Operator is running or not |
There was a problem hiding this comment.
- "Operator is running or not" with desired state ig?
| - *Health of Metrics exporter* | ||
| - *CSI Provisioner Health* | ||
| - *CSI/Quotad health* | ||
| - *CSI/Mounts health* (Based on expected number of Volumes in ConfigMap and number of mount processes). Gluster client process will continue to run even if all the bricks are down, it waits for the brick processes and re-connects as soon as they are available. Detect this by doing a regular IO from the mount or parsing the log files for `ENOTCONN` errors. |
There was a problem hiding this comment.
"regular IO from the mount"
- Please clarify which mount will be used for performing this op, the provisioner with some test dir or a new pod etc?
There was a problem hiding this comment.
From the mount available in the CSI provisioner pod.
| - *Storage Units Utilization* | ||
| - *Storage units/bricks CPU,Memory and Uptime metrics* | ||
| - *CSI Provisioner CPU,Memory and Uptime metrics* | ||
| - *CSI Node plugins CPU,Memory and Uptime metrics* |
There was a problem hiding this comment.
@vatsa287 ig there's no separate deployment strategy for nodeplugin, it'll be same as provisioner
- however as the role and containers in the pods are different, same port mapping can be used
|
|
||
| [source,yaml] | ||
| ---- | ||
| annotations: |
There was a problem hiding this comment.
- Scrape interval is configurable by user, like another annotation would suffice here?
There was a problem hiding this comment.
Prometheus is Pull based, that means it calls the APIs and collects the metrics. Metric exporters should not have its own scrape interval https://prometheus.io/docs/instrumenting/writing_exporters/#scheduling
Yeah, I wrote it in a flow. I will review once for grammatical errors.
Yes. Prometheus definitions will now move to
Some more details I will add soon.
Thanks. |
|
Thanks for the info. Will get going 😄. |
Signed-off-by: Aravinda Vishwanathapura aravinda@kadalu.io