feat: elasticsearch_up 指标添加cluster标签#1413
Conversation
Signed-off-by: ruochen <wanxialianwei@gmail.com>
…pes 指标添加cluster标签 Signed-off-by: ruochen <wanxialianwei@gmail.com>
Signed-off-by: ruochen <wanxialianwei@gmail.com>
There was a problem hiding this comment.
Pull request overview
This PR adds an Elasticsearch cluster identifier as a label so that “up”/node stats metrics can be distinguished across clusters when scraping multiple Elasticsearch endpoints.
Changes:
- Extends
serverInfoto storeclusterNameand attempts to fetch it duringGather(). - Adds a
clusterlabel to theelasticsearch upsample. - Threads
clusterNameintocollector.NewNodes(...)and applies it as a const label on node-stats “up/scrape/failure” metrics.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 6 comments.
| File | Description |
|---|---|
inputs/elasticsearch/elasticsearch.go |
Fetches cluster name, adds cluster label to elasticsearch up, and passes cluster name into node stats collector. |
inputs/elasticsearch/collector/nodes.go |
Extends NewNodes signature, adds const cluster labels to a few metrics, and introduces GetClusterName. |
inputs/elasticsearch/collector/nodes_test.go |
Updates NewNodes call site for the new function signature. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| func GetClusterName(client *http.Client, user, password, s string) (string, error) { | ||
| u, err := url.Parse(s) | ||
| if err != nil { | ||
| return "", fmt.Errorf("failed to parse URL %s: %s", s, err) | ||
| } | ||
| if user != "" && password != "" { | ||
| u.User = url.UserPassword(user, password) | ||
| } | ||
|
|
||
| var cir ClusterInfoResponse | ||
| res, err := client.Get(u.String()) | ||
| if err != nil { | ||
| return "", fmt.Errorf("failed to get cluster info from %s: %s", u.String(), err) | ||
| } | ||
| defer func() { | ||
| err = res.Body.Close() | ||
| if err != nil { | ||
| log.Println("failed to close http.Client, err: ", err) | ||
| } | ||
| }() | ||
| if res.StatusCode != http.StatusOK { | ||
| return "", fmt.Errorf("HTTP Request failed with code %d", res.StatusCode) | ||
| } | ||
| bts, err := io.ReadAll(res.Body) | ||
| if err != nil { | ||
| return "", err | ||
| } | ||
| if err := json.Unmarshal(bts, &cir); err != nil { | ||
| return "", err | ||
| } | ||
| return cir.ClusterName, nil | ||
| } |
There was a problem hiding this comment.
GetClusterName is new behavior and currently has no unit test coverage. Since this package already has HTTP fixture-based tests (e.g., nodes_test.go), consider adding a small test that validates cluster name extraction and non-200 handling.
|
|
||
| // Always gather node stats | ||
| if err := inputs.Collect(collector.NewNodes(ins.Client, EsUrl, ins.AllNodes, ins.Node, ins.Local, ins.NodeStats), slist); err != nil { | ||
| if err := inputs.Collect(collector.NewNodes(ins.Client, EsUrl, ins.AllNodes, ins.Node, ins.Local, ins.NodeStats, ins.serverInfo[ins.Node].clusterName), slist); err != nil { |
There was a problem hiding this comment.
ins.serverInfo is keyed by server URL (s), but here the lookup uses ins.Node (often a selector like "_local"), so clusterName will usually be empty and the cluster label on node stats metrics will be wrong. Use the current server key (s) when retrieving clusterName (and consider locking/using a local copy if needed).
| if err := inputs.Collect(collector.NewNodes(ins.Client, EsUrl, ins.AllNodes, ins.Node, ins.Local, ins.NodeStats, ins.serverInfo[ins.Node].clusterName), slist); err != nil { | |
| clusterName := "" | |
| if info, ok := ins.serverInfo[s]; ok { | |
| clusterName = info.clusterName | |
| } | |
| if err := inputs.Collect(collector.NewNodes(ins.Client, EsUrl, ins.AllNodes, ins.Node, ins.Local, ins.NodeStats, clusterName), slist); err != nil { |
| if info.clusterName, err = collector.GetClusterName(ins.Client, ins.UserName, ins.Password, s); err != nil { | ||
| slist.PushSample("elasticsearch", "up", 0, map[string]string{"address": s}) | ||
| log.Println("E! failed to get cluster name:", err) | ||
| } |
There was a problem hiding this comment.
GetClusterName adds an extra HTTP request per server per Gather, but the existing GetNodeID call already unmarshals a nodeStatsResponse that includes cluster_name (see nodeStatsResponse.ClusterName). To reduce overhead, consider returning the cluster name from the same request (e.g., a new helper that returns both nodeID and clusterName) and avoid this additional GET / call.
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
…pes 指标添加 address 标签 Signed-off-by: ruochen <wanxialianwei@gmail.com>
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 3 out of 3 changed files in this pull request and generated 4 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| up: prometheus.NewGauge(prometheus.GaugeOpts{ | ||
| Name: prometheus.BuildFQName(namespace, "node_stats", "up"), | ||
| Help: "Was the last scrape of the Elasticsearch nodes endpoint successful.", | ||
| ConstLabels: prometheus.Labels{ | ||
| "cluster": clusterName, | ||
| "address": url.String(), | ||
| }, |
There was a problem hiding this comment.
NewNodes sets address ConstLabels to url.String(). Since the URL passed in is configured with basic auth (EsUrl.User = url.UserPassword(...)), this will embed credentials (user/password or API key) into Prometheus labels/metrics output. Please ensure the address label is derived from a sanitized URL (e.g., clone the URL and clear User, or use Scheme+Host only) so secrets are never exported as labels.
| func GetClusterName(client *http.Client, user, password, s string) (string, error) { | ||
| u, err := url.Parse(s) | ||
| if err != nil { | ||
| return "", fmt.Errorf("failed to parse URL %s: %s", s, err) | ||
| } | ||
| if user != "" && password != "" { | ||
| u.User = url.UserPassword(user, password) | ||
| } | ||
|
|
||
| var cir ClusterInfoResponse | ||
| res, err := client.Get(u.String()) | ||
| if err != nil { | ||
| return "", fmt.Errorf("failed to get cluster info from %s: %s", u.String(), err) | ||
| } |
There was a problem hiding this comment.
GetClusterName is newly introduced but doesn’t have unit test coverage. Since similar helpers (e.g., fetchAndDecodeNodeStats) are tested in nodes_test.go, please add tests for GetClusterName covering: successful parsing of cluster_name, non-200 responses, and basic auth handling.
| if info.clusterName, err = collector.GetClusterName(ins.Client, ins.UserName, ins.Password, s); err != nil { | ||
| slist.PushSample("elasticsearch", "up", 0, map[string]string{"address": s}) | ||
| log.Println("E! failed to get cluster name:", err) | ||
| return | ||
| } | ||
|
|
||
| slist.PushSample("elasticsearch", "up", 1, map[string]string{ | ||
| "address": s, | ||
| "cluster": info.clusterName, | ||
| }) |
There was a problem hiding this comment.
The elasticsearch/up metric now uses different label sets depending on success vs failure: on failure you emit only {address: s}, but on success you emit {address: s, cluster: ...}. This creates separate time series for the same target and can leave stale series behind. Consider always including the cluster label (set to ""/"unknown" when unavailable) or keeping labels consistent across both paths.
| if err := inputs.Collect(collector.NewNodes(ins.Client, EsUrl, ins.AllNodes, ins.Node, ins.Local, ins.NodeStats, ins.serverInfo[ins.Node].clusterName), slist); err != nil { | ||
| log.Println("E! failed to collect nodes metrics:", err) |
There was a problem hiding this comment.
NewNodes(..., ins.serverInfo[ins.Node].clusterName) looks up serverInfo using ins.Node (e.g. "_local"), but serverInfo is populated with keys of the server URL string (s). This will typically pass an empty cluster name. Also, ins.serverInfo is only refreshed inside the earlier if ins.ClusterStats || len(ins.IndicesInclude) > 0 block; when that block is skipped, this read can use stale data from a previous run. Please derive the cluster name for the current server (s / EsUrl) with consistent key normalization and ensure it’s populated (or explicitly empty) on every Gather run before passing into NewNodes.
No description provided.