feat: 新增 dmesg 错误日志检查插件#1415
Conversation
Signed-off-by: ruochen <wanxialianwei@gmail.com>
Signed-off-by: ruochen <wanxialianwei@gmail.com>
There was a problem hiding this comment.
Pull request overview
新增一个 dmesg input 插件,用于从 Linux 内核环形缓冲区(/dev/kmsg)读取并解析日志,并基于内置关键字与用户配置关键字上报错误命中指标。
Changes:
- 新增
/dev/kmsg读取与解析逻辑(dmesg decode)。 - 新增
dmesginput 插件:遍历 dmesg 文本并按关键字上报指标。 - 在 agent 自动注册列表中加入
inputs/dmesg。
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 8 comments.
| File | Description |
|---|---|
| inputs/dmesg/dmesg_decode.go | 新增 /dev/kmsg 读取与消息解析实现,并提供对外 API。 |
| inputs/dmesg/dmesg.go | 新增 dmesg input:关键字匹配并上报采样。 |
| agent/metrics_agent.go | 将 dmesg input 纳入自动注册。 |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
inputs/dmesg/dmesg_decode.go
Outdated
| // Package dmesg provides interfaces to get log messages from linux kernel ring buffer like | ||
| // cmd util 'dmesg' by reading data from /dev/kmsg. | ||
| package dmesg |
There was a problem hiding this comment.
This package reads /dev/kmsg and uses linux-specific syscall behavior, but it has no build constraints. Because agent/metrics_agent.go always blank-imports this input (like other linux-only inputs), non-Linux builds can fail (e.g., “build constraints exclude all Go files” or missing syscalls). Add //go:build linux (and // +build linux) to the real implementation files and add a dmesg_nolinux.go stub with //go:build !linux, following the pattern used by inputs/conntrack and inputs/kernel.
inputs/dmesg/dmesg_decode.go
Outdated
|
|
||
| msg.Text = string(data[prefixEnd+1 : textEnd]) | ||
| if textEnd == dataLen-1 { | ||
| return nil |
There was a problem hiding this comment.
parseData returns nil when the log line ends right after the newline (textEnd == dataLen-1). For the common /dev/kmsg format where a message is just ...;text\n (no extra device-info lines), this drops valid messages so the plugin won’t ever match keywords. Return the parsed Msg even when there’s no device info (and leave DeviceInfo nil/empty).
| return nil | |
| // No additional device info lines; return the parsed message as-is. | |
| return &msg |
inputs/dmesg/dmesg_decode.go
Outdated
| val, _ := strconv.ParseInt(string(prefix), 10, 64) | ||
| msg.TsUsec = val | ||
| case 3: | ||
| msg.IsFragment = prefix[0] != '-' |
There was a problem hiding this comment.
Potential panic: msg.IsFragment = prefix[0] != '-' assumes prefix is non-empty. If the prefix section has consecutive commas (missing fields) this will be an index-out-of-range panic. Guard with len(prefix) > 0 before indexing.
| msg.IsFragment = prefix[0] != '-' | |
| if len(prefix) > 0 { | |
| msg.IsFragment = prefix[0] != '-' | |
| } |
inputs/dmesg/dmesg_decode.go
Outdated
| if info[0] != ' ' { | ||
| continue | ||
| } | ||
|
|
||
| kv := bytes.Split(info, []byte("=")) | ||
| if len(kv) != 2 { | ||
| continue | ||
| } | ||
|
|
||
| msg.DeviceInfo[string(kv[0])] = string(kv[1]) |
There was a problem hiding this comment.
Potential panics while parsing device info: if info[0] != ' ' will panic on empty info, and bytes.Split(info, []byte("=")) can return more than 2 parts if the value contains '='. Add len(info) > 0 guard, use bytes.SplitN(..., 2), and consider trimming the leading space so map keys don’t include it.
| if info[0] != ' ' { | |
| continue | |
| } | |
| kv := bytes.Split(info, []byte("=")) | |
| if len(kv) != 2 { | |
| continue | |
| } | |
| msg.DeviceInfo[string(kv[0])] = string(kv[1]) | |
| if len(info) == 0 || info[0] != ' ' { | |
| continue | |
| } | |
| kv := bytes.SplitN(info, []byte("="), 2) | |
| if len(kv) != 2 { | |
| continue | |
| } | |
| key := bytes.TrimSpace(kv[0]) | |
| msg.DeviceInfo[string(key)] = string(kv[1]) |
inputs/dmesg/dmesg_decode.go
Outdated
| buf := make([]byte, bufSize) | ||
| _, err := syscall.Read(int(fd), buf) | ||
| if err != nil { | ||
| syscallError = err | ||
| // EINVAL means buf is not enough, data would be truncated, but still can continue. | ||
| if !errors.Is(err, syscall.EINVAL) { | ||
| return true | ||
| } | ||
| } | ||
|
|
||
| if fetchRaw { | ||
| d.raw = append(d.raw, buf) | ||
| } else { | ||
| msg := parseData(buf) | ||
| if msg == nil { | ||
| continue | ||
| } | ||
| d.msg = append(d.msg, *msg) | ||
| } |
There was a problem hiding this comment.
syscall.Read’s return length is ignored, and the code appends/parses the full bufSize buffer (including trailing zero bytes). This can break parsing (e.g., textEnd == dataLen-1 never true) and makes RawDmesgWithBufSize return padded garbage. Use the returned n to slice (buf[:n]) before appending/parsing; also consider reusing the buffer to avoid allocating each loop iteration.
inputs/dmesg/dmesg.go
Outdated
| const ( | ||
| _ = iota // 跳过 0 和 1,让 iota 从 2 开始计数 | ||
| ErrCode1 // 2 | ||
| ErrCode2 // 3 | ||
| ErrCode3 // 4 | ||
| ErrCode4 // 5 |
There was a problem hiding this comment.
This iota block doesn’t do what the comment says: with _ = iota, ErrCode1 will be 1 (not 2), ErrCode2 will be 2, etc. If you need codes starting at 2, either skip two values or offset iota (e.g., ErrCode1 = iota + 2).
inputs/dmesg/dmesg.go
Outdated
| for _, d := range dmesgs { | ||
| slist.PushFront(types.NewSample(inputName, "up", 1, nil)) | ||
| for _, e := range errorList { |
There was a problem hiding this comment.
up is pushed once per dmesg message inside the loop, producing many duplicate up=1 samples per Gather. Move the success up=1 emission outside the loop so it’s reported once per collection interval.
inputs/dmesg/dmesg_decode.go
Outdated
| Level uint64 // SYSLOG lvel | ||
| Facility uint64 // SYSLOG facility |
There was a problem hiding this comment.
Typo in comment: SYSLOG lvel → SYSLOG level.
Signed-off-by: ruochen <wanxialianwei@gmail.com>
Signed-off-by: ruochen <wanxialianwei@gmail.com>
Signed-off-by: ruochen <wanxialianwei@gmail.com>
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 2 out of 2 changed files in this pull request and generated 9 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| _, err := syscall.Read(int(fd), buf) | ||
| if err != nil { | ||
| syscallError = err | ||
| // EINVAL means buf is not enough, data would be truncated, but still can continue. | ||
| if !errors.Is(err, syscall.EINVAL) { | ||
| return true | ||
| } | ||
| } | ||
|
|
||
| msg := parseData(buf) |
There was a problem hiding this comment.
The read loop ignores the byte count returned by syscall.Read and always passes the full fixed-size buffer to parseData. That means parseData is parsing trailing zero bytes and can mis-handle end-of-line logic. Capture n and call parseData(buf[:n]) (and consider skipping parse on non-nil errors like EINVAL if the message is known-truncated).
| _, err := syscall.Read(int(fd), buf) | |
| if err != nil { | |
| syscallError = err | |
| // EINVAL means buf is not enough, data would be truncated, but still can continue. | |
| if !errors.Is(err, syscall.EINVAL) { | |
| return true | |
| } | |
| } | |
| msg := parseData(buf) | |
| n, err := syscall.Read(int(fd), buf) | |
| if err != nil { | |
| syscallError = err | |
| // EINVAL means buf is not enough, data would be truncated, but still can continue. | |
| if errors.Is(err, syscall.EINVAL) { | |
| // Skip parsing known-truncated data. | |
| continue | |
| } | |
| return true | |
| } | |
| if n <= 0 { | |
| continue | |
| } | |
| msg := parseData(buf[:n]) |
| msg.DeviceInfo = make(map[string]string, 2) | ||
| deviceInfo := bytes.Split(data[textEnd+1:dataLen-1], []byte("\n")) | ||
| for _, info := range deviceInfo { | ||
| if info[0] != ' ' { |
There was a problem hiding this comment.
There is a potential panic in device-info parsing: if info[0] != ' ' will panic if info is empty (e.g., consecutive newlines or an empty trailer slice). Guard len(info) > 0 before indexing.
| if info[0] != ' ' { | |
| if len(info) == 0 || info[0] != ' ' { |
| ) | ||
|
|
||
| type Msg struct { | ||
| Level uint64 // SYSLOG lvel |
There was a problem hiding this comment.
Spelling: comment says "SYSLOG lvel"; should be "SYSLOG level".
| Level uint64 // SYSLOG lvel | |
| Level uint64 // SYSLOG level |
| func parseData(data []byte) *Msg { | ||
| msg := Msg{} | ||
|
|
||
| dataLen := len(data) | ||
| prefixEnd := bytes.IndexByte(data, ';') | ||
| if prefixEnd == -1 { | ||
| return nil | ||
| } | ||
|
|
||
| for index, prefix := range bytes.Split(data[:prefixEnd], []byte(",")) { | ||
| switch index { | ||
| case 0: | ||
| val, _ := strconv.ParseUint(string(prefix), 10, 64) | ||
| msg.Level = val & levelMask | ||
| msg.Facility = val & (^levelMask) | ||
| case 1: | ||
| val, _ := strconv.ParseUint(string(prefix), 10, 64) | ||
| msg.Seq = val | ||
| case 2: | ||
| val, _ := strconv.ParseInt(string(prefix), 10, 64) | ||
| msg.TsUsec = val | ||
| case 3: | ||
| msg.IsFragment = prefix[0] != '-' | ||
| case 4: | ||
| msg.Caller = string(prefix) | ||
| } | ||
| } | ||
|
|
||
| textEnd := bytes.IndexByte(data, '\n') | ||
| if textEnd == -1 || textEnd <= prefixEnd { | ||
| return nil | ||
| } | ||
|
|
||
| msg.Text = string(data[prefixEnd+1 : textEnd]) | ||
| if textEnd == dataLen-1 { | ||
| return nil | ||
| } | ||
|
|
||
| msg.DeviceInfo = make(map[string]string, 2) | ||
| deviceInfo := bytes.Split(data[textEnd+1:dataLen-1], []byte("\n")) | ||
| for _, info := range deviceInfo { | ||
| if info[0] != ' ' { | ||
| continue | ||
| } | ||
|
|
||
| kv := bytes.Split(info, []byte("=")) | ||
| if len(kv) != 2 { | ||
| continue | ||
| } | ||
|
|
||
| msg.DeviceInfo[string(kv[0])] = string(kv[1]) | ||
| } | ||
|
|
||
| return &msg | ||
| } |
There was a problem hiding this comment.
This new parser/keyword matching logic isn’t covered by tests. Since it’s easy to regress and interacts with tricky /dev/kmsg formatting (prefix parsing, no-device-info lines, empty segments), please add unit tests for parseData (at least: normal line without device info, line with device info, and malformed/empty segments).
| package dmesg | ||
|
|
||
| import ( | ||
| "bytes" | ||
| "errors" | ||
| "log" | ||
| "os" | ||
| "strconv" | ||
| "strings" | ||
| "syscall" | ||
|
|
||
| "flashcat.cloud/categraf/config" | ||
| "flashcat.cloud/categraf/inputs" | ||
| "flashcat.cloud/categraf/types" | ||
| ) |
There was a problem hiding this comment.
This input reads /dev/kmsg and uses Linux-only syscall flags (e.g., O_NONBLOCK). Without Linux build tags plus a !linux stub file, the repo will fail to compile on non-Linux platforms (and the blank import in agent/metrics_agent.go will also break builds). Please add //go:build linux to this file and add a corresponding dmesg_* !linux stub (see inputs/conntrack/conntrack.go and inputs/conntrack/conntrack_nolinux.go for the established pattern).
| if textEnd == dataLen-1 { | ||
| return nil | ||
| } | ||
|
|
||
| msg.DeviceInfo = make(map[string]string, 2) | ||
| deviceInfo := bytes.Split(data[textEnd+1:dataLen-1], []byte("\n")) | ||
| for _, info := range deviceInfo { | ||
| if info[0] != ' ' { | ||
| continue | ||
| } | ||
|
|
||
| kv := bytes.Split(info, []byte("=")) | ||
| if len(kv) != 2 { | ||
| continue | ||
| } | ||
|
|
||
| msg.DeviceInfo[string(kv[0])] = string(kv[1]) |
There was a problem hiding this comment.
parseData returns nil when the log line ends right after the message text (textEnd == dataLen-1). That drops the common /dev/kmsg case where there is no device-info trailer, so keyword detection will miss most messages. This function should return the parsed Msg even when there is no device info section.
| if textEnd == dataLen-1 { | |
| return nil | |
| } | |
| msg.DeviceInfo = make(map[string]string, 2) | |
| deviceInfo := bytes.Split(data[textEnd+1:dataLen-1], []byte("\n")) | |
| for _, info := range deviceInfo { | |
| if info[0] != ' ' { | |
| continue | |
| } | |
| kv := bytes.Split(info, []byte("=")) | |
| if len(kv) != 2 { | |
| continue | |
| } | |
| msg.DeviceInfo[string(kv[0])] = string(kv[1]) | |
| // Only parse device info if there is data after the message text. | |
| if textEnd < dataLen-1 { | |
| msg.DeviceInfo = make(map[string]string, 2) | |
| deviceInfo := bytes.Split(data[textEnd+1:dataLen-1], []byte("\n")) | |
| for _, info := range deviceInfo { | |
| if len(info) == 0 || info[0] != ' ' { | |
| continue | |
| } | |
| kv := bytes.Split(info, []byte("=")) | |
| if len(kv) != 2 { | |
| continue | |
| } | |
| msg.DeviceInfo[string(kv[0])] = string(kv[1]) | |
| } |
| val, _ := strconv.ParseInt(string(prefix), 10, 64) | ||
| msg.TsUsec = val | ||
| case 3: | ||
| msg.IsFragment = prefix[0] != '-' |
There was a problem hiding this comment.
There is a potential panic in parseData: msg.IsFragment = prefix[0] != '-' assumes the 4th prefix field is non-empty. If the field is empty/malformed, this will index out of range. Add a length check before accessing prefix[0] and treat empty as non-fragment (or return nil).
| msg.IsFragment = prefix[0] != '-' | |
| if len(prefix) > 0 { | |
| msg.IsFragment = prefix[0] != '-' | |
| } else { | |
| // Treat empty or malformed fragment field as non-fragment. | |
| msg.IsFragment = false | |
| } |
| _ "flashcat.cloud/categraf/inputs/dcgm" | ||
| _ "flashcat.cloud/categraf/inputs/disk" | ||
| _ "flashcat.cloud/categraf/inputs/diskio" | ||
| _ "flashcat.cloud/categraf/inputs/dmesg" |
There was a problem hiding this comment.
Blank-importing the dmesg input will break non-Linux builds unless the input package is guarded by Linux build tags and provides a //go:build !linux stub file. Please ensure the dmesg input follows the same cross-platform pattern as other Linux-only inputs (e.g., inputs/conntrack).
| _ "flashcat.cloud/categraf/inputs/dmesg" |
| msgs := make([]Msg, 0) | ||
|
|
||
| var syscallError error = nil | ||
| err := ins.conn.Read(func(fd uintptr) bool { | ||
| for { | ||
| buf := make([]byte, defaultBufSize) | ||
| _, err := syscall.Read(int(fd), buf) | ||
| if err != nil { | ||
| syscallError = err | ||
| // EINVAL means buf is not enough, data would be truncated, but still can continue. | ||
| if !errors.Is(err, syscall.EINVAL) { | ||
| return true | ||
| } | ||
| } | ||
|
|
||
| msg := parseData(buf) | ||
| if msg == nil { | ||
| continue | ||
| } | ||
| msgs = append(msgs, *msg) | ||
| } | ||
| }) | ||
|
|
||
| // EAGAIN means no more data, should be treated as normal. | ||
| if syscallError != nil && !errors.Is(syscallError, syscall.EAGAIN) { | ||
| err = syscallError | ||
| } | ||
|
|
||
| if err != nil { | ||
| log.Println("Error reading from /dev/kmsg:", err) | ||
| slist.PushFront(types.NewSample(inputName, "up", 0, nil)) | ||
| return | ||
| } | ||
|
|
||
| slist.PushFront(types.NewSample(inputName, "up", 1, nil)) | ||
| for _, d := range msgs { | ||
| for keyword := range ins.errorList { | ||
| if strings.Contains(d.Text, keyword) { | ||
| ins.errorList[keyword]++ | ||
| } | ||
| } | ||
| } |
There was a problem hiding this comment.
The Gather loop allocates a new 16KB buffer and appends Msg structs for every kmsg line, then does a second pass just to count keyword hits. This creates avoidable allocations and memory growth under heavy kernel logging. Consider reusing a single buffer (or a pool) and incrementing counters as each message is parsed, without storing all messages in msgs.
No description provided.