fix(perf_metrics): 修复 PG 下 upsert 列引用歧义并加固 flush 重试#17
Merged
Conversation
UpsertPerfMetric 的 ON CONFLICT DO UPDATE 右侧列名未加表名限定, PostgreSQL (>=14.4) 会判定为目标行与 excluded 行之间的列引用歧义 (column reference is ambiguous),导致所有走 DO UPDATE 的写入失败。 多实例/多写入源并发时每个冲突桶都会报错并被无限重试。 - model/perf_metric.go: 7 个累加列全部加 perf_metrics. 表名限定 (SQLite/MySQL/PostgreSQL 通用),并发 upsert 可原子正确累加。 - pkg/perf_metrics/flush.go: flush 失败分支增加 24h 年龄兜底, 超期 stuck 桶直接丢弃,避免内存只增不减与日志无限刷。
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
背景 / 现象
现网 RDS(PostgreSQL 14.17)错误日志两周内刷出约 13.8 万条 同一错误:
占全部错误的 99.998%(另有 2 条
cached plan must not change result type、1 条用户名唯一约束冲突,均为偶发且无害)。根因
UpsertPerfMetric的ON CONFLICT DO UPDATE右侧列名未加表名限定(裸generation_ms + ?)。PostgreSQL >=14.4 会把这种未限定列引用判定为「目标行 vsexcluded行」之间的歧义(column reference is ambiguous),导致所有走 DO UPDATE 路径的写入失败。INSERT,不触发;一旦行已存在(多写入源并发)即走DO UPDATE→ 必失败。SQL_DSN,形成第二写入源,与生产实例抢同一批(model,group,bucket)行 → 持续冲突。flush.go失败分支「退回内存 + 无限重试 + 永不删桶」,把单次冲突放大成持续刷错 + 内存只增不减。影响范围
改动
model/perf_metric.go:7 个累加列全部加perf_metrics.表名限定。SQLite/MySQL/PostgreSQL 通用(Rule 2),并发 upsert 可原子正确累加 → 单实例 / 多节点(CCE) 均安全,根治。pkg/perf_metrics/flush.go:flush 失败分支增加 24h 年龄兜底(复用既有阈值),超期 stuck 桶直接丢弃,避免内存泄漏与日志无限刷;24h 内失败仍退回重试,不丢数据(可扛 DB 短暂抖动)。验证
go build ./...全量编译通过go vet ./model/ ./pkg/perf_metrics/通过perf_metrics.<col> + N累加正常发版注意