diff --git a/k-shortest-path/.gitignore b/k-shortest-path/.gitignore new file mode 100644 index 0000000..fde2722 --- /dev/null +++ b/k-shortest-path/.gitignore @@ -0,0 +1,6 @@ +datasets/ +results/ +data-p/ +bulk-out/ +*.rdf.gz +*.tar.zst diff --git a/k-shortest-path/README.md b/k-shortest-path/README.md new file mode 100644 index 0000000..d90650d --- /dev/null +++ b/k-shortest-path/README.md @@ -0,0 +1,154 @@ +# k-shortest-path + +Local benchmark harness for comparing the four open Dgraph shortest-path +PRs that target [issue #9577](https://github.com/dgraph-io/dgraph/issues/9577) +(k-shortest-path returns incorrect paths when `maxfrontiersize` is hit). + +PRs under comparison: + +| PR | Family | Approach | +|---|---|---| +| #9576 | Backpressure | `lowWatermark = 0.6 × maxFrontierSize`; stop expanding when reached. Keeps the buggy `pq.Pop()`. | +| #9599 | Fix eviction | `TrimToMax()` (scan + `heap.Remove`), **push-then-trim** + regression unit test. | +| #9607 | Fix eviction | `removeMax()` (scan + `heap.Remove`), check-then-trim. | +| #9678 | Fix eviction | `removeMax()` + push-then-trim + `MaxFrontierSize > 0` guard; 24 unit + 10 integration tests. | + +This harness lives **outside** the dgraph repo on purpose — it carries a +multi-GB LDBC Graphalytics dataset that has no business in CI. + +The complementary CI-tier benchmark is in +`dgraph/systest/shortest-path/benchmark_test.go` and is wired into +`ci-dgraph-integration2-tests.yml`. + +## Layout + +``` +shortest-path-bench/ +├── go.mod +├── docker-compose.yaml # Alpha + Zero for a single-node cluster +├── scripts/download-ldbc.sh # fetch a Graphalytics datagen dataset + reference +├── cmd/ +│ ├── convert/main.go # LDBC .v/.e → graph.rdf.gz + graph.schema (for `dgraph bulk`) +│ ├── bench/main.go # --mode=correctness | --mode=perf +│ └── validate/main.go # standalone diff vs LDBC SSSP reference (0.01% ε) +├── internal/ +│ ├── ldbc/ # parsers for .v, .e, .properties, validation/.SSSP +│ ├── client/ # dgo client wrapper +│ └── stats/ # latency histogram, p50/p95/p99 +└── results/ # per-PR JSON outputs (gitignored) +``` + +The loader pipeline is built around `dgraph bulk`, not live-load: at the L +scale (~34M edges) live-load would take hours. `cmd/convert` emits a +blank-node RDF file + schema that `dgraph bulk` consumes in one pass. + +## Prerequisites + +- Docker + Docker Compose +- Go 1.26+ +- ~10 GB free disk for a single LDBC datagen-7_5-fb dataset +- A built `dgraph` binary on `$PATH` (the docker-compose pulls the standard + image; build a custom one per PR — see below) + +## Per-PR workflow + +Each PR run is one full pass through these steps. Results land in +`results//` and are aggregated with `compare`. + +```bash +# 0. one-time: pull and extract a Graphalytics dataset +./scripts/download-ldbc.sh datagen-7_5-fb + +# 1. check out the PR you want to benchmark +cd ~/workspace/dgraph +git fetch origin pull/9607/head:pr-9607 && git checkout pr-9607 +make docker-image # tags dgraph/dgraph:local + +# 2. convert LDBC files to Dgraph RDF + schema +cd ~/workspace/shortest-path-bench +go run ./cmd/convert -dataset ./datasets/datagen-7_5-fb +# → writes datasets/datagen-7_5-fb/dgraph/{graph.rdf.gz, graph.schema} + +# 3. bulk-load into a Zero, producing the alpha data directory +docker compose up -d zero +dgraph bulk \ + -f datasets/datagen-7_5-fb/dgraph/graph.rdf.gz \ + -s datasets/datagen-7_5-fb/dgraph/graph.schema \ + --zero localhost:5080 \ + --out datasets/datagen-7_5-fb/dgraph/bulk-out +docker compose down + +# 4. bring up Alpha with the bulk-loaded data +cp -r datasets/datagen-7_5-fb/dgraph/bulk-out/0/p ./data-p +DGRAPH_IMAGE=dgraph/dgraph:local DATA_DIR=$(pwd)/data-p docker compose up -d + +# 5a. correctness — sample N targets, diff against LDBC reference +go run ./cmd/bench \ + -mode correctness \ + -dataset ./datasets/datagen-7_5-fb \ + -alpha localhost:9080 \ + -targets 1000 \ + -maxfrontier 1000 \ + -out results/pr-9607/correctness.json + +# 5b. perf — sampled (src,dst) pairs at varying concurrency +go run ./cmd/bench \ + -mode perf \ + -dataset ./datasets/datagen-7_5-fb \ + -alpha localhost:9080 \ + -pairs 10000 \ + -concurrency 16 \ + -maxfrontier 1000 \ + -out results/pr-9607/perf.json + +# 6. aggregate the per-PR JSON files into a verdict matrix by hand or with +# a small jq/python script over results/*/*.json — left intentionally +# un-automated for now since the comparison is a one-shot exercise. +``` + +## Output + +### correctness.json +```jsonc +{ + "dataset": "datagen-7_5-fb", + "source_vertex": 1, + "targets_sampled": 1000, + "epsilon": 0.0001, + "passed": 982, + "failed": 18, + "infinity_mismatches": 0, + "first_failures": [ + {"vertex": 12345, "expected": 4.32, "got": 4.71, "rel_error": 0.0903} + ] +} +``` + +### perf.json +```jsonc +{ + "dataset": "datagen-7_5-fb", + "pairs": 10000, + "concurrency": 16, + "maxfrontier": 1000, + "latency_ms": {"p50": 4.1, "p95": 18.3, "p99": 42.7}, + "qps": 1820, + "heap_alloc_peak_mb": 612, + "errors": 0 +} +``` + +## How to read the verdict + +The interesting comparisons: + +- **Family A (#9599/#9607/#9678)** should all pass correctness; if any fails, + that PR has a latent bug. Within family A, p50/p95 diffs are mostly noise + unless the push-vs-check ordering matters at higher cap utilisation. +- **Family B (#9576)** is the discriminator: backpressure throttles + exploration, so it may show **better latency at the cap** but also a + higher correctness-failure count. The verdict depends on whether the + user-visible promise of `shortest` is "fastest answer" or "optimal answer". + +If correctness pass-rate is < 99% on any PR for a workload that real users +hit, that's a merge blocker regardless of perf. diff --git a/k-shortest-path/cmd/bench/kshortest.go b/k-shortest-path/cmd/bench/kshortest.go new file mode 100644 index 0000000..05bf72a --- /dev/null +++ b/k-shortest-path/cmd/bench/kshortest.go @@ -0,0 +1,373 @@ +package main + +import ( + "context" + "errors" + "log" + "math" + "math/rand" + "sort" + "strconv" + "strings" + "time" + + "github.com/dgraph-io/dgraph-benchmarks/k-shortest-path/internal/client" + "github.com/dgraph-io/dgraph-benchmarks/k-shortest-path/internal/compare" + "github.com/dgraph-io/dgraph-benchmarks/k-shortest-path/internal/ldbc" + "github.com/dgraph-io/dgraph-benchmarks/k-shortest-path/internal/oracle" + "github.com/dgraph-io/dgraph-benchmarks/k-shortest-path/internal/stats" +) + +// kshortest mode is the top-k correctness discriminator. For each sampled +// target it compares Dgraph's returned path-cost VECTOR against the gonum Yen +// oracle's, across a sweep of maxfrontiersize values. Comparing sorted costs +// (not path identity) is tie-robust; the oracle is validated (internal/oracle +// tests) and the loopless-not-disjoint semantics confirmed (cmd/handprobe). +// +// The output JSON keeps full per-target detail (oracle vs Dgraph vectors, +// verdict, latency, self-consistency, the SSSP distance) so any aggregate can +// be re-derived and any individual failure audited later. It also splits +// timeouts out of the correctness denominator: a query that never returned is +// NOT "wrong", so we report correct/returned separately from correct/targets. +// +// Run once per PR binary against its alpha; aggregate the JSONs across PRs. + +// kTarget is the full record of one (target, frontier) comparison. +type kTarget struct { + GID int64 `json:"gid"` + SSSPDist float64 `json:"sssp_dist"` + DstUID string `json:"dst_uid"` + OracleWeights []float64 `json:"oracle_weights"` + DgraphWeights []float64 `json:"dgraph_weights"` + PathCount int `json:"path_count"` + Verdict string `json:"verdict"` // ok|count_mismatch|weight_mismatch|timeout|error + WorstRelErr float64 `json:"worst_rel_err"` + SelfConsistent bool `json:"self_consistent"` + Loopless bool `json:"loopless"` + MaxWeightErr float64 `json:"max_weight_err"` + LatencyMS float64 `json:"latency_ms"` + Err string `json:"err,omitempty"` +} + +type kFrontierStat struct { + MaxFrontier int `json:"max_frontier"` // 0 = unlimited + Targets int `json:"targets"` + Returned int `json:"returned"` // queries that came back (not timeout/other error) + Correct int `json:"correct"` + // CorrectPct is correct/targets (timeouts counted as failures — pessimistic). + CorrectPct float64 `json:"correct_pct"` + // CorrectOfReturnedPct is correct/returned — the honest correctness among + // queries that actually finished. This is the number to compare across PRs. + CorrectOfReturnedPct float64 `json:"correct_of_returned_pct"` + CountMismatch int `json:"count_mismatch"` + WeightMismatch int `json:"weight_mismatch"` + SelfInconsistent int `json:"self_inconsistent"` + Timeouts int `json:"timeouts"` + OtherErrors int `json:"other_errors"` + Latency stats.Summary `json:"latency"` + TargetResults []kTarget `json:"target_results"` +} + +type kResult struct { + Label string `json:"label"` // e.g. "pr-9599@997d5dcb" + Dataset string `json:"dataset"` + Source int64 `json:"source_vertex"` + NumPaths int `json:"num_paths"` + EdgePred string `json:"edge_pred"` + MaxFrontierSweep []int `json:"max_frontier_sweep"` + Tolerance float64 `json:"tolerance"` + Seed int64 `json:"seed"` + BandLo float64 `json:"band_lo"` + BandHi float64 `json:"band_hi"` + GraphNodes int `json:"graph_nodes"` + Directed bool `json:"directed"` + CandidateTargets int `json:"candidate_targets"` + QualifiedTargets int `json:"qualified_targets"` + StartedUTC string `json:"started_utc"` + Frontiers []kFrontierStat `json:"frontiers"` +} + +type kCand struct { + gid int64 + dist float64 +} + +func runKShortest(ctx context.Context, cfg config, c *client.Client, ds *ldbc.Dataset, uidMap map[int64]string) { + numPaths := cfg.numPaths + if numPaths < 2 { + log.Printf("[kshortest] numpaths=%d; the top-k comparison is most meaningful at >=2 (set -numpaths 2)", numPaths) + } + frontiers := parseFrontiers(cfg.frontiers) + + source := ds.Properties.SourceVertex + if _, ok := uidMap[source]; !ok { + log.Fatalf("source vertex %d not in graph", source) + } + + // Build the oracle graph from the same .e stream + directedness the + // converter used, so the oracle traverses exactly what Dgraph traverses. + log.Printf("[kshortest] building oracle graph from %s (directed=%v)...", ds.EdgeFile, ds.Properties.Directed) + start := time.Now() + g := oracle.New() + directed := ds.Properties.Directed + if err := ldbc.ScanEdges(ds.EdgeFile, func(e ldbc.Edge) error { + if aerr := g.AddEdge(e.Src, e.Dst, e.Weight); aerr != nil { + return aerr + } + if !directed { + return g.AddEdge(e.Dst, e.Src, e.Weight) + } + return nil + }); err != nil { + log.Fatalf("build oracle graph: %v", err) + } + log.Printf("[kshortest] oracle graph: %d nodes in %s", g.Nodes(), time.Since(start).Round(time.Millisecond)) + + // Select candidate targets from a distance band, not uniformly at random. + // Uniform random picks targets thousands of hops away, where the numpaths=2 + // frontier explodes (queries time out) and Yen precompute is slow. A + // near/moderate band keeps both fast while still exercising eviction — the + // regime where the bug lives. + candidates := bandedTargets(ds.SSSPRefFile, uidMap, source, cfg.bandLo, cfg.bandHi, cfg.targets, cfg.seed) + type pair struct { + gid int64 + dist float64 + uid string + oracle []float64 + } + var qualified []pair + log.Printf("[kshortest] precomputing oracle top-%d for %d candidate targets...", numPaths, len(candidates)) + preStart := time.Now() + for i, cand := range candidates { + vec, err := g.TopK(source, cand.gid, numPaths) + if err != nil || len(vec) < 2 { + continue + } + qualified = append(qualified, pair{gid: cand.gid, dist: cand.dist, uid: uidMap[cand.gid], oracle: vec}) + if (i+1)%50 == 0 { + log.Printf("[kshortest] oracle precompute %d/%d (%d qualified) elapsed=%s", + i+1, len(candidates), len(qualified), time.Since(preStart).Round(time.Second)) + } + } + if len(qualified) == 0 { + log.Fatal("[kshortest] no targets with >=2 oracle paths — try more -targets or a different -source") + } + log.Printf("[kshortest] %d/%d targets qualified (>=2 paths) in %s", + len(qualified), len(candidates), time.Since(preStart).Round(time.Second)) + + res := kResult{ + Label: cfg.label, + Dataset: ds.Name, + Source: source, + NumPaths: numPaths, + EdgePred: cfg.edgePred, + MaxFrontierSweep: frontiers, + Tolerance: cfg.tol, + Seed: cfg.seed, + BandLo: cfg.bandLo, + BandHi: cfg.bandHi, + GraphNodes: g.Nodes(), + Directed: directed, + CandidateTargets: len(candidates), + QualifiedTargets: len(qualified), + StartedUTC: time.Now().UTC().Format(time.RFC3339), + } + + srcUID := uidMap[source] + // Sweep largest frontier first: if a binary OOMs/hangs at a big frontier, + // you learn it before spending time on the cheaper ones. + for _, fr := range frontiers { + stat := kFrontierStat{MaxFrontier: fr, Targets: len(qualified)} + rec := stats.New() + swStart := time.Now() + for i, p := range qualified { + sr, err := c.Shortest(ctx, client.ShortestOptions{ + SrcUID: srcUID, + DstUID: p.uid, + EdgePred: cfg.edgePred, + NumPaths: numPaths, + MaxFrontier: fr, + Timeout: cfg.timeout, + }) + tr := kTarget{ + GID: p.gid, SSSPDist: p.dist, DstUID: p.uid, + OracleWeights: p.oracle, + LatencyMS: float64(sr.Latency.Microseconds()) / 1000.0, + } + if err != nil { + rec.RecordError() + kind := classifyErr(err) + tr.Verdict = kind + tr.Err = err.Error() + if kind == "timeout" { + stat.Timeouts++ + } else { + stat.OtherErrors++ + } + stat.TargetResults = append(stat.TargetResults, tr) + continue + } + rec.Record(sr.Latency) + stat.Returned++ + tr.DgraphWeights = sr.Weights + tr.PathCount = sr.PathCount + tr.SelfConsistent = sr.SelfConsistent + tr.Loopless = sr.Loopless + tr.MaxWeightErr = sr.MaxWeightErr + if !sr.SelfConsistent || !sr.Loopless { + stat.SelfInconsistent++ + } + r := compare.Vectors(p.oracle, sr.Weights, cfg.tol) + tr.WorstRelErr = r.WorstRelErr + switch r.Verdict { + case compare.OK: + stat.Correct++ + tr.Verdict = "ok" + case compare.CountMismatch: + stat.CountMismatch++ + tr.Verdict = "count_mismatch" + case compare.WeightMismatch: + stat.WeightMismatch++ + tr.Verdict = "weight_mismatch" + } + stat.TargetResults = append(stat.TargetResults, tr) + // Heartbeat inside a frontier: a 100-target phase with 60s timeouts + // can run quiet for 15+ min otherwise, which looks like a hang. + if (i+1)%25 == 0 { + log.Printf("[kshortest] frontier=%s progress %d/%d (correct=%d wt_mm=%d cnt_mm=%d self_bad=%d timeout=%d)", + frontierLabel(fr), i+1, len(qualified), stat.Correct, stat.WeightMismatch, + stat.CountMismatch, stat.SelfInconsistent, stat.Timeouts) + } + } + stat.Latency = rec.Summarize(time.Since(swStart)) + if stat.Targets > 0 { + stat.CorrectPct = 100 * float64(stat.Correct) / float64(stat.Targets) + } + if stat.Returned > 0 { + stat.CorrectOfReturnedPct = 100 * float64(stat.Correct) / float64(stat.Returned) + } + res.Frontiers = append(res.Frontiers, stat) + log.Printf("[kshortest] frontier=%-9s correct=%d/%d returned=%d (%.1f%% of returned) wt_mm=%d cnt_mm=%d self_bad=%d timeout=%d err=%d p50=%s p95=%s", + frontierLabel(fr), stat.Correct, stat.Targets, stat.Returned, stat.CorrectOfReturnedPct, + stat.WeightMismatch, stat.CountMismatch, stat.SelfInconsistent, stat.Timeouts, stat.OtherErrors, + stat.Latency.P50.Round(time.Millisecond), stat.Latency.P95.Round(time.Millisecond)) + // Persist after EACH frontier so a kill/drop mid-branch keeps the + // frontiers done so far, instead of losing the whole branch. + writeJSON(cfg.out, res) + } + + printKTable(res) +} + +// classifyErr distinguishes a query timeout (the binary didn't terminate within +// the deadline) from any other error. A timeout is NOT a wrong answer, so it's +// reported separately rather than folded into the correctness denominator. +func classifyErr(err error) string { + if err == nil { + return "" + } + if errors.Is(err, context.DeadlineExceeded) { + return "timeout" + } + s := strings.ToLower(err.Error()) + if strings.Contains(s, "deadline") || strings.Contains(s, "timeout") { + return "timeout" + } + return "error" +} + +func printKTable(res kResult) { + log.Printf("[kshortest] === %s %s source=%d numpaths=%d tol=%g qualified=%d ===", + res.Label, res.Dataset, res.Source, res.NumPaths, res.Tolerance, res.QualifiedTargets) + log.Printf("[kshortest] %-9s | %-12s | %-8s | %-6s | %-6s | %-8s | %-8s | %-7s | %-8s", + "frontier", "correct/ret", "ret%", "wt_mm", "cnt_mm", "self_bad", "timeout", "err", "p95") + for _, s := range res.Frontiers { + log.Printf("[kshortest] %-9s | %4d/%-7d | %7.1f%% | %-6d | %-6d | %-8d | %-8d | %-7d | %-8s", + frontierLabel(s.MaxFrontier), s.Correct, s.Returned, s.CorrectOfReturnedPct, + s.WeightMismatch, s.CountMismatch, s.SelfInconsistent, s.Timeouts, s.OtherErrors, + s.Latency.P95.Round(time.Millisecond)) + } +} + +func frontierLabel(fr int) string { + if fr <= 0 { + return "unlimited" + } + return strconv.Itoa(fr) +} + +func parseFrontiers(s string) []int { + var out []int + for _, tok := range strings.Split(s, ",") { + tok = strings.TrimSpace(tok) + if tok == "" { + continue + } + n, err := strconv.Atoi(tok) + if err != nil { + log.Fatalf("bad -frontiers value %q: %v", tok, err) + } + out = append(out, n) + } + if len(out) == 0 { + log.Fatal("-frontiers produced no values") + } + return out +} + +// bandedTargets selects up to n target vertices whose SSSP reference distance +// falls in the percentile window [bandLo, bandHi] of the distance-sorted +// reachable set. This avoids the far-target frontier explosion (which times out +// queries and makes Yen precompute slow) while still hitting multi-hop targets +// where eviction — and the bug — is exercised. Deterministic for a given seed. +func bandedTargets(ssspFile string, uidMap map[int64]string, source int64, bandLo, bandHi float64, n int, seed int64) []kCand { + ref, err := ldbc.ReadSSSP(ssspFile) + if err != nil { + log.Fatalf("[kshortest] banded target selection needs the SSSP reference (%s): %v", ssspFile, err) + } + reach := make([]kCand, 0, len(ref)) + for gid, d := range ref { + if gid == source || d <= 0 || math.IsInf(d, +1) { + continue + } + if _, ok := uidMap[gid]; ok { + reach = append(reach, kCand{gid, d}) + } + } + if len(reach) == 0 { + log.Fatal("[kshortest] no reachable targets in SSSP reference") + } + sort.Slice(reach, func(i, j int) bool { + if reach[i].dist != reach[j].dist { + return reach[i].dist < reach[j].dist + } + return reach[i].gid < reach[j].gid // stable tie-break for determinism + }) + + lo := int(bandLo * float64(len(reach))) + hi := int(bandHi * float64(len(reach))) + lo = max(lo, 0) + hi = min(hi, len(reach)) + if lo >= hi { // degenerate band — fall back to a single rank + lo = min(lo, len(reach)-1) + hi = lo + 1 + } + band := reach[lo:hi] + log.Printf("[kshortest] target band [%.4f,%.4f] = ranks %d..%d of %d reachable (dist %.0f..%.0f)", + bandLo, bandHi, lo, hi, len(reach), band[0].dist, band[len(band)-1].dist) + + idx := make([]int, len(band)) + for i := range idx { + idx[i] = i + } + rng := rand.New(rand.NewSource(seed)) + rng.Shuffle(len(idx), func(i, j int) { idx[i], idx[j] = idx[j], idx[i] }) + if n > 0 && n < len(idx) { + idx = idx[:n] + } + out := make([]kCand, 0, len(idx)) + for _, i := range idx { + out = append(out, band[i]) + } + return out +} diff --git a/k-shortest-path/cmd/bench/main.go b/k-shortest-path/cmd/bench/main.go new file mode 100644 index 0000000..f0f9634 --- /dev/null +++ b/k-shortest-path/cmd/bench/main.go @@ -0,0 +1,502 @@ +// Command bench is the workhorse: two modes, --mode=correctness and +// --mode=perf, both running against a Dgraph cluster already populated by +// `dgraph bulk` (see cmd/convert). +// +// correctness mode: +// +// full SSSP-style sweep from the dataset's declared source vertex to a +// sampled set of targets, distances diffed against the LDBC reference +// output with 0.01% epsilon. The discriminator for any PR that throttles +// or evicts incorrectly under maxfrontiersize. +// +// perf mode: +// +// N (src,dst) pairs sampled uniformly from the vertex set, dispatched +// through a worker pool. Records p50/p95/p99 latency + QPS. +// +// Output is one JSON file per run, intended to be aggregated across PRs +// with `cmd/compare` (TODO) or eyeballed directly. +package main + +import ( + "context" + "encoding/json" + "flag" + "log" + "math" + "math/rand" + "os" + "path/filepath" + "runtime" + "sort" + "sync" + "time" + + "github.com/dgraph-io/dgraph-benchmarks/k-shortest-path/internal/client" + "github.com/dgraph-io/dgraph-benchmarks/k-shortest-path/internal/ldbc" + "github.com/dgraph-io/dgraph-benchmarks/k-shortest-path/internal/stats" +) + +const epsilon = 0.0001 // LDBC's 0.01% tolerance. + +type config struct { + mode string + datasetDir string + alpha string + out string + maxFrontier int + numPaths int + edgePred string + timeout time.Duration + seed int64 + + // correctness flags + targets int + + // perf flags + pairs int + concurrency int + + // kshortest flags + frontiers string + tol float64 + bandLo float64 + bandHi float64 + label string + + // uid-map cache + uidMapCache string + refreshUIDMap bool +} + +func main() { + cfg := config{} + flag.StringVar(&cfg.mode, "mode", "perf", "correctness | perf | kshortest") + flag.StringVar(&cfg.datasetDir, "dataset", "", "path to extracted LDBC dataset") + flag.StringVar(&cfg.alpha, "alpha", "localhost:9080", "Dgraph alpha gRPC address") + flag.StringVar(&cfg.out, "out", "", "output JSON path (defaults to results/.json)") + flag.IntVar(&cfg.maxFrontier, "maxfrontier", 0, "maxfrontiersize for shortest queries (0 = unset)") + flag.IntVar(&cfg.numPaths, "numpaths", 1, "numpaths for shortest queries") + flag.StringVar(&cfg.edgePred, "edge", "connected", "DQL edge predicate for shortest") + flag.DurationVar(&cfg.timeout, "timeout", 60*time.Second, "per-query timeout") + flag.Int64Var(&cfg.seed, "seed", 1, "RNG seed for sampling") + flag.IntVar(&cfg.targets, "targets", 1000, "[correctness] number of target vertices to sample (-1 = all)") + flag.IntVar(&cfg.pairs, "pairs", 5000, "[perf] number of (src,dst) pairs") + flag.IntVar(&cfg.concurrency, "concurrency", runtime.GOMAXPROCS(0), "worker count (applies to both correctness and perf modes)") + flag.StringVar(&cfg.uidMapCache, "uidmap-cache", "", "path to cached uid map TSV (defaults to /dgraph/uid-map.tsv). Reused across PR-build swaps to skip the slow refetch.") + flag.BoolVar(&cfg.refreshUIDMap, "refresh-uidmap", false, "ignore the uid map cache and refetch from Dgraph (do this after a fresh bulk-load)") + flag.StringVar(&cfg.frontiers, "frontiers", "0,10000,5000,1000,500,200,100", "[kshortest] comma-separated maxfrontiersize sweep; 0 = unlimited") + flag.Float64Var(&cfg.tol, "tol", 0.0001, "[kshortest] relative tolerance for weight-vector comparison") + flag.Float64Var(&cfg.bandLo, "band-lo", 0.0005, "[kshortest] low edge of the SSSP-distance band to draw targets from (fraction of distance-sorted reachable vertices)") + flag.Float64Var(&cfg.bandHi, "band-hi", 0.01, "[kshortest] high edge of the SSSP-distance band; far targets blow up the numpaths=2 frontier and time out, so keep this small") + flag.StringVar(&cfg.label, "label", "", "[kshortest] free-form label embedded in the result JSON (e.g. branch@sha) so each file is self-identifying") + flag.Parse() + + if cfg.datasetDir == "" { + log.Fatal("-dataset is required") + } + if cfg.out == "" { + cfg.out = filepath.Join("results", cfg.mode+".json") + } + if err := os.MkdirAll(filepath.Dir(cfg.out), 0o755); err != nil { + log.Fatal(err) + } + + ds, err := ldbc.LoadDataset(cfg.datasetDir) + if err != nil { + log.Fatalf("load dataset: %v", err) + } + + c, err := client.Open(cfg.alpha) + if err != nil { + log.Fatalf("open client: %v", err) + } + defer c.Close() + + ctx := context.Background() + if cfg.uidMapCache == "" { + cfg.uidMapCache = filepath.Join(cfg.datasetDir, "dgraph", "uid-map.tsv") + } + + var uidMap map[int64]string + if !cfg.refreshUIDMap { + if m, lerr := client.LoadUIDMap(cfg.uidMapCache); lerr == nil { + log.Printf("[bench] loaded uid map from cache %s: %d entries", cfg.uidMapCache, len(m)) + uidMap = m + } + } + if uidMap == nil { + log.Printf("[bench] fetching graphalytics_id → uid map from Dgraph (will cache to %s)...", cfg.uidMapCache) + start := time.Now() + m, ferr := c.FetchUIDMap(ctx, 10000) + if ferr != nil { + log.Fatalf("uid map: %v", ferr) + } + log.Printf("[bench] uid map: %d entries in %s", len(m), time.Since(start).Round(time.Millisecond)) + uidMap = m + // Never cache an empty map: a transient "alpha serving no data" would + // otherwise poison the cache and make every subsequent branch fail. + if len(uidMap) == 0 { + log.Fatal("uid map empty — alpha has no graphalytics_id nodes (cluster not serving the bulk p/?). NOT caching the empty result; fix the cluster and retry.") + } + if err := os.MkdirAll(filepath.Dir(cfg.uidMapCache), 0o755); err == nil { + if serr := client.SaveUIDMap(cfg.uidMapCache, uidMap); serr != nil { + log.Printf("[bench] warning: failed to save uid map cache to %s: %v", cfg.uidMapCache, serr) + } else { + log.Printf("[bench] saved uid map cache to %s", cfg.uidMapCache) + } + } + } + if len(uidMap) == 0 { + log.Fatal("uid map empty — did you run cmd/convert and dgraph bulk?") + } + + switch cfg.mode { + case "correctness": + runCorrectness(ctx, cfg, c, ds, uidMap) + case "perf": + runPerf(ctx, cfg, c, uidMap) + case "kshortest": + runKShortest(ctx, cfg, c, ds, uidMap) + default: + log.Fatalf("unknown mode %q", cfg.mode) + } +} + +// ----------------------------------------------------------------------------- +// correctness mode +// ----------------------------------------------------------------------------- + +type correctnessFailure struct { + Vertex int64 + Expected float64 + Got float64 + RelErr float64 +} + +func (f correctnessFailure) MarshalJSON() ([]byte, error) { + type out struct { + Vertex int64 `json:"vertex"` + Expected any `json:"expected"` + Got any `json:"got"` + RelErr float64 `json:"rel_err,omitempty"` + } + return json.Marshal(out{ + Vertex: f.Vertex, Expected: jsonDist(f.Expected), Got: jsonDist(f.Got), RelErr: f.RelErr, + }) +} + +func jsonDist(d float64) any { + if math.IsInf(d, 1) { + return "Infinity" + } + if math.IsInf(d, -1) { + return "-Infinity" + } + return d +} + +type correctnessResult struct { + Dataset string `json:"dataset"` + SourceVertex int64 `json:"source_vertex"` + MaxFrontier int `json:"max_frontier"` + Concurrency int `json:"concurrency"` + Sampled int `json:"targets_sampled"` + Epsilon float64 `json:"epsilon"` + Passed int `json:"passed"` + Failed int `json:"failed"` + InfMismatch int `json:"infinity_mismatches"` + QueryErrors int `json:"query_errors"` + Latency stats.Summary `json:"latency"` + FirstFails []correctnessFailure `json:"first_failures,omitempty"` +} + +func runCorrectness(ctx context.Context, cfg config, c *client.Client, ds *ldbc.Dataset, uidMap map[int64]string) { + if _, err := os.Stat(ds.SSSPRefFile); err != nil { + log.Fatalf("reference SSSP file missing at %s — extract validation/ archive first", ds.SSSPRefFile) + } + ref, err := ldbc.ReadSSSP(ds.SSSPRefFile) + if err != nil { + log.Fatalf("read reference: %v", err) + } + srcUID, ok := uidMap[ds.Properties.SourceVertex] + if !ok { + log.Fatalf("source vertex %d not in graph", ds.Properties.SourceVertex) + } + + targets := pickTargets(ref, uidMap, ds.Properties.SourceVertex, cfg.targets, cfg.seed) + log.Printf("[correctness] source=%d targets=%d maxfrontier=%d concurrency=%d", + ds.Properties.SourceVertex, len(targets), cfg.maxFrontier, cfg.concurrency) + + rec := stats.New() + res := correctnessResult{ + Dataset: ds.Name, + SourceVertex: ds.Properties.SourceVertex, + MaxFrontier: cfg.maxFrontier, + Concurrency: cfg.concurrency, + Sampled: len(targets), + Epsilon: epsilon, + } + + type queryResult struct { + tgt int64 + sr client.ShortestResult + err error + } + + jobs := make(chan int64, cfg.concurrency*2) + out := make(chan queryResult, cfg.concurrency*2) + var wg sync.WaitGroup + for w := 0; w < cfg.concurrency; w++ { + wg.Add(1) + go func() { + defer wg.Done() + for tgt := range jobs { + dstUID := uidMap[tgt] + sr, err := c.Shortest(ctx, client.ShortestOptions{ + SrcUID: srcUID, + DstUID: dstUID, + EdgePred: cfg.edgePred, + NumPaths: cfg.numPaths, + MaxFrontier: cfg.maxFrontier, + Timeout: cfg.timeout, + }) + out <- queryResult{tgt: tgt, sr: sr, err: err} + } + }() + } + go func() { + for _, t := range targets { + jobs <- t + } + close(jobs) + wg.Wait() + close(out) + }() + + startAll := time.Now() + lastLog := startAll + const logEvery = 10 + done := 0 + for qr := range out { + done++ + expected := ref[qr.tgt] + switch { + case qr.err != nil: + rec.RecordError() + res.QueryErrors++ + default: + rec.Record(qr.sr.Latency) + switch { + case math.IsInf(expected, +1) && math.IsInf(qr.sr.Distance, +1): + res.Passed++ + case math.IsInf(expected, +1) != math.IsInf(qr.sr.Distance, +1): + res.InfMismatch++ + res.Failed++ + if len(res.FirstFails) < 20 { + res.FirstFails = append(res.FirstFails, correctnessFailure{ + Vertex: qr.tgt, Expected: expected, Got: qr.sr.Distance, + }) + } + case relErr(expected, qr.sr.Distance) > epsilon: + res.Failed++ + if len(res.FirstFails) < 20 { + res.FirstFails = append(res.FirstFails, correctnessFailure{ + Vertex: qr.tgt, Expected: expected, Got: qr.sr.Distance, + RelErr: relErr(expected, qr.sr.Distance), + }) + } + default: + res.Passed++ + } + } + if done%logEvery == 0 || done == len(targets) || time.Since(lastLog) >= 10*time.Second { + elapsed := time.Since(startAll) + rate := float64(done) / elapsed.Seconds() + var eta time.Duration + if rate > 0 { + eta = time.Duration(float64(len(targets)-done)/rate) * time.Second + } + log.Printf("[correctness] %d/%d done elapsed=%s last_latency=%s rate=%.1f q/s eta=%s passed=%d failed=%d errors=%d", + done, len(targets), elapsed.Round(time.Second), + qr.sr.Latency.Round(time.Millisecond), rate, eta.Round(time.Second), + res.Passed, res.Failed, res.QueryErrors) + lastLog = time.Now() + } + } + res.Latency = rec.Summarize(time.Since(startAll)) + + writeJSON(cfg.out, res) + log.Printf("[correctness] passed=%d failed=%d inf_mismatch=%d errors=%d", + res.Passed, res.Failed, res.InfMismatch, res.QueryErrors) + log.Printf("[correctness] latency p50=%s p95=%s p99=%s", + res.Latency.P50, res.Latency.P95, res.Latency.P99) +} + +func pickTargets(ref map[int64]float64, uidMap map[int64]string, source int64, n int, seed int64) []int64 { + all := make([]int64, 0, len(ref)) + for v := range ref { + if v == source { + continue + } + if _, ok := uidMap[v]; !ok { + continue + } + all = append(all, v) + } + // Go's map iteration is randomized per program start, so we must sort the + // candidate slice before shuffling — otherwise the same -seed produces + // different target sets across runs, and per-PR comparisons aren't fair. + sort.Slice(all, func(i, j int) bool { return all[i] < all[j] }) + if n <= 0 || n >= len(all) { + return all + } + rng := rand.New(rand.NewSource(seed)) + rng.Shuffle(len(all), func(i, j int) { all[i], all[j] = all[j], all[i] }) + return all[:n] +} + +func relErr(expected, got float64) float64 { + if expected == 0 { + if got == 0 { + return 0 + } + return math.Abs(got) + } + return math.Abs(expected-got) / math.Abs(expected) +} + +// ----------------------------------------------------------------------------- +// perf mode +// ----------------------------------------------------------------------------- + +type perfResult struct { + Dataset string `json:"dataset"` + MaxFrontier int `json:"max_frontier"` + NumPaths int `json:"num_paths"` + Pairs int `json:"pairs"` + Concurrency int `json:"concurrency"` + Latency stats.Summary `json:"latency"` + HeapMaxMB uint64 `json:"heap_max_mb"` +} + +func runPerf(ctx context.Context, cfg config, c *client.Client, uidMap map[int64]string) { + ids := make([]int64, 0, len(uidMap)) + for k := range uidMap { + ids = append(ids, k) + } + // Sort for the same reason as in pickTargets: Go map iteration is + // randomized, so without sorting, the same -seed gives different pair + // samples across runs. + sort.Slice(ids, func(i, j int) bool { return ids[i] < ids[j] }) + if len(ids) < 2 { + log.Fatal("perf needs at least 2 vertices") + } + rng := rand.New(rand.NewSource(cfg.seed)) + pairs := make([][2]string, cfg.pairs) + for i := range pairs { + a := ids[rng.Intn(len(ids))] + b := ids[rng.Intn(len(ids))] + for a == b { + b = ids[rng.Intn(len(ids))] + } + pairs[i] = [2]string{uidMap[a], uidMap[b]} + } + + log.Printf("[perf] pairs=%d concurrency=%d maxfrontier=%d numpaths=%d", + cfg.pairs, cfg.concurrency, cfg.maxFrontier, cfg.numPaths) + + rec := stats.New() + jobs := make(chan [2]string, cfg.concurrency*2) + var wg sync.WaitGroup + var heapPeak uint64 + var heapMu sync.Mutex + + for w := 0; w < cfg.concurrency; w++ { + wg.Add(1) + go func() { + defer wg.Done() + for p := range jobs { + sr, err := c.Shortest(ctx, client.ShortestOptions{ + SrcUID: p[0], + DstUID: p[1], + EdgePred: cfg.edgePred, + NumPaths: cfg.numPaths, + MaxFrontier: cfg.maxFrontier, + Timeout: cfg.timeout, + }) + if err != nil { + rec.RecordError() + continue + } + rec.Record(sr.Latency) + } + }() + } + + stopHeap := make(chan struct{}) + go func() { + t := time.NewTicker(500 * time.Millisecond) + defer t.Stop() + var ms runtime.MemStats + for { + select { + case <-stopHeap: + return + case <-t.C: + runtime.ReadMemStats(&ms) + heapMu.Lock() + if ms.HeapAlloc > heapPeak { + heapPeak = ms.HeapAlloc + } + heapMu.Unlock() + } + } + }() + + startAll := time.Now() + for _, p := range pairs { + jobs <- p + } + close(jobs) + wg.Wait() + close(stopHeap) + wall := time.Since(startAll) + + res := perfResult{ + Dataset: filepath.Base(cfg.datasetDir), + MaxFrontier: cfg.maxFrontier, + NumPaths: cfg.numPaths, + Pairs: cfg.pairs, + Concurrency: cfg.concurrency, + Latency: rec.Summarize(wall), + HeapMaxMB: heapPeak / (1024 * 1024), + } + writeJSON(cfg.out, res) + log.Printf("[perf] qps=%.0f p50=%s p95=%s p99=%s heap=%dMB errors=%d", + res.Latency.QPS, res.Latency.P50, res.Latency.P95, res.Latency.P99, + res.HeapMaxMB, res.Latency.Errors) +} + +// writeJSON writes v atomically: encode to .tmp, then rename over . +// Atomicity matters because kshortest now writes after every frontier — a kill +// mid-encode must not corrupt or truncate the last-good result. +func writeJSON(path string, v any) { + tmp := path + ".tmp" + f, err := os.Create(tmp) + if err != nil { + log.Fatalf("create %s: %v", tmp, err) + } + enc := json.NewEncoder(f) + enc.SetIndent("", " ") + if err := enc.Encode(v); err != nil { + f.Close() + log.Fatalf("encode %s: %v", tmp, err) + } + if err := f.Close(); err != nil { + log.Fatalf("close %s: %v", tmp, err) + } + if err := os.Rename(tmp, path); err != nil { + log.Fatalf("rename %s -> %s: %v", tmp, path, err) + } +} diff --git a/k-shortest-path/cmd/convert/main.go b/k-shortest-path/cmd/convert/main.go new file mode 100644 index 0000000..5aa6dd1 --- /dev/null +++ b/k-shortest-path/cmd/convert/main.go @@ -0,0 +1,139 @@ +// Command convert turns an LDBC Graphalytics datagen dataset into Dgraph- +// loadable artefacts: a gzipped RDF file with weight facets, and a schema +// file. The output is consumed by `dgraph bulk` for the fastest possible +// initial load. +// +// convert -dataset ./datasets/datagen-7_5-fb \ +// -out ./datasets/datagen-7_5-fb/dgraph +// +// Produces: +// +// /graph.rdf.gz — blank-node RDF for vertices + edges +// /graph.schema — DQL schema with graphalytics_id index +// +// Then run: +// +// dgraph bulk -f graph.rdf.gz -s graph.schema --zero localhost:5080 --out ./bulk-out +package main + +import ( + "bufio" + "compress/gzip" + "flag" + "fmt" + "log" + "os" + "path/filepath" + "strconv" + "strings" + "time" + + "github.com/dgraph-io/dgraph-benchmarks/k-shortest-path/internal/client" + "github.com/dgraph-io/dgraph-benchmarks/k-shortest-path/internal/ldbc" +) + +func main() { + dataset := flag.String("dataset", "", "path to extracted LDBC dataset (datasets//)") + outDir := flag.String("out", "", "output directory (defaults to /dgraph)") + flag.Parse() + + if *dataset == "" { + log.Fatal("-dataset is required") + } + if *outDir == "" { + *outDir = filepath.Join(*dataset, "dgraph") + } + if err := os.MkdirAll(*outDir, 0o755); err != nil { + log.Fatal(err) + } + + ds, err := ldbc.LoadDataset(*dataset) + if err != nil { + log.Fatalf("load dataset: %v", err) + } + + if err := writeSchema(filepath.Join(*outDir, "graph.schema")); err != nil { + log.Fatalf("write schema: %v", err) + } + + rdfPath := filepath.Join(*outDir, "graph.rdf.gz") + start := time.Now() + nV, nE, err := writeRDF(rdfPath, ds) + if err != nil { + log.Fatalf("write rdf: %v", err) + } + log.Printf("[convert] wrote %d vertices, %d edge-triples to %s in %s", + nV, nE, rdfPath, time.Since(start).Round(time.Millisecond)) + log.Printf("[next] dgraph bulk -f %s -s %s --zero localhost:5080 --out %s", + rdfPath, + filepath.Join(*outDir, "graph.schema"), + filepath.Join(*outDir, "bulk-out")) +} + +func writeSchema(path string) error { + return os.WriteFile(path, []byte(client.Schema), 0o644) +} + +func writeRDF(path string, ds *ldbc.Dataset) (vertexCount, edgeCount int, err error) { + f, err := os.Create(path) + if err != nil { + return 0, 0, err + } + defer f.Close() + gz := gzip.NewWriter(f) + defer gz.Close() + bw := bufio.NewWriterSize(gz, 1<<20) + defer bw.Flush() + + verts, err := ldbc.ReadVertices(ds.VertexFile) + if err != nil { + return 0, 0, fmt.Errorf("read vertices: %w", err) + } + for _, v := range verts { + // Vertex triples: graphalytics_id + dgraph.type. + if _, err := fmt.Fprintf(bw, "_:v%d \"%d\"^^ .\n", v, v); err != nil { + return 0, 0, err + } + if _, err := fmt.Fprintf(bw, "_:v%d \"Vertex\" .\n", v); err != nil { + return 0, 0, err + } + } + vertexCount = len(verts) + + // Stream edges; emit reverse triples too for undirected graphs. + directed := ds.Properties.Directed + err = ldbc.ScanEdges(ds.EdgeFile, func(e ldbc.Edge) error { + if err := writeEdgeTriple(bw, e.Src, e.Dst, e.Weight); err != nil { + return err + } + edgeCount++ + if !directed { + if err := writeEdgeTriple(bw, e.Dst, e.Src, e.Weight); err != nil { + return err + } + edgeCount++ + } + return nil + }) + if err != nil { + return vertexCount, edgeCount, fmt.Errorf("scan edges: %w", err) + } + return vertexCount, edgeCount, nil +} + +func writeEdgeTriple(bw *bufio.Writer, src, dst int64, weight float64) error { + _, err := fmt.Fprintf(bw, "_:v%d _:v%d (weight=%s) .\n", + src, dst, formatWeight(weight)) + return err +} + +// formatWeight emits a weight value Dgraph will always type as a float facet. +// strconv 'f' avoids scientific notation; appending ".0" for integer-valued +// floats prevents Dgraph from inferring an int facet for, e.g., weight=1.0. +func formatWeight(w float64) string { + s := strconv.FormatFloat(w, 'f', -1, 64) + if !strings.ContainsAny(s, ".eE") { + s += ".0" + } + return s +} diff --git a/k-shortest-path/cmd/handprobe/main.go b/k-shortest-path/cmd/handprobe/main.go new file mode 100644 index 0000000..5156571 --- /dev/null +++ b/k-shortest-path/cmd/handprobe/main.go @@ -0,0 +1,195 @@ +// Command handprobe settles the one semantic question the whole k-shortest +// comparison rests on: when Dgraph is asked for numpaths=2, does it return the +// 2nd-best LOOPLESS path (Yen's definition — what our oracle computes), or the +// 2nd-best node/edge-DISJOINT path (a different problem)? +// +// It loads a tiny hand graph into a FRESH (empty) alpha, where the two +// definitions give provably different answers, then runs numpaths=2 and prints +// what came back. Run it against one PR binary's empty alpha: +// +// go run ./cmd/handprobe -alpha localhost:9080 +// +// The graph (source=1, target=4): +// +// 1 ->2 ->3 ->4 weights 1,1,1 => path cost 3 (shortest) +// \-------->4 weight 5 (2 ->4) => 1-2-4 cost 6 (shares node 2 + edge 1->2 with the shortest) +// 1 ->5 ->4 weights 10,10 => 1-5-4 cost 20 (fully disjoint from the shortest) +// +// Interpretation of the numpaths=2 result: +// +// costs {3, 6} => LOOPLESS (Yen) => our gonum oracle is the right reference. PROCEED. +// costs {3, 20} => DISJOINT => oracle approach is wrong; STOP and rethink. +// costs {3} => only 1 path returned => check numpaths handling on this binary. +// +// It also dumps the raw _path_ JSON so the self-consistency parser (task #5) +// can be written against the real wire format of this exact binary, not a guess. +package main + +import ( + "context" + "encoding/json" + "flag" + "fmt" + "log" + "sort" + "time" + + "github.com/dgraph-io/dgo/v250" + "github.com/dgraph-io/dgo/v250/protos/api" + "github.com/dgraph-io/dgraph-benchmarks/k-shortest-path/internal/client" + "google.golang.org/grpc" + "google.golang.org/grpc/credentials/insecure" +) + +const handGraphRDF = ` +_:n1 "1"^^ . +_:n2 "2"^^ . +_:n3 "3"^^ . +_:n4 "4"^^ . +_:n5 "5"^^ . +_:n1 _:n2 (weight=1.0) . +_:n2 _:n3 (weight=1.0) . +_:n3 _:n4 (weight=1.0) . +_:n2 _:n4 (weight=5.0) . +_:n1 _:n5 (weight=10.0) . +_:n5 _:n4 (weight=10.0) . +` + +func main() { + alpha := flag.String("alpha", "localhost:9080", "Dgraph alpha gRPC address (point at a FRESH/empty alpha)") + numpaths := flag.Int("numpaths", 2, "numpaths for the probe query") + maxFrontier := flag.Int("maxfrontier", 0, "maxfrontiersize for the probe query (0 = unbounded)") + timeout := flag.Duration("timeout", 30*time.Second, "hard timeout on the shortest query so a hanging binary can't wedge the session") + flag.Parse() + + ctx := context.Background() + + conn, err := grpc.NewClient(*alpha, grpc.WithTransportCredentials(insecure.NewCredentials())) + if err != nil { + log.Fatalf("dial %s: %v", *alpha, err) + } + defer conn.Close() + dg := dgo.NewDgraphClient(api.NewDgraphClient(conn)) + + // Schema + load. Reuses the exact schema the real bench uses. + if err := dg.Alter(ctx, &api.Operation{Schema: client.Schema}); err != nil { + log.Fatalf("alter schema: %v", err) + } + txn := dg.NewTxn() + if _, err := txn.Mutate(ctx, &api.Mutation{SetNquads: []byte(handGraphRDF), CommitNow: true}); err != nil { + log.Fatalf("mutate hand graph: %v (is the alpha empty? re-run needs a fresh alpha)", err) + } + // Indexing is async after CommitNow in some builds; give it a beat. + time.Sleep(500 * time.Millisecond) + + src := lookupUID(ctx, dg, 1) + dst := lookupUID(ctx, dg, 4) + fmt.Printf("src(gid=1)=%s dst(gid=4)=%s\n\n", src, dst) + + frontier := "" + if *maxFrontier > 0 { + frontier = fmt.Sprintf(", maxfrontiersize: %d", *maxFrontier) + } + q := fmt.Sprintf(` + { + path as shortest(from: %s, to: %s, numpaths: %d%s) { + connected @facets(weight) + } + result(func: uid(path)) { uid graphalytics_id } + }`, src, dst, *numpaths, frontier) + + // Hard deadline: if the binary hangs on numpaths>=2, the client gives up + // instead of blocking forever (which, with an unbounded query, can balloon + // alpha's memory and get the SSH session OOM-killed). + qctx, cancel := context.WithTimeout(ctx, *timeout) + defer cancel() + rtxn := dg.NewReadOnlyTxn().BestEffort() + resp, err := rtxn.Query(qctx, q) + _ = rtxn.Discard(ctx) + if err != nil { + log.Fatalf("shortest query (timeout=%s, maxfrontier=%d): %v\n"+ + " a timeout here means this binary does not terminate numpaths=%d on a 6-node graph — that is itself the finding.", + *timeout, *maxFrontier, err, *numpaths) + } + + fmt.Println("=== RAW _path_ JSON (write the self-consistency parser against THIS) ===") + fmt.Println(prettyJSON(resp.Json)) + + weights := parseWeights(resp.Json) + fmt.Printf("\n=== parsed per-path costs (sorted): %v\n\n", weights) + + fmt.Println("=== VERDICT ===") + switch { + case len(weights) >= 2 && approx(weights[0], 3) && approx(weights[1], 6): + fmt.Println("LOOPLESS (Yen). numpaths=2 returned {3, 6} — the gonum oracle IS the right reference. PROCEED.") + case len(weights) >= 2 && approx(weights[0], 3) && approx(weights[1], 20): + fmt.Println("DISJOINT. numpaths=2 returned {3, 20} — oracle approach is WRONG for this semantics. STOP and rethink.") + case len(weights) == 1 && approx(weights[0], 3): + fmt.Println("ONLY 1 PATH. numpaths=2 returned a single path — investigate numpaths handling on this binary.") + default: + fmt.Printf("UNEXPECTED: %v — inspect the raw JSON above before drawing conclusions.\n", weights) + } +} + +func lookupUID(ctx context.Context, dg *dgo.Dgraph, gid int64) string { + q := fmt.Sprintf(`{ q(func: eq(graphalytics_id, %d)) { uid } }`, gid) + txn := dg.NewReadOnlyTxn().BestEffort() + resp, err := txn.Query(ctx, q) + _ = txn.Discard(ctx) + if err != nil { + log.Fatalf("lookup gid=%d: %v", gid, err) + } + var d struct { + Q []struct { + UID string `json:"uid"` + } `json:"q"` + } + if err := json.Unmarshal(resp.Json, &d); err != nil { + log.Fatalf("decode lookup gid=%d: %v", gid, err) + } + if len(d.Q) == 0 { + log.Fatalf("gid=%d not found — did the mutation/indexing complete?", gid) + } + return d.Q[0].UID +} + +func parseWeights(raw []byte) []float64 { + var decoded struct { + Paths []map[string]any `json:"_path_"` + } + if err := json.Unmarshal(raw, &decoded); err != nil { + return nil + } + var out []float64 + for _, p := range decoded.Paths { + if w, ok := p["_weight_"]; ok { + switch x := w.(type) { + case float64: + out = append(out, x) + case json.Number: + if f, err := x.Float64(); err == nil { + out = append(out, f) + } + } + } + } + sort.Float64s(out) + return out +} + +func approx(a, b float64) bool { + d := a - b + return d < 1e-6 && d > -1e-6 +} + +func prettyJSON(raw []byte) string { + var v any + if err := json.Unmarshal(raw, &v); err != nil { + return string(raw) + } + b, err := json.MarshalIndent(v, "", " ") + if err != nil { + return string(raw) + } + return string(b) +} diff --git a/k-shortest-path/cmd/prepare/main.go b/k-shortest-path/cmd/prepare/main.go new file mode 100644 index 0000000..9be6c7e --- /dev/null +++ b/k-shortest-path/cmd/prepare/main.go @@ -0,0 +1,464 @@ +// Command prepare turns a road-network dataset, fetched from a URL (or local +// path), into the LDBC layout the rest of the harness consumes: +// +// datasets//.v vertex ids, one per line +// datasets//.e "src dst weight", the edge list +// datasets//.properties metadata (directed flag, source vertex) +// datasets//validation/-SSSP Dijkstra reference from the source +// +// Two input formats, selected with -format: +// +// dimacs DIMACS challenge9 .gr (.gz ok): "p sp n m" + "a u v w" arcs. +// Arcs are already bidirectional, so the graph is treated as directed +// and edges are emitted as-is. Vertices are 1..n. (e.g. roadCOL, roadNY) +// csv a .zip containing edges.csv (source,target,travel_time,distance,cat) +// and nodes.csv (index,...). Undirected; weight = the distance column; +// self-loops and duplicate reverse edges dropped; both directions +// emitted. Matches the legacy csv_road_to_ldbc.py exactly. (e.g. NY) +// +// The SSSP reference is computed with gonum Dijkstra on the *same* graph the +// oracle builds for k-shortest, so the top-1 reference can never disagree with +// the oracle. Examples: +// +// prepare -format dimacs -name roadCOL -source 1 \ +// -url http://www.diag.uniroma1.it/challenge9/data/USA-road-t/USA-road-t.COL.gr.gz +// prepare -format csv -name roadNY -source 1 -url ./NY.csv.zip +package main + +import ( + "archive/zip" + "bufio" + "compress/gzip" + "flag" + "fmt" + "io" + "log" + "math" + "net/http" + "os" + "path/filepath" + "sort" + "strconv" + "strings" + "time" + + "github.com/dgraph-io/dgraph-benchmarks/k-shortest-path/internal/ldbc" + "github.com/dgraph-io/dgraph-benchmarks/k-shortest-path/internal/oracle" +) + +type dataset struct { + vertices []int64 // explicit vertex ids, sorted + edges []ldbc.Edge // unique edges as parsed (one per pair for undirected) + directed bool +} + +func main() { + url := flag.String("url", "", "source URL or local path to the dataset file") + format := flag.String("format", "", "dimacs | csv") + name := flag.String("name", "", "dataset name (output goes to //)") + outDir := flag.String("out", "datasets", "output root directory") + source := flag.Int64("source", 1, "SSSP source vertex id") + flag.Parse() + + if *url == "" || *format == "" || *name == "" { + log.Fatal("-url, -format, and -name are all required") + } + + local, cleanup, err := fetch(*url) + if err != nil { + log.Fatalf("fetch %s: %v", *url, err) + } + defer cleanup() + + var ds dataset + switch *format { + case "dimacs": + ds, err = parseDIMACS(local) + case "csv": + ds, err = parseCSVZip(local) + default: + log.Fatalf("unknown -format %q (want dimacs|csv)", *format) + } + if err != nil { + log.Fatalf("parse %s: %v", *format, err) + } + log.Printf("[prepare] %s: %d vertices, %d unique edges, directed=%v", + *name, len(ds.vertices), len(ds.edges), ds.directed) + + root := filepath.Join(*outDir, *name) + if err := os.MkdirAll(filepath.Join(root, "validation"), 0o755); err != nil { + log.Fatal(err) + } + + if err := writeVertices(filepath.Join(root, *name+".v"), ds.vertices); err != nil { + log.Fatalf("write .v: %v", err) + } + nE, err := writeEdges(filepath.Join(root, *name+".e"), ds) + if err != nil { + log.Fatalf("write .e: %v", err) + } + if err := writeProperties(filepath.Join(root, *name+".properties"), *name, ds, nE, *source); err != nil { + log.Fatalf("write .properties: %v", err) + } + + // SSSP reference on the same effective graph the oracle/Dgraph see. + log.Printf("[prepare] building graph + Dijkstra reference from source %d...", *source) + start := time.Now() + g := oracle.New() + for _, v := range ds.vertices { + g.AddNode(v) + } + if err := forEachEffectiveEdge(ds, func(s, t int64, w float64) error { + return g.AddEdge(s, t, w) + }); err != nil { + log.Fatalf("build graph: %v", err) + } + dist, err := g.SSSP(*source) + if err != nil { + log.Fatalf("sssp: %v", err) + } + if err := writeSSSP(filepath.Join(root, "validation", *name+"-SSSP"), ds.vertices, dist); err != nil { + log.Fatalf("write SSSP: %v", err) + } + reachable := 0 + for _, d := range dist { + if !math.IsInf(d, +1) { + reachable++ + } + } + log.Printf("[prepare] SSSP: %d/%d reachable in %s", reachable, len(ds.vertices), time.Since(start).Round(time.Millisecond)) + log.Printf("[prepare] done -> %s/", root) + log.Printf("[next] convert -dataset %s && dgraph bulk ...", root) +} + +// ----------------------------------------------------------------------------- +// fetch +// ----------------------------------------------------------------------------- + +// fetch returns a local file path for url. A url without a "://" scheme is +// treated as an existing local path (no copy). Remote urls are downloaded to a +// temp file. The returned cleanup removes any temp file. +func fetch(url string) (path string, cleanup func(), err error) { + noop := func() {} + if !strings.Contains(url, "://") { + if _, statErr := os.Stat(url); statErr != nil { + return "", noop, fmt.Errorf("local path: %w", statErr) + } + return url, noop, nil + } + resp, err := http.Get(url) + if err != nil { + return "", noop, err + } + defer resp.Body.Close() + if resp.StatusCode != http.StatusOK { + return "", noop, fmt.Errorf("http %s: %s", url, resp.Status) + } + tmp, err := os.CreateTemp("", "prepare-*"+filepath.Ext(url)) + if err != nil { + return "", noop, err + } + log.Printf("[prepare] downloading %s ...", url) + if _, err := io.Copy(tmp, resp.Body); err != nil { + tmp.Close() + os.Remove(tmp.Name()) + return "", noop, err + } + tmp.Close() + return tmp.Name(), func() { os.Remove(tmp.Name()) }, nil +} + +// openMaybeGz opens path, transparently decompressing if it is gzip. +func openMaybeGz(path string) (io.ReadCloser, error) { + f, err := os.Open(path) + if err != nil { + return nil, err + } + if strings.HasSuffix(path, ".gz") { + gz, err := gzip.NewReader(f) + if err != nil { + f.Close() + return nil, err + } + return gzReadCloser{gz: gz, f: f}, nil + } + return f, nil +} + +type gzReadCloser struct { + gz *gzip.Reader + f *os.File +} + +func (g gzReadCloser) Read(p []byte) (int, error) { return g.gz.Read(p) } +func (g gzReadCloser) Close() error { + _ = g.gz.Close() + return g.f.Close() +} + +// ----------------------------------------------------------------------------- +// DIMACS .gr +// ----------------------------------------------------------------------------- + +func parseDIMACS(path string) (dataset, error) { + r, err := openMaybeGz(path) + if err != nil { + return dataset{}, err + } + defer r.Close() + + var n int64 + var edges []ldbc.Edge + sc := bufio.NewScanner(r) + sc.Buffer(make([]byte, 1<<20), 1<<20) + for sc.Scan() { + line := sc.Text() + if line == "" { + continue + } + switch line[0] { + case 'c': + continue + case 'p': + // p sp + f := strings.Fields(line) + if len(f) >= 3 { + n, err = strconv.ParseInt(f[2], 10, 64) + if err != nil { + return dataset{}, fmt.Errorf("problem line %q: %w", line, err) + } + } + case 'a': + f := strings.Fields(line) + if len(f) != 4 { + return dataset{}, fmt.Errorf("arc line %q: want 4 fields", line) + } + u, err1 := strconv.ParseInt(f[1], 10, 64) + v, err2 := strconv.ParseInt(f[2], 10, 64) + w, err3 := strconv.ParseFloat(f[3], 64) + if err1 != nil || err2 != nil || err3 != nil { + return dataset{}, fmt.Errorf("arc parse %q", line) + } + if u == v { + continue + } + edges = append(edges, ldbc.Edge{Src: u, Dst: v, Weight: w}) + } + } + if err := sc.Err(); err != nil { + return dataset{}, err + } + if n == 0 { + return dataset{}, fmt.Errorf("no problem line (p sp n m) found") + } + verts := make([]int64, n) + for i := int64(0); i < n; i++ { + verts[i] = i + 1 // DIMACS vertices are 1-indexed + } + // DIMACS arcs already encode both directions, so treat as directed and + // emit as-is (don't synthesize reverse edges). + return dataset{vertices: verts, edges: edges, directed: true}, nil +} + +// ----------------------------------------------------------------------------- +// CSV zip (edges.csv + nodes.csv) +// ----------------------------------------------------------------------------- + +func parseCSVZip(path string) (dataset, error) { + zr, err := zip.OpenReader(path) + if err != nil { + return dataset{}, fmt.Errorf("open zip: %w", err) + } + defer zr.Close() + + var edgesF, nodesF *zip.File + for _, f := range zr.File { + switch filepath.Base(f.Name) { + case "edges.csv": + edgesF = f + case "nodes.csv": + nodesF = f + } + } + if edgesF == nil || nodesF == nil { + return dataset{}, fmt.Errorf("zip must contain edges.csv and nodes.csv") + } + + // Vertex count = number of data rows in nodes.csv (0-indexed ids 0..N-1). + nNodes, err := countCSVRows(nodesF) + if err != nil { + return dataset{}, fmt.Errorf("nodes.csv: %w", err) + } + + // Edges: undirected dedup, skip self-loops, weight = distance column (idx 3). + rc, err := edgesF.Open() + if err != nil { + return dataset{}, err + } + defer rc.Close() + seen := make(map[[2]int64]struct{}) + var edges []ldbc.Edge + var skipSelf, skipDup int + sc := bufio.NewScanner(rc) + sc.Buffer(make([]byte, 1<<20), 1<<20) + for sc.Scan() { + line := strings.TrimSpace(sc.Text()) + if line == "" || strings.HasPrefix(line, "#") { + continue + } + parts := strings.Split(line, ",") + if len(parts) < 4 { + continue + } + s, err1 := strconv.ParseInt(strings.TrimSpace(parts[0]), 10, 64) + t, err2 := strconv.ParseInt(strings.TrimSpace(parts[1]), 10, 64) + w, err3 := strconv.ParseFloat(strings.TrimSpace(parts[3]), 64) // distance + if err1 != nil || err2 != nil || err3 != nil { + return dataset{}, fmt.Errorf("edge parse %q", line) + } + if s == t { + skipSelf++ + continue + } + key := [2]int64{s, t} + if s > t { + key = [2]int64{t, s} + } + if _, dup := seen[key]; dup { + skipDup++ + continue + } + seen[key] = struct{}{} + edges = append(edges, ldbc.Edge{Src: s, Dst: t, Weight: w}) + } + if err := sc.Err(); err != nil { + return dataset{}, err + } + log.Printf("[prepare] csv: %d unique edges (skipped %d self-loops, %d dup-reverse)", + len(edges), skipSelf, skipDup) + + verts := make([]int64, nNodes) + for i := int64(0); i < nNodes; i++ { + verts[i] = i + } + return dataset{vertices: verts, edges: edges, directed: false}, nil +} + +func countCSVRows(f *zip.File) (int64, error) { + rc, err := f.Open() + if err != nil { + return 0, err + } + defer rc.Close() + var n int64 + sc := bufio.NewScanner(rc) + sc.Buffer(make([]byte, 1<<20), 1<<20) + for sc.Scan() { + line := strings.TrimSpace(sc.Text()) + if line == "" || strings.HasPrefix(line, "#") { + continue + } + n++ + } + return n, sc.Err() +} + +// ----------------------------------------------------------------------------- +// effective edges + writers +// ----------------------------------------------------------------------------- + +// forEachEffectiveEdge invokes fn for every directed edge actually present in +// the graph: as-is for directed datasets, both directions for undirected ones. +// This is the single source of truth shared by .e writing and graph building, +// so they can never diverge. +func forEachEffectiveEdge(ds dataset, fn func(s, t int64, w float64) error) error { + for _, e := range ds.edges { + if err := fn(e.Src, e.Dst, e.Weight); err != nil { + return err + } + if !ds.directed { + if err := fn(e.Dst, e.Src, e.Weight); err != nil { + return err + } + } + } + return nil +} + +func writeVertices(path string, verts []int64) error { + f, err := os.Create(path) + if err != nil { + return err + } + defer f.Close() + bw := bufio.NewWriter(f) + defer bw.Flush() + for _, v := range verts { + if _, err := fmt.Fprintf(bw, "%d\n", v); err != nil { + return err + } + } + return nil +} + +func writeEdges(path string, ds dataset) (int, error) { + f, err := os.Create(path) + if err != nil { + return 0, err + } + defer f.Close() + bw := bufio.NewWriter(f) + defer bw.Flush() + count := 0 + err = forEachEffectiveEdge(ds, func(s, t int64, w float64) error { + count++ + _, e := fmt.Fprintf(bw, "%d %d %s\n", s, t, strconv.FormatFloat(w, 'f', 6, 64)) + return e + }) + return count, err +} + +func writeProperties(path, name string, ds dataset, nEdges int, source int64) error { + content := fmt.Sprintf(`graph.%[1]s.vertex-file = %[1]s.v +graph.%[1]s.edge-file = %[1]s.e + +graph.%[1]s.meta.vertices = %[2]d +graph.%[1]s.meta.edges = %[3]d + +graph.%[1]s.directed = %[4]v + +graph.%[1]s.edge-properties.names = weight +graph.%[1]s.edge-properties.types = real + +graph.%[1]s.algorithms = sssp + +graph.%[1]s.sssp.weight-property = weight +graph.%[1]s.sssp.source-vertex = %[5]d +`, name, len(ds.vertices), nEdges, ds.directed, source) + return os.WriteFile(path, []byte(content), 0o644) +} + +func writeSSSP(path string, verts []int64, dist map[int64]float64) error { + f, err := os.Create(path) + if err != nil { + return err + } + defer f.Close() + bw := bufio.NewWriter(f) + defer bw.Flush() + // verts is already in id order from parsing; ensure deterministic output. + sort.Slice(verts, func(i, j int) bool { return verts[i] < verts[j] }) + for _, v := range verts { + d, ok := dist[v] + if !ok || math.IsInf(d, +1) { + if _, err := fmt.Fprintf(bw, "%d infinity\n", v); err != nil { + return err + } + continue + } + if _, err := fmt.Fprintf(bw, "%d %s\n", v, strconv.FormatFloat(d, 'f', 6, 64)); err != nil { + return err + } + } + return nil +} diff --git a/k-shortest-path/cmd/validate/main.go b/k-shortest-path/cmd/validate/main.go new file mode 100644 index 0000000..da0d0b2 --- /dev/null +++ b/k-shortest-path/cmd/validate/main.go @@ -0,0 +1,89 @@ +// Command validate is a standalone differ for two SSSP-format files. Useful +// when you've captured a Dgraph SSSP output to disk and want to diff it +// against the LDBC reference without spinning up the bench harness. +// +// validate -reference datasets/datagen-7_5-fb/validation/datagen-7_5-fb-SSSP \ +// -actual results/pr-9607/sssp.txt +package main + +import ( + "flag" + "fmt" + "log" + "math" + "os" + + "github.com/dgraph-io/dgraph-benchmarks/k-shortest-path/internal/ldbc" +) + +const epsilon = 0.0001 + +func main() { + ref := flag.String("reference", "", "LDBC reference SSSP output") + act := flag.String("actual", "", "actual SSSP output to validate") + maxShow := flag.Int("max-show", 10, "max failures to print") + flag.Parse() + if *ref == "" || *act == "" { + log.Fatal("-reference and -actual are both required") + } + + expected, err := ldbc.ReadSSSP(*ref) + if err != nil { + log.Fatalf("read reference: %v", err) + } + got, err := ldbc.ReadSSSP(*act) + if err != nil { + log.Fatalf("read actual: %v", err) + } + + if len(expected) != len(got) { + fmt.Fprintf(os.Stderr, + "WARN: vertex set sizes differ — expected=%d actual=%d\n", + len(expected), len(got)) + } + + var passed, failed int + var fails []string + for v, exp := range expected { + g, ok := got[v] + if !ok { + failed++ + if len(fails) < *maxShow { + fails = append(fails, fmt.Sprintf("vertex %d: missing from actual", v)) + } + continue + } + if math.IsInf(exp, +1) && math.IsInf(g, +1) { + passed++ + continue + } + if math.IsInf(exp, +1) != math.IsInf(g, +1) { + failed++ + if len(fails) < *maxShow { + fails = append(fails, fmt.Sprintf("vertex %d: inf mismatch exp=%v got=%v", v, exp, g)) + } + continue + } + denom := math.Abs(exp) + if denom == 0 { + denom = 1 + } + if math.Abs(exp-g)/denom > epsilon { + failed++ + if len(fails) < *maxShow { + fails = append(fails, fmt.Sprintf("vertex %d: exp=%v got=%v rel=%.4f", + v, exp, g, math.Abs(exp-g)/denom)) + } + continue + } + passed++ + } + + fmt.Printf("passed=%d failed=%d total=%d epsilon=%g\n", passed, failed, passed+failed, epsilon) + for _, s := range fails { + fmt.Println(" ", s) + } + if failed > 0 { + os.Exit(1) + } +} diff --git a/k-shortest-path/docker-compose.yaml b/k-shortest-path/docker-compose.yaml new file mode 100644 index 0000000..92bc5db --- /dev/null +++ b/k-shortest-path/docker-compose.yaml @@ -0,0 +1,33 @@ +services: + zero: + image: ${DGRAPH_IMAGE:-dgraph/dgraph:latest} + container_name: shortestpath-zero + command: dgraph zero --my=zero:5080 --replicas=1 + ports: + - "5080:5080" + - "6080:6080" + healthcheck: + test: ["CMD-SHELL", "curl -fsS http://localhost:6080/state || exit 1"] + interval: 5s + timeout: 5s + retries: 20 + + alpha: + image: ${DGRAPH_IMAGE:-dgraph/dgraph:latest} + container_name: shortestpath-alpha + command: > + dgraph alpha + --my=alpha:7080 + --zero=zero:5080 + --security whitelist=0.0.0.0/0 + ports: + - "8080:8080" + - "9080:9080" + depends_on: + zero: + condition: service_healthy + healthcheck: + test: ["CMD-SHELL", "curl -fsS http://localhost:8080/health || exit 1"] + interval: 5s + timeout: 5s + retries: 30 diff --git a/k-shortest-path/go.mod b/k-shortest-path/go.mod new file mode 100644 index 0000000..678fc1f --- /dev/null +++ b/k-shortest-path/go.mod @@ -0,0 +1,17 @@ +module github.com/dgraph-io/dgraph-benchmarks/k-shortest-path + +go 1.26.3 + +require ( + github.com/dgraph-io/dgo/v250 v250.0.0 + gonum.org/v1/gonum v0.17.0 + google.golang.org/grpc v1.80.0 +) + +require ( + golang.org/x/net v0.49.0 // indirect + golang.org/x/sys v0.40.0 // indirect + golang.org/x/text v0.33.0 // indirect + google.golang.org/genproto/googleapis/rpc v0.0.0-20260120221211-b8f7ae30c516 // indirect + google.golang.org/protobuf v1.36.11 // indirect +) diff --git a/k-shortest-path/go.sum b/k-shortest-path/go.sum new file mode 100644 index 0000000..4e98131 --- /dev/null +++ b/k-shortest-path/go.sum @@ -0,0 +1,48 @@ +github.com/cespare/xxhash/v2 v2.3.0 h1:UL815xU9SqsFlibzuggzjXhog7bL6oX9BbNZnL2UFvs= +github.com/cespare/xxhash/v2 v2.3.0/go.mod h1:VGX0DQ3Q6kWi7AoAeZDth3/j3BFtOZR5XLFGgcrjCOs= +github.com/davecgh/go-spew v1.1.1 h1:vj9j/u1bqnvCEfJOwUhtlOARqs3+rkHYY13jYWTU97c= +github.com/davecgh/go-spew v1.1.1/go.mod h1:J7Y8YcW2NihsgmVo/mv3lAwl/skON4iLHjSsI+c5H38= +github.com/dgraph-io/dgo/v250 v250.0.0 h1:zkVj8EOgNOK3s5XFEK7CJKRdftWqg5K6qGs4HEH5TcY= +github.com/dgraph-io/dgo/v250 v250.0.0/go.mod h1:OVSaapUnuqaY4beLe98CajukINwbVm0JRNp0SRBCz/w= +github.com/go-logr/logr v1.4.3 h1:CjnDlHq8ikf6E492q6eKboGOC0T8CDaOvkHCIg8idEI= +github.com/go-logr/logr v1.4.3/go.mod h1:9T104GzyrTigFIr8wt5mBrctHMim0Nb2HLGrmQ40KvY= +github.com/go-logr/stdr v1.2.2 h1:hSWxHoqTgW2S2qGc0LTAI563KZ5YKYRhT3MFKZMbjag= +github.com/go-logr/stdr v1.2.2/go.mod h1:mMo/vtBO5dYbehREoey6XUKy/eSumjCCveDpRre4VKE= +github.com/golang/protobuf v1.5.4 h1:i7eJL8qZTpSEXOPTxNKhASYpMn+8e5Q6AdndVa1dWek= +github.com/golang/protobuf v1.5.4/go.mod h1:lnTiLA8Wa4RWRcIUkrtSVa5nRhsEGBg48fD6rSs7xps= +github.com/google/go-cmp v0.7.0 h1:wk8382ETsv4JYUZwIsn6YpYiWiBsYLSJiTsyBybVuN8= +github.com/google/go-cmp v0.7.0/go.mod h1:pXiqmnSA92OHEEa9HXL2W4E7lf9JzCmGVUdgjX3N/iU= +github.com/google/uuid v1.6.0 h1:NIvaJDMOsjHA8n1jAhLSgzrAzy1Hgr+hNrb57e+94F0= +github.com/google/uuid v1.6.0/go.mod h1:TIyPZe4MgqvfeYDBFedMoGGpEw/LqOeaOT+nhxU+yHo= +github.com/pmezard/go-difflib v1.0.0 h1:4DBwDE0NGyQoBHbLQYPwSUPoCMWR5BEzIk/f1lZbAQM= +github.com/pmezard/go-difflib v1.0.0/go.mod h1:iKH77koFhYxTK1pcRnkKkqfTogsbg7gZNVY4sRDYZ/4= +github.com/stretchr/testify v1.11.1 h1:7s2iGBzp5EwR7/aIZr8ao5+dra3wiQyKjjFuvgVKu7U= +github.com/stretchr/testify v1.11.1/go.mod h1:wZwfW3scLgRK+23gO65QZefKpKQRnfz6sD981Nm4B6U= +go.opentelemetry.io/auto/sdk v1.2.1 h1:jXsnJ4Lmnqd11kwkBV2LgLoFMZKizbCi5fNZ/ipaZ64= +go.opentelemetry.io/auto/sdk v1.2.1/go.mod h1:KRTj+aOaElaLi+wW1kO/DZRXwkF4C5xPbEe3ZiIhN7Y= +go.opentelemetry.io/otel v1.39.0 h1:8yPrr/S0ND9QEfTfdP9V+SiwT4E0G7Y5MO7p85nis48= +go.opentelemetry.io/otel v1.39.0/go.mod h1:kLlFTywNWrFyEdH0oj2xK0bFYZtHRYUdv1NklR/tgc8= +go.opentelemetry.io/otel/metric v1.39.0 h1:d1UzonvEZriVfpNKEVmHXbdf909uGTOQjA0HF0Ls5Q0= +go.opentelemetry.io/otel/metric v1.39.0/go.mod h1:jrZSWL33sD7bBxg1xjrqyDjnuzTUB0x1nBERXd7Ftcs= +go.opentelemetry.io/otel/sdk v1.39.0 h1:nMLYcjVsvdui1B/4FRkwjzoRVsMK8uL/cj0OyhKzt18= +go.opentelemetry.io/otel/sdk v1.39.0/go.mod h1:vDojkC4/jsTJsE+kh+LXYQlbL8CgrEcwmt1ENZszdJE= +go.opentelemetry.io/otel/sdk/metric v1.39.0 h1:cXMVVFVgsIf2YL6QkRF4Urbr/aMInf+2WKg+sEJTtB8= +go.opentelemetry.io/otel/sdk/metric v1.39.0/go.mod h1:xq9HEVH7qeX69/JnwEfp6fVq5wosJsY1mt4lLfYdVew= +go.opentelemetry.io/otel/trace v1.39.0 h1:2d2vfpEDmCJ5zVYz7ijaJdOF59xLomrvj7bjt6/qCJI= +go.opentelemetry.io/otel/trace v1.39.0/go.mod h1:88w4/PnZSazkGzz/w84VHpQafiU4EtqqlVdxWy+rNOA= +golang.org/x/net v0.49.0 h1:eeHFmOGUTtaaPSGNmjBKpbng9MulQsJURQUAfUwY++o= +golang.org/x/net v0.49.0/go.mod h1:/ysNB2EvaqvesRkuLAyjI1ycPZlQHM3q01F02UY/MV8= +golang.org/x/sys v0.40.0 h1:DBZZqJ2Rkml6QMQsZywtnjnnGvHza6BTfYFWY9kjEWQ= +golang.org/x/sys v0.40.0/go.mod h1:OgkHotnGiDImocRcuBABYBEXf8A9a87e/uXjp9XT3ks= +golang.org/x/text v0.33.0 h1:B3njUFyqtHDUI5jMn1YIr5B0IE2U0qck04r6d4KPAxE= +golang.org/x/text v0.33.0/go.mod h1:LuMebE6+rBincTi9+xWTY8TztLzKHc/9C1uBCG27+q8= +gonum.org/v1/gonum v0.17.0 h1:VbpOemQlsSMrYmn7T2OUvQ4dqxQXU+ouZFQsZOx50z4= +gonum.org/v1/gonum v0.17.0/go.mod h1:El3tOrEuMpv2UdMrbNlKEh9vd86bmQ6vqIcDwxEOc1E= +google.golang.org/genproto/googleapis/rpc v0.0.0-20260120221211-b8f7ae30c516 h1:sNrWoksmOyF5bvJUcnmbeAmQi8baNhqg5IWaI3llQqU= +google.golang.org/genproto/googleapis/rpc v0.0.0-20260120221211-b8f7ae30c516/go.mod h1:j9x/tPzZkyxcgEFkiKEEGxfvyumM01BEtsW8xzOahRQ= +google.golang.org/grpc v1.80.0 h1:Xr6m2WmWZLETvUNvIUmeD5OAagMw3FiKmMlTdViWsHM= +google.golang.org/grpc v1.80.0/go.mod h1:ho/dLnxwi3EDJA4Zghp7k2Ec1+c2jqup0bFkw07bwF4= +google.golang.org/protobuf v1.36.11 h1:fV6ZwhNocDyBLK0dj+fg8ektcVegBBuEolpbTQyBNVE= +google.golang.org/protobuf v1.36.11/go.mod h1:HTf+CrKn2C3g5S8VImy6tdcUvCska2kB7j23XfzDpco= +gopkg.in/yaml.v3 v3.0.1 h1:fxVm/GzAzEWqLHuvctI91KS9hhNmmWOoWu0XTYJS7CA= +gopkg.in/yaml.v3 v3.0.1/go.mod h1:K4uyk7z7BCEPqu6E+C64Yfv1cQ7kz7rIZviUmN+EgEM= diff --git a/k-shortest-path/internal/client/client.go b/k-shortest-path/internal/client/client.go new file mode 100644 index 0000000..5e3aceb --- /dev/null +++ b/k-shortest-path/internal/client/client.go @@ -0,0 +1,335 @@ +// Package client wraps a dgo gRPC client with the few helpers shortest-path +// benchmarks need: schema setup, a graphalytics_id → uid map fetched in one +// query, and a Shortest(...) call that returns the total path weight. +package client + +import ( + "bufio" + "context" + "encoding/json" + "fmt" + "math" + "os" + "sort" + "strconv" + "strings" + "time" + + "github.com/dgraph-io/dgo/v250" + "github.com/dgraph-io/dgo/v250/protos/api" + "google.golang.org/grpc" + "google.golang.org/grpc/credentials/insecure" +) + +// Schema is the DQL schema applied before any benchmark run. graphalytics_id +// is the LDBC vertex id; connected carries the weight as a facet. +const Schema = ` +graphalytics_id: int @index(int) . +connected: [uid] @reverse . +` + +type Client struct { + conn *grpc.ClientConn + dg *dgo.Dgraph +} + +// Open dials the Alpha at addr (typically "localhost:9080") with insecure +// credentials. +func Open(addr string) (*Client, error) { + conn, err := grpc.NewClient(addr, grpc.WithTransportCredentials(insecure.NewCredentials())) + if err != nil { + return nil, fmt.Errorf("grpc dial %s: %w", addr, err) + } + return &Client{ + conn: conn, + dg: dgo.NewDgraphClient(api.NewDgraphClient(conn)), + }, nil +} + +func (c *Client) Close() error { return c.conn.Close() } + +// Alter applies the package Schema. +func (c *Client) Alter(ctx context.Context) error { + return c.dg.Alter(ctx, &api.Operation{Schema: Schema}) +} + +// FetchUIDMap returns a graphalytics_id → uid mapping for every Vertex +// node in the database. Uses paginated reads so the query never returns +// more than `pageSize` results at once. +func (c *Client) FetchUIDMap(ctx context.Context, pageSize int) (map[int64]string, error) { + out := make(map[int64]string) + offset := 0 + for { + q := fmt.Sprintf(` + { + vertices(func: has(graphalytics_id), first: %d, offset: %d) { + uid + graphalytics_id + } + }`, pageSize, offset) + + txn := c.dg.NewReadOnlyTxn().BestEffort() + resp, err := txn.Query(ctx, q) + _ = txn.Discard(ctx) + if err != nil { + return nil, fmt.Errorf("uidmap page offset=%d: %w", offset, err) + } + + var page struct { + Vertices []struct { + UID string `json:"uid"` + GraphalyticsID int64 `json:"graphalytics_id"` + } `json:"vertices"` + } + if err := json.Unmarshal(resp.Json, &page); err != nil { + return nil, fmt.Errorf("uidmap decode: %w", err) + } + if len(page.Vertices) == 0 { + break + } + for _, v := range page.Vertices { + out[v.GraphalyticsID] = v.UID + } + if len(page.Vertices) < pageSize { + break + } + offset += pageSize + } + return out, nil +} + +// SaveUIDMap writes the graphalytics_id → uid map to a TSV file at path. +// Format: one line per entry, "\t". This cache is valid +// as long as the underlying Dgraph p/ directory isn't re-bulk-loaded — UIDs +// persist across Alpha restarts and PR-build swaps. +func SaveUIDMap(path string, m map[int64]string) error { + f, err := os.Create(path) + if err != nil { + return fmt.Errorf("create %s: %w", path, err) + } + defer f.Close() + bw := bufio.NewWriter(f) + defer bw.Flush() + for gid, uid := range m { + if _, err := fmt.Fprintf(bw, "%d\t%s\n", gid, uid); err != nil { + return err + } + } + return nil +} + +// LoadUIDMap reads a TSV cache file written by SaveUIDMap. +func LoadUIDMap(path string) (map[int64]string, error) { + f, err := os.Open(path) + if err != nil { + return nil, err + } + defer f.Close() + out := make(map[int64]string) + sc := bufio.NewScanner(f) + sc.Buffer(make([]byte, 1<<20), 1<<20) + for sc.Scan() { + line := sc.Text() + tab := strings.IndexByte(line, '\t') + if tab < 0 { + continue + } + gid, err := strconv.ParseInt(line[:tab], 10, 64) + if err != nil { + return nil, fmt.Errorf("parse graphalytics_id %q: %w", line[:tab], err) + } + out[gid] = line[tab+1:] + } + return out, sc.Err() +} + +// ShortestOptions parameterises a single shortest-path query. +type ShortestOptions struct { + SrcUID string + DstUID string + EdgePred string // DQL edge predicate, e.g. connected + NumPaths int + MaxFrontier int // <= 0 omits the cap + Timeout time.Duration +} + +// ShortestResult is the parsed outcome of a shortest-path query. +type ShortestResult struct { + // Distance is the smallest total path weight (the classic shortest + // distance). math.Inf if no path was returned. Kept for the SSSP top-1 + // correctness mode. + Distance float64 + // Weights is the per-path total weights from every entry in _path_, sorted + // non-decreasing. This is the weight VECTOR compared against the oracle's + // TopK output: comparing sorted costs (not path identity) is what makes the + // k-shortest correctness check robust to ties. Len == PathCount. + Weights []float64 + // Latency is the wall time spent in the gRPC round-trip. + Latency time.Duration + // PathCount is the number of distinct paths in the response (1 for the + // classic shortest, up to NumPaths for k-shortest). + PathCount int + // SelfConsistent is true when, for every returned path, the sum of the + // per-edge weight facets equals the path's reported _weight_ (within + // tolerance). This needs no oracle: a path whose own edges don't add up to + // its claimed cost is a bug. False if any path fails or can't be walked. + SelfConsistent bool + // Loopless is true when no returned path revisits a uid. Dgraph claims to + // prune cyclical paths; a repeated uid would contradict that. + Loopless bool + // MaxWeightErr is the largest absolute |summed - reported| across paths, + // for diagnostics when SelfConsistent is false. + MaxWeightErr float64 +} + +// Shortest issues one `shortest` query and decodes total weight from the +// `_path_` response. Returns Distance=math.Inf if no path is found. +func (c *Client) Shortest(ctx context.Context, opts ShortestOptions) (ShortestResult, error) { + if opts.NumPaths < 1 { + opts.NumPaths = 1 + } + edge := opts.EdgePred + if edge == "" { + edge = "connected" + } + frontier := "" + if opts.MaxFrontier > 0 { + frontier = fmt.Sprintf(", maxfrontiersize: %d", opts.MaxFrontier) + } + q := fmt.Sprintf(` + { + path as shortest(from: %s, to: %s, numpaths: %d%s) { + %s @facets(weight) + } + result(func: uid(path)) { + uid + } + }`, opts.SrcUID, opts.DstUID, opts.NumPaths, frontier, edge) + + if opts.Timeout > 0 { + var cancel context.CancelFunc + ctx, cancel = context.WithTimeout(ctx, opts.Timeout) + defer cancel() + } + + txn := c.dg.NewReadOnlyTxn().BestEffort() + defer func() { _ = txn.Discard(ctx) }() + + start := time.Now() + resp, err := txn.Query(ctx, q) + elapsed := time.Since(start) + if err != nil { + return ShortestResult{Latency: elapsed, Distance: math.Inf(+1)}, err + } + + // Top-level _path_ array; each entry has a `_weight_` field. Pick the + // smallest weight as the shortest distance. + var decoded struct { + Paths []map[string]any `json:"_path_"` + } + if err := json.Unmarshal(resp.Json, &decoded); err != nil { + return ShortestResult{Latency: elapsed}, fmt.Errorf("decode _path_: %w", err) + } + if len(decoded.Paths) == 0 { + return ShortestResult{Latency: elapsed, Distance: math.Inf(+1), PathCount: 0}, nil + } + + weights := make([]float64, 0, len(decoded.Paths)) + selfConsistent, loopless := true, true + maxErr := 0.0 + for _, p := range decoded.Paths { + reported, hasReported := toFloat(p["_weight_"]) + if hasReported { + weights = append(weights, reported) + } + // Walk the nested path to sum edge facets and check for loops. + summed, noLoop, ok := walkPath(p, edge) + if !ok || !noLoop { + loopless = loopless && noLoop + selfConsistent = false + continue + } + if hasReported { + if e := math.Abs(summed - reported); e > maxErr { + maxErr = e + } + // 1e-6 absolute plus a relative term absorbs float-facet summation + // noise without masking a real discrepancy. + if math.Abs(summed-reported) > 1e-6+1e-9*math.Abs(reported) { + selfConsistent = false + } + } + } + sort.Float64s(weights) + best := math.Inf(+1) + if len(weights) > 0 { + best = weights[0] + } + return ShortestResult{ + Distance: best, + Weights: weights, + Latency: elapsed, + PathCount: len(decoded.Paths), + SelfConsistent: selfConsistent, + Loopless: loopless, + MaxWeightErr: maxErr, + }, nil +} + +// walkPath descends the singly-nested path under the edge predicate, summing +// the per-edge weight facets (key "|weight", carried on each child node +// next to its uid) and checking that no uid repeats. Returns the summed edge +// weight, whether the path is loopless, and whether the structure parsed +// cleanly. Matches the real _path_ wire format: +// +// {"_weight_":3,"uid":"0x1","connected":{"connected|weight":1,"uid":"0x2", +// "connected":{...}}} +func walkPath(root map[string]any, edge string) (total float64, noLoop bool, ok bool) { + facetKey := edge + "|weight" + seen := make(map[string]bool) + cur := root + if u, has := cur["uid"].(string); has { + seen[u] = true + } + for { + childRaw, has := cur[edge] + if !has { + return total, true, true // reached the leaf — clean linear path + } + child, isMap := childRaw.(map[string]any) + if !isMap { + // A linear path nests a single object; an array (branching) is + // unexpected for k-shortest and we flag it rather than guess. + if arr, isArr := childRaw.([]any); isArr && len(arr) == 1 { + child, isMap = arr[0].(map[string]any) + } + if !isMap { + return total, true, false + } + } + w, wok := toFloat(child[facetKey]) + if !wok { + return total, true, false + } + total += w + if u, has := child["uid"].(string); has { + if seen[u] { + return total, false, true // a uid repeated — not loopless + } + seen[u] = true + } + cur = child + } +} + +func toFloat(v any) (float64, bool) { + switch x := v.(type) { + case float64: + return x, true + case json.Number: + f, err := x.Float64() + return f, err == nil + default: + return 0, false + } +} diff --git a/k-shortest-path/internal/client/client_test.go b/k-shortest-path/internal/client/client_test.go new file mode 100644 index 0000000..af9dab9 --- /dev/null +++ b/k-shortest-path/internal/client/client_test.go @@ -0,0 +1,87 @@ +package client + +import ( + "encoding/json" + "math" + "testing" +) + +// The exact _path_ payload captured from a live PR-binary alpha via +// cmd/handprobe (numpaths=2 on the hand graph). The self-consistency walk is +// tested against this real wire format, not a guess. +const realPathJSON = `{ + "_path_": [ + {"_weight_": 3, "uid": "0x1", + "connected": {"connected|weight": 1, "uid": "0x2", + "connected": {"connected|weight": 1, "uid": "0x3", + "connected": {"connected|weight": 1, "uid": "0x4"}}}}, + {"_weight_": 6, "uid": "0x1", + "connected": {"connected|weight": 1, "uid": "0x2", + "connected": {"connected|weight": 5, "uid": "0x4"}}} + ] +}` + +func parsePaths(t *testing.T, raw string) []map[string]any { + t.Helper() + var d struct { + Paths []map[string]any `json:"_path_"` + } + if err := json.Unmarshal([]byte(raw), &d); err != nil { + t.Fatalf("unmarshal: %v", err) + } + return d.Paths +} + +func TestWalkPath_RealFormat(t *testing.T) { + paths := parsePaths(t, realPathJSON) + if len(paths) != 2 { + t.Fatalf("want 2 paths, got %d", len(paths)) + } + + // Path 0: 1->2->3->4, weights 1+1+1 = 3, loopless. + total, noLoop, ok := walkPath(paths[0], "connected") + if !ok || !noLoop { + t.Fatalf("path0: ok=%v noLoop=%v", ok, noLoop) + } + if math.Abs(total-3) > 1e-9 { + t.Fatalf("path0 summed weight: got %v want 3", total) + } + + // Path 1: 1->2->4, weights 1+5 = 6, loopless. + total, noLoop, ok = walkPath(paths[1], "connected") + if !ok || !noLoop { + t.Fatalf("path1: ok=%v noLoop=%v", ok, noLoop) + } + if math.Abs(total-6) > 1e-9 { + t.Fatalf("path1 summed weight: got %v want 6", total) + } +} + +// A path that revisits a uid must be flagged as not loopless. +func TestWalkPath_DetectsLoop(t *testing.T) { + const loopJSON = `{"_path_":[ + {"_weight_": 2, "uid": "0x1", + "connected": {"connected|weight": 1, "uid": "0x2", + "connected": {"connected|weight": 1, "uid": "0x1"}}} + ]}` + paths := parsePaths(t, loopJSON) + _, noLoop, ok := walkPath(paths[0], "connected") + if !ok { + t.Fatal("expected clean parse") + } + if noLoop { + t.Fatal("expected loop to be detected (0x1 repeats)") + } +} + +// A missing facet weight should fail the parse (can't verify consistency). +func TestWalkPath_MissingFacet(t *testing.T) { + const noFacet = `{"_path_":[ + {"_weight_": 1, "uid": "0x1", "connected": {"uid": "0x2"}} + ]}` + paths := parsePaths(t, noFacet) + _, _, ok := walkPath(paths[0], "connected") + if ok { + t.Fatal("expected ok=false when an edge facet is missing") + } +} diff --git a/k-shortest-path/internal/compare/compare.go b/k-shortest-path/internal/compare/compare.go new file mode 100644 index 0000000..0994766 --- /dev/null +++ b/k-shortest-path/internal/compare/compare.go @@ -0,0 +1,101 @@ +// Package compare diffs a Dgraph k-shortest weight vector against the oracle's. +// +// The comparison is on sorted total-path COSTS, never on path identity. With +// road-network weights many distinct paths tie on cost, and Dgraph and the +// oracle (gonum Yen) break ties differently — so requiring identical paths +// would manufacture false failures. The k-th smallest cost, however, is +// deterministic, so comparing the sorted cost vectors is both sound and +// tie-robust. +package compare + +import "math" + +// Verdict classifies a single (src,dst) comparison. +type Verdict int + +const ( + // OK: Dgraph returned the same number of paths as the oracle and every + // cost matches within tolerance. + OK Verdict = iota + // CountMismatch: Dgraph returned a different number of paths than the + // oracle — typically dropping paths under a frontier cap. + CountMismatch + // WeightMismatch: same count, but at least one cost differs beyond + // tolerance — Dgraph substituted a worse (or wrong) path for a real one. + WeightMismatch +) + +func (v Verdict) String() string { + switch v { + case OK: + return "ok" + case CountMismatch: + return "count_mismatch" + case WeightMismatch: + return "weight_mismatch" + default: + return "unknown" + } +} + +// Result is the outcome of comparing one weight vector pair. +type Result struct { + Verdict Verdict + OracleK int // number of paths the oracle found (== len(oracle)) + DgraphK int // number of paths Dgraph returned (== len(got)) + // WorstRelErr is the largest per-rank relative error among the ranks both + // vectors cover. 0 when counts differ and no overlap mismatches. + WorstRelErr float64 +} + +// Correct reports whether the comparison passed. +func (r Result) Correct() bool { return r.Verdict == OK } + +// Vectors compares the oracle's expected cost vector against Dgraph's returned +// cost vector. Both must already be sorted non-decreasing (oracle.TopK and +// client.ShortestResult.Weights both guarantee this). tol is a relative +// tolerance (e.g. 1e-4); an exact-equal absolute path is taken first so that +// two empty vectors, or identical integer-valued costs, always pass. +func Vectors(oracle, got []float64, tol float64) Result { + r := Result{OracleK: len(oracle), DgraphK: len(got)} + if len(oracle) != len(got) { + // Still surface the worst error over the overlapping prefix — useful + // for spotting whether the paths Dgraph *did* return are also wrong. + r.WorstRelErr = worstRelErr(oracle, got) + r.Verdict = CountMismatch + return r + } + worst := worstRelErr(oracle, got) + r.WorstRelErr = worst + if worst > tol { + r.Verdict = WeightMismatch + return r + } + r.Verdict = OK + return r +} + +func worstRelErr(a, b []float64) float64 { + n := min(len(a), len(b)) + worst := 0.0 + for i := 0; i < n; i++ { + if e := relErr(a[i], b[i]); e > worst { + worst = e + } + } + return worst +} + +func relErr(expected, got float64) float64 { + if expected == got { // covers both-zero and both-Inf + return 0 + } + if math.IsInf(expected, 0) || math.IsInf(got, 0) { + return math.Inf(+1) + } + denom := math.Abs(expected) + if denom == 0 { + denom = 1 + } + return math.Abs(expected-got) / denom +} diff --git a/k-shortest-path/internal/compare/compare_test.go b/k-shortest-path/internal/compare/compare_test.go new file mode 100644 index 0000000..c224afe --- /dev/null +++ b/k-shortest-path/internal/compare/compare_test.go @@ -0,0 +1,68 @@ +package compare + +import "testing" + +const tol = 1e-4 + +func TestVectors_Exact(t *testing.T) { + r := Vectors([]float64{3, 6}, []float64{3, 6}, tol) + if !r.Correct() || r.Verdict != OK { + t.Fatalf("expected OK, got %v", r.Verdict) + } +} + +// Tie vector: identical costs at ranks 2 and 3. Must pass — this is the whole +// reason we compare costs not paths. +func TestVectors_Tie(t *testing.T) { + r := Vectors([]float64{3, 10, 10}, []float64{3, 10, 10}, tol) + if !r.Correct() { + t.Fatalf("tie vector should be OK, got %v (worst=%g)", r.Verdict, r.WorstRelErr) + } +} + +// Dgraph dropped a path under a frontier cap. +func TestVectors_CountMismatch(t *testing.T) { + r := Vectors([]float64{3, 6}, []float64{3}, tol) + if r.Verdict != CountMismatch { + t.Fatalf("expected count_mismatch, got %v", r.Verdict) + } + if r.OracleK != 2 || r.DgraphK != 1 { + t.Fatalf("counts wrong: oracle=%d dgraph=%d", r.OracleK, r.DgraphK) + } +} + +// Same count, but Dgraph substituted a worse second path (the STATUS.md bug +// shape: returns a path far longer than the true one). +func TestVectors_WeightMismatch(t *testing.T) { + r := Vectors([]float64{4058.67, 4100.0}, []float64{4058.67, 17636.65}, tol) + if r.Verdict != WeightMismatch { + t.Fatalf("expected weight_mismatch, got %v", r.Verdict) + } + if r.WorstRelErr <= tol { + t.Fatalf("worst rel err should exceed tol, got %g", r.WorstRelErr) + } +} + +// Within tolerance counts as OK (float noise on summed facets). +func TestVectors_WithinTolerance(t *testing.T) { + r := Vectors([]float64{4058.67}, []float64{4058.6701}, tol) + if !r.Correct() { + t.Fatalf("tiny diff should pass, got %v (worst=%g)", r.Verdict, r.WorstRelErr) + } +} + +// Both unreachable (empty vectors) is a correct match. +func TestVectors_BothEmpty(t *testing.T) { + r := Vectors(nil, nil, tol) + if !r.Correct() { + t.Fatalf("both empty should be OK, got %v", r.Verdict) + } +} + +// Oracle found a path, Dgraph found none — count mismatch, not a silent pass. +func TestVectors_DgraphEmpty(t *testing.T) { + r := Vectors([]float64{3}, nil, tol) + if r.Verdict != CountMismatch { + t.Fatalf("expected count_mismatch, got %v", r.Verdict) + } +} diff --git a/k-shortest-path/internal/ldbc/parse.go b/k-shortest-path/internal/ldbc/parse.go new file mode 100644 index 0000000..34ccc7b --- /dev/null +++ b/k-shortest-path/internal/ldbc/parse.go @@ -0,0 +1,255 @@ +// Package ldbc parses the file artefacts of an LDBC Graphalytics datagen +// dataset: the .v vertex list, the .e edge list, the .properties metadata +// file, and the validation/-SSSP reference output. +// +// All parsers are streaming where useful; the .v and SSSP files comfortably +// fit in memory at M-scale (633K vertices ~ a few MB), but the .e file at +// 34M edges does not — use ScanEdges for that one. +package ldbc + +import ( + "bufio" + "fmt" + "io" + "math" + "os" + "path/filepath" + "strconv" + "strings" +) + +// Properties carries the subset of `.properties` fields we care about. +type Properties struct { + Directed bool + WeightedSSSP bool + SourceVertex int64 +} + +// ReadProperties parses a Java-properties style file. Only the keys we use +// are extracted; everything else is silently ignored. +func ReadProperties(path string) (Properties, error) { + f, err := os.Open(path) + if err != nil { + return Properties{}, err + } + defer f.Close() + + var p Properties + sc := bufio.NewScanner(f) + for sc.Scan() { + line := strings.TrimSpace(sc.Text()) + if line == "" || strings.HasPrefix(line, "#") { + continue + } + key, val, ok := strings.Cut(line, "=") + if !ok { + continue + } + key = strings.TrimSpace(key) + val = strings.TrimSpace(val) + switch { + case key == "graph.directed" || strings.HasSuffix(key, ".directed"): + p.Directed = strings.EqualFold(val, "true") + case key == "graph.weights" || strings.HasSuffix(key, ".sssp.weight-property"): + p.WeightedSSSP = true + case key == "algorithms.sssp.source-vertex" || strings.HasSuffix(key, ".sssp.source-vertex"): + n, err := strconv.ParseInt(val, 10, 64) + if err != nil { + return p, fmt.Errorf("parse sssp source: %w", err) + } + p.SourceVertex = n + case p.SourceVertex == 0 && (key == "algorithms.bfs.source-vertex" || strings.HasSuffix(key, ".bfs.source-vertex")): + // Fall back to BFS source-vertex when the dataset doesn't declare + // an SSSP source (e.g. unweighted graphs like cit-Patents). + n, err := strconv.ParseInt(val, 10, 64) + if err != nil { + return p, fmt.Errorf("parse bfs source: %w", err) + } + p.SourceVertex = n + } + } + if err := sc.Err(); err != nil { + return p, err + } + return p, nil +} + +// ReadVertices reads the entire `.v` file into a slice. Each line is a +// single int64 vertex id. +func ReadVertices(path string) ([]int64, error) { + f, err := os.Open(path) + if err != nil { + return nil, err + } + defer f.Close() + + var out []int64 + sc := bufio.NewScanner(f) + sc.Buffer(make([]byte, 1<<20), 1<<20) + for sc.Scan() { + s := strings.TrimSpace(sc.Text()) + if s == "" { + continue + } + n, err := strconv.ParseInt(s, 10, 64) + if err != nil { + return nil, fmt.Errorf("vertex parse %q: %w", s, err) + } + out = append(out, n) + } + return out, sc.Err() +} + +// Edge is a parsed line from the `.e` file. Weight is 1.0 for unweighted +// graphs (the third field is then absent). +type Edge struct { + Src int64 + Dst int64 + Weight float64 +} + +// ScanEdges streams the `.e` file. The callback is invoked once per edge; +// returning an error stops the scan. +func ScanEdges(path string, fn func(Edge) error) error { + f, err := os.Open(path) + if err != nil { + return err + } + defer f.Close() + + sc := bufio.NewScanner(f) + sc.Buffer(make([]byte, 1<<20), 1<<20) + for sc.Scan() { + line := strings.TrimSpace(sc.Text()) + if line == "" { + continue + } + fields := strings.Fields(line) + if len(fields) < 2 { + return fmt.Errorf("malformed edge line %q", line) + } + src, err := strconv.ParseInt(fields[0], 10, 64) + if err != nil { + return fmt.Errorf("edge src parse %q: %w", fields[0], err) + } + dst, err := strconv.ParseInt(fields[1], 10, 64) + if err != nil { + return fmt.Errorf("edge dst parse %q: %w", fields[1], err) + } + w := 1.0 + if len(fields) >= 3 { + w, err = strconv.ParseFloat(fields[2], 64) + if err != nil { + return fmt.Errorf("edge weight parse %q: %w", fields[2], err) + } + } + if err := fn(Edge{Src: src, Dst: dst, Weight: w}); err != nil { + return err + } + } + return sc.Err() +} + +// ReadSSSP loads a reference SSSP file produced by LDBC. Lines look like +// +// +// +// Returns a map keyed by vertex id; unreachable vertices have value +// math.Inf(+1). +func ReadSSSP(path string) (map[int64]float64, error) { + f, err := os.Open(path) + if err != nil { + return nil, err + } + defer f.Close() + + out := make(map[int64]float64) + sc := bufio.NewScanner(f) + sc.Buffer(make([]byte, 1<<20), 1<<20) + for sc.Scan() { + line := strings.TrimSpace(sc.Text()) + if line == "" { + continue + } + fields := strings.Fields(line) + if len(fields) != 2 { + return nil, fmt.Errorf("malformed SSSP line %q", line) + } + vid, err := strconv.ParseInt(fields[0], 10, 64) + if err != nil { + return nil, fmt.Errorf("vertex parse %q: %w", fields[0], err) + } + var dist float64 + switch { + case strings.EqualFold(fields[1], "Infinity"): + dist = math.Inf(+1) + case fields[1] == "9223372036854775807": + // max int64 — LDBC BFS reference uses this as the unreachable marker. + dist = math.Inf(+1) + default: + dist, err = strconv.ParseFloat(fields[1], 64) + if err != nil { + return nil, fmt.Errorf("distance parse %q: %w", fields[1], err) + } + } + out[vid] = dist + } + return out, sc.Err() +} + +// Dataset captures the absolute file paths for a single Graphalytics dataset. +// All fields point at files that exist; an error is returned otherwise. +type Dataset struct { + Name string + Root string + VertexFile string + EdgeFile string + PropsFile string + SSSPRefFile string + Properties Properties +} + +// LoadDataset inspects `root` (typically datasets//) for the expected +// layout. It verifies file presence and reads `.properties` eagerly but does +// not touch `.v`, `.e`, or the reference file. If the dataset has no SSSP +// reference but does have a BFS reference, SSSPRefFile falls back to that — +// useful for unweighted graphs where BFS hop counts and SSSP distances align +// (i.e. when every edge weight is 1). +func LoadDataset(root string) (*Dataset, error) { + name := filepath.Base(strings.TrimRight(root, string(filepath.Separator))) + d := &Dataset{ + Name: name, + Root: root, + VertexFile: filepath.Join(root, name+".v"), + EdgeFile: filepath.Join(root, name+".e"), + PropsFile: filepath.Join(root, name+".properties"), + } + for label, p := range map[string]string{ + "vertex": d.VertexFile, "edge": d.EdgeFile, "properties": d.PropsFile, + } { + if _, err := os.Stat(p); err != nil { + return nil, fmt.Errorf("%s file %s: %w", label, p, err) + } + } + // Prefer SSSP reference; fall back to BFS if SSSP isn't present. + for _, suffix := range []string{"-SSSP", "-BFS"} { + candidate := filepath.Join(root, "validation", name+suffix) + if _, err := os.Stat(candidate); err == nil { + d.SSSPRefFile = candidate + break + } + } + if d.SSSPRefFile == "" { + // Set the canonical SSSP path even if missing, so error messages elsewhere + // point users at the expected location. + d.SSSPRefFile = filepath.Join(root, "validation", name+"-SSSP") + } + props, err := ReadProperties(d.PropsFile) + if err != nil { + return nil, err + } + d.Properties = props + return d, nil +} + +var _ io.Reader // silence unused-import on some builds diff --git a/k-shortest-path/internal/ldbc/parse_test.go b/k-shortest-path/internal/ldbc/parse_test.go new file mode 100644 index 0000000..9dcc5af --- /dev/null +++ b/k-shortest-path/internal/ldbc/parse_test.go @@ -0,0 +1,23 @@ +package ldbc + +import ( + "path/filepath" + "testing" +) + +func TestReadProperties_graphalyticsFormat(t *testing.T) { + path := filepath.Join("..", "..", "datasets", "datagen-7_5-fb", "datagen-7_5-fb.properties") + p, err := ReadProperties(path) + if err != nil { + t.Fatal(err) + } + if p.SourceVertex != 6 { + t.Fatalf("SourceVertex = %d, want 6", p.SourceVertex) + } + if p.Directed { + t.Fatal("expected undirected graph") + } + if !p.WeightedSSSP { + t.Fatal("expected weighted SSSP") + } +} diff --git a/k-shortest-path/internal/oracle/oracle.go b/k-shortest-path/internal/oracle/oracle.go new file mode 100644 index 0000000..47cddbb --- /dev/null +++ b/k-shortest-path/internal/oracle/oracle.go @@ -0,0 +1,135 @@ +// Package oracle is the ground truth for k-shortest-path correctness. It wraps +// gonum's YenKShortestPaths (v0.17, well past the v0.14 fix for the loop / +// missing-path bug) to compute, for a given (src,dst), the sorted vector of +// total path weights of the k shortest LOOPLESS paths. +// +// The central design choice — and the thing that makes the comparison robust +// to ties — is that we compare the *weight vector* [W1, W2, ...], not path +// identity. With road-network weights (especially after rounding), many +// distinct paths tie on total cost; which path realises a given cost is +// implementation-dependent and differs between Yen and Dgraph. But the k-th +// smallest *cost* is deterministic. So a correct comparison checks the sorted +// costs, never the node sequences. +// +// The graph is built to mirror exactly the edge set Dgraph traverses for the +// `connected` predicate: feed it the same `.e` stream the converter reads, with +// the same directedness rule (see cmd/convert: undirected datasets emit both +// directions). Parallel edges collapse to the minimum weight, matching +// Dgraph's set-valued [uid] predicate (one edge per (src,dst)). +package oracle + +import ( + "fmt" + "math" + "sort" + + "gonum.org/v1/gonum/graph" + "gonum.org/v1/gonum/graph/path" + "gonum.org/v1/gonum/graph/simple" +) + +// Graph is a weighted directed graph backing the oracle. +type Graph struct { + g *simple.WeightedDirectedGraph +} + +// New returns an empty oracle graph. self=0 (node-to-itself weight) and +// absent=+Inf (no edge) are the standard shortest-path conventions. +func New() *Graph { + return &Graph{g: simple.NewWeightedDirectedGraph(0, math.Inf(1))} +} + +// AddEdge inserts a directed edge src->dst with the given weight. Parallel +// edges collapse to the minimum weight so the oracle's path costs agree with +// Dgraph's single-edge-per-pair model. Self-loops are dropped — they cannot +// appear in a loopless path. YenKShortestPaths panics on negative weights, so +// reject them here with a clear error path instead. +func (gr *Graph) AddEdge(src, dst int64, weight float64) error { + if weight < 0 { + return fmt.Errorf("negative edge weight %g on %d->%d (Yen requires non-negative)", weight, src, dst) + } + if src == dst { + return nil + } + if e := gr.g.WeightedEdge(src, dst); e != nil { + if weight >= e.Weight() { + return nil // existing edge is at least as cheap; keep it + } + gr.g.RemoveEdge(src, dst) + } + gr.g.SetWeightedEdge(gr.g.NewWeightedEdge(simple.Node(src), simple.Node(dst), weight)) + return nil +} + +// AddNode ensures a vertex exists even if it has no edges. Needed so the SSSP +// reference lists isolated/unreachable vertices (as +Inf) rather than omitting +// them. +func (gr *Graph) AddNode(id int64) { + if gr.g.Node(id) == nil { + gr.g.AddNode(simple.Node(id)) + } +} + +// Nodes returns the number of vertices currently in the graph. +func (gr *Graph) Nodes() int { return gr.g.Nodes().Len() } + +// TopK returns the sorted (non-decreasing) total weights of up to k shortest +// loopless paths from src to dst. The slice has length <= k; a length shorter +// than k means fewer than k distinct loopless paths exist. An empty slice means +// dst is unreachable from src. +// +// This weight vector is the oracle's answer. It is deterministic regardless of +// how ties among equal-cost paths are broken. +func (gr *Graph) TopK(src, dst int64, k int) ([]float64, error) { + s := gr.g.Node(src) + if s == nil { + return nil, fmt.Errorf("src %d not in graph", src) + } + t := gr.g.Node(dst) + if t == nil { + return nil, fmt.Errorf("dst %d not in graph", dst) + } + paths := path.YenKShortestPaths(gr.g, k, math.Inf(1), s, t) + weights := make([]float64, 0, len(paths)) + for _, p := range paths { + w, err := gr.pathWeight(p) + if err != nil { + return nil, err + } + weights = append(weights, w) + } + sort.Float64s(weights) + return weights, nil +} + +// SSSP returns single-source shortest-path distances from src to every node, +// via gonum's Dijkstra on the same graph used for TopK. Unreachable nodes map +// to +Inf. This is the top-1 reference for the existing correctness mode, and +// computing it on the identical graph guarantees it agrees with the oracle. +func (gr *Graph) SSSP(src int64) (map[int64]float64, error) { + s := gr.g.Node(src) + if s == nil { + return nil, fmt.Errorf("src %d not in graph", src) + } + tree := path.DijkstraFrom(s, gr.g) + out := make(map[int64]float64) + nodes := gr.g.Nodes() + for nodes.Next() { + id := nodes.Node().ID() + out[id] = tree.WeightTo(id) + } + return out, nil +} + +// pathWeight sums the edge weights along a node sequence returned by Yen. +func (gr *Graph) pathWeight(nodes []graph.Node) (float64, error) { + total := 0.0 + for i := 0; i+1 < len(nodes); i++ { + w, ok := gr.g.Weight(nodes[i].ID(), nodes[i+1].ID()) + if !ok || math.IsInf(w, 1) { + return 0, fmt.Errorf("returned path traverses missing edge %d->%d", nodes[i].ID(), nodes[i+1].ID()) + } + total += w + } + return total, nil +} diff --git a/k-shortest-path/internal/oracle/oracle_test.go b/k-shortest-path/internal/oracle/oracle_test.go new file mode 100644 index 0000000..261831a --- /dev/null +++ b/k-shortest-path/internal/oracle/oracle_test.go @@ -0,0 +1,167 @@ +package oracle + +import ( + "math" + "testing" +) + +const eps = 1e-9 + +func vecEqual(a, b []float64) bool { + if len(a) != len(b) { + return false + } + for i := range a { + if math.Abs(a[i]-b[i]) > eps { + return false + } + } + return true +} + +func mustAdd(t *testing.T, g *Graph, src, dst int64, w float64) { + t.Helper() + if err := g.AddEdge(src, dst, w); err != nil { + t.Fatalf("AddEdge(%d,%d,%g): %v", src, dst, w, err) + } +} + +// Distinct top-2: 1-2-4 = 3 (shortest), 1-3-4 = 6 (second). No third path. +// Verifies the basic top-k cost vector and that asking for more than exists +// returns only what exists. +func TestTopK_Distinct(t *testing.T) { + g := New() + mustAdd(t, g, 1, 2, 1) + mustAdd(t, g, 2, 4, 2) + mustAdd(t, g, 1, 3, 1) + mustAdd(t, g, 3, 4, 5) + + got, err := g.TopK(1, 4, 2) + if err != nil { + t.Fatal(err) + } + if want := []float64{3, 6}; !vecEqual(got, want) { + t.Fatalf("top-2: got %v want %v", got, want) + } + + // k=5 but only 2 paths exist. + got, err = g.TopK(1, 4, 5) + if err != nil { + t.Fatal(err) + } + if want := []float64{3, 6}; !vecEqual(got, want) { + t.Fatalf("top-5 (only 2 exist): got %v want %v", got, want) + } +} + +// Tie case — the whole point of comparing weight vectors. Three paths: +// +// 10->40 direct = 3 (shortest) +// 10-20-40 = 10 +// 10-30-40 = 10 (ties with the above) +// +// Yen may return either tied path in either slot; the cost vector [3,10,10] is +// invariant. If this passes, ties cannot produce false correctness failures. +func TestTopK_Tie(t *testing.T) { + g := New() + mustAdd(t, g, 10, 40, 3) + mustAdd(t, g, 10, 20, 5) + mustAdd(t, g, 20, 40, 5) + mustAdd(t, g, 10, 30, 5) + mustAdd(t, g, 30, 40, 5) + + got, err := g.TopK(10, 40, 3) + if err != nil { + t.Fatal(err) + } + if want := []float64{3, 10, 10}; !vecEqual(got, want) { + t.Fatalf("tie top-3: got %v want %v", got, want) + } +} + +// Parallel edges collapse to the minimum weight: adding a more expensive 1->2 +// after the cheap one must not change the shortest cost. +func TestTopK_MinWeightDedup(t *testing.T) { + g := New() + mustAdd(t, g, 1, 2, 1) + mustAdd(t, g, 1, 2, 100) // more expensive parallel edge — must be ignored + mustAdd(t, g, 2, 4, 2) + + got, err := g.TopK(1, 4, 1) + if err != nil { + t.Fatal(err) + } + if want := []float64{3}; !vecEqual(got, want) { + t.Fatalf("min-weight dedup: got %v want %v", got, want) + } + + // Insertion order independence: cheaper edge added second must win too. + g2 := New() + mustAdd(t, g2, 1, 2, 100) + mustAdd(t, g2, 1, 2, 1) // cheaper, added second + mustAdd(t, g2, 2, 4, 2) + got, err = g2.TopK(1, 4, 1) + if err != nil { + t.Fatal(err) + } + if want := []float64{3}; !vecEqual(got, want) { + t.Fatalf("min-weight dedup (reverse order): got %v want %v", got, want) + } +} + +// Unreachable target yields an empty vector, not an error. +func TestTopK_Unreachable(t *testing.T) { + g := New() + mustAdd(t, g, 1, 2, 1) + mustAdd(t, g, 3, 4, 1) // disconnected component + // node 4 must exist for the lookup; it does (added as edge endpoint). + got, err := g.TopK(1, 4, 2) + if err != nil { + t.Fatal(err) + } + if len(got) != 0 { + t.Fatalf("unreachable: got %v want empty", got) + } +} + +// Direction matters: an edge a->b does not imply b->a. Confirms we built a +// directed graph, matching Dgraph's directed `connected` traversal. +func TestTopK_Directed(t *testing.T) { + g := New() + mustAdd(t, g, 1, 2, 1) + got, err := g.TopK(2, 1, 1) // reverse direction — no edge + if err != nil { + t.Fatal(err) + } + if len(got) != 0 { + t.Fatalf("directed: 2->1 should be unreachable, got %v", got) + } +} + +// Mirrors cmd/handprobe's hand graph exactly. Confirms the oracle predicts the +// LOOPLESS top-2 = [3, 6]; the disjoint alternative (1-5-4 = 20) is correctly +// NOT in the top 2. This is the number the gate probe checks Dgraph against. +func TestTopK_HandprobeGraph(t *testing.T) { + g := New() + mustAdd(t, g, 1, 2, 1) + mustAdd(t, g, 2, 3, 1) + mustAdd(t, g, 3, 4, 1) + mustAdd(t, g, 2, 4, 5) + mustAdd(t, g, 1, 5, 10) + mustAdd(t, g, 5, 4, 10) + + got, err := g.TopK(1, 4, 2) + if err != nil { + t.Fatal(err) + } + if want := []float64{3, 6}; !vecEqual(got, want) { + t.Fatalf("handprobe top-2: got %v want %v (loopless)", got, want) + } +} + +func TestAddEdge_NegativeWeight(t *testing.T) { + g := New() + if err := g.AddEdge(1, 2, -1); err == nil { + t.Fatal("expected error on negative weight") + } +} diff --git a/k-shortest-path/internal/stats/stats.go b/k-shortest-path/internal/stats/stats.go new file mode 100644 index 0000000..caeeaa1 --- /dev/null +++ b/k-shortest-path/internal/stats/stats.go @@ -0,0 +1,87 @@ +// Package stats provides a tiny latency-recording utility with p50/p95/p99. +// Backed by a sorted slice; fine for sample counts up to a few million. +package stats + +import ( + "math" + "sort" + "sync" + "time" +) + +type Recorder struct { + mu sync.Mutex + samples []time.Duration + errs int64 +} + +func New() *Recorder { return &Recorder{} } + +func (r *Recorder) Record(d time.Duration) { + r.mu.Lock() + r.samples = append(r.samples, d) + r.mu.Unlock() +} + +func (r *Recorder) RecordError() { + r.mu.Lock() + r.errs++ + r.mu.Unlock() +} + +type Summary struct { + Count int `json:"count"` + Errors int64 `json:"errors"` + Min time.Duration `json:"min_ns"` + Max time.Duration `json:"max_ns"` + Mean time.Duration `json:"mean_ns"` + P50 time.Duration `json:"p50_ns"` + P95 time.Duration `json:"p95_ns"` + P99 time.Duration `json:"p99_ns"` + P999 time.Duration `json:"p999_ns"` + Duration time.Duration `json:"wall_ns"` + QPS float64 `json:"qps"` +} + +// Summarize returns percentile stats. Pass the wall-clock duration the +// recorder was active for so QPS can be computed. +func (r *Recorder) Summarize(wall time.Duration) Summary { + r.mu.Lock() + defer r.mu.Unlock() + + s := Summary{Count: len(r.samples), Errors: r.errs, Duration: wall} + if len(r.samples) == 0 { + return s + } + sort.Slice(r.samples, func(i, j int) bool { return r.samples[i] < r.samples[j] }) + s.Min = r.samples[0] + s.Max = r.samples[len(r.samples)-1] + + var sum int64 + for _, v := range r.samples { + sum += int64(v) + } + s.Mean = time.Duration(sum / int64(len(r.samples))) + s.P50 = quantile(r.samples, 0.50) + s.P95 = quantile(r.samples, 0.95) + s.P99 = quantile(r.samples, 0.99) + s.P999 = quantile(r.samples, 0.999) + if wall > 0 { + s.QPS = float64(len(r.samples)) / wall.Seconds() + } + return s +} + +func quantile(sorted []time.Duration, q float64) time.Duration { + if len(sorted) == 0 { + return 0 + } + idx := int(math.Ceil(q*float64(len(sorted)))) - 1 + if idx < 0 { + idx = 0 + } + if idx >= len(sorted) { + idx = len(sorted) - 1 + } + return sorted[idx] +} diff --git a/k-shortest-path/internal/stats/stats_test.go b/k-shortest-path/internal/stats/stats_test.go new file mode 100644 index 0000000..5a6f984 --- /dev/null +++ b/k-shortest-path/internal/stats/stats_test.go @@ -0,0 +1,37 @@ +package stats + +import ( + "testing" + "time" +) + +func TestSummarizePercentiles(t *testing.T) { + r := New() + for i := 1; i <= 100; i++ { + r.Record(time.Duration(i) * time.Millisecond) + } + s := r.Summarize(100 * time.Millisecond) + if s.Count != 100 { + t.Fatalf("count=%d", s.Count) + } + if s.P50 != 50*time.Millisecond { + t.Errorf("p50=%v want 50ms", s.P50) + } + if s.P95 != 95*time.Millisecond { + t.Errorf("p95=%v want 95ms", s.P95) + } + if s.P99 != 99*time.Millisecond { + t.Errorf("p99=%v want 99ms", s.P99) + } + if s.QPS <= 0 { + t.Errorf("qps=%v", s.QPS) + } +} + +func TestSummarizeEmpty(t *testing.T) { + r := New() + s := r.Summarize(time.Second) + if s.Count != 0 || s.QPS != 0 { + t.Fatalf("unexpected empty summary: %+v", s) + } +} diff --git a/k-shortest-path/scripts/bench.sh b/k-shortest-path/scripts/bench.sh new file mode 100755 index 0000000..d02868f --- /dev/null +++ b/k-shortest-path/scripts/bench.sh @@ -0,0 +1,566 @@ +#!/usr/bin/env bash +# bench.sh — end-to-end Dgraph k-shortest path correctness benchmark. +# +# ONE COMMAND from zero to results: downloads the dataset, builds each +# requested Dgraph branch, loads the data, sweeps maxfrontiersize values, and +# prints a correctness table comparing Dgraph's path-cost vector against the +# gonum Yen oracle. +# +# Usage: +# ./scripts/bench.sh --dataset-url [options] [branch ...] +# +# When no branches are given the default is "main". +# +# Examples: +# # Benchmark main on roadCOL: +# ./scripts/bench.sh --dataset-url https://example.com/roadCOL.gr.gz +# +# # Compare two branches (dataset already on disk, skip re-download): +# ./scripts/bench.sh --dataset-url main pr-9599 +# +# # Run detached — memory-capped, survives SSH disconnect: +# ./scripts/bench.sh --dataset-url --detach main pr-9599 +# +# # Force a fresh bulk-load even though p/ is present: +# ./scripts/bench.sh --dataset-url --force-bulk main + +set -euo pipefail + +SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)" +BENCH_DIR="$(cd "$SCRIPT_DIR/.." && pwd)" + +# ── defaults ────────────────────────────────────────────────────────────────── +DATASET_URL="" +DATASET_FORMAT="dimacs" +DATASET_NAME="" +SOURCE_VERTEX="1" +DGRAPH_REPO="${DGRAPH_REPO:-$HOME/dgraph}" +DGRAPH_REMOTE="https://github.com/dgraph-io/dgraph.git" +ALPHA_DIR="${ALPHA_DIR:-$HOME/db}" +RESULTS_DIR="" +FRONTIERS="100,1000,2000" +TARGETS="30" +NUMPATHS="2" +TIMEOUT="60s" +SEED="1" +BANDLO="0.0005" +BANDHI="0.004" +TOL="0.0001" +MEMORY_MAX="${MEMORY_MAX:-48G}" +MEMORY_SWAP_MAX="${MEMORY_SWAP_MAX:-1G}" +FORCE_BULK=0 +DETACH=0 +RUN_TAG="" +CAPTURE_PPROF="${CAPTURE_PPROF:-1}" +BRANCHES=() +ORIG_ARGS=("$@") + +# ── helpers ─────────────────────────────────────────────────────────────────── +ts() { date +'%Y-%m-%d %H:%M:%S'; } +log() { echo "[$(ts)] $*"; } +die() { echo "[$(ts)] ERROR: $*" >&2; exit 1; } +warn() { echo "[$(ts)] WARN: $*" >&2; } + +usage() { +cat <<'EOF' +Usage: ./scripts/bench.sh --dataset-url [options] [branch ...] + +Required (first run; skipped once the dataset is already on disk): + --dataset-url URL passed to cmd/prepare (download + extract). + Use a file:// URL for local archives. + +Dataset options: + --dataset-name Directory name under datasets/ (default: URL basename). + --dataset-format dimacs|csv Raw format for cmd/prepare (default: dimacs). + --source-vertex SSSP source vertex ID in .properties file (default: 1). + +Infrastructure: + --dgraph-repo Dgraph source tree; cloned if absent (default: $HOME/dgraph). + --dgraph-remote Clone remote (default: github.com/dgraph-io/dgraph). + --alpha-dir Alpha workspace for p/t/w/zw (default: $HOME/db). + --results-dir Output root (default: /results). + +Sweep tuning: + --frontiers Comma-separated maxfrontiersize values (default: 100,1000,2000). + --targets Banded target pairs per run (default: 30). + --numpaths k for k-shortest queries (default: 2). + --timeout Per-query wall-clock limit (default: 60s). + +Memory (--detach only): + --memory-max cgroup hard cap (default: 48G). + --memory-swap-max cgroup swap cap (default: 1G). + +Run control: + --force-bulk Re-run bulk load even when p/ is present. + --detach Launch in a memory-capped systemd transient unit + (survives SSH disconnect); prints watch commands and exits. + -h, --help Print this message. +EOF +} + +# ── arg parsing ─────────────────────────────────────────────────────────────── +while [[ $# -gt 0 ]]; do + case "$1" in + --dataset-url) DATASET_URL="$2"; shift 2 ;; + --dataset-format) DATASET_FORMAT="$2"; shift 2 ;; + --dataset-name) DATASET_NAME="$2"; shift 2 ;; + --source-vertex) SOURCE_VERTEX="$2"; shift 2 ;; + --dgraph-repo) DGRAPH_REPO="$2"; shift 2 ;; + --dgraph-remote) DGRAPH_REMOTE="$2"; shift 2 ;; + --alpha-dir) ALPHA_DIR="$2"; shift 2 ;; + --results-dir) RESULTS_DIR="$2"; shift 2 ;; + --frontiers) FRONTIERS="$2"; shift 2 ;; + --targets) TARGETS="$2"; shift 2 ;; + --numpaths) NUMPATHS="$2"; shift 2 ;; + --timeout) TIMEOUT="$2"; shift 2 ;; + --seed) SEED="$2"; shift 2 ;; + --band-lo) BANDLO="$2"; shift 2 ;; + --band-hi) BANDHI="$2"; shift 2 ;; + --tol) TOL="$2"; shift 2 ;; + --memory-max) MEMORY_MAX="$2"; shift 2 ;; + --memory-swap-max) MEMORY_SWAP_MAX="$2"; shift 2 ;; + --force-bulk) FORCE_BULK=1; shift ;; + --detach) DETACH=1; shift ;; + --run-tag) RUN_TAG="$2"; shift 2 ;; # internal: set by --detach re-launch + -h|--help) usage; exit 0 ;; + --*) die "unknown option: $1 (see --help)" ;; + *) BRANCHES+=("$1"); shift ;; + esac +done + +[[ ${#BRANCHES[@]} -eq 0 ]] && BRANCHES=("main") + +# Derive dataset name from URL when not given explicitly. +if [[ -z "$DATASET_NAME" && -n "$DATASET_URL" ]]; then + DATASET_NAME="$(basename "$DATASET_URL")" + DATASET_NAME="${DATASET_NAME%%.*}" +fi + +RESULTS_DIR="${RESULTS_DIR:-$BENCH_DIR/results}" +KS_DIR="$RESULTS_DIR/kshortest" +LOG_DIR="$RESULTS_DIR/logs" +ZERO_DIR="${ZERO_DIR:-$ALPHA_DIR/zero-setup}" +ALPHA_DIR_PREFIX_ALLOW="$(dirname "$ALPHA_DIR")/" + +# Endpoints (overridable for non-localhost setups). +ALPHA_HTTP_URL="${ALPHA_HTTP_URL:-http://localhost:8080}" +ALPHA_HEALTH_URL="$ALPHA_HTTP_URL/health" +ALPHA_GRPC="${ALPHA_GRPC:-localhost:9080}" +ZERO_STATE_URL="${ZERO_STATE_URL:-http://localhost:6080/state}" +ZERO_GRPC_ADDR="${ZERO_GRPC_ADDR:-localhost:5080}" +ALPHA_HEALTH_TIMEOUT_SEC="${ALPHA_HEALTH_TIMEOUT_SEC:-300}" + +# alpha.sh uses DATASETS array in guard_rm_target to protect bulk p/ from rm. +DATASETS=("${DATASET_NAME:-_unset_}") +source "$SCRIPT_DIR/lib/alpha.sh" + +# ── systemd availability check (used by detach and ensure_zero) ─────────────── +HAVE_SYSTEMD=0 +if [[ "${USE_SYSTEMD:-1}" == "1" ]] \ + && command -v systemd-run >/dev/null 2>&1 \ + && sudo -n true 2>/dev/null; then + HAVE_SYSTEMD=1 +fi + +# ── detach: re-launch self inside a memory-capped systemd transient unit ────── +if [[ $DETACH -eq 1 ]]; then + [[ -n "$DATASET_NAME" ]] \ + || die "--dataset-url or --dataset-name is required when using --detach" + + LABEL=$(printf '%s' "${BRANCHES[*]}" | tr ' ' '-') + RUN_TAG="$(date +%Y%m%d-%H%M%S)" + mkdir -p "$RESULTS_DIR" + OUT="$RESULTS_DIR/sweep-$LABEL-$RUN_TAG.out" + : > "$OUT" + ln -sfn "$OUT" "$RESULTS_DIR/sweep-$LABEL-latest.out" + + # Stop any stray bench/alpha from a previous run. + # Zero is shared infra holding uid-lease state -- never touched here. + sudo systemctl stop 'dgraph-bench-*' 2>/dev/null || true + pkill -f 'cmd/bench' 2>/dev/null || true + pkill -f 'dgraph alpha' 2>/dev/null || true + sleep 2 + + # Forward all original args minus --detach; append internal --run-tag so + # the unit's JSON artifacts share the same tag as the .out file. + fwd_args=() + for arg in "${ORIG_ARGS[@]}"; do + [[ "$arg" == "--detach" ]] || fwd_args+=("$arg") + done + fwd_args+=(--run-tag "$RUN_TAG") + + SCRIPT_PATH="$(realpath "${BASH_SOURCE[0]}")" + + if (( HAVE_SYSTEMD )); then + UNIT="dgraph-bench-$LABEL" + sudo systemctl reset-failed "$UNIT" 2>/dev/null || true + sudo systemd-run --unit="$UNIT" --collect \ + -p MemoryMax="$MEMORY_MAX" \ + -p MemorySwapMax="$MEMORY_SWAP_MAX" \ + -p OOMPolicy=continue \ + -p "User=$(id -un)" \ + -p "WorkingDirectory=$BENCH_DIR" \ + -p "StandardOutput=append:$OUT" \ + -p "StandardError=append:$OUT" \ + --setenv=HOME="$HOME" \ + --setenv=PATH="$PATH" \ + "$SCRIPT_PATH" "${fwd_args[@]}" + echo "[bench] launched as unit $UNIT (MemoryMax=$MEMORY_MAX MemorySwapMax=$MEMORY_SWAP_MAX)" + echo "[bench] watch: journalctl -u $UNIT -f" + echo "[bench] status: systemctl status $UNIT" + else + warn "systemd-run / passwordless sudo unavailable -- using nohup (no memory cap)" + warn "a runaway query can exhaust RAM; consider systemd or running on a cgroup-enabled VM" + nohup "$SCRIPT_PATH" "${fwd_args[@]}" >> "$OUT" 2>&1 & + disown + echo "[bench] launched (pid $!)" + fi + echo "[bench] run tag: $RUN_TAG" + echo "[bench] watch: tail -f $OUT" + echo "[bench] tail -f $RESULTS_DIR/sweep-$LABEL-latest.out" + echo "[bench] results: $KS_DIR/-${DATASET_NAME}-${RUN_TAG}.json (written per branch)" + exit 0 +fi + +# ═══════════════════════════════════════════════════════════════════════════════ +# Synchronous run +# ═══════════════════════════════════════════════════════════════════════════════ +RUN_TAG="${RUN_TAG:-$(date +%Y%m%d-%H%M%S)}" +LABEL=$(printf '%s' "${BRANCHES[*]}" | tr ' ' '-') +mkdir -p "$KS_DIR" "$LOG_DIR" +MASTER_LOG="$LOG_DIR/master-$RUN_TAG.log" +exec > >(tee -a "$MASTER_LOG") 2>&1 + +log "=================================================================" +log " bench.sh — k-shortest path correctness benchmark" +log "=================================================================" +log " run tag: $RUN_TAG" +log " branches: ${BRANCHES[*]}" +log " dataset: ${DATASET_NAME:-(not yet known)} format=$DATASET_FORMAT" +log " frontiers: $FRONTIERS" +log " targets: $TARGETS numpaths=$NUMPATHS timeout=$TIMEOUT" +log " dgraph repo: $DGRAPH_REPO" +log " alpha dir: $ALPHA_DIR" +log " results dir: $RESULTS_DIR" +log "=================================================================" + +trap 'stop_memlog; alpha_cleanup_on_exit' EXIT + +# ── phase 1: preflight ──────────────────────────────────────────────────────── +log "" +log "=== [1/5] preflight ===" +preflight_memory +for bin in dgraph go git make curl jq awk; do + command -v "$bin" >/dev/null || die "$bin not found on PATH" +done +[[ -d "$BENCH_DIR" ]] || die "BENCH_DIR not found: $BENCH_DIR" +( cd "$BENCH_DIR" && go build ./... ) 2>&1 \ + | tee "$LOG_DIR/build-bench-$RUN_TAG.log" \ + || die "bench tools failed to compile -- see $LOG_DIR/build-bench-$RUN_TAG.log" +log " tools OK" + +# ── phase 2: dgraph repo ────────────────────────────────────────────────────── +log "" +log "=== [2/5] dgraph repo ($DGRAPH_REPO) ===" +if [[ ! -d "$DGRAPH_REPO/.git" ]]; then + log " cloning $DGRAPH_REMOTE -> $DGRAPH_REPO (this may take a few minutes)" + git clone "$DGRAPH_REMOTE" "$DGRAPH_REPO" 2>&1 \ + | tee "$LOG_DIR/clone-$RUN_TAG.log" \ + || die "git clone failed -- see $LOG_DIR/clone-$RUN_TAG.log" +fi +( cd "$DGRAPH_REPO" + log " fetching origin..." + git fetch origin 2>&1 \ + | tee "$LOG_DIR/fetch-$RUN_TAG.log" \ + | tail -5 || warn "git fetch failed (offline?)" + git diff --quiet && git diff --cached --quiet \ + || die "dgraph working tree is dirty -- commit or stash before benchmarking" + for br in "${BRANCHES[@]}"; do + if ! git rev-parse --verify --quiet "$br" >/dev/null 2>&1; then + git branch "$br" "origin/$br" 2>/dev/null \ + || die "branch '$br' not found locally or on origin (check spelling or run git fetch)" + log " created local branch $br from origin/$br" + else + log " branch $br present" + fi + done ) +log " repo ready" + +# ── phase 3: dataset ────────────────────────────────────────────────────────── +log "" +log "=== [3/5] dataset ($DATASET_NAME) ===" +[[ -n "$DATASET_NAME" ]] || die "--dataset-url is required (cannot infer dataset name)" +DATASETS=("$DATASET_NAME") +ds_dir="$BENCH_DIR/datasets/$DATASET_NAME" + +if [[ ! -f "$ds_dir/$DATASET_NAME.properties" ]]; then + [[ -n "$DATASET_URL" ]] \ + || die "--dataset-url is required: dataset not found at $ds_dir" + log " running cmd/prepare (download + extract -- may take a few minutes)" + log " url=$DATASET_URL format=$DATASET_FORMAT name=$DATASET_NAME source=$SOURCE_VERTEX" + ( cd "$BENCH_DIR" && go run ./cmd/prepare \ + -format "$DATASET_FORMAT" \ + -name "$DATASET_NAME" \ + -url "$DATASET_URL" \ + -source "$SOURCE_VERTEX" \ + ) 2>&1 | tee "$LOG_DIR/prepare-$DATASET_NAME-$RUN_TAG.log" \ + || die "cmd/prepare failed -- see $LOG_DIR/prepare-$DATASET_NAME-$RUN_TAG.log" + [[ -f "$ds_dir/$DATASET_NAME.properties" ]] \ + || die "cmd/prepare completed but $ds_dir/$DATASET_NAME.properties not found" +else + log " already extracted at $ds_dir (skipping prepare)" +fi + +rdf="$ds_dir/dgraph/graph.rdf.gz" +schema="$ds_dir/dgraph/graph.schema" +if [[ ! -f "$rdf" || ! -f "$schema" ]]; then + log " running cmd/convert (raw -> RDF + schema)" + ( cd "$BENCH_DIR" && go run ./cmd/convert -dataset "$ds_dir" ) \ + 2>&1 | tee "$LOG_DIR/convert-$DATASET_NAME-$RUN_TAG.log" \ + || die "cmd/convert failed -- see $LOG_DIR/convert-$DATASET_NAME-$RUN_TAG.log" + [[ -f "$rdf" && -f "$schema" ]] \ + || die "cmd/convert did not produce $rdf and/or $schema" +else + log " RDF + schema already present (skipping convert)" +fi +log " dataset ready: $ds_dir" + +# ── phase 4: zero + bulk load ───────────────────────────────────────────────── +log "" +log "=== [4/5] zero + bulk load ===" + +zero_up() { curl -s -m 3 "$ZERO_STATE_URL" >/dev/null 2>&1; } + +ensure_zero() { + log " ensuring zero is running..." + if (( HAVE_SYSTEMD )); then + local dgraph_bin + dgraph_bin=$(command -v dgraph) || die "dgraph not on PATH" + mkdir -p "$ZERO_DIR" + local unit_file=/etc/systemd/system/dgraph-zero.service + # Write the unit file only when the content has changed (idempotent). + local desired + desired="[Unit] +Description=Dgraph Zero (bench infra — do not stop between branch runs) +After=network.target + +[Service] +Type=simple +User=$(id -un) +WorkingDirectory=$ZERO_DIR +ExecStart=$dgraph_bin zero --my=$ZERO_GRPC_ADDR --replicas=1 +Restart=on-failure +RestartSec=5 +MemoryMax=4G +OOMScoreAdjust=-500 +StandardOutput=append:$ZERO_DIR/zero.log +StandardError=append:$ZERO_DIR/zero.log + +[Install] +WantedBy=multi-user.target" + if [[ ! -f "$unit_file" ]] \ + || ! diff -q <(printf '%s\n' "$desired") "$unit_file" >/dev/null 2>&1; then + printf '%s\n' "$desired" | sudo tee "$unit_file" >/dev/null + sudo systemctl daemon-reload + log " wrote $unit_file" + fi + sudo systemctl enable dgraph-zero >/dev/null 2>&1 || true + if zero_up; then + if systemctl is-active --quiet dgraph-zero; then + log " zero running (dgraph-zero.service) -- leaving it alone" + else + log " zero running (legacy process) -- leaving it alone" + log " (dgraph-zero.service is enabled and takes over on next reboot)" + fi + else + pkill -f 'dgraph zero' 2>/dev/null || true + sleep 1 + sudo systemctl restart dgraph-zero + for _ in $(seq 1 15); do zero_up && break; sleep 2; done + zero_up || die "zero failed to start -- journalctl -u dgraph-zero -n 50" + log " zero started via dgraph-zero.service" + fi + else + # No systemd: start via nohup and leave it running. Zero is shared infra; + # it is NOT stopped on exit -- only alpha is cleaned up in the EXIT trap. + if zero_up; then + log " zero already running" + else + log " starting zero (nohup) in $ZERO_DIR" + pkill -f 'dgraph zero' 2>/dev/null || true + sleep 1 + mkdir -p "$ZERO_DIR" + ( cd "$ZERO_DIR" && nohup dgraph zero --my="$ZERO_GRPC_ADDR" --replicas=1 \ + > zero.log 2>&1 & ) + sleep 6 + zero_up || die "zero failed to start -- see $ZERO_DIR/zero.log" + log " zero started (nohup -- install systemd for reboot-proof operation)" + fi + fi +} + +ensure_zero + +bp=$(bulk_p_for "$DATASET_NAME") +if [[ -d "$bp" && $FORCE_BULK -eq 0 ]]; then + log " bulk p/ present: $bp ($(size_of "$bp"))" + log " skipping bulk load (pass --force-bulk to redo)" +else + if [[ $FORCE_BULK -eq 1 && -d "$bp" ]]; then + log " --force-bulk: removing existing $bp" + rm -rf "$bp" + fi + log " running dgraph bulk (this takes several minutes)..." + ( cd "$ds_dir/dgraph" && dgraph bulk \ + -f "$rdf" \ + -s "$schema" \ + --zero "$ZERO_GRPC_ADDR" \ + --out bulk-out \ + ) 2>&1 | tee "$LOG_DIR/bulk-$DATASET_NAME-$RUN_TAG.log" \ + || die "dgraph bulk failed -- see $LOG_DIR/bulk-$DATASET_NAME-$RUN_TAG.log" + [[ -d "$bp" ]] || die "bulk completed but expected p/ not found at $bp" + log " bulk done: $bp ($(size_of "$bp"))" +fi +require_zero + +# ── phase 5: sweep ──────────────────────────────────────────────────────────── +log "" +log "=== [5/5] sweep (${#BRANCHES[@]} branch(es) × frontiers: $FRONTIERS) ===" +start_memlog "$LOG_DIR/memlog-$RUN_TAG.log" +stop_alpha + +first=1 +pprof_pid="" +for branch in "${BRANCHES[@]}"; do + log "" + log "─── branch: $branch ───" + + if ! ( cd "$DGRAPH_REPO" + git checkout "$branch" >/dev/null 2>&1 + make install ) 2>&1 | tee "$LOG_DIR/build-$branch-$RUN_TAG.log"; then + warn "[$branch] build failed -- skipping (see $LOG_DIR/build-$branch-$RUN_TAG.log)" + continue + fi + + # Verify the installed binary is actually this branch's commit. + # Catches a stale PATH or a make install that silently used a cached binary. + want_sha=$(cd "$DGRAPH_REPO" && git rev-parse --short=9 "$branch") + bin_sha=$(dgraph version 2>/dev/null \ + | awk -F: '/Commit SHA-1/{gsub(/[[:space:]]/,"",$2); print $2}') + [[ -n "$bin_sha" ]] \ + || die "[$branch] cannot parse 'Commit SHA-1' from dgraph version (PATH/install problem)" + if [[ "$bin_sha" != "$want_sha"* && "$want_sha" != "$bin_sha"* ]]; then + die "[$branch] binary mismatch: branch=$want_sha binary=$bin_sha (stale PATH?)" + fi + label="${branch}@${bin_sha}" + log "[$branch] binary verified: $label" + + alpha_log="$LOG_DIR/alpha-$branch-$RUN_TAG.log" + stop_alpha + reset_data "$bp" + start_alpha "$alpha_log" + + if ! wait_alpha; then + warn "[$branch] alpha unhealthy after ${ALPHA_HEALTH_TIMEOUT_SEC}s -- skipping" + tail_log "[$branch] alpha" "$alpha_log" + stop_alpha + continue + fi + + # /health goes green before bulk tablets are queryable (alpha must load + # postings + register tablets with zero). Poll until data actually serves. + log "[$branch] waiting for bulk data to be served..." + data_ok=0 + for _ in $(seq 1 90); do + n=$(curl -s -m 5 -H 'Content-Type: application/dql' "$ALPHA_HTTP_URL/query" \ + -d '{ q(func: has(graphalytics_id)) { count(uid) } }' 2>/dev/null \ + | jq -r '.data.q[0].count // 0' 2>/dev/null) + if [[ "${n:-0}" =~ ^[0-9]+$ ]] && (( n > 0 )); then + data_ok=1; log "[$branch] data serving: $n nodes"; break + fi + sleep 2 + done + if (( data_ok == 0 )); then + warn "[$branch] bulk data not served within 180s -- skipping" + stop_alpha + continue + fi + + # UIDs are stable across branches (same dataset, same bulk load). Fetch + # the uid map only on the first branch; all others reuse the cache. + refresh="" + if (( first == 1 )); then refresh="-refresh-uidmap"; first=0; fi + + # CPU profile + goroutine dump taken ~40s into the run: on a binary that + # exercises eviction, pprof -top shows removeMax/pq.Pop/expandOut. + if [[ "${CAPTURE_PPROF:-1}" == "1" ]]; then + ( sleep 40 + curl -s "${ALPHA_HTTP_URL}/debug/pprof/profile?seconds=30" \ + -o "$KS_DIR/pprof-cpu-$branch-$RUN_TAG.prof" 2>/dev/null + curl -s "${ALPHA_HTTP_URL}/debug/pprof/goroutine?debug=2" \ + -o "$KS_DIR/pprof-goroutine-$branch-$RUN_TAG.txt" 2>/dev/null + ) & + pprof_pid=$! + fi + + out="$KS_DIR/${branch}-${DATASET_NAME}-${RUN_TAG}.json" + log "[$branch] bench -> $out" + if ! ( cd "$BENCH_DIR" && go run ./cmd/bench \ + -mode kshortest \ + -dataset "$ds_dir" \ + -alpha "$ALPHA_GRPC" \ + -numpaths "$NUMPATHS" \ + -targets "$TARGETS" \ + -frontiers "$FRONTIERS" \ + -band-lo "$BANDLO" -band-hi "$BANDHI" \ + -tol "$TOL" -timeout "$TIMEOUT" -seed "$SEED" \ + -label "$label" \ + $refresh \ + -out "$out" \ + ) 2>&1 | tee "$LOG_DIR/bench-$branch-$RUN_TAG.log"; then + warn "[$branch] bench failed -- see $LOG_DIR/bench-$branch-$RUN_TAG.log" + apid=$(cat /tmp/dgraph-alpha.pid 2>/dev/null || true) + if [[ -z "$apid" ]] || ! kill -0 "$apid" 2>/dev/null; then + warn "[$branch] alpha is no longer running (OOM-killed or crashed mid-bench)" + tail_log "[$branch] alpha" "$alpha_log" + fi + fi + + [[ -n "$pprof_pid" ]] && { wait "$pprof_pid" 2>/dev/null || true; pprof_pid=""; } + stop_alpha +done + +# ── results ─────────────────────────────────────────────────────────────────── +log "" +log "=================================================================" +log " RESULTS: correct-of-returned% (r=returned t=timed-out)" +log "=================================================================" +printf '%-12s' "frontier" +for b in "${BRANCHES[@]}"; do printf ' %-22s' "$b"; done +echo +printf '%-12s' "------------" +for b in "${BRANCHES[@]}"; do printf ' %-22s' "----------------------"; done +echo +for fr in ${FRONTIERS//,/ }; do + flabel=$fr; [[ "$fr" == "0" ]] && flabel="unlimited" + printf '%-12s' "$flabel" + for b in "${BRANCHES[@]}"; do + f="$KS_DIR/${b}-${DATASET_NAME}-${RUN_TAG}.json" + if [[ -f "$f" ]]; then + cell=$(jq -r --argjson fr "$fr" \ + '.frontiers[]? | select(.max_frontier==$fr) + | "\(.correct_of_returned_pct|floor)%(r\(.returned) t\(.timeouts))"' \ + "$f" 2>/dev/null) + printf ' %-22s' "${cell:-NA}" + else + printf ' %-22s' "norun" + fi + done + echo +done + +log "" +log "artifacts: $KS_DIR/*-${DATASET_NAME}-${RUN_TAG}.json" +log "pprof: $KS_DIR/pprof-cpu--${RUN_TAG}.prof" +log " go tool pprof -top | grep -iE 'removeMax|pq.Pop|expandOut'" +log "master log: $MASTER_LOG" diff --git a/k-shortest-path/scripts/calibrate.sh b/k-shortest-path/scripts/calibrate.sh new file mode 100755 index 0000000..f8bc401 --- /dev/null +++ b/k-shortest-path/scripts/calibrate.sh @@ -0,0 +1,299 @@ +#!/usr/bin/env bash +# calibrate.sh -- empirical maxfrontiersize calibration, per dataset. +# +# WHY: a `maxfrontiersize` value that's too high never triggers eviction +# (bug-fix code path never exercised) and a value that's too low forces +# even a correct k-shortest implementation to fail (the cap is so tight +# the optimum can't be kept alive). The right value sits between the two, +# and it differs per dataset because frontier sizes scale with graph +# degree and diameter. KGS and datagen-7_5-fb cannot share a cap. +# +# WHAT: against a single known-correct branch (default pr-9678 -- the +# Family A PR with the most tests per README), this script sweeps cap +# values per dataset and reports the smallest cap where correctness +# still holds on numpaths=2. That's the recommended MAXFRONTIER_: +# tight enough to trigger eviction on as many queries as possible, +# loose enough that a real fix can still produce the right answer. +# +# USAGE: run AFTER probe.sh has identified at least one PR that passes +# numpaths=2. Set CALIBRATION_BRANCH to that PR. Re-run after any fresh +# bulk-load. +# +# OUTPUT: prints a per-dataset table and a copy-pasteable +# .env-style block of MAXFRONTIER_= recommendations. +# +# Override anything via env: +# BENCH_DIR DGRAPH_REPO ALPHA_DIR ALPHA_DIR_PREFIX_ALLOW +# CALIBRATION_BRANCH (default pr-9678) +# DATASETS_OVERRIDE (default "kgs datagen-7_5-fb") +# CAPS_OVERRIDE (space-separated; 0 means "unset/no cap") +# default "50 100 200 500 1000 2000 5000 10000 20000 50000 100000 0" +# TARGETS (default 20 -- per-cap sample size) +# NUMPATHS (default 2 -- bug-triggering value) +# SEED (default 1) +# TIMEOUT (default 5m per query) + +set -euo pipefail + +# ============================================================================ +# CONFIG +# ============================================================================ +BENCH_DIR="${BENCH_DIR:-/Users/shiva/workspace/shortest-path-bench}" +DGRAPH_REPO="${DGRAPH_REPO:-/Users/shiva/workspace/dgraph-scratch/dgraph}" +ALPHA_DIR="${ALPHA_DIR:-/Users/shiva/workspace/db}" +ALPHA_DIR_PREFIX_ALLOW="${ALPHA_DIR_PREFIX_ALLOW:-/Users/shiva/workspace/}" + +CALIBRATION_BRANCH="${CALIBRATION_BRANCH:-pr-9678}" + +DATASETS_STR="${DATASETS_OVERRIDE:-kgs datagen-7_5-fb}" +read -ra DATASETS <<< "$DATASETS_STR" + +# 0 sentinel = "unset" (no cap). Sweep is geometric from low to high. +CAPS_STR="${CAPS_OVERRIDE:-50 100 200 500 1000 2000 5000 10000 20000 50000 100000 0}" +read -ra CAPS <<< "$CAPS_STR" + +TARGETS="${TARGETS:-20}" +NUMPATHS="${NUMPATHS:-2}" +SEED="${SEED:-1}" +TIMEOUT="${TIMEOUT:-5m}" + +ALPHA_HTTP_URL="${ALPHA_HTTP_URL:-http://localhost:8080}" +ALPHA_HEALTH_URL="$ALPHA_HTTP_URL/health" +ALPHA_GRPC="${ALPHA_GRPC:-localhost:9080}" +ZERO_STATE_URL="${ZERO_STATE_URL:-http://localhost:6080/state}" +ALPHA_HEALTH_TIMEOUT_SEC="${ALPHA_HEALTH_TIMEOUT_SEC:-300}" + +RESULTS_DIR="${RESULTS_DIR:-$BENCH_DIR/results/calibrate}" +LOG_DIR="$RESULTS_DIR/logs" +mkdir -p "$RESULTS_DIR" "$LOG_DIR" + +# Shared Alpha lifecycle + safety helpers (stop_alpha, wait_alpha, reset_data, +# start_alpha, guard_rm_target, bulk_p_for, log/die/warn, etc.) +source "$(dirname "${BASH_SOURCE[0]}")/lib/alpha.sh" +trap alpha_cleanup_on_exit EXIT + +# ============================================================================ +# calibrate-specific helpers +# ============================================================================ +env_name_for() { + local ds="$1" + printf 'MAXFRONTIER_%s' "$(printf '%s' "$ds" | tr '[:lower:]-.' '[:upper:]__')" +} + +# ============================================================================ +# Pre-flight +# ============================================================================ +log "=================================================================" +log " calibrate.sh -- pick MAXFRONTIER per dataset" +log "=================================================================" +log "config:" +log " CALIBRATION_BRANCH = $CALIBRATION_BRANCH" +log " DATASETS = ${DATASETS[*]}" +log " CAPS sweep = ${CAPS[*]}" +log " TARGETS / cell = $TARGETS" +log " NUMPATHS = $NUMPATHS" +log " SEED = $SEED" +log " per-query TIMEOUT = $TIMEOUT" + +for bin in dgraph go git make curl jq awk; do + command -v "$bin" >/dev/null || die "$bin not on PATH" +done + +[[ -d "$BENCH_DIR" ]] || die "BENCH_DIR not found: $BENCH_DIR" +[[ -d "$DGRAPH_REPO" ]] || die "DGRAPH_REPO not found: $DGRAPH_REPO" +[[ -d "$ALPHA_DIR" ]] || die "ALPHA_DIR not found: $ALPHA_DIR" +[[ "$ALPHA_DIR" == "$ALPHA_DIR_PREFIX_ALLOW"* ]] \ + || die "ALPHA_DIR ($ALPHA_DIR) not under allowed prefix $ALPHA_DIR_PREFIX_ALLOW" +[[ "$ALPHA_DIR" != "/" ]] || die "ALPHA_DIR is /" +[[ "$ALPHA_DIR" != "$HOME" ]] || die "ALPHA_DIR equals HOME" + +curl -s -m 5 "$ZERO_STATE_URL" >/dev/null 2>&1 \ + || die "zero not reachable at $ZERO_STATE_URL -- start it before calibrating" + +( cd "$DGRAPH_REPO" + if ! git diff --quiet || ! git diff --cached --quiet; then + die "dgraph working tree dirty -- commit/stash before calibrating" + fi + git rev-parse --verify --quiet "$CALIBRATION_BRANCH" >/dev/null \ + || die "calibration branch '$CALIBRATION_BRANCH' not found in $DGRAPH_REPO" +) + +for ds in "${DATASETS[@]}"; do + ds_dir="$BENCH_DIR/datasets/$ds" + [[ -f "$ds_dir/$ds.properties" ]] || die "$ds_dir/$ds.properties missing" + bp=$(bulk_p_for "$ds") + [[ -d "$bp" ]] || die "bulk-loaded p/ missing for $ds at $bp" +done + +( cd "$BENCH_DIR" && go build ./... ) || die "bench failed to compile" + +# ============================================================================ +# Build calibration branch +# ============================================================================ +log "" +log "[build] checkout + make install for $CALIBRATION_BRANCH" +( cd "$DGRAPH_REPO" + git checkout "$CALIBRATION_BRANCH" >/dev/null 2>&1 + make install ) > "$LOG_DIR/build.log" 2>&1 \ + || die "build failed -- see $LOG_DIR/build.log" + +bin_branch=$(dgraph version 2>/dev/null | awk '/^Branch/ {print $3; exit}' || true) +if [[ -n "$bin_branch" && "$bin_branch" != "$CALIBRATION_BRANCH" ]]; then + log "[build] WARN: dgraph binary reports Branch='$bin_branch' (expected '$CALIBRATION_BRANCH')" +fi +log "[build] done" + +stop_alpha + +# ============================================================================ +# Calibration loop +# ============================================================================ +# RESULT[ds:cap] -> "passed failed errors p50_ms wall_s" +declare -A RESULT + +t0=$(date +%s) + +for ds in "${DATASETS[@]}"; do + bp=$(bulk_p_for "$ds") + log "" + log "=================================================================" + log " dataset: $ds (bulk p/ at $bp)" + log "=================================================================" + + # Bring Alpha up ONCE per dataset (data doesn't change as cap varies). + stop_alpha + log "[$ds] reset_data + start alpha" + reset_data "$bp" + start_alpha "$LOG_DIR/alpha-$ds.log" + if ! wait_alpha; then + log "[$ds] alpha did not come up healthy; skipping dataset" + for cap in "${CAPS[@]}"; do RESULT["$ds:$cap"]="ALPHA_FAIL"; done + continue + fi + log "[$ds] alpha healthy" + + for cap in "${CAPS[@]}"; do + out="$RESULTS_DIR/cal-$ds-cap${cap}.json" + bench_log="$LOG_DIR/bench-$ds-cap${cap}.log" + log "[$ds cap=$cap] running bench (targets=$TARGETS numpaths=$NUMPATHS)" + + if ! ( cd "$BENCH_DIR" && go run ./cmd/bench \ + -mode correctness \ + -dataset "$BENCH_DIR/datasets/$ds" \ + -alpha "$ALPHA_GRPC" \ + -targets "$TARGETS" \ + -numpaths "$NUMPATHS" \ + -maxfrontier "$cap" \ + -seed "$SEED" \ + -timeout "$TIMEOUT" \ + -out "$out" ) > "$bench_log" 2>&1 + then + log "[$ds cap=$cap] bench invocation failed -- see $bench_log" + RESULT["$ds:$cap"]="BENCH_FAIL" + continue + fi + + passed=$(jq -r '.passed' "$out" 2>/dev/null || echo 0) + failed=$(jq -r '.failed' "$out" 2>/dev/null || echo 0) + errors=$(jq -r '.query_errors' "$out" 2>/dev/null || echo 0) + p50_ns=$(jq -r '.latency.p50_ns // 0' "$out" 2>/dev/null || echo 0) + wall_ns=$(jq -r '.latency.wall_ns // 0' "$out" 2>/dev/null || echo 0) + p50_ms=$(( p50_ns / 1000000 )) + wall_s=$(awk -v n="$wall_ns" 'BEGIN { printf "%.1f", n/1e9 }') + RESULT["$ds:$cap"]="$passed $failed $errors $p50_ms $wall_s" + log "[$ds cap=$cap] passed=$passed/$TARGETS failed=$failed errors=$errors p50=${p50_ms}ms wall=${wall_s}s" + done + + stop_alpha +done + +t1=$(date +%s) +elapsed=$(( t1 - t0 )) + +# ============================================================================ +# Analysis: per dataset, smallest cap where passed=TARGETS, failed=0, errors=0 +# ============================================================================ +log "" +log "=================================================================" +log " calibration analysis (elapsed: ${elapsed}s)" +log "=================================================================" +log " sweep used branch '$CALIBRATION_BRANCH'; TARGETS=$TARGETS NUMPATHS=$NUMPATHS" + +declare -A RECOMMENDATION + +for ds in "${DATASETS[@]}"; do + echo "" + printf "Dataset: %s\n" "$ds" + printf " %-8s %-13s %-7s %-7s %-9s %-9s %s\n" \ + "cap" "passed/total" "failed" "errors" "p50_ms" "wall_s" "verdict" + printf " %-8s %-13s %-7s %-7s %-9s %-9s %s\n" \ + "--------" "-------------" "------" "------" "------" "------" "-------" + + # Sort caps numerically ascending. Treat "0" (unset) as largest sentinel. + sorted_caps=$(printf '%s\n' "${CAPS[@]}" | awk '{ if ($1==0) print 999999999, "0"; else print $1, $1 }' \ + | sort -n | awk '{print $2}') + + first_pass="" + while IFS= read -r cap; do + v="${RESULT[$ds:$cap]:-MISSING}" + case "$v" in + ALPHA_FAIL|BENCH_FAIL|MISSING) + cap_disp="$cap"; [[ "$cap" == "0" ]] && cap_disp="(none)" + printf " %-8s %-13s %-7s %-7s %-9s %-9s %s\n" \ + "$cap_disp" "-" "-" "-" "-" "-" "$v" + continue + ;; + esac + read -r p f e p50 wall <<< "$v" + cap_disp="$cap"; [[ "$cap" == "0" ]] && cap_disp="(none)" + verdict="✗ fails" + if (( p == TARGETS )) && (( f == 0 )) && (( e == 0 )); then + if [[ -z "$first_pass" ]]; then + verdict="✓ PASSING (smallest)" + first_pass="$cap" + else + verdict="✓ passing" + fi + fi + printf " %-8s %-13s %-7s %-7s %-9s %-9s %s\n" \ + "$cap_disp" "$p/$TARGETS" "$f" "$e" "$p50" "$wall" "$verdict" + done <<< "$sorted_caps" + + if [[ -z "$first_pass" ]]; then + echo " -> no cap passed correctness on this dataset" + echo " (calibration branch '$CALIBRATION_BRANCH' may not actually fix the bug for $ds," + echo " OR every cap in the sweep is too tight for this dataset's frontier scale)" + RECOMMENDATION["$ds"]="" + elif [[ "$first_pass" == "0" ]]; then + echo " -> recommended: leave $(env_name_for "$ds") UNSET (no cap)" + echo " (even the smallest cap tested broke correctness; the cap cannot be exercised" + echo " on this dataset without losing the optimum)" + RECOMMENDATION["$ds"]="UNSET" + else + echo " -> recommended: $(env_name_for "$ds")=$first_pass" + RECOMMENDATION["$ds"]="$first_pass" + fi +done + +# ============================================================================ +# Final copy-pasteable .env block +# ============================================================================ +echo "" +log "=================================================================" +log " copy-paste into your .env (or export before run-pr-comparison.sh)" +log "=================================================================" +echo "" +for ds in "${DATASETS[@]}"; do + val="${RECOMMENDATION[$ds]:-}" + name=$(env_name_for "$ds") + if [[ -z "$val" ]]; then + echo "# $name -- no recommendation (calibration inconclusive for $ds)" + elif [[ "$val" == "UNSET" ]]; then + echo "# $name -- intentionally unset (cap cannot be safely exercised on $ds)" + else + echo "export $name=$val" + fi +done +echo "" +log "raw json per (dataset, cap) in $RESULTS_DIR/" +log "alpha logs in $LOG_DIR/" diff --git a/k-shortest-path/scripts/download-ldbc.sh b/k-shortest-path/scripts/download-ldbc.sh new file mode 100755 index 0000000..22d35c0 --- /dev/null +++ b/k-shortest-path/scripts/download-ldbc.sh @@ -0,0 +1,105 @@ +#!/usr/bin/env bash +# Downloads and extracts an LDBC Graphalytics dataset (data archive + optional +# validation archive) into ./datasets//. +# +# Usage: +# ./scripts/download-ldbc.sh [validation-url] +# +# Examples: +# # Just the data archive (use this if validation is bundled inside, like the +# # test-sssp-* graphs): +# ./scripts/download-ldbc.sh \ +# https://example.org/path/to/kgs.tar.zst +# +# # Data + separate validation archive: +# ./scripts/download-ldbc.sh \ +# https://example.org/path/to/kgs.tar.zst \ +# https://example.org/path/to/kgs-validation.tar.zst +# +# Dataset names are derived from the data URL's basename. Whichever URL you +# point at is whichever URL you get — there is no hard-coded mirror. +# +# Browse the LDBC dataset catalog at: +# https://ldbcouncil.org/benchmarks/graphalytics/ + +set -euo pipefail + +if [[ $# -lt 1 ]]; then + cat >&2 < [validation-url] + + data-url URL of the dataset tar.zst (required) + validation-url URL of the validation tar.zst (optional — many datasets + bundle SSSP/BFS/etc. reference outputs inside the data + archive, in which case omit this argument) + +Examples: + $0 https://example.org/path/to/kgs.tar.zst + $0 https://example.org/path/to/kgs.tar.zst https://example.org/path/to/kgs-validation.tar.zst +EOF + exit 2 +fi + +DATA_URL="$1" +VAL_URL="${2:-}" + +# Derive dataset name from data URL basename (strip .tar.zst suffix). +DATA_BASENAME="$(basename "$DATA_URL")" +DATASET="${DATA_BASENAME%.tar.zst}" +DATASET="${DATASET%.tar.gz}" + +OUT_DIR="$(cd "$(dirname "$0")/.." && pwd)/datasets/$DATASET" +mkdir -p "$OUT_DIR" +cd "$OUT_DIR" + +download_one() { + local url="$1" + local fname + fname="$(basename "$url")" + if [[ -f "$fname" ]]; then + echo "[skip] $fname already present" + return 0 + fi + echo "[download] $url" + if ! curl -L --fail --connect-timeout 30 -o "$fname" "$url"; then + echo "[error] failed to download $url" >&2 + rm -f "$fname" + return 1 + fi +} + +extract_one() { + local fname="$1" + if [[ ! -f "$fname" ]]; then + return 0 + fi + echo "[extract] $fname" + tar --zstd -xf "$fname" +} + +# Download both archives (validation is optional). +download_one "$DATA_URL" +if [[ -n "$VAL_URL" ]]; then + download_one "$VAL_URL" +fi + +# Extract whatever landed on disk. +extract_one "$DATA_BASENAME" +if [[ -n "$VAL_URL" ]]; then + extract_one "$(basename "$VAL_URL")" +fi + +# Normalize: bench expects validation/-SSSP. If the SSSP reference was +# bundled at the top level of the data archive, move it into validation/. +SSSP_TOP="$DATASET-SSSP" +if [[ -f "$SSSP_TOP" ]]; then + mkdir -p validation + mv "$SSSP_TOP" validation/ + echo "[normalize] moved $SSSP_TOP into validation/" +fi + +echo "[done] dataset ready at $OUT_DIR" +ls -lh "$OUT_DIR" +echo +echo "Contents under validation/ (if any):" +ls -lh "$OUT_DIR/validation" 2>/dev/null || echo " (no validation/ subdirectory — SSSP reference may be elsewhere or absent)" diff --git a/k-shortest-path/scripts/lib/alpha.sh b/k-shortest-path/scripts/lib/alpha.sh new file mode 100644 index 0000000..2c5810e --- /dev/null +++ b/k-shortest-path/scripts/lib/alpha.sh @@ -0,0 +1,254 @@ +# scripts/lib/alpha.sh -- shared Dgraph Alpha lifecycle + safety helpers. +# +# SOURCE THIS, do not exec. Usage: +# +# set -euo pipefail +# BENCH_DIR=...; DGRAPH_REPO=...; ALPHA_DIR=...; ALPHA_DIR_PREFIX_ALLOW=... +# DATASETS=( kgs datagen-7_5-fb ) +# source "$(dirname "${BASH_SOURCE[0]}")/lib/alpha.sh" +# trap alpha_cleanup_on_exit EXIT +# +# Required env from the caller (set BEFORE calling any function below): +# BENCH_DIR -- for bulk_p_for() convention path +# ALPHA_DIR -- workspace for alpha (p/t/w/zw + logs) +# ALPHA_DIR_PREFIX_ALLOW -- safety whitelist for rm operations +# ALPHA_HEALTH_URL -- e.g. http://localhost:8080/health +# DATASETS -- bash array; guard_rm_target uses it to +# protect each dataset's bulk p/ from rm +# +# Optional env (defaults if unset): +# ALPHA_HEALTH_TIMEOUT_SEC -- 300 -- wait_alpha timeout +# ZERO_STATE_URL -- required for require_zero(no-arg form) +# BULK_P_ -- override per-dataset bulk p/ path +# (DS uppercased; -/. -> _; e.g. BULK_P_DATAGEN_7_5_FB) +# +# Functions provided: +# ts log die warn -- logging +# size_of -- human-readable du of one path +# bulk_p_for -- BULK_P_ env wins else convention path +# guard_rm_target -- die if path is unsafe to rm +# stop_alpha -- stop any running alpha (managed + stray) +# wait_alpha -- block until /health healthy, or timeout +# reset_data -- wipe ALPHA_DIR/{p,t,w,zw}, cp bulk_p -> p +# start_alpha -- launch alpha (background), record pid +# require_zero [url] -- die unless Zero is reachable +# to_seconds -- "5m"/"60s"/"90" -> integer seconds +# preflight_memory -- abort if RAM already low; warn if no swap +# start_memlog -- background free/RSS sampler (post-mortem data) +# stop_memlog -- kill the sampler (call from EXIT trap) +# tail_log