Performance optimizations: async refactor, structured errors, telemetry, and connection improvements by jkurashcvx · Pull Request #201 · ChevronETC/AzManagers.jl

jkurashcvx · 2026-05-05T17:47:20Z

Changes (6 commits)

Event-driven async refactor — Replace polling loops with event-driven architecture
Refactor list_all_scaleset_vms — Use Resource Graph query instead of per-scaleset REST calls
Structured error types, telemetry, logging — Add AzManagersError hierarchy, ManagerMetrics, structured log events
Pre-validate connections — Validate worker connections before batching into addprocs_locked
Run flush_wconfig_batch async — Unblock the event loop during batch registration
Performance optimizations CompatHelper: add new compat entry for "MPI" at version "0.15" #2-6 — Additional perf improvements

All integration tests passing (10/10 hard, 1 soft-fail spot_eviction).

- Phase 0: lock safety, errormonitor orphaned tasks, protect counter - Phase 1-2: Event types, central event loop, timer-based prune/clean - Phase 3: Event-driven connection acceptance and batching - Phase 4: Event-driven preempt monitoring - Phase 5: Fix remaining polling loops - Phase 6: Cleanup and hardening - Fix nic_dic scoping bug, decouple spinner from event loop - Add unit tests for event-driven async refactor - Refactor: extract poll helpers in detached.jl - rmproc: fire-and-forget VM deletion, unique basenames in test_detached - feat: track VM deletions via DeletionStarted event - test: add integration test for deletion tracking lifecycle

- Replace per-scaleset list_scaleset_vms with single Resource Graph query - Handle Uniform VMSS via concurrent ARM calls in list_all_scaleset_vms - Use ComputeResources table for Uniform VMSS VM listing - Reduce prune timer default from 600s to 120s - Add integration tests for prune and prune_loop

- New src/errors.jl: structured exception types (AzManagerError, WorkerJoinError, etc.) - New src/telemetry.jl: ManagerMetrics struct with atomic counters, record_* functions - Integrate telemetry into event loop, connections, scaleset operations - Replace ad-hoc @warn/@error with structured error types - Add unit tests for error types and telemetry

Move socket IO (cookie read + connection string parse) out of addprocs_locked into an async validation step that runs before batching. Only validated WorkerConfig objects enter the batch. Changes: - New ConnectionValidated event type holding a WorkerConfig - SocketAccepted handler now spawns @async validate_connection with timeout - ConnectionValidated handler does the batching (timer + batch_max logic) - flush_wconfig_batch passes pre-built WorkerConfigs to addprocs - launch() becomes trivial push of pre-built WorkerConfigs - Remove launch_on_machine (logic moved to validate_connection) - Add record_connection_validation_failed! telemetry counter - Rename socket_batch -> wconfig_batch in AzManager struct Benefits: - Dead/stale sockets filtered before entering batch - No IO inside worker_lock (less time holding the lock) - Flush timer starts from first validated connection, not raw socket - VMs that TCP-connect but crash before sending cookie are silently discarded

flush_wconfig_batch calls addprocs_locked which can block for the full worker_timeout duration. Running it in @async allows the event loop to continue processing prune ticks, clean ticks, and new ConnectionValidated events while addprocs runs. Safe because flush_wconfig_batch snapshots and clears wconfig_batch synchronously before yielding.

- Parallelize check_pending_deletions HTTP GETs with @sync/@async - Separate PruneTick/CleanTick responsibilities (remove redundant list_all_scaleset_vms from CleanTick) - Reduce tick intervals from 120s/60s to 30s/30s - Replace serial scaleset_capacity ARM GETs in scaleset_sync with single Resource Graph query - Remove redundant double-check in delete_empty_scalesets (scaleset_sync keeps capacity fresh) - Rewrite prune_cluster as single-pass with O(1) lookups; fix pending_down iteration bug - Fix DateTime parsing in prune_scalesets for Azure's variable-precision timestamps - Drop legacy JULIA_AZMANAGERS_PENDING_CADENCE env var fallback - Add local-only 100-node scale test (test/integration/test_scale.jl)

Josh added 6 commits May 5, 2026 17:46

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance optimizations: async refactor, structured errors, telemetry, and connection improvements#201

Performance optimizations: async refactor, structured errors, telemetry, and connection improvements#201
jkurashcvx wants to merge 6 commits intojoshfrom
performance_optimizations

jkurashcvx commented May 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

jkurashcvx commented May 5, 2026

Changes (6 commits)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant