Performance optimizations: async refactor, structured errors, telemetry, and connection improvements#201
Open
jkurashcvx wants to merge 6 commits intojoshfrom
Open
Performance optimizations: async refactor, structured errors, telemetry, and connection improvements#201jkurashcvx wants to merge 6 commits intojoshfrom
jkurashcvx wants to merge 6 commits intojoshfrom
Conversation
added 6 commits
May 5, 2026 17:46
- Phase 0: lock safety, errormonitor orphaned tasks, protect counter - Phase 1-2: Event types, central event loop, timer-based prune/clean - Phase 3: Event-driven connection acceptance and batching - Phase 4: Event-driven preempt monitoring - Phase 5: Fix remaining polling loops - Phase 6: Cleanup and hardening - Fix nic_dic scoping bug, decouple spinner from event loop - Add unit tests for event-driven async refactor - Refactor: extract poll helpers in detached.jl - rmproc: fire-and-forget VM deletion, unique basenames in test_detached - feat: track VM deletions via DeletionStarted event - test: add integration test for deletion tracking lifecycle
- Replace per-scaleset list_scaleset_vms with single Resource Graph query - Handle Uniform VMSS via concurrent ARM calls in list_all_scaleset_vms - Use ComputeResources table for Uniform VMSS VM listing - Reduce prune timer default from 600s to 120s - Add integration tests for prune and prune_loop
- New src/errors.jl: structured exception types (AzManagerError, WorkerJoinError, etc.) - New src/telemetry.jl: ManagerMetrics struct with atomic counters, record_* functions - Integrate telemetry into event loop, connections, scaleset operations - Replace ad-hoc @warn/@error with structured error types - Add unit tests for error types and telemetry
Move socket IO (cookie read + connection string parse) out of addprocs_locked into an async validation step that runs before batching. Only validated WorkerConfig objects enter the batch. Changes: - New ConnectionValidated event type holding a WorkerConfig - SocketAccepted handler now spawns @async validate_connection with timeout - ConnectionValidated handler does the batching (timer + batch_max logic) - flush_wconfig_batch passes pre-built WorkerConfigs to addprocs - launch() becomes trivial push of pre-built WorkerConfigs - Remove launch_on_machine (logic moved to validate_connection) - Add record_connection_validation_failed! telemetry counter - Rename socket_batch -> wconfig_batch in AzManager struct Benefits: - Dead/stale sockets filtered before entering batch - No IO inside worker_lock (less time holding the lock) - Flush timer starts from first validated connection, not raw socket - VMs that TCP-connect but crash before sending cookie are silently discarded
flush_wconfig_batch calls addprocs_locked which can block for the full worker_timeout duration. Running it in @async allows the event loop to continue processing prune ticks, clean ticks, and new ConnectionValidated events while addprocs runs. Safe because flush_wconfig_batch snapshots and clears wconfig_batch synchronously before yielding.
- Parallelize check_pending_deletions HTTP GETs with @sync/@async - Separate PruneTick/CleanTick responsibilities (remove redundant list_all_scaleset_vms from CleanTick) - Reduce tick intervals from 120s/60s to 30s/30s - Replace serial scaleset_capacity ARM GETs in scaleset_sync with single Resource Graph query - Remove redundant double-check in delete_empty_scalesets (scaleset_sync keeps capacity fresh) - Rewrite prune_cluster as single-pass with O(1) lookups; fix pending_down iteration bug - Fix DateTime parsing in prune_scalesets for Azure's variable-precision timestamps - Drop legacy JULIA_AZMANAGERS_PENDING_CADENCE env var fallback - Add local-only 100-node scale test (test/integration/test_scale.jl)
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Changes (6 commits)
AzManagersErrorhierarchy,ManagerMetrics, structured log eventsaddprocs_lockedAll integration tests passing (10/10 hard, 1 soft-fail spot_eviction).