Skip to content

Performance optimizations: async refactor, structured errors, telemetry, and connection improvements#201

Open
jkurashcvx wants to merge 6 commits intojoshfrom
performance_optimizations
Open

Performance optimizations: async refactor, structured errors, telemetry, and connection improvements#201
jkurashcvx wants to merge 6 commits intojoshfrom
performance_optimizations

Conversation

@jkurashcvx
Copy link
Copy Markdown

Changes (6 commits)

  1. Event-driven async refactor — Replace polling loops with event-driven architecture
  2. Refactor list_all_scaleset_vms — Use Resource Graph query instead of per-scaleset REST calls
  3. Structured error types, telemetry, logging — Add AzManagersError hierarchy, ManagerMetrics, structured log events
  4. Pre-validate connections — Validate worker connections before batching into addprocs_locked
  5. Run flush_wconfig_batch async — Unblock the event loop during batch registration
  6. Performance optimizations CompatHelper: add new compat entry for "MPI" at version "0.15" #2-6 — Additional perf improvements

All integration tests passing (10/10 hard, 1 soft-fail spot_eviction).

Josh added 6 commits May 5, 2026 17:46
- Phase 0: lock safety, errormonitor orphaned tasks, protect counter
- Phase 1-2: Event types, central event loop, timer-based prune/clean
- Phase 3: Event-driven connection acceptance and batching
- Phase 4: Event-driven preempt monitoring
- Phase 5: Fix remaining polling loops
- Phase 6: Cleanup and hardening
- Fix nic_dic scoping bug, decouple spinner from event loop
- Add unit tests for event-driven async refactor
- Refactor: extract poll helpers in detached.jl
- rmproc: fire-and-forget VM deletion, unique basenames in test_detached
- feat: track VM deletions via DeletionStarted event
- test: add integration test for deletion tracking lifecycle
- Replace per-scaleset list_scaleset_vms with single Resource Graph query
- Handle Uniform VMSS via concurrent ARM calls in list_all_scaleset_vms
- Use ComputeResources table for Uniform VMSS VM listing
- Reduce prune timer default from 600s to 120s
- Add integration tests for prune and prune_loop
- New src/errors.jl: structured exception types (AzManagerError, WorkerJoinError, etc.)
- New src/telemetry.jl: ManagerMetrics struct with atomic counters, record_* functions
- Integrate telemetry into event loop, connections, scaleset operations
- Replace ad-hoc @warn/@error with structured error types
- Add unit tests for error types and telemetry
Move socket IO (cookie read + connection string parse) out of addprocs_locked
into an async validation step that runs before batching. Only validated
WorkerConfig objects enter the batch.

Changes:
- New ConnectionValidated event type holding a WorkerConfig
- SocketAccepted handler now spawns @async validate_connection with timeout
- ConnectionValidated handler does the batching (timer + batch_max logic)
- flush_wconfig_batch passes pre-built WorkerConfigs to addprocs
- launch() becomes trivial push of pre-built WorkerConfigs
- Remove launch_on_machine (logic moved to validate_connection)
- Add record_connection_validation_failed! telemetry counter
- Rename socket_batch -> wconfig_batch in AzManager struct

Benefits:
- Dead/stale sockets filtered before entering batch
- No IO inside worker_lock (less time holding the lock)
- Flush timer starts from first validated connection, not raw socket
- VMs that TCP-connect but crash before sending cookie are silently discarded
flush_wconfig_batch calls addprocs_locked which can block for the full
worker_timeout duration. Running it in @async allows the event loop to
continue processing prune ticks, clean ticks, and new ConnectionValidated
events while addprocs runs.

Safe because flush_wconfig_batch snapshots and clears wconfig_batch
synchronously before yielding.
- Parallelize check_pending_deletions HTTP GETs with @sync/@async
- Separate PruneTick/CleanTick responsibilities (remove redundant list_all_scaleset_vms from CleanTick)
- Reduce tick intervals from 120s/60s to 30s/30s
- Replace serial scaleset_capacity ARM GETs in scaleset_sync with single Resource Graph query
- Remove redundant double-check in delete_empty_scalesets (scaleset_sync keeps capacity fresh)
- Rewrite prune_cluster as single-pass with O(1) lookups; fix pending_down iteration bug
- Fix DateTime parsing in prune_scalesets for Azure's variable-precision timestamps
- Drop legacy JULIA_AZMANAGERS_PENDING_CADENCE env var fallback
- Add local-only 100-node scale test (test/integration/test_scale.jl)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant