Database connection pool health probe with adaptive max connection sizing#119
Open
ekwe7 wants to merge 3 commits into
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
closes #72
Summary
This PR introduces an adaptive database connection pool management system that continuously monitors pool health and dynamically adjusts maximum connection limits based on workload characteristics.
The implementation improves database efficiency by scaling connection capacity during traffic spikes while reducing unnecessary resource consumption during periods of low activity.
Problem
Database connection pools currently use static maximum connection limits.
This creates two operational challenges:
Under High Load
Connection pools can become exhausted
Query wait times increase
Application latency degrades
Request failures become more likely
Under Low Load
Excess connections remain allocated unnecessarily
Memory and database resources are wasted
Infrastructure efficiency decreases
Static sizing requires manual tuning and cannot adapt to changing traffic patterns.
Solution
This PR introduces a health-probe-driven adaptive scaling system that continuously evaluates connection pool performance and adjusts pool capacity automatically.
The system uses runtime metrics including:
Query latency distribution
Connection acquisition wait times
Pool utilization levels
to determine when scaling actions are required.
Key Features
Connection Pool Health Probe
Implemented a health monitoring service that executes every:
10 seconds
The probe collects:
Query latency metrics
Connection wait times
Active pool utilization
Current pool size
Adaptive Scaling Logic
Added PID-controller-based scaling behavior that:
Continuously evaluates pool health
Calculates optimal pool sizing targets
Avoids aggressive oscillation
Responds smoothly to changing workloads
Scaling Constraints
The pool operates within the following bounds:
Setting
Value
Minimum Connections
5
Maximum Connections
200
Adjustment Step
±5
Probe Interval
10 seconds
Cooldown Period
60 seconds
Scale-Up Conditions
Scaling upward is triggered when:
p90 connection wait time exceeds 50ms
indicating connection contention and insufficient pool capacity.
Scale-Down Conditions
Scaling downward is allowed when:
p95 query latency remains below 100ms
indicating stable performance and excess connection capacity.
Cooldown Protection
To prevent rapid oscillation:
60-second cooldown
is enforced between consecutive scaling actions in the same direction.
Observability
Added Prometheus metrics for monitoring adaptive behavior.
New gauges include:
Current pool size
Target pool size
Pool utilization percentage
Scaling decision state
These metrics provide visibility into runtime scaling behavior and operational tuning.
Testing
Integration Testing
Added automated tests that simulate:
Normal traffic conditions
Sudden load spikes
Sustained high utilization
Recovery after traffic reduction
Validation ensures:
Pool size increases during contention
Pool size decreases during low utilization
Scaling respects configured limits
Cooldown behavior prevents excessive adjustments
Technical Bounds
Probe Configuration
Interval: 10 seconds
Pool Limits
Minimum Connections: 5
Maximum Connections: 200
Scaling Step
±5 connections per adjustment
Thresholds
Scale Down: p95 latency < 100ms
Scale Up: p90 wait time > 50ms
Cooldown
60 seconds
between same-direction scaling actions.
Acceptance Criteria
Health probe collects pool metrics
Query latency distribution monitored
Connection wait times monitored
Pool utilization monitored
PID-controller adjustment logic implemented
Dynamic pool sizing supported
Scaling bounded between 5 and 200 connections
Prometheus metrics added
Integration tests validate scaling behavior
Load spike auto-scaling verified