Skip to content

Comments

Nodepool Aware Validation#340

Open
paras-agarwal-meesho wants to merge 26 commits intodevelopfrom
feat/nodepool-aware-validation
Open

Nodepool Aware Validation#340
paras-agarwal-meesho wants to merge 26 commits intodevelopfrom
feat/nodepool-aware-validation

Conversation

@paras-agarwal-meesho
Copy link

@paras-agarwal-meesho paras-agarwal-meesho commented Feb 18, 2026

🔁 Pull Request Template – BharatMLStack

Context:

Objective: Ensure that every model is validated on the same node pool (hardware configuration) where it will ultimately be deployed. This guarantees hardware parity between validation and deployment.

Refer Solution Doc

Describe your changes:

  • Getting correct Test Deployable Id : GetTestDeployableId now return the nodepool aware test deployable id. Earlier it just distinguished based on machineType.
  • Nodepool->test Deployable Mapping: Sql function for fetching the test deployable id using nodepool from the service_depoyable_config table
  • Scaling: Scaling functions are created that use ArgoCD client to patch the min/max replica count.

Checklist before requesting a review

  • I have reviewed my own changes?
  • Relevant or critical functionality is covered by tests?
  • Monitoring needs have been evaluated?
  • Any necessary documentation updates have been considered?

📂 Modules Affected

  • horizon (Real-time systems / networking)
  • online-feature-store (Feature serving infra)
  • trufflebox-ui (Admin panel / UI)
  • infra (Docker, CI/CD, GCP/AWS setup)
  • docs (Documentation updates)
  • Other: ___________

✅ Type of Change

  • Feature addition
  • Bug fix
  • Infra / build system change
  • Performance improvement
  • Refactor
  • Documentation
  • Other: ___________

📊 Benchmark / Metrics (if applicable)

Summary by CodeRabbit

Release Notes

  • New Features

    • Added deployable scaling capability with configurable minimum and maximum replica counts
    • Enhanced test deployable resolution with node pool lookup support
  • Database

    • Added deployable type classification to deployment configurations
    • Added source configuration and model name tracking fields for improved configuration management

@coderabbitai
Copy link

coderabbitai bot commented Feb 18, 2026

Walkthrough

This pull request introduces deployable type classification (test/target), adds ArgoCD-based replica scaling functionality, enhances test deployable ID resolution with node pool lookup fallback, and updates corresponding database schema, repository methods, and test infrastructure.

Changes

Cohort / File(s) Summary
DeployableType Field
horizon/internal/repositories/sql/servicedeployableconfig/table.go, horizon/internal/deployable/handler/modelhandler.go, horizon/internal/deployable/handler/predatorhandler.go, quick-start/db-init/scripts/init-mysql.sh
Added new DeployableType field to ServiceDeployableConfig struct with constants DeployableTypeTest and DeployableTypeTarget. Field is initialized to DeployableTypeTarget during deployable creation and persisted in database schema.
Deployable Scaling
horizon/pkg/argocd/hpa.go, horizon/internal/infrastructure/handler/handler.go, horizon/internal/infrastructure/controller/controller_test.go
Introduced new ScaleDeployable method that scales deployables by patching HPA resources, with fallback to KEDA ScaledObject. Includes input validation and error handling. Added corresponding mock method for testing.
Test Deployable Resolution
horizon/internal/predator/handler/predator_validation.go, horizon/internal/predator/handler/predator.go, horizon/internal/repositories/sql/servicedeployableconfig/sql.go
Enhanced test deployable ID resolution with new resolveTestDeployableID helper that attempts node pool lookup first, then falls back to machine-type-based selection. Added GetTestDeployableIDByNodePool repository method and improved error handling in validation logic. Updated error messaging to show numeric deployable ID.
Test Infrastructure Updates
horizon/internal/predator/handler/predator_test.go, horizon/internal/inferflow/handler/inferflow_test.go
Added mock implementations of GetTestDeployableIDByNodePool method returning not-found error for test scenarios. Added required gorm imports for error type support.
Database Schema
quick-start/db-init/scripts/init-mysql.sh
Added new columns source_config_id to inferflow_config and source_model_name to predator_config. Updated service_deployable_config inserts to include deployable_type value.

Sequence Diagram(s)

sequenceDiagram
    participant Client as API Client
    participant Handler as Infrastructure Handler
    participant ArgoCD as ArgoCD Service
    participant HPA as HPA Resource
    participant KEDA as KEDA ScaledObject

    Client->>Handler: ScaleDeployable(appName, env, minReplica, maxReplica)
    Handler->>Handler: Validate inputs & derive app name
    Handler->>ArgoCD: Patch HPA resource with replica counts
    
    alt HPA Patch Successful
        ArgoCD->>HPA: Apply replica configuration
        HPA-->>ArgoCD: Success
        ArgoCD-->>Handler: Success response
        Handler-->>Client: Return nil
    else HPA Patch Failed
        ArgoCD-->>Handler: Error
        Handler->>ArgoCD: Patch KEDA ScaledObject with replica counts
        
        alt KEDA Patch Successful
            ArgoCD->>KEDA: Apply replica configuration
            KEDA-->>ArgoCD: Success
            ArgoCD-->>Handler: Success response
            Handler-->>Client: Return nil
        else Both Failed
            ArgoCD-->>Handler: Error
            Handler-->>Client: Return combined error
        end
    end
Loading
🚥 Pre-merge checks | ✅ 1 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Krd Checker ⚠️ Warning PR description lacks required KRD link or KRD exemption with Pod Type and justified explanation. Add KRD link (format: KRD: [Google Doc URL]) or KRD exemption (Pod Type H0/H1/H2 + justified Why KRD not required section).
✅ Passed checks (1 passed)
Check name Status Explanation
Dynamic Configuration Validation ✅ Passed The pull request does not contain any changes to files matching the application-dyn-*.yml pattern.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

Tip

Issue Planner is now in beta. Read the docs and try it out! Share your feedback on Discord.


Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
quick-start/db-init/scripts/init-mysql.sh (2)

605-639: ⚠️ Potential issue | 🟠 Major

Add deployable_type column to the table schema.

Line 606/630 insert deployable_type, but CREATE TABLE service_deployable_config doesn't define it, so the init script will fail at runtime. Add the column to the table definition to keep schema and seed data consistent.

🛠️ Suggested fix
   CREATE TABLE IF NOT EXISTS service_deployable_config (
     id int NOT NULL AUTO_INCREMENT,
     name varchar(255),
     host varchar(255) NOT NULL,
     active tinyint(1) DEFAULT 0,
     service enum('inferflow', 'predator', 'numerix'),
+    deployable_type enum('test','target') NOT NULL DEFAULT 'target',
     created_by varchar(255),
     updated_by varchar(255),
     created_at datetime NOT NULL DEFAULT CURRENT_TIMESTAMP,
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@quick-start/db-init/scripts/init-mysql.sh` around lines 605 - 639, The seed
INSERTs include a deployable_type column but the service_deployable_config table
definition is missing that column; update the CREATE TABLE
service_deployable_config statement to add a deployable_type column with an
appropriate type (e.g., VARCHAR or ENUM) and default/null settings consistent
with other columns so the INSERTs (which reference deployable_type) succeed;
ensure any constraints or indexes match expected usage in the INSERTs.

714-717: ⚠️ Potential issue | 🟡 Minor

Lowercase the service name in seed data.

Line 716 uses 'INFERFLOW'; service names should be stored and compared in lowercase to avoid lookup mismatches.

🛠️ Suggested fix
-      'INFERFLOW',
+      'inferflow',

Based on learnings: Within the BharatMLStack horizon codebase, ensure that service names (inferflow, predator, numerix) are stored in lowercase in database tables and are consistently used in lowercase throughout the code.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@quick-start/db-init/scripts/init-mysql.sh` around lines 714 - 717, The seed
row in init-mysql.sh contains the service name 'INFERFLOW' which must be stored
in lowercase to avoid lookup mismatches; update the seed data entry that
currently uses 'INFERFLOW' to 'inferflow' (and scan sibling seed inserts for
other services like 'PREDATOR'/'NUMERIX' to make them lowercase too) so the
database, the seeding logic in init-mysql.sh, and any comparisons elsewhere
consistently use lowercase service names.
🧹 Nitpick comments (2)
horizon/internal/repositories/sql/servicedeployableconfig/sql.go (1)

117-123: Consider adding ORDER BY for deterministic results.

If multiple test deployables share the same nodeSelectorValue, Limit(1) without an explicit ordering will return a non-deterministic result. Consider ordering by id (or another stable field) to ensure consistent behavior.

♻️ Proposed fix
 	tx := r.db.
 		Model(&ServiceDeployableConfig{}).
 		Select("id").
 		Where("deployable_type = ?", DeployableTypeTest).
 		Where("JSON_UNQUOTE(JSON_EXTRACT(config, '$.nodeSelectorValue')) = ?", nodePool).
+		Order("id ASC").
 		Limit(1).
 		Scan(&id)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@horizon/internal/repositories/sql/servicedeployableconfig/sql.go` around
lines 117 - 123, The query selecting an id from ServiceDeployableConfig with
DeployableTypeTest and a JSON_EXTRACT filter uses Limit(1) but no deterministic
ordering; update the query that builds tx (the chain on
r.db.Model(&ServiceDeployableConfig{})...Scan(&id)) to include an explicit Order
clause (e.g., Order("id ASC") or another stable field) before Limit(1) so the
returned row is deterministic when multiple rows match the nodeSelectorValue.
horizon/internal/predator/handler/predator_validation.go (1)

183-187: Use appropriate log level for error conditions.

Line 186 logs an actual error using log.Info().Err(lookupErr). This could mask issues in log monitoring/alerting systems that filter by severity. Consider using log.Warn() since this is an unexpected failure (not just a "not found" case).

♻️ Proposed fix
 		} else if errors.Is(lookupErr, gorm.ErrRecordNotFound) {
 			log.Info().Str("nodePool", nodePool).Msgf("no test deployable for node pool %q (deployable_type=test, config.nodeSelectorValue=%q), using machine-type fallback", nodePool, nodePool)
 		} else {
-			log.Info().Err(lookupErr).Str("nodePool", nodePool).Msg("Test deployable lookup by node pool failed, using machine-type fallback")
+			log.Warn().Err(lookupErr).Str("nodePool", nodePool).Msg("Test deployable lookup by node pool failed, using machine-type fallback")
 		}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@horizon/internal/predator/handler/predator_validation.go` around lines 183 -
187, Replace the info-level logging of an actual error with a warning: in the
block handling lookupErr (the branch that currently calls
log.Info().Err(lookupErr).Str("nodePool", nodePool).Msg("Test deployable lookup
by node pool failed, using machine-type fallback")), change the logger to
log.Warn() while keeping the Err(lookupErr), Str("nodePool", nodePool) and the
same Msg text so unexpected lookup failures are surfaced at warn level; this
affects the code around the lookupErr handling in predator_validation.go where
lookupErr and nodePool are used.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Outside diff comments:
In `@quick-start/db-init/scripts/init-mysql.sh`:
- Around line 605-639: The seed INSERTs include a deployable_type column but the
service_deployable_config table definition is missing that column; update the
CREATE TABLE service_deployable_config statement to add a deployable_type column
with an appropriate type (e.g., VARCHAR or ENUM) and default/null settings
consistent with other columns so the INSERTs (which reference deployable_type)
succeed; ensure any constraints or indexes match expected usage in the INSERTs.
- Around line 714-717: The seed row in init-mysql.sh contains the service name
'INFERFLOW' which must be stored in lowercase to avoid lookup mismatches; update
the seed data entry that currently uses 'INFERFLOW' to 'inferflow' (and scan
sibling seed inserts for other services like 'PREDATOR'/'NUMERIX' to make them
lowercase too) so the database, the seeding logic in init-mysql.sh, and any
comparisons elsewhere consistently use lowercase service names.

---

Nitpick comments:
In `@horizon/internal/predator/handler/predator_validation.go`:
- Around line 183-187: Replace the info-level logging of an actual error with a
warning: in the block handling lookupErr (the branch that currently calls
log.Info().Err(lookupErr).Str("nodePool", nodePool).Msg("Test deployable lookup
by node pool failed, using machine-type fallback")), change the logger to
log.Warn() while keeping the Err(lookupErr), Str("nodePool", nodePool) and the
same Msg text so unexpected lookup failures are surfaced at warn level; this
affects the code around the lookupErr handling in predator_validation.go where
lookupErr and nodePool are used.

In `@horizon/internal/repositories/sql/servicedeployableconfig/sql.go`:
- Around line 117-123: The query selecting an id from ServiceDeployableConfig
with DeployableTypeTest and a JSON_EXTRACT filter uses Limit(1) but no
deterministic ordering; update the query that builds tx (the chain on
r.db.Model(&ServiceDeployableConfig{})...Scan(&id)) to include an explicit Order
clause (e.g., Order("id ASC") or another stable field) before Limit(1) so the
returned row is deterministic when multiple rows match the nodeSelectorValue.

@pavan-adari-meesho
Copy link
Contributor

@paras-agarwal-meesho Take a pull from develop and fix any merge conflict that may come

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants