Nodepool Aware Validation by paras-agarwal-meesho · Pull Request #340 · Meesho/BharatMLStack

paras-agarwal-meesho · 2026-02-18T12:32:45Z

🔁 Pull Request Template – BharatMLStack

Context:

Objective: Ensure that every model is validated on the same node pool (hardware configuration) where it will ultimately be deployed. This guarantees hardware parity between validation and deployment.

Refer Solution Doc

Describe your changes:

Getting correct Test Deployable Id : GetTestDeployableId now return the nodepool aware test deployable id. Earlier it just distinguished based on machineType.
Nodepool->test Deployable Mapping: Sql function for fetching the test deployable id using nodepool from the service_depoyable_config table
Scaling: Scaling functions are created that use ArgoCD client to patch the min/max replica count.

Checklist before requesting a review

I have reviewed my own changes?
Relevant or critical functionality is covered by tests?
Monitoring needs have been evaluated?
Any necessary documentation updates have been considered?

📂 Modules Affected

horizon (Real-time systems / networking)
online-feature-store (Feature serving infra)
trufflebox-ui (Admin panel / UI)
infra (Docker, CI/CD, GCP/AWS setup)
docs (Documentation updates)
Other: ___________

✅ Type of Change

📊 Benchmark / Metrics (if applicable)

Summary by CodeRabbit

Release Notes

New Features
- Added deployable scaling capability with configurable minimum and maximum replica counts
- Enhanced test deployable resolution with node pool lookup support
Database
- Added deployable type classification to deployment configurations
- Added source configuration and model name tracking fields for improved configuration management

…e-validation

coderabbitai · 2026-02-18T12:33:04Z

Walkthrough

This pull request introduces deployable type classification (test/target), adds ArgoCD-based replica scaling functionality, enhances test deployable ID resolution with node pool lookup fallback, and updates corresponding database schema, repository methods, and test infrastructure.

Changes

Cohort / File(s)	Summary
DeployableType Field `horizon/internal/repositories/sql/servicedeployableconfig/table.go`, `horizon/internal/deployable/handler/modelhandler.go`, `horizon/internal/deployable/handler/predatorhandler.go`, `quick-start/db-init/scripts/init-mysql.sh`	Added new `DeployableType` field to `ServiceDeployableConfig` struct with constants `DeployableTypeTest` and `DeployableTypeTarget`. Field is initialized to `DeployableTypeTarget` during deployable creation and persisted in database schema.
Deployable Scaling `horizon/pkg/argocd/hpa.go`, `horizon/internal/infrastructure/handler/handler.go`, `horizon/internal/infrastructure/controller/controller_test.go`	Introduced new `ScaleDeployable` method that scales deployables by patching HPA resources, with fallback to KEDA ScaledObject. Includes input validation and error handling. Added corresponding mock method for testing.
Test Deployable Resolution `horizon/internal/predator/handler/predator_validation.go`, `horizon/internal/predator/handler/predator.go`, `horizon/internal/repositories/sql/servicedeployableconfig/sql.go`	Enhanced test deployable ID resolution with new `resolveTestDeployableID` helper that attempts node pool lookup first, then falls back to machine-type-based selection. Added `GetTestDeployableIDByNodePool` repository method and improved error handling in validation logic. Updated error messaging to show numeric deployable ID.
Test Infrastructure Updates `horizon/internal/predator/handler/predator_test.go`, `horizon/internal/inferflow/handler/inferflow_test.go`	Added mock implementations of `GetTestDeployableIDByNodePool` method returning not-found error for test scenarios. Added required gorm imports for error type support.
Database Schema `quick-start/db-init/scripts/init-mysql.sh`	Added new columns `source_config_id` to `inferflow_config` and `source_model_name` to `predator_config`. Updated `service_deployable_config` inserts to include deployable_type value.

Sequence Diagram(s)

sequenceDiagram
    participant Client as API Client
    participant Handler as Infrastructure Handler
    participant ArgoCD as ArgoCD Service
    participant HPA as HPA Resource
    participant KEDA as KEDA ScaledObject

    Client->>Handler: ScaleDeployable(appName, env, minReplica, maxReplica)
    Handler->>Handler: Validate inputs & derive app name
    Handler->>ArgoCD: Patch HPA resource with replica counts
    
    alt HPA Patch Successful
        ArgoCD->>HPA: Apply replica configuration
        HPA-->>ArgoCD: Success
        ArgoCD-->>Handler: Success response
        Handler-->>Client: Return nil
    else HPA Patch Failed
        ArgoCD-->>Handler: Error
        Handler->>ArgoCD: Patch KEDA ScaledObject with replica counts
        
        alt KEDA Patch Successful
            ArgoCD->>KEDA: Apply replica configuration
            KEDA-->>ArgoCD: Success
            ArgoCD-->>Handler: Success response
            Handler-->>Client: Return nil
        else Both Failed
            ArgoCD-->>Handler: Error
            Handler-->>Client: Return combined error
        end
    end

🚥 Pre-merge checks | ✅ 1 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Krd Checker	⚠️ Warning	PR description lacks required KRD link or KRD exemption with Pod Type and justified explanation.	Add KRD link (format: KRD: [Google Doc URL]) or KRD exemption (Pod Type H0/H1/H2 + justified Why KRD not required section).

✅ Passed checks (1 passed)

Check name	Status	Explanation
Dynamic Configuration Validation	✅ Passed	The pull request does not contain any changes to files matching the application-dyn-*.yml pattern.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

Tip

Issue Planner is now in beta. Read the docs and try it out! Share your feedback on Discord.

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)

quick-start/db-init/scripts/init-mysql.sh (2)
605-639: ⚠️ Potential issue | 🟠 Major

Add deployable_type column to the table schema.

Line 606/630 insert deployable_type, but CREATE TABLE service_deployable_config doesn't define it, so the init script will fail at runtime. Add the column to the table definition to keep schema and seed data consistent.
🛠️ Suggested fix
   CREATE TABLE IF NOT EXISTS service_deployable_config (
     id int NOT NULL AUTO_INCREMENT,
     name varchar(255),
     host varchar(255) NOT NULL,
     active tinyint(1) DEFAULT 0,
     service enum('inferflow', 'predator', 'numerix'),
+    deployable_type enum('test','target') NOT NULL DEFAULT 'target',
     created_by varchar(255),
     updated_by varchar(255),
     created_at datetime NOT NULL DEFAULT CURRENT_TIMESTAMP,
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@quick-start/db-init/scripts/init-mysql.sh` around lines 605 - 639, The seed
INSERTs include a deployable_type column but the service_deployable_config table
definition is missing that column; update the CREATE TABLE
service_deployable_config statement to add a deployable_type column with an
appropriate type (e.g., VARCHAR or ENUM) and default/null settings consistent
with other columns so the INSERTs (which reference deployable_type) succeed;
ensure any constraints or indexes match expected usage in the INSERTs.
714-717: ⚠️ Potential issue | 🟡 Minor

Lowercase the service name in seed data.

Line 716 uses 'INFERFLOW'; service names should be stored and compared in lowercase to avoid lookup mismatches.
🛠️ Suggested fix
-      'INFERFLOW',
+      'inferflow',
Based on learnings: Within the BharatMLStack horizon codebase, ensure that service names (inferflow, predator, numerix) are stored in lowercase in database tables and are consistently used in lowercase throughout the code.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@quick-start/db-init/scripts/init-mysql.sh` around lines 714 - 717, The seed
row in init-mysql.sh contains the service name 'INFERFLOW' which must be stored
in lowercase to avoid lookup mismatches; update the seed data entry that
currently uses 'INFERFLOW' to 'inferflow' (and scan sibling seed inserts for
other services like 'PREDATOR'/'NUMERIX' to make them lowercase too) so the
database, the seeding logic in init-mysql.sh, and any comparisons elsewhere
consistently use lowercase service names.

🧹 Nitpick comments (2)

horizon/internal/repositories/sql/servicedeployableconfig/sql.go (1)

117-123: Consider adding ORDER BY for deterministic results.

If multiple test deployables share the same nodeSelectorValue, Limit(1) without an explicit ordering will return a non-deterministic result. Consider ordering by id (or another stable field) to ensure consistent behavior.

♻️ Proposed fix

 	tx := r.db.
 		Model(&ServiceDeployableConfig{}).
 		Select("id").
 		Where("deployable_type = ?", DeployableTypeTest).
 		Where("JSON_UNQUOTE(JSON_EXTRACT(config, '$.nodeSelectorValue')) = ?", nodePool).
+		Order("id ASC").
 		Limit(1).
 		Scan(&id)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@horizon/internal/repositories/sql/servicedeployableconfig/sql.go` around
lines 117 - 123, The query selecting an id from ServiceDeployableConfig with
DeployableTypeTest and a JSON_EXTRACT filter uses Limit(1) but no deterministic
ordering; update the query that builds tx (the chain on
r.db.Model(&ServiceDeployableConfig{})...Scan(&id)) to include an explicit Order
clause (e.g., Order("id ASC") or another stable field) before Limit(1) so the
returned row is deterministic when multiple rows match the nodeSelectorValue.

horizon/internal/predator/handler/predator_validation.go (1)

183-187: Use appropriate log level for error conditions.

Line 186 logs an actual error using log.Info().Err(lookupErr). This could mask issues in log monitoring/alerting systems that filter by severity. Consider using log.Warn() since this is an unexpected failure (not just a "not found" case).

♻️ Proposed fix

 		} else if errors.Is(lookupErr, gorm.ErrRecordNotFound) {
 			log.Info().Str("nodePool", nodePool).Msgf("no test deployable for node pool %q (deployable_type=test, config.nodeSelectorValue=%q), using machine-type fallback", nodePool, nodePool)
 		} else {
-			log.Info().Err(lookupErr).Str("nodePool", nodePool).Msg("Test deployable lookup by node pool failed, using machine-type fallback")
+			log.Warn().Err(lookupErr).Str("nodePool", nodePool).Msg("Test deployable lookup by node pool failed, using machine-type fallback")
 		}

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@horizon/internal/predator/handler/predator_validation.go` around lines 183 -
187, Replace the info-level logging of an actual error with a warning: in the
block handling lookupErr (the branch that currently calls
log.Info().Err(lookupErr).Str("nodePool", nodePool).Msg("Test deployable lookup
by node pool failed, using machine-type fallback")), change the logger to
log.Warn() while keeping the Err(lookupErr), Str("nodePool", nodePool) and the
same Msg text so unexpected lookup failures are surfaced at warn level; this
affects the code around the lookupErr handling in predator_validation.go where
lookupErr and nodePool are used.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Outside diff comments:
In `@quick-start/db-init/scripts/init-mysql.sh`:
- Around line 605-639: The seed INSERTs include a deployable_type column but the
service_deployable_config table definition is missing that column; update the
CREATE TABLE service_deployable_config statement to add a deployable_type column
with an appropriate type (e.g., VARCHAR or ENUM) and default/null settings
consistent with other columns so the INSERTs (which reference deployable_type)
succeed; ensure any constraints or indexes match expected usage in the INSERTs.
- Around line 714-717: The seed row in init-mysql.sh contains the service name
'INFERFLOW' which must be stored in lowercase to avoid lookup mismatches; update
the seed data entry that currently uses 'INFERFLOW' to 'inferflow' (and scan
sibling seed inserts for other services like 'PREDATOR'/'NUMERIX' to make them
lowercase too) so the database, the seeding logic in init-mysql.sh, and any
comparisons elsewhere consistently use lowercase service names.

---

Nitpick comments:
In `@horizon/internal/predator/handler/predator_validation.go`:
- Around line 183-187: Replace the info-level logging of an actual error with a
warning: in the block handling lookupErr (the branch that currently calls
log.Info().Err(lookupErr).Str("nodePool", nodePool).Msg("Test deployable lookup
by node pool failed, using machine-type fallback")), change the logger to
log.Warn() while keeping the Err(lookupErr), Str("nodePool", nodePool) and the
same Msg text so unexpected lookup failures are surfaced at warn level; this
affects the code around the lookupErr handling in predator_validation.go where
lookupErr and nodePool are used.

In `@horizon/internal/repositories/sql/servicedeployableconfig/sql.go`:
- Around line 117-123: The query selecting an id from ServiceDeployableConfig
with DeployableTypeTest and a JSON_EXTRACT filter uses Limit(1) but no
deterministic ordering; update the query that builds tx (the chain on
r.db.Model(&ServiceDeployableConfig{})...Scan(&id)) to include an explicit Order
clause (e.g., Order("id ASC") or another stable field) before Limit(1) so the
returned row is deterministic when multiple rows match the nodeSelectorValue.

pavan-adari-meesho · 2026-02-20T10:48:30Z

@paras-agarwal-meesho Take a pull from develop and fix any merge conflict that may come

pavan-adari-meesho and others added 25 commits January 30, 2026 18:02

internal complete code migration

363041b

internal components interface methods

614d483

revert pre-commit config

c3762fd

removed redundant structs

bb796ed

error formatting fixes

f380623

predator sync

c7ee967

predator init fixes

1aa8d32

code rabbit issues

19c0d77

bulk delete changes

0423a9e

schema client separation and fixes

9754fcb

removing redundant functions and error formatting

b845141

minor bug fix

2199ed2

further refractoring for coderabbit changes

8a8157a

model name extraction fix

340983e

return error on no model files found

8ecb0ea

capitilization

649e89d

schema client refractor

4ec6b12

int to preprod name convention

1303975

etcd name fixes and refractors

7135c92

refractors and dev toggle script fix

a07f6f2

predator handler refractor and gcs client minor fixes

16aeb6c

inferflow refractor into multiple files

2968ade

feat: nodepool aware validation

1da7ceb

Merge remote-tracking branch 'origin/develop' into feat/nodepool-awar…

e3e294c

…e-validation

Merge Conflicts corrected

080fd4f

coderabbitai bot reviewed Feb 18, 2026

View reviewed changes

Merge branch 'develop' into feat/nodepool-aware-validation

e6a1bd5

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

Nodepool Aware Validation#340

Nodepool Aware Validation#340
paras-agarwal-meesho wants to merge 26 commits intodevelopfrom
feat/nodepool-aware-validation

paras-agarwal-meesho commented Feb 18, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Feb 18, 2026 •

edited

Loading

❌ Failed checks (1 warning)

Uh oh!

coderabbitai bot left a comment

Uh oh!

pavan-adari-meesho commented Feb 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Comments

Conversation

paras-agarwal-meesho commented Feb 18, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔁 Pull Request Template – BharatMLStack

Context:

Describe your changes:

Checklist before requesting a review

📂 Modules Affected

✅ Type of Change

📊 Benchmark / Metrics (if applicable)

Summary by CodeRabbit

Release Notes

Uh oh!

coderabbitai bot commented Feb 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

❌ Failed checks (1 warning)

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

pavan-adari-meesho commented Feb 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

paras-agarwal-meesho commented Feb 18, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Feb 18, 2026 •

edited

Loading