Skip to content

Comments

Edit Instance Count Enabled#289

Open
paras-agarwal-meesho wants to merge 32 commits intodevelopfrom
feature/edit-instance-count
Open

Edit Instance Count Enabled#289
paras-agarwal-meesho wants to merge 32 commits intodevelopfrom
feature/edit-instance-count

Conversation

@paras-agarwal-meesho
Copy link

@paras-agarwal-meesho paras-agarwal-meesho commented Feb 13, 2026

🔁 Pull Request Template – BharatMLStack


📌 Summary

Adds instance-count edit support in predator (prod):
replaceInstanceCountInConfigPreservingFormat in predator_helpers.go updates the count inside instance_group in config.pbtxt
updateInstanceCountInConfigSource changes the config.pbtxt file in config-source
processEditGCSCopyStage in predator_approval.go applies it to GCS config-source before transfer.
Adds unit tests in predator_test.go for the helper.


📂 Modules Affected

  • horizon (Real-time systems / networking)
  • online-feature-store (Feature serving infra)
  • trufflebox-ui (Admin panel / UI)
  • infra (Docker, CI/CD, GCP/AWS setup)
  • docs (Documentation updates)
  • Other: ___________

✅ Type of Change

  • Feature addition
  • Bug fix
  • Infra / build system change
  • Performance improvement
  • Refactor
  • Documentation
  • Other: ___________

📊 Benchmark / Metrics (if applicable)

Summary by CodeRabbit

  • New Features

    • Support for updating instance counts during model config edits.
    • Better model name derivation for scale-up workflows with deployable tag handling.
  • Bug Fixes

    • Safer, format-preserving config modifications and improved error handling when editing or copying configs.
    • More robust model config parsing to prevent malformed updates.
  • Tests

    • Added extensive unit tests for config parsing, name derivation, and instance-count changes.

@coderabbitai
Copy link

coderabbitai bot commented Feb 13, 2026

Walkthrough

Adds GCS/model-name helper utilities, protobuf-based config text parsing and model name replacement, instance-count update logic integrated into edit flow, test coverage for helpers and replacements, and migrates local model config types to generated protogen types (proto go_package change).

Changes

Cohort / File(s) Summary
Instance Count Management
horizon/internal/predator/handler/predator_approval.go
Adds updateInstanceCountInConfigSource(bucket, basePath, modelName string, instanceCount int) error and calls it from the edit flow when configBucket, configPath present and payload.MetaData.InstanceCount > 0. Note: the function appears duplicated within the file.
Helpers and Utilities
horizon/internal/predator/handler/predator_helpers.go
New helper file with GCS URL parsing (parseGCSURL, extractGCSPath, extractGCSDetails), model-name derivation/reversion for ScaleUp (GetDerivedModelName, GetOriginalModelName), env check (isNonProductionEnvironment), and replaceInstanceCountInConfigPreservingFormat to update instance_count while preserving formatting.
Config Replacement Migration
horizon/internal/externalcall/gcs_client.go
Replaces regex/byte-manipulation approach with protobuf text parsing: adds exported ReplaceModelNameInConfig(data []byte, destModelName string) ([]byte, error) that unmarshals to protogen.ModelConfig, updates Name, and marshals back; adds input validation and error propagation.
Config Replacement Tests
horizon/internal/externalcall/gcs_client_test.go
Tests updated to call new ReplaceModelNameInConfig API, handle errors, and assert top-level/nested name changes or expected errors.
Upload Configuration
horizon/internal/predator/handler/predator_upload.go
Replaces in-method name replacement calls with externalcall.ReplaceModelNameInConfig(...) and adds error handling; updates validation to unmarshal into protogen.ModelConfig.
Model Type Consolidation
horizon/internal/predator/handler/model.go, horizon/internal/predator/handler/predator.go
Switches local ModelConfig/ModelInput/ModelOutput/ModelEnsembling usages to generated protogen types (proto go_package changed). Updates function signatures and internal references accordingly (e.g., ModelParamsResponse.EnsembleScheduling type).
Unit Tests & Mocks
horizon/internal/predator/handler/predator_test.go
Adds comprehensive unit tests for GCS parsing, model-name derivation/reversion, instance count replacement formatting, and request validation; includes a mock ServiceDeployableRepository implementation used by tests.
🚥 Pre-merge checks | ✅ 1 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Krd Checker ⚠️ Warning The pull request description does not contain either a valid KRD link or a valid KRD exemption as required by the custom check. Add either a KRD link (KRD: https://docs.google.com/...) or KRD exemption (Pod Type with justification) to the PR description.
✅ Passed checks (1 passed)
Check name Status Explanation
Dynamic Configuration Validation ✅ Passed PR contains no modifications to any application-dyn-*.yml files. All changes are Go source code, test files, and protobuf definitions.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 10

Note

Due to the large number of review comments, Critical, Major severity comments were prioritized as inline comments.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
horizon/internal/inferflow/handler/inferflow.go (2)

375-437: ⚠️ Potential issue | 🟠 Major

Fix nil-error dereference in ScaleUp latest-request lookup.
If GetLatestRequest returns a non-empty Error with err == nil, err.Error() will panic. Handle the two cases separately.

🛠️ Suggested fix
-	latestSourceRequest, err = m.GetLatestRequest(sourceConfigID)
-	if err != nil || latestSourceRequest.Error != emptyResponse {
-		return Response{}, errors.New("failed to get latest request for the source configID: " + sourceConfigID + ": " + err.Error())
-	}
+	latestSourceRequest, err = m.GetLatestRequest(sourceConfigID)
+	if err != nil {
+		return Response{}, errors.New("failed to get latest request for the source configID: " + sourceConfigID + ": " + err.Error())
+	}
+	if latestSourceRequest.Error != emptyResponse {
+		return Response{}, errors.New("failed to get latest request for the source configID: " + sourceConfigID + ": " + fmt.Sprint(latestSourceRequest.Error))
+	}

579-615: ⚠️ Potential issue | 🟡 Minor

Guard against empty deployable ID list.
If there are no deployable IDs, calling GetByIds may error or waste a DB call.

🔧 Suggested guard
-	serviceDeployables, err := m.ServiceDeployableConfigRepo.GetByIds(deployableIDs)
-	if err != nil {
-		return GetAllRequestConfigsResponse{}, errors.New("failed to get service deployable configs: " + err.Error())
-	}
-
-	deployableMap := make(map[int]string)
-	for _, sd := range serviceDeployables {
-		deployableMap[sd.ID] = sd.Name
-	}
+	deployableMap := make(map[int]string)
+	if len(deployableIDs) > 0 {
+		serviceDeployables, err := m.ServiceDeployableConfigRepo.GetByIds(deployableIDs)
+		if err != nil {
+			return GetAllRequestConfigsResponse{}, errors.New("failed to get service deployable configs: " + err.Error())
+		}
+		for _, sd := range serviceDeployables {
+			deployableMap[sd.ID] = sd.Name
+		}
+	}
🤖 Fix all issues with AI agents
In @.pre-commit-config.yaml:
- Line 7: Create the missing script referenced by the pre-commit config named
pre-commit-scripts/runner.sh: add a file with a valid shebang (e.g.
#!/usr/bin/env bash), implement a simple runner that forwards arguments to the
underlying hook logic (or invokes the repo's lint/test commands) and returns
appropriate exit codes, and ensure the file is made executable (chmod +x) so the
pre-commit entry in .pre-commit-config.yaml can run successfully.

In `@horizon/internal/externalcall/prometheus_client.go`:
- Around line 178-184: GetPredatorModelTraffic builds the prometheus query_range
URL without the required "/select/100" prefix; update the fmt.Sprintf in
GetPredatorModelTraffic to use "/select/100/prometheus/api/v1/query_range"
(keeping the same parameters: p.BaseURL, escapePrometheusQuery(query), start,
end, step) so the constructed URL matches other methods like GetModelNames and
routes correctly.

In `@horizon/internal/inferflow/handler/inferflow_helpers.go`:
- Around line 8-10: Wrap the call to
m.ServiceDeployableConfigRepo.GetById(deployableID) with the same repo-call
metrics pattern used elsewhere in this module: start a timer/observer before the
call, increment a "repo.get.count" metric (or the module's equivalent) and on
return observe latency; if err != nil increment the repo error counter (and
log/record the error via the module's error metric) before returning the wrapped
error. Use the same metric names and helper functions used for other DB calls in
this file to ensure consistency (reference the call
m.ServiceDeployableConfigRepo.GetById and the returned serviceDeployableConfig
variable when adding the timing/count/error instrumentation).

In `@horizon/internal/inferflow/handler/inferflow_review.go`:
- Around line 39-43: The code calls m.InferFlowRequestRepo.GetRequestByID into
tempRequest but only checks err; update the logic in the inferflow_review.go
block that uses tempRequest (type inferflow_request.Table) to detect a not-found
result (e.g., zero-value/empty Table) after GetRequestByID and return an
appropriate error Response (use the same error-wrapping style as existing
returns) instead of continuing; reference tempRequest, GetRequestByID and
m.InferFlowRequestRepo to locate the change and ensure you treat an empty
tempRequest as "request not found" and return early.

In `@horizon/internal/inferflow/handler/inferflow_validation.go`:
- Around line 16-20: The loop over ComponentConfig.FeatureComponents
dereferences featureComponent.FSRequest.Label without checking FSRequest for
nil, which can panic; update the loop in the inferflow validation (the block
iterating over ComponentConfig.FeatureComponents and using
featureComponent.FSRequest.Label) to first check if featureComponent.FSRequest
== nil and skip (or handle) that component before reading .Label (and optionally
log or record the missing FSRequest for debugging).

In `@horizon/internal/predator/handler/predator_approval.go`:
- Around line 254-313: processEditDBUpdateStage currently updates each model
config inline causing partial commits on mid-loop failures; wrap the entire DB
update phase in a single transaction so all model updates either commit or roll
back together. Modify Predator (processEditDBUpdateStage) to begin a transaction
from the repository/DB (use whatever transactional method your
PredatorConfigRepo exposes or add BeginTx/WithTx), perform all
GetActiveModelByModelName, marshal/update fields, and call
PredatorConfigRepo.Update within that transaction, and then commit at the end;
on any error ensure you rollback the transaction and return the error (so GCS
rollback can be coordinated), and remove early returns that bypass transaction
rollback. Ensure transaction covers the whole for-loop and that Update uses the
transaction context/handle.

In `@horizon/internal/predator/handler/predator_functional_testing.go`:
- Around line 25-44: The loop assumes each row v[batchIdx] has length
featureCount and can panic; add a bounds check before accessing
v[batchIdx][featureIdx] in the nested loop (use lenRow := len(v[batchIdx]) and
verify featureIdx < lenRow) and return a clear error if a row is shorter
(include batchIdx, lenRow and expected featureCount). Apply this check around
the access used by the FP16/FP32 branches (where
serializer.Float32ToFloat16Bytes, math.Float32bits and
binary.LittleEndian.PutUint32 are invoked) so you fail fast instead of risking
an index out of range panic.

In `@horizon/internal/repositories/sql/inferflow/request/repository.go`:
- Around line 128-130: The duplicate-check in DoesConfigIdExistWithRequestType
currently restricts matches to STATUS = 'PENDING APPROVAL', which can allow
duplicates because APPROVED (and other non-REJECTED) requests are ignored;
update the filter in InferflowRequest.DoesConfigIdExistWithRequestType to match
the original semantics used by Promote/ScaleUp handlers (e.g., use "STATUS !=
'REJECTED'" or an equivalent condition that excludes only REJECTED records) so
any active/non-rejected request for the same configID and requestType blocks a
new request; locate and update the SQL predicate in
DoesConfigIdExistWithRequestType accordingly and ensure Promote and ScaleUp
continue to rely on this method for duplicate detection.

In `@horizon/pkg/configschemaclient/client.go`:
- Around line 81-98: processNumerixInput currently splits numerixInput on "@"
and directly uses inputParts[1], which can panic if the split yields fewer than
2 parts; update processNumerixInput to validate inputParts length before
accessing index 1 (for example skip entries where strings.Split(numerixInput,
"@") returns < 2 or set a safe default for FeatureType), referencing the
ScoreMapping loop and the use of inputParts and SchemaComponents (FeatureName,
FeatureType, FeatureSize) to ensure you handle malformed numerixInput values
safely.

In `@horizon/pkg/configschemaclient/types.go`:
- Around line 112-129: The JSON tag for FSRequest.FeatureGroups uses camelCase
("featureGroups") instead of the repo's snake_case convention; update the struct
FSRequest in horizon/pkg/configschemaclient/types.go to use `feature_groups` for
the FeatureGroups field's json tag (i.e., FSRequest.FeatureGroups ->
json:"feature_groups"`) and ensure the same correction is applied to the
duplicate FSRequest definition in
horizon/internal/repositories/sql/inferflow/models.go so unmarshalling matches
expected sample JSON.
🟡 Minor comments (17)
horizon/internal/externalcall/prometheus_client.go-173-176 (1)

173-176: ⚠️ Potential issue | 🟡 Minor

Normalize serviceName to lowercase before building the query.
This keeps service labels consistent even if callers pass mixed-case values and aligns with existing conventions.

🔧 Proposed fix
 import (
 	"encoding/json"
 	"fmt"
 	"io"
 	"net/http"
 	"net/url"
 	"strconv"
+	"strings"
 	"sync"
 	"time"
 )
-	query := fmt.Sprintf(
+	service := strings.ToLower(serviceName)
+	query := fmt.Sprintf(
 		"sum by (model)(increase(nv_inference_count{service=\"%s\"}[1m]))",
-		serviceName,
+		service,
 	)

Based on learnings: Within the BharatMLStack horizon codebase, ensure that service names (inferflow, predator, numerix) are stored in lowercase in database tables and are consistently used in lowercase throughout the code.

horizon/internal/inferflow/handler/inferflow_helpers.go-10-10 (1)

10-10: ⚠️ Potential issue | 🟡 Minor

Fix duplicated word in the error message.

Minor typo: “service service”.

Proposed change
-		return "", fmt.Errorf("failed to fetch service service deployable config for name generation: %w", err)
+		return "", fmt.Errorf("failed to fetch service deployable config for name generation: %w", err)
horizon/internal/predator/README.md-7-15 (1)

7-15: ⚠️ Potential issue | 🟡 Minor

Specify a language for the fenced code block.

markdownlint flags this; adding a language keeps docs clean.

Proposed change
-```
+```text
 internal/predator/
 ├── controller/ (wires routes to handler)
 ├── handler/ (Config implementation + helpers)
 ├── route/ 
 ├── proto/ and generated Go
 ├── init.go
 └── README.md 
-```
+```
horizon/internal/inferflow/README.md-7-16 (1)

7-16: ⚠️ Potential issue | 🟡 Minor

Specify a language for the fenced code block.

markdownlint flags this; adding a language keeps docs clean.

Proposed change
-```
+```text
 internal/inferflow/
 ├── controller/   (wires routes to handler)
 ├── handler/      (Config implementation + helpers)
 ├── etcd/         (ETCD config read/write for inferflow)
 ├── route/
 ├── handler/proto/ and generated Go
 ├── init.go
 └── README.md
-```
+```
horizon/internal/repositories/sql/servicedeployableconfig/table.go-31-32 (1)

31-32: ⚠️ Potential issue | 🟡 Minor

Clarify nullability intent for DeployableTag.

OverrideTesting is correctly configured with default:false as Go's bool type is never null. However, DeployableTag lacks an explicit nullability constraint. If the field is optional, consider adding default:null or use *string to avoid empty-string ambiguity. If required, add not null for consistency with similar constraints elsewhere in the struct (e.g., Host).

horizon/internal/repositories/sql/inferflow/config/table.go-29-29 (1)

29-29: ⚠️ Potential issue | 🟡 Minor

Use *string to preserve NULL semantics for nullable column.

The database column source_config_id is defined as varchar(255) NULL, but the GORM struct maps it as string instead of *string. This mismatch means NULL values are converted to empty strings on read, and there's no way to distinguish between "never set" (NULL) and "explicitly empty" on write. While the current code shows no explicit NULL checks, this semantic loss could become a latent bug if future queries or validation logic needs to differentiate NULL from empty string.

Consider changing to:

SourceConfigID  *string  `gorm:"column:source_config_id"`
horizon/internal/inferflow/controller/controller.go-370-380 (1)

370-380: ⚠️ Potential issue | 🟡 Minor

Inconsistent error response format.

The error response on line 376 returns a plain string "Error getting feature schema" instead of using the structured handler.Response type with err.Error(). This is inconsistent with other handlers in this file (e.g., Onboard, Promote, Edit) and loses the actual error message.

Also, strings.TrimSpace is applied to version but not model_config_id, which may be inconsistent.

🔧 Proposed fix for consistent error handling
 func (c *V1) GetFeatureSchema(ctx *gin.Context) {
 	response, err := c.Config.GetFeatureSchema(handler.FeatureSchemaRequest{
-		ModelConfigId: ctx.Query("model_config_id"),
+		ModelConfigId: strings.TrimSpace(ctx.Query("model_config_id")),
 		Version:       strings.TrimSpace(ctx.Query("version")),
 	})
 	if err != nil {
-		ctx.JSON(api.NewBadRequestError(err.Error()).StatusCode, "Error getting feature schema")
+		ctx.JSON(api.NewBadRequestError(err.Error()).StatusCode, handler.Response{
+			Error: err.Error(),
+			Data:  handler.Message{Message: ""},
+		})
 		return
 	}
 	ctx.JSON(200, response)
 }
horizon/internal/inferflow/handler/inferflow_functional_testing.go-71-75 (1)

71-75: ⚠️ Potential issue | 🟡 Minor

Endpoint port stripping may fail for IPv6 addresses.

The logic strings.LastIndex(ep, ":") to find and strip ports will incorrectly handle IPv6 addresses like [::1]:8080 since colons appear within the address itself. If IPv6 endpoints are possible, consider using net.SplitHostPort for robust parsing.

horizon/internal/inferflow/handler/inferflow_functional_testing.go-60-60 (1)

60-60: ⚠️ Potential issue | 🟡 Minor

Typo in function name: ExecuteFuncitonalTestRequest

The function name contains a typo ("Funciton" instead of "Function"). This should be corrected to ExecuteFunctionalTestRequest for consistency and discoverability.

✏️ Proposed fix
-func (m *InferFlow) ExecuteFuncitonalTestRequest(request ExecuteRequestFunctionalTestingRequest) (ExecuteRequestFunctionalTestingResponse, error) {
+func (m *InferFlow) ExecuteFunctionalTestRequest(request ExecuteRequestFunctionalTestingRequest) (ExecuteRequestFunctionalTestingResponse, error) {
horizon/internal/inferflow/handler/inferflow_fetch.go-62-86 (1)

62-86: ⚠️ Potential issue | 🟡 Minor

Errors from GetConfig are silently ignored.

If m.infrastructureHandler.GetConfig() fails or returns an invalid config, no error is propagated. The caller receives a map that may be missing entries without knowing why. Consider collecting errors or at minimum logging failures.

🛠️ Suggested improvement
 func (m *InferFlow) batchFetchRingMasterConfigs(serviceDeployables map[int]*service_deployable_config.ServiceDeployableConfig) (map[int]infrastructurehandler.Config, error) {
 	ringMasterConfigs := make(map[int]infrastructurehandler.Config)
 	var mu sync.Mutex
 	var wg sync.WaitGroup
+	var errMu sync.Mutex
+	var errs []error

 	semaphore := make(chan struct{}, 10)

 	for id, deployable := range serviceDeployables {
 		wg.Add(1)
 		go func(deployableID int, sd *service_deployable_config.ServiceDeployableConfig) {
 			defer wg.Done()
 			semaphore <- struct{}{}
 			defer func() { <-semaphore }()

 			config := m.infrastructureHandler.GetConfig(sd.Name, inferflowPkg.AppEnv)
+			if config == (infrastructurehandler.Config{}) {
+				errMu.Lock()
+				errs = append(errs, fmt.Errorf("failed to get config for deployable %d (%s)", deployableID, sd.Name))
+				errMu.Unlock()
+				return
+			}

 			mu.Lock()
 			ringMasterConfigs[deployableID] = config
 			mu.Unlock()
 		}(id, deployable)
 	}

 	wg.Wait()
+	if len(errs) > 0 {
+		// Log or return aggregated errors based on requirements
+	}
 	return ringMasterConfigs, nil
 }
horizon/internal/predator/handler/predator_upload.go-187-200 (1)

187-200: ⚠️ Potential issue | 🟡 Minor

Misleading function name: validateVersionHasFiles only checks folder existence.

The function name suggests it validates that the version folder contains files, but it only calls CheckFolderExists. The error message at line 195 mentions "version folder 1/ is empty" but the actual check doesn't verify emptiness—it only checks if the folder path prefix exists.

🔧 Proposed fix to align name with behavior
-// validateVersionHasFiles checks if version folder has at least one non-empty file
-func (p *Predator) validateVersionHasFiles(srcBucket, versionPath string) error {
+// validateVersionFolderExists checks if the version folder exists in GCS
+func (p *Predator) validateVersionFolderExists(srcBucket, versionPath string) error {
 	exists, err := p.GcsClient.CheckFolderExists(srcBucket, versionPath)
 	if err != nil {
 		return fmt.Errorf("failed to check version folder contents: %w", err)
 	}

 	if !exists {
-		return fmt.Errorf("version folder 1/ is empty - must contain model files")
+		return fmt.Errorf("version folder 1/ not found - must exist with model files")
 	}

-	log.Info().Msg("Version folder 1/ contains files")
+	log.Info().Msg("Version folder 1/ exists")
 	return nil
 }
horizon/internal/predator/handler/predator_fetch.go-104-124 (1)

104-124: ⚠️ Potential issue | 🟡 Minor

Missing key existence check for deployableConfigs map lookup.

Line 124 accesses deployableConfigs[serviceDiscovery.ServiceDeployableID] without checking if the key exists. If the infrastructure handler call failed in batchFetchDeployableConfigs, this entry won't exist, and you'll get a zero-value Config silently.

🛡️ Proposed fix
+		// Get infrastructure config (HPA/replica info)
+		infraConfig, infraExists := deployableConfigs[serviceDiscovery.ServiceDeployableID]
+		if !infraExists {
+			continue // Skip if infrastructure config not found
+		}
-		// Get infrastructure config (HPA/replica info)
-		infraConfig := deployableConfigs[serviceDiscovery.ServiceDeployableID]
horizon/internal/predator/handler/predator_upload.go-377-395 (1)

377-395: ⚠️ Potential issue | 🟡 Minor

Use path.Dir() and path.Base() for more robust path manipulation.

The validation on line 382-384 ensures srcPath is non-empty, but the path extraction logic at lines 386-388 using strings.Split() and strings.TrimSuffix() is less robust than using standard library functions. The codebase already uses path.Dir() and path.Base() elsewhere (e.g., extractGCSDetails in the same file); apply the same pattern here:

srcBasePath := path.Dir(srcPath)
srcModelName := path.Base(srcPath)

This is more idiomatic and handles edge cases consistently.

horizon/internal/repositories/sql/inferflow/config/repository.go-163-176 (1)

163-176: ⚠️ Potential issue | 🟡 Minor

Remove the manual updated_at assignment from the map; the BeforeUpdate hook handles it.

GORM's Updates(map) method triggers the BeforeUpdate hook, and the hook's SetColumn() call overrides any value in the map. The line "updated_at": time.Now() in the map is therefore redundant and has no effect—the hook will set it to time.Now() regardless.

horizon/internal/predator/handler/predator_validation.go-278-307 (1)

278-307: ⚠️ Potential issue | 🟡 Minor

Comment/logic mismatch in health check.
The comment says “at least one pod healthy” but the code requires all deployments healthy. Update the comment or loosen the condition.

📝 Comment-only fix
-	// Consider deployment healthy if at least one pod is healthy and running
+	// Deployment considered healthy only when all deployments report Healthy
 	return false, nil
horizon/internal/jobs/bulkdeletestrategy/predator_service.go-181-272 (1)

181-272: ⚠️ Potential issue | 🟡 Minor

Guard empty discoveryConfigIds to avoid unnecessary repo calls.
If there are no active discovery configs, skipping the DB query keeps behavior predictable.

🔧 Suggested guard
-	predatorConfigList, err := predatorBulkDeleteRepos.predatorConfigRepo.FindByDiscoveryConfigIdsAndAge(discoveryConfigIds, zeroTrafficDays)
+	if len(discoveryConfigIds) == 0 {
+		log.Info().Msg("No active discovery configs found for deployable")
+		return nil, nil, parentToChildMapping, trafficData, nil
+	}
+	predatorConfigList, err := predatorBulkDeleteRepos.predatorConfigRepo.FindByDiscoveryConfigIdsAndAge(discoveryConfigIds, zeroTrafficDays)
horizon/internal/predator/handler/predator_validation.go-18-70 (1)

18-70: ⚠️ Potential issue | 🟡 Minor

Duplicate model detection won’t work with the current map-based check.
deployableModelMap deduplicates by key, so the later count always ends at 1 and duplicates slip through. Track duplicates before inserting into the map.

🛠️ Suggested fix
-	// Build maps from requested models
+	// Build maps from requested models + detect duplicates per deployable
+	modelNameCount := make(map[int]map[string]int)
 	for _, predatorConfig := range predatorConfigList {
 		// Get service deployable ID for this model
 		discoveryConfig, err := p.ServiceDiscoveryRepo.GetById(predatorConfig.DiscoveryConfigID)
 		if err != nil {
 			log.Error().Err(err).Msgf("failed to fetch discovery config for model %s", predatorConfig.ModelName)
 			return false, fmt.Errorf(errFailedToFetchDiscoveryConfigForModel, predatorConfig.ModelName, err)
 		}
+		if modelNameCount[discoveryConfig.ServiceDeployableID] == nil {
+			modelNameCount[discoveryConfig.ServiceDeployableID] = map[string]int{}
+		}
+		modelNameCount[discoveryConfig.ServiceDeployableID][predatorConfig.ModelName]++
+		if modelNameCount[discoveryConfig.ServiceDeployableID][predatorConfig.ModelName] > 1 {
+			return false, fmt.Errorf(errDuplicateModelNameInDeployable, predatorConfig.ModelName, discoveryConfig.ServiceDeployableID)
+		}
 
 		requestedModelMap[predatorConfig.ModelName] = predatorConfig
 		requestedDeployableMap[discoveryConfig.ServiceDeployableID] = true
🧹 Nitpick comments (22)
horizon/internal/externalcall/prometheus_client.go (2)

14-18: Add a short doc comment clarifying daysAgo semantics.

✍️ Proposed doc comment
 type PrometheusClient interface {
 	GetModelNames(serviceName string) ([]string, error)
 	GetInferflowConfigNames(serviceName string) ([]string, error)
 	GetNumerixConfigNames() ([]string, error)
+	// GetPredatorModelTraffic returns model traffic for the past N days (daysAgo must be > 0).
 	GetPredatorModelTraffic(serviceName string, daysAgo int) (map[string]PredatorModelTraffic, error)
 }

96-108: Keep Prometheus decode helpers unexported unless used outside the package.
If PredatorModelResponse is only used for JSON decoding here, unexport it to avoid widening the public surface.

♻️ Proposed change
-type PredatorModelResponse struct {
+type predatorModelResponse struct {
 	Status    string `json:"status"`
 	IsPartial bool   `json:"isPartial"`
 	Data      struct {
 		ResultType string `json:"resultType"`
 		Result     []struct {
 			Metric struct {
 				Model string `json:"model"`
 			} `json:"metric"`
 			Values [][]interface{} `json:"values"` // [[timestamp, "value"], ...]
 		} `json:"result"`
 	} `json:"data"`
 }
-	var pr PredatorModelResponse
+	var pr predatorModelResponse
inferflow/handlers/inferflow/inferflow.go (1)

69-69: Extract the init log message into a constant.

This literal was added inside a function; please hoist it to a const near the existing constant block.

Proposed change
 const (
 	UserId         = "USER-ID"
 	appVersionCode = "APP-VERSION-CODE"
 	appVersion     = "app_version_code"
 	defaultErrMsg  = "something went wrong!"
 	userId         = "user_id"
 	modelConfigId  = "model-config-id"
+	inferflowInitLogMsg = "Inferflow handler initialized"
 )
@@
-	logger.Info("Inferflow handler initialized")
+	logger.Info(inferflowInitLogMsg)

As per coding guidelines: "we should not use the literal values directly inside functions, they should be stored into separate variables and the variables should be used".

horizon/internal/externalcall/gcs_client_test.go (1)

65-119: Consolidate the repeated nil-client setup and error assertions into helpers.

This block repeats the same setup and error checks; a helper keeps tests shorter and aligned with the duplication guideline.

Proposed change (example)
+func newNilClient() *GCSClient {
+	return &GCSClient{client: nil, ctx: context.Background()}
+}
+
+func requireNotInitialized(t *testing.T, err error) {
+	require.Error(t, err)
+	assert.Contains(t, err.Error(), "not initialized")
+}
@@
-	g := &GCSClient{client: nil, ctx: context.Background()}
+	g := newNilClient()
 	folders, err := g.ListFolders("bucket", "prefix/")
-	require.Error(t, err)
+	requireNotInitialized(t, err)
 	assert.Nil(t, folders)
-	assert.Contains(t, err.Error(), "not initialized")

As per coding guidelines: "if duplicate code exists it should be moved to a common method".

horizon/internal/inferflow/handler/inferflow_helpers.go (1)

21-24: Hoist TTL values into a package-level variable/const.

Avoid literals inside the method by defining a shared slice.

Proposed change
+var loggingTTLValues = []int{30, 60, 90}
+
 func (m *InferFlow) GetLoggingTTL() (GetLoggingTTLResponse, error) {
 	return GetLoggingTTLResponse{
-		Data: []int{30, 60, 90},
+		Data: loggingTTLValues,
 	}, nil
 }

As per coding guidelines: "we should not use the literal values directly inside functions, they should be stored into separate variables and the variables should be used".

horizon/internal/inferflow/etcd/models.go (1)

81-87: Consider consolidating duplicate SeenScoreComponent definitions.

The SeenScoreComponent struct is defined identically in three locations:

  • horizon/internal/inferflow/etcd/models.go (this file)
  • horizon/internal/inferflow/handler/models.go
  • horizon/internal/repositories/sql/inferflow/models.go

This duplication could lead to drift if one definition is updated but others are not. Consider defining this type in a shared package and importing it where needed.

horizon/internal/repositories/sql/predatorconfig/sql.go (1)

155-168: Clarify the distinction between similar query methods.

This method (FindByDiscoveryConfigIdsAndAge) queries by created_at < cutoffDate AND active = true, while FindByDiscoveryIDsAndCreatedBefore (lines 126-135) queries by updated_at < cutoffDate without filtering on active.

The naming is similar but behavior differs significantly:

  • One filters by created_at with active = true
  • The other filters by updated_at without an active filter

Consider adding documentation to clarify the distinct use cases for each method, or unifying them if they serve the same purpose.

horizon/internal/repositories/sql/inferflow/models.go (1)

253-257: Consider documenting expected types for FeatureSize.

Using any for FeatureSize provides flexibility but loses type safety. If the expected types are known (e.g., int, []int, string), consider documenting them or using a more specific type/interface.

horizon/internal/predator/handler/predator_functional_testing.go (2)

205-217: Extract magic numbers as named constants.

The values 10 and 128 used to replace dynamic dimensions (-1) are magic numbers. Consider extracting them as named constants for clarity and maintainability.

♻️ Proposed refactor
+const (
+	defaultBatchDimension   = 10
+	defaultFeatureDimension = 128
+)
+
 	for i, dim := range result {
 		if dim == -1 {
 			if i == 0 {
-				result[i] = 10
+				result[i] = defaultBatchDimension
 			} else {
-				result[i] = 128
+				result[i] = defaultFeatureDimension
 			}
 			log.Debug().Msgf("Replaced dynamic dimension -1 at position %d with %d", i, result[i])

156-163: Data truncation is silent when batch size exceeds available data.

If end > len(dataSlice), the batch is silently skipped. This could mask issues where the expected batch count doesn't match the actual data. Consider logging a warning or returning an error.

♻️ Proposed fix to add warning
 	for i := int64(0); i < batchSize; i++ {
 		start := i * elementsPerBatch
 		end := start + elementsPerBatch
 		if end <= int64(len(dataSlice)) {
 			batch := dataSlice[start:end]
 			result = append(result, batch)
+		} else {
+			log.Warn().Msgf("Batch %d truncated: expected %d elements but only %d available", i, end, len(dataSlice))
 		}
 	}
horizon/internal/repositories/sql/discoveryconfig/sql.go (1)

124-126: ReactivateByIDTx doesn't update audit fields.

Unlike DeactivateByID which sets updated_by and updated_at, ReactivateByIDTx only sets active = true. For audit consistency, consider updating the timestamp and updater.

♻️ Proposed fix to add audit fields
-func (r *discoveryConfigRepo) ReactivateByIDTx(tx *gorm.DB, id int) error {
-	return tx.Model(&DiscoveryConfig{}).Where("id = ?", id).Update("active", true).Error
+func (r *discoveryConfigRepo) ReactivateByIDTx(tx *gorm.DB, id int, updatedBy string) error {
+	return tx.Model(&DiscoveryConfig{}).Where("id = ?", id).Updates(map[string]interface{}{
+		"active":     true,
+		"updated_by": updatedBy,
+		"updated_at": time.Now(),
+	}).Error
 }

Note: This would require updating the interface signature and all callers.

horizon/internal/repositories/sql/inferflow/request/repository.go (2)

96-106: Redundant error return and inconsistent error handling.

Line 105 returns result.Error which was already checked and is nil at that point. Additionally, returning an error when no records are found differs from other methods like GetApprovedRequestsByConfigID which return empty slices.

♻️ Proposed fix
 func (g *InferflowRequest) GetByConfigIDandVersion(configID string, version int) ([]Table, error) {
 	var tables []Table
 	result := g.db.Where("config_id = ? AND version = ? AND status = 'APPROVED'", configID, version).Find(&tables)
 	if result.Error != nil {
 		return nil, result.Error
 	}
 	if len(tables) == 0 {
 		return nil, errors.New("no request found with the given config_id and version")
 	}
-	return tables, result.Error
+	return tables, nil
 }

162-172: GetRequestByID returns empty struct on not found instead of an error.

Returning an empty Table{} with nil error when a record is not found differs from typical repository patterns (which return an error). Callers must check for an empty struct rather than an error, which could lead to subtle bugs.

Consider either:

  1. Returning a pointer (*Table) with nil when not found
  2. Returning gorm.ErrRecordNotFound as the error
horizon/internal/configs/app_config.go (1)

74-74: Inconsistent field alignment.

Line 74 has extra whitespace before the int type compared to adjacent fields (lines 73, 75, 76). This appears to be a formatting inconsistency.

🧹 Proposed formatting fix
-	BulkDeleteInferflowMaxInactiveDays        int  `mapstructure:"bulk_delete_inferflow_max_inactive_days"`
+	BulkDeleteInferflowMaxInactiveDays         int  `mapstructure:"bulk_delete_inferflow_max_inactive_days"`
horizon/internal/inferflow/handler/inferflow_constants.go (1)

16-17: Consider removing redundant boolean aliases.

activeTrue = true and activeFalse = false provide no abstraction benefit over using true/false directly. They add indirection without semantic value.

horizon/internal/inferflow/handler/inferflow_functional_testing.go (1)

160-161: Inconsistent logging: fmt.Println instead of zerolog.

The file imports and uses zerolog (e.g., line 131, 162), but lines 160 and 177 use fmt.Println for error logging. Use the structured logger consistently.

♻️ Proposed fix
 	if err != nil {
-		fmt.Println("Error getting inferflow config: ", err)
+		log.Error().Err(err).Msg("Error getting inferflow config")
 	} else if inferFlowConfig == nil {
 		err = m.InferFlowConfigRepo.Update(inferFlowConfig)
 		if err != nil {
-			fmt.Println("Error updating inferflow config: ", err)
+			log.Error().Err(err).Msg("Error updating inferflow config")
 		}

Also applies to: 177-178

horizon/internal/inferflow/handler/inferflow_test.go (1)

115-153: Consider consolidating mock repositories.

Both inferflow_test.go and predator_test.go define nearly identical mock implementations of ServiceDeployableRepository. Consider extracting a shared mock into a test helper package to reduce duplication.

horizon/internal/repositories/sql/inferflow/config/repository.go (1)

137-146: Consider extracting shared delete logic.

DeleteByConfigIDTx duplicates the logic from DeleteTx (lines 72-81). Consider extracting the common delete-by-config-id logic into a shared helper to reduce duplication.

♻️ Proposed refactor
+func deleteByConfigID(tx *gorm.DB, configID string) error {
+	result := tx.Where("config_id = ?", configID).Delete(&Table{})
+	if result.Error != nil {
+		return result.Error
+	}
+	if result.RowsAffected == 0 {
+		return errors.New("no config found with the given config_id")
+	}
+	return nil
+}

 func (g *InferflowConfig) DeleteTx(tx *gorm.DB, configID string) error {
-	result := tx.Where("config_id = ?", configID).Delete(&Table{})
-	if result.Error != nil {
-		return result.Error
-	}
-	if result.RowsAffected == 0 {
-		return errors.New("no config found with the given config_id")
-	}
-	return nil
+	return deleteByConfigID(tx, configID)
 }
 ...
 func (g *InferflowConfig) DeleteByConfigIDTx(tx *gorm.DB, configID string) error {
-	result := tx.Where("config_id = ?", configID).Delete(&Table{})
-	if result.Error != nil {
-		return result.Error
-	}
-	if result.RowsAffected == 0 {
-		return errors.New("no config found with the given config_id")
-	}
-	return nil
+	return deleteByConfigID(tx, configID)
 }
horizon/internal/predator/handler/predator_upload.go (1)

497-532: Duplicate implementation of replaceModelNameInConfigPreservingFormat.

This function duplicates the logic in horizon/internal/externalcall/gcs_client.go (replaceModelNameInConfig, lines 465-510). Consider extracting to a shared utility to avoid maintaining two implementations.

horizon/internal/predator/handler/predator_helpers.go (1)

14-34: Redundant length check in parseGCSURL.

Line 25 checks len(parts) < 1, but strings.SplitN(trimmed, "/", 2) will always return at least one element (the original string if no separator is found). This check is redundant.

♻️ Simplified version
 	trimmed := strings.TrimPrefix(gcsURL, gcsPrefix)
 	parts := strings.SplitN(trimmed, slashConstant, 2)
-	if len(parts) < 1 {
-		return constant.EmptyString, constant.EmptyString, false
-	}

 	bucket = parts[0]
+	if bucket == "" {
+		return constant.EmptyString, constant.EmptyString, false
+	}
 	if len(parts) == 2 {
 		objectPath = parts[1]
 	}
 	return bucket, objectPath, true
horizon/internal/externalcall/gcs_client.go (1)

465-510: Consider pre-compiling regex patterns.

The replaceModelNameInConfig function compiles regex patterns (namePattern and valuePattern) on every invocation. For frequently called functions, consider compiling these at package level.

♻️ Pre-compile regex patterns
+var (
+	namePattern  = regexp.MustCompile(`name\s*:\s*"([^"]+)"`)
+	valuePattern = regexp.MustCompile(`"([^"]+)"`)
+)

 func replaceModelNameInConfig(data []byte, destModelName string) []byte {
 	content := string(data)
 	lines := strings.Split(content, "\n")

 	for i, line := range lines {
 		trimmed := strings.TrimSpace(line)
 		if strings.HasPrefix(trimmed, "name:") {
 			leadingWhitespace := len(line) - len(strings.TrimLeft(line, " \t"))
 			if leadingWhitespace >= 2 {
 				continue
 			}

-			namePattern := regexp.MustCompile(`name\s*:\s*"([^"]+)"`)
 			matches := namePattern.FindStringSubmatch(line)
 			// ... rest of function
-				valuePattern := regexp.MustCompile(`"([^"]+)"`)
 				valueReplaced := valuePattern.ReplaceAllString(matched, fmt.Sprintf(`"%s"`, destModelName))
horizon/internal/predator/handler/predator_fetch.go (1)

27-56: Use batch fetch methods to avoid N+1 query pattern.

The current implementation fetches each discovery config and service deployable individually in a loop (lines 30 and 51). The repositories provide batch methods that should be used instead:

  • GetByDiscoveryIDs() for discovery configs
  • GetByIds() for service deployables

This will reduce database round-trips from N+1 to a single query per resource type.

@turbo-turtle-github
Copy link

⚠️ CI Workflow did not complete successfully ⚠️

Links:
No additional links available.

Reason:

running ci workflow has been cancelled due to the following reasons:
 1. New commit pushed to the PR 
 2. PR is closed or drafted 
A new workflow will be created for case 1 automatically.

📘 Please check the Turbo Turtle FAQs for more details.

👉 If the reason is not self-explanatory or not covered in the FAQ, please reach out to @devops-oncall in the #devops-tech channel.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
quick-start/db-init/scripts/init-mysql.sh (1)

687-705: ⚠️ Potential issue | 🟡 Minor

Service value should be lowercase.

This seed insert uses 'INFERFLOW', but service names must be stored lowercase. Please change to 'inferflow', and consider a simple assertion in this script to fail if any service value is not lowercase.

Based on learnings: “ensure that service names (inferflow, predator, numerix) are stored in lowercase in database tables and are consistently used in lowercase throughout the code.”

🤖 Fix all issues with AI agents
In `@horizon/internal/predator/handler/predator_approval.go`:
- Around line 196-201: The config-source update (call to
updateInstanceCountInConfigSource) is applied before downstream steps (GCS
transfer, DB update, restart) and is not rolled back on failure; either move the
update to after those steps succeed or implement a rollback: read and cache the
original config before calling updateInstanceCountInConfigSource (use the same
configBucket/configPath/modelName context and the local variable
transferredGcsModelData for flow), then if any subsequent operation (the GCS
transfer, DB update, or restart code paths that follow this block) returns an
error, call a restore function to write the cached original config back;
alternatively, defer performing updateInstanceCountInConfigSource until all
downstream operations succeed and then call it as the final step.
- Around line 223-252: The function updateInstanceCountInConfigSource should
validate the instanceCount parameter before touching GCS: add a defensive check
at the top of updateInstanceCountInConfigSource (before reading config) to
return a clear error if instanceCount is <= 0 (or otherwise invalid), so you
never write an invalid Triton config; reference the function name
updateInstanceCountInConfigSource and the downstream helper
replaceInstanceCountInConfigPreservingFormat when adding the guard and error
message.

Comment on lines +196 to +201
if configBucket != "" && configPath != "" && payload.MetaData.InstanceCount > 0 {
if err := p.updateInstanceCountInConfigSource(configBucket, configPath, modelName, payload.MetaData.InstanceCount); err != nil {
log.Error().Err(err).Msgf("Failed to update instance count in config-source for model %s", modelName)
return transferredGcsModelData, err
}
}
Copy link

@coderabbitai coderabbitai bot Feb 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Config-source update lacks rollback on later failures.

If the instance_count update succeeds but the subsequent GCS transfer / DB update / restart fails, the config-source stays mutated while the request is marked failed. Consider deferring this update until success or capturing the original config and restoring it on any downstream failure.

🤖 Prompt for AI Agents
In `@horizon/internal/predator/handler/predator_approval.go` around lines 196 -
201, The config-source update (call to updateInstanceCountInConfigSource) is
applied before downstream steps (GCS transfer, DB update, restart) and is not
rolled back on failure; either move the update to after those steps succeed or
implement a rollback: read and cache the original config before calling
updateInstanceCountInConfigSource (use the same
configBucket/configPath/modelName context and the local variable
transferredGcsModelData for flow), then if any subsequent operation (the GCS
transfer, DB update, or restart code paths that follow this block) returns an
error, call a restore function to write the cached original config back;
alternatively, defer performing updateInstanceCountInConfigSource until all
downstream operations succeed and then call it as the final step.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mutation is not harmful and rollback will cause more read writes.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@paras-agarwal-meesho, understood. I acknowledge your decision to proceed without rollback for the config-source update to minimize read/write operations.


✏️ Learnings added
Learnt from: paras-agarwal-meesho
Repo: Meesho/BharatMLStack PR: 289
File: horizon/internal/predator/handler/predator_approval.go:196-201
Timestamp: 2026-02-17T07:03:43.427Z
Learning: In the Predator handler (horizon/internal/predator/handler), config-source mutations (e.g., instance count updates) are intentionally not rolled back on downstream failures to avoid additional read/write operations, as these mutations are considered non-harmful even if the request ultimately fails.

Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.

Comment on lines 223 to 252
func (p *Predator) updateInstanceCountInConfigSource(bucket, basePath, modelName string, instanceCount int) error {
if modelName == "" {
return fmt.Errorf("model name is empty, required to update instance count in config-source")
}

configPath := path.Join(basePath, modelName, configFile)
configData, err := p.GcsClient.ReadFile(bucket, configPath)
if err != nil {
return fmt.Errorf("failed to read config.pbtxt from config-source for model %s: %w", modelName, err)
}

var modelConfig ModelConfig
if err := prototext.Unmarshal(configData, &modelConfig); err != nil {
return fmt.Errorf("failed to parse config.pbtxt from config-source for model %s: %w", modelName, err)
}
if len(modelConfig.InstanceGroup) == 0 {
return fmt.Errorf("%s (model %s)", errNoInstanceGroup, modelName)
}

newConfigData, err := replaceInstanceCountInConfigPreservingFormat(configData, instanceCount)
if err != nil {
return fmt.Errorf("failed to update instance count in config for model %s: %w", modelName, err)
}
if err := p.GcsClient.UploadFile(bucket, configPath, newConfigData); err != nil {
return fmt.Errorf("failed to upload config.pbtxt to config-source for model %s: %w", modelName, err)
}

log.Info().Msgf("Updated instance_count to %d in config-source for model %s", instanceCount, modelName)
return nil
}
Copy link

@coderabbitai coderabbitai bot Feb 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Validate instanceCount before writing.

The helper is public and should defensively guard against zero/negative values to avoid invalid Triton configs even if callers misbehave.

✅ Suggested guard
 func (p *Predator) updateInstanceCountInConfigSource(bucket, basePath, modelName string, instanceCount int) error {
 	if modelName == "" {
 		return fmt.Errorf("model name is empty, required to update instance count in config-source")
 	}
+	if instanceCount <= 0 {
+		return fmt.Errorf("instance count must be > 0 for model %s", modelName)
+	}
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
func (p *Predator) updateInstanceCountInConfigSource(bucket, basePath, modelName string, instanceCount int) error {
if modelName == "" {
return fmt.Errorf("model name is empty, required to update instance count in config-source")
}
configPath := path.Join(basePath, modelName, configFile)
configData, err := p.GcsClient.ReadFile(bucket, configPath)
if err != nil {
return fmt.Errorf("failed to read config.pbtxt from config-source for model %s: %w", modelName, err)
}
var modelConfig ModelConfig
if err := prototext.Unmarshal(configData, &modelConfig); err != nil {
return fmt.Errorf("failed to parse config.pbtxt from config-source for model %s: %w", modelName, err)
}
if len(modelConfig.InstanceGroup) == 0 {
return fmt.Errorf("%s (model %s)", errNoInstanceGroup, modelName)
}
newConfigData, err := replaceInstanceCountInConfigPreservingFormat(configData, instanceCount)
if err != nil {
return fmt.Errorf("failed to update instance count in config for model %s: %w", modelName, err)
}
if err := p.GcsClient.UploadFile(bucket, configPath, newConfigData); err != nil {
return fmt.Errorf("failed to upload config.pbtxt to config-source for model %s: %w", modelName, err)
}
log.Info().Msgf("Updated instance_count to %d in config-source for model %s", instanceCount, modelName)
return nil
}
func (p *Predator) updateInstanceCountInConfigSource(bucket, basePath, modelName string, instanceCount int) error {
if modelName == "" {
return fmt.Errorf("model name is empty, required to update instance count in config-source")
}
if instanceCount <= 0 {
return fmt.Errorf("instance count must be > 0 for model %s", modelName)
}
configPath := path.Join(basePath, modelName, configFile)
configData, err := p.GcsClient.ReadFile(bucket, configPath)
if err != nil {
return fmt.Errorf("failed to read config.pbtxt from config-source for model %s: %w", modelName, err)
}
var modelConfig ModelConfig
if err := prototext.Unmarshal(configData, &modelConfig); err != nil {
return fmt.Errorf("failed to parse config.pbtxt from config-source for model %s: %w", modelName, err)
}
if len(modelConfig.InstanceGroup) == 0 {
return fmt.Errorf("%s (model %s)", errNoInstanceGroup, modelName)
}
newConfigData, err := replaceInstanceCountInConfigPreservingFormat(configData, instanceCount)
if err != nil {
return fmt.Errorf("failed to update instance count in config for model %s: %w", modelName, err)
}
if err := p.GcsClient.UploadFile(bucket, configPath, newConfigData); err != nil {
return fmt.Errorf("failed to upload config.pbtxt to config-source for model %s: %w", modelName, err)
}
log.Info().Msgf("Updated instance_count to %d in config-source for model %s", instanceCount, modelName)
return nil
}
🤖 Prompt for AI Agents
In `@horizon/internal/predator/handler/predator_approval.go` around lines 223 -
252, The function updateInstanceCountInConfigSource should validate the
instanceCount parameter before touching GCS: add a defensive check at the top of
updateInstanceCountInConfigSource (before reading config) to return a clear
error if instanceCount is <= 0 (or otherwise invalid), so you never write an
invalid Triton config; reference the function name
updateInstanceCountInConfigSource and the downstream helper
replaceInstanceCountInConfigPreservingFormat when adding the guard and error
message.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

check is done at caller site and function wont be called anywhere else.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No description provided.

@paras-agarwal-meesho paras-agarwal-meesho changed the base branch from develop to dev/horizon-selfserve-migration February 13, 2026 12:44
@paras-agarwal-meesho paras-agarwal-meesho changed the base branch from dev/horizon-selfserve-migration to develop February 13, 2026 12:44
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
horizon/internal/predator/handler/predator_upload.go (1)

500-534: ⚠️ Potential issue | 🟡 Minor

Remove unused replaceModelNameInConfigPreservingFormat method and its associated test.

This method is only referenced in unit tests and is no longer called in production code—both copyConfigToProdConfigSource and copyConfigToNewNameInConfigSource now delegate to externalcall.ReplaceModelNameInConfig. Remove the method from predator_upload.go (lines 500-534) and the corresponding test case from predator_test.go (line 51).

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@horizon/internal/predator/handler/predator_upload.go` around lines 500 - 534,
Remove the unused helper replaceModelNameInConfigPreservingFormat from
predator_upload.go and delete its corresponding unit test in predator_test.go
(the test that targets the old formatter, referenced around the earlier test at
line 51), since production paths copyConfigToProdConfigSource and
copyConfigToNewNameInConfigSource now use externalcall.ReplaceModelNameInConfig;
ensure no other references to replaceModelNameInConfigPreservingFormat remain
(search for the function name) and run tests to confirm only the redundant test
was removed.
🧹 Nitpick comments (1)
horizon/internal/externalcall/gcs_client_test.go (1)

38-43: Consider adding test case for empty destination model name.

The ReplaceModelNameInConfig function validates that destModelName cannot be empty, but there's no test case covering this behavior.

💡 Suggested additional test case
 		{
 			name:          "invalid pbtxt returns error",
 			data:          []byte(`invalid_field: "x"`),
 			destModelName: "any",
 			expectError:   true,
 		},
+		{
+			name:          "empty dest model name returns error",
+			data:          []byte(`name: "old_model"` + "\n"),
+			destModelName: "",
+			expectError:   true,
+		},
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@horizon/internal/externalcall/gcs_client_test.go` around lines 38 - 43, Add a
test case to hcs/internal/externalcall/gcs_client_test.go's table-driven tests
that exercises ReplaceModelNameInConfig with an empty destModelName (e.g., name
"empty destModelName returns error"), supply a valid config payload (or the
invalid pbtxt already used) and set expectError to true; this verifies
ReplaceModelNameInConfig enforces non-empty destModelName. Use the same test
runner that iterates over the table so the new entry follows the existing
pattern and asserts an error is returned.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Outside diff comments:
In `@horizon/internal/predator/handler/predator_upload.go`:
- Around line 500-534: Remove the unused helper
replaceModelNameInConfigPreservingFormat from predator_upload.go and delete its
corresponding unit test in predator_test.go (the test that targets the old
formatter, referenced around the earlier test at line 51), since production
paths copyConfigToProdConfigSource and copyConfigToNewNameInConfigSource now use
externalcall.ReplaceModelNameInConfig; ensure no other references to
replaceModelNameInConfigPreservingFormat remain (search for the function name)
and run tests to confirm only the redundant test was removed.

---

Duplicate comments:
In `@horizon/internal/predator/handler/predator_approval.go`:
- Around line 224-227: The updateInstanceCountInConfigSource function currently
checks modelName but not instanceCount; add a guard at the start of
updateInstanceCountInConfigSource(bucket, basePath, modelName string,
instanceCount int) to validate instanceCount > 0 (or whatever minimum is
required) and return a clear error (e.g., fmt.Errorf("invalid instanceCount:
%d", instanceCount)) when it's zero or negative so an invalid Triton config is
never written to GCS; place this check alongside the existing modelName check
and use the same error-return pattern.

---

Nitpick comments:
In `@horizon/internal/externalcall/gcs_client_test.go`:
- Around line 38-43: Add a test case to
hcs/internal/externalcall/gcs_client_test.go's table-driven tests that exercises
ReplaceModelNameInConfig with an empty destModelName (e.g., name "empty
destModelName returns error"), supply a valid config payload (or the invalid
pbtxt already used) and set expectError to true; this verifies
ReplaceModelNameInConfig enforces non-empty destModelName. Use the same test
runner that iterates over the table so the new entry follows the existing
pattern and asserts an error is returned.

Copy link
Contributor

@pavan-adari-meesho pavan-adari-meesho left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Take a pull from develop and fix conflicts if any

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@horizon/internal/predator/handler/predator_approval.go`:
- Around line 224-259: The updateInstanceCountInConfigSource function currently
allows zero or negative instanceCount; add an input guard at the start of
updateInstanceCountInConfigSource to validate instanceCount > 0 and return a
clear error (e.g. fmt.Errorf("instanceCount must be positive, got %d",
instanceCount)) before any GCS reads or parsing; ensure the check sits
immediately after the modelName validation so functions like
p.GcsClient.ReadFile, prototext.Unmarshal, and UploadFile are not executed with
invalid counts.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants