AI-Dynamo updates. #791

karya0 · 2026-01-27T18:21:28Z

Summary

Several updates to bring it up to speed with the latest AI-Dynamo code base and use cases.

Test Plan

Manual Verification
Some unit tests

coderabbitai · 2026-01-27T18:21:38Z

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

🔍 Trigger a full review

📝 Walkthrough

Walkthrough

Renames public export GenAIPerfArgs → GenAIPerf, introduces Benchmark/LMCache models and constraints, adds CSV percentile utilities and a genai-perf wrapper, refactors SLURM/report generation logic, extends the ai_dynamo shell orchestration, updates Kubernetes pod execution for wrapper-based runs, and adapts tests/configs accordingly. (50 words)

Changes

Cohort / File(s)	Summary
Public API `src/cloudai/workloads/ai_dynamo/__init__.py`	Expose `GenAIPerf` (replacing `GenAIPerfArgs`) and add `LMBench`, `LMCache`, `LMCacheArgs` to the public exports.
Core models & CLI args `src/cloudai/workloads/ai_dynamo/ai_dynamo.py`	Add `Benchmark`, `BenchmarkArgs`, `LMCache`, `LMCacheArgs`, `GenAIPerf`, `LMBench`, `Constraints`; update `AIDynamoArgs`, `AIDynamoCmdArgs`, and `AIDynamoTestDefinition` (new script/file fields, benchmarks, constraint_check, populate_git_repos, installables).
Shell orchestration & runtime script `src/cloudai/workloads/ai_dynamo/ai_dynamo.sh`	Large expansion: new markers/flags, lmcache/kvbm support, genai-perf wrapper invocation, workload dispatch (genai-perf/lmbench/single-shot), enhanced exit/monitoring, GPU logging, connector setup, and many new CLI options.
Scripts & utilities `src/cloudai/workloads/ai_dynamo/genai_perf_wrapper.sh` `src/cloudai/workloads/ai_dynamo/calc_percentile_csv.py`	Add wrapper script to run genai-perf and post-process profiles; add Python utility to compute percentiles and summarize CSV metrics with a CLI.
Report & SLURM strategies `src/cloudai/workloads/ai_dynamo/report_generation_strategy.py` `src/cloudai/workloads/ai_dynamo/slurm_command_gen_strategy.py`	Replace section-based CSV parsing with pandas-based extraction; simplify can_handle_directory/generate_report; extend SLURM argument generation (lmcache/lmbench/genai_perf flags), node-spec validation, dynamic fatal file, mounts and TOML arg handling.
Container paths (Slurm) `src/cloudai/systems/slurm/slurm_command_gen_strategy.py`	Add `container_install_path` and `container_results_path`, switch container mounts to use these paths.
Kubernetes execution `src/cloudai/systems/kubernetes/kubernetes_system.py`	`_run_genai_perf()` changed to copy wrapper and percentile script into frontend pod, chmod, build/execute wrapper command, handle ApiException errors, and copy results back from pod.
Tests updated `tests/json_gen_strategy/test_ai_dynamo.py` `tests/report_generation_strategy/test_ai_dynamo_report_generation_strategy.py` `tests/slurm_command_gen_strategy/test_ai_dynamo_slurm_command_gen_strategy.py` `tests/test_acceptance.py`	Replace `GenAIPerfArgs()` with `GenAIPerf()`; add/initialize `lmcache` and `lmbench` entities; adjust fixtures and expectations for new CSV/report naming and mounts.
New unit tests `tests/test_calc_percentile_csv.py`	Add tests for `compute_percentile`, `parse_float_safe`, and `summarize` behaviours, including NaN and interpolation cases.
Configs & job templates `conf/experimental/ai_dynamo/test/vllm.toml` `tests/ref_data/ai-dynamo.sbatch`	TOML: add empty `[cmd_args.lmcache]` and `[cmd_args.lmbench]` sections and year tweak. SBATCH: expand mounts, add lmcache/benchmark/genai_perf flags, dynamic fatal file handling, and retry/archive updates.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Poem

🐇 I hopped in with a wrapper tug,

copied scripts and gave a shrug,
percentiles counted, reports take flight,
pods fetch results by moonlit night,
benchmarks hum — the meadow’s snug.

🚥 Pre-merge checks | ✅ 1 | ❌ 1

❌ Failed checks (1 inconclusive)

Check name	Status	Explanation	Resolution
Title check	❓ Inconclusive	The title "AI-Dynamo updates." is vague and generic, using a non-descriptive term ("updates") that does not convey meaningful information about the specific changes made.	Consider using a more specific title that highlights the main change, such as "Refactor AI-Dynamo benchmarking workflow with wrapper-based execution and GenAIPerf/LMCache support" or similar.

✅ Passed checks (1 passed)

Check name	Status	Explanation
Description check	✅ Passed	The description mentions updating the codebase to align with the latest AI-Dynamo requirements and includes a test plan, which is related to the changeset covering wrapper scripts, benchmarking models, and configuration updates.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

karya0 · 2026-01-27T18:23:00Z

@coderabbitai review

coderabbitai · 2026-01-27T18:23:12Z

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

coderabbitai

Actionable comments posted: 30

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (5)

tests/test_acceptance.py (1)
469-479: Missing required name field causes pipeline failure.

GenAIPerf inherits from Benchmark which requires a name field. The pipeline error confirms:

GenAIPerf: field 'name' is required. ValidationError: 1 validation error for GenAIPerf; name field is missing.

Either provide name here, or fix the GenAIPerf class to have a default (preferred fix at the class level).
🐛 Workaround if class-level fix isn't applied
                     genai_perf=GenAIPerf(
+                        name="genai_perf",
                         **{
                             "streaming": True,
tests/report_generation_strategy/test_ai_dynamo_report_generation_strategy.py (2)

89-121: Test expects report generation behavior that was removed.

This test calls strategy.generate_report() and then asserts that report.csv exists with specific content. However, generate_report() is now implemented as pass (no-op), so:

report_file.is_file() assertion on line 94 will fail

All subsequent content assertions are unreachable

Either the test should be removed/updated to match the new no-op behavior, or generate_report() needs a proper implementation.

124-147: Metric tests expect different file naming and metric key formats.

The test fixture creates profile_genai_perf.csv, but get_metric() now looks for genai_perf-report.csv. Additionally, the tests use kebab-case metric names like "output-token-throughput", but the implementation searches for exact matches in the CSV's "Metric" column (e.g., "Output Token Throughput (tokens/sec)").

These tests will return METRIC_ERROR instead of the expected values.
tests/slurm_command_gen_strategy/test_ai_dynamo_slurm_command_gen_strategy.py (1)
60-74: Missing required name field for GenAIPerf.

The pipeline failure indicates:

GenAIPerf: missing required field 'name' in definition

This instantiation needs the required name field.
🐛 Workaround
         genai_perf=GenAIPerf(
+            name="genai_perf",
             **{
                 "endpoint-type": "chat",
src/cloudai/systems/kubernetes/kubernetes_system.py (1)
379-389: Use list-based subprocess call for cp_logs_cmd.

Similar to the earlier commands, avoid shell=True for the log copy operation.
♻️ Suggested fix
-        cp_logs_cmd = " ".join(
-            [
-                "kubectl",
-                "cp",
-                f"{self.default_namespace}/{frontend_pod}:{genai_perf_results_path}",
-                str(job.test_run.output_path / "genai-perf"),
-            ]
-        )
+        cp_logs_cmd = [
+            "kubectl",
+            "cp",
+            f"{self.default_namespace}/{frontend_pod}:{genai_perf_results_path}",
+            str(job.test_run.output_path / "genai-perf"),
+        ]
         logging.debug(f"Copying genai-perf results with command: {cp_logs_cmd}")
-        p = subprocess.run(cp_logs_cmd, shell=True, capture_output=True, text=True)
+        p = subprocess.run(cp_logs_cmd, capture_output=True, text=True)

🤖 Fix all issues with AI agents

In `@src/cloudai/systems/kubernetes/kubernetes_system.py`:
- Around line 376-378: The current except block for lazy.k8s.client.ApiException
only logs the error and then allows execution to continue (risking copying
missing results); update the except to stop further processing by either
re-raising the caught exception (raise) or returning early from the enclosing
function/method so downstream code that copies results does not run; modify the
except block that catches lazy.k8s.client.ApiException (the one referencing
frontend_pod) to either raise the same exception (e) or return an appropriate
failure value immediately.
- Around line 311-316: The kubectl copy calls use subprocess.run with shell=True
and f-strings (wrapper_script_path, calc_csv_script_path, frontend_pod,
pod_wrapper_path, pod_calc_csv_path) which risks shell injection; change both
subprocess.run invocations in the method to pass the command as a list of
arguments (e.g. ["kubectl", "cp", wrapper_script_path,
f"{self.default_namespace}/{frontend_pod}:{pod_wrapper_path}"]) and remove
shell=True so Python executes the command directly; do the same for the calc CSV
call (use ["kubectl","cp", calc_csv_script_path,
f"{self.default_namespace}/{frontend_pod}:{pod_calc_csv_path}"]) while keeping
capture_output=True, text=True, and check=True.

In `@src/cloudai/workloads/ai_dynamo/ai_dynamo.py`:
- Around line 41-59: The Benchmark base model declares a required field name:
str but subclasses GenAIPerf and LMBench don't provide defaults, causing
validation errors when instantiated without args; update the GenAIPerf and
LMBench classes to declare a default name (e.g., name: str = "GenAIPerf" and
name: str = "LMBench" or other canonical defaults) so Pydantic validation
succeeds, ensuring any logic using Benchmark.report_name continues to reference
the subclass name.

In `@src/cloudai/workloads/ai_dynamo/ai_dynamo.sh`:
- Around line 864-895: The issue is that the variable CUFILE_ENV_PATH_JSON on
line 867 is not quoted, which can cause word splitting if the path contains
spaces. To fix this, quote the variable in the redirection operator when writing
to the file in the setup_cufile function, changing the line to use
"$CUFILE_ENV_PATH_JSON" instead of $CUFILE_ENV_PATH_JSON to ensure correct
handling of paths with spaces.
- Around line 852-862: The command line in the genai-perf wrapper should quote
all variable expansions to avoid word-splitting for paths/args: wrap
${genai_perf_wrapper_script}, $RESULTS_DIR, ${genai_perf_report_file},
${calc_percentile_csv_script}, $(_gpus_per_node), the
dynamo_args/genai_perf_argument/genai_perf_args expansions, the redirection
target ${RESULTS_DIR}/genai_perf.log, and the touch "$DONE_MARKER" in quotes;
update the invocation in the block that builds the genai-perf command
(references: genai_perf_wrapper_script, RESULTS_DIR, genai_perf_report_file,
calc_percentile_csv_script, _gpus_per_node, dynamo_args["genai-perf-cmd"],
genai_perf_arguments, genai_perf_args["--extra-args"], DONE_MARKER) so every
variable is used as "${VAR}" (or the appropriate quoted form for
associative/array elements) and the log file path is quoted too.
- Around line 993-1018: In launch_lmbench, guard the pushd and quote expansions:
check pushd "$RESULTS_DIR" and if it fails log an error and exit or touch
"$DONE_MARKER" and return; after a successful pushd ensure you popd on exit;
quote all variable expansions like "${lmbench_dir}", "${RESULTS_DIR}",
"${log_file}", "${calc_percentile_csv_script}", and use proper quoting when
invoking the command and args (e.g. "$cmd" "${lmbench_arguments}"
"${lmbench_args["--extra-args"]}" or convert cmd/args into an array) to avoid
word-splitting and globbing and ensure safe handling of spaces in
launch_lmbench.
- Around line 898-914: The setup_kvbm function currently does an unsafe rm -rf
on DYN_KVBM_DISK_CACHE_DIR; before removing, validate and normalize the path
(use realpath) and ensure it is non-root, absolute, does not contain globs, and
is within an allowed base directory (e.g., a configured CACHE_BASE like
/var/cache or /tmp) or matches a safe whitelist; if validation fails, log an
error and exit instead of deleting. Also ensure all uses of
${DYN_KVBM_DISK_CACHE_DIR} are quoted to avoid word-splitting, and only perform
rm -rf after these checks succeed; keep the call to setup_cufile unchanged.
- Around line 561-563: The install step for "uv" currently pipes curl to sh and
blindly sources $HOME/.local/bin/env; change it to download the installer to a
temp file (using curl -fL), verify integrity (checksum or signature) before
executing, run the installer explicitly and check its exit status, and only
source $HOME/.local/bin/env after confirming the file exists and the install
succeeded; on any failure log an error via the existing log function and exit
with non-zero status. If possible prefer a package manager installation path
(apt/yum) or document verification requirements instead of curl | sh.
- Around line 1031-1047: The nvidia-smi invocation in function
log_gpu_utilization builds the output filename without quoting environment
variables, so paths containing spaces break; update the nvidia-smi -f argument
to quote the constructed path using the RESULTS_DIR and SLURM_NODEID variables
(e.g., use "${RESULTS_DIR}/gpu_utilization-${SLURM_NODEID}.csv") so the shell
treats it as a single filename when calling nvidia-smi in log_gpu_utilization.
- Around line 789-797: The launch_lmcache_controller function currently invokes
${dynamo_args["lmcache-controller-cmd"]} unquoted which can cause word-splitting
and broken argument parsing; update the call in launch_lmcache_controller to run
the command safely by quoting the variable or invoking it via bash -c (e.g.,
pass "${dynamo_args["lmcache-controller-cmd"]}" to bash -c or use eval-safe
invocation) so that dynamo_args["lmcache-controller-cmd"] with spaces/args is
executed correctly and output still redirected to
${RESULTS_DIR}/lmcache_controller.log.
- Around line 174-183: The min function calls use unquoted parameter expansions
which can cause word-splitting; update the two invocations that set
genai_perf_args["--warmup-request-count"] and genai_perf_args["--request-count"]
to quote their arguments (reference the genai_perf_args array keys
"--warmup-request-count" and "--request-count" and the dynamo_args keys
"min-warmup-request-count" and "min-request-count") so the calls become min
"${genai_perf_args[... ]}" "${dynamo_args[... ]}" ensuring safe expansion.
- Around line 959-991: In function launch_single_shot: remove or use the
computed but unused variable max_ctx_tokens_following (derived from isl and
cache-hit-pct) — either pass it into the python call as
--max_ctx_tokens_following or delete the computation; add robust error handling
around pushd/popd by checking their return status (or replace with a safe cd and
trap to ensure popd happens) so the script exits on directory change failure;
and ensure all variable expansions are quoted (e.g., $lmcache_path,
$RESULTS_DIR, $DONE_MARKER, $url, ${dynamo_args[…]} and any command
substitutions) to avoid word-splitting and globbing issues.
- Around line 1061-1063: The script unconditionally does cd $RESULTS_DIR before
launching log_gpu_utilization, which can fail silently; update the ai_dynamo.sh
around the cd to verify the directory exists and the cd succeeds (e.g., test for
existence or create it, then attempt cd and on failure log an error and exit
non‑zero) before starting log_gpu_utilization in the background so the logging
function isn't started in the wrong working directory; reference the RESULTS_DIR
variable, the cd invocation, and the log_gpu_utilization function when making
the change.
- Around line 799-811: The clear_lmcache function is defined but never used;
either remove it or call it during LMCache setup/benchmark startup to ensure a
clean cache state — for example invoke clear_lmcache from the LMCache
initialization path (e.g., inside start_lmcache, setup_lmcache, or before
run_benchmarks) making sure lmcache_config is available and handle curl
failures/log non-zero exits, or delete the clear_lmcache function to avoid dead
code if you don't want an automatic cache clear.
- Around line 916-957: In setup_lmcache, add a safety check before any
destructive rm -rf to validate storage_cachedir (e.g., ensure it is non-empty,
not "/" or ".", resolve with realpath and verify it is under an expected base
like $RESULTS_DIR or contains the frontend node) and abort the removal if the
check fails; also quote all variable expansions to prevent word-splitting and
globbing — specifically quote $lmcache_path in the uv pip install call, quote
"$storage_cachedir" in rm -rf and mkdir -p, quote "$LMCACHE_CONFIG_FILE" for
export/rm/-redirects, and quote "$key", "$val" and "$nkey" in the echo redirects
when writing the config; update function setup_lmcache and references to
lmcache_path, storage_cachedir, LMCACHE_CONFIG_FILE, and lmcache_config
accordingly.
- Around line 815-820: The prefill paths are intentionally disabled here:
replace the hardcoded want_prefill=0 and have_prefill=0 (which currently comment
out calls to _expected_ready_prefill and _count_initialized_prefill) with an
explanatory inline comment stating why prefill workers are disabled (e.g.,
"prefill disabled for single-node testing/temporarily pending refactor —
decode-only mode") or, if not intentional, re-enable/refactor by restoring the
real calls to _expected_ready_prefill and _count_initialized_prefill and
removing unused prefill code; reference the variables want_prefill/have_prefill
and the helper functions _expected_ready_prefill/_count_initialized_prefill and
the commented prefill launch to locate the relevant code to update.
- Around line 416-419: The appended replacement value can be split if not
properly quoted; update the append that builds result so the expanded
replacement from replace_placeholders (the val variable derived from arr[$key]
in the loop) is quoted when concatenating—use the existing result concatenation
that includes ${key} but ensure the replacement is inserted as "${val}"
(preserving spaces/special chars) so modify the code around result+="${key}
${val} " to quote the val expansion.

In `@src/cloudai/workloads/ai_dynamo/calc_percentile_csv.py`:
- Around line 61-65: The parse_float_safe function currently catches a broad
Exception; narrow it to only the expected float conversion failures by catching
ValueError and TypeError instead (i.e., change the except Exception to except
(ValueError, TypeError)) and keep the same behavior of returning float("nan") so
parse_float_safe(value: Any) still returns NaN for invalid inputs.

In `@src/cloudai/workloads/ai_dynamo/genai_perf_wrapper.sh`:
- Line 60: The assignment to profile_path uses an unquoted $result_dir which can
cause globbing and word-splitting for paths with spaces; update the find
invocation (the profile_path assignment) to quote $result_dir (and any other
shell expansions in that command) so the path is treated as a single argument
and safe from splitting/globbing.
- Around line 47-52: The code flattens the genai_perf_cmdline_args array into a
single string losing argument boundaries and then tries to execute that string;
instead keep it as an array and execute it via the array expansion. Replace the
assignment genai_perf_cmdline_args="${genai_perf_cmdline_args[*]}" (remove it),
change the log to safely show arguments (for example use printf '%q '
"${genai_perf_cmdline_args[@]}" or join with spaces for display), and launch the
command using the array expansion "${genai_perf_cmdline_args[@]}" so arguments
with spaces or special characters are preserved when executing.
- Around line 62-66: The script currently overwrites the percentile report
because the grep line (`grep ".*,.*,.*,.*" $profile_path >
$result_dir/$report_file`) uses a single >; change it to append (>>) so the
earlier python3 `${calc_percentile_csv_script} $profile_path -o
$result_dir/$report_file` output is preserved. Replace the integer division for
`output_tokens_per_second_per_gpu=$(( $output_tokens_per_second / $total_gpus
))` with a floating-point calculation (e.g., using `bc` or `awk`, or
`output_tokens_per_second_per_gpu=$(awk -v a="$output_tokens_per_second" -v
b="$total_gpus" 'BEGIN{printf "%.6f", a/b}')`) to avoid truncation, and keep the
final echo that appends the value to `$result_dir/$report_file`.
- Around line 41-43: In the case statement of genai_perf_wrapper.sh the default
branch "*)" currently silences unknown arguments; update that branch (the "*)"
handler) to report unrecognized options by printing an error to stderr
(including the offending "$1" value) and either show brief usage/help or suggest
correct flags, then exit non‑zero (e.g., exit 1) so typos or unsupported args
are caught instead of ignored.

In `@src/cloudai/workloads/ai_dynamo/report_generation_strategy.py`:
- Around line 69-70: The generate_report method is currently a no-op and must be
implemented to satisfy test_ai_dynamo_generate_report: update the
ReportGenerationStrategy.generate_report implementation to create a report.csv
file (in the same output/location the tests expect) and write the expected CSV
headers and rows the test asserts against; use the class's available data
members or helper methods to collect the rows, open the file in write mode, and
write CSV-formatted content (including correct headers and values) so the test
can find the file and validate its contents.
- Around line 27-28: Remove the empty TYPE_CHECKING block because it contains
only a no-op pass; delete the lines containing "if TYPE_CHECKING:" and the
following "pass" in report_generation_strategy.py so there is no dead code
related to TYPE_CHECKING in the module.
- Line 23: Replace the direct pandas import with the project's lazy import
utility to avoid startup overhead: remove "import pandas as pd" and instead
reference LazyImports.pd (from src.cloudai.util.lazy_imports) wherever pandas is
used in ReportGenerationStrategy or this module; ensure you import the
LazyImports symbol (e.g., "from src.cloudai.util.lazy_imports import
LazyImports") and update any code that expects a module object to call
LazyImports.pd instead so pandas is loaded lazily only when needed.

In `@src/cloudai/workloads/ai_dynamo/slurm_command_gen_strategy.py`:
- Around line 242-244: The log may report the wrong "from" value because
self.test_run.num_nodes is modified earlier; capture the original value into a
local (e.g., original_num_nodes = self.test_run.num_nodes) before any
assignments and use that in the logging.info call (log "Setting num_nodes from
%d to %d", original_num_nodes, total_nodes) right before you set
self.test_run.num_nodes = total_nodes in the method that contains the current
code (referencing self.test_run.num_nodes and total_nodes in
slurm_command_gen_strategy.py).

In `@tests/json_gen_strategy/test_ai_dynamo.py`:
- Line 49: The test instantiates GenAIPerf without the required Benchmark field
"name", causing validation to fail; update the GenAIPerf construction in
tests/json_gen_strategy/test_ai_dynamo.py (the genai_perf=GenAIPerf(...) site)
to pass a string name (e.g., a short descriptive identifier) so the Benchmark
validation succeeds and CI stops failing.

In
`@tests/report_generation_strategy/test_ai_dynamo_report_generation_strategy.py`:
- Line 67: The GenAIPerf instantiation is missing the required Benchmark 'name'
argument; update the test where GenAIPerf() is constructed (the GenAIPerf
symbol) to pass a name string (e.g., GenAIPerf(name="genai_perf")) so the
required field from the Benchmark parent is provided and the test can construct
the object successfully.

In `@tests/test_calc_percentile_csv.py`:
- Line 18: The import statement "from pathlib import Path" in the test file is
unused; remove that import line so the test module no longer contains the unused
symbol Path and the CI lint error is resolved—locate the import at the top of
tests/test_calc_percentile_csv.py (the "from pathlib import Path" line) and
delete it.

src/cloudai/systems/kubernetes/kubernetes_system.py

src/cloudai/workloads/ai_dynamo/ai_dynamo.py

src/cloudai/workloads/ai_dynamo/ai_dynamo.sh

coderabbitai · 2026-01-27T18:33:00Z

src/cloudai/workloads/ai_dynamo/ai_dynamo.sh

+    shopt -s nocasematch
+    val=$(replace_placeholders "${arr[$key]}")
+    result+="${key} ${val} "
  done


🧹 Nitpick | 🔵 Trivial

Quote the replaced value to prevent word splitting.

The val variable should be quoted when appended to result to handle values containing spaces or special characters.

♻️ Proposed fix

shopt -s nocasematch val=$(replace_placeholders "${arr[$key]}") - result+="${key} ${val} " + result+="${key} \"${val}\" "

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

shopt -s nocasematch

val=$(replace_placeholders "${arr[$key]}")

result+="${key} ${val} "

done

shopt -s nocasematch

val=$(replace_placeholders "${arr[$key]}")

result+="${key} \"${val}\" "

done

🤖 Prompt for AI Agents

In `@src/cloudai/workloads/ai_dynamo/ai_dynamo.sh` around lines 416 - 419, The appended replacement value can be split if not properly quoted; update the append that builds result so the expanded replacement from replace_placeholders (the val variable derived from arr[$key] in the loop) is quoted when concatenating—use the existing result concatenation that includes ${key} but ensure the replacement is inserted as "${val}" (preserving spaces/special chars) so modify the code around result+="${key} ${val} " to quote the val expansion.

coderabbitai · 2026-01-27T18:33:02Z

src/cloudai/workloads/ai_dynamo/report_generation_strategy.py

    def generate_report(self) -> None:
-        output_path = self.test_run.output_path
-        source_csv = next(output_path.rglob(CSV_FILES_PATTERN))
-        target_csv = output_path / "report.csv"
-
-        total_gpus = self._calculate_total_gpus()
-        if total_gpus is None:
-            logging.warning("gpus_per_node is None, skipping Overall Output Tokens per Second per GPU calculation.")
-            shutil.copy2(source_csv, target_csv)
-            return
-
-        sections = self._read_csv_sections(source_csv)
-        self._write_sections_with_metric(target_csv, sections, total_gpus)
+        pass


⚠️ Potential issue | 🟠 Major

generate_report is now a no-op, breaking test assertions.

The test_ai_dynamo_generate_report test expects generate_report() to produce a report.csv file with specific content, but this method now does nothing. Either the test needs updating or the implementation needs to be completed.

🤖 Prompt for AI Agents

In `@src/cloudai/workloads/ai_dynamo/report_generation_strategy.py` around lines 69 - 70, The generate_report method is currently a no-op and must be implemented to satisfy test_ai_dynamo_generate_report: update the ReportGenerationStrategy.generate_report implementation to create a report.csv file (in the same output/location the tests expect) and write the expected CSV headers and rows the test asserts against; use the class's available data members or helper methods to collect the rows, open the file in write mode, and write CSV-formatted content (including correct headers and values) so the test can find the file and validate its contents.

coderabbitai · 2026-01-27T18:33:02Z

src/cloudai/workloads/ai_dynamo/slurm_command_gen_strategy.py

+        logging.info("Setting num_nodes from %d to %d", self.test_run.num_nodes, total_nodes)
+
+        self.test_run.num_nodes = total_nodes


🧹 Nitpick | 🔵 Trivial

Log message may be misleading.

The log says "Setting num_nodes from X to Y" but self.test_run.num_nodes was already set on line 234 or 238, so the "from" value already equals total_nodes in some paths. Consider logging the original value before any modifications.

♻️ Suggested improvement

+ original_num_nodes = self.test_run.num_nodes + if prefill_nodes and decode_nodes: self.test_run.nodes = prefill_nodes.split(",") + decode_nodes.split(",") + self.test_run.nodes self.test_run.num_nodes = len(self.test_run.nodes) prefill_n = len(prefill_nodes.split(",")) decode_n = len(decode_nodes.split(",")) else: self.test_run.num_nodes = prefill_n + decode_n total_nodes = prefill_n + decode_n - logging.info("Setting num_nodes from %d to %d", self.test_run.num_nodes, total_nodes) - - self.test_run.num_nodes = total_nodes + if original_num_nodes != total_nodes: + logging.info("Setting num_nodes from %d to %d", original_num_nodes, total_nodes) + self.test_run.num_nodes = total_nodes

🤖 Prompt for AI Agents

In `@src/cloudai/workloads/ai_dynamo/slurm_command_gen_strategy.py` around lines 242 - 244, The log may report the wrong "from" value because self.test_run.num_nodes is modified earlier; capture the original value into a local (e.g., original_num_nodes = self.test_run.num_nodes) before any assignments and use that in the logging.info call (log "Setting num_nodes from %d to %d", original_num_nodes, total_nodes) right before you set self.test_run.num_nodes = total_nodes in the method that contains the current code (referencing self.test_run.num_nodes and total_nodes in slurm_command_gen_strategy.py).

tests/json_gen_strategy/test_ai_dynamo.py

tests/report_generation_strategy/test_ai_dynamo_report_generation_strategy.py

tests/test_calc_percentile_csv.py

Signed-off-by: Kapil Arya <kapila@nvidia.com>

karya0 · 2026-01-27T21:55:30Z

@coderabbitai review

coderabbitai · 2026-01-27T21:55:41Z

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

coderabbitai

Actionable comments posted: 13

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

src/cloudai/workloads/ai_dynamo/ai_dynamo.sh (1)
1072-1074: Add error handling for cd to workspace path.

Line 1073 changes to ${dynamo_args["workspace-path"]} without checking if the operation succeeds or quoting the path. This could silently fail if the directory doesn't exist.
♻️ Proposed fix
   if _is_vllm; then
-    cd ${dynamo_args["workspace-path"]}
+    cd "${dynamo_args["workspace-path"]}" || {
+      log "ERROR: Failed to change to workspace directory: ${dynamo_args["workspace-path"]}"
+      exit 1
+    }
   fi

🤖 Fix all issues with AI agents

In `@src/cloudai/workloads/ai_dynamo/ai_dynamo.py`:
- Around line 74-76: The serializer _serialize_repo currently returns the
misleading literal "/invalid/repo/path" when repo is None; change it to return
None (or empty string if your schema requires) instead to avoid masking
configuration errors: replace the else branch with "None" (i.e. return
v.container_mount if v else None), update the return type to Optional[str] if
necessary, and adjust any callers/consumers that expect a string to handle None
(or validate/raise an explicit error inside _serialize_repo if you prefer
failing-fast).
- Around line 183-187: The GitRepo instances are using SSH-style URLs which
break CI without SSH keys; update the GitRepo constructor calls (e.g., the
variable repo of type Optional[GitRepo] and the other GitRepo usage around the
second occurrence) to use HTTPS-formatted repository URLs (e.g.,
https://github.com/<org>/<repo>.git) instead of "git@github.com:..."; ensure
both places where GitRepo(...) is invoked are changed so CI/CD environments can
clone without SSH auth.

In `@src/cloudai/workloads/ai_dynamo/ai_dynamo.sh`:
- Around line 448-476: Separate the declaration and assignment to address SC2155
in exit_on_error: instead of combining local and command substitution for fatal,
first declare the variable (local fatal) and then assign it with
fatal=$(_detect_fatal_once); locate this in the exit_on_error function where
fatal is set (currently using _detect_fatal_once) and make the two-step change
so shellcheck warning SC2155 is resolved.
- Around line 210-211: The case branch for genai_perf args has a prefix
mismatch: the pattern uses `--genai_perf-args-*` but the extraction uses
`${key#--genai-perf-args-}`, so `${key#...}` won't strip the actual prefix and
will leave a leftover in the index; update them to be consistent (either change
the case pattern to `--genai-perf-args-*` or change the extraction to
`${key#--genai_perf-args-}`) in the branch handling `genai_perf_args` (look for
the `genai_perf_args["${key#...}"]="$2"` line and the `--genai_perf-args-*`
case) so the array keys are computed correctly.
- Around line 550-560: The workload-type checks (_is_genai_perf_workload,
_is_lmbench_workload, _is_single_shot_workload) rely on
dynamo_args["workload-type"] which may be unset; initialize a sensible default
for dynamo_args["workload-type"] (e.g., "genai-perf" or empty string) at
dynamo_args population/initialization point so those predicate functions behave
predictably when CLI input is absent, and ensure any existing code that sets
dynamo_args still allows the default to be overridden.

In `@src/cloudai/workloads/ai_dynamo/calc_percentile_csv.py`:
- Around line 106-110: The current logic silently does nothing for completely
empty CSVs (no headers and no rows); detect this case early by checking rows
and/or the DictReader fieldnames and emit a user-friendly warning or logger
message before returning/continuing; update the code around the variables rows
and all_columns in calc_percentile_csv.py to log a clear warning (e.g., "empty
CSV: no headers or data") when both rows is empty and there are no fieldnames,
and then skip the numeric-summary loop that calls parse_float_safe and
append_summary so behavior is explicit and debuggable.

In `@src/cloudai/workloads/ai_dynamo/report_generation_strategy.py`:
- Around line 37-47: The extract_metric_from_csv function assumes a "Metric"
column exists but only checks metric_type; update extract_metric_from_csv to
first validate that "Metric" is in df.columns (in addition to metric_type), and
if missing log a clear message including df.columns and return METRIC_ERROR;
then proceed to check metric_name in df["Metric"].values and return the metric
as before. Ensure you reference the existing symbols metric_type, metric_name,
METRIC_ERROR and use the same logging style as the other checks.
- Around line 55-56: The current metric parsing in report_generation_strategy.py
unpacks metric.split(":") into three variables which raises ValueError for
inputs with not exactly three parts; modify the parsing in the block handling
the variable metric (the code that currently does `benchmark_name, metric_name,
metric_type = metric.split(":")`) to safely split with at most two splits (e.g.
split(":", 2)) and then normalize missing parts with sensible defaults (e.g. set
metric_type to a default like "unknown" or None when absent) or validate and
reject malformed metrics with a clear error/log message; ensure the change is
applied wherever the metric variable is processed so downstream code that
expects benchmark_name, metric_name, and metric_type receives well-defined
values.

In `@src/cloudai/workloads/ai_dynamo/slurm_command_gen_strategy.py`:
- Around line 193-194: The assert messages incorrectly reference
dynamo.num_prefill_nodes and dynamo.num_decode_nodes; update the assertions in
slurm_command_gen_strategy.py that validate prefill_n and decode_n so the error
messages reflect the actual attributes (e.g., "prefill_worker.num_nodes must be
an integer" and "decode_worker.num_nodes must be an integer") or use the
variable names ("prefill_n must be an integer", "decode_n must be an integer")
for clarity; change the assert statements that check isinstance(prefill_n, int)
and isinstance(decode_n, int) to use one of these corrected messages.

In `@tests/ref_data/ai-dynamo.sbatch`:
- Line 31: Remove the duplicate `--export=ALL` flag from the sbatch command in
the script so only one `--export=ALL` remains; locate the second occurrence of
`--export=ALL` (the duplicate entry added near the sbatch options) and delete
it, leaving the original instance intact.
- Line 69: The generated SLURM argument construction in
slurm_command_gen_strategy.py (the list comprehension building args from
toml_args using prefix and the f-string at the comprehension) wraps values with
double quotes which breaks JSON-containing values; change the construction to
either wrap values in single quotes when the value contains double quotes or to
escape inner double quotes (e.g., detect if v contains '"' then use
f"{prefix}{k} '{v}'" or escape quotes) so that JSON like {"temperature": 0.7,
"max_tokens": 128} is emitted as a single shell-safe token (use the same change
where args is built and any helper that formats values).

In
`@tests/report_generation_strategy/test_ai_dynamo_report_generation_strategy.py`:
- Around line 117-119: The test empties the wrong CSV due to a filename
mismatch: replace the hyphenated filename with the underscore variant so the
existing test CSV is cleared; specifically, update the write_text call that
currently targets (ai_dynamo_tr.output_path / "genai_perf-report.csv") to target
"genai_perf_report.csv" so the subsequent call to
strategy.get_metric("invalid-metric") exercises the empty-file error path
instead of creating a new file.
- Around line 86-91: The test currently only asserts True; instead validate a
concrete behavior of AIDynamoReportGenerationStrategy.generate_report: call
strategy.generate_report(), capture its return value and assert it is None (or
the expected sentinel), and/or assert no report file was created by checking the
expected report path derived from the strategy (use
AIDynamoReportGenerationStrategy.report_path or the strategy instance to
determine the output file) or check a relevant side-effect on the ai_dynamo_tr
fixture; update test_ai_dynamo_generate_report to replace the tautological
assert with one or two assertions verifying the method's return and absence (or
presence if expected) of the report file or expected state change.

coderabbitai · 2026-01-27T22:03:41Z

src/cloudai/workloads/ai_dynamo/ai_dynamo.py

+    @field_serializer("repo")
+    def _serialize_repo(self, v: GitRepo) -> str:
+        return v.container_mount if v else "/invalid/repo/path"


⚠️ Potential issue | 🟡 Minor

Serializer returns misleading path for None repo.

When repo is None, the serializer returns "/invalid/repo/path" which could mask configuration issues downstream. Consider returning None or an empty string, or raising an error during serialization if the repo is unexpectedly missing.

🐛 Proposed fix

`@field_serializer`("repo") - def _serialize_repo(self, v: GitRepo) -> str: - return v.container_mount if v else "/invalid/repo/path" + def _serialize_repo(self, v: Optional[GitRepo]) -> Optional[str]: + return v.container_mount if v else None

🤖 Prompt for AI Agents

In `@src/cloudai/workloads/ai_dynamo/ai_dynamo.py` around lines 74 - 76, The serializer _serialize_repo currently returns the misleading literal "/invalid/repo/path" when repo is None; change it to return None (or empty string if your schema requires) instead to avoid masking configuration errors: replace the else branch with "None" (i.e. return v.container_mount if v else None), update the return type to Optional[str] if necessary, and adjust any callers/consumers that expect a string to handle None (or validate/raise an explicit error inside _serialize_repo if you prefer failing-fast).

coderabbitai · 2026-01-27T22:03:41Z

src/cloudai/workloads/ai_dynamo/ai_dynamo.py

+    repo: Optional[GitRepo] = GitRepo(
+        url="git@github.com:LMCache/LMCache.git",
+        commit="ab8530993992db873869ba882320953582d94309",
+        mount_as="/git/LMCache",
+    )


🧹 Nitpick | 🔵 Trivial

🧩 Analysis chain

🏁 Script executed:

rg -n "GitRepo\(" --type=py -A3 | grep -E "(url=|https://|git@)"

Repository: NVIDIA/cloudai

Length of output: 1971

Use HTTPS URLs for GitRepo to improve CI/CD compatibility.

The SSH URLs at lines 184 and 214 should be changed to HTTPS format (https://github.com:...) to ensure compatibility in environments without SSH key authentication. This aligns with HTTPS URLs used elsewhere in the codebase (e.g., nemo_launcher.py, megatron_bridge.py, and in ai_dynamo.py at line 255).

🤖 Prompt for AI Agents

In `@src/cloudai/workloads/ai_dynamo/ai_dynamo.py` around lines 183 - 187, The GitRepo instances are using SSH-style URLs which break CI without SSH keys; update the GitRepo constructor calls (e.g., the variable repo of type Optional[GitRepo] and the other GitRepo usage around the second occurrence) to use HTTPS-formatted repository URLs (e.g., https://github.com/<org>/<repo>.git) instead of "git@github.com:..."; ensure both places where GitRepo(...) is invoked are changed so CI/CD environments can clone without SSH auth.

coderabbitai · 2026-01-27T22:03:41Z

src/cloudai/workloads/ai_dynamo/ai_dynamo.sh

+      --genai_perf-args-*)
+        genai_perf_args["${key#--genai-perf-args-}"]="$2" ;;


⚠️ Potential issue | 🔴 Critical

Bug: Pattern-extraction mismatch for genai_perf-args.

The case pattern --genai_perf-args-* uses an underscore, but the extraction ${key#--genai-perf-args-} uses a hyphen. This mismatch means the extraction will not strip the prefix correctly, leaving part of the key in the array index.

🐛 Proposed fix

- --genai_perf-args-*) - genai_perf_args["${key#--genai-perf-args-}"]="$2" ;; + --genai-perf-args-*) + genai_perf_args["${key#--genai-perf-args-}"]="$2" ;;

Or if the underscore variant is intentional:

--genai_perf-args-*) - genai_perf_args["${key#--genai-perf-args-}"]="$2" ;; + genai_perf_args["${key#--genai_perf-args-}"]="$2" ;;

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

--genai_perf-args-*)

genai_perf_args["${key#--genai-perf-args-}"]="$2" ;;

--genai-perf-args-*)

genai_perf_args["${key#--genai-perf-args-}"]="$2" ;;

🤖 Prompt for AI Agents

In `@src/cloudai/workloads/ai_dynamo/ai_dynamo.sh` around lines 210 - 211, The case branch for genai_perf args has a prefix mismatch: the pattern uses `--genai_perf-args-*` but the extraction uses `${key#--genai-perf-args-}`, so `${key#...}` won't strip the actual prefix and will leave a leftover in the index; update them to be consistent (either change the case pattern to `--genai-perf-args-*` or change the extraction to `${key#--genai_perf-args-}`) in the branch handling `genai_perf_args` (look for the `genai_perf_args["${key#...}"]="$2"` line and the `--genai_perf-args-*` case) so the array keys are computed correctly.

coderabbitai · 2026-01-27T22:03:41Z

src/cloudai/workloads/ai_dynamo/ai_dynamo.sh

+function perform_exit()
+{
+  local exit_code=$1
+  local sleep_before_exit="${dynamo_args["sleep-before-exit"]}"
+  if [[ -n "${sleep_before_exit}" ]]; then
+    log "Sleeping for ${sleep_before_exit} seconds before exit"
+    sleep "${sleep_before_exit}"
+  fi
+  exit "${exit_code}"
+}
+
 exit_on_error() {
  local fatal=$(_detect_fatal_once)
+  if [ -f "${DONE_MARKER}" ]; then
+    log "DONE_MARKER found. Skipping error check."
+    return
+  fi
  if [[ "${fatal}" -gt 0 ]]; then
    log "FATAL: detected ${fatal} fatal error line(s). Writing ${FATAL_ERROR_MARKER} and terminating."
+    sleep 1
+
    touch "${FATAL_ERROR_MARKER}"
+    grep -E "${dynamo_args["worker-error-pattern"]}|UCX.*ERROR" "${RESULTS_DIR}"/*.log 2>/dev/null > "${FATAL_ERROR_MARKER}"
    # Try to stop background jobs for a cleaner exit, but do not loop
    kill $(jobs -p) 2>/dev/null || true
    # Exit non-zero so srun can retry
-    exit 1
+    perform_exit 1
  fi
 }


🧹 Nitpick | 🔵 Trivial

LGTM with minor suggestion.

The error handling improvements are good:

perform_exit provides configurable exit delay for debugging

Checking DONE_MARKER before error detection prevents false positives after successful runs

Minor: Line 460 could separate declaration and assignment per SC2155, but this is stylistic.

🧰 Tools

🪛 Shellcheck (0.11.0)

[warning] 460-460: Declare and assign separately to avoid masking return values.

(SC2155)

[warning] 472-472: Quote this to prevent word splitting.

(SC2046)

🤖 Prompt for AI Agents

In `@src/cloudai/workloads/ai_dynamo/ai_dynamo.sh` around lines 448 - 476, Separate the declaration and assignment to address SC2155 in exit_on_error: instead of combining local and command substitution for fatal, first declare the variable (local fatal) and then assign it with fatal=$(_detect_fatal_once); locate this in the exit_on_error function where fatal is set (currently using _detect_fatal_once) and make the two-step change so shellcheck warning SC2155 is resolved.

coderabbitai · 2026-01-27T22:03:41Z

src/cloudai/workloads/ai_dynamo/ai_dynamo.sh

+_is_genai_perf_workload() {
+  [[ "${dynamo_args["workload-type"]}" == "genai-perf" ]]
+}
+
+_is_lmbench_workload() {
+  [[ "${dynamo_args["workload-type"]}" == "lmbench" ]]
+}
+
+_is_single_shot_workload() {
+  [[ "${dynamo_args["workload-type"]}" == "single-shot" ]]
+}


⚠️ Potential issue | 🟡 Minor

Missing default value for workload-type.

These workload type check functions depend on dynamo_args["workload-type"], but this key is not initialized with a default value. If not provided via CLI, none of the workload branches will execute on the frontend node.

Consider adding a default:

dynamo_args["workload-type"]="genai-perf" # or empty string if intentional

🤖 Prompt for AI Agents

In `@src/cloudai/workloads/ai_dynamo/ai_dynamo.sh` around lines 550 - 560, The workload-type checks (_is_genai_perf_workload, _is_lmbench_workload, _is_single_shot_workload) rely on dynamo_args["workload-type"] which may be unset; initialize a sensible default for dynamo_args["workload-type"] (e.g., "genai-perf" or empty string) at dynamo_args population/initialization point so those predicate functions behave predictably when CLI input is absent, and ensure any existing code that sets dynamo_args still allows the default to be overridden.

coderabbitai · 2026-01-27T22:03:41Z

src/cloudai/workloads/ai_dynamo/slurm_command_gen_strategy.py

+        assert isinstance(prefill_n, int), "dynamo.num_prefill_nodes must be an integer"
+        assert isinstance(decode_n, int), "dynamo.num_decode_nodes must be an integer"


⚠️ Potential issue | 🟡 Minor

Assertion messages reference non-existent attribute names.

The messages say dynamo.num_prefill_nodes and dynamo.num_decode_nodes, but the actual attributes being checked are prefill_worker.num_nodes and decode_worker.num_nodes.

🐛 Proposed fix

- assert isinstance(prefill_n, int), "dynamo.num_prefill_nodes must be an integer" - assert isinstance(decode_n, int), "dynamo.num_decode_nodes must be an integer" + assert isinstance(prefill_n, int), "prefill_worker.num_nodes must be an integer" + assert isinstance(decode_n, int), "decode_worker.num_nodes must be an integer"

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

assert isinstance(prefill_n, int), "dynamo.num_prefill_nodes must be an integer"

assert isinstance(decode_n, int), "dynamo.num_decode_nodes must be an integer"

assert isinstance(prefill_n, int), "prefill_worker.num_nodes must be an integer"

assert isinstance(decode_n, int), "decode_worker.num_nodes must be an integer"

🤖 Prompt for AI Agents

In `@src/cloudai/workloads/ai_dynamo/slurm_command_gen_strategy.py` around lines 193 - 194, The assert messages incorrectly reference dynamo.num_prefill_nodes and dynamo.num_decode_nodes; update the assertions in slurm_command_gen_strategy.py that validate prefill_n and decode_n so the error messages reflect the actual attributes (e.g., "prefill_worker.num_nodes must be an integer" and "decode_worker.num_nodes must be an integer") or use the variable names ("prefill_n must be an integer", "decode_n must be an integer") for clarity; change the assert statements that check isinstance(prefill_n, int) and isinstance(decode_n, int) to use one of these corrected messages.

coderabbitai · 2026-01-27T22:03:41Z

tests/ref_data/ai-dynamo.sbatch

  --ntasks=2 \
  --ntasks-per-node=1 \
-  --export=ALL,DYNAMO_FATAL_ERROR_FILE=fatal_error.marker \
+  --export=ALL \


🧹 Nitpick | 🔵 Trivial

Duplicate --export=ALL flag.

Line 31 duplicates --export=ALL which is already specified on line 23. While not harmful, it adds unnecessary redundancy to the command.

♻️ Suggested fix

--nodes=2 \ --ntasks=2 \ --ntasks-per-node=1 \ - --export=ALL \ --output=__OUTPUT_DIR__/output/node-%n-stdout.txt \

🤖 Prompt for AI Agents

In `@tests/ref_data/ai-dynamo.sbatch` at line 31, Remove the duplicate `--export=ALL` flag from the sbatch command in the script so only one `--export=ALL` remains; locate the second occurrence of `--export=ALL` (the duplicate entry added near the sbatch options) and delete it, leaving the original instance intact.

coderabbitai · 2026-01-27T22:03:42Z

tests/ref_data/ai-dynamo.sbatch

+  --genai_perf-report-name "genai_perf_report.csv" \
+  --genai_perf-enabled "True" \
+  --genai_perf-streaming "True" \
+  --genai_perf-extra-inputs "{"temperature": 0.7, "max_tokens": 128}" \


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# First, check the actual content of the sbatch file around line 69 sed -n '65,75p' tests/ref_data/ai-dynamo.sbatch

Repository: NVIDIA/cloudai

Length of output: 526

🏁 Script executed:

# Search for how extra-inputs is handled in the AI Dynamo command generation rg -n 'extra.inputs' src/cloudai/workloads/ai_dynamo/ -A 5 -B 5 -i

Repository: NVIDIA/cloudai

Length of output: 40

🏁 Script executed:

# Find the command generation strategy file for AI Dynamo fd -type f 'cmd_gen' src/cloudai/workloads/ai_dynamo/

Repository: NVIDIA/cloudai

Length of output: 228

🏁 Script executed:

# Also search for any JSON escaping or serialization in the command generation rg -n 'json\|dumps\|serialize' src/cloudai/workloads/ai_dynamo/ -A 2 -B 2

Repository: NVIDIA/cloudai

Length of output: 40

🏁 Script executed:

# List files in the AI Dynamo workload directory ls -la src/cloudai/workloads/ai_dynamo/

Repository: NVIDIA/cloudai

Length of output: 756

🏁 Script executed:

# Search for command generation in AI Dynamo rg -n 'genai_perf' src/cloudai/workloads/ai_dynamo/ -A 3 -B 3

Repository: NVIDIA/cloudai

Length of output: 18475

🏁 Script executed:

# Search for the command builder or cmd generation strategy rg -n 'cmd_gen\|cmd_string\|build_cmd' src/cloudai/workloads/ai_dynamo/

Repository: NVIDIA/cloudai

Length of output: 40

🏁 Script executed:

# Read the slurm_command_gen_strategy.py file to find _get_benchmark_toml_args cat -n src/cloudai/workloads/ai_dynamo/slurm_command_gen_strategy.py

Repository: NVIDIA/cloudai

Length of output: 11978

🏁 Script executed:

# Also search for the specific method rg -n '_get_benchmark_toml_args\|_get_toml_args' src/cloudai/workloads/ai_dynamo/ -A 10

Repository: NVIDIA/cloudai

Length of output: 40

🏁 Script executed:

# Search for extra_inputs definition in the GenAIPerf configuration rg -n 'extra.inputs\|extra_inputs' src/cloudai/workloads/ai_dynamo/ai_dynamo.py -B 3 -A 3

Repository: NVIDIA/cloudai

Length of output: 40

🏁 Script executed:

# Check the GenAIPerf class definition to see the type of extra_inputs field ast-grep --pattern 'class GenAIPerf { $$$ extra_inputs: $_ $$$ }'

Repository: NVIDIA/cloudai

Length of output: 40

🏁 Script executed:

# Search for GenAIPerf class definition rg -n 'class GenAIPerf' src/cloudai/workloads/ai_dynamo/ai_dynamo.py -A 30

Repository: NVIDIA/cloudai

Length of output: 1028

🏁 Script executed:

# Also search for any reference to extra-inputs or extra_inputs in the whole codebase rg -n 'extra.inputs\|extra_inputs' src/cloudai/ -B 2 -A 2

Repository: NVIDIA/cloudai

Length of output: 40

🏁 Script executed:

# Search for where extra-inputs is actually defined/used in test data rg -n 'extra-inputs\|extra_inputs' . -B 2 -A 2

Repository: NVIDIA/cloudai

Length of output: 40

🏁 Script executed:

# Check the reference test data file directly for context cat tests/ref_data/ai-dynamo.sbatch | grep -n "extra-inputs" -B 5 -A 5

Repository: NVIDIA/cloudai

Length of output: 565

🏁 Script executed:

# Look for tests that use this reference file rg -n 'ai-dynamo.sbatch' tests/ -B 2 -A 2

Repository: NVIDIA/cloudai

Length of output: 40

🏁 Script executed:

# Check if there are test files that verify command generation fd -type f -name '*.py' tests/ | xargs rg -l 'ai_dynamo.*sbatch\|sbatch.*ai_dynamo' | head -5

Repository: NVIDIA/cloudai

Length of output: 228

🏁 Script executed:

# Look at the actual GenAIPerf model definition more carefully cat src/cloudai/workloads/ai_dynamo/ai_dynamo.py | sed -n '190,210p'

Repository: NVIDIA/cloudai

Length of output: 597

🏁 Script executed:

# Search for test configuration files that might define extra-inputs find tests/ -name "*.yaml" -o -name "*.json" -o -name "*.toml" | xargs grep -l "extra-inputs" 2>/dev/null

Repository: NVIDIA/cloudai

Length of output: 42

🏁 Script executed:

# Look for the test setup that generates ai-dynamo.sbatch rg -n 'ai-dynamo.sbatch' . --type py -B 5 -A 5

Repository: NVIDIA/cloudai

Length of output: 40

🏁 Script executed:

# Check if there's documentation or comments about how extra-inputs should be formatted rg -n 'extra.inputs\|extra-inputs' . -B 3 -A 3 --type py

Repository: NVIDIA/cloudai

Length of output: 40

🏁 Script executed:

# Let's check if there's a parent Benchmark class with extra-inputs definition rg -n 'class Benchmark' src/cloudai/ -A 20 | grep -A 20 'ai_dynamo'

Repository: NVIDIA/cloudai

Length of output: 2238

🏁 Script executed:

# Check the base Benchmark class to understand the data model find src/cloudai -name '*.py' -type f -exec grep -l "class Benchmark" {} \; | head -3

Repository: NVIDIA/cloudai

Length of output: 102

🏁 Script executed:

# Let's trace back to understand where this JSON value comes from in test configs find tests/ -type f $ -name "*.yaml" -o -name "*.json" -o -name "*.py" $ | xargs grep -l "temperature.*0.7\|max_tokens.*128" 2>/dev/null | head -5

Repository: NVIDIA/cloudai

Length of output: 82

🏁 Script executed:

# Check the test acceptance file to see how the reference file is generated rg -n 'ai-dynamo\|temperature.*0.7' tests/test_acceptance.py -B 5 -A 5

Repository: NVIDIA/cloudai

Length of output: 40

🏁 Script executed:

# Look for configuration files in tests that might contain the test setup find tests/ -type f $ -name "*.yaml" -o -name "*.json" $ | head -10

Repository: NVIDIA/cloudai

Length of output: 40

🏁 Script executed:

# Let's look at the test data directory structure ls -la tests/ref_data/ | head -20

Repository: NVIDIA/cloudai

Length of output: 1319

🏁 Script executed:

# Let's check the test to see how the reference file is used find tests/ -name "*.py" -type f | xargs grep -l "ai-dynamo.sbatch\|ref_data" | head -5

Repository: NVIDIA/cloudai

Length of output: 82

🏁 Script executed:

# Check test_acceptance.py more carefully for how these reference files are used head -100 tests/test_acceptance.py

Repository: NVIDIA/cloudai

Length of output: 3708

🏁 Script executed:

# Look for the test that compares generated output with reference rg -n 'ref_data\|ai-dynamo' tests/test_acceptance.py | head -20

Repository: NVIDIA/cloudai

Length of output: 40

🏁 Script executed:

# Search for the specific test that validates ai-dynamo sbatch output rg -n 'class.*AI.*Dynamo\|def.*ai.*dynamo' tests/test_acceptance.py -B 2 -A 10 | head -50

Repository: NVIDIA/cloudai

Length of output: 40

🏁 Script executed:

# Look for where the reference file content is actually compared rg -n 'ai-dynamo.sbatch' tests/ -r

Repository: NVIDIA/cloudai

Length of output: 471

🏁 Script executed:

# Check the full test_acceptance.py for AI Dynamo tests cat tests/test_acceptance.py | grep -n "ai_dynamo\|AIDynamo\|ai-dynamo" -B 3 -A 10 | head -80

Repository: NVIDIA/cloudai

Length of output: 3174

🏁 Script executed:

# Get the full ai-dynamo test definition from test_acceptance.py sed -n '441,515p' tests/test_acceptance.py

Repository: NVIDIA/cloudai

Length of output: 3176

🏁 Script executed:

# Also check if there's a configuration file for ai-dynamo find conf/ -name "*ai-dynamo*" -o -name "*dynamo*" 2>/dev/null

Repository: NVIDIA/cloudai

Length of output: 85

🏁 Script executed:

# Check the configuration files for ai-dynamo ls -la conf/experimental/ai_dynamo/

Repository: NVIDIA/cloudai

Length of output: 281

🏁 Script executed:

# Read one of the ai-dynamo config files to see if extra-inputs is defined there cat conf/experimental/ai_dynamo/*.toml 2>/dev/null | head -50

Repository: NVIDIA/cloudai

Length of output: 40

🏁 Script executed:

# Let's verify the exact format of the reference file one more time and check if there are any tests validating this cat tests/ref_data/ai-dynamo.sbatch | sed -n '60,75p'

Repository: NVIDIA/cloudai

Length of output: 776

🏁 Script executed:

# Check how the _get_toml_args method in slurm_command_gen_strategy.py handles string formatting cat -n src/cloudai/workloads/ai_dynamo/slurm_command_gen_strategy.py | sed -n '40,52p'

Repository: NVIDIA/cloudai

Length of output: 813

🏁 Script executed:

# Let's verify what happens when extra-inputs with JSON is serialized - check if this is intentional python3 << 'EOF' # Simulate what happens in the code prefix = "--genai_perf-" key = "extra-inputs" value = '{"temperature": 0.7, "max_tokens": 128}' # Current implementation result = f'{prefix}{key} "{value}"' print("Current result:", result) print("\nWhen shell parses this:") print(" First token: --genai_perf-extra-inputs") print(" Second token: \"{\"temperature\":") print(" This will FAIL - unmatched quotes!") print("\n\nProposed fixes:") print("Option 1 (single quotes):", f"{prefix}{key} '{value}'") print("Option 2 (escaped):", f'{prefix}{key} "{value.replace(chr(34), chr(92)+chr(34))}"') EOF

Repository: NVIDIA/cloudai

Length of output: 144

🏁 Script executed:

# Check how the sbatch file is actually used and if it's executed rg -n 'sbatch\|execute\|run' tests/test_acceptance.py | grep -i 'sbatch\|execute' | head -20

Repository: NVIDIA/cloudai

Length of output: 40

🏁 Script executed:

# Look for where the reference files are compared in tests rg -n 'ref_data\|assert.*sbatch' tests/test_acceptance.py -B 2 -A 2 | head -40

Repository: NVIDIA/cloudai

Length of output: 40

JSON argument in generated command has unescaped quotes that will cause shell parsing failures.

The --genai_perf-extra-inputs argument is generated by slurm_command_gen_strategy.py line 44, which wraps all values in double quotes without escaping inner quotes:

args = [f'{prefix}{k} "{v}"' for k, v in toml_args.items()]

When a JSON string like {"temperature": 0.7, "max_tokens": 128} is passed through this method, it produces:

--genai_perf-extra-inputs "{"temperature": 0.7, "max_tokens": 128}"

The shell will interpret the inner " characters as quote boundaries, breaking the argument parsing. Either use single quotes for JSON values or escape the inner double quotes:

Single quotes: '{"temperature": 0.7, "max_tokens": 128}'

Escaped quotes: "{\"temperature\": 0.7, \"max_tokens\": 128}"

Update the command generation logic in slurm_command_gen_strategy.py line 44 to handle JSON and other complex string values properly.

🤖 Prompt for AI Agents

In `@tests/ref_data/ai-dynamo.sbatch` at line 69, The generated SLURM argument construction in slurm_command_gen_strategy.py (the list comprehension building args from toml_args using prefix and the f-string at the comprehension) wraps values with double quotes which breaks JSON-containing values; change the construction to either wrap values in single quotes when the value contains double quotes or to escape inner double quotes (e.g., detect if v contains '"' then use f"{prefix}{k} '{v}'" or escape quotes) so that JSON like {"temperature": 0.7, "max_tokens": 128} is emitted as a single shell-safe token (use the same change where args is built and any helper that formats values).

coderabbitai · 2026-01-27T22:03:42Z

tests/report_generation_strategy/test_ai_dynamo_report_generation_strategy.py

 def test_ai_dynamo_generate_report(slurm_system: SlurmSystem, ai_dynamo_tr: TestRun, csv_content: str) -> None:
    strategy = AIDynamoReportGenerationStrategy(slurm_system, ai_dynamo_tr)
+    # The new implementation does not generate a report file
    strategy.generate_report()
-
-    report_file = ai_dynamo_tr.output_path / "report.csv"
-    assert report_file.is_file(), "Report CSV was not generated."
-
-    report_content = report_file.read_text()
-
-    def split_into_sections(content: str) -> list[str]:
-        sections = content.split("\n\n")
-        return [s.strip() for s in sections if s.strip()]
-
-    def normalize_csv_section(section: str) -> str:
-        return section.replace('"', "").strip()
-
-    actual_sections = [normalize_csv_section(s) for s in split_into_sections(report_content)]
-    expected_sections = [normalize_csv_section(s) for s in split_into_sections(csv_content)]
-
-    # First section should match after normalization
-    assert actual_sections[0] == expected_sections[0], "First section (metrics) does not match"
-
-    # Second section should have our additional metric
-    second_section_lines = actual_sections[1].split("\n")
-    assert second_section_lines[0] == "Metric,Value", "Second section header does not match"
-    assert second_section_lines[1] == "Output Token Throughput (tokens/sec),24", "Throughput line does not match"
-    assert second_section_lines[2] == "Overall Output Tokens per Second per GPU,1.0", "Added metric line is incorrect"
-    assert second_section_lines[3:] == ["Request Throughput (per sec),1.23", "Request Count (count),40.00"], (
-        "Remaining lines do not match"
-    )
-
-    # Third section (GPU metrics) should be identical
-    assert actual_sections[2] == expected_sections[2], "Third section (GPU metrics) does not match"
+    # Just verify the method runs without error
+    assert True


🧹 Nitpick | 🔵 Trivial

Consider adding a meaningful assertion.

The test only verifies that generate_report() runs without error. An assert True is a tautology. Consider asserting on a specific behavior or removing the redundant assertion.

♻️ Suggested improvement

def test_ai_dynamo_generate_report(slurm_system: SlurmSystem, ai_dynamo_tr: TestRun, csv_content: str) -> None: strategy = AIDynamoReportGenerationStrategy(slurm_system, ai_dynamo_tr) # The new implementation does not generate a report file - strategy.generate_report() - # Just verify the method runs without error - assert True + # Verify the method runs without raising an exception + strategy.generate_report() # No assertion needed - no exception = success

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

def test_ai_dynamo_generate_report(slurm_system: SlurmSystem, ai_dynamo_tr: TestRun, csv_content: str) -> None:

strategy = AIDynamoReportGenerationStrategy(slurm_system, ai_dynamo_tr)

# The new implementation does not generate a report file

strategy.generate_report()

report_file = ai_dynamo_tr.output_path / "report.csv"

assert report_file.is_file(), "Report CSV was not generated."

report_content = report_file.read_text()

def split_into_sections(content: str) -> list[str]:

sections = content.split("\n\n")

return [s.strip() for s in sections if s.strip()]

def normalize_csv_section(section: str) -> str:

return section.replace('"', "").strip()

actual_sections = [normalize_csv_section(s) for s in split_into_sections(report_content)]

expected_sections = [normalize_csv_section(s) for s in split_into_sections(csv_content)]

# First section should match after normalization

assert actual_sections[0] == expected_sections[0], "First section (metrics) does not match"

# Second section should have our additional metric

second_section_lines = actual_sections[1].split("\n")

assert second_section_lines[0] == "Metric,Value", "Second section header does not match"

assert second_section_lines[1] == "Output Token Throughput (tokens/sec),24", "Throughput line does not match"

assert second_section_lines[2] == "Overall Output Tokens per Second per GPU,1.0", "Added metric line is incorrect"

assert second_section_lines[3:] == ["Request Throughput (per sec),1.23", "Request Count (count),40.00"], (

"Remaining lines do not match"

)

# Third section (GPU metrics) should be identical

assert actual_sections[2] == expected_sections[2], "Third section (GPU metrics) does not match"

# Just verify the method runs without error

assert True

def test_ai_dynamo_generate_report(slurm_system: SlurmSystem, ai_dynamo_tr: TestRun, csv_content: str) -> None:

strategy = AIDynamoReportGenerationStrategy(slurm_system, ai_dynamo_tr)

# The new implementation does not generate a report file

# Verify the method runs without raising an exception

strategy.generate_report() # No assertion needed - no exception = success

🤖 Prompt for AI Agents

In `@tests/report_generation_strategy/test_ai_dynamo_report_generation_strategy.py` around lines 86 - 91, The test currently only asserts True; instead validate a concrete behavior of AIDynamoReportGenerationStrategy.generate_report: call strategy.generate_report(), capture its return value and assert it is None (or the expected sentinel), and/or assert no report file was created by checking the expected report path derived from the strategy (use AIDynamoReportGenerationStrategy.report_path or the strategy instance to determine the output file) or check a relevant side-effect on the ai_dynamo_tr fixture; update test_ai_dynamo_generate_report to replace the tautological assert with one or two assertions verifying the method's return and absence (or presence if expected) of the report file or expected state change.

coderabbitai · 2026-01-27T22:03:42Z

tests/report_generation_strategy/test_ai_dynamo_report_generation_strategy.py

+    # Empty the CSV file to test error handling
+    (ai_dynamo_tr.output_path / "genai_perf-report.csv").write_text("")
+    assert strategy.get_metric("invalid-metric") == METRIC_ERROR


⚠️ Potential issue | 🟡 Minor

Filename mismatch: genai_perf-report.csv vs genai_perf_report.csv.

Line 68 creates genai_perf_report.csv (underscore), but line 118 writes to genai_perf-report.csv (hyphen). This test writes to a non-existent file (creating a new empty one) rather than clearing the existing test data file. The subsequent assertion passes but doesn't test the intended error-handling path for an empty CSV.

🐛 Proposed fix

# Empty the CSV file to test error handling - (ai_dynamo_tr.output_path / "genai_perf-report.csv").write_text("") + (ai_dynamo_tr.output_path / "genai_perf_report.csv").write_text("") assert strategy.get_metric("invalid-metric") == METRIC_ERROR

🤖 Prompt for AI Agents

In `@tests/report_generation_strategy/test_ai_dynamo_report_generation_strategy.py` around lines 117 - 119, The test empties the wrong CSV due to a filename mismatch: replace the hyphenated filename with the underscore variant so the existing test CSV is cleared; specifically, update the write_text call that currently targets (ai_dynamo_tr.output_path / "genai_perf-report.csv") to target "genai_perf_report.csv" so the subsequent call to strategy.get_metric("invalid-metric") exercises the empty-file error path instead of creating a new file.

Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>

karya0 force-pushed the dynamo branch from f3fe499 to fb5cf33 Compare January 27, 2026 18:22

coderabbitai bot reviewed Jan 27, 2026

View reviewed changes

karya0 force-pushed the dynamo branch 7 times, most recently from 0a6acd6 to 9ad787e Compare January 27, 2026 21:49

AI-Dynamo updates.

c4617f7

Signed-off-by: Kapil Arya <kapila@nvidia.com>

karya0 force-pushed the dynamo branch from 9ad787e to c4617f7 Compare January 27, 2026 21:53

coderabbitai bot reviewed Jan 27, 2026

View reviewed changes

Update src/cloudai/workloads/ai_dynamo/calc_percentile_csv.py

509adb7

Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>

		logging.info("Setting num_nodes from %d to %d", self.test_run.num_nodes, total_nodes)

		self.test_run.num_nodes = total_nodes

		--genai_perf-args-*)
		genai_perf_args["${key#--genai-perf-args-}"]="$2" ;;

		assert isinstance(prefill_n, int), "dynamo.num_prefill_nodes must be an integer"
		assert isinstance(decode_n, int), "dynamo.num_decode_nodes must be an integer"

AI-Dynamo updates. #791

Are you sure you want to change the base?

AI-Dynamo updates. #791

Uh oh!

Conversation

karya0 commented Jan 27, 2026

Summary

Test Plan

Uh oh!

coderabbitai bot commented Jan 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review skipped

Walkthrough

Changes

Estimated code review effort

Poem

Uh oh!

karya0 commented Jan 27, 2026

Uh oh!

coderabbitai bot commented Jan 27, 2026

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot Jan 27, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jan 27, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jan 27, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

karya0 commented Jan 27, 2026

Uh oh!

coderabbitai bot commented Jan 27, 2026

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jan 27, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jan 27, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jan 27, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jan 27, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jan 27, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jan 27, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jan 27, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jan 27, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jan 27, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jan 27, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

coderabbitai bot commented Jan 27, 2026 •

edited

Loading