Revise the gen mixin class to support additional features, models by jhdavis8 · Pull Request #27 · parallelcodefoundry/ParEval-Repo

jhdavis8 · 2026-04-21T17:54:02Z

Changes made:

Add a CLAUDE.md file
Add vLLM config files for some additional models
Improve handling of reasoning model token limits
Add option to launch vLLM server internally and optionally keep the server process running after translation
Add options to override API key and base url for OpenAI server
Use logging rather than print statements for all built-in translation methods and generator mixin
Add batch scripts to help with running translations

Copilot

Pull request overview

This PR expands the translation/generation infrastructure to support more models and workflows (notably vLLM), adds HPC run scripts/configs, and standardizes runtime output on Python logging.

Changes:

Add vLLM launch/keepalive support and API key/base URL overrides in GeneratorMixin, plus token counting via tiktoken.
Replace print() usage with structured logging across translation components and add a CLI --log-level.
Add Perlmutter vLLM YAML configs and Slurm scripts for translation sweeps.

Reviewed changes

Copilot reviewed 21 out of 23 changed files in this pull request and generated 9 comments.

Show a summary per file

File	Description
`uv.lock`	Locks new dependency additions (notably `tiktoken` and its deps).
`pyproject.toml`	Adds `tiktoken` dependency for token counting / limits.
`translate.sbatch`	Adds a single-job translation launcher (currently has a CLI arg typo).
`translate-sweep.sbatch`	Adds Slurm array sweep driver for translation experiments.
`translate-runner.sh`	Adds per-array-task runner, including vLLM keepalive integration (currently mismatched with CLI type).
`src/translate/translator.py`	Introduces module logger for consistent logging.
`src/translate/translate.py`	Adds logging configuration + shared GeneratorMixin CLI flags (api key/base url/vLLM server options).
`src/translate/top_down_agentic/utils.py`	Migrates utility prints to logging.
`src/translate/top_down_agentic/top_down_agentic.py`	Plumbs new GeneratorMixin args + replaces prints with logging; changes generation call parameters.
`src/translate/top_down_agentic/dependency_agent.py`	Migrates dependency tracing output from prints to logging.
`src/translate/top_down_agentic/context_agent.py`	Migrates to logging; changes context generation call parameters.
`src/translate/top_down_agentic/chunk_agent.py`	Migrates chunking status output from prints to logging.
`src/translate/swe_agent/swe_agent_translator.py`	Migrates SWE-agent workflow output from prints to logging.
`src/translate/repo.py`	Introduces module logger for consistent logging.
`src/translate/naive/naive_translator.py`	Plumbs new GeneratorMixin args + migrates to logging; changes multi-output generation behavior.
`src/translate/generator_mixin.py`	Major backend changes: OpenAI/vLLM client overrides, optional internal vLLM server launch, encoder/token-limit helpers, logging, and new request retry behavior.
`config/perlmutter-vllm-qwen.yaml`	Adds Perlmutter vLLM config for Qwen/Qwen3-Coder-Next.
`config/perlmutter-vllm-oss.yaml`	Adds Perlmutter vLLM config for `openai/gpt-oss-120b`.
`config/perlmutter-vllm-nemo.yaml`	Adds Perlmutter vLLM config for NVIDIA Nemotron + custom reasoning parser plugin.
`config/perlmutter-vllm-glm.yaml`	Adds Perlmutter vLLM config for GLM.
`config/nano_v3_reasoning_parser.py`	Adds custom reasoning parser plugin for vLLM/Nemotron.
`CLAUDE.md`	Adds repository guidance for Claude Code users.
`.gitignore`	Ignores local test outputs.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-21T17:59:08Z

+    # A keepalive ID unique to this array task causes GeneratorMixin to write a
+    # PID file for the vLLM server so it persists across all num_translations calls.
+    local keepalive_id="${SLURM_JOB_ID}_${slot}"
+
+    for i in $(seq 0 $((num_translations - 1))); do
+        python src/translate/translate.py \
+            -i "targets/${app_name}/${src_model}/" \
+            -o ../restate-results/ \
+            -c "targets/${app_name}/${dst_model}/" \
+            --method "${method}" \
+            --src-model "${src_model}" \
+            --dst-model "${dst_model}" \
+            -n 1 \
+            --output-id "${i}" \
+            --app-name "${app_name}" \
+            --vllm-environment ../serve/.venv/ \
+            --vllm-yaml-config config/perlmutter-vllm-oss.yaml \
+            --vllm-keepalive-id "${keepalive_id}" \
+            --naive-backend vllm \
+            --naive-llm-name openai/gpt-oss-120b


keepalive_id is built as a string like "${SLURM_JOB_ID}_${slot}", but translate.py defines --vllm-keepalive-id as an int. This will fail argparse validation for job arrays. Either pass a pure integer here (e.g., ${SLURM_JOB_ID}${slot}) or change the CLI arg/type to accept strings.

Copilot · 2026-04-21T17:59:08Z

@@ -346,32 +450,22 @@ async def _generate_vllm_async(
            self,
            prompts: List[str],
            system_prompt: str,
-            max_new_tokens: int = 2048,
-            temperature: Optional[float] = None,
-            top_p: Optional[float] = None,
            **kwargs
    ) -> List[GenericResponse]:
-        """ Generate text using the vLLM via the OpenAI server API in async mode.
+        """ Generate text using the vLLM via the OpenAI completions API in async batch mode.
        """
        if not self._vllm_client:
            raise ValueError("vLLM (OpenAI) client not initialized.")

-        # Adjust parameters for reasoning models
-        temperature, top_p, max_new_tokens = self._adjust_parameters_for_reasoning(
-            temperature, top_p, max_new_tokens
-        )
        is_reasoning = self._is_reasoning_model()

-        # Prepare prompts for reasoning models
+        # Merge system prompt into user prompts for reasoning models
        if is_reasoning and system_prompt is not None:
            prompts = [system_prompt + "\n" + p for p in prompts]

        completion = await self._vllm_client.completions.create(
            model=self._llm_name,
            prompt=prompts,
-            temperature=temperature,
-            top_p=top_p,
-            max_tokens=max_new_tokens,
            **kwargs
        )


The vLLM sync path uses the OpenAI responses endpoint (client.responses.create), but the async batch path uses the completions endpoint (client.completions.create). Besides being inconsistent, the async path ignores the system prompt for non-reasoning models (it only merges it for reasoning models), so prompts will differ depending on batch size/async settings. Align both sync/async vLLM generation on a single endpoint and ensure the system prompt is always applied.

Copilot · 2026-04-21T17:59:09Z

+    def _test_vllm_server(self) -> bool:
+        """Test if the vLLM server is running."""
+        base = (self._api_base_url or VLLM_BASE_URL).rstrip("/")
+        if base.endswith("/v1"):
+            base = base[:-3]
+        health_url = base + "/health"
+        return (
+            subprocess.run(
+                ["curl", health_url],
+                capture_output=True,
+                text=True,
+                check=False,
+            ).returncode == 0
+        )


_test_vllm_server() shells out to curl without any timeout flags. If DNS/routing hangs, this can block translation startup indefinitely and also assumes curl is installed in the runtime environment. Prefer doing the health check with a Python HTTP client (e.g., requests) and an explicit short timeout.

Copilot · 2026-04-21T17:59:09Z

+    parser.add_argument("--vllm-keepalive-id", type=int, default=None,
+                        help="If set, write the vLLM server PID to a file with this ID instead of "
+                        "terminating the server on exit.")


--vllm-keepalive-id is declared as type=int, but callers (e.g., Slurm runner) may want to pass composite IDs like "_" for uniqueness. With the current type, those values will be rejected by argparse. Consider making this a str (and update downstream typing in translators/GeneratorMixin accordingly), since it’s only used to name a PID file.

Suggested change

parser.add_argument("--vllm-keepalive-id", type=int, default=None,

help="If set, write the vLLM server PID to a file with this ID instead of "

"terminating the server on exit.")

parser.add_argument("--vllm-keepalive-id", type=str, default=None,

help="If set, write the vLLM server PID to a file with this identifier "

"instead of terminating the server on exit.")

Copilot · 2026-04-21T17:59:09Z

        if self._dry:
            self._handle_dry_run(prompt, fpath, output_fpaths)
            return

-        responses = self.generate(
-            prompt,
-            temperature=DEFAULT_TEMPERATURE,
-            top_p=DEFAULT_TOP_P,
-            n=len(self._output_paths)
-        )
+        responses = [self.generate(prompt)[0] for _ in self._output_paths]

        self._process_responses(responses, prompt, fpath, output_fpaths, chunk, chunk_id)


NaiveTranslator now produces multiple outputs by calling self.generate(prompt) once per output path. This turns what used to be a single request with n=... into N separate backend requests, which is significantly slower/costlier and can interact badly with RPM/TPM limits. Consider batching via generate_async (repeat the same prompt N times for vLLM) or reintroducing an n parameter for backends that support it.

Copilot · 2026-04-21T17:59:09Z

+    _handler = logging.StreamHandler()
+    _handler.setFormatter(logging.Formatter(
+        "%(asctime)s [%(levelname)s] %(message)s", datefmt="%H:%M:%S"
+    ))
+    logger.addHandler(_handler)


The logger setup unconditionally adds a new StreamHandler each time main() runs. If this module is invoked multiple times in the same interpreter (tests, programmatic usage), logs will be duplicated. Guard with if not logger.handlers: (or clear existing handlers) before adding a handler.

Suggested change

_handler = logging.StreamHandler()

_handler.setFormatter(logging.Formatter(

"%(asctime)s [%(levelname)s] %(message)s", datefmt="%H:%M:%S"

))

logger.addHandler(_handler)

if not logger.handlers:

_handler = logging.StreamHandler()

_handler.setFormatter(logging.Formatter(

"%(asctime)s [%(levelname)s] %(message)s", datefmt="%H:%M:%S"

))

logger.addHandler(_handler)

Copilot · 2026-04-21T17:59:10Z

+    def _generate_openai(self, prompt: str, system_prompt: str, **kwargs) -> List[GenericResponse]:
+        """ Generate text using the OpenAI responses API.
+        """
+        if not self._openai_client:
+            raise ValueError("OpenAI client not initialized.")

+        messages = self._format_messages_list(prompt, system_prompt)
+        gen_kwargs: Dict[str, Any] = {"model": self._llm_name, "input": messages, **kwargs}
+        if self._llm_name.endswith("-thinking"):
+            gen_kwargs["model"] = self._llm_name.replace("-thinking", "")
+            gen_kwargs["reasoning"] = {"effort": "high"}

-    def _prepare_messages_for_reasoning(self, prompt: str, system_prompt: Optional[str]) -> List[Dict[str, str]]:
-        """Prepare messages for reasoning models by merging system prompt."""
-        if self._is_reasoning_model() and system_prompt is not None:
-            return self._format_messages_list(system_prompt + "\n" + prompt, None)
-        return self._format_messages_list(prompt, system_prompt)
+        response = self._openai_client.responses.create(**gen_kwargs)

+        return [GenericResponse(
+            response.output_text,
+            response.usage.input_tokens,
+            response.usage.output_tokens,
+        )]

-    def _generate_vllm(
-            self,
-            prompt: str,
-            system_prompt: str,
-            max_new_tokens: int = 2048,
-            temperature: Optional[float] = None,
-            top_p: Optional[float] = None,
-            n: int = 1,
-            **kwargs
-    ) -> List[GenericResponse]:
-        """ Generate text using the vLLM via the OpenAI server API.
+
+    def _generate_vllm(self, prompt: str, system_prompt: str, **kwargs) -> List[GenericResponse]:
+        """ Generate text using the vLLM via the OpenAI responses API.
        """
+        from openai import APITimeoutError
        if not self._vllm_client:
            raise ValueError("vLLM (OpenAI) client not initialized.")

-        # Adjust parameters for reasoning models
-        temperature, top_p, max_new_tokens = self._adjust_parameters_for_reasoning(
-            temperature, top_p, max_new_tokens
-        )
+        messages = self._format_messages_list(prompt, system_prompt)
+        gen_kwargs: Dict[str, Any] = {"model": self._llm_name, "input": messages, **kwargs}
+        if "gpt-oss" in self._llm_name:


_generate_openai() / _generate_vllm() no longer set any explicit output token limit (previously max_new_tokens/max_tokens was passed). This means the backend default will apply, which can lead to unexpectedly truncated translations or unexpectedly large/expensive outputs depending on the provider defaults. Consider reinstating an explicit max-output-tokens parameter (and optionally exposing it via CLI/config) and passing it through gen_kwargs.

Copilot · 2026-04-21T17:59:10Z

    def _generate_translations(self, prompts: List[str]) -> Tuple[List[str], List[Union[str, None]]]:
        """Generate translations using the LLM."""
-        response_obs = self.generate_async(prompts, temperature=0.2, top_p=0.95)
+        response_obs = self.generate_async(prompts)


This call used to pass explicit sampling params (temperature, top_p) but now relies on backend defaults via generate_async(prompts).
If stable/low-variance translations are required, consider keeping the prior defaults (or making them configurable) and passing them through to GeneratorMixin via kwargs.

Suggested change

response_obs = self.generate_async(prompts)

generate_kwargs = {}

temperature = getattr(self, "_temperature", getattr(self, "temperature", None))

if temperature is not None:

generate_kwargs["temperature"] = temperature

top_p = getattr(self, "_top_p", getattr(self, "top_p", None))

if top_p is not None:

generate_kwargs["top_p"] = top_p

response_obs = self.generate_async(prompts, **generate_kwargs)

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

ikhillan

PR looks generally good. I think Copilot's comments are definitely things to consider so I won't approve for now and let Josh decide whether to incorporate Copilot's changes.

jhdavis8 added 9 commits March 20, 2026 16:54

Add CLAUDE.md

ae55bb1

Revise generator mixin to support lots of new QoL features

3042329

Add config files for vllm setup

7b9d455

Use logging in translate

d2ff83d

Update uv.lock

ac77dc8

Add example command script

24669b5

Change example script to be a sbatch

04c2fe4

Expand eg-translate to full sweep, improve format on translate.py

13b383c

Add translate-sweep array sbatch

e63b383

jhdavis8 requested review from Copilot and ikhillan April 21, 2026 17:54

jhdavis8 self-assigned this Apr 21, 2026

Copilot started reviewing on behalf of jhdavis8 April 21, 2026 17:54 View session

Copilot AI reviewed Apr 21, 2026

View reviewed changes

Apply suggestion from @Copilot

7021a34

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

ikhillan requested changes Apr 22, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Revise the gen mixin class to support additional features, models#27

Revise the gen mixin class to support additional features, models#27
jhdavis8 wants to merge 10 commits intodevelopfrom
gen-mixin-overhaul

jhdavis8 commented Apr 21, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Copilot AI Apr 21, 2026

Uh oh!

Copilot AI Apr 21, 2026

Uh oh!

Copilot AI Apr 21, 2026

Uh oh!

Copilot AI Apr 21, 2026

Uh oh!

Copilot AI Apr 21, 2026

Uh oh!

Copilot AI Apr 21, 2026

Uh oh!

Copilot AI Apr 21, 2026

Uh oh!

Copilot AI Apr 21, 2026

Uh oh!

ikhillan left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

-        response_obs = self.generate_async(prompts)
+        generate_kwargs = {}
+        temperature = getattr(self, "_temperature", getattr(self, "temperature", None))
+        if temperature is not None:
+            generate_kwargs["temperature"] = temperature
+        top_p = getattr(self, "_top_p", getattr(self, "top_p", None))
+        if top_p is not None:
+            generate_kwargs["top_p"] = top_p
+        response_obs = self.generate_async(prompts, **generate_kwargs)

Conversation

jhdavis8 commented Apr 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Copilot AI Apr 21, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 21, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 21, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 21, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 21, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 21, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 21, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 21, 2026

Choose a reason for hiding this comment

Uh oh!

ikhillan left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

jhdavis8 commented Apr 21, 2026 •

edited

Loading