Skip to content

Revise the gen mixin class to support additional features, models#27

Open
jhdavis8 wants to merge 10 commits intodevelopfrom
gen-mixin-overhaul
Open

Revise the gen mixin class to support additional features, models#27
jhdavis8 wants to merge 10 commits intodevelopfrom
gen-mixin-overhaul

Conversation

@jhdavis8
Copy link
Copy Markdown
Collaborator

@jhdavis8 jhdavis8 commented Apr 21, 2026

Changes made:

  • Add a CLAUDE.md file
  • Add vLLM config files for some additional models
  • Improve handling of reasoning model token limits
  • Add option to launch vLLM server internally and optionally keep the server process running after translation
  • Add options to override API key and base url for OpenAI server
  • Use logging rather than print statements for all built-in translation methods and generator mixin
  • Add batch scripts to help with running translations

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR expands the translation/generation infrastructure to support more models and workflows (notably vLLM), adds HPC run scripts/configs, and standardizes runtime output on Python logging.

Changes:

  • Add vLLM launch/keepalive support and API key/base URL overrides in GeneratorMixin, plus token counting via tiktoken.
  • Replace print() usage with structured logging across translation components and add a CLI --log-level.
  • Add Perlmutter vLLM YAML configs and Slurm scripts for translation sweeps.

Reviewed changes

Copilot reviewed 21 out of 23 changed files in this pull request and generated 9 comments.

Show a summary per file
File Description
uv.lock Locks new dependency additions (notably tiktoken and its deps).
pyproject.toml Adds tiktoken dependency for token counting / limits.
translate.sbatch Adds a single-job translation launcher (currently has a CLI arg typo).
translate-sweep.sbatch Adds Slurm array sweep driver for translation experiments.
translate-runner.sh Adds per-array-task runner, including vLLM keepalive integration (currently mismatched with CLI type).
src/translate/translator.py Introduces module logger for consistent logging.
src/translate/translate.py Adds logging configuration + shared GeneratorMixin CLI flags (api key/base url/vLLM server options).
src/translate/top_down_agentic/utils.py Migrates utility prints to logging.
src/translate/top_down_agentic/top_down_agentic.py Plumbs new GeneratorMixin args + replaces prints with logging; changes generation call parameters.
src/translate/top_down_agentic/dependency_agent.py Migrates dependency tracing output from prints to logging.
src/translate/top_down_agentic/context_agent.py Migrates to logging; changes context generation call parameters.
src/translate/top_down_agentic/chunk_agent.py Migrates chunking status output from prints to logging.
src/translate/swe_agent/swe_agent_translator.py Migrates SWE-agent workflow output from prints to logging.
src/translate/repo.py Introduces module logger for consistent logging.
src/translate/naive/naive_translator.py Plumbs new GeneratorMixin args + migrates to logging; changes multi-output generation behavior.
src/translate/generator_mixin.py Major backend changes: OpenAI/vLLM client overrides, optional internal vLLM server launch, encoder/token-limit helpers, logging, and new request retry behavior.
config/perlmutter-vllm-qwen.yaml Adds Perlmutter vLLM config for Qwen/Qwen3-Coder-Next.
config/perlmutter-vllm-oss.yaml Adds Perlmutter vLLM config for openai/gpt-oss-120b.
config/perlmutter-vllm-nemo.yaml Adds Perlmutter vLLM config for NVIDIA Nemotron + custom reasoning parser plugin.
config/perlmutter-vllm-glm.yaml Adds Perlmutter vLLM config for GLM.
config/nano_v3_reasoning_parser.py Adds custom reasoning parser plugin for vLLM/Nemotron.
CLAUDE.md Adds repository guidance for Claude Code users.
.gitignore Ignores local test outputs.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread translate.sbatch Outdated
Comment thread translate-runner.sh
Comment on lines +47 to +66
# A keepalive ID unique to this array task causes GeneratorMixin to write a
# PID file for the vLLM server so it persists across all num_translations calls.
local keepalive_id="${SLURM_JOB_ID}_${slot}"

for i in $(seq 0 $((num_translations - 1))); do
python src/translate/translate.py \
-i "targets/${app_name}/${src_model}/" \
-o ../restate-results/ \
-c "targets/${app_name}/${dst_model}/" \
--method "${method}" \
--src-model "${src_model}" \
--dst-model "${dst_model}" \
-n 1 \
--output-id "${i}" \
--app-name "${app_name}" \
--vllm-environment ../serve/.venv/ \
--vllm-yaml-config config/perlmutter-vllm-oss.yaml \
--vllm-keepalive-id "${keepalive_id}" \
--naive-backend vllm \
--naive-llm-name openai/gpt-oss-120b
Copy link

Copilot AI Apr 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

keepalive_id is built as a string like "${SLURM_JOB_ID}_${slot}", but translate.py defines --vllm-keepalive-id as an int. This will fail argparse validation for job arrays. Either pass a pure integer here (e.g., ${SLURM_JOB_ID}${slot}) or change the CLI arg/type to accept strings.

Copilot uses AI. Check for mistakes.
Comment on lines 409 to 470
@@ -346,32 +450,22 @@ async def _generate_vllm_async(
self,
prompts: List[str],
system_prompt: str,
max_new_tokens: int = 2048,
temperature: Optional[float] = None,
top_p: Optional[float] = None,
**kwargs
) -> List[GenericResponse]:
""" Generate text using the vLLM via the OpenAI server API in async mode.
""" Generate text using the vLLM via the OpenAI completions API in async batch mode.
"""
if not self._vllm_client:
raise ValueError("vLLM (OpenAI) client not initialized.")

# Adjust parameters for reasoning models
temperature, top_p, max_new_tokens = self._adjust_parameters_for_reasoning(
temperature, top_p, max_new_tokens
)
is_reasoning = self._is_reasoning_model()

# Prepare prompts for reasoning models
# Merge system prompt into user prompts for reasoning models
if is_reasoning and system_prompt is not None:
prompts = [system_prompt + "\n" + p for p in prompts]

completion = await self._vllm_client.completions.create(
model=self._llm_name,
prompt=prompts,
temperature=temperature,
top_p=top_p,
max_tokens=max_new_tokens,
**kwargs
)
Copy link

Copilot AI Apr 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The vLLM sync path uses the OpenAI responses endpoint (client.responses.create), but the async batch path uses the completions endpoint (client.completions.create). Besides being inconsistent, the async path ignores the system prompt for non-reasoning models (it only merges it for reasoning models), so prompts will differ depending on batch size/async settings. Align both sync/async vLLM generation on a single endpoint and ensure the system prompt is always applied.

Copilot uses AI. Check for mistakes.
Comment on lines +235 to +248
def _test_vllm_server(self) -> bool:
"""Test if the vLLM server is running."""
base = (self._api_base_url or VLLM_BASE_URL).rstrip("/")
if base.endswith("/v1"):
base = base[:-3]
health_url = base + "/health"
return (
subprocess.run(
["curl", health_url],
capture_output=True,
text=True,
check=False,
).returncode == 0
)
Copy link

Copilot AI Apr 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_test_vllm_server() shells out to curl without any timeout flags. If DNS/routing hangs, this can block translation startup indefinitely and also assumes curl is installed in the runtime environment. Prefer doing the health check with a Python HTTP client (e.g., requests) and an explicit short timeout.

Copilot uses AI. Check for mistakes.
Comment on lines +76 to +78
parser.add_argument("--vllm-keepalive-id", type=int, default=None,
help="If set, write the vLLM server PID to a file with this ID instead of "
"terminating the server on exit.")
Copy link

Copilot AI Apr 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

--vllm-keepalive-id is declared as type=int, but callers (e.g., Slurm runner) may want to pass composite IDs like "_" for uniqueness. With the current type, those values will be rejected by argparse. Consider making this a str (and update downstream typing in translators/GeneratorMixin accordingly), since it’s only used to name a PID file.

Suggested change
parser.add_argument("--vllm-keepalive-id", type=int, default=None,
help="If set, write the vLLM server PID to a file with this ID instead of "
"terminating the server on exit.")
parser.add_argument("--vllm-keepalive-id", type=str, default=None,
help="If set, write the vLLM server PID to a file with this identifier "
"instead of terminating the server on exit.")

Copilot uses AI. Check for mistakes.
Comment on lines 527 to 533
if self._dry:
self._handle_dry_run(prompt, fpath, output_fpaths)
return

responses = self.generate(
prompt,
temperature=DEFAULT_TEMPERATURE,
top_p=DEFAULT_TOP_P,
n=len(self._output_paths)
)
responses = [self.generate(prompt)[0] for _ in self._output_paths]

self._process_responses(responses, prompt, fpath, output_fpaths, chunk, chunk_id)
Copy link

Copilot AI Apr 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NaiveTranslator now produces multiple outputs by calling self.generate(prompt) once per output path. This turns what used to be a single request with n=... into N separate backend requests, which is significantly slower/costlier and can interact badly with RPM/TPM limits. Consider batching via generate_async (repeat the same prompt N times for vLLM) or reintroducing an n parameter for backends that support it.

Copilot uses AI. Check for mistakes.
Comment on lines +108 to +112
_handler = logging.StreamHandler()
_handler.setFormatter(logging.Formatter(
"%(asctime)s [%(levelname)s] %(message)s", datefmt="%H:%M:%S"
))
logger.addHandler(_handler)
Copy link

Copilot AI Apr 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The logger setup unconditionally adds a new StreamHandler each time main() runs. If this module is invoked multiple times in the same interpreter (tests, programmatic usage), logs will be duplicated. Guard with if not logger.handlers: (or clear existing handlers) before adding a handler.

Suggested change
_handler = logging.StreamHandler()
_handler.setFormatter(logging.Formatter(
"%(asctime)s [%(levelname)s] %(message)s", datefmt="%H:%M:%S"
))
logger.addHandler(_handler)
if not logger.handlers:
_handler = logging.StreamHandler()
_handler.setFormatter(logging.Formatter(
"%(asctime)s [%(levelname)s] %(message)s", datefmt="%H:%M:%S"
))
logger.addHandler(_handler)

Copilot uses AI. Check for mistakes.
Comment on lines +388 to +418
def _generate_openai(self, prompt: str, system_prompt: str, **kwargs) -> List[GenericResponse]:
""" Generate text using the OpenAI responses API.
"""
if not self._openai_client:
raise ValueError("OpenAI client not initialized.")

messages = self._format_messages_list(prompt, system_prompt)
gen_kwargs: Dict[str, Any] = {"model": self._llm_name, "input": messages, **kwargs}
if self._llm_name.endswith("-thinking"):
gen_kwargs["model"] = self._llm_name.replace("-thinking", "")
gen_kwargs["reasoning"] = {"effort": "high"}

def _prepare_messages_for_reasoning(self, prompt: str, system_prompt: Optional[str]) -> List[Dict[str, str]]:
"""Prepare messages for reasoning models by merging system prompt."""
if self._is_reasoning_model() and system_prompt is not None:
return self._format_messages_list(system_prompt + "\n" + prompt, None)
return self._format_messages_list(prompt, system_prompt)
response = self._openai_client.responses.create(**gen_kwargs)

return [GenericResponse(
response.output_text,
response.usage.input_tokens,
response.usage.output_tokens,
)]

def _generate_vllm(
self,
prompt: str,
system_prompt: str,
max_new_tokens: int = 2048,
temperature: Optional[float] = None,
top_p: Optional[float] = None,
n: int = 1,
**kwargs
) -> List[GenericResponse]:
""" Generate text using the vLLM via the OpenAI server API.

def _generate_vllm(self, prompt: str, system_prompt: str, **kwargs) -> List[GenericResponse]:
""" Generate text using the vLLM via the OpenAI responses API.
"""
from openai import APITimeoutError
if not self._vllm_client:
raise ValueError("vLLM (OpenAI) client not initialized.")

# Adjust parameters for reasoning models
temperature, top_p, max_new_tokens = self._adjust_parameters_for_reasoning(
temperature, top_p, max_new_tokens
)
messages = self._format_messages_list(prompt, system_prompt)
gen_kwargs: Dict[str, Any] = {"model": self._llm_name, "input": messages, **kwargs}
if "gpt-oss" in self._llm_name:
Copy link

Copilot AI Apr 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_generate_openai() / _generate_vllm() no longer set any explicit output token limit (previously max_new_tokens/max_tokens was passed). This means the backend default will apply, which can lead to unexpectedly truncated translations or unexpectedly large/expensive outputs depending on the provider defaults. Consider reinstating an explicit max-output-tokens parameter (and optionally exposing it via CLI/config) and passing it through gen_kwargs.

Copilot uses AI. Check for mistakes.
def _generate_translations(self, prompts: List[str]) -> Tuple[List[str], List[Union[str, None]]]:
"""Generate translations using the LLM."""
response_obs = self.generate_async(prompts, temperature=0.2, top_p=0.95)
response_obs = self.generate_async(prompts)
Copy link

Copilot AI Apr 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This call used to pass explicit sampling params (temperature, top_p) but now relies on backend defaults via generate_async(prompts).
If stable/low-variance translations are required, consider keeping the prior defaults (or making them configurable) and passing them through to GeneratorMixin via kwargs.

Suggested change
response_obs = self.generate_async(prompts)
generate_kwargs = {}
temperature = getattr(self, "_temperature", getattr(self, "temperature", None))
if temperature is not None:
generate_kwargs["temperature"] = temperature
top_p = getattr(self, "_top_p", getattr(self, "top_p", None))
if top_p is not None:
generate_kwargs["top_p"] = top_p
response_obs = self.generate_async(prompts, **generate_kwargs)

Copilot uses AI. Check for mistakes.
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Copy link
Copy Markdown
Collaborator

@ikhillan ikhillan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR looks generally good. I think Copilot's comments are definitely things to consider so I won't approve for now and let Josh decide whether to incorporate Copilot's changes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants