Revise the gen mixin class to support additional features, models#27
Revise the gen mixin class to support additional features, models#27
Conversation
There was a problem hiding this comment.
Pull request overview
This PR expands the translation/generation infrastructure to support more models and workflows (notably vLLM), adds HPC run scripts/configs, and standardizes runtime output on Python logging.
Changes:
- Add vLLM launch/keepalive support and API key/base URL overrides in
GeneratorMixin, plus token counting viatiktoken. - Replace
print()usage with structured logging across translation components and add a CLI--log-level. - Add Perlmutter vLLM YAML configs and Slurm scripts for translation sweeps.
Reviewed changes
Copilot reviewed 21 out of 23 changed files in this pull request and generated 9 comments.
Show a summary per file
| File | Description |
|---|---|
uv.lock |
Locks new dependency additions (notably tiktoken and its deps). |
pyproject.toml |
Adds tiktoken dependency for token counting / limits. |
translate.sbatch |
Adds a single-job translation launcher (currently has a CLI arg typo). |
translate-sweep.sbatch |
Adds Slurm array sweep driver for translation experiments. |
translate-runner.sh |
Adds per-array-task runner, including vLLM keepalive integration (currently mismatched with CLI type). |
src/translate/translator.py |
Introduces module logger for consistent logging. |
src/translate/translate.py |
Adds logging configuration + shared GeneratorMixin CLI flags (api key/base url/vLLM server options). |
src/translate/top_down_agentic/utils.py |
Migrates utility prints to logging. |
src/translate/top_down_agentic/top_down_agentic.py |
Plumbs new GeneratorMixin args + replaces prints with logging; changes generation call parameters. |
src/translate/top_down_agentic/dependency_agent.py |
Migrates dependency tracing output from prints to logging. |
src/translate/top_down_agentic/context_agent.py |
Migrates to logging; changes context generation call parameters. |
src/translate/top_down_agentic/chunk_agent.py |
Migrates chunking status output from prints to logging. |
src/translate/swe_agent/swe_agent_translator.py |
Migrates SWE-agent workflow output from prints to logging. |
src/translate/repo.py |
Introduces module logger for consistent logging. |
src/translate/naive/naive_translator.py |
Plumbs new GeneratorMixin args + migrates to logging; changes multi-output generation behavior. |
src/translate/generator_mixin.py |
Major backend changes: OpenAI/vLLM client overrides, optional internal vLLM server launch, encoder/token-limit helpers, logging, and new request retry behavior. |
config/perlmutter-vllm-qwen.yaml |
Adds Perlmutter vLLM config for Qwen/Qwen3-Coder-Next. |
config/perlmutter-vllm-oss.yaml |
Adds Perlmutter vLLM config for openai/gpt-oss-120b. |
config/perlmutter-vllm-nemo.yaml |
Adds Perlmutter vLLM config for NVIDIA Nemotron + custom reasoning parser plugin. |
config/perlmutter-vllm-glm.yaml |
Adds Perlmutter vLLM config for GLM. |
config/nano_v3_reasoning_parser.py |
Adds custom reasoning parser plugin for vLLM/Nemotron. |
CLAUDE.md |
Adds repository guidance for Claude Code users. |
.gitignore |
Ignores local test outputs. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| # A keepalive ID unique to this array task causes GeneratorMixin to write a | ||
| # PID file for the vLLM server so it persists across all num_translations calls. | ||
| local keepalive_id="${SLURM_JOB_ID}_${slot}" | ||
|
|
||
| for i in $(seq 0 $((num_translations - 1))); do | ||
| python src/translate/translate.py \ | ||
| -i "targets/${app_name}/${src_model}/" \ | ||
| -o ../restate-results/ \ | ||
| -c "targets/${app_name}/${dst_model}/" \ | ||
| --method "${method}" \ | ||
| --src-model "${src_model}" \ | ||
| --dst-model "${dst_model}" \ | ||
| -n 1 \ | ||
| --output-id "${i}" \ | ||
| --app-name "${app_name}" \ | ||
| --vllm-environment ../serve/.venv/ \ | ||
| --vllm-yaml-config config/perlmutter-vllm-oss.yaml \ | ||
| --vllm-keepalive-id "${keepalive_id}" \ | ||
| --naive-backend vllm \ | ||
| --naive-llm-name openai/gpt-oss-120b |
There was a problem hiding this comment.
keepalive_id is built as a string like "${SLURM_JOB_ID}_${slot}", but translate.py defines --vllm-keepalive-id as an int. This will fail argparse validation for job arrays. Either pass a pure integer here (e.g., ${SLURM_JOB_ID}${slot}) or change the CLI arg/type to accept strings.
| @@ -346,32 +450,22 @@ async def _generate_vllm_async( | |||
| self, | |||
| prompts: List[str], | |||
| system_prompt: str, | |||
| max_new_tokens: int = 2048, | |||
| temperature: Optional[float] = None, | |||
| top_p: Optional[float] = None, | |||
| **kwargs | |||
| ) -> List[GenericResponse]: | |||
| """ Generate text using the vLLM via the OpenAI server API in async mode. | |||
| """ Generate text using the vLLM via the OpenAI completions API in async batch mode. | |||
| """ | |||
| if not self._vllm_client: | |||
| raise ValueError("vLLM (OpenAI) client not initialized.") | |||
|
|
|||
| # Adjust parameters for reasoning models | |||
| temperature, top_p, max_new_tokens = self._adjust_parameters_for_reasoning( | |||
| temperature, top_p, max_new_tokens | |||
| ) | |||
| is_reasoning = self._is_reasoning_model() | |||
|
|
|||
| # Prepare prompts for reasoning models | |||
| # Merge system prompt into user prompts for reasoning models | |||
| if is_reasoning and system_prompt is not None: | |||
| prompts = [system_prompt + "\n" + p for p in prompts] | |||
|
|
|||
| completion = await self._vllm_client.completions.create( | |||
| model=self._llm_name, | |||
| prompt=prompts, | |||
| temperature=temperature, | |||
| top_p=top_p, | |||
| max_tokens=max_new_tokens, | |||
| **kwargs | |||
| ) | |||
There was a problem hiding this comment.
The vLLM sync path uses the OpenAI responses endpoint (client.responses.create), but the async batch path uses the completions endpoint (client.completions.create). Besides being inconsistent, the async path ignores the system prompt for non-reasoning models (it only merges it for reasoning models), so prompts will differ depending on batch size/async settings. Align both sync/async vLLM generation on a single endpoint and ensure the system prompt is always applied.
| def _test_vllm_server(self) -> bool: | ||
| """Test if the vLLM server is running.""" | ||
| base = (self._api_base_url or VLLM_BASE_URL).rstrip("/") | ||
| if base.endswith("/v1"): | ||
| base = base[:-3] | ||
| health_url = base + "/health" | ||
| return ( | ||
| subprocess.run( | ||
| ["curl", health_url], | ||
| capture_output=True, | ||
| text=True, | ||
| check=False, | ||
| ).returncode == 0 | ||
| ) |
There was a problem hiding this comment.
_test_vllm_server() shells out to curl without any timeout flags. If DNS/routing hangs, this can block translation startup indefinitely and also assumes curl is installed in the runtime environment. Prefer doing the health check with a Python HTTP client (e.g., requests) and an explicit short timeout.
| parser.add_argument("--vllm-keepalive-id", type=int, default=None, | ||
| help="If set, write the vLLM server PID to a file with this ID instead of " | ||
| "terminating the server on exit.") |
There was a problem hiding this comment.
--vllm-keepalive-id is declared as type=int, but callers (e.g., Slurm runner) may want to pass composite IDs like "_" for uniqueness. With the current type, those values will be rejected by argparse. Consider making this a str (and update downstream typing in translators/GeneratorMixin accordingly), since it’s only used to name a PID file.
| parser.add_argument("--vllm-keepalive-id", type=int, default=None, | |
| help="If set, write the vLLM server PID to a file with this ID instead of " | |
| "terminating the server on exit.") | |
| parser.add_argument("--vllm-keepalive-id", type=str, default=None, | |
| help="If set, write the vLLM server PID to a file with this identifier " | |
| "instead of terminating the server on exit.") |
| if self._dry: | ||
| self._handle_dry_run(prompt, fpath, output_fpaths) | ||
| return | ||
|
|
||
| responses = self.generate( | ||
| prompt, | ||
| temperature=DEFAULT_TEMPERATURE, | ||
| top_p=DEFAULT_TOP_P, | ||
| n=len(self._output_paths) | ||
| ) | ||
| responses = [self.generate(prompt)[0] for _ in self._output_paths] | ||
|
|
||
| self._process_responses(responses, prompt, fpath, output_fpaths, chunk, chunk_id) |
There was a problem hiding this comment.
NaiveTranslator now produces multiple outputs by calling self.generate(prompt) once per output path. This turns what used to be a single request with n=... into N separate backend requests, which is significantly slower/costlier and can interact badly with RPM/TPM limits. Consider batching via generate_async (repeat the same prompt N times for vLLM) or reintroducing an n parameter for backends that support it.
| _handler = logging.StreamHandler() | ||
| _handler.setFormatter(logging.Formatter( | ||
| "%(asctime)s [%(levelname)s] %(message)s", datefmt="%H:%M:%S" | ||
| )) | ||
| logger.addHandler(_handler) |
There was a problem hiding this comment.
The logger setup unconditionally adds a new StreamHandler each time main() runs. If this module is invoked multiple times in the same interpreter (tests, programmatic usage), logs will be duplicated. Guard with if not logger.handlers: (or clear existing handlers) before adding a handler.
| _handler = logging.StreamHandler() | |
| _handler.setFormatter(logging.Formatter( | |
| "%(asctime)s [%(levelname)s] %(message)s", datefmt="%H:%M:%S" | |
| )) | |
| logger.addHandler(_handler) | |
| if not logger.handlers: | |
| _handler = logging.StreamHandler() | |
| _handler.setFormatter(logging.Formatter( | |
| "%(asctime)s [%(levelname)s] %(message)s", datefmt="%H:%M:%S" | |
| )) | |
| logger.addHandler(_handler) |
| def _generate_openai(self, prompt: str, system_prompt: str, **kwargs) -> List[GenericResponse]: | ||
| """ Generate text using the OpenAI responses API. | ||
| """ | ||
| if not self._openai_client: | ||
| raise ValueError("OpenAI client not initialized.") | ||
|
|
||
| messages = self._format_messages_list(prompt, system_prompt) | ||
| gen_kwargs: Dict[str, Any] = {"model": self._llm_name, "input": messages, **kwargs} | ||
| if self._llm_name.endswith("-thinking"): | ||
| gen_kwargs["model"] = self._llm_name.replace("-thinking", "") | ||
| gen_kwargs["reasoning"] = {"effort": "high"} | ||
|
|
||
| def _prepare_messages_for_reasoning(self, prompt: str, system_prompt: Optional[str]) -> List[Dict[str, str]]: | ||
| """Prepare messages for reasoning models by merging system prompt.""" | ||
| if self._is_reasoning_model() and system_prompt is not None: | ||
| return self._format_messages_list(system_prompt + "\n" + prompt, None) | ||
| return self._format_messages_list(prompt, system_prompt) | ||
| response = self._openai_client.responses.create(**gen_kwargs) | ||
|
|
||
| return [GenericResponse( | ||
| response.output_text, | ||
| response.usage.input_tokens, | ||
| response.usage.output_tokens, | ||
| )] | ||
|
|
||
| def _generate_vllm( | ||
| self, | ||
| prompt: str, | ||
| system_prompt: str, | ||
| max_new_tokens: int = 2048, | ||
| temperature: Optional[float] = None, | ||
| top_p: Optional[float] = None, | ||
| n: int = 1, | ||
| **kwargs | ||
| ) -> List[GenericResponse]: | ||
| """ Generate text using the vLLM via the OpenAI server API. | ||
|
|
||
| def _generate_vllm(self, prompt: str, system_prompt: str, **kwargs) -> List[GenericResponse]: | ||
| """ Generate text using the vLLM via the OpenAI responses API. | ||
| """ | ||
| from openai import APITimeoutError | ||
| if not self._vllm_client: | ||
| raise ValueError("vLLM (OpenAI) client not initialized.") | ||
|
|
||
| # Adjust parameters for reasoning models | ||
| temperature, top_p, max_new_tokens = self._adjust_parameters_for_reasoning( | ||
| temperature, top_p, max_new_tokens | ||
| ) | ||
| messages = self._format_messages_list(prompt, system_prompt) | ||
| gen_kwargs: Dict[str, Any] = {"model": self._llm_name, "input": messages, **kwargs} | ||
| if "gpt-oss" in self._llm_name: |
There was a problem hiding this comment.
_generate_openai() / _generate_vllm() no longer set any explicit output token limit (previously max_new_tokens/max_tokens was passed). This means the backend default will apply, which can lead to unexpectedly truncated translations or unexpectedly large/expensive outputs depending on the provider defaults. Consider reinstating an explicit max-output-tokens parameter (and optionally exposing it via CLI/config) and passing it through gen_kwargs.
| def _generate_translations(self, prompts: List[str]) -> Tuple[List[str], List[Union[str, None]]]: | ||
| """Generate translations using the LLM.""" | ||
| response_obs = self.generate_async(prompts, temperature=0.2, top_p=0.95) | ||
| response_obs = self.generate_async(prompts) |
There was a problem hiding this comment.
This call used to pass explicit sampling params (temperature, top_p) but now relies on backend defaults via generate_async(prompts).
If stable/low-variance translations are required, consider keeping the prior defaults (or making them configurable) and passing them through to GeneratorMixin via kwargs.
| response_obs = self.generate_async(prompts) | |
| generate_kwargs = {} | |
| temperature = getattr(self, "_temperature", getattr(self, "temperature", None)) | |
| if temperature is not None: | |
| generate_kwargs["temperature"] = temperature | |
| top_p = getattr(self, "_top_p", getattr(self, "top_p", None)) | |
| if top_p is not None: | |
| generate_kwargs["top_p"] = top_p | |
| response_obs = self.generate_async(prompts, **generate_kwargs) |
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
ikhillan
left a comment
There was a problem hiding this comment.
PR looks generally good. I think Copilot's comments are definitely things to consider so I won't approve for now and let Josh decide whether to incorporate Copilot's changes.
Changes made: