Skip to content

Fix single-machine cross-process registrations#69

Open
alberthli wants to merge 12 commits intorai-opensource:mainfrom
alberthli:fix-task-registration
Open

Fix single-machine cross-process registrations#69
alberthli wants to merge 12 commits intorai-opensource:mainfrom
alberthli:fix-task-registration

Conversation

@alberthli
Copy link
Copy Markdown
Collaborator

@alberthli alberthli commented Jul 11, 2025

This PR addresses broken CLI-based custom registrations on a single machine for the following. This addresses #68.

  • Tasks
  • Task-specific overrides
  • Update CHANGELOG.md
  • Update docs to explain sharp edge that you shouldn't use relative imports. Also indicate that if you are using multiple machines, you should currently only use the hydra-based override system.

Previously, these overrides were only loosely tested in a single process, but the propagation to other processes was not investigated (we only investigated the hydra-based solution deeply). This PR does the following to resolve this issue:

  • It saves the custom task and config definitions to file
  • This file is opened by all other processes, retrieving the custom definition

This PR currently does NOT solve the following, but we will want to soon, especially for mobile robots:

  • custom task registrations across machines

This simple test case passes for me currently:

from dataclasses import dataclass

from judo.config import set_config_overrides
from judo.optimizers.cem import CrossEntropyMethodConfig
from judo.tasks import CylinderPush, CylinderPushConfig


@dataclass
class MyCylinderPushConfig(CylinderPushConfig):
    """Custom configuration for MyCylinderPush task."""

class MyCylinderPush(CylinderPush):
    """Custom task to test registration."""

set_config_overrides(
    "my_cylinder_push",
    CrossEntropyMethodConfig,
    {
        "num_nodes": 5,
    },
)

if __name__ == "__main__":
    from judo.cli import app
    from judo.tasks import get_registered_tasks, register_task

    # on running this app, you should see my_cylinder_push in the dropdown
    # you should also see that num_nodes is set to 5 for it, and all other
    # values are set to global defaults with no overrides (so the task will
    # not perform well, but it will run)
    register_task("my_cylinder_push", MyCylinderPush, MyCylinderPushConfig)
    tasks = get_registered_tasks()
    
    app()

To test on other custom code, do the following:

pip uninstall judo-rai  # uninstall any previous versions in your virtual env
pip install git+https://github.com/alberthli/judo.git@e2ebc4c

Copy link
Copy Markdown
Collaborator

@pculbertson pculbertson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one needs to cook a bit longer, per slack discussion

Comment thread judo/config.py
return
try:
data = json.loads(_OVERRIDES_PATH.read_text())
except Exception:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style: per the Google guide, I don't like catch-all except blocks. I also try to print or log the exception somewhere if I'm suppressing it.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.



def get_run_id(env_var: str = "JUDO_TASK_RUN_ID") -> str:
"""Return a per-process run-ID and whether it was freshly created."""
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: this docstring seems stale

Comment thread judo/tasks/__init__.py
_registered_tasks[name] = (task_cls, cfg_cls)


def _persist_ephemeral_registry() -> None:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: persist doesn't really capture what this function is doing. maybe sync or update?

Comment thread judo/tasks/__init__.py
def register_task(name: str, task_type: Type[Task], task_config_type: Type[TaskConfig]) -> None:
"""Registers a new task."""
"""Register a new task with the Judo framework for this run only."""
_registered_tasks[name] = (task_type, task_config_type)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should there be a _load_ephemeral_registry call here? Otherwise registry entries only get read once from file whenever __init__ is imported (per thread?) right?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alternatively, you could change the logic of _persist_ephemeral_registry() to first read in what's currently in the file, merge it with what's in _registered_tasks in this thread + write it back?

directory.mkdir(parents=True, exist_ok=True)
path = directory / f"{prefix}_{run_id}.json"

# remove stale siblings
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe you should print a warning that you're doing this? I can't imagine why someone would do this, but if they happened to write an unrelated file {prefix}_{something_else}.json it's going to get deleted here. That'd be a hell of a thing to debug lol

Comment thread judo/tasks/__init__.py
return
try:
data = json.loads(_CUSTOM_REGISTRY_PATH.read_text())
except Exception:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same nit re: bare exceptions. not sure this should be suppressed if it errors...

Comment thread judo/tasks/__init__.py

for name, info in data.items():
# load Task class
task_mod = info.get("task_mod")
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will always be None, right? I don't see task_mod anywhere else in the codebase - is it vestigial?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants