Skip to content

Latest commit

 

History

History
305 lines (198 loc) · 12 KB

File metadata and controls

305 lines (198 loc) · 12 KB

Getting started with the batch system:
Determined-AI User Guide

Introduction

"intro-determined-ai"

We are currently using Determined AI to manage our GPU Cluster.

You can open the dashboard (a.k.a WebUI) by the following URL and log in:

https://gpu.lins.lab/

Determined is a successful (acquired by Hewlett Packard Enterprise in 2021) open-source deep learning training platform that helps researchers train models more quickly, easily share GPU resources, and collaborate more effectively.

Monitoring

You can check the realtime utilization of the cluster in the grafana dashboard.

User Account

Ask for your account

You need to ask the system admin (Yufan Wang) for your user account.

Tips

  • Once getting your cluster account, you can configure your own job environment. Some guidelines can be found here.
  • We have a basic GPU job monitor, and the reserved container will be terminated if all GPUs are idle for 2 hours.
    • We notify the container status through Slack. If you do not want the notification in the Slack channel disturbs you, please consider this settings

Authentication

WebUI

The WebUI will automatically redirect users to a login page if there is no valid Determined session established on that browser. After logging in, the user will be redirected to the URL they initially attempted to access.

CLI

Users can also interact with Determined using a command-line interface (CLI). The CLI is distributed as a Python wheel package; once the wheel has been installed, the CLI can be used via the det command.

You can use the CLI either on the login node or on your local development machine.

  1. Installation

    The CLI can be installed via pip:

    pip install determined
  2. (Optional) Configure environment variable

    If you are using your own PC, you need to add the environment variable DET_MASTER=10.0.2.168. If you are using the login node, no configuration is required, because the system administrator has already configured this globally on the login node.

    For Linux, *nix including macOS, if you are using bash append this line to the end of ~/.bashrc (most systems) or ~/.bash_profile (some macOS);

    If you are using zsh, append it to the end of ~/.zshrc:

    export DET_MASTER=10.0.2.168

    For Windows, you can follow this tutorial: tutorial

  3. Log in

    In the CLI, the user login subcommand can be used to authenticate a user:

    det user login <username>

    Note: If you did not configure the environment variable, you need to specify the master's IP:

    det -m 10.0.2.168 user login <username>

Changing passwords

Users have blank passwords by default. If desired, a user can change their own password using the user change-password subcommand:

det user change-password

Submitting Tasks

Diagram of submitting task

Task Configuration Template

Here is a template of a task configuration file, in YAML format:

description: <task_name>
resources:
    slots: 1
    resource_pool: 64c128t_512_3090
    shm_size: 4G
bind_mounts:
    - host_path: /home/<username>/
      container_path: /run/determined/workdir/home/
    - host_path: /labdata0/<project_name>/
      container_path: /run/determined/workdir/data/
environment:
    image: determinedai/environments:cuda-11.3-pytorch-1.10-tf-2.8-gpu-0.19.4

Notes:

  • You need to change the task_name and user_name to your own
  • Number of resources.slots is the number of GPUs you are requesting to use, which is set to 1 here
  • resources.resource_pool is the resource pool you are requesting to use. Currently, we have two resource pools: 64c128t_512_3090 and 64c128t_512_4090.
  • resources.shm_size is set to 4G by default. You may need a greater size if you use multiple dataloader workers in pytorch, etc.
  • In bind_mounts, it maps the dataset directory (/labdata0) into the container.
  • In environment.image, an official image by Determined AI is used. Determined AI provides Docker images that include common deep-learning libraries and frameworks. You can also develop your custom image based on your project dependency, which will be discussed in this tutorial: Custom Containerized Environment
  • How bind_mounts works:

Storage Model

Submit

Save the YAML configuration to, let's say, test_task.yaml. You can start a Jupyter Notebook (Lab) environment or a simple shell environment. A notebook is a web interface and thus more user-friendly. However, you can use Visual Studio Code or PyCharm to connect to a shell environment[3], which brings more flexibility and productivity if you are familiar with these editors.

For notebook (try to avoid using notebook through DeterminedAI, due to some privacy issues):

    det notebook start --config-file test_task.yaml

For shell (strongly recommended):

    det shell start --config-file test_task.yaml

In order to ensure a pleasant environment, please

  • avoid being a root user in your tasks/pods/containers.
  • carefully check your code and avoid occupying too many CPU cores.
  • try to use OMP_NUM_THREADS=2 MKL_NUM_THREADS=2 python <your_code.py>.
  • include your name in your <task_name>.
  • ...

Managing Tasks

You are encouraged to check out more operations of Determined.AI in the API docs, e.g.,

  • det task
  • det shell open [task id]
  • det shell kill [task id]

Now you can see your task pending/running on the WebUI dashboard. You can manage the tasks on the WebUI. tasks

Connect to a shell task

You can use Visual Studio Code or PyCharm to connect to a shell task.

You also need to install and use determined on your local computer, in order to get the SSH IdentityFile, which is necessary in the next section.

First-time setup of connecting VS Code to a shell task

  1. First, you need to install the Remote-SSH plugin.

  2. Check the UUID of your tasks:

    det shell list
  3. Get the ssh command for the task with the UUID above (it also generates an SSH IdentityFile on your PC):

    det shell show_ssh_command <UUID>

    The results should follow this pattern:

    ssh -o "ProxyCommand=<YOUR PROXY COMMAND>" \
        -o StrictHostKeyChecking=no \
        -tt \
        -o IdentitiesOnly=yes \
        -i <YOUR KEY PATH> \
        -p <YOUR PORT NUMBER> \
        <YOUR USERNAME>@<YOUR SHELL HOST NAME (UUID)>
  4. Add the shell task as a new SSH task:

    Click the SSH button on the left-bottom corner:

    SSH button

    Select connect to host -> +Add New SSH Host:

    SSH connect to host

    SSH add new host

    Paste the SSH command generated by det shell show_ssh_command above in to the dialog window:

    SSH enter command

    Then choose your ssh configuration file to update:

    SSH select config

    You can continue to edit your ssh configuration file, e.g. add a custom name:

    SSH config before

    SSH config after

Update the setup of connecting VS Code to a shell task

  1. Check the UUID of your tasks:

    det shell list
  2. Get the new ssh command:

    det shell show_ssh_command <UUID>
  3. Replace the old UUID with the new one (with Ctrl + H):

    ssh config update

Connect PyCharm to a shell task

  1. As of the current version, PyCharm lacks support for custom options in SSH commands via the UI. Therefore, you must provide via an entry in your ssh_config file. You can generate this entry by following the steps in First-time setup of connecting VS Code to a shell task.

  2. In PyCharm, open Settings/Preferences > Tools > SSH Configurations.

  3. Select the plus icon to add a new configuration.

  4. Enter YOUR SHELL HOST NAME (UUID), YOUR PORT NUMBER (fill in 22 here), and YOUR USERNAME in the corresponding fields. (P.S. you can chage YOUR SHELL HOST NAME (UUID) into your custom one configured in the SSH config identity, e.g. TestEnv, as shown above)

  5. Switch the Authentication type dropdown to OpenSSH config and authentication agent.

  6. You can hit Test Connection to test it.

  7. Save the new configuration by clicking OK. Now you can continue to add Python Interpreters with this SSH configuration.

pycharm remote ssh

Port forwarding

You will need to do the port forwarding from the task container to your personal computer through the SSH tunnel (managed by determined) when you want to set up services like tensorboard, etc, in your task container.

Here is an example. First launch a notebook or shell task with the proxy_ports configurations:

environment:
    proxy_ports:
      - proxy_port: 6006
        proxy_tcp: true

where 6006 is the port used by tensorboard.

Then launch port forwarding on you personal computer with this command:

python -m determined.cli.tunnel --listener 6006 --auth 10.0.2.168 YOUR_TASK_UUID:6006

Remember to change YOUR_TASK_UUID to your task's UUID.

Now you can open the tensorboard (http://localhost:6006) with your web browser.

Reference: Expossing custom ports - Determined AI docs

Experiments

(TBA)

experiments experiments hyper parameter tuning