Skip to content
This repository was archived by the owner on Jan 7, 2025. It is now read-only.

Comments

Job management#240

Merged
jmancewicz merged 1 commit intoNVIDIA:masterfrom
jmancewicz:job_management
Oct 22, 2015
Merged

Job management#240
jmancewicz merged 1 commit intoNVIDIA:masterfrom
jmancewicz:job_management

Conversation

@jmancewicz
Copy link
Contributor

Initial Job Management page.

There is no link to it

http://localhost:5000/job_management

@lukeyeager
Copy link
Member

What's the status on this?

@jmancewicz
Copy link
Contributor Author

The only comment from Andrews was to have mAP rather than Loss being graphed. But it sounds like the plan is to show loss unless mAP is available. That sounds like an enhancement. It could show both if mAP is output.

This could be added to the head of the main page, with the dataset and model jobs just beneath.

@lukeyeager
Copy link
Member

I think any version of this will be an improvement on the current home page. Can you rebase to fix the merge conflicts and let me know when this is ready for review?

@jmancewicz
Copy link
Contributor Author

Rebased. I'll move it to the home page next.

@jmancewicz jmancewicz force-pushed the job_management branch 3 times, most recently from 30048c7 to 7838d29 Compare October 2, 2015 02:46
@jmancewicz
Copy link
Contributor Author

@lukeyeager, can you have a look. I might move the job_management.html into home.html rather than include it.

@lukeyeager
Copy link
Member

Looks pretty good!

  1. The "GPUs" available widget doesn't update when a job starts/finishes. That's misleading.

job-mgmt-gpus-available

  1. If you refresh the home page, all the loss graphs get reset to blank, even when there is some data to display.

job-mgmt-refresh-loss

  1. The job page tells you how much time is remaining, while the home page tells you how long the job has been running.

Job page:
job-mgmt-eta

Home page:
job-mgmt-ago

Let's be consistent. How about doing what Google Maps does - tell you (1) the time remaining and (2) the estimated time that will be.
http://2.bp.blogspot.com/-dJ7ajrgiRL8/U2E509S7EsI/AAAAAAAAC9I/ScmVKo03F98/s1600/Navigation+with+Lane+Guidance.png

digits/job.py Outdated
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm confused by this function. If you have tasks A and B, and A is B's parent, wouldn't this function return [A,B,A]? Is that what you want? Don't you just want the list at job.tasks?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wait, so if A is B's parent, then it is also in the job's list of tasks? That sounds like the node is in the graph twice.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, all tasks are in the job.tasks list. The task.parents field is optional, and may describe dependencies between tasks. The tasks are not necessarily a fully-connected graph. We can change this behavior if you have a good reason for it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to change it at this point. I was just working with certain assumptions about the graph. I removed that method in job.py and task.py. So, I'm happy with that.

@jmancewicz
Copy link
Contributor Author

@lukeyeager
I'll update the available GPUs, and keep the graphs after refresh.

The trick about time estimation is that Jobs page shows the estimate time remaining for the task not the job. Projecting the time for the Job is not going to be very reliable. Could just the last task, or display the times for each task.

@lukeyeager
Copy link
Member

The trick about time estimation is that Jobs page shows the estimate time remaining for the task not the job. Projecting the time for the Job is not going to be very reliable. Could just the last task, or display the times for each task.

Oh right, I forgot about this problem. How ridiculous would it be for us to assume that all tasks take the same amount of time? Could we display the overall, naively-averaged progress on top, and the progress of each task below? That might look ugly, I'm just spitballing here.

I don't think we need to solve this before merging.

@jmancewicz
Copy link
Contributor Author

@lukeyeager, I emit gpu availability after resources are allocated or deallocated. That should be pretty solid, I think.

The sparkline drawn when the page is refreshed.

I've removed the get_tasks_recursively method, because job.tasks was the exhaustive task list.

I'd too would rather get this out and deal with the eta issue in the future.

@lukeyeager
Copy link
Member

Sounds good, thanks.

You've still got some tests failing on the Travis build. Let's get those sorted out and merge.

@jmancewicz jmancewicz force-pushed the job_management branch 2 times, most recently from 5c23342 to 91b3327 Compare October 13, 2015 06:05
@jmancewicz
Copy link
Contributor Author

Quick, it's passing travis!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does this check do?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I recall, I was getting a failure on tests that I wasn't getting in practice, and the error was in flask that flask._app_ctx_stack.top was None. It was probably late, and I I couldn't ping you, so I did that. Let me see if it's still needed. If it still errors, I'll ping you.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

======================================================================
ERROR: digits.test_scheduler.TestSchedulerFlow.test_add_remove_job
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/usr/local/lib/python2.7/dist-packages/nose/case.py", line 197, in runTest
    self.test(*self.arg)
  File "/home/jmancewicz/dev/digits/DIGITS/digits/test_scheduler.py", line 44, in test_add_remove_job
    assert self.s.add_job(job), 'failed to add job'
  File "/home/jmancewicz/dev/digits/DIGITS/digits/scheduler.py", line 183, in add_job
    html = flask.render_template('job_row.html', job = job)
  File "/usr/local/lib/python2.7/dist-packages/flask/templating.py", line 126, in render_template
    ctx.app.update_template_context(context)
AttributeError: 'NoneType' object has no attribute 'app'

I wasn't sure what was missing and it looks like I just committed the stopgap measure. What's the correct way to avoid that error? @lukeyeager

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this work?

with app.app_context():
    html = flask.render_template('job_row.html', job = job)

Examples:
https://github.com/NVIDIA/DIGITS/blob/v2.2.1/digits/job.py#L161-L162
https://github.com/NVIDIA/DIGITS/blob/v2.2.1/digits/task.py#L100-L105

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not quite. New error. Some new caffe errors, so rebuilt caffe and pycaffe. Incidentally make -j10 doesn't work in caffe until the first few targets build. Looking into the new error.

======================================================================
ERROR: Failure: AttributeError ('module' object has no attribute 'scheduler')
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/usr/local/lib/python2.7/dist-packages/nose/loader.py", line 418, in loadTestsFromName
    addr.filename, addr.module)
  File "/usr/local/lib/python2.7/dist-packages/nose/importer.py", line 47, in importFromPath
    return self.importFromDir(dir_path, fqname)
  File "/usr/local/lib/python2.7/dist-packages/nose/importer.py", line 94, in importFromDir
    mod = load_module(part_fqname, fh, filename, desc)
  File "/home/jmancewicz/dev/digits/DIGITS/digits/test_scheduler.py", line 7, in <module>
    from . import scheduler as _
  File "/home/jmancewicz/dev/digits/DIGITS/digits/scheduler.py", line 23, in <module>
    from digits.webapp import app
  File "/home/jmancewicz/dev/digits/DIGITS/digits/webapp.py", line 20, in <module>
    scheduler = digits.scheduler.Scheduler(config_value('gpu_list'))
AttributeError: 'module' object has no attribute 'scheduler'

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're talking about this test?
https://github.com/NVIDIA/DIGITS/blob/v2.2.1/digits/test_scheduler.py#L42-L47

That's a pretty sloppy test - my bad.

You could also give Jobs a default type or make job_type() return None instead of throwing an error. Both seem easier than subclassing Job to me.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  File "/home/jmancewicz/dev/digits/DIGITS/digits/templates/job_row.html", line 5, in top-level template code
    <td><h4 class="list-group-item-heading"><a href="{{ url_for(show_func, job_id=job.id()) }}">{{ job.name() }}  </a></h4></td>
  File "/usr/local/lib/python2.7/dist-packages/flask/helpers.py", line 287, in url_for
    raise RuntimeError('Application was not able to create a URL '

@lukeyeager, I think this is close to the error that caused me to add the line you questioned. url_for doesn't work in this test. From what I can see it's that the server is not running or the SERVER_NAME is not set. It feels like we talked about this, but I don't recall if there was a resolution.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, that's the test. I did the subclass, and ran into the above error.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lukeyeager, as far as imports go, the only import change from origin/master in webapp or scheduler is in scheduler.py

from digits.utils import subclass, override

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I seem to be past whatever the import issue may have been, but url_for is not working, which is why I had bailed out if there was not an app context (if that is what that was).

@jmancewicz jmancewicz force-pushed the job_management branch 4 times, most recently from e789fbb to e46a4b9 Compare October 22, 2015 17:01
@lukeyeager
Copy link
Member

The training loss spark line isn't showing up for me. Is it for you?

job-mgmg-no-loss

@jmancewicz
Copy link
Contributor Author

Removing the 'ago' text which shows how long ago the job started. There are potential time computation issues between client and server that need to be resolved. This will most likely will return in the future.

jmancewicz added a commit that referenced this pull request Oct 22, 2015
@jmancewicz jmancewicz merged commit 87bd19e into NVIDIA:master Oct 22, 2015
@jmancewicz jmancewicz deleted the job_management branch October 23, 2015 17:17
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jmancewicz did you mean to leave this route in? It throws an error on my machine when I try to access this url:

UndefinedError
'running_datasets' is undefined

Traceback (most recent call last):
  File "/usr/lib/python2.7/dist-packages/flask/app.py", line 1475, in full_dispatch_request
    rv = self.dispatch_request()
  File "/usr/lib/python2.7/dist-packages/flask/app.py", line 1461, in dispatch_request
    return self.view_functions[rule.endpoint](**req.view_args)
  File "/usr/share/digits/digits/views.py", line 181, in job_management
    running_job = running_datasets + running_models,
  File "/usr/lib/python2.7/dist-packages/flask/templating.py", line 128, in render_template
    context, ctx.app)
  File "/usr/lib/python2.7/dist-packages/flask/templating.py", line 110, in _render
    rv = template.render(context)
  File "/usr/lib/python2.7/dist-packages/jinja2/environment.py", line 969, in render
    return self.environment.handle_exception(exc_info, True)
  File "/usr/lib/python2.7/dist-packages/jinja2/environment.py", line 742, in handle_exception
    reraise(exc_type, exc_value, tb)
  File "/usr/share/digits/digits/templates/job_management.html", line 204, in top-level template code
    {% block content %}
  File "/usr/share/digits/digits/templates/job_management.html", line 206, in block "content"
    {% set running_jobs = running_datasets + running_models %}
UndefinedError: 'running_datasets' is undefined

@jmancewicz
Copy link
Contributor Author

Ah. Nope. That's left over from the first version.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants