Job management by jmancewicz · Pull Request #240 · NVIDIA/DIGITS

jmancewicz · 2015-08-27T23:21:29Z

Initial Job Management page.

There is no link to it

lukeyeager · 2015-09-28T20:08:12Z

What's the status on this?

jmancewicz · 2015-09-29T20:31:05Z

The only comment from Andrews was to have mAP rather than Loss being graphed. But it sounds like the plan is to show loss unless mAP is available. That sounds like an enhancement. It could show both if mAP is output.

This could be added to the head of the main page, with the dataset and model jobs just beneath.

lukeyeager · 2015-09-29T21:33:28Z

I think any version of this will be an improvement on the current home page. Can you rebase to fix the merge conflicts and let me know when this is ready for review?

jmancewicz · 2015-09-30T03:17:21Z

Rebased. I'll move it to the home page next.

jmancewicz · 2015-10-02T02:48:31Z

@lukeyeager, can you have a look. I might move the job_management.html into home.html rather than include it.

lukeyeager · 2015-10-02T18:24:44Z

Looks pretty good!

The "GPUs" available widget doesn't update when a job starts/finishes. That's misleading.

If you refresh the home page, all the loss graphs get reset to blank, even when there is some data to display.

The job page tells you how much time is remaining, while the home page tells you how long the job has been running.

Job page:

Home page:

Let's be consistent. How about doing what Google Maps does - tell you (1) the time remaining and (2) the estimated time that will be.
http://2.bp.blogspot.com/-dJ7ajrgiRL8/U2E509S7EsI/AAAAAAAAC9I/ScmVKo03F98/s1600/Navigation+with+Lane+Guidance.png

lukeyeager · 2015-10-02T18:30:25Z

digits/job.py

I'm confused by this function. If you have tasks A and B, and A is B's parent, wouldn't this function return [A,B,A]? Is that what you want? Don't you just want the list at job.tasks?

Wait, so if A is B's parent, then it is also in the job's list of tasks? That sounds like the node is in the graph twice.

Yes, all tasks are in the job.tasks list. The task.parents field is optional, and may describe dependencies between tasks. The tasks are not necessarily a fully-connected graph. We can change this behavior if you have a good reason for it.

No need to change it at this point. I was just working with certain assumptions about the graph. I removed that method in job.py and task.py. So, I'm happy with that.

jmancewicz · 2015-10-06T18:14:10Z

@lukeyeager
I'll update the available GPUs, and keep the graphs after refresh.

The trick about time estimation is that Jobs page shows the estimate time remaining for the task not the job. Projecting the time for the Job is not going to be very reliable. Could just the last task, or display the times for each task.

lukeyeager · 2015-10-07T21:04:51Z

The trick about time estimation is that Jobs page shows the estimate time remaining for the task not the job. Projecting the time for the Job is not going to be very reliable. Could just the last task, or display the times for each task.

Oh right, I forgot about this problem. How ridiculous would it be for us to assume that all tasks take the same amount of time? Could we display the overall, naively-averaged progress on top, and the progress of each task below? That might look ugly, I'm just spitballing here.

I don't think we need to solve this before merging.

jmancewicz · 2015-10-07T23:54:41Z

@lukeyeager, I emit gpu availability after resources are allocated or deallocated. That should be pretty solid, I think.

The sparkline drawn when the page is refreshed.

I've removed the get_tasks_recursively method, because job.tasks was the exhaustive task list.

I'd too would rather get this out and deal with the eta issue in the future.

lukeyeager · 2015-10-08T02:04:43Z

Sounds good, thanks.

You've still got some tests failing on the Travis build. Let's get those sorted out and merge.

jmancewicz · 2015-10-13T06:33:21Z

Quick, it's passing travis!

lukeyeager · 2015-10-13T17:05:59Z

digits/scheduler.py

What does this check do?

As I recall, I was getting a failure on tests that I wasn't getting in practice, and the error was in flask that flask._app_ctx_stack.top was None. It was probably late, and I I couldn't ping you, so I did that. Let me see if it's still needed. If it still errors, I'll ping you.

====================================================================== ERROR: digits.test_scheduler.TestSchedulerFlow.test_add_remove_job ---------------------------------------------------------------------- Traceback (most recent call last): File "/usr/local/lib/python2.7/dist-packages/nose/case.py", line 197, in runTest self.test(*self.arg) File "/home/jmancewicz/dev/digits/DIGITS/digits/test_scheduler.py", line 44, in test_add_remove_job assert self.s.add_job(job), 'failed to add job' File "/home/jmancewicz/dev/digits/DIGITS/digits/scheduler.py", line 183, in add_job html = flask.render_template('job_row.html', job = job) File "/usr/local/lib/python2.7/dist-packages/flask/templating.py", line 126, in render_template ctx.app.update_template_context(context) AttributeError: 'NoneType' object has no attribute 'app'

I wasn't sure what was missing and it looks like I just committed the stopgap measure. What's the correct way to avoid that error? @lukeyeager

Does this work?

with app.app_context(): html = flask.render_template('job_row.html', job = job)

Examples:
https://github.com/NVIDIA/DIGITS/blob/v2.2.1/digits/job.py#L161-L162
https://github.com/NVIDIA/DIGITS/blob/v2.2.1/digits/task.py#L100-L105

Not quite. New error. Some new caffe errors, so rebuilt caffe and pycaffe. Incidentally make -j10 doesn't work in caffe until the first few targets build. Looking into the new error.

====================================================================== ERROR: Failure: AttributeError ('module' object has no attribute 'scheduler') ---------------------------------------------------------------------- Traceback (most recent call last): File "/usr/local/lib/python2.7/dist-packages/nose/loader.py", line 418, in loadTestsFromName addr.filename, addr.module) File "/usr/local/lib/python2.7/dist-packages/nose/importer.py", line 47, in importFromPath return self.importFromDir(dir_path, fqname) File "/usr/local/lib/python2.7/dist-packages/nose/importer.py", line 94, in importFromDir mod = load_module(part_fqname, fh, filename, desc) File "/home/jmancewicz/dev/digits/DIGITS/digits/test_scheduler.py", line 7, in <module> from . import scheduler as _ File "/home/jmancewicz/dev/digits/DIGITS/digits/scheduler.py", line 23, in <module> from digits.webapp import app File "/home/jmancewicz/dev/digits/DIGITS/digits/webapp.py", line 20, in <module> scheduler = digits.scheduler.Scheduler(config_value('gpu_list')) AttributeError: 'module' object has no attribute 'scheduler'

You're talking about this test?
https://github.com/NVIDIA/DIGITS/blob/v2.2.1/digits/test_scheduler.py#L42-L47

That's a pretty sloppy test - my bad.

You could also give Jobs a default type or make job_type() return None instead of throwing an error. Both seem easier than subclassing Job to me.

File "/home/jmancewicz/dev/digits/DIGITS/digits/templates/job_row.html", line 5, in top-level template code <td><h4 class="list-group-item-heading"><a href="{{ url_for(show_func, job_id=job.id()) }}">{{ job.name() }} </a></h4></td> File "/usr/local/lib/python2.7/dist-packages/flask/helpers.py", line 287, in url_for raise RuntimeError('Application was not able to create a URL '

@lukeyeager, I think this is close to the error that caused me to add the line you questioned. url_for doesn't work in this test. From what I can see it's that the server is not running or the SERVER_NAME is not set. It feels like we talked about this, but I don't recall if there was a resolution.

yes, that's the test. I did the subclass, and ran into the above error.

@lukeyeager, as far as imports go, the only import change from origin/master in webapp or scheduler is in scheduler.py

from digits.utils import subclass, override

So I seem to be past whatever the import issue may have been, but url_for is not working, which is why I had bailed out if there was not an app context (if that is what that was).

lukeyeager · 2015-10-22T17:06:19Z

The training loss spark line isn't showing up for me. Is it for you?

jmancewicz · 2015-10-22T18:46:40Z

Removing the 'ago' text which shows how long ago the job started. There are potential time computation issues between client and server that need to be resolved. This will most likely will return in the future.

Job management

lukeyeager · 2015-11-20T19:14:37Z

digits/views.py

@jmancewicz did you mean to leave this route in? It throws an error on my machine when I try to access this url:

UndefinedError 'running_datasets' is undefined Traceback (most recent call last): File "/usr/lib/python2.7/dist-packages/flask/app.py", line 1475, in full_dispatch_request rv = self.dispatch_request() File "/usr/lib/python2.7/dist-packages/flask/app.py", line 1461, in dispatch_request return self.view_functions[rule.endpoint](**req.view_args) File "/usr/share/digits/digits/views.py", line 181, in job_management running_job = running_datasets + running_models, File "/usr/lib/python2.7/dist-packages/flask/templating.py", line 128, in render_template context, ctx.app) File "/usr/lib/python2.7/dist-packages/flask/templating.py", line 110, in _render rv = template.render(context) File "/usr/lib/python2.7/dist-packages/jinja2/environment.py", line 969, in render return self.environment.handle_exception(exc_info, True) File "/usr/lib/python2.7/dist-packages/jinja2/environment.py", line 742, in handle_exception reraise(exc_type, exc_value, tb) File "/usr/share/digits/digits/templates/job_management.html", line 204, in top-level template code {% block content %} File "/usr/share/digits/digits/templates/job_management.html", line 206, in block "content" {% set running_jobs = running_datasets + running_models %} UndefinedError: 'running_datasets' is undefined

jmancewicz · 2015-11-20T19:58:46Z

Ah. Nope. That's left over from the first version.

jmancewicz force-pushed the job_management branch from cdeb5d1 to 04e0547 Compare August 28, 2015 17:27

jmancewicz added the UI label Aug 28, 2015

jmancewicz force-pushed the job_management branch from 07caa3c to e57e09d Compare September 3, 2015 17:39

lukeyeager mentioned this pull request Sep 21, 2015

Show DB backend on home and model creation pages #323

Merged

jmancewicz force-pushed the job_management branch 3 times, most recently from 30048c7 to 7838d29 Compare October 2, 2015 02:46

lukeyeager reviewed Oct 2, 2015
View reviewed changes

jmancewicz force-pushed the job_management branch from 7838d29 to e159fb2 Compare October 7, 2015 18:54

jmancewicz force-pushed the job_management branch from e159fb2 to 008a038 Compare October 7, 2015 23:44

jmancewicz force-pushed the job_management branch 2 times, most recently from 5c23342 to 91b3327 Compare October 13, 2015 06:05

lukeyeager reviewed Oct 13, 2015
View reviewed changes

jmancewicz force-pushed the job_management branch 4 times, most recently from e789fbb to e46a4b9 Compare October 22, 2015 17:01

Adding Job Management UI

2bb22ee

jmancewicz force-pushed the job_management branch from e46a4b9 to 2bb22ee Compare October 22, 2015 18:03

jmancewicz added a commit that referenced this pull request Oct 22, 2015

Merge pull request #240 from jmancewicz/job_management

87bd19e

Job management

jmancewicz merged commit 87bd19e into NVIDIA:master Oct 22, 2015

jmancewicz mentioned this pull request Oct 22, 2015

remove ago from running jobs #380

Merged

jmancewicz deleted the job_management branch October 23, 2015 17:17

lukeyeager reviewed Nov 20, 2015
View reviewed changes

lukeyeager mentioned this pull request Nov 20, 2015

remove job_management route #425

Merged

lukeyeager mentioned this pull request Dec 10, 2015

Fix typo in template #464

Merged

lukeyeager mentioned this pull request Feb 23, 2016

Remove update_ago_timer variable and update_ago function. #556

Closed

Comments

Conversation

jmancewicz commented Aug 27, 2015

Uh oh!

lukeyeager commented Sep 28, 2015

Uh oh!

jmancewicz commented Sep 29, 2015

Uh oh!

lukeyeager commented Sep 29, 2015

Uh oh!

jmancewicz commented Sep 30, 2015

Uh oh!

jmancewicz commented Oct 2, 2015

Uh oh!

lukeyeager commented Oct 2, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jmancewicz commented Oct 6, 2015

Uh oh!

lukeyeager commented Oct 7, 2015

Uh oh!

jmancewicz commented Oct 7, 2015

Uh oh!

lukeyeager commented Oct 8, 2015

Uh oh!

jmancewicz commented Oct 13, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lukeyeager commented Oct 22, 2015

Uh oh!

jmancewicz commented Oct 22, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jmancewicz commented Nov 20, 2015

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants