Recursion by acharal · Pull Request #2 · DelphianCalamity/tensorflow-recursion

acharal · 2018-07-03T09:00:23Z

Implementation of recursion using tags in a static dataflow graph.
Supports general recursion and employs tags and Call/Return operators to manipulate them.

The purpose of this request is to track the full changes of the ongoing implementation and to enable the review of the code.

- 1.3 version doesn't have the 'CustomGraphOptimizer' feature, so I 'manually' registered the optimizer. - 'function' optimizer does not exist either in 1.3, so 'function_transformation' took its place. - 'TESTS' directory contains some toy examples of recursion.

- writing the optimized graph within the core as an 'event', so that I can visualize it with tensorboard.

- Merge / non existing op (IdentityN for now) / has as many inputs as the call sites - Enter/ Exit ops are simple IdentityN nodes for now.

- Multiple Merge ops - Enter/ Exit ops are simple IdentityN nodes for now.

- I'm gonna break Call/Return (IdentityN-like) nodes to multiple Call/Return (Identity-like) nodes similar to iteration's ops, so that I can steal code more easily and allow more parallelism. - I'm gonna add a 'NextCall' op as well, for the sake of consistency.

- I broke Call/Return (IdentityN-like) nodes to multiple Call/Return (Identity-like). - I added a 'NextCall' op in transformation.

- removed NextCall op - fixed convertToGraph / transformation now makes it to Executor

-Temporarily disabled constant folding optimizations and Topological Sort, for they are causing mini problems and need special handling. -

- if a fetch_output corresponded to a function-calling node that transformation altered, then fetch was invalid

Remove unnecessary edits to files to make clearer the changes that matter.

acharal · 2018-07-03T13:30:48Z

tensorflow/core/common_runtime/executor.cc

-  s = GetNodeAttr(node->attrs(), "parallel_iterations", &parallel_iters);
-  DCHECK(s.ok()) << s;
+  int parallel_iters = 1;
+  if (node->op_def().name() != "Call") {


Maybe this should change to !IsCall(node)

acharal · 2018-07-03T13:31:36Z

tensorflow/core/common_runtime/executor.cc

-  int parallel_iters;
-  s = GetNodeAttr(node->attrs(), "parallel_iterations", &parallel_iters);
-  DCHECK(s.ok()) << s;
+  int parallel_iters = 1;


The original version does not set parallel_iters to 1. It does not assign it at all. Is this a bug fix?

Nop, in case the CreateFrameBlah was triggered by a Call node, we initialize the 'iterations' vector 's size to 1, cause we 'll only ever need one ''IterationState'. In case of more recursive calls, new FrameStates will be created according to our approach instead.

acharal · 2018-07-03T13:34:39Z

tensorflow/core/grappler/optimizers/function_transformation.h

@@ -0,0 +1,50 @@
+/* Copyright 2018 The TensorFlow Authors. All Rights Reserved.


Is this indeed code of "Tensorflow Authors"?

hehe I was copy pasting too much

acharal · 2018-07-03T13:35:03Z

tensorflow/core/grappler/optimizers/meta_optimizer.cc

    }
  }
-  TopologicalSort(optimized_graph);
+  //TopologicalSort(optimized_graph);


Why is this commented out?

"A topological ordering is possible if and only if the graph has no directed cycles"
they made it work with iteration cycles, I have to make it work with recursion cycles too.
It's a "Job To Do". Sorting isn't necessary, I guess they are doing it to optimize the algorithms that will follow, or maybe TensorBoard's graph representation..don't know.

Ok, I created issue #3 for that.

* Allreduce: Rebase to TF 1.3-rc1 (#3) * Introduce MPI allreduce in a new contrib project. This commit adds the tensorflow.contrib.mpi namespace and contrib project, which has a variety of ops that work with MPI. The MPI system works by starting a background thread which communicates between the different processes at a regular interval and schedules asynchronous reductions. At every tick, every rank will notify rank zero of the tensors it is ready to reduce, signifying completion with an empty DONE message. Rank zero will count how many ranks are ready to reduce every tensor, and, whenever a tensor is ready to reduce (that is, every rank is ready to reduce it), rank zero will issue a message to all other ranks directing them to reduce that tensor. This repeats for all the tensors that are ready to reduce, after which rank zero sends all other ranks a DONE message indicating that the tick is complete. Reviewed-by: Joel Hestness <jthestness@gmail.com> * Allreduce/Allgather: Major changes and fixes (#2) This commit constitutes many major updates to the TF MPI allreduce and allgather ops. Specifically, the following changes are included in this commit: 1) The allreduce and allgather ops had race conditions, which this commit fixes. Specifically, the BackgroundThreadLoop previously allocated temporary and output tensors after the main graph traversal thread has completed its call to MPIAll*::ComputeAsync(). Unfortunately, the ops kernel context's memory allocator is only guaranteed to be valid during the ComputeAsync call. This constraint requires ComputeAsync to allocate all tensors before returning; Otherwise, the memory allocator state may reflect allocations and deallocations from further ops that can cause races for the memory locations. To fix this, hoist the memory allocations to ComputeAsync. In this process, introduce a collective op record, which tracks the parameters of the op (e.g. input, output, and configurations). 2) Many models require capability to allreduce or allgather int64 tensors. We add functionality to handle long long data type (64-bit ints). 3) Eliminate the thread sleep. A major to-do item is to eliminate the need for polling between coordinator threads and other ranks. This change will require the coordinator rank to be able to wake up all other ranks when a collective is ready to be performed, but also for all ranks (i.e. background threads) to be woken up by graph traversal threads. In the meantime, remove the thread sleep, because it introduces significant run time overhead (e.g. >20%) for models with quick-running layers (e.g. few recurrent time-steps or few hidden nodes per layer). * mpi_ops.cc: Move toward more TF nature This commit changes a few bits and pieces to align more closely with Tensorflow structures and organization: 1) Use TF mutexes. TF mutexes provide nice scoping and management around std::mutex, and using them is consistent with other TF code. 2) Remove thread sleep at MPI initialization time. Thread sleep should not be used for polling activity. Instead, this commit replaces sleep-polling with a condition variable: The compute graph traversal thread waits on the condition variable until the background thread has completed initialization and signals the graph traversal thread that initialization is complete. 3) Slim MPI initialization check: Since TF permits many threads to be traversing the compute graph concurrently (e.g. with inter_op_parallelism_threads > 1), some graph traversal threads may not have set their GPU device ID. If such a thread executes an MPI op, it would fail the check in InitializedMPIOnSameDevice, because the background thread would be controlling a GPU with ID other than the default (0). Since graph traversal threads do not perform GPU activity, this GPU ID check was unnecessary. Remove it and refactor to just check whether MPI is initialized (IsMPIInitialized). * Rebase to TF 1.3.0-rc1 complete and tested * Allreduce: Rebase to TF 1.3-rc1 (#3) * Introduce MPI allreduce in a new contrib project. This commit adds the tensorflow.contrib.mpi namespace and contrib project, which has a variety of ops that work with MPI. The MPI system works by starting a background thread which communicates between the different processes at a regular interval and schedules asynchronous reductions. At every tick, every rank will notify rank zero of the tensors it is ready to reduce, signifying completion with an empty DONE message. Rank zero will count how many ranks are ready to reduce every tensor, and, whenever a tensor is ready to reduce (that is, every rank is ready to reduce it), rank zero will issue a message to all other ranks directing them to reduce that tensor. This repeats for all the tensors that are ready to reduce, after which rank zero sends all other ranks a DONE message indicating that the tick is complete. Reviewed-by: Joel Hestness <jthestness@gmail.com> * Allreduce/Allgather: Major changes and fixes (#2) This commit constitutes many major updates to the TF MPI allreduce and allgather ops. Specifically, the following changes are included in this commit: 1) The allreduce and allgather ops had race conditions, which this commit fixes. Specifically, the BackgroundThreadLoop previously allocated temporary and output tensors after the main graph traversal thread has completed its call to MPIAll*::ComputeAsync(). Unfortunately, the ops kernel context's memory allocator is only guaranteed to be valid during the ComputeAsync call. This constraint requires ComputeAsync to allocate all tensors before returning; Otherwise, the memory allocator state may reflect allocations and deallocations from further ops that can cause races for the memory locations. To fix this, hoist the memory allocations to ComputeAsync. In this process, introduce a collective op record, which tracks the parameters of the op (e.g. input, output, and configurations). 2) Many models require capability to allreduce or allgather int64 tensors. We add functionality to handle long long data type (64-bit ints). 3) Eliminate the thread sleep. A major to-do item is to eliminate the need for polling between coordinator threads and other ranks. This change will require the coordinator rank to be able to wake up all other ranks when a collective is ready to be performed, but also for all ranks (i.e. background threads) to be woken up by graph traversal threads. In the meantime, remove the thread sleep, because it introduces significant run time overhead (e.g. >20%) for models with quick-running layers (e.g. few recurrent time-steps or few hidden nodes per layer). * mpi_ops.cc: Move toward more TF nature This commit changes a few bits and pieces to align more closely with Tensorflow structures and organization: 1) Use TF mutexes. TF mutexes provide nice scoping and management around std::mutex, and using them is consistent with other TF code. 2) Remove thread sleep at MPI initialization time. Thread sleep should not be used for polling activity. Instead, this commit replaces sleep-polling with a condition variable: The compute graph traversal thread waits on the condition variable until the background thread has completed initialization and signals the graph traversal thread that initialization is complete. 3) Slim MPI initialization check: Since TF permits many threads to be traversing the compute graph concurrently (e.g. with inter_op_parallelism_threads > 1), some graph traversal threads may not have set their GPU device ID. If such a thread executes an MPI op, it would fail the check in InitializedMPIOnSameDevice, because the background thread would be controlling a GPU with ID other than the default (0). Since graph traversal threads do not perform GPU activity, this GPU ID check was unnecessary. Remove it and refactor to just check whether MPI is initialized (IsMPIInitialized). * Rebase to TF 1.3.0-rc1 complete and tested * Minor fixes * Point MPI message proto at contrib/mpi package * MPI Session: Fix graph handling * Pylint fixes * More pylint fixes * Python 2 pylint fix * MPI Collectives Ops: Fix coordinator shut down * Update copyrights to 2017 * Remove MPIDataType and switch to TF DataType * Add Allgather test, fix Allreduce test config * Fix BUILD file for TF sanity checks * Try guarding MPI collectives C++ files with TENSORFLOW_USE_MPI The TF build system on Github tries to build C++ source files in tensorflow/contrib/mpi_collectives even when configured with TF_NEED_MPI=0. This leads to a build failure when the mpi_collectives C++ files try to link against MPI third party headers, which are not set up. Unable to reproduce in contributor's build environment, we try guarding the MPI collectives C++ code with defines for TENSORFLOW_USE_MPI, similar to tensorflow/contrib/mpi. * Comment formatting Hopefully, this will trigger googlebot.

- Call/Return frame_names are now small 1-2 character strings - Removed ":0" suffix from each new generated frame - added 'clock' in commit

- Dead Returns are now not added in ready queue - flatmap outstanding_frames_ was moved from ExecutorState to FrameState. Now each parent frame holds its children frames. (Maybe now that there will be only a few elements in oustanding_frames, FlatMap could be changed into a potentially better structure.) - The tags have the form: <frame_id: frame_name>

acharal · 2018-07-14T20:04:16Z

tensorflow/core/common_runtime/executor.cc

 }

 ExecutorState::~ExecutorState() {
-  for (auto name_frame : outstanding_frames_) {


So, my guess is that we can't just delete this. I mean, we need to deallocate the children frames at some point, right? Eventually we should move the corresponding code for deallocation where appropriate.

When each frame gets deleted by DeleteFrame(..), this:
parent_frame->outstanding_child_frames_.erase(frame_name);
piece of code gets executed, deleting its "registration" inside the parent frame's outstanding-frames.
I thought this one gradually empties the maps no matter what. I don't think a frame can be deleted without having all of its children frames deleted first. I may be wrong. Root frame is not registered anywhere as "outstanding" as it has no parent.

Ok, make sense [that the outstanding_frame_ only contains root that is not added after the modification]. However, DeleteFrame is only called by an exit or a return (in our case). That is, if the graph is not well-formed (meaning having the appropriate number of exit and return) there might be a memory leak.
We also need to call DeleteFrame(root_frame) in that very spot during the deallocation of ExecutorState and possibly amend DeleteFrame to recursively delete the outstanding_child_frames.

acharal

[Github just refuses to add the reply on the previous thread regarding the deallocation of outstanding frames. Now, everyone will blame MS]

Ok, make sense [that the outstanding_frame_ only contains root that is not added after the modification]. However, DeleteFrame is only called by an exit or a return (in our case). That is, if the graph is not well-formed (meaning having the appropriate number of exit and return) there might be a memory leak.
We also need to call DeleteFrame(root_frame) in that very spot during the deallocation of ExecutorState and possibly amend DeleteFrame to recursively delete the outstanding_child_frames.

acharal · 2018-07-16T07:20:53Z

tensorflow/core/common_runtime/executor.cc

+    // child frame is composed of the name of the parent frame, the iteration
+    // number at which the parent frame is creating the new frame, and the
+    // name of the new frame from nodedef.
+    gtl::FlatMap<string, FrameState*> outstanding_child_frames_ GUARDED_BY(mu_);


This is said that it is guarded by mu_. The mutex mu_ is in the ExecutorState. Possibly we need to change this using a local mutex in the FrameState, say FrameState::mu. Also, we need to change the corresponding locks that refer to mu_ and locks critical sections that access outstanding_child_frames_.

Makes it inclusive to ControlFlowInfo. Results into unchanged signature for BuildControlFlowInfo.

acharal · 2018-07-16T10:01:13Z

tensorflow/core/common_runtime/executor.cc

  {
-    mutex_lock executor_lock(mu_);
-    outstanding_frames_.erase(frame_name);
+    if (frame->frame_id != 0) {


maybe just check parent_frame != nullptr ?

Apparently GetNodeAttr costs an arm and a leg. Adds 2.0-2.2 sec in fib(30).

* Use VLOG instead of printf * Fix typo * Comment out eventwriters for performance * Print diagnostics in VLOG instead of stdout * Change CopyArgTypeN to CopyArgType; rely on function overloading * Use Merge with N>=2 arguments * Oops, missing semicolon * Remove commented code * Change the MakeFrameName to produce more compact names * Fix compilation error in executor * Fix compilation error in executor * Change outstanding_frames to use frame_id instead of frame_name * Revert "Change outstanding_frames to use frame_id instead of frame_name" This reverts commit 707e41d. We revert since there we didn't get any performance gain anyway. We may come back to that during fine-tuning. * Cache frame_name in NodeItem * Cache frame_name in NodeItem * Cache frame_name in NodeItem * Cache frame_name in NodeItem * Cache frame_name in NodeItem * Change the implementation of BuildControlFlowInfo Remove synonym frames and use two kind of frames for calls: static and dynamic where static corresponds to the static information which is per function and dynamic which is per function call. Squashed commit of the following: commit d32d488 Author: Angelos Charalambidis <agcharal@gmail.com> Date: Wed Jul 18 14:23:32 2018 +0300 Use call_id as check of return commit 390d2e7 Author: Angelos Charalambidis <agcharal@gmail.com> Date: Wed Jul 18 14:22:22 2018 +0300 Use call_id as check of return commit 364db3a Author: Angelos Charalambidis <agcharal@gmail.com> Date: Wed Jul 18 13:58:01 2018 +0300 Change outstanding_frames to use frame_id instead of frame_name commit 7095ea2 Author: Angelos Charalambidis <agcharal@gmail.com> Date: Wed Jul 18 13:49:30 2018 +0300 Syntax error fixed commit cf0c48b Author: Angelos Charalambidis <agcharal@gmail.com> Date: Wed Jul 18 13:05:56 2018 +0300 Fix input_count to count corrently the inputs of calls commit 78feb11 Author: Angelos Charalambidis <agcharal@gmail.com> Date: Tue Jul 17 18:38:40 2018 +0300 Fix error commit 9aeec10 Author: Angelos Charalambidis <agcharal@gmail.com> Date: Tue Jul 17 18:30:10 2018 +0300 Fix syntax error commit b286174 Merge: e92ef53 c7e335c Author: Angelos Charalambidis <agcharal@gmail.com> Date: Tue Jul 17 18:02:36 2018 +0300 Merge branch 'r1.4_recursion_synonyms' into r1.4_recursion_opt_merge commit c7e335c Author: Angelos Charalambidis <agcharal@gmail.com> Date: Tue Jul 17 11:59:41 2018 +0300 Simplify BuildControlFlowInfo commit 8423b3a Author: Angelos Charalambidis <agcharal@gmail.com> Date: Tue Jul 17 11:55:28 2018 +0300 Simplify BuildControlFlowInfo commit 8d80595 Author: Angelos Charalambidis <agcharal@gmail.com> Date: Tue Jul 17 11:43:01 2018 +0300 Refactor BuildControlFlowInfo commit c252b6c Author: Angelos Charalambidis <agcharal@gmail.com> Date: Tue Jul 17 11:41:08 2018 +0300 Refactor BuildControlFlowInfo commit c53327e Author: Angelos Charalambidis <agcharal@gmail.com> Date: Mon Jul 16 23:46:07 2018 +0300 Fix incompatible Return name commit 564b1ab Author: Angelos Charalambidis <agcharal@gmail.com> Date: Mon Jul 16 21:55:55 2018 +0300 Leave only call_id as dyn frame name commit 8749ec8 Merge: 0fcd2b5 d034dc2 Author: Angelos Charalambidis <agcharal@gmail.com> Date: Mon Jul 16 21:53:51 2018 +0300 Merge branch 'r1.4_recursion' into r1.4_recursion_synonyms commit 0fcd2b5 Author: Angelos Charalambidis <agcharal@gmail.com> Date: Mon Jul 16 18:57:50 2018 +0300 Fix frame creation commit f8fca8a Author: Angelos Charalambidis <agcharal@gmail.com> Date: Mon Jul 16 16:23:52 2018 +0300 Fix syntax error commit 26b3393 Author: Angelos Charalambidis <agcharal@gmail.com> Date: Mon Jul 16 16:20:47 2018 +0300 Fix syntax error commit f47322c Author: Angelos Charalambidis <agcharal@gmail.com> Date: Mon Jul 16 16:11:07 2018 +0300 Fix syntax error commit c278522 Author: Angelos Charalambidis <agcharal@gmail.com> Date: Mon Jul 16 16:02:09 2018 +0300 Fix some typo commit aca0218 Author: Angelos Charalambidis <agcharal@gmail.com> Date: Mon Jul 16 15:40:34 2018 +0300 Keep separate info for function name and call id. Record at function transformation the function name, call id and arg id. This deprecates the need of computing synonym frames in every execution run. function name is named "frame_name" and is used to index the static frame info, i.e. all calls of the function share the same static frame info. The runtime framestate is named after the concatenation of the parent_id, function_name and call_id to ensure uniqueness. Hope it works.. * Exit has no frame_name (apparently) * Quick Fix on Topological Sort

DelphianCalamity and others added 24 commits June 11, 2018 21:19

Added the inlining feature

ff23c61

- writing the optimized graph within the core as an 'event', so that I can visualize it with tensorboard.

Transformation_v.1

3d60a2f

- Merge / non existing op (IdentityN for now) / has as many inputs as the call sites - Enter/ Exit ops are simple IdentityN nodes for now.

Transformation_v.2

26d33f9

- Multiple Merge ops - Enter/ Exit ops are simple IdentityN nodes for now.

Transformation v.3

cf1b2e8

- I broke Call/Return (IdentityN-like) nodes to multiple Call/Return (Identity-like). - I added a 'NextCall' op in transformation.

Transformation_v.4

5b8715e

- removed NextCall op - fixed convertToGraph / transformation now makes it to Executor

Call/Return kernels

7113382

Common Runtime -ongoing

fc1114a

-Temporarily disabled constant folding optimizations and Topological Sort, for they are causing mini problems and need special handling. -

Fixed Executor for mutually recursive functions

2658ed6

Fixed bug in transformation

87738e1

- if a fetch_output corresponded to a function-calling node that transformation altered, then fetch was invalid

Remove unnecessary changes to files

37ffe26

Remove unnecessary edits to files to make clearer the changes that matter.

Remove (more) unnecessary edits

d113ee5

Reformat and reident

5cdeeaf

Remove newline

921a3b5

Reformat executor code

605033a

Reformat executor code

3830c4b

Minor reformat in executor

1b721a0

Minor in executor

8193c33

Set constant folding off instead of commenting out the code

c1a22d8

Missing indentation

024eab1

Missing indentation

1678395

Missing indentation

bfd1183

Reindent function transformation; hopefully it compiles

8011181

acharal commented Jul 3, 2018

View reviewed changes

acharal added 2 commits July 3, 2018 16:42

Fixes (possibly) the conflict with r1.4

4a9907c

Fixes (possibly) the conflict with r1.4

b12cc77

acharal and others added 14 commits July 5, 2018 16:45

Use functions.cc from commit 7013a5

55603da

Some Test Cases

6d0f462

Merge remote-tracking branch 'origin/r1.4_recursion' into r1.4_recursion

c514d2a

Fixed Topological Sort

92bcceb

Mini optimization in Topological Sort

b95938d

Renamed CallOp class

0d193af

Add Ackermann function as test

6529bbc

Add some more classic benchmarks

c953c6a

Change indentation in function trans

8cf77f7

Change some more indentation

0afc0e5

Move typedef to cc

e01e329

Optimized Tags

3d3cfcc

- Call/Return frame_names are now small 1-2 character strings - Removed ":0" suffix from each new generated frame - added 'clock' in commit

"Change MakeFrameFunctions to MakeFrameName;rely on overloading"

bb403ff

acharal commented Jul 14, 2018

View reviewed changes

acharal commented Jul 16, 2018

View reviewed changes

acharal added 5 commits July 16, 2018 10:48

Change frame lock mutex and typos

6ce4f1a

Minor change the creation of a frame

7223ca7

Refactor synonym_frames

0bf53cd

Makes it inclusive to ControlFlowInfo. Results into unchanged signature for BuildControlFlowInfo.

Remove var ref in synonym_frame_names

d53d6d9

Untangle child_name-ing and parallel iters

3ca7c5b

acharal commented Jul 16, 2018

View reviewed changes

acharal and others added 6 commits July 16, 2018 21:52

Guard GetNodeAttr to access it when it's necessary

d034dc2

Apparently GetNodeAttr costs an arm and a leg. Adds 2.0-2.2 sec in fib(30).

Use parent_frame not null check

3e3eac0

bugs fixed while working on distr

4c192b4

ackermann

f47355a

Distributed Runtime (#11)

0497209

		@@ -0,0 +1,50 @@
		/* Copyright 2018 The TensorFlow Authors. All Rights Reserved.

Conversation

acharal commented Jul 3, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

acharal Jul 3, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

acharal left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

acharal Jul 16, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

acharal Jul 3, 2018 •

edited

Loading

acharal left a comment •

edited

Loading

acharal Jul 16, 2018 •

edited

Loading