Conversation
| let recv_poll_with_dls mchan dls = | ||
| try | ||
| Ws_deque.pop (Array.unsafe_get mchan.channels dls.id) | ||
| Ws_deque.steal (Array.unsafe_get mchan.channels dls.id) |
There was a problem hiding this comment.
Ws_deque.pop returns the latest pushed item (like a stack ocaml-multicore/saturn#38 ), and as such doesn't allow yield to be run after we had a chance to go through the other tasks from the deque :/
There was a problem hiding this comment.
Unfortunately this completely changes the scheduling strategy and makes it qualitatively worse for parallelism. For parallel programming you really want LIFO scheduling with work-stealing taking from the bottom of the stack (that has been proven to be optimal).
Per worker FIFO scheduling improves fairness, but is actually not enough to make scheduling really fair, because it is still possible that different workers have different queue lengths. You could e.g. have a single worker that only runs a single task. That task gets 100% of a single domain. Another worker might have 100 tasks. So, each task on that domain would get 1% of a single domain. That is rather unfair.
I recently updated the sample multififo scheduler for Picos to balance the number of fibers across workers, which seems to substantially improve fairness in a test.
(This PR is more of a question mark, there's an unpleasant change to make this POC work, and perhaps
domainslibshouldn't provide this primitive?)When using a lockfree datastructure to communicate between tasks, I didn't find a way to write the basic strategy "failed to make progress, retry later". With standard domains, we can use
Domain.cpu_relax ()to spinlock a bit... but with tasks this can deadlock the current domain (if we are waiting on a task scheduled on the same domain, we'll never give it a chance to run). Of course it (slowly) works if another domain steals our pending tasks, but all the domains may actually be stuck and unable to help.I've included a small artificial example to demonstrate the issue :)