Skip to content

socket failures that take hours to heal #33

@stopatz

Description

@stopatz

I use a Wolfram session to compute the integrand in the Vegas algorithm in Python.

I use MPI to call a session in each core on a high-performance cluster.

Before I start a session, I want to kill any floating Mathematica processes, so I use the kernelcontroller as follows:

controller = kernelcontroller.WolframKernelController(kernel='path', kernel_loglevel=1)

controller._kernel_stop()

Now, if I wait 10 minutes after this clean-up, my actual code

with WolframLanguageSession('path') as session:...

works fine most of the time.

But at seemingly random times, I get socket failures when I run the two-step process (cleanup, then run session), with multiple instances of the following error message:

Socket exception: Failed to read any message from socket tcp://127.0.0.1:39237 after 20.0 seconds and 199 retries.
Failed to start.
Traceback (most recent call last):
File "/home/sjsuh/anaconda3/lib/python3.9/site-packages/wolframclient/evaluation/kernel/kernelcontroller.py", line 435, in _kernel_start
response = self.kernel_socket_in.recv_abortable(
File "/home/sjsuh/anaconda3/lib/python3.9/site-packages/wolframclient/evaluation/kernel/zmqsocket.py", line 53, in recv_abortable
raise SocketOperationTimeout(
wolframclient.evaluation.kernel.zmqsocket.SocketOperationTimeout: Failed to read any message from socket tcp://127.0.0.1:39237 after 20.0 seconds and 199 retries.

Now, to be able to run my code again, I find that I have to wait around 3 hours and run my routine. Otherwise, this socket failure persists.

So my questions are i) is there a better way to kill stray processes than what I have used, ii) why am I getting the socket failures, and is there a way to heal the socket failures faster?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions