Skip to content

Deadlock after fork when calling dgetrf_ #5520

@mattip

Description

@mattip

As described in numpy/numpy#30092 and scipy/scipy#23686, there is a deadlock in OpenBLAS when calling dgetrf_ after a fork. I instrumented the calls to LOCK_COMMAND and UNLOCK_COMMAND in blas_server.c and I think the problem is in exec_blas_async. This is "new" after #5170.

Here is the main() of the test code

int main() {
    int64_t m = 200, n = 200;
    int64_t lda = m;
    int64_t info;
    int64_t ipiv[200];

    // array is an identity matrix
    double arr[200*200];
    for (int i = 0; i < m*n; i += n + 1) {
        arr[i] = 1.0;
    }

    printf("before fork\n");
    pid_t pid = fork();
    printf("after fork\n");
    if (pid == 0) {
        printf("inside child\n");
        exit(0);
    } else {
        wait(NULL);
    }

    printf("before dgetrf\n");
    dgetrf_(&m, &n, arr, &lda, ipiv, &info);
    printf("after dgetrf\n");

and here is what I see with debug printing (on OpenBLAS HEAD, using ``)

installing atfork handler in memory::openblas_fork_handler 2015
in blas_thread_init 565
in blas_thread_init 567 server_lock locked
in blas_thread_init 615
in blas_thread_init 623
in blas_thread_init 626 server_lock unlocked
before fork
in blas_thread_shutdown
in blas_thread_shutdown 1000 server_lock locked
in blas_thread_shutdown 1042 server_lock unlocked
after fork
after fork
inside child
in blas_thread_shutdown
in blas_thread_shutdown 1000 server_lock locked
in blas_thread_shutdown 1042 server_lock unlocked
before dgetrf
in exec_blas_async 644
in exec_blas_async 647 server_lock locked
in blas_thread_init 565

Note the call to LOCK_COMMAND in exec_blas_async, and then the call to blas_thread_init, which again tries to call LOCK_COMMAND. Boom.

#ifdef SMP_SERVER
// Handle lazy re-init of the thread-pool after a POSIX fork
LOCK_COMMAND(&server_lock);
if (unlikely(blas_server_avail == 0)) blas_thread_init();
UNLOCK_COMMAND(&server_lock);
#endif
BLASLONG i = 0;

I am not sure what the best way is to solve this. Note that the first thing blas_thread_init does is to check blas_server_avail (with no lock), so maybe the lock/unlock in exec_blas_async should be removed?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions