Skip to content

dtrmv performance penalty with small N #5402

@jschueller

Description

@jschueller

hello,

I have a piece of code which calls dtrmv in very moderate dimensions (N<10) but repeatedly (~1e6) to compute Normal cumulative distribution function from the inverse cholesky factor of its the correlation matrix (openturns),
and it seems openblas has a penalty there as it is trying to start as many threads as possible (40 HT on my machine).

Consider the following reproducer:

#include <cblas.h>
#include <stdio.h>

int main()
{
  double A[25] = {1.1,2.0,1.0,-3.0,4.0,
                  -1.4,2.0,2.0,3.0, 0.0,
                  5.4, 3.45, -5.9, 0.0, 0.0,
                  7.1, 4.3, 0.0, 0.0, 0.0,
                  -8.2, 0.0, 0.0, 0.0, 0.0};
  const int N = 5;
  double X[5] = {1.0,2.0,1.0,-3.0,4.0};
  for (unsigned int i = 0; i < 1000000; ++ i)
  {
    cblas_dtrmv(CblasRowMajor, CblasLower, CblasNoTrans, CblasUnit, N, A, N, X, 1);
  }
  for(int i=0; i<N; i++)
    printf("%g ", X[i]);
  
  printf("\n");
  return 0;
}

With the default thread count (40) it takes ~15s but only ~0.1s with OMP_NUM_THREADS=1.

Looks it would need something similar to what's done in #4585.

This is openblas 0.3.29 from fedora rawhide (with flexiblas).

I also tried 0.3.30 from archlinux.

/cc @martin-frbg

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions