Skip to content

Latest commit

 

History

History
35 lines (26 loc) · 1.33 KB

File metadata and controls

35 lines (26 loc) · 1.33 KB

cppsim

Fast multi-threaded memory optimized tool to compute cosine similarity on very large matrices imported from NumPy

Run

Using cppsim is very straightforward.

$ ./cppsim 
Allowed options:
  -h [ --help ]         print usage message
  -i [ --input ] arg    pathname for input matrix
  -o [ --output ] arg   pathname for output matrix
  -d [ --dir ] arg      output directory for row vectors
  -s [ --split ]        split matrix into row vectors
  -u [ --users ]        path for users index file to compute similarities
  -t [ --threads ] arg  threads

There are two main ways to operate with cppsim. It is possible to compute the similarity matrix or split it into row vectors.

The former method needs the input and output matrices, as shown in the following example.

$ ./cppsim -i input.npy -o output.npy

The latter splits the matrix similarity into row vectors, one for each file. This method is preferred when the RAM you have is not enough to maintain the full similarity matrix. To split the output matrix into row vectors, run cppsim with -s arguments and specify with -d the directory in which to store the row vectors.

$ ./cppsim -i input.npy -s -d similarities 

If a users index file is specified, cppsim must be run with the -s option. In this case, it will compute similarities only for the specified users.