Skip to content

[Bug/Model Request]: LateInteractionTextEmbedding("colbert-ir/colbertv2.0") creates different size of embeddings for large set of documents #273

@shima-khoshraftar

Description

@shima-khoshraftar

What happened?

A bug happened!

I am using embedding_model = LateInteractionTextEmbedding("colbert-ir/colbertv2.0") in fastembed==0.3.0 and
following the code in https://qdrant.github.io/fastembed/examples/ColBERT_with_FastEmbed/#colbert-in-fastembed.

I created embeddings for some documents. However, I got an error on this part of the code when running on large collection of documents:
sorted_indices = compute_relevance_scores(
np.array(query_embeddings[0]), np.array(document_embeddings), k=3
)

complaining that it can not create np.array from document_embeddings. Looking into it I realized that sizes of each document_embedding in document_embeddings are different. For instance for 442 documents, the first ~260 documents have embedding size of (182,128) and for the next half the document embedding size is (164, 128). I am wondering if you can help me with that. Thanks.

What Python version are you on? e.g. python --version

Python 3.12

Version

0.2.7 (Latest)

What os are you seeing the problem on?

MacOS

Relevant stack traces and/or logs

Traceback (most recent call last):
    np.array(query_embeddings[0]), np.array(document_embeddings), k=3
                                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ValueError: setting an array element with a sequence. The requested array has an inhomogeneous shape after 1 dimensions. The detected shape was (442,) + inhomogeneous part.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions