Any pointers to search PDF documents and custom embedding ? #140

octalpixel · 2025-08-19T16:59:08Z

octalpixel
Aug 19, 2025

Any pointers to search PDF documents would to see a recipe or example. Off the head, I guess chunk and save into the columnar data and then integrate superlink to get the embeddings ? Train of thoughts would it also possible to use something like colpali as well ?

Answered by svonava

Aug 19, 2025

Yep you can always OCR -> chunk -> annotate chunks with metadata attached to the documents (where they came from, creation timestamp, relationships to other docs captured via eventEffects etc).

On the vision-model side (ala colpali), you can start with single-vector models like https://huggingface.co/Alibaba-NLP/gme-Qwen2-VL-7B-Instruct [there is also a 2B variant] and pass their embeddings in via CustomSpace in superlinked (or see if you can figure out how to use the ImageSpace to run it within your superlinked server). Of course being single-vector, the quality is reduced - so make sure to get more candidates and provide them to the vision-capable model (e.g. gpt 4o) that you use to ans…

View full answer

svonava · 2025-08-19T21:50:59Z

svonava
Aug 19, 2025
Maintainer

Yep you can always OCR -> chunk -> annotate chunks with metadata attached to the documents (where they came from, creation timestamp, relationships to other docs captured via eventEffects etc).

On the vision-model side (ala colpali), you can start with single-vector models like https://huggingface.co/Alibaba-NLP/gme-Qwen2-VL-7B-Instruct [there is also a 2B variant] and pass their embeddings in via CustomSpace in superlinked (or see if you can figure out how to use the ImageSpace to run it within your superlinked server). Of course being single-vector, the quality is reduced - so make sure to get more candidates and provide them to the vision-capable model (e.g. gpt 4o) that you use to answer the question - we found that even 10+ page screenshots are workable.

We are actively working on making that experience (pool-the late interaction multi-vector representation into single vector and then re-rank on the multi-vector representation) better in superlinked so it works more out of the box :-)

4 replies

octalpixel Aug 19, 2025
Author

Thanks for the detailed answer! Paints a good pictures for the way forward. That last sentence got me excited! Looking forward to the out of the box solution.

Couple links to add some of things I've come across for the pooling topic (pretty sure you might come across this, but adding for context)
https://qdrant.tech/blog/colpali-qdrant-optimization/
https://weaviate.io/blog/muvera

Curious would be possible to get embedding API on remote server add pass them via CustomSpace, i didn't find this in the doc (I am not sure if i missed it), but I guess this is where "Source" comes in and we put to the vector db.

Any plans of lancedb as vector db ?

svonava Aug 20, 2025
Maintainer

Yep, aware on the ideas around clustering of the multi-vector representations & muvera as well - lot's of interesting work in that area!

For lancedb support, please add your vote to superlinked/superlinked#41

octalpixel Aug 20, 2025
Author

Yes! Very interesting things happening around the space.

Voted!

octalpixel Nov 1, 2025
Author

Hey @svonava just following up to see if there any movement on multi-vector side of things

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Any pointers to search PDF documents and custom embedding ? #140

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 4 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Any pointers to search PDF documents and custom embedding ? #140

Uh oh!

octalpixel Aug 19, 2025

Replies: 1 comment · 4 replies

Uh oh!

svonava Aug 19, 2025 Maintainer

Uh oh!

octalpixel Aug 19, 2025 Author

Uh oh!

svonava Aug 20, 2025 Maintainer

Uh oh!

octalpixel Aug 20, 2025 Author

Uh oh!

octalpixel Nov 1, 2025 Author

octalpixel
Aug 19, 2025

Replies: 1 comment 4 replies

svonava
Aug 19, 2025
Maintainer

octalpixel Aug 19, 2025
Author

svonava Aug 20, 2025
Maintainer

octalpixel Aug 20, 2025
Author

octalpixel Nov 1, 2025
Author