Any pointers to search PDF documents and custom embedding ? #140
-
|
Any pointers to search PDF documents would to see a recipe or example. Off the head, I guess chunk and save into the columnar data and then integrate superlink to get the embeddings ? Train of thoughts would it also possible to use something like colpali as well ? |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 4 replies
-
|
Yep you can always OCR -> chunk -> annotate chunks with metadata attached to the documents (where they came from, creation timestamp, relationships to other docs captured via eventEffects etc). On the vision-model side (ala colpali), you can start with single-vector models like https://huggingface.co/Alibaba-NLP/gme-Qwen2-VL-7B-Instruct [there is also a 2B variant] and pass their embeddings in via CustomSpace in superlinked (or see if you can figure out how to use the ImageSpace to run it within your superlinked server). Of course being single-vector, the quality is reduced - so make sure to get more candidates and provide them to the vision-capable model (e.g. gpt 4o) that you use to answer the question - we found that even 10+ page screenshots are workable. We are actively working on making that experience (pool-the late interaction multi-vector representation into single vector and then re-rank on the multi-vector representation) better in superlinked so it works more out of the box :-) |
Beta Was this translation helpful? Give feedback.
Yep you can always OCR -> chunk -> annotate chunks with metadata attached to the documents (where they came from, creation timestamp, relationships to other docs captured via eventEffects etc).
On the vision-model side (ala colpali), you can start with single-vector models like https://huggingface.co/Alibaba-NLP/gme-Qwen2-VL-7B-Instruct [there is also a 2B variant] and pass their embeddings in via CustomSpace in superlinked (or see if you can figure out how to use the ImageSpace to run it within your superlinked server). Of course being single-vector, the quality is reduced - so make sure to get more candidates and provide them to the vision-capable model (e.g. gpt 4o) that you use to ans…