diff --git a/docs/src/index.md b/docs/src/index.md index 7f418fa..0f1ac7d 100644 --- a/docs/src/index.md +++ b/docs/src/index.md @@ -208,3 +208,8 @@ You can now use these pids to see which documents match the best against your qu julia> print(readlines("1kcollection.txt")[pids[1]]) Tl;dr - Yes, it sounds like a possible 1080 fox bait poisoning. Can't be sure though. The traditional fox bait is called 1080. That poisonous bait is still used in a few countries to kill foxes, rabbits, possums and other mammal pests. The toxin in 1080 is Sodium fluoroacetate. Wikipedia is a bit vague on symptoms in animals, but for humans they say: In humans, the symptoms of poisoning normally appear between 30 minutes and three hours after exposure. Initial symptoms typically include nausea, vomiting and abdominal pain; sweating, confusion and agitation follow. In significant poisoning, cardiac abnormalities including tachycardia or bradycardia, hypotension and ECG changes develop. Neurological effects include muscle twitching and seizures... One might safely assume a dog, especially a small Whippet, would show symptoms of poisoning faster than the 30 mins stated for humans. The listed (human) symptoms look like a good fit to what your neighbour reported about your dog. Strychnine is another commonly used poison against mammal pests. It affects the animal's muscles so that contracted muscles can no longer relax. That means the muscles responsible of breathing cease to operate and the animal suffocates to death in less than two hours. This sounds like unlikely case with your dog. One possibility is unintentional pet poisoning by snail/slug baits. These baits are meant to control a population of snails and slugs in a garden. Because the pelletized bait looks a lot like dry food made for dogs it is easily one of the most common causes of unintentional poisoning of dogs. The toxin in these baits is Metaldehyde and a dog may die inside four hours of ingesting these baits, which sounds like too slow to explain what happened to your dog, even though the symptoms of this toxin are somewhat similar to your case. Then again, the malicious use of poisons against neighbourhood dogs can vary a lot. In fact they don't end with just pesticides but also other harmful matter, like medicine made for humans and even razorblades stuck inside a meatball, have been found in baits. It is quite impossible to say what might have caused the death of your dog, at least without autopsy and toxicology tests. The 1080 is just one of the possible explanations. It is best to always use a leash when walking dogs in populated areas and only let dogs free (when allowed by local legislation) in unpopulated parks and forests and suchlike places. ``` +--- + +## Tutorials + +- [Basic Retrieval Example](tutorials/basic_retrieval.md) \ No newline at end of file diff --git a/docs/src/tutorials/basic_retrieval.md b/docs/src/tutorials/basic_retrieval.md new file mode 100644 index 0000000..32882a5 --- /dev/null +++ b/docs/src/tutorials/basic_retrieval.md @@ -0,0 +1,38 @@ +# Basic Retrieval Example + +This tutorial demonstrates how to use ColBERT.jl for simple document retrieval. + +--- + +## Step 1: Prepare Dataset + +The dataset should be in TSV format: + +doc_id \t title \t body + +Example: + +1 Deep Learning Neural networks are powerful +2 Machine Learning Supervised learning is common + +--- + +## Step 2: Build Index + +```julia +using ColBERT + +config = ColBERTConfig( + collection="sample.tsv", + index_path="index" +) + +indexer = Indexer(config) +index(indexer) + +## Step 3: Retrieval + +After building the index, it can be used for efficient document retrieval. + +The querying interface may vary depending on the current implementation. +Users can refer to the latest examples in the repository for performing search. \ No newline at end of file diff --git a/src/indexing.jl b/src/indexing.jl index 9f40765..b93bb63 100644 --- a/src/indexing.jl +++ b/src/indexing.jl @@ -21,12 +21,32 @@ Type representing an ColBERT indexer. An [`Indexer`] wrapping a [`ColBERTConfig`](@ref) along with the trained ColBERT model. """ +function parse_tsv_line(line::String) + parts = split(line, '\t') + + # Skip invalid lines + if length(parts) < 2 + return nothing + end + + # Combine title + body + extra fields + return strip(join(parts[2:end], " ")) +end function Indexer(config::ColBERTConfig) tokenizer, bert, linear = load_hgf_pretrained_local(config.checkpoint) bert = bert |> Flux.gpu linear = linear |> Flux.gpu - collection = config.collection isa String ? readlines(config.collection) : - config.collection + collection = + if config.collection isa String + lines = readlines(config.collection) + + [ + doc for doc in (parse_tsv_line(line) for line in lines) + if doc !== nothing + ] + else + config.collection + end punctuations_and_padsym = [string.(collect("!\"#\$%&\'()*+,-./:;<=>?@[\\]^_`{|}~")); tokenizer.padsym] skiplist = config.mask_punctuation ?