Conversation
simunic-cz
commented
Apr 27, 2024
- Added the Czech stemming algorithm in SBL, copied from https://snowballstem.org
- Compiled Czech SBL into Rust, added support for Czech
- Updated the README file
|
Update: fixed incorrectly generated |
|
Is this project abandoned? |
|
@elvircrn pretty much 😅😅 I needed to add a stemmer here because a full-text search engine Tantivy depends on this package. But since it seems to be abandoned, I took inspiration from this project and created a new package testuj-to/tantivy-stemmers that incorporates many more languages and even exposes them as Cargo features so you don't have to compile and bundle any languages you don't need. While the testuj-to/tantivy-stemmers crate is primarily intended for the use in connection with Tantivy engine, it's not necessarily dependent on it. I have seen that someone has used it without Tantivy - https://github.com/kemingy/tocken/blob/main/src/tokenizer.rs#L11 You can install whatever language (algorithm) you need as a Cargo feature [dependencies]
tantivy-stemmers = { version = "0.4.0", features = ["default", "norwegian_bokmal"] }and then you should be able to just import and use the stemming function directly from the use tantivy_stemmers::algorithms::norwegian_bokmal as stem_norwegian;
fn main() {
let input_phrase = &"The quick brown fox jumps over the lazy dog";
for word in input_phrase.split(" ") {
let input = word.to_lowercase(); // Input is must be single word in lowercase
/*
* Pass the input as &str - all the algorithms are of type
* pub type Algorithm = fn(&str) -> Cow<str>;
*/
let output = stem_norwegian(input.as_str());
println!("{}", output);
}
} |
|
Ah that cargo feature is definitely useful, I went ahead and added a missing language to both rust-stemmres and tantivity-stemmers, but thought that tantivity-stemmers was abanded in favor of rust-stemmers, since tantitivity depends on it. Seems like it's the other way around. :) |
|
(thanks for the detailed answer) |