-
Notifications
You must be signed in to change notification settings - Fork 45
Description
Description:
I would like to propose a change to the data preprocessing workflow. Currently, there might be a tendency to pass pre-tokenized (space-separated) strings into the model. However, to leverage the full semantic power of transformer-based embeddings (like BERT), we should be feeding the raw, original text into the embedding model.
The Problem:
If we manually join tokens with spaces before encoding:
We lose the structural integrity and contextual nuances of the original sentence.
The embedding model's performance may degrade as it can't "see" the natural flow of the language.
The Proposed Solution:
Instead of pre-splitting the text, we should pass the raw strings directly to the BERTopic pipeline. To handle specific tokenization needs (especially for Chinese text like 'Luoyang Tourism' data), we should redefine the tokenizer within the CountVectorizer model.
Code Example:
Python
Instead of using " ".join(words), we use:
vectorizer_model = CountVectorizer(
tokenizer=lambda x: jieba.lcut(x),
vocabulary=custom_vocab,
token_pattern=None
)
topic_model = BERTopic(vectorizer_model=vectorizer_model)
topic_model.fit_transform(raw_documents) # Passing raw text here