Skip to content

Some Concerns about Use Cases #4

@yijiaxu13-jpg

Description

@yijiaxu13-jpg

Description:
I would like to propose a change to the data preprocessing workflow. Currently, there might be a tendency to pass pre-tokenized (space-separated) strings into the model. However, to leverage the full semantic power of transformer-based embeddings (like BERT), we should be feeding the raw, original text into the embedding model.

The Problem:
If we manually join tokens with spaces before encoding:

We lose the structural integrity and contextual nuances of the original sentence.

The embedding model's performance may degrade as it can't "see" the natural flow of the language.

The Proposed Solution:
Instead of pre-splitting the text, we should pass the raw strings directly to the BERTopic pipeline. To handle specific tokenization needs (especially for Chinese text like 'Luoyang Tourism' data), we should redefine the tokenizer within the CountVectorizer model.

Code Example:

Python

Instead of using " ".join(words), we use:

vectorizer_model = CountVectorizer(
tokenizer=lambda x: jieba.lcut(x),
vocabulary=custom_vocab,
token_pattern=None
)

topic_model = BERTopic(vectorizer_model=vectorizer_model)
topic_model.fit_transform(raw_documents) # Passing raw text here

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions