Some Concerns about Use Cases

Description:
I would like to propose a change to the data preprocessing workflow. Currently, there might be a tendency to pass pre-tokenized (space-separated) strings into the model. However, to leverage the full semantic power of transformer-based embeddings (like BERT), we should be feeding the raw, original text into the embedding model.

The Problem:
If we manually join tokens with spaces before encoding:

We lose the structural integrity and contextual nuances of the original sentence.

The embedding model's performance may degrade as it can't "see" the natural flow of the language.

The Proposed Solution:
Instead of pre-splitting the text, we should pass the raw strings directly to the BERTopic pipeline. To handle specific tokenization needs (especially for Chinese text like 'Luoyang Tourism' data), we should redefine the tokenizer within the CountVectorizer model.

Code Example:

Python

# Instead of using " ".join(words), we use:
vectorizer_model = CountVectorizer(
    tokenizer=lambda x: jieba.lcut(x), 
    vocabulary=custom_vocab,
    token_pattern=None
)

topic_model = BERTopic(vectorizer_model=vectorizer_model)
topic_model.fit_transform(raw_documents) # Passing raw text here

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Some Concerns about Use Cases #4

Instead of using " ".join(words), we use:

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Some Concerns about Use Cases #4

Description

Instead of using " ".join(words), we use:

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions