Note: BERTopic's visualizations required the model to be created during the run because saving to a pickle file removed the functionality. Training BERTopic within the notebook will take around 20 minutes (if ran on google colab). More details here.
-
Ensure that the following libraries are installed in python 3.10 environment:
- pandas
- numpy
- seaborn
- matplotlib
- nltk
- collections
- scikit-learn
- gensim
- tqdm
- plotly
- wordcloud
- octis
- sentence_transformers
- umap
- hdbscan
- bertopic
-
Open NMF_vs_BERT.ipynb and run all the cells of the notebook.
- NMF
Top 10 words for each of the topics:
Distributions of topics across all documents:

Visualized embedding of documents in 2-D space:

Visualized distribution of documents per topic based on newsgroup:

Topic Diversity: 0.98
Residual Score: 0.94
Coherence Score: 0.46
- BERTopic
Top 10 words for each of the topics:
Distributions of topics across all documents:

Visualized embedding of documents in 2-D space:

Visualized distribution of documents per topic based on newsgroup:

Topic Diversity: 0.99
Topic Coherence: 0.51
- Metric Comparison
NMF appears to better extract abstract themes, like religion and sport, whereas BERTopic appears to better extract more specific topics, like christianity and hockey.
