Project: Non-Negative Matrix Factorization (NMF) vs. BERTopic for Topic Modeling

Course: DSC 210: Numerical Linear Algebra for Data Science

Instructor: Dr. Tsui-wei Weng

Instructions:

Note: BERTopic's visualizations required the model to be created during the run because saving to a pickle file removed the functionality. Training BERTopic within the notebook will take around 20 minutes (if ran on google colab). More details here.

Ensure that the following libraries are installed in python 3.10 environment:
- pandas
- numpy
- seaborn
- matplotlib
- nltk
- collections
- scikit-learn
- gensim
- tqdm
- plotly
- wordcloud
- octis
- sentence_transformers
- umap
- hdbscan
- bertopic
Open NMF_vs_BERT.ipynb and run all the cells of the notebook.

Results:

NMF

Top 10 words for each of the topics:

Distributions of topics across all documents:

Visualized embedding of documents in 2-D space:

Visualized distribution of documents per topic based on newsgroup:

View the interactive plot

Topic Diversity: 0.98

Residual Score: 0.94

Coherence Score: 0.46

BERTopic

Top 10 words for each of the topics:

Distributions of topics across all documents:

Visualized embedding of documents in 2-D space:

Visualized distribution of documents per topic based on newsgroup:

View the interactive plot

Topic Diversity: 0.99

Topic Coherence: 0.51

Metric Comparison

NMF appears to better extract abstract themes, like religion and sport, whereas BERTopic appears to better extract more specific topics, like christianity and hockey.

Name		Name	Last commit message	Last commit date
Latest commit History 62 Commits
210_Final_Presentation.pdf		210_Final_Presentation.pdf
BERTopic_topics.png		BERTopic_topics.png
DSC_210_Final_Project_Report-3.pdf		DSC_210_Final_Project_Report-3.pdf
NMF_vs_BERT.ipynb		NMF_vs_BERT.ipynb
README.md		README.md
interactive_plot_bert.html		interactive_plot_bert.html
interactive_plot_nmf.html		interactive_plot_nmf.html
newplot-3.png		newplot-3.png
newplot-4.png		newplot-4.png
w2v-model.bin		w2v-model.bin

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Project: Non-Negative Matrix Factorization (NMF) vs. BERTopic for Topic Modeling

Course: DSC 210: Numerical Linear Algebra for Data Science

Instructor: Dr. Tsui-wei Weng

Instructions:

Results:

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Project: Non-Negative Matrix Factorization (NMF) vs. BERTopic for Topic Modeling

Course: DSC 210: Numerical Linear Algebra for Data Science

Instructor: Dr. Tsui-wei Weng

Instructions:

Results:

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages