Skip to content

jjdrisco/Topic-Modeling_NMF-vs-BERTopic

Repository files navigation

Project: Non-Negative Matrix Factorization (NMF) vs. BERTopic for Topic Modeling

Course: DSC 210: Numerical Linear Algebra for Data Science

Instructor: Dr. Tsui-wei Weng

Instructions:

Note: BERTopic's visualizations required the model to be created during the run because saving to a pickle file removed the functionality. Training BERTopic within the notebook will take around 20 minutes (if ran on google colab). More details here.

  • Ensure that the following libraries are installed in python 3.10 environment:

    • pandas
    • numpy
    • seaborn
    • matplotlib
    • nltk
    • collections
    • scikit-learn
    • gensim
    • tqdm
    • plotly
    • wordcloud
    • octis
    • sentence_transformers
    • umap
    • hdbscan
    • bertopic
  • Open NMF_vs_BERT.ipynb and run all the cells of the notebook.

Results:

  • NMF

Top 10 words for each of the topics:

Screenshot 2024-12-03 at 10 49 52 AM

Distributions of topics across all documents: Screenshot 2024-12-03 at 10 50 34 AM

Visualized embedding of documents in 2-D space: Screenshot 2024-12-03 at 10 51 21 AM

Visualized distribution of documents per topic based on newsgroup: Plot

View the interactive plot

Topic Diversity: 0.98

Residual Score: 0.94

Coherence Score: 0.46

  • BERTopic

Top 10 words for each of the topics:

Plot

Distributions of topics across all documents: Screenshot 2024-12-03 at 11 50 16 AM

Visualized embedding of documents in 2-D space: Screenshot 2024-12-03 at 11 32 31 AM

Visualized distribution of documents per topic based on newsgroup: Plot

View the interactive plot

Topic Diversity: 0.99

Topic Coherence: 0.51

  • Metric Comparison
Screenshot 2024-12-03 at 11 46 31 AM

NMF appears to better extract abstract themes, like religion and sport, whereas BERTopic appears to better extract more specific topics, like christianity and hockey.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors