A text classification project using scikit-learn's 20 Newsgroups dataset and a TF-IDF + Logistic Regression pipeline.
Uses a 4-category subset of the 20 Newsgroups dataset:
- alt.atheism
- talk.religion.misc
- comp.graphics
- sci.space
The dataset is downloaded automatically via fetch_20newsgroups.
notebooks/newsgroups_model.ipynb– exploration and notebook walkthroughsrc/train.py– standalone training scriptoutputs/confusion_matrix.png– confusion matrixoutputs/newsgroups_text_model.joblib– saved text classification model
python -m venv venv # or: python3 -m venv venv
venv\Scripts\activate # Windows
source venv/bin/activate # macOS/Linux
pip install -r requirements.txtpython src/train.py