Newsgroup Topic Classification

An exercise in processing and feature engineering unstructured text data

Description

The objective of this project was to experiment with handling unstructured text data for a classification task. The data for this project comes from the open newsgroups dataset built into Scikit-Learn, which contains over 18,000 newsgroup posts for 20 topics. In particular, I focus on classifying posts in newsgroups falling into the following four categories: atheism, religion, computer graphics, and space. For this project, unlike more sophisticated natural language processing projects, I use a bag-of-words method to extract features from each newsgroup post.

Techniques

Feature engineering:

Bag-of-words
count vectorization
TFIDF vectorization
vocabulary limitation/expansion
stop word removal

Models used:

Naive Bayes
Logistic Regression
K-Nearest Neighbors

Tools

scikit-learn
numpy
jupyter notebooks

Outcome

Explored feature engineering in the bag-of-words paradigm: count vs TFIDF vectorization, word vs character feature representations, limiting features to those that occur in n documents, etc. Experimented with text pre-processing. Increased the baseline F1-score for a logistic regression model from 0.70 to 0.77.

More Information

More information can be found at the following links:

GitHub Repository: https://github.com/nsylva/topic_classification