Newsgroup Topic Classification

An exercise in processing and feature engineering unstructured text data

Description

The objective of this project was to experiment with handling unstructured text data for a classification task. The data for this project comes from the open newsgroups dataset built into Scikit-Learn, which contains over 18,000 newsgroup posts for 20 topics. In particular, I focus on classifying posts in newsgroups falling into the following four categories: atheism, religion, computer graphics, and space. For this project, unlike more sophisticated natural language processing projects, I use a bag-of-words method to extract features from each newsgroup post.

Techniques

Feature engineering:

  • Bag-of-words
  • count vectorization
  • TFIDF vectorization
  • vocabulary limitation/expansion
  • stop word removal

Models used:

  • Naive Bayes
  • Logistic Regression
  • K-Nearest Neighbors

Tools

  • scikit-learn
  • numpy
  • jupyter notebooks

Outcome

Explored feature engineering in the bag-of-words paradigm: count vs TFIDF vectorization, word vs character feature representations, limiting features to those that occur in n documents, etc. Experimented with text pre-processing. Increased the baseline F1-score for a logistic regression model from 0.70 to 0.77.

More Information

More information can be found at the following links:

GitHub Repository: https://github.com/nsylva/topic_classification