How to perform topic modeling using NMF?

why is it useful?

Let's say you are give 1 millions questions from Quora. And ask to categorize. them based on the question type. This simply is a application of topic modeling.

You can choose to do this two ways:

Latent Dirichlet Allocation
Non Negative Matrix Factorization.

Using NMF:

Import the data
for nmf, we can use tfidfvectorizer
fit_transform the data into TFidfVectorizer
Once the data is ready, we can import NMF from sklearn
instantiate the nmfModel with amount of topic we want to model
we can say we want 20 topics out of these 1 million question
or we want 5 topics out of these 1 million question
then we fit our NMF model with our tfidf vectorized data
now we have the topic modeling done
we can check our top words that are responsible for the topic.
however the 1 topic out of these top words we have name it.

import pandas as pd

quora = pd.read_csv(quora_data)

from sklearn.feature_extraction.text import TfidfVectorizer

tfidf = TfidfVectorizer(max_df= 0.92, min_df=2, stop_words='english')

dtm = tfidf.fit_transfrom(quora['question'])

here we have document term matrix we are now going to use NMF model

from sklearn.decomposition import NMF

nmf_model = NMF(n_components_ = 5, random_state = 42)

nmf_model.fit(dtm)

for index,topic in enumerate(nmf_model.components_):

print('The top 15 words for the topic #', index)

print([tfidf.get_feature_names_out()[i] for i in topic.argsort()[-20:]])

print('\n')

here we can see words in this fashion:

The top 15 words for the topic # 0
['start', 'lose', 'buy', 'laptop', 'movie', 'learning', 'time', 'weight', '2016', 'ways', 'english', 'language', 'movies', 'programming', 'book', 'books', 'india', 'learn', 'way', 'best']


The top 15 words for the topic # 1
['looking', 'relationship', 'use', 'person', 'new', 'exist', 'compare', 'look', 'cost', 'really', 'girl', 'love', 'time', 'long', 'sex', 'work', 'feel', 'like', 'mean', 'does']

now it's upto us to decide what these word represents and topic model it out.

also you can attach the topic to the quora variable we mentioned in the steps by:

topic_results = nmf_model.transform(documentTermMatrix)

topic_results.argmax(axis=1)

quora['topic'] = topic_results.argmax(axis=1)

if we further want to put a subject on the topic, which is varibale we can define a dictionary lile:

mapr = {0:'politics', 1: 'sports' , 2 : 'technology ' , 3 : 'stocks' , 4 : 'games'}

quora['subject'] = quora['topic'].map(mapr)