Not all the information that we need to process are labeled. In order to apply supervised learning someone needs to categorize data. For example movie reviews, what if they are not labelled, we don't even have train and test split. So for data that are not categorised, we apply unsupervised learning. So the tough part in supervised learning is evaluating our models, as we don't have a predefined test cases to perform evaluations.
A good use case of unsupervised learning is categorising newspaper just by providing the content /news on the paper. And asking the algorithm to predict the topic or where the news belongs.
How does unsupervised learning use Latent Dirichlet Distribution ?
Choose some fixed number of k topics to discover
In order for LDA to work you as a user need to decided how many topics are going to be discovered.
LDA learn the topic representation of each document and the word associated with each topic
We go through each document and randomly assign each word in the document to one of the k-topics.
So keep in mind, the very first pass, this random assignment actually already gives you both topic representations of all the documents and word distributions of all the topics.
We have assigned everything randomly at the very first pass, these initial random topics won't make sense as you randomly assign each word a topic.
So we are going to iterate for every word in every document to improve these topics.
For every word in every document, and for each topic t we calculate:
p(topic T | document d) = proportion of words in document D, that are currently assigned to topic T
p(word w | topic t) = proportion of assignments to topic T, over all documents that comes from the particular word W
reassign the word W a new topic where topic T with probability:
p(topic t |document d) * p(word w | topic t)
Essentially the probability that topic T generated the word W
After repeating this steps large number of times, we reach a roughly steady state where the assignments for the topics are acceptable. So at the end, what we have is each document being assigned to a topic.
then we can search for the words that have highest probability of being assigned a topic
Two important notes:
User must decide on the amount of topics represent in the document before even beginning this process.
Unsupervised Learning:
When is it applied/useful?
Not all the information that we need to process are labeled. In order to apply supervised learning someone needs to categorize data. For example movie reviews, what if they are not labelled, we don't even have train and test split. So for data that are not categorised, we apply unsupervised learning. So the tough part in supervised learning is evaluating our models, as we don't have a predefined test cases to perform evaluations.
A good use case of unsupervised learning is categorising newspaper just by providing the content /news on the paper. And asking the algorithm to predict the topic or where the news belongs.
How does unsupervised learning use Latent Dirichlet Distribution ?
p(topic T | document d) = proportion of words in document D, that are currently assigned to topic T
p(word w | topic t) = proportion of assignments to topic T, over all documents that comes from the particular word W
reassign the word W a new topic where topic T with probability:
p(topic t |document d) * p(word w | topic t)
Essentially the probability that topic T generated the word W
After repeating this steps large number of times, we reach a roughly steady state where the assignments for the topics are acceptable. So at the end, what we have is each document being assigned to a topic.
then we can search for the words that have highest probability of being assigned a topic
Two important notes: