Latent Dirichlet Allocation
Blueprint for the LDA Machine
Dirichlet Allocation
In this distribution we have a parameter `alpha’. Depending on the values of alpha we will get different distributions.
we consider two Dirichlet distributions - One which will associate document with topics and the other topics with words
These two Dirichlet distributions are the parameters in the blueprint. We generate different documents by adjusting the points in these two distributions
How LDA works
We generate documents by assigning documents to topics and topics to words. The probability of generating the same article as training data will be very low.
Training LDA
- The number of topcis is a hyperparameter
Gibbs Sampling
Gibbs Sampling in the context of LDA is trying to tag the words in the document to be monochromatic (belonging to a single category) and trying to tag the document to be monochromatic
Maximizing the probability of the LDA equation is very difficult. Hence we use Gibbs sampling
Gibbs sampling explained by chatgpt
Gibbs sampling is a statistical algorithm used to generate samples from a probability distribution that might be too complex to calculate directly. It is often used in Bayesian inference, where the goal is to estimate the unknown parameters of a model given some observed data.
The idea behind Gibbs sampling is to iteratively sample from the conditional distributions of each variable in the model, while holding all other variables fixed. This means that we generate a sample for one variable at a time, based on the values of the other variables in the model.
The process starts with some initial values for all the variables in the model. Then, for each iteration of the algorithm, we randomly select one of the variables and update its value based on the values of the other variables in the model. We keep doing this for all the variables until we have generated enough samples.
Exploring Further
- How the length of the document is treated by LDA