Latent Dirichlet Allocation

Blueprint for the LDA Machine

  • LDA Blueprint

  • Probability of a document

  • Probability of a document

  • LDA Blueprint with what they are doing

Dirichlet Allocation

  • In this distribution we have a parameter `alpha’. Depending on the values of alpha we will get different distributions.

  • Dirichlet distributions based on alpha values

  • we consider two Dirichlet distributions - One which will associate document with topics and the other topics with words

  • Two Dirichlet distributions

  • These two Dirichlet distributions are the parameters in the blueprint. We generate different documents by adjusting the points in these two distributions

  • Two Dirichlet distributions as knobs in the blueprint

How LDA works

  • LDA in action

  • We generate documents by assigning documents to topics and topics to words. The probability of generating the same article as training data will be very low.

  • Generating documents with assigning of topics for different documents and assigning topics to different words

  • Comparing the probability of the document generated with the ground-truth

Training LDA

  • The number of topcis is a hyperparameter
  • Try to assign topics to documents and topics to words in such a way that they are as monochromatic (belong to a single category) as possible

Gibbs Sampling

  • Gibbs Sampling in the context of LDA is trying to tag the words in the document to be monochromatic (belonging to a single category) and trying to tag the document to be monochromatic

  • Gibbs Sampling

  • Ensuring we are considering all the topics are considered in Gibbs sampling

  • Assigning to topics to documents based on assigning topics to words

  • Maximizing the probability of the LDA equation is very difficult. Hence we use Gibbs sampling

Gibbs sampling explained by chatgpt

  • Gibbs sampling is a statistical algorithm used to generate samples from a probability distribution that might be too complex to calculate directly. It is often used in Bayesian inference, where the goal is to estimate the unknown parameters of a model given some observed data.

  • The idea behind Gibbs sampling is to iteratively sample from the conditional distributions of each variable in the model, while holding all other variables fixed. This means that we generate a sample for one variable at a time, based on the values of the other variables in the model.

  • The process starts with some initial values for all the variables in the model. Then, for each iteration of the algorithm, we randomly select one of the variables and update its value based on the values of the other variables in the model. We keep doing this for all the variables until we have generated enough samples.

Exploring Further

  • How the length of the document is treated by LDA

Reference