Digging into the academic papers describing how automated topic discovery using topic modeling works can be a laborious and confusing process. However, you don't have to be well-versed in probability theory and statistics to gain an intuition into the goals and processes of topic modeling.
We will outline below one of the most popular topic modeling frameworks, Latent Dirichlet Allocation (LDA), and the most common estimation algorithm called "Gibbs Sampling."
This tutorial will walk you through the process to give you a better understanding of what is actually happening and how topics are formed. Throughout we will use a small nine-document collection, and limit our goals to obtaining two topics to make the process easy to follow.
But before we proceed with the HOW, let's examine the WHY.
Let's ask ourselves how we would approach each of the following:
Topic modeling is the process of using a computer to automatically discover themes (topics) and their prevalence in a document collection (i.e., "Corpus"). Unlike "search" which requires you to specify what you are trying to find, topic modeling is the inverse--which we call "discovery."
Over the past decade or so there have been major advances in the field of topic modeling, stimulated by the seminal papers "Latent Dirichlet Allocation" (2003) by David Blei, Andrew Ng, and Michael Jordan and "Finding Scientific Topics" (2004) by Thomas Griffiths and Mark Steyvers.
From these papers we learned about Latent Dirichlet Allocation (LDA) as a "generative probabilistic model" that leverages Bayesian probability in order to discover hidden (i.e., "latent") topics. This "generative" model means that for our purpose of discovering latent topics, it is useful to assume that our source documents are assumed to have been generated by mixtures of topics. The topics determine which terms (words) are included in each document.
Now this may seem counter-intuitive, but remember that LDA is a MODEL of how things work, and like all models, it is a useful approximation of reality.
So far we have only discussed the first assumption for LDA, that the words in documents are assumed to have been generated by topics. Let's now outline all of the assumptions before we dive into a simple example. But first let's have an important discussion about the term "distribution":
1Each document is a distribution over topics.
2Each topic is a distribution over words.
3For each topic in a document, there is assigned a probability from 0 to 1. (You can also think of this as a percentage from 0% to 100%.)
4For each word in a topic, there is assigned a probability from 0 to 1.
5The sum of the probabilities for all topics assigned to a document must equal 1 (i.e., 100%).
6The sum of the probabilities for all words assigned to a topic must equal 1.
7Neither the order of the documents nor the order of the words in each document is to be considered significant.
First some important definitions
Each topic's distribution (i.e., an "array") has a placeholder for each word in the vocabulary that contains the probability (from 0 to 1) that the word is associated with the topic.
We usually initially identify (i.e., "name") a topic using the top 10 or so words with the highest probabilities. We later add (usually with human input) a shorter, more concise name.
A document "collection": the complete body of text (usually) to be examined.
The list of all unique words, excluding "stop words" (see definition below), in a corpus.
A grouping of logically related content. In database terminology we often use the term "record." Depending upon your focus and goals, documents for topic modeling can significantly vary in length: a tweet; an email; a paragraph; a page; an article.
A distinct vocabulary or "dictionary" item. For non-text topic modeling this might be a code, symbol, or unique genetic sequence.
An unwanted word, i.e., a commonly used word that does not have significant meaning for what you are studying: is, and, the, that, but, can, etc. Custom stop word lists are often used for specific domains.
A specific instance or occurrence of a word within a document.
We are using an imaginary collection of email content we call "River Bank vs. Money Bank." (We build upon the example in the excellent 2007 paper "Probabilistic Topic Models" by Mark Stevyers and Tom Griffiths.)
|Document #||Document Text|
|1||I live on the river. Close to the river bank. A stream feeds the river|
|2||I had an old boat on the river. I used the boat on the river every day. The river is great.|
|3||A storm pulled the boat from the river bank. The boat sank in the river.|
|4||I needed money from a bank loan to get a new boat. I went to the bank. The bank was open.|
|5||I told the bank that I need a new boat.|
|6||I called the bank every month about the money for the loan. I needed the money.|
|7||I waited a year for the bank to give me the money for the loan.|
|8||The bank called about the money. The bank said the money for the boat was approved.|
|9||I now take the boat out on the river every day. I tie the boat tight on the bank.|
Let's take a visual approach to understanding the model of generating documents from topics that is fundemental to the LDA topic modeling theory. Now that we understand what we mean by a distribution in the context of topic modeling, let's simulate the process of building a document using the "generative model" process. We are going to leverage the nine-document corpus used in the rest of this tutorial.
Let's look at generating Document 8:
The bank called about the money. The bank said the money for the boat was approved.
Now remove the stop words, capitalization and punctuation:
bank called about money bank said money boat approved
Or, as a list:
["bank", "called", "about", "money", "bank", "said", "money", "boat", "approved"]
But when we start we don't have these word tokens, we just know we have nine slots we need to fill:
["__", "__", "__", "__", "__", "__", "__", "__"]
Step 1. For each token we need to populate in the document, select a topic.
Remember, the generative model assumes that all of the words in a document came via the topics within the document. In our example we are assuming each document was created using two topics. The following is the assumed distribution of documents by topic:
|Document #||Topic 1 Probability||Topic 2 Probability|
Wait, HOW do we select a topic?
Here comes the probabilistic part. We will be using a random number generator that will provide us with a number between 1 and 100 each time we need one:
How do we use this number to select a topic?
It is helpful to change the individual probabilities into equivalent relative proportions of numbers:
|Topic #||Start Number||End Number|
Choose a random number to select a topic
So let's say we watch the random number generator, and the next number that flips up is 87. We look at our table and see that Topic 2 covers the numbers from 63 to 100, so we chose Topic 2.
Step 2. Choose a word from the selected topic.
We previously selected a random number and came up with Topic 2. Now we need to look at the word probability distribution Choose another random 2:
For comparison, let's look at the same chart for Topic 1:
Back to Topic 2. Let's create a table with the numbers in the chart:
|Word #||Word||Start #||End #||Topic 2 Probability|
Choose another random number to obtain a word.
We watch the random number generator again, and this time the number 12 flips up. We examine our table and see that this fits the range of the word bank.
We now have our first entry into the list of tokens to build our Document #3:
["bank"," "," "," "," "," "," "," "]
Continue this process until done.
We need to do this same process for seven more iterations to derive our simulated document. Then we need to do the same for all the other documents.
You have now experienced, visually, The "generative" process assumptions underlying the LDA topic model. By walking through it in detail you also can surmise some of its limitations. It is not perfect - but it turns out to be very useful.
Ok, now I think I undertand LDA, but what is this Gibbs Sampling thing?
In our example above we have illustrated the assumptions underlying LDA. What we haven't yet addressed is that if we want an exact solution to the Bayesian probability formula, it will require calculations on the order of Tn, where T is the number of topics desired, and n is the number of word tokens in the corpus. In our contrived example we would "only" need 269 calculations for an exact answer. What if you wanted to derive 10 topics and there are 1,000 tokens? Trying to solve this exactly would take on the order of 101000 calculations!
Luckily very clever people have figured out ways to do very good estimations, so you don't have to wait until after the end of the universe to get an answer. One of the most commonly used and fast estimation tools is called Gibbs Sampling.
Topic modeling with Gibbs Sampling happens in three steps
(Press the button below to walk through the steps.)
Below are four "raw" sample documents.
In the next image you will see the removal of "stop words."
Only the words in bold will be used for the next steps.
Stop words are ignored.
The algorithm reassigns each token to a topic (Topic 1 or Topic 2), one at a time, by calculating what are called "topic weights." Initially these are assigned randomly. Subsequent iterations continue for the number of iterations requested.
I live on the river. Close to the river bank. A stream feeds the river.
A storm pulled the boat from the river bank. The boat sank in the river.
I needed money from a bank loan to get a new boat. I went to the bank. The bank was open.
I told the bank that I need a new boat.
When you got to Step 3, there was a reference to "Topic Weights"...
Before we look at the calculation, we have to understand how topic weights are used. During each iteration, every token is reassigned to a topic using the following method:
1Remove the current topic assignment of the token.
2Calculate a topic weight for each possible topic.
3The token is not just assigned to the the topic with the largest weight. Instead, we generate a random number to choose the topic, where the weights determine the probability that each topic will be chosen.
For every iteration of the algorithm, the changing topic weight determines how likely it is for each token to be assigned to each topic. We are going to take a close look at the topic weight formula to understand how tokens are iteratively assigned into meaningful topics.
During every iteration, every token has a new topic weight calculated for each possible topic using the following formula:
The equation includes a few constants, called hyperparameters.
Hyperparameters are small values that prevent the calculated weight from ever being zero, ensuring there is always at least a small chance that any token could be assigned to any topic.
For now, if we ignore the hyperparameters, the topic weight calculation is simply:
So, it's actually a very simple calculation. All we do is:
1Count the number of other instances of that word currently assigned to the topic.
2Divide by the total number of tokens assigned to the topic.
3Multiply by the number of tokens assigned to the topic in the current token's document.
If we look closely at this formula, we can see how tokens are sorted into increasingly specific topics.
This formula establishes two goals:
1Tokens of the same word should be assigned to the same topic.
2Tokens in the same document should be assigned to the same topic.
There is a tension between these two goals because documents are not just made up of repititions of the same word. So, as we repeatedly apply this formula, words that frequently occur in the same documents (in our example corpus, "river" and "boat", as opposed to "river" and "money") will gradually be assigned into the same topics.
Using the chart below, toggle to see a "force-directed" layout that pulls tokens of the same word and tokens in the same document together.
This reveals the hidden structure of co-occurring words that sampling finds.
With a small enough corpus, we can watch the sampling process happen. You can examine the topic weight calculations and watch as individual tokens are reassigned into meaningful topics.
aBelow, we've saved the result of 10 iterations of Gibbs Sampling. (Iteration 0 is just the random assignment.)
bYou can step through by individual token or by iteration. Selecting an iteration skips to the end of that iteration.
cThe topic lists are just the most common words assigned to each topic. The top six words in each topic at any point during the iterations are displayed.
dYou can toggle "Show Weights" to view each token's current topic weight calculation. Raw topic weight calculations can vary greatly. The displayed weights have been converted to probabilities by dividing each one by the sum of both. Additionally, the algorithm only calculates topic weights for each token once per iteration. However, we've displayed what the calculation would be at any point in time. This lets you see the effect that reassigning one token has on the weight calculations of the rest of the corpus. (Note, a token's own weights do not change when it is reassigned, because each token is subtracted from the counts for their own weight calculations.)
eNotice the purpose of hyperparameters. In iteration 4, step 49-50, all tokens of "money" were at this point assigned to topic 2. However, because of the hyperparameters in the topic weight equation, there was still a small chance for this token to be assigned to topic 1. The reassignment of this token greatly influenced the chance of other tokens of "money" to switch to topic 1 because many of them were also in documents that had a moderate preference for topic 1.
fBeneath the "Interactive Gibbs Sampling Visualization" we've included charts of the most common word-to-topic assignments, and measures of how specific the assignments are during each iteration.