Automated Topic Discovery: A Tutorial
An approachable explanation of how Topic Modeling works

Introduction

Digging into the academic papers describing how automated topic discovery using topic modeling works can be a laborious and confusing process. However, you don't have to be well-versed in probability theory and statistics to gain an intuition into the goals and processes of topic modeling.

We will outline below one of the most popular topic modeling frameworks, Latent Dirichlet Allocation (LDA), and the most common estimation algorithm called "Gibbs Sampling."

This tutorial will walk you through the process to give you a better understanding of what is actually happening and how topics are formed. Throughout we will use a small nine-document collection, and limit our goals to obtaining two topics to make the process easy to follow.

But before we proceed with the HOW, let's examine the WHY.

The Problem

Let's ask ourselves how we would approach each of the following:

  • You are a Senate staffer and have been handed a 1,000 page bill without an index. How can you find the major themes and potential issues before the floor vote tomorrow?
  • A client of yours has a website with tremendously rich and diverse content that has been accumulating for over a decade. There is a search function, but no index. How can you most quickly build a useful index that provides insight into the diverse content available, and that can be updated automatically when needed?
  • You have been given the responsibility to review the funding for 10,000 research projects by a major foundation. For each project you have the title, summary text, and funding dollars. You need to estimate the total funding by topic, not by project.
  • You now have the gene sequencing from 10,000 patients, half with a key disease, and half without. Which genetic clusters might be related to this disease?
  • You have been handed a computer from a suspected terrorist and you want to quickly assess which themes, locations, and co-conspirators might be found.

The Solution

Topic modeling is the process of using a computer to automatically discover themes (topics) and their prevalence in a document collection (i.e., "Corpus"). Unlike "search" which requires you to specify what you are trying to find, topic modeling is the inverse--which we call "discovery."

Over the past decade or so there have been major advances in the field of topic modeling, stimulated by the seminal papers "Latent Dirichlet Allocation" (2003) by David Blei, Andrew Ng, and Michael Jordan and "Finding Scientific Topics" (2004) by Thomas Griffiths and Mark Steyvers.

From these papers we learned about Latent Dirichlet Allocation (LDA) as a "generative probabilistic model" that leverages Bayesian probability in order to discover hidden (i.e., "latent") topics. This "generative" model means that for our purpose of discovering latent topics, it is useful to assume that our source documents are assumed to have been generated by mixtures of topics. The topics determine which terms (words) are included in each document.

"Topics are the gatekeepers for the words in a document."

Now this may seem counter-intuitive, but remember that LDA is a MODEL of how things work, and like all models, it is a useful approximation of reality.

So far we have only discussed the first assumption for LDA, that the words in documents are assumed to have been generated by topics. Let's now outline all of the assumptions before we dive into a simple example. But first let's have an important discussion about the term "distribution":

Distribution. There is nothing magical about the term "distribution," but often people have a deer-in-the-headlights look when the term "probability distribution" is mentioned. (In the context of topic modeling we are focused on "discrete" distributions, meaning that each item can be listed.) So, to build a distribution all we need to do is to assign a probability to each item in the list.. (Implicit in that statement is the requirement that sum of all the individual probabilities for each possible outcome should equal 1.)

For example, when you roll a single die, each of the six values (1 to 6) has the probability 1/6. the probability distribution would look like the following (rounded to three decimal digits):

[.167, .167, .167, .167, .167, .167]

In topic modeling there are two key distributions we create:

Within a document: The collection of probabilities for each topic that the topic is relevant to the document.

Within a topic: The collection of probabilities for each word in the vocabulary that the word is relevant to the topic.

The assumptions:

1Each document is a distribution over topics.

2Each topic is a distribution over words.

3For each topic in a document, there is assigned a probability from 0 to 1. (You can also think of this as a percentage from 0% to 100%.)

4For each word in a topic, there is assigned a probability from 0 to 1.

5The sum of the probabilities for all topics assigned to a document must equal 1 (i.e., 100%).

6The sum of the probabilities for all words assigned to a topic must equal 1.

7Neither the order of the documents nor the order of the words in each document is to be considered significant.

How to generate a document

First some important definitions

Topic Modeling is method for identifying 1 topics and their 2 distributions across the 3 documents in a 4 corpus.

diagram of topic modeling

Topic

A collection of words that occur frequently together. The formal definition for a topic is a distribution over a fixed vocabulary (Blei, 2012).

Each topic's distribution (i.e., an "array") has a placeholder for each word in the vocabulary that contains the probability (from 0 to 1) that the word is associated with the topic.

We usually initially identify (i.e., "name") a topic using the top 10 or so words with the highest probabilities. We later add (usually with human input) a shorter, more concise name.

Corpus

A document "collection": the complete body of text (usually) to be examined.

Vocabulary

The list of all unique words, excluding "stop words" (see definition below), in a corpus.

Document

A grouping of logically related content. In database terminology we often use the term "record." Depending upon your focus and goals, documents for topic modeling can significantly vary in length: a tweet; an email; a paragraph; a page; an article.

Word

A distinct vocabulary or "dictionary" item. For non-text topic modeling this might be a code, symbol, or unique genetic sequence.

Stop Word

An unwanted word, i.e., a commonly used word that does not have significant meaning for what you are studying: is, and, the, that, but, can, etc. Custom stop word lists are often used for specific domains.

Token

A specific instance or occurrence of a word within a document.

Example: The phrase, "duck duck goose" has three tokens, but only two words. There are two tokens of the word "duck."

Our document corpus

We are using an imaginary collection of email content we call "River Bank vs. Money Bank." (We build upon the example in the excellent 2007 paper "Probabilistic Topic Models" by Mark Stevyers and Tom Griffiths.)

Document # Document Text
1 I live on the river. Close to the river bank. A stream feeds the river
2 I had an old boat on the river. I used the boat on the river every day. The river is great.
3 A storm pulled the boat from the river bank. The boat sank in the river.
4 I needed money from a bank loan to get a new boat. I went to the bank. The bank was open.
5 I told the bank that I need a new boat.
6 I called the bank every month about the money for the loan. I needed the money.
7 I waited a year for the bank to give me the money for the loan.
8 The bank called about the money. The bank said the money for the boat was approved.
9 I now take the boat out on the river every day. I tie the boat tight on the bank.
Word cloud for the corpus

Generating our document corpus

Let's take a visual approach to understanding the model of generating documents from topics that is fundemental to the LDA topic modeling theory. Now that we understand what we mean by a distribution in the context of topic modeling, let's simulate the process of building a document using the "generative model" process. We are going to leverage the nine-document corpus used in the rest of this tutorial.

Let's look at generating Document 8:

The bank called about the money. The bank said the money for the boat was approved.

Now remove the stop words, capitalization and punctuation:

bank called about money bank said money boat approved

Or, as a list:

["bank", "called", "about", "money", "bank", "said", "money", "boat", "approved"]

But when we start we don't have these word tokens, we just know we have nine slots we need to fill:

["__", "__", "__", "__", "__", "__", "__", "__"]

Step 1. For each token we need to populate in the document, select a topic.

Remember, the generative model assumes that all of the words in a document came via the topics within the document. In our example we are assuming each document was created using two topics. The following is the assumed distribution of documents by topic:

Document # Topic 1 Probability Topic 2 Probability
1 0.12 0.88
2 0.33 0.67
3 0.62 0.38
4 1.00 0.00
5 1.00 0.00
6 0.88 0.12
7 1.00 0.00
8 0.89 0.11
9 0.33 0.67

Wait, HOW do we select a topic?

Here comes the probabilistic part. We will be using a random number generator that will provide us with a number between 1 and 100 each time we need one:

Our Random Number Generator

How do we use this number to select a topic?

It is helpful to change the individual probabilities into equivalent relative proportions of numbers:

Topic # Start Number End Number
1 1 62
2 63 100

Choose a random number to select a topic

So let's say we watch the random number generator, and the next number that flips up is 87. We look at our table and see that Topic 2 covers the numbers from 63 to 100, so we chose Topic 2.

Step 2. Choose a word from the selected topic.

We previously selected a random number and came up with Topic 2. Now we need to look at the word probability distribution Choose another random 2:

A chart of the word probability distribution for Topic 2.

topic 2 word probability chart

For comparison, let's look at the same chart for Topic 1:

A chart of the word probability distribution for Topic 1.

topic 2 word probability chart

Back to Topic 2. Let's create a table with the numbers in the chart:

All non-zero topic word distribution items for Topic 2.

Word # Word Start # End # Topic 2 Probability
1 bank 1 27 0.27
2 money 28 42 0.15
3 boat 43 52 0.10
4 loan 53 59 0.07
5 needed 60 64 0.05
6 called 65 69 0.05
7 about 70 74 0.05
8 close 75 77 0.03
9 feeds 78 80 0.03
10 get 81 83 0.03
11 went 84 86 0.03
12 open 87 88 0.02
13 told 89 90 0.02
14 month 91 92 0.02
15 waited 93 94 0.02
16 give 95 96 0.02
17 said 97 98 0.02
18 approved 99 100 0.02

Choose another random number to obtain a word.

We watch the random number generator again, and this time the number 12 flips up. We examine our table and see that this fits the range of the word bank.

We now have our first entry into the list of tokens to build our Document #3:

["bank"," "," "," "," "," "," "," "]

Continue this process until done.

We need to do this same process for seven more iterations to derive our simulated document. Then we need to do the same for all the other documents.

You have now experienced, visually, The "generative" process assumptions underlying the LDA topic model. By walking through it in detail you also can surmise some of its limitations. It is not perfect - but it turns out to be very useful.

Ok, now I think I undertand LDA, but what is this Gibbs Sampling thing?

In our example above we have illustrated the assumptions underlying LDA. What we haven't yet addressed is that if we want an exact solution to the Bayesian probability formula, it will require calculations on the order of Tn, where T is the number of topics desired, and n is the number of word tokens in the corpus. In our contrived example we would "only" need 269 calculations for an exact answer. What if you wanted to derive 10 topics and there are 1,000 tokens? Trying to solve this exactly would take on the order of 101000 calculations!

Luckily very clever people have figured out ways to do very good estimations, so you don't have to wait until after the end of the universe to get an answer. One of the most commonly used and fast estimation tools is called Gibbs Sampling.

Here's how Gibbs Sampling works

The Process

Topic modeling with Gibbs Sampling happens in three steps
(Press the button below to walk through the steps.)

1. The Raw Documents

Below are four "raw" sample documents.
In the next image you will see the removal of "stop words."

2. The Documents with "Stop Words" Removed

Only the words in bold will be used for the next steps.
Stop words are ignored.

3. The Initial Random Assignment of Topics

The algorithm reassigns each token to a topic (Topic 1 or Topic 2), one at a time, by calculating what are called "topic weights." Initially these are assigned randomly. Subsequent iterations continue for the number of iterations requested.

I live on the river. Close to the river bank. A stream feeds the river.

A storm pulled the boat from the river bank. The boat sank in the river.

I needed money from a bank loan to get a new boat. I went to the bank. The bank was open.

I told the bank that I need a new boat.

Remove Stop Words Randomly Assign Topics Start Iterations Reset

When you got to Step 3, there was a reference to "Topic Weights"...

What is a Topic Weight?

Before we look at the calculation, we have to understand how topic weights are used. During each iteration, every token is reassigned to a topic using the following method:

1Remove the current topic assignment of the token.

2Calculate a topic weight for each possible topic.

3The token is not just assigned to the the topic with the largest weight. Instead, we generate a random number to choose the topic, where the weights determine the probability that each topic will be chosen.

You can imagine this as choosing a marble out of a jar with each topic weight represented by the proportion of marbles for each. In the illustration below there are 24 marbles, 16 which are green and 8 which are blue. Topic 1 therefore has a weight of 16/24 (2/3) and Topic 2 has a weight of 8/24 (1/3).

topic-weights

The Topic Weight Calculation

For every iteration of the algorithm, the changing topic weight determines how likely it is for each token to be assigned to each topic. We are going to take a close look at the topic weight formula to understand how tokens are iteratively assigned into meaningful topics.

During every iteration, every token has a new topic weight calculated for each possible topic using the following formula:

topic weight equation

The equation includes a few constants, called hyperparameters.

hyperparameters highlighted

Hyperparameters are small values that prevent the calculated weight from ever being zero, ensuring there is always at least a small chance that any token could be assigned to any topic.

We won't distract you with the fine details here. See the recommeded resources page on this site for some excellent discussions.

For now, if we ignore the hyperparameters, the topic weight calculation is simply:

simplified topic weight equation

So, it's actually a very simple calculation. All we do is:

1Count the number of other instances of that word currently assigned to the topic.

2Divide by the total number of tokens assigned to the topic.

3Multiply by the number of tokens assigned to the topic in the current token's document.

If we look closely at this formula, we can see how tokens are sorted into increasingly specific topics.

A topic weight will be larger (meaning the token is more likely to be assigned to that topic) if:

left side of equation is larger

The topic is made up of a higher proportion of this word

right side of equation is larger

The current document has a larger number of tokens assigned to the topic

And, accordingly, when a token changes to another topic:

left side of equation is larger

Other tokens of that word are more likely to be assigned that topic

right side of equation is larger

Other tokens in that document are more likely be assigned that topic

Tokens not of that word are a little less likely to be assigned that topic

left side of equation is smaller

This formula establishes two goals:

1Tokens of the same word should be assigned to the same topic.

2Tokens in the same document should be assigned to the same topic.

There is a tension between these two goals because documents are not just made up of repititions of the same word. So, as we repeatedly apply this formula, words that frequently occur in the same documents (in our example corpus, "river" and "boat", as opposed to "river" and "money") will gradually be assigned into the same topics.

If you're having trouble visualizing why this can eventually stabilize, looking at the underlying structure of co-occurring words can help.

Below, our example documents are displayed with their stop words removed.

Connections (lines) have been drawn between tokens of the same word (solid purple lines) and between tokens in the same document (dotted grey lines).

(The tokens are colored by their topic assignments after ten iterations of Gibbs Sampling.)

Using the chart below, toggle to see a "force-directed" layout that pulls tokens of the same word and tokens in the same document together.
This reveals the hidden structure of co-occurring words that sampling finds.

Show the "Force" Layout Show the Document Layout

Explore the Iterations

With a small enough corpus, we can watch the sampling process happen. You can examine the topic weight calculations and watch as individual tokens are reassigned into meaningful topics.

Notes and Instructions

aBelow, we've saved the result of 10 iterations of Gibbs Sampling. (Iteration 0 is just the random assignment.)

bYou can step through by individual token or by iteration. Selecting an iteration skips to the end of that iteration.

cThe topic lists are just the most common words assigned to each topic. The top six words in each topic at any point during the iterations are displayed.

dYou can toggle "Show Weights" to view each token's current topic weight calculation. Raw topic weight calculations can vary greatly. The displayed weights have been converted to probabilities by dividing each one by the sum of both. Additionally, the algorithm only calculates topic weights for each token once per iteration. However, we've displayed what the calculation would be at any point in time. This lets you see the effect that reassigning one token has on the weight calculations of the rest of the corpus. (Note, a token's own weights do not change when it is reassigned, because each token is subtracted from the counts for their own weight calculations.)

eNotice the purpose of hyperparameters. In iteration 4, step 49-50, all tokens of "money" were at this point assigned to topic 2. However, because of the hyperparameters in the topic weight equation, there was still a small chance for this token to be assigned to topic 1. The reassignment of this token greatly influenced the chance of other tokens of "money" to switch to topic 1 because many of them were also in documents that had a moderate preference for topic 1.

fBeneath the "Interactive Gibbs Sampling Visualization" we've included charts of the most common word-to-topic assignments, and measures of how specific the assignments are during each iteration.

The Interactive Gibbs Sampling Visualization