RESTART
Exploration Options

Choose one of our demonstration document collections, or upload your own document collection.

Processing large document collections can be very CPU and memory intensive. Therefore, please limit your document collection if you wish to explore using your smartphone or a poorly equipped computer. (If you receive a "Kill/Wait" alert in your browser, select "Wait" to continue processing)

The following factors impact the memory required for processing in addition to the elapsed time required:

  • Number of documents
  • Number of tokens in each document (document size)
  • Size of the vocabulary (unique tokens)
  • Number of topics

We're now developing the option to run memory-intensive parts of the topic modeling process in our cloud. Stay tuned!

Select one of the preset files

Try your own files
selected: [ default ]

FORMAT #1: Provide a "Documents" text (.txt) file with the following "header" text in the first line: #MULTIPLE-LINE-TEXT
(Make sure the hash symbol (#) is the first character in the line.)

Separate each document with the text: #DOC-END

This format is convenient for cutting and pasting from word and other formats without the requirement of removing line or paragraph breaks. You can prepare you corpus using, for example, Microsoft Word, and simply use the "Save As" feature to save as a text file (.txt).

The following example includes three documents:

multiple-line text
Indentation and blank lines are not required.

FORMAT # 2: Provide a "Documents" text (.txt) with "header" text in the first line that specifies which fields are to be processed. The first character in the line must be a hash (#) symbol. Use a tab-delimited format (tabs between fields; linefeedat the end of each line) to be used for both the header line and all subsequent lines with data. The list below describes the eight fields that can be included. (Each field is optional except for "TEXT")


#DOCID[TAB]YEAR[TAB]LOCATION[][DOLLARS[TAB]CATEGORY[TAB]TITLE[TAB]TEXT[TAB]TEXT[LF]

  1. DOCID: Your Record Identifier.
  2. YEAR: Four-digit year.
  3. LOCATION: Country, State, etc.
  4. DOLLARS: For topic-related financial analysis.
  5. CATEGORY: For more detailed analysis by a user-specified category.
  6. TITLE: Label to be associated with each document for report listing.
  7. TEXT: Source text (or other "tokens") to be parsed for the topic model.
  8. KEYWORDS: Optional list of keywords for display and searching in the Topix Network Explorer.
selected: [ default ]
Specify a file that either contains an "Exclude List" (i.e, stop list) or an "Include List." An Exclude List is for words/items to be EXCLUDED from processing. An Include List specifies the list of words/items to be INCLUDED in processing: all other words/items will be ignored. (Place each word/item on a separate line.)
selected: [ default ]

Specify a "lexicon" file to enhance the accuracy of the NLP processing or to include domain-specific vocabulary items. Place each vocabulary item on a separate line in a text file, with a tab separating the item and the part of speech tag (or special function tag). For example, the line for defining the organization "Army and Air Force Exchange Service" would look like:

Army and Air Force Exchange Service[TAB]Organization

(Note that the special function tag "Organization" is capitalized. Captialization of vocabulary items is ignored when matching.)

Part of speech tags you can use:  Noun, Verb, Adjective, Adverb
Additional special function tags:  Organization, Place

Selecting the checkbox you can run the process excluding advanced visualizations (faster).

Specify the Number of Topics to generate. (The default is 10)
Specify the number of iterations to reach a steady state of topics. (The default is 50)
(min 10 - max 120)
Provide a title for later identification of this document collection (corpus). It will be included in reports and in the first line of downloaded files.
(min 20 - max 300)
Provide a description of the document collection. It will be included in reports and in the first line of downloaded files.

You have a variety of options to derive the "tokens" (logical piences) from your document collection. The default is to use all words as tokens. (This is the correct choice for non-English language text or special symbols.)

For English language text you have the option of including only specific parts of speech, such as nouns, verbs, adjectives, adverbs, and other Natural Languate Processing (NLP) options such as entity extraction for persons, places, organizations, locations, values and dates. These advanced NLP options are not perfect, but may provide insight into collection that would not otherwise be possible.

Save    Load

Save all of your preferences for future retrieval, or restore preferences from a previously saved file.
Selecting the checkbox you can additionally load the original files.

If you press START without specifying options, you will run the default demo document collection and exclude list for Topics and Iterations.