Choose one of our demonstration document collections, or upload your own document collection.

Processing large document collections can be very CPU and memory intensive. Starting with Topix 3.0 compute-intensive tasks are executed in the AWS cloud.

Select one of the preset files
Try your own files
selected: [ default ]

Use the 'Pick File' button to select a file that includes the documents (corpus) to process. Acceptable file formats include those with extensions TXT, CSV, TSV, or JSON. A file with a CSV or TSV extension is assumed to have a line-feed separating documents, with CSV fields within each documenmet comma-delimited, and TSV fields within each document tab-delimited. A file with a TXT extension aloows you to specify what separates each document, such as special cases where documents have multiple lines and two-lines separate (i.e, a paragraph break) each document. You can also specify your own text to tell Topix how to split documents within the file

For JSON data, Topix expects a well-formed JSON file which is an array of documents, with object keys within each array element specifiying each field:


Except in the case of loading a JSON file (which will bring you directly to Step 2), in Step 1 you will be prompted to provide the delimiter that separates each document, and optional field delimiters (if you have separate fields). In Step 2 you will match fields in your file to those available in Topix:


(The list below describes the nine fields available in Topix that can be mapped. Each field is optional except for "TEXT")

  1. DOCID: Your Identifier for each document. If you don't have one, Topix will create a sequential number.
  2. DATE: (For future use). Either YYYY, YYYY-MM, or YYYY-MM-DD formats. For time treand analysis.
  3. LOCATION: (For future use). Country, State, etc, for filtering and aggregation.
  4. DOLLARS: (For future use). For filtering or topic-related financial analysis.
  5. CATEGORY: (For future use). For filtering or aggregation.
  6. TITLE: A Title/Label to be associated with each document for report listing, and is included in document profiling and topix modeling by default.
  7. TEXT: Source text (or other "tokens") to be parsed for the topic model.
  8. TAGS: (For future use). An optional list of tags/keywords (tilde-delimited) for filtering in the Topix Network Explorer.
  9. AUTHOR: (For future use). An optional author name for filtering, aggregation, and network visualization.
selected: [ default ]
Specify a text (.txt) file, with one word or phrase per line, that contains either a list of words to be excluded ("Exclude List" or "stopwords") or a list of words to be included ("Include List") as appropriate for the option you have chosed.

An Exclude List is for words/items to be EXCLUDED from processing, such as "a, the, of, etc." An Include List is for words/items to be INCLUDED in processing. Using an Include List provides you the ability to focus your corpus profiling on a specific domain or entities of interest.
selected: [ default ]

Specify a "lexicon" file to enhance the accuracy of the NLP processing or to include domain-specific vocabulary items. Place each vocabulary item on a separate line in a text file, with a tab separating the item and the part of speech tag (or special function tag). For example, the line for defining the organization "Army and Air Force Exchange Service" would look like:

Army and Air Force Exchange Service[TAB]Organization

(Note that the special function tag "Organization" is capitalized. Captialization of vocabulary items is ignored when matching.)

Part of speech tags you can use:  Noun, Verb, Adjective, Adverb
Additional special function tags:  Organization, Place

By selecting the checkbox you will focus on profiling your document corpus, skipping the topic modeling steps. This reduces processing time significantly. Use this option when your goal is not to generate topics, but to discover the basic composition of the corpus vocabulary using one of the Topix NLP (Natural Language Processing) options.

Specify the number of topics to generate. Choose more topics for finer granularity, fewer for a higher-level summary of the themes in your corpus. (The default is 10)
Specify the number of iterations (calculations) to fine-tune the discovery of topics. In general, the more iterations you select, the more stable the outcome. However, additional iterations take more time. Experiment! (The default is 50)
(min 10 - max 120)
Provide a title for later identification of this document collection (corpus). It will be included in reports and in the first line of downloaded files.
(min 20 - max 300)
Provide a description of the document collection and the processing options you choose for later reference. It will be included in reports and in the first line of downloaded files.

You have a variety of options to derive the "tokens" (logical pieces of interest) from your document collection. (The default is to use all words as tokens. This is the correct choice for non-English language text or special symbols.)

For English language text you have the option of including only specific parts of speech, such as nouns, verbs, adjectives, adverbs, and other Natural Language Processing (NLP) options such as entity extraction for persons, places, organizations, locations, values and dates. These advanced NLP options are not perfect, but may provide insight into collection that would not otherwise be possible.

If you press START without specifying options, you will run the default demo document collection and exclude list for Topics and Iterations.