Workshop 1: NLTK & MALLET

We went over using NLTK to load texts and get frequency counts. Here is what we did:

# Import the libraries we need
import nltk, string
from nltk.corpus import stopwords

# Read the text:
millText = open('/home/share/texts/eliot/eliot_mill-on-the-floss.txt','r').read()
# The format of the preceding line is as follows.
# [variable name] = open('[path to file]','r'[for reading]).read()
# millText is now a variable containing the entire text of Eliot's Mill on the Floss
# in a single string. We now want to pre-process it and then tokenize it.

# To make it all lower case, we will call the lower() function which is
# available to all string objects.
millText = millText.lower()

# Now we'll remove punctuation

# First, let's collapse contractions by replacing apostrophes with an empty string.
# As noted, this is a bit clumsy, but it works.
millText = millText.replace("'","")

# Now, we use a for loop and the built-in set of puncutation from the string
# library to remove other punctuation, replacing it with a space.
for punct in string.punctuation:
millText = millText.replace(punct," ")

# Now we have finished pre-processing the text and are ready to tokenize it
# using NLTK's built in tokenizer.
mill_words = nltk.tokenize.word_tokenize(millText)

# The variable mill_words now is a list of words (not a single string). It is
# now suitable to be passed to NLTK's FreqDist function. We could also, at
# this stage remove stop words in the following way:

# This next line uses a 'list comprehension' and NLTK's built in stopwords
# to remove stopwords from the list representing the text before we pass
# it to FreqDist to generate our frequency distribution.
eliot_words = [w for w in eliot_words if not w in stopwords.words('english')]

# Get our word frequencies
freqs = nltk.FreqDist(eliot_words)

# TA DA! We have it.

# Now what?

# Well, we look at the top ones:
# the variable eliot_vocab now contains a list of the most frequently occurring terms
# in descending order.
eliot_vocab = freqs.keys()
vocab[:50] # Top Fifty Words

# We can get the frequency of specific terms, thus:
freqs['books'] # Returns the frequency of stamps

# Or, have a look at frequent collocations.

To use MALLET, you need to put the texts you want in a directory. Then, tell MALLET to import those files into a data structure it can use; here is the code for reading in the gothic novel data we briefly looked at.

/mallet/bin/mallet import-dir --keep-sequence --input /home/share/texts/gothic/ --output gothic_data.mallet

Now, the file gothic_data.mallet is ready to be processed to “train topics”:

/mallet/bin/mallet train-topics --input gothic_data.mallet --num-topics 20 --optimize-interval 10 --output-state topic-state.gz --output-topic-keys gothic_keys.txt --output-doc-topics gothic_compostion.txt

Inspect the files gothic_keys.txt for your “topics” (recall, a topic is just a set of terms which co-occur with a certain regularity; go in fear of abstractions!); inspect gothic_composition.txt for the distribution of topics across your documents. You can now tweak the process by looking for more, or fewer, topics. You can also start reading more.

Let me know if you have questions. Happy hunting.

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>