From Plaintext to Map

A few weeks ago we looked at a variety DH mapping projects, among them Matt Wilkens‘s work. Matt Wilkens kindly shared some of his work flow with me via email; I’ve adapted it a bit and simplified it so that we can (relatively) easily go from raw text to map. What this method gains in simplicity, it loses in sophistication. We work entirely with plaintext and so there is no real metadata here. You have juggle your files appropriately if you want to compare texts based on publication date, gender, genre, or so forth.

This method is also very basic; its use of NER is perhaps its weakest point; its naive geocoding its next weakest—both of these weaknesses produce data that requires significant intervention by hand if you are interested in eliminating errors. If you have a larger data, you might consider heuristics like removing all data which occurs infrequently, etc. For specifics, read on.

Get Your Texts on the Server

We discussed this in our workshop. The key here is to have your texts somewhere on the server you can work with them. They need to be in plaintext; if you have TEI or other XML, you’ll just need to strip out the tags and header data (we have a simple script to do this).

You can use sftp to move files from your computer to the server; wget can fetch texts from the web to the server (though it will not work in all cases). Or you can use rstudio (about which, see below) to upload them. (This last may be the easiest for folks uncomfortable with the command line… but you’re going to need to use the command line for the next step anyway.)

Process Each Text with the Stanford NER

Logged into the server via ssh, for each text type:

/home/share/NER/stanford-ner-2012-11-11/ FILENAME.TXT > TAGGED.TXT

Replacing, of course, FILENAME.TXT with your filename; that > symbol redirects the output from the NER script into a file that you can name whatever you like.

The Stanford NER has now done its best to name the various entities; we want to extract all those tagged /LOCATION. Linux command lines tools can do this relatively easily. Here is the magical incantation:

grep -o '\w*/LOCATION' TAGGED.TXT  | sed -e 's/\/LOCATION//' | tr 'A-Z' 'a-z' | sort | uniq -c | sort -rn > LOCATIONS-TALLY.TXT

Here we're once again using output redirection (>) to capture the output in a filename we specify. (If you're interested in the specifics of how this odd line works, ask me in person. There are, no doubt, better ways to do this; but this one works.)

You now have a file with a tally of the "locations" (that is, words that NER guesses are locations). It looks like this (this is the top of the list from Woolf's To The Lighthouse):

 8 england
 7 london
 4 rome
 4 road
 3 minta
 3 india
 3 cardiff
 3 brompton
 2 westmorland
 2 paris

This is a good stage in the process to intervene; you can go through this file and remove false positives. (NER, you'll recall, identified "Louis," the character from The Waves, as the second most frequently mentioned location in the novel; Google, you'll also recall, locates Louis in Arkansas.) You might here wonder about the location names that NER doesn't capture. That's a good thing to wonder about...

Geocode Your Locations

Here we use the Google GeoCode API to get coordinates (latitude and longitude data that we can actually plot on a map) from Google.

python /home/share/scripts/ LOCATIONS-TALLY.TXT

This script looks at each line in the locations tally file you generated using the above command, and then queries Google for the latitude and longitude. This can take a few seconds per query; the script will let you know how it is doing as it runs. For instance:

NOTE: Google's API privileges its results based on region. If you don't it does it based on IP and will therefore give you results that it considers most relevant to someone in the US. If you're working, though, on chiefly British novels, this is not so good. We can override this by specifying a region. I'll be sprucing up this code to make this easier; but if you're interested, email me. ALSO NOTE: The Google API is chiefly interested in the present; that is, it assumes you're looking for current info. Query Google for "Constantinople," and it will point you not to Istanbul, but to "Constantinople, 37160 La Celle-Saint-Avant, France." All this is noted and explained at greater length by Wilkens. When you run this script, it'll look something like this:

Asking google about  rayleys ...  writing results.
Asking google about  queens ...  writing results.
Asking google about  panama ...  writing results.
Asking google about  padua ...  writing results.
Asking google about  neptune ...  writing results.
Asking google about  mucklebackit ... no result for  mucklebackit
Asking google about  mile ...  writing results.
Asking google about  mexico ...  writing results.
Asking google about  madrid ...  writing results.

The output is written to a file called LOCATIONS-TALLY_coors.csv and it contains the frequency the location occurs in the original plaintext file, the location name we queried, latitude and longitude, and the place that Google associated it with. Here is the head of the file generated for To The Lighthouse.

frequency,place,latitude,longitude,Google Address
8,england,52.3555177,-1.1743197,England; UK
7,london,51.5112139,-0.1198244,London; UK
4,rome,41.8929163,12.4825199,Rome; Italy
4,road,28.3198138,70.1007323,Road; Sdiqbd; Pakistan
3,minta,43.9166667,-80.8666667,Minto; ON N0G; Canada
3,cardiff,51.481581,-3.17909,Cardiff; UK
3,brompton,-34.894581,138.5800061,Brompton SA 5007; Australia
2,westmorland,33.0372674,-115.6213817,Westmorland; CA; USA

I've deliberately not cleaned my data so that you can see the problems one might encounter. England looks good. So do London, Rome, India, and Cardiff. But there are some problems here. "Road" looks like a false positive from NER; and that "Minta" is NER picking up the character "Minta Doyle." We can go back into our NER-tagged data and see that the three confusing "Minta" instances are:

It scorched her, and Lily, looking at Minta, being charming to Mr. Ramsay at the other end of the table, flinched for her exposed to these fangs, and was thankful.

he would laugh at Minta, and she, Mrs. Ramsay saw, realising his extreme anxiety about himself, would, in her own way, see that he was taken care of, and praise him, somehow or other.

She kept looking at Minta, shyly, yet curiously, so that Mrs. Ramsay looked from one to the other and said, speaking to Prue in her own mind, You will be as happy as she is one of these days.

Minta Doyle gets mentioned far more than three times; indeed, NER identifies 73 occurences of "Minta" as a person. Returning to the text, I find 76 occurences of the string "Minta" overall; so, it seems, NER rightly realized that "Minta" is a named entity in each of its 76 occurences; it misidentified it as a location, rather than a person, 3/76 (or roughly 4%) of the time. You'll note that in these three occasions, the pesky preposition at likely accounts for some of the confusion.

There is another kind of error too. "Brompton" in that list comes from "Brompton Road" in London; but Google has located it in Australia. Figuring out what is wrong here will require returning to the original text (so get good with a text editor and/or some full text search tools *cough*emacs*cough*). You could simply delete wrong lines; or you could fix 'em.

Fixing them can be time intensive. Once you know what the place should be, you'll need to find the correct coordinates. This website will nicely help you convert an address to a set of lat/long coordinates. I've also created a simple wrapper script on the server. Just type:

/home/share/scripts/ LOCATION

And you should be good. You can add additional infor to help Google along. So:

/home/share/scripts/ Brompton
Asking google about  Brompton ...

Brompton SA 5007, Australia

Brompton, Novi, MI 48374, USA

Brompton, Rochester Hills, MI 48309, USA

Brompton, North Yorkshire, UK

Brompton, Gurnee, IL 60031, USA

Brompton on Swale, North Yorkshire DL10, UK

Brompton, Rosedale, Auckland 0632, New Zealand

Brompton, QC, Canada

Brompton, Shropshire SY5, UK

Brompton, Chatham, Medway ME4, UK

Okay, no dice. Try again.

$ /home/share/scripts/ Brompton Road
Asking google about  Brompton Road ...

Brompton Road, London SW3, UK

Brompton Road, Houston, TX, USA

Brompton Road, Lochearn, MD 21207, USA

Brompton Road, Garden City, NY, USA

Brompton Road, Buffalo, NY 14221, USA

Brompton Road, Memphis, TN 38118, USA

Brompton Road, Tonawanda, NY 14150, USA

Brompton Road, Carmel, IN 46033, USA

Brompton Road, Limestone, NC, USA

Brompton Road, Great Neck, NY 11021, USA

Region codes should work too; ie /home/share/scripts/ Brompton Road region=uk.

And now you can paste the appropriate coordinates into your .csv data file.

Hand correcting the data from To the Lighthouse, I removed false positives; NER had identified 52 locations; I pared that down to 35 (reducing the data by, in effect, one third). I also corrected four of those coordinates.

Painful though it is, this sort of hand correction is valuable because it makes you rethink exactly what you're doing. NER extracted "Sofia" (that is, the Hagia Sophia in Istanbul) as well as "Sistine" (from the Sistine Chapel) and tagged them as locations; I simply removed these. But should these be plotted? What exactly are we looking for again? What about the reference to a "Panama Hat"? Does that, as NER suggests, count as a location?

Attempting to massage your results into your "intentions" makes you perhaps realize how ill-understood your intentions really were. So long as this (or any method) is just a black box, it is easy ignore these questions. But dealing with the data directly, you may begin to wonder: Do I actually understand what I'm doing?

Plot Your Data with R

Whatever amount of massaging has been required, once you have a csv file (with proper column headings) and lat/long coordinates, you're ready to graph it. To do this, we'll be taking advantage of R and some R packages (maps, ggplot, and some others, all installed on the server).

First, log into the R server with your credentials by going to the server ip:8787. This will give you a nice, web-based front-end for interacting with R. Make sure your R working directory is the same as the one where your csv files are. Then you can essentially paste the following code in the code window and run it, changing the filename (and adding more sets of points) as necessary:


mdat <- map_data('world')

points <- read.csv('woolf_lighthouse-corrected.csv')

ggplot() +
geom_polygon(dat=mdat, aes(long, lat, group=group), fill="grey50") +
aes(x=points$longitude, y=points$latitude), col="#00ff0055",size=points$frequency)

In the above code, I've bolded material you would want to change customize; chiefly the variable name with the data you're reading from the CSV file you generated (and, perhaps, hand-pruned and cultivated). If you wanted to add another set of data, put a plus at the end of that geom_point line and add another geom_point line. Here, for instance, is complete code to graph not only of To the Lighthouse locations, but of those from Mrs Dalloway as well (the former in red; the latter in blue).


mdat <- map_data('world')

lighthouse <- read.csv('woolf_lighthouse-corrected.csv')
dalloway <- read.csv('woolf_dalloway-corrected.csv')

ggplot() +
geom_polygon(dat=mdat, aes(long, lat, group=group), fill="grey50") +
aes(x=lighthouse$longitude, y=lighthouse$latitude), col="#00ff0055",size=lighthouse$frequency) +

A Map of Locations from Mrs. Dalloway (Green) and To the Lighthouse (Red)

A Map of Locations from Mrs. Dalloway (Green) and To the Lighthouse (Red)

There is a better way to do this, combining all your data into a single frame, and then graphing it with a legend; but I'm still working out how to do that.

Let me know if you have any questions.

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>