Tag Archives: text mining

Week 7: Crunchy

via wikipedia: http://en.wikipedia.org/wiki/File:Blackbox.svg

via wikipedia: http://en.wikipedia.org/wiki/File:Blackbox.svg

This week things get crunchy; lots of nGrams follow. We’ll try to tighten up people’s grasp of topic modelling in class.

  • Joseph’s post plumbs the space between signal and noise, to think about using machine learning techniques without simply turning them into confirmation engines. Bringing together Schmidt’s reservations about LDA with Heuser and Le-Khac’s concern that provocative data may too quickly be dismissed as erroneous, Joesph’s post ends with the question: mightn’t these sorts of apparent hiccups in the data be a spur not to better algorithms, but to closer reading?
  • This week Peter gets excited about Victorian pets and martial arts, before getting disappointed on both counts. Along the way he gets frustrated by processor architectures, statistics, and has a great picture connecting suffragism and Jiu-Jitsu. Peter’s post nicely tries to balance the challenges of borrowing our methods from elsewhere with the excitement of new methods.
    (I might have borrowed his concluding metaphor from his other area of concern in this post: pet ownership. Is training a classifier like training a puppy?) The comments thread here is already quite excellent.
  • Chris Barnes is interested in space. His post wonders about what attention to space can contribute to our understanding of texts and of space in texts. We’ll be spending more time with maps next week.
  • Jesse Menn worries about black boxes and data. When we start querying these sorts of databses, do we make a fetish of data we do not understand?
  • Peter D uses nGrams to consider capitalism and communism, in both English and French. The results are provocative, but for Peter, doubts remain. “I remain unconvinced that quantitative methods will necessarily produce more reliable information about history (literary or otherwise).” Is more reliable information what we’re looking for?