Big Data and Distant Reading (no it’s not the band or reading from afar…)

Hello again lovely readers! This week is all about BIG data, distant reading, and text mining.

Distant Reading

When I first started reading the assignments for this week, I had no clue what distant reading meant, to the point where I made a joke about reading signs from far away, kind of like this…vision-hack-see-clearly-without-your-glasses-contacts.w654

What distant reading actually is, is a way of understanding literature or other texts by not studying a particular work, but by aggregating and analyzing massive amounts of data. So instead of looking at one novel or monograph to try and understand a period of time or certain historiography these books can be scanned and distantly read by a computer that will then help identify broad patterns and trends. The concept of distant reading is based on the work of Franco Moretti, he argues that this distant reading as opposed to close reading is the only way that the true scope and nature of literature can be understood. One of the most interesting things that Moretti and his team at the Lit Lab are working on is trying to find hidden aspects of plots within books by transforming them into networks. Moretti is testing this idea on Hamlet and has turned each character into a verticie in order to highlight the patterns within the narrative. This is what his network of Hamlet looks like.

http://www.nytimes.com/2011/06/26/books/review/the-mechanic-muse-what-is-distant-reading.html?_r=0
http://www.nytimes.com/2011/06/26/books/review/the-mechanic-muse-what-is-distant-reading.html?_r=0

While this is very interesting, the fact that Moretti is only looking at literature can be somewhat problematic. While literature does not exist in a vacuum and can be helpful in illuminating certain aspects of the culture in which it was produced, literature can also disguise other aspects of the culture it was produced in. However this is true of lots of sources used by historians and the bias of the source needs to be addressed and considered within the analysis.

Topic Modeling 

Topic Modeling, according to Scott Weingart is essentially a class of computer programs that “auto-magically” extract topics from texts. This program is an algorithm that reads texts and then spits out several lists of words that occur frequently and can thus be seen as the relevant topics of the text. This list can give the researcher a great place to start identifying patterns within a large set of documents, but these topics are all text and no subtext, so researchers need to be aware of this limitation of topic modeling. A really interesting project using topic modeling was conducted by Robert Nelson and the Digital Scholarship Lab at the University of Richmond. The project is entitled Mining the Dispatch and seeks to understand the social and political changes happening in Richmond during the Civil War; in order to do this, the researchers used topic modeling on the Richmond Daily Dispatch.

http://opinionator.blogs.nytimes.com/2011/05/29/of-monsters-men-and-topic-modeling/
http://opinionator.blogs.nytimes.com/2011/05/29/of-monsters-men-and-topic-modeling/

This topic modeling then generates graphs and charts that show the changing patterns within the news in Richmond. One of the most interesting topics that is showcased by this project is the section dealing with poetry, patriotism, and anti-northern diatribes.

Through the analysis of massive amounts of newspaper print, Nelson was able to see that Southerners justified the killing of Northerners through the use of vitriolic language, insisting that these men were not Christians, but instead infidels, demons, and beasts. This project was so interesting and really showcased how topic modeling and big data can highlight patterns that would otherwise be difficult to detect.

Big Data

So what have we learned about big data? Well for one thing, when researchers discuss big data, they mean huge, massive amounts of data. So much data that it would be nearly impossible for a human to analyze it in a timely fashion. In the digital age, computer programs are able to sift through such massive amounts of data, thus making projects like the ones mentioned above possible. Without these digital tools, it would be extremely difficult for historians and other researchers to identify and interpret the patterns that are being discovered. While some historians view the use of computing technology as tied to this idea of post-modernism and the collapse of the narrative form of history, I have to disagree. These tools make it possible for larger narratives to be discovered and discussed. Text mining, topic modeling, and the use of big data allow for researchers to compare and synthesize sources that had previously been studied in isolation.

Big Data Projects

As part of the readings this week, I also looked at some digital history projects that utilized topic modeling, text mining, and lots of data. Of the projects I looked at, I had two that I absolutely loved. One, I mentioned before, the project entitled Mining the Dispatch. This project was so interesting to me, The site was simple to navigate, the graphs and charts were interactive and easy to understand. The section of the site dealing with the different topics really showcased the vast amount of data the project was able to use and I think this site showcased the strengths of using topic modeling and text mining. chart

The other project that I really enjoyed this week, The Virginia Secession Convention, was also created by the Digital Scholarship Lab at the University of Richmond. This project analyzed the Virginia Secession Convention papers and provides amazing visualizations of how the language of the convention changed and the push for secession. This site was interactive and easy to navigate and I really enjoyed the maps.

Digital Tools in my Research

As part of the assignment this week, I also looked at some digital tools that make all this topic modeling and text mining possible. One of my favorites from this week is the Data for Research tool that is hosted by JSTOR. This tool allows researchers to use the JSTOR database for topic modeling and word frequency searches. I think this tool would be valuable in my research, because it would allow me to look at the frequency with which LGBT topics are discussed within certain journals and it would also allow me to see trends within the historiography of LGBT studies.

Well that’s all for this week.

-E

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s