Text analysis is the core of much of what DH does. In order to do works of far reading, we must know these tools to statistically break down the word use within a body of text. In Mathematica, we have so far learned the process of breaking down a piece of plain text into word frequency lists, word bags, and compare them with other documents.

So where have I begun to use this? The first playing I did with this involved two important documents surrounding my object of study, the 9/11 commission report, and the September 11th attacks Wikipedia page.

First I had to import these into mathematica. Both using the simple command Import["URL", "Plaintext"]. This command took both an html file and a pdf, and extracted the plain text from them. Excellent.

When i broke down word frequencies I found that a casual reader could at a glance get the jist of what the documents were about. alright for just two documents, but fascinating when you can then extrapolate this over a large corpus of texts. Finding the quick meaning of a dument without reading it opens a lot of tools.

One example would be topic modeling. I am preparing a talk on it for april and so I am figuring out how to get topic modeling to work in Mathematica. The first step uses the bag of words function we took in class. the next involves a beyesian analysis that I am a little less sure about.

The comparison of texts also has a lot of uses. comparing the commission report to the wikipedia article revealed the much more technical nature of the commission report. Much more had to do with money, and operations, while Wikipedia focused more on impacts and consequences. All done without reading either text.

Next Steps:

I have a few ways I can progress with this text analysis in the future. With Mathematica's visualization tools, I could do a data comparison of word usage between years in the internet archive's news database. The database has all the closed captioning text, and finding the 9/11 anniversary tributes between 2009, and 2015, we can see how the memory of the attacks changed over that 6 year period.

Another way to do something much more detailed is to use the Wikipedia editing data. Wikipedia not only has their articles, but every article contains a complete history of its own edits. This means you can view any wikipedia article at any point in its history. I could use this to build language charts of the article, finding when interesting keywords arrive, and how language changes as 9/11 goes from an event into memory, and then history.