Next Steps

Another term ends, and I wanted to say a few last words about how I want to use these tools going forward. First of all, learning is a lifelong process, and after the course is finished, I have plans to keep learning and practicing Mathematica. As I begin real work with real data to solve historical questions, I think that some of the theoretical power can come to real use.

I also want to increase my theoretical knowledge of what's going on here. I have textbooks on linear algebra, pre-calculus, information retrieval, and Mathematica ready to go deeper. It's gonna be a fun run.



Machine Learning

So, so excited for Machine Learning. I think it will be the cornerstone of my research. My current plan is to do a batch analysis of the New York Times in the first year since the September 11th attacks. Since 9/11 has so many terms and ways of writing about it. I will go through a handful of articles to train the machine on articles about the attacks, and then set it loose on the stack, finding every 9/11 focused article in the pile, and use them for further analysis.

I could also make a graph with the X axis being the chronology of NYT issues, and the Y-axis articles about 9/11. I think one year might be interesting, but a decade might show how we remember and forget things as they get older. The peaks and valleys might be very surprising.

It's exciting stuff, and computer science is really digging deep into its implications these days.



Search Engines

So my project du jour is attempting to make a public history display for my Digital History Course, and I am thinking that Mathematica might not be the place to do it.

I am trying to make a game based on the popular summer hit Her Story, which is basically playing a search engine. I am trying to find a way to build a search engine that looks through about 70 txt files, and when you search for a particular word, returns only the 5 most relevant results. Currently, Mathematica has some of that stuff all together, but the limitations have been a real pickle. It seems noone is interested in LESS results.

My next move is to try out a google custom search, upload the sources to webpages, and use that. It returns a limit of 10 results which is a start, but I feel like I should go farther. There is some code tweaking, but its all in Java script so I will need to learn enough to do this. I feel like I am missing something though because as a concept its such a simple thing, but finding the platform to do it is really rough.



Justin Trudeau's Government is most like X

I heard this a lot on the weeks following the election of Trudeau last month. Political pundits tried to weigh the seat situation in Ottawa and the provinces to establish what made this period of Canadian politics unique, or just like x period in the past. Once we learned about the cool things we can do with linear algebra. I began to suspect we could answer this with data.

Here is my plan, currently a work in progress. We worked with strings where each word was a nested list within a list of the whole work. What if we looked at it a little differently. Say we made a list where each "page" was a state of Canadian parliaments, ie every year a provincial or federal election takes place, and each word was a seat. We could using the techniques we learned find some cool pieces of data.

My thought so far are that we could compare any state of Canadian politics, and find the other states closest to that state, and answer when the Canadian political scene actually did resemble the one we're studying. We can also use a TF-IDF score, and something I will try to make work called an Anti-TF-IDF score top find what about this period makes things unique. The anti-TF-IDF score would find words that are more common in the corpus than the piece analysed, finding where there are parties with more or less seats than average.

I am interested if this sounds at all interesting to you guys, and what pitfalls you might foresee. I am not an expert on Canadian politics, and so thoughts would be appreciated.

Until next time!



Big Files

My recent work has been with trying to manage large pieces of text (about 11 3-4 million word texts) from my cognate project. In order to improve on my work, I will need to drasitcally improve on cleaning up text. At the moment, this corpus which is a collection of New York Times articles that mention September 11th, has a lot of metadata at the top of each article that the database added on.

I believe there is a new way to structure data that might also help to do cool collocations within the database. My dream is to find when new words enter the lexicon of the 9/11 attacks, begin the periodization on the scale of the paper, and also discern those interesting connections I might have missed.

Solutions? Bursting the text file into individual articles. I actually had someone do this for me in R before, but I want to learn how to do it myself. I then would need to find what I can do with about 5,000 short text files.

I think I am getting to a point where mathematica is allowing me to do some really cool stuff, but I think there are a few tools I need to work on. More as it develops!




This past class I saw a glimpse into what the future of my PhD will look like. After reading chapter 5 of the textbook, even more so. I want to talk about what I see as the major hurdles of my research given what we've learned about scraping, and data crunching.

1. Bandwidth and Copyright

These two words are likely to haunt my dissertation work. Luckily I pay the extra cash for unlimited bandwidth, but I am already seeing that when it comes to scraping databases, I will have to work around various degrees of grumpiness in system admins. Pay walls will also plague my work. Much of my work will also still be in copyright, meaning that following the law will be harder than I will think. I should say these are better problems to have than most historians have with vast amounts of data simply not existing.

2. Cleaning data

This part I worry about. The data we worked with last night showed that things can get messy really quick, and in the wild internet even more so. I have my work cut out for me, and better get good at writing strong string search symbols.



The New Project

Hello there Digital Research Methods readers,

I have a new project I have begun working on over in the Digital History course. My goals for the semester if I can figure it out, is to build an educational history game engine in Mathematica. It's a simple concept for a game, but has a proven precedent. More as it develops. I did a write up over here



Mo' Text Mo' Problems

My project this week was to try some of the methods we learned in class on a larger set. My object of study du jour seems to be the 9/11 commission report. I got some interesting data, but I think there is a piece missing. I think the first thing to mention is some sort of error in either the OCR of the pdf I converted, or something wonky with the pdf. I have a lot of capitalized words beginnign with strange accented letters, and I am not sure the process of cleaning up data yet, especially in lists. On top of that the word list is huge. It will be my next mission to find out how I can either make sense of a large word list, or find my way to analyse it in some sort of useful way.

I had much more success with the collocation. On top of expected bigrams like Clinton administration, I found new names that I can then use to build up my research. Here is the results:

{{"PM", "Page"}, {"report", "interrogation"}, {"FBI", 
  "report"}, {"New", "York"}, {"report", 
  "investigation"}, {"investigation", "interview"}, {"Saudi", 
  "Arabia"}, {"reports", "interrogations"}, {"President", 
  "Clinton"}, {"September", "11"}, {"President", "Bush"}, {"analytic",
   "report"}, {"national", "security"}, {"electronic", 
  "communication"}, {"NSC", "memo"}, {"intelligence", 
  "community"}, {"briefing", "materials"}, {"law", 
  "enforcement"}, {"American", "11"}, {"muscle", 
  "hijackers"}, {"interview", "June"}, {"interview", 
  "transcript"}, {"audio", "file"}, {"covert", "action"}, {"NSC", 
  "email"}, {"fire", "safety"}, {"CIA", "briefing"}, {"letterhead", 
  "memorandum"}, {"Al", "Qaeda"}, {"NSC", "staff"}, {"air", 
  "traffic"}, {"impact", "zone"}, {"summary", "re"}, {"repeater", 
  "channel"}, {"planes", "operation"}, {"Ladin", "unit"}, {"national",
   "intelligence"}, {"New", "Jersey"}, {"investigative", 
  "summary"}, {"intelligence", "agencies"}, {"FAA", 
  "report"}, {"flight", "training"}, {"interrogation", 
  "detainee"}, {"command", "post"}, {"embassy", "bombings"}, {"upper",
   "floors"}, {"FAA", "headquarters"}, {"found", 
  "evidence"}, {"terrorist", "attacks"}, {"East", "Africa"}, {"Rice", 
  "meeting"}, {"Qaeda", "operatives"}, {"Manila", 
  "air"}, {"interview", "10"}, {"CIA", "cable"}, {"investigation", 
  "interviews"}, {"reports", "investigation"}, {"FBI", 
  "reports"}, {"primary", "radar"}, {"civil", 
  "aviation"}, {"interview", "13"}, {"Clinton", 
  "administration"}, {"field", "offices"}, {"FBI", "agent"}, {"Qaeda",
   "members"}, {"secretary", "defense"}, {"FBI", 
  "letterhead"}, {"weapons", "mass"}, {"mass", 
  "destruction"}, {"traffic", "control"}, {"CIA", 
  "memo"}, {"training", "camps"}, {"Saudi", 
  "government"}, {"interview", "15"}, {"memorandum", 
  "investigation"}, {"Clinton", "meeting"}, {"radio", 
  "channels"}, {"Bush", "administration"}, {"flight", 
  "attendants"}, {"talking", "points"}, {"evacuation", 
  "order"}, {"telephone", "call"}, {"Saudi", "nationals"}, {"King", 
  "Fahd"}, {"information", "sharing"}, {"CIA", "FBI"}}

What sticks out? The term muscle hijackers is an important one. It turns out to be the operating term for the 9/11 hijackers who were for subduing the passengers and crew, and not trained to fly airplanes. King Fahd is a regular translation. Fahd is the term for the king of Saudi Arabia. Also though look at how many bigrams are instantly recognizable as part of a mutually agreed upon 9/11 lexicon. Al-Qaeda, the Bin Laden unit, training camps, weapons of mass destruction. Saudi Arabia is mentioned often, as well as many of the different government organizations that the commission investigated. You can get some very interesting data with collocation.

Looking forward to tomorrow!



Rough Week

This week I found myself not getting as much Mathematica time as I'd like because of various forces. I will say though that this week and next week feels very much like I am getting some strong first principles. My main struggle for now is thinking of how we can look at this data we're finding, and translate it into information and research. We're getting data and I am curious how to use it.



Picking Apart My Own Writing

Hello Internet,

This week I decided to apply the lessons I learned on analysing text to my own work. At the moment, we're building up first principles, and I can definitely see when I get better at things like pure functions, I can do some powerful stuff.

What enamoured me about our class was the discussion on an author's "fingerprint". The way we find how writers write. I decided to use these tools to do a little bit if introspection, and find if maybe I have some cool quirks in my writing.

My corpus is a collection of 5 papers I've written since I started grad school. It includes 3 seminar papers, and two MA papers. Though I have a definite research interest, these papers cover a few slightly different topics. The papers are on 9/11 reconstruction at Ground Zero, Islamophobia in America int he short 21st century, free love and sex in the hsitoriography of Emma Goldman, and a comparison of Occupy Wall Street with the Industrial Workers of the World.

Word Clouds

The first tool I played with is word clouds. The first thing I wanted to see is if I could at a glance see how my papers looked. I was rather unsurprised by the result, but it nontheless created my first infographics on my papers.

So we can see the biggest words in each essay pretty much identify the topic of the paper. Overall a happy result. I then googled the Join command in order to build a total word list and do a word cloud on the word frequencies all together.

Overall a disappointment here. Not much to glean except that my papers on 9/11 topics were the biggest.

Modal Verbs

The modal verbs didn't glean a ton of insight. Only thing is that maybe I overuse the word would. 


I decided to see what my writer's fingerprint might be with bigrams and trigrams. Comparing the 5 papers together, I found these as my common bigrams:

{{"according", "to"}, {"act", "of"}, {"and", "a"}, {"and", 
  "the"}, {"as", "a"}, {"as", "the"}, {"as", "well"}, {"at", 
  "the"}, {"because", "of"}, {"by", "the"}, {"by", 
  "their"}, {"desire", "to"}, {"does", "not"}, {"for", 
  "the"}, {"from", "a"}, {"from", "the"}, {"has", "its"}, {"in", 
  "the"}, {"into", "the"}, {"is", "a"}, {"is", "an"}, {"is", 
  "that"}, {"is", "the"}, {"it", "is"}, {"it", "to"}, {"much", 
  "more"}, {"of", "a"}, {"of", "all"}, {"of", "the"}, {"of", 
  "this"}, {"on", "a"}, {"on", "the"}, {"one", "of"}, {"over", 
  "the"}, {"rather", "than"}, {"than", "a"}, {"that", "is"}, {"that", 
  "it"}, {"that", "the"}, {"that", "this"}, {"the", "major"}, {"the", 
  "united"}, {"there", "are"}, {"they", "are"}, {"they", 
  "were"}, {"this", "is"}, {"to", "be"}, {"to", "the"}, {"to", 
  "their"}, {"tristan", "johnson"}, {"was", "the"}, {"way", 
  "to"}, {"well", "as"}, {"with", "a"}, {"would", "be"}}

Instantly popping out at me is how bad I am with passive voice. There is usually is not a good bigram to see a lot of. Desire, seems like an interesting word to then look up in my works to see why such a term would show up in so many of my writings.


My trigrams list was a very small one:

{{"as", "well", "as"}, {"one", "of", "the"}}

one of the is the most interesting ones. It seems it is my favourite ways to introducing case studies, or tangeants. Overall, this gives me a little bit of a glimpse into what my writing style is like, and if it tells me much about my own writer's psyche.



Using Text

Text analysis is the core of much of what DH does. In order to do works of far reading, we must know these tools to statistically break down the word use within a body of text. In Mathematica, we have so far learned the process of breaking down a piece of plain text into word frequency lists, word bags, and compare them with other documents.

So where have I begun to use this? The first playing I did with this involved two important documents surrounding my object of study, the 9/11 commission report, and the September 11th attacks Wikipedia page.

First I had to import these into mathematica. Both using the simple command Import["URL", "Plaintext"]. This command took both an html file and a pdf, and extracted the plain text from them. Excellent.

When i broke down word frequencies I found that a casual reader could at a glance get the jist of what the documents were about. alright for just two documents, but fascinating when you can then extrapolate this over a large corpus of texts. Finding the quick meaning of a dument without reading it opens a lot of tools.

One example would be topic modeling. I am preparing a talk on it for april and so I am figuring out how to get topic modeling to work in Mathematica. The first step uses the bag of words function we took in class. the next involves a beyesian analysis that I am a little less sure about.

The comparison of texts also has a lot of uses. comparing the commission report to the wikipedia article revealed the much more technical nature of the commission report. Much more had to do with money, and operations, while Wikipedia focused more on impacts and consequences. All done without reading either text.

Next Steps:

I have a few ways I can progress with this text analysis in the future. With Mathematica's visualization tools, I could do a data comparison of word usage between years in the internet archive's news database. The database has all the closed captioning text, and finding the 9/11 anniversary tributes between 2009, and 2015, we can see how the memory of the attacks changed over that 6 year period.

Another way to do something much more detailed is to use the Wikipedia editing data. Wikipedia not only has their articles, but every article contains a complete history of its own edits. This means you can view any wikipedia article at any point in its history. I could use this to build language charts of the article, finding when interesting keywords arrive, and how language changes as 9/11 goes from an event into memory, and then history.