My project this week was to try some of the methods we learned in class on a larger set. My object of study du jour seems to be the 9/11 commission report. I got some interesting data, but I think there is a piece missing. I think the first thing to mention is some sort of error in either the OCR of the pdf I converted, or something wonky with the pdf. I have a lot of capitalized words beginnign with strange accented letters, and I am not sure the process of cleaning up data yet, especially in lists. On top of that the word list is huge. It will be my next mission to find out how I can either make sense of a large word list, or find my way to analyse it in some sort of useful way.

I had much more success with the collocation. On top of expected bigrams like Clinton administration, I found new names that I can then use to build up my research. Here is the results:

{{"PM", "Page"}, {"report", "interrogation"}, {"FBI", 
  "report"}, {"New", "York"}, {"report", 
  "investigation"}, {"investigation", "interview"}, {"Saudi", 
  "Arabia"}, {"reports", "interrogations"}, {"President", 
  "Clinton"}, {"September", "11"}, {"President", "Bush"}, {"analytic",
   "report"}, {"national", "security"}, {"electronic", 
  "communication"}, {"NSC", "memo"}, {"intelligence", 
  "community"}, {"briefing", "materials"}, {"law", 
  "enforcement"}, {"American", "11"}, {"muscle", 
  "hijackers"}, {"interview", "June"}, {"interview", 
  "transcript"}, {"audio", "file"}, {"covert", "action"}, {"NSC", 
  "email"}, {"fire", "safety"}, {"CIA", "briefing"}, {"letterhead", 
  "memorandum"}, {"Al", "Qaeda"}, {"NSC", "staff"}, {"air", 
  "traffic"}, {"impact", "zone"}, {"summary", "re"}, {"repeater", 
  "channel"}, {"planes", "operation"}, {"Ladin", "unit"}, {"national",
   "intelligence"}, {"New", "Jersey"}, {"investigative", 
  "summary"}, {"intelligence", "agencies"}, {"FAA", 
  "report"}, {"flight", "training"}, {"interrogation", 
  "detainee"}, {"command", "post"}, {"embassy", "bombings"}, {"upper",
   "floors"}, {"FAA", "headquarters"}, {"found", 
  "evidence"}, {"terrorist", "attacks"}, {"East", "Africa"}, {"Rice", 
  "meeting"}, {"Qaeda", "operatives"}, {"Manila", 
  "air"}, {"interview", "10"}, {"CIA", "cable"}, {"investigation", 
  "interviews"}, {"reports", "investigation"}, {"FBI", 
  "reports"}, {"primary", "radar"}, {"civil", 
  "aviation"}, {"interview", "13"}, {"Clinton", 
  "administration"}, {"field", "offices"}, {"FBI", "agent"}, {"Qaeda",
   "members"}, {"secretary", "defense"}, {"FBI", 
  "letterhead"}, {"weapons", "mass"}, {"mass", 
  "destruction"}, {"traffic", "control"}, {"CIA", 
  "memo"}, {"training", "camps"}, {"Saudi", 
  "government"}, {"interview", "15"}, {"memorandum", 
  "investigation"}, {"Clinton", "meeting"}, {"radio", 
  "channels"}, {"Bush", "administration"}, {"flight", 
  "attendants"}, {"talking", "points"}, {"evacuation", 
  "order"}, {"telephone", "call"}, {"Saudi", "nationals"}, {"King", 
  "Fahd"}, {"information", "sharing"}, {"CIA", "FBI"}}

What sticks out? The term muscle hijackers is an important one. It turns out to be the operating term for the 9/11 hijackers who were for subduing the passengers and crew, and not trained to fly airplanes. King Fahd is a regular translation. Fahd is the term for the king of Saudi Arabia. Also though look at how many bigrams are instantly recognizable as part of a mutually agreed upon 9/11 lexicon. Al-Qaeda, the Bin Laden unit, training camps, weapons of mass destruction. Saudi Arabia is mentioned often, as well as many of the different government organizations that the commission investigated. You can get some very interesting data with collocation.

Looking forward to tomorrow!