My recent work has been with trying to manage large pieces of text (about 11 3-4 million word texts) from my cognate project. In order to improve on my work, I will need to drasitcally improve on cleaning up text. At the moment, this corpus which is a collection of New York Times articles that mention September 11th, has a lot of metadata at the top of each article that the database added on.

I believe there is a new way to structure data that might also help to do cool collocations within the database. My dream is to find when new words enter the lexicon of the 9/11 attacks, begin the periodization on the scale of the paper, and also discern those interesting connections I might have missed.

Solutions? Bursting the text file into individual articles. I actually had someone do this for me in R before, but I want to learn how to do it myself. I then would need to find what I can do with about 5,000 short text files.

I think I am getting to a point where mathematica is allowing me to do some really cool stuff, but I think there are a few tools I need to work on. More as it develops!

Comment