115060526148944002

June 17, 2006

Last Thursday was the first complete day that we dumped the news feed to disk. Turns out, we receive about 28,000 unique files per day. We were a little surprised at the sheer quantity of the data, and we had to figure out how to start analyzing it. The first step, of course, was just to start opening files to get an idea of what the files contained. At first we were pretty confused, but by now we have a fairly good idea. News arrives as an XML document with special newsedge specific tags that contain metadata, unique identifiers, etc, all wrapped around standard nitf documents. Therefore, for the first stage of the project, we decided to focus on analyzing only the newsedge information, since that represents the value added by the special feed that we are subscribed to (over just subscriping to the AP and other providers individually).

Our other project for the day was to get some simple JavaScript scripts written and running on Chris’s machine that would analyze the news data. The reason Chris recommended JavaScript is that new versions of Rhino (a Java-based JS engine) support a new addition to the standard that treats XML as a native object. This makes validating, constructing, and extracting information from XML documents much simpler. Plus, I think he just likes E4X. Thankfully, it wasn’t too hard to get started, especially since Chris gave us a book on JavaScript to read, and we were able to learn most of core JavaScript pretty quickly (the core being the only part that we need to use).

I’ve actually discovered that JavaScript is a pretty fun language. I haven’t used it much before, and I was really happy to learn that it’s very Scheme-like in that functions are treated as first class objects. This means we can assign functions to variables and use them to construct new functions. You wouldn’t believe how much time and energy this one feature has ended up saving us over the last two weeks. There are a few subtleties to the scoping rules that I don’t quite have completely in my head yet, but I think I understand JavaScript really well. I would certainly be comfortable putting it on my resume even after only the last two weeks.

After work on Thursday, Matt and I just came home to eat. I had some leftover Chinese food from eating out with Chen that I still hadn’t had an opportunity to finish because Matt’s mom kept making us fresh meals. I decided to get rid of it one way or another Thursday night. Five days, though, is way too long to keep Chinese food in the refrigerator… After a few bites of the meal, I decided I couldn’t take any more. Matt offered me some of his enchiladas, which were way better…

—————————

I’m going to end here because after I saved part of this entry, our router at home stopped working and I haven’t updated for a long time (it’s working again now).

Leave a Reply