This post's topic first came up on my 'to blog radar' when Ian Kennedy blogged about Mining the New York Times Archives a couple weeks back pointing out that NY Times online articles (who recently opened their archives for consumers) came with rich Metadata within each article. Metadata tags like bylines, what people the article was about, the company, the region etc. that can be used to build applications that display/deliver content.
Dave Winer's smart little script to track NY Times article key words and his orginal post in which he asks basically for a standard in news content Metadata (well he wants to look at the NY Times taxonomy which he states could be a standard) prompted me again this evening to think about the value of Metadata in dealing with media content overload.
Like Ian points out in his post, because he used to work with me at Factiva he know quite well that the New York Times along with many of the 10,000 sources we aggregate send us content that is rich in this type of Metadata. The Metadata they send is usually based on an establish taxonomy that the Information Provider (IP) uses in their own content production process and many a time it is still hand coded by the editorial staff- although big media companies or consolidated services do also use automatic categorizers.
So before i go any further let me define one thing- Dow Jones produces content, the Factiva division is basically an Aggregator- we take over 150,000 articles per day from thousands of news providers through our source processing process and apply structured Metadata to each article which you can learn more about in detail in this white paper. During that process we normalize all the content into a standard XML format and add additional Metadata to each article as i describe below.
The content goes through various steps to ensure that the Metadata applied is useful downstream. So there are Metadata fields like 'author', 'date' and 'source name' etc. that we expect the IPs to send us that is fairly straight forward and we have mapping tables to figure out what each IP calls each field. There are also fields that are automatically calculated that can not be disputed like for example 'wordcount' and 'language'.
Then there is Metadata that describes the 'aboutness' of the article. Some Metadata that addressed the 'aboutness' of the content we map from what the IPs provide us for example the NY Times might send us an article that they have tagged about Russia- we 'trust' the NY Times coding so we apply our 'Russia' region tag to it.
The rest of the 'aboutness' is applied as Factiva Intelligent Indexing (FII) which is Factiva's core taxonomy- that covers Industry, Subject, Region and Companies (more detail on its application here in this white paper). There are also a lot of additional Metadata elements such as people, brands, products, organizations, parts of speech (e.g. a quote) that can potentially be extracted from the content as well.
So as online media providers open their archives i always get pings from friends and family about what that means for the Factiva 'business'- i always point to the value of the aggregation, normalization and Metadata additions to our content that covers 22 languages. In addition we also provide many services to help Enterprises deliver that content to the users that need it including licensing our taxonomy or helping clients build their own.
Winer's call for a taxonomy standard that is maintained by the public specifically for News is an interesting one and one that i thought about as a good use case for Freebase perhaps as more content is made available and more tools are made available for users to create delivery mechanisms through mashups.
Afterthought: After i shut down the computer last night, i kept thinking about this. There is no reason why if you are a ubergeek like Winer or Ian and have access to Factiva services in your enteprise you can't build things like this or even cooler- get in contact with me if you are interested i am looking for crazy ideas ;-)
photo attributed to denverjeffrey