Linked open data and DH: the shiny side

I had a lengthy conversation about linked open data today with one of the Sherman Centre’s graduate students, and it reminded me of some of the questions a few of my fellow CLIR postdocs asked at CNI, where, as I’ve noted elsewhere, there was a lot of excitement about RDF/linked open data/semantic web stuff.

Then, later this evening, I ran across the DPLA’s announcement about Heidrun and Krikri (I recommend watching the #code4lib presentation linked at the bottom for a more extensive description of the project.) It might be a pretty exciting development, but to understand why, let me run through the standard points I think about whenever someone asks me “is RDF going to be the next big thing in digital humanities?”

RDF is great for encoding heterogeneous data, and humanities subjects have a lot of heterogenous data.

By heterogeneous data, I mean data that doesn’t fit neatly in the rows and columns associated with traditional relational databases, like those built with MySQL.

My perennial example is that I thought encoding prices in MySQL would be easy — after all, every object I wanted to include would have a price, right? But no: some objects have standard prices, like 5 shillings, and other objects have prices like “6 pence plus beer.” Or potatoes. I use that example a lot because it makes people laugh when I explain that MySQL had a hard time encoding “plus beer,” but the truth is that encoding prices contained in many texts means that I have a tremendous variety in terms of what sort of information I actually have. For example, take this excerpt of text, re: prices for cinnamon, vs. this price for a bottle of currant wine. The cinnamon prices have quantities in units, and it’s important whether or not each price is for pure or impure cinnamon. The bottle of currant wine has only an approximate price — “a couple of shillings or so.” Triples are  a much better structure for dealing with these sorts of uniqueness than endless stacked tables, or tables with null values in them.

And there are plenty of humanities topics where this sort of flexibility might come in handy.

RDF seems to present a foundation for widespread collaboration within digital humanities projects.

If you’re a digital humanities person working with RDF, there’s a decent chance that you’re thinking about building your own ontology — a vocabulary designed specifically to allow queries about your information. You might also be planning to mint URIs — identifiers that serve as hubs of information about … well, about any particular thing. Here’s one for John Lennon, at dbpedia (the datastore behind Wikipedia). Notice that the page isn’t a biography for John Lennon in the traditional sense — it’s more like an anchor point for a bunch of structured data, and clicking on many of the links (like Rickenbacker 325) will get you to other URIs. You can get lost in dbpedia, much as you can in Wikipedia! It’s fun.

What this sort of set-up means is that it might well make sense for people to develop ontologies and URIs for a particular area. Regency romance tropes? 18th century genres? Jon and I anticipate putting together terms and resources for dealing with pre-1970s currency, and assuming we succeed, I can imagine several instances in which other researchers and other projects might use our dataset to build on. That would make me really happy, both from the perspective of making VP more sustainable, and feeling like the effort that’s gone into it would be useful. And that sort of widespread collaboration feels like it might be the fulfillment of a lot of the enthusiasm that’s driven DH so far.

And actually, I think that’s a good point to end on for tonight. I’ll come back tomorrow to discuss the messy side, and how Heidrun might affect the utility of LOD for digital scholars.



Leave a Reply

Your email address will not be published. Required fields are marked *