Linked open data and DH: the messy side

Yesterday evening, I wrote about the optimistic side of RDF — the things that make me excited, and that led me to see it as the right platform for Visible Prices. This morning, I want to address the not-so-great aspects of it — though I’m hopeful that they’re changing.

There’s a lot of discussion about linked open data in libraries — but libraries and DH researchers are trying to do different things with RDF, and thus face different challenges.

Libraries working to add linked open data functionality to their catalogs and repositories are working to get into what I think of as the semantic web nightclub, where other institutions are pointing at your library’s data, and your library is pointing at their data (think 4 & 5 star data, as described here), or imagine having your library’s data integrated into URIs like the John Lennon and Rickenbacker 325 pages from yesterday’s post.)

For libraries, as I understand it, the primary challenges involved are with the accuracy of their cataloging data, and the conversion of non-RDF data into an RDF format (which may or may not be easy, depending on how their data is structured). There are also questions about the best practices for developing metadata for scholarly material in new formats (digital dissertations, partly-digital dissertations, etc.) (Librarian friends, feel free to jump in and correct me if I’m in error on any of this).

I certainly don’t want to minimize or trivialize the work involved in linked open data for libraries, because it isn’t easy, especially given the quantity of records that libraries have to work with.

However, libraries face a different set of problems from the DH researcher who’s interested in building a project/dataset using RDF because the structure is so excellent for humanities data. Library records already include very similar information, so getting up to speed with linked open data is partly about making sure that everyone is using the same data structure, especially in the instances where not using the same data structure would result in bad info. Now, getting many people from multiple libraries to participate in the cycle of discerning the right choice, communicating it to teams, implementing it into each library’s systems: that’s serious work.

In contrast, the DH/DS researcher, while they probably deal with bibliographic information (and can thus make use of existing formats for encoding bibliographic info in RDF), is much more likely to be trying to develop a data structure for encoding information that has not been previously structured….for example, objects listed for sale in texts! Or the metal content in ancient pottery. Or 18th and 19th century technological objects, including their construction, purpose, etc.

The DH/DS researcher, then, is in a position where they need to figure out what work has been previously done — i.e., whether ontologies, predicates, controlled vocabularies have already been developed that could be reused for their project, because the ethos of linked open data is that when possible, you make use of existing data structure. Doing so is what makes the great promise of the semantic web work — that your data could be integrated and found in search queries because it has the predicates that allow it to fit into the graph. If you’ve worked with TEI before, then you understand the importance of making sure that your encoding method is compliant with the current guidelines, and you know where to find them, and you probably know how to get onto the TEI-L mailing list to ask questions about the usage you’re making, or trying to make.

But.

Imagine if the TEI guidelines, rather than being located at that one page, were scattered in fragments all over the web, being developed by people who might or might not be talking with each other in order to make decisions, because one is working with e-commerce, and another is working with paleontology, and why would they? They’re each trying to get something done, rather than get wrapped up in theory, or the quest for The One True Data Structure.

In the context of yesterday’s conversation, I went looking for RDF related to metallurgy, and quickly found this class developed for the purposes of e-commerce. It includes some terms (gr:weight, gr:condition) that might be applicable to structuring data about the metallurgical content of ancient pottery. Maybe. The question is whether those terms are being used in a way that’s compatible with my graduate student’s ancient pottery project. Could they be integrated into his data the way that various terms are integrated into the dbpedia John Lennon page? The short answer is that I would keep on looking for RDF developed for something closer to ancient metallurgy, rather than just yoinking e-commerce vocab in — but then again, weight is weight, and GoodRelations (the ontology with the gr: prefix) is an established ontology. Why overcomplicate things? These are the sorts of questions that an individual researcher has to wrestle with frequently.  There are ontologies out there for lots of things. Scrolling down the dbpedia page, I noticed a prefix that I don’t remember seeing before: yago: — which is an ontology for different roles of all sorts, including “Assassinated English People.”

The recurring questions that you face, if you’re developing a linked open data database are:

  • does vocabulary for X topic already exist?
  • if vocabulary sort of appropriate for X topic already exists, should I reuse it?
  • or should I be trying to create my own ontology?

These aren’t simple questions, and my experience is that you get better able to answer them the more you know about RDF/OWL/linked open data — but there aren’t particularly easy tests that you can use right when you’re starting out. That makes the learning curve pretty steep for people who are working on building careers in DH. The 5 Star Open Data site lists, as a cost, under the 4 star standard

⚠ You need to either find existing patterns to reuse or create your own.

 That one little sentence can be extraordinarily misleading about the time and labour involved.

This post is getting long, so I suspect I’ll come back and talk about the messiness of linked open data platforms, and DPLA’s Heidrun on Monday — but I think the questions and scenarios I’ve discussed above provide you with some context about why RDF can be uniquely challenging for an individual researcher, as opposed to a library.

ETA: I realize that I never addressed why I think the situation might be improving! But I will get to that, and soon, too.

 



Leave a Reply

Your email address will not be published. Required fields are marked *