Adventures in bibliographic encodingPosted: May 24, 2015 | Author: Paige Morgan | Filed under: Uncategorized | Leave a comment »
If you click this link, then click “Run”, wait a couple seconds, and then scroll down, then you can see a Very Basic SPARQL Query of Visible Prices, that will return two entries.
The query was written by my collaborator, Jon Crump — all I did was move it out of a Python script, and into the little repository I have over at Dydra.com. Small as it is, I’m still pleased, because for the last week-and-a-half, I’ve had time to work more steadily on VP than I did while the semester was still in session.
Working on VP, in this case means dusting off my Python skills to work with rdflib; boning up on SPARQL, which I started learning at DHOXSS, but didn’t use steadily, and lost; but most of all, wading through pages and pages of info and commentary on bibliographic encoding systems for linked open data, and thinking about what VP actually needs to encode, and what it doesn’t. I’m very luck to have Jon tackling the development of a basic encoding system and schema for entries (the price, the thing being sold, and their relationships) — but because I want to be learning linked open data as well, and because bibliographic information is less free-form — there are more examples for me to look at — I wanted to take a whack at encoding my sources myself.
I really wasn’t sure, when I started, whether I would find that it was surprisingly simple — or much more thorny. So far, the answer is: it’s somewhere between those two extremes.
My priority is to provide users with enough information to find the snippet of text that I’m quoting in the database. Here are two cases of what I’m dealing with:
Case #1: The Lancet (1852)
The Lancet is a lovely source for all sorts of interesting prices (coffee, overcoats, shirts, straight razors, etc.), mainly because each weekly issue has between 2-5 pages of front and back matter, in a section called “The Lancet General Advertiser.” Because these pages are just the wrapper for the more serious content, they’re unnumbered. It doesn’t surprise me. I certainly can’t blame the Victorians for not anticipating the fact that I’d want to encode their classified ads in my project. Thus, the first issue of The Lancet in 1852 starts with page 1, and the final numbered page is page 30. The second issue starts the pagination with page 31. But between page 30 and page 31 are ten full pages of the back matter of Issue No. 1; and the front matter of of Issue No. 2. (You can see this here: The Lancet, 1852, Vol. 1, digitized by Google Books)
Challenge: Describing the location of prices when each entry has no page numbers. Jon has suggested that I’ll want to include which column the ad is in, and what position (i.e. 4th, 5th, 6th, etc.) — I’m still pondering that.
Case #2: General View of the Agriculture of …? (1794)
General View of Agriculture volumes seem to have been (I have not made an exhaustive study of the genre/series) the product of one or more wealthy gentlemen going round a particular county to gather various sorts of data from farmers and laborers, with the encouragement of the London Board of Agriculture. The result is sort of a hybrid of a survey, an almanac, and a scientific journal, with plenty of prices for labour and goods scattered throughout. (Sample volume: General View of the Agriculture of the County of Essex, digitized by Google Books).
At first, nothing seems terribly unusual — but if you scroll through the book, you’ll discover that after page 172, the pagination starts over, because contained in the same volume is the General View of the Agriculture of the County of Northampton, which, though published in the same year, has a different author and publisher. It’s followed by General Views for Worcester and Hertford, each written and published by different individuals.
Challenge: GV … Northampton has its standard bibliographic information, but that’s only the first layer. The second layer is about its location “within” GV … Essex. I’m a little concerned about people heading over to explore further, and thinking that I’ve mislead them, because there’s no indication that the book is actually several works bound together.
These two cases present different challenges, but they have a common problem as well. I could treat them more like entries in an essay bibliography, and not actually worry about their provenance. That would be the simplest solution — but it goes against the fundamental ethos of linked open data, as I understand it. If I retrieved the source from an online repository, then I ought to indicate where it came from. Plus, it feels silly to be using information from books that are available from Google Books, and not providing a link to them.
This might get simpler when the ESTC transformation into a linked open data resource is complete, but we’re not there yet.