But where are the cartoons in the library’s digital version of the New Yorker? Or, on encoding stray bits of books.
Posted: July 20th, 2015 | Author: Paige Morgan | Filed under: Uncategorized | No Comments »(a follow-up on this post)
I used to have a print subscription to the New Yorker, which I gave up because I never had time to read it, and I thought that I could just read the (free/tuition-paid) version of the magazine that my library had through ProQuest. To my sorrow, though, I discovered that the library subscription left off a good 1/3 of the magazine: the Shouts & Murmurs column, the Letters, the Talk of the Town, the poetry, and of course, the cartoons. Apparently, ProQuest’s metadata categories just didn’t include such things. (Looking at my current institutional library’s New Yorker subscription, they still don’t.) Today’s post is about a similar problem.
Goal: People should be able to find the prices listed in Visible Prices in the books where they’re printed, if they can access the book. This involves bibliographic information. Specifically, two areas of bibliographic information. One area pertains to the book itself, the other are pertains to the specific price. Let’s call these two areas book metadata and price metadata.
Example of book and price metadata together: This collected edition of The Lancet contains many issues, and many prices (for razors, tooth powder, etc.) While each of the entries for these prices might share some bibliographic info (price is contained in the collected edition of The Lancet, digitized by Google Books, etc., located at this URL).
Observation: There are existing ontologies that deal with much of the bibliographic information pertaining to any book metadata. Dublin Core Simple is probably the simplest encoding method; Bibframe is probably the most complex.
Observation: Price metadata — or really, metadata for any specific feature found *inside* a book, has been less explored. The TEI Guidelines are probably the most advanced body of work in this area (specifically, section 10, Manuscript Description, which includes vocabulary for describing page layout). However, TEI structural markup is for the purpose of allowing document structures to be studied, more than it’s meant to allow users to locate a particular piece of information. There are taxonomies within linked open data (the swpo ontology specifies that books contain chapters, and journals contain articles — but it’s not set up to deal with less traditional parts of books).
Question: Where are the places that prices show up?
- In prose — essays, chapters, newspaper and magazine articles, reports, letters.
- In advertisements.
- On title pages of books and newspapers.
The prose category is gigantic, but so far relatively manageable, since there’s a long tradition of essay and article-like objects being considered the important part of books and magazines. Being important means that they have page numbers, and often listings in a table of contents.
Advertisements are more of a challenge. In some periodicals, they’re in separate sections without page numbers. In other periodicals, they’re in separate sections that have their own page numbers, i.e., The Law Times jumps from advertisement sections paginated 39-4o to primary journal sections paginated 213-228. Sometimes these sections have their own title (i.e. “The Universal Advertising Sheet”) — but other times, they don’t. Advertisements are listed in columns, and some documents have two columns, while others have three. I’ve been looking for a text or newspaper that has four columns, and haven’t run into any in the 18th and 19th century in British texts — but I’m sure they exist. Some periodicals have advertisement sections at the front and back; others have advertisements only at the front or the back, but not both. Google Books assigns page numbers to books that it digitizes — sometimes being true to the pagination offered by the source; other times, providing its own pagination.
Challenge: What’s the best method for creating /encoding price metadata that is intelligible, given the complexity of the primary source material (and in some cases, the additional complexity imposed by digital instantiations?
Solutions:
- Just include the URL, and let people search, and don’t worry about other minute particulars of price metadata. If I only wanted to include books that had been digitized, then in some ways this would be simpler: I would be content to link to the book (or provide its metadata for books behind paywalls), and would feel relatively confident that people would be able to locate the book, and then the price. At the moment, that is a workable temporary solution, since my test data set tends to come from digitized books.
- Pro: Quick! Easy!
- Con: Ineffective in the long run, because I fully expect to be including prices from books that haven’t yet been digitized.
- A comments field, where people can include helpful info about where to find the specific price listing.
- Pro: Highly flexible for the idiosyncratic and often eccentric arrangements of anthologies; doesn’t require developing a vocabulary or ontology specifically for describing the internal structures of books, magazines, newspapers, and books which contain magazines and/or newspapers. Easier to implement than solution #3, below.
- Con: More potential for reader-introduced confusion; requires extra effort to train potential users to produce comment content; possible extra complexity if situations arise where both a price and a book have complex enough features to require two comment fields.
- Develop an ontology and controlled vocabulary specifically for describing the internal structure of books, magazines, etc.; and encode prices using that.
- Pro: Other digital humanists might find this very useful for their own linked open data projects.
- Con: The range of practices for including, organizing, and/or paginating advertisements is so complex that the resulting taxonomy might be abstruse and all but incomprehensible to non-experts. At this point in time, I don’t think that there would be a big enough audience to work together to contribute to and develop such a vocabulary. (I have expert knowledge in re: one part of book structures (i.e., how advertisements work) — but I don’t want to put VP aside in order to try and gain knowledge of how to describe other aspects of internal structure. Also, I see ontologies and vocabularies serving as good solutions when they’re describing a fairly orderly set of choices — and in contrast, I interpret the phenomena I see in terms of internal book structure as often slapdash, chaotic, bass-ackwards — organized by the capricious whims of individuals, rather than by any set standards.
I’ve toyed with option #3, indeed, have been very tempted by it; tempted enough to spend several hours looking at books that include advertisements, and seeing — but for the moment, my answer to the “What Would Sir Tim Berners-Lee Do?” is adhere to the Principle of Least Power. For the moment, a comments field is a better option; and can be used to gather data that will be useful in iterating further — including, perhaps, some sort of controlled vocabulary or taxonomy. However, the question of which content is “important,” and which content is just insignificant wrapping seems likely to recur in re: DH projects and linked open data — so the question of how to describe internal book organization and structure is far from over.
Leave a Reply