A Wealth of Choices

I was thrilled to be part of the MEDEA Working Group April meeting — it’s the first time I’ve gotten to talk about Visible Prices to a whole group of people who were also working on economics-related DH! Here are the slides from that talk, with detailed notes in the notes field.


But where are the cartoons in the library’s digital version of the New Yorker? Or, on encoding stray bits of books.

(a follow-up on this post)

I used to have a print subscription to the New Yorker, which I gave up because I never had time to read it, and I thought that I could just read the (free/tuition-paid) version of the magazine that my library had through ProQuest. To my sorrow, though, I discovered that the library subscription left off a good 1/3 of the magazine: the Shouts & Murmurs column, the Letters, the Talk of the Town, the poetry, and of course, the cartoons. Apparently, ProQuest’s metadata categories just didn’t include such things. (Looking at my current institutional library’s New Yorker subscription, they still don’t.) Today’s post is about a similar problem.

 

Goal: People should be able to find the prices listed in Visible Prices in the books where they’re printed, if they can access the book. This involves bibliographic information. Specifically, two areas of bibliographic information. One area pertains to the book itself, the other are pertains to the specific price. Let’s call these two areas book metadata and price metadata.

Example of book and price metadata together: This collected edition of The Lancet contains many issues, and many prices (for razors, tooth powder, etc.) While each of the entries for these prices might share some bibliographic info (price is contained in the collected edition of The Lancet, digitized by Google Books, etc., located at this URL).

Observation: There are existing ontologies that deal with much of the bibliographic information pertaining to any book metadata. Dublin Core Simple is probably the simplest encoding method; Bibframe is probably the most complex.

Observation: Price metadata — or really, metadata for any specific feature found *inside* a book, has been less explored. The TEI Guidelines are probably the most advanced body of work in this area (specifically, section 10, Manuscript Description, which includes vocabulary for describing page layout). However, TEI structural markup is for the purpose of allowing document structures to be studied, more than it’s meant to allow users to locate a particular piece of information. There are taxonomies within linked open data (the swpo ontology specifies that books contain chapters, and journals contain articles — but it’s not set up to deal with less traditional parts of books).

Question: Where are the places that prices show up?

  • In prose — essays, chapters, newspaper and magazine articles, reports, letters.
  • In advertisements.
  • On title pages of books and newspapers.

The prose category is gigantic, but so far relatively manageable, since there’s a long tradition of essay and article-like objects being considered the important part of books and magazines. Being important means that they have page numbers, and often listings in a table of contents.

Advertisements are more of a challenge. In some periodicals, they’re in separate sections without page numbers. In other periodicals, they’re in separate sections that have their own page numbers, i.e., The Law Times jumps from advertisement sections paginated 39-4o to primary journal sections paginated 213-228. Sometimes these sections have their own title (i.e. “The Universal Advertising Sheet”) — but other times, they don’t. Advertisements are listed in columns, and some documents have two columns, while others have three. I’ve been looking for a text or newspaper that has four columns, and haven’t run into any in the 18th and 19th century in British texts — but I’m sure they exist. Some periodicals have advertisement sections at the front and back; others have advertisements only at the front or the back, but not both. Google Books assigns page numbers to books that it digitizes — sometimes being true to the pagination offered by the source; other times, providing its own pagination.

Challenge: What’s the best method for creating /encoding price metadata that is intelligible, given the complexity of the primary source material (and in some cases, the additional complexity imposed by digital instantiations?

Solutions:

  • Just include the URL, and let people search, and don’t worry about other minute particulars of price metadata. If I only wanted to include books that had been digitized, then in some ways this would be simpler: I would be content to link to the book (or provide its metadata for books behind paywalls), and would feel relatively confident that people would be able to locate the book, and then the price. At the moment, that is a workable temporary solution, since my test data set tends to come from digitized books.
    • Pro: Quick! Easy!
    • Con: Ineffective in the long run, because I fully expect to be including prices from books that haven’t yet been digitized.
  • A comments field, where people can include helpful info about where to find the specific price listing.
    • Pro: Highly flexible for the idiosyncratic and often eccentric arrangements of anthologies; doesn’t require developing a vocabulary or ontology specifically for describing the internal structures of books, magazines, newspapers, and books which contain magazines and/or newspapers. Easier to implement than solution #3, below.
    • Con: More potential for reader-introduced confusion; requires extra effort to train potential users to produce comment content; possible extra complexity if situations arise where both a price and a book have complex enough features to require two comment fields.
  • Develop an ontology and controlled vocabulary specifically for describing the internal structure of books, magazines, etc.; and encode prices using that.
    • Pro: Other digital humanists might find this very useful for their own linked open data projects.
    • Con: The range of practices for including, organizing, and/or paginating advertisements is so complex that the resulting taxonomy might be abstruse and all but incomprehensible to non-experts. At this point in time, I don’t think that there would be a big enough audience to work together to contribute to and develop such a vocabulary. (I have expert knowledge in re: one part of book structures (i.e., how advertisements work) — but I don’t want to put VP aside in order to try and gain knowledge of how to describe other aspects of internal structure. Also, I see ontologies and vocabularies serving as good solutions when they’re describing a fairly orderly set of choices — and in contrast, I interpret the phenomena I see in terms of internal book structure as often slapdash, chaotic, bass-ackwards — organized by the capricious whims of individuals, rather than by any set standards.

I’ve toyed with option #3, indeed, have been very tempted by it; tempted enough to spend several hours looking at books that include advertisements, and seeing  — but for the moment, my answer to the “What Would Sir Tim Berners-Lee Do?” is adhere to the Principle of Least Power.  For the moment, a comments field is a better option; and can be used to gather data that will be useful in iterating further — including, perhaps, some sort of controlled vocabulary or taxonomy. However, the question of which content is “important,” and which content is just insignificant wrapping seems likely to recur in re: DH projects and linked open data — so the question of how to describe internal book organization and structure is far from over.

 

 

 

 


Post-Digital Humanities 2015 updates

Attending the Digital Humanities conference in Sydney this year was tremendously useful. Not only did I learn about several new DH linked open data projects that I might want to interface with in the future (or that are excellent models for the content that I ought to be producing on this site); I also came away with a goal for next year: have Visible Prices at a point where I can propose a pre-conference workshop for DH 2016 in Krakow. Right now, I anticipate the proposal looking a bit like the Jane-athon, but also (or alternatively) being a session where by working on and contributing to Visible Prices, participants will have the opportunity to learn about working with linked open data for DH projects, and take away a basic but solid understanding of:

  • how triples and graph databases work
  • major ontologies and vocabularies that DH projects would most likely utilize
  • what a specialized topic ontology looks like and involves

To that end, I’m pleased to note that Jon and I have an almost complete tool chain for VP: a GoogleDocs spreadsheet that feeds into a BRAT site where a select group of users will be able to mark up the text (example; ETA: here’s a quick video of what that mark up process looks like) in order to identify the object(s) being priced and the prices (sometimes there are many!), and a script that transforms that markup into rdf and feeds it into a basic user interface where users will be able to query the data, but also add keywords and help normalize the currency value (i.e., tell the computer that 3s. 6d. = 3 shillings 6 pence). More on both the keywords and the currency normalization soon — but they’re a vital component of this prototype, both for dealing with the complexity of how price expressions are written, and for expanding beyond Anglo-Saxon pounds to include other currencies.

We’ll have several pre-baked SPARQL queries that people can put in or adapt; more experienced SPARQLers will of course be able to accomplish even more. I’ll have a page where it’s possible to share useful SPARQL queries so that as people write them, others can use them.

Once we pass the initial testing of the set-up, I’ll link the Google spreadsheet to a survey that will allow outside users to contribute prices that they’ve found.

There will still be more work to do: for example, I only recently learned about PeriodO — a linked open data gazetteer of period assertions, and I absolutely plan to incorporate its data into VP. Thus far, I’ve kept it simple — relying only on the date of publication as an chronological marker. That isn’t ideal, however, and PeriodO looks like precisely what I need.

Another major area of work will be the task of acquiring data in bulk from large online archives. More on that later — but the coming tool chain will be a huge advance on either of the two prototypes I built previously, and I’m so pleased that after being on the back burner behind my dissertation, the Demystifying workshops, and then my teaching/consulting work here at the Sherman Centre, I’m finally able to bring VP to the foreground.

 


Adventures in bibliographic encoding

If you click this link, then click “Run”, wait a couple seconds, and then scroll down, then you can see a Very Basic SPARQL Query of Visible Prices, that will return two entries.

The query was written by my collaborator, Jon Crump — all I did was move it out of a Python script, and into the little repository I have over at Dydra.com. Small as it is, I’m still pleased, because for the last week-and-a-half, I’ve had time to work more steadily on VP than I did while the semester was still in session.

Working on VP, in this case means dusting off my Python skills to work with rdflib; boning up on SPARQL, which I started learning at DHOXSS, but didn’t use steadily, and lost; but most of all, wading through pages and pages of info and commentary on bibliographic encoding systems for linked open data, and thinking about what VP actually needs to encode, and what it doesn’t. I’m very luck to have Jon tackling the development of a basic encoding system and schema for entries (the price, the thing being sold, and their relationships) — but because I want to be learning linked open data as well, and because bibliographic information is less free-form — there are more examples for me to look at — I wanted to take a whack at encoding my sources myself.

I really wasn’t sure, when I started, whether I would find that it was surprisingly simple — or much more thorny. So far, the answer is: it’s somewhere between those two extremes.

My priority is to provide users with enough information to find the snippet of text that I’m quoting in the database. Here are two cases of what I’m dealing with:

Case #1: The Lancet (1852)

The Lancet is a lovely source for all sorts of interesting prices (coffee, overcoats, shirts, straight razors, etc.), mainly because each weekly issue has between 2-5 pages of front and back matter, in a section called “The Lancet General Advertiser.” Because these pages are just the wrapper for the more serious content, they’re unnumbered. It doesn’t surprise me. I certainly can’t blame the Victorians for not anticipating the fact that I’d want to encode their classified ads in my project. Thus, the first issue of The Lancet in 1852 starts with page 1, and the final numbered page is page 30. The second issue starts the pagination with page 31. But between page 30 and page 31 are ten full pages of the back matter of Issue No. 1; and the front matter of of Issue No. 2. (You can see this here: The Lancet, 1852, Vol. 1, digitized by Google Books)

Challenge: Describing the location of prices when each entry has no page numbers. Jon has suggested that I’ll want to include which column the ad is in, and what position (i.e. 4th, 5th, 6th, etc.) — I’m still pondering that.

Case #2: General View of the Agriculture of …? (1794)

General View of Agriculture volumes seem to have been (I have not made an exhaustive study of the genre/series) the product of one or more wealthy gentlemen going round a particular county to gather various sorts of data from farmers and laborers, with the encouragement of the London Board of Agriculture. The result is sort of a hybrid of a survey, an almanac, and a scientific journal, with plenty of prices for labour and goods scattered throughout. (Sample volume: General View of the Agriculture of the County of Essex, digitized by Google Books).

At first, nothing seems terribly unusual — but if you scroll through the book, you’ll discover that after page 172, the pagination starts over, because contained in the same volume is the General View of the Agriculture of the County of Northampton, which, though published in the same year, has a different author and publisher. It’s followed by General Views for Worcester and Hertford, each written and published by different individuals.

Challenge: GV … Northampton has its standard bibliographic information, but that’s only the first layer. The second layer is about its location “within” GV … Essex. I’m a little concerned about people heading over to explore further, and thinking that I’ve mislead them, because there’s no indication that the book is actually several works bound together.

—–

These two cases present different challenges, but they have a common problem as well. I could treat them more like entries in an essay bibliography, and not actually worry about their provenance. That would be the simplest solution — but it goes against the fundamental ethos of linked open data, as I understand it. If I retrieved the source from an online repository, then I ought to indicate where it came from. Plus, it feels silly to be using information from books that are available from Google Books, and not providing a link to them.

This might get simpler when the ESTC transformation into a linked open data resource is complete, but we’re not there yet.


Herrenhausen Big Data Conference Poster

I am very excited to be presenting a poster and lightning talk at the Herrenhausen Conference on BigData, and thought I’d share it here as well.

Since the poster is meant to introduce the project, some of it will be familiar to people who already know this project well. But just for you — well, and because of the visual background of the poster, I included a selection of prices from the data I’m currently working with — and those are all along the righthand side of the poster.

If you’re new to this project, then you may want to play with the last recent prototype I built, which uses the Alliance for Visual Culture’s Scalar tool. I’m currently hard at work on the next prototype, working with an RDF professional through the support of a small EADH Grant — and I hope to release that soon. In the mean time, I’ll be writing more about the process here.

In the meantime, enjoy the poster! (Just click to view it in a larger size, or, if you prefer, download the pdf).

ETA: Here’s the lightning talk that I gave to accompany it.

.

(I feel very lucky that someone made high resolution scans of pound notes available online, making possible both this poster, and another poster for DHOXSS 2013).

 

 

 


Linked Open Data and Digital Humanities: more messiness, but more possibilities, too

Last week, I wrote about a few of the things that make linked open data (aka RDF) so attractive for digital humanities projects, and some of the reasons that RDF (and its more complex sibling, OWL) are challenging platforms for researchers to work with.

Today, I want to address one more of the challenges, but also say a little more about DPLA’s Heidrun, and how it might make things better (as I understand what it’s trying to do).

For a particular computer function (or set of functions) to become widely used in DH, it needs to be supported by a range of different resource types:

  • detailed documentation of what to do to & how to troubleshoot;
  • simpler, user-friendly tutorials that walk people through the process of getting started;
  • tools that simplify the function(s) enough for people to simply mess around and experiment, and easily show other people what they’ve been doing
  • tools that allow the construction of a graphical user interface (GUI) so that people other than the creator(s) can play with the tool.

It’s the third of these that linked open data especially lacks right now. Protégé and WebProtégé (produced by Stanford) make it pretty easy to start adding classes and properties, sourcing from some of the more prevalent ontologies (skos, foaf, dc, etc.). Franz’s AllegroGraph also makes this process easy (though personally, I’ve found it a bit buggy to get working, along with its browser, Gruff). Jena and Marmotta (both Apache products) are large-scale  triplestores (websites where you can store the triples you’ve created). I have yet to be able to successfully get Jena going, though I did get Marmotta up and running without too much difficulty last weekend). There are other more up and coming tools: Dydra and K-Infinity are both trying to make working with RDF easy for newbies.

Unfortunately, structuring your data and getting it into a triplestore is only part of the challenge. To query it (which is really the point of working with RDF, and which you need to do in order to make sure that your data structure works), you need to know SPARQL — but SPARQL will return a page of URIs (uniform resource identifiers — which are often in the form of HTML addresses). To get data out of your triplestore in a more user-friendly and readable format, you need to write a script in something like Python or Ruby.  And that still isn’t any sort of graphical user interface for users who aren’t especially tech-savvy.

In short: understanding the theory of RDF and linked open data isn’t too difficult. Understanding all the moving parts involved in implementation is much more hairy.

And: as smarter people than I have said, DH isn’t just about tech skill & knowledge. Part of the field’s vitality comes from humanists asking questions and wanting to do things, without the fetters of expertise structuring their idea of what is possible.

Even writing this post and the previous two, I’m aware of the possibility that what I really ought to do is go back, and start writing more simplified commentary on RDF and linked open data that helps digital humanists/digital scholars make more sense of all the implementation details: what I’ve learned so far, and what more I learn, as I learn it. I’m also aware that such commentary isn’t a tool — so it doesn’t really let people get their hands dirty and play around.

Anyways: DPLA’s Heidrun and Krikri: named after goats: curious animals that will try to eat anything, and integrate what they consume into DPLA’s linked open data structure. They’re intended to grab data from metadata hubs, like HathiTrust and the National Archives. There’s a good article in D-Lib Magazine titled “On Being a Hub” that explains more about the work involved; or you can read the DPLA’s guidelines for becoming a hub.

I have to admit — when I saw the announcement about Heidrun, I took “try to eat anything” too literally, and thought that the DPLA was working to ingest metadata from more than just official hubs. I was wrong, and even if I hadn’t been, Visible Prices is a long way from being ready to become a hub. However: I’m still very excited about Heidrun’s existence, because it looks like the DPLA is working on finding good ways to harvest and integrate rich and complex metadata from all sorts of cultural/heritage organizations — so, not just bibliographic metadata. Working towards that harvest and integration should raise awareness of existing ontologies that are constructed for or well-suited to humanities data — and quite possibly encourage the development of new ontologies, when appropriate.

And: the work involved in making Heidrun a success will, I think, be applicable/useful in developing the tools that digital humanists need to really start exploring the potential of linked open data. It would certainly be to DPLA’s advantage if more humanities and heritage professionals were able to develop confidence and competence with it, so that the DPLA would have more to ingest, and so that it would be more likely that the metadata being ingested was well-formed.

This is a tiny thing to be hopeful about, but I think it’s worth documenting: both to make the stakes/challenges of working with linked open data more transparent, and because I’m fascinated by the way that platform development often see-saws between massive organizational (enterprise-level) users and individual users. I’m not yet ready to write eloquently about that, but perhaps in 5-10 years I will be, and I want this post to be a record of my thoughts at this point in time.

 

 


Linked open data and DH: the messy side

Yesterday evening, I wrote about the optimistic side of RDF — the things that make me excited, and that led me to see it as the right platform for Visible Prices. This morning, I want to address the not-so-great aspects of it — though I’m hopeful that they’re changing.

There’s a lot of discussion about linked open data in libraries — but libraries and DH researchers are trying to do different things with RDF, and thus face different challenges.

Libraries working to add linked open data functionality to their catalogs and repositories are working to get into what I think of as the semantic web nightclub, where other institutions are pointing at your library’s data, and your library is pointing at their data (think 4 & 5 star data, as described here), or imagine having your library’s data integrated into URIs like the John Lennon and Rickenbacker 325 pages from yesterday’s post.)

For libraries, as I understand it, the primary challenges involved are with the accuracy of their cataloging data, and the conversion of non-RDF data into an RDF format (which may or may not be easy, depending on how their data is structured). There are also questions about the best practices for developing metadata for scholarly material in new formats (digital dissertations, partly-digital dissertations, etc.) (Librarian friends, feel free to jump in and correct me if I’m in error on any of this).

I certainly don’t want to minimize or trivialize the work involved in linked open data for libraries, because it isn’t easy, especially given the quantity of records that libraries have to work with.

However, libraries face a different set of problems from the DH researcher who’s interested in building a project/dataset using RDF because the structure is so excellent for humanities data. Library records already include very similar information, so getting up to speed with linked open data is partly about making sure that everyone is using the same data structure, especially in the instances where not using the same data structure would result in bad info. Now, getting many people from multiple libraries to participate in the cycle of discerning the right choice, communicating it to teams, implementing it into each library’s systems: that’s serious work.

In contrast, the DH/DS researcher, while they probably deal with bibliographic information (and can thus make use of existing formats for encoding bibliographic info in RDF), is much more likely to be trying to develop a data structure for encoding information that has not been previously structured….for example, objects listed for sale in texts! Or the metal content in ancient pottery. Or 18th and 19th century technological objects, including their construction, purpose, etc.

The DH/DS researcher, then, is in a position where they need to figure out what work has been previously done — i.e., whether ontologies, predicates, controlled vocabularies have already been developed that could be reused for their project, because the ethos of linked open data is that when possible, you make use of existing data structure. Doing so is what makes the great promise of the semantic web work — that your data could be integrated and found in search queries because it has the predicates that allow it to fit into the graph. If you’ve worked with TEI before, then you understand the importance of making sure that your encoding method is compliant with the current guidelines, and you know where to find them, and you probably know how to get onto the TEI-L mailing list to ask questions about the usage you’re making, or trying to make.

But.

Imagine if the TEI guidelines, rather than being located at that one page, were scattered in fragments all over the web, being developed by people who might or might not be talking with each other in order to make decisions, because one is working with e-commerce, and another is working with paleontology, and why would they? They’re each trying to get something done, rather than get wrapped up in theory, or the quest for The One True Data Structure.

In the context of yesterday’s conversation, I went looking for RDF related to metallurgy, and quickly found this class developed for the purposes of e-commerce. It includes some terms (gr:weight, gr:condition) that might be applicable to structuring data about the metallurgical content of ancient pottery. Maybe. The question is whether those terms are being used in a way that’s compatible with my graduate student’s ancient pottery project. Could they be integrated into his data the way that various terms are integrated into the dbpedia John Lennon page? The short answer is that I would keep on looking for RDF developed for something closer to ancient metallurgy, rather than just yoinking e-commerce vocab in — but then again, weight is weight, and GoodRelations (the ontology with the gr: prefix) is an established ontology. Why overcomplicate things? These are the sorts of questions that an individual researcher has to wrestle with frequently.  There are ontologies out there for lots of things. Scrolling down the dbpedia page, I noticed a prefix that I don’t remember seeing before: yago: — which is an ontology for different roles of all sorts, including “Assassinated English People.”

The recurring questions that you face, if you’re developing a linked open data database are:

  • does vocabulary for X topic already exist?
  • if vocabulary sort of appropriate for X topic already exists, should I reuse it?
  • or should I be trying to create my own ontology?

These aren’t simple questions, and my experience is that you get better able to answer them the more you know about RDF/OWL/linked open data — but there aren’t particularly easy tests that you can use right when you’re starting out. That makes the learning curve pretty steep for people who are working on building careers in DH. The 5 Star Open Data site lists, as a cost, under the 4 star standard

⚠ You need to either find existing patterns to reuse or create your own.

 That one little sentence can be extraordinarily misleading about the time and labour involved.

This post is getting long, so I suspect I’ll come back and talk about the messiness of linked open data platforms, and DPLA’s Heidrun on Monday — but I think the questions and scenarios I’ve discussed above provide you with some context about why RDF can be uniquely challenging for an individual researcher, as opposed to a library.

ETA: I realize that I never addressed why I think the situation might be improving! But I will get to that, and soon, too.

 


Linked open data and DH: the shiny side

I had a lengthy conversation about linked open data today with one of the Sherman Centre’s graduate students, and it reminded me of some of the questions a few of my fellow CLIR postdocs asked at CNI, where, as I’ve noted elsewhere, there was a lot of excitement about RDF/linked open data/semantic web stuff.

Then, later this evening, I ran across the DPLA’s announcement about Heidrun and Krikri (I recommend watching the #code4lib presentation linked at the bottom for a more extensive description of the project.) It might be a pretty exciting development, but to understand why, let me run through the standard points I think about whenever someone asks me “is RDF going to be the next big thing in digital humanities?”

RDF is great for encoding heterogeneous data, and humanities subjects have a lot of heterogenous data.

By heterogeneous data, I mean data that doesn’t fit neatly in the rows and columns associated with traditional relational databases, like those built with MySQL.

My perennial example is that I thought encoding prices in MySQL would be easy — after all, every object I wanted to include would have a price, right? But no: some objects have standard prices, like 5 shillings, and other objects have prices like “6 pence plus beer.” Or potatoes. I use that example a lot because it makes people laugh when I explain that MySQL had a hard time encoding “plus beer,” but the truth is that encoding prices contained in many texts means that I have a tremendous variety in terms of what sort of information I actually have. For example, take this excerpt of text, re: prices for cinnamon, vs. this price for a bottle of currant wine. The cinnamon prices have quantities in units, and it’s important whether or not each price is for pure or impure cinnamon. The bottle of currant wine has only an approximate price — “a couple of shillings or so.” Triples are  a much better structure for dealing with these sorts of uniqueness than endless stacked tables, or tables with null values in them.

And there are plenty of humanities topics where this sort of flexibility might come in handy.

RDF seems to present a foundation for widespread collaboration within digital humanities projects.

If you’re a digital humanities person working with RDF, there’s a decent chance that you’re thinking about building your own ontology — a vocabulary designed specifically to allow queries about your information. You might also be planning to mint URIs — identifiers that serve as hubs of information about … well, about any particular thing. Here’s one for John Lennon, at dbpedia (the datastore behind Wikipedia). Notice that the page isn’t a biography for John Lennon in the traditional sense — it’s more like an anchor point for a bunch of structured data, and clicking on many of the links (like Rickenbacker 325) will get you to other URIs. You can get lost in dbpedia, much as you can in Wikipedia! It’s fun.

What this sort of set-up means is that it might well make sense for people to develop ontologies and URIs for a particular area. Regency romance tropes? 18th century genres? Jon and I anticipate putting together terms and resources for dealing with pre-1970s currency, and assuming we succeed, I can imagine several instances in which other researchers and other projects might use our dataset to build on. That would make me really happy, both from the perspective of making VP more sustainable, and feeling like the effort that’s gone into it would be useful. And that sort of widespread collaboration feels like it might be the fulfillment of a lot of the enthusiasm that’s driven DH so far.

And actually, I think that’s a good point to end on for tonight. I’ll come back tomorrow to discuss the messy side, and how Heidrun might affect the utility of LOD for digital scholars.


February 2015 update: getting started with brat

Shortly after I wrote my last post here at the end of July, I was offered a post-doc at McMaster University, in the Sherman Centre for Digital Scholarship. Since then, I’ve started that position, and started working with Jon Crump, with funding from the EADH.

Here’s a tiny bit of we’ve been up to so far:

1. Working on some sample data, and marking it up in the brat rapid annotation tool. Brat allows me to identify portions of text from the quotations (which are my basic raw data, since I’m not going to be encoding whole texts) as RDF subjects and objects, and connect them with predicates. The subjects, objects, and predicates are customized for Visible Prices, rather than predefined. (If you’re new to this project, and new to linked open data/RDF/semantic web stuff, then you might find my Scalar technical statement useful)

Here’s what the annotation looks like, and how it’s progressing:

This is the first pass, from a few weeks ago. Notice that while I’ve identified the specific object being sold, and the price, the rest of the quotation is just hanging out, unmarked. That means that we haven’t defined the relationship of the rest of that information (the extraordinary Efficacy, etc.) to the ontology that we’re building. But that was just the first pass.

VP Brat shot 1

 

Here’s a more recent update, with new markup by Jon. In this instance, we’ve got the whole quotation included (though to be fair, this is a simpler quotation than VP5 above.) Part of my homework is to go through and see about applying this to the other samples. Note: we may end up changing these subjects, objects, and predicates even further.

VP Brat shot 2

 

Brat seems like a great tool for this project so far. It’s easy to work with, and will export the annotation in RDF format. It’s also occurred to me that it could work well for crowdsourcing annotations, should I decide to make use of that in the future. (I might, but it won’t be in the immediate future. Maybe within a year, I think? I may crowdsource something else, however…).

Besides getting started with brat, I now have my own Github repository for this project, which is very exciting, in part because I set it up myself using the command line, rather than a GUI. (I know my way around a command line, but I’m not too sophisticated to feel pleased when something works exactly as I wanted it to.) My facility with both the command line and with git is due in part to the tutorials at Team Treehouse, which I’ve found to be a) useful, b) wide-ranging in what they cover, and c) packaged in small sections that mean that I can work on them when I have a spare few minutes.

Jon’s done more than get me set up with brat, but I’m going to save that for another post, so that I can get into the habit of writing about this more regularly (and get back in the blogging habit generally.) So: more soon!

 


The Linked Open Data solution for pre-1971 British currency

One of the things I’m working on this summer is the graph structure for Visible Prices — the way that the information will be organized. “Graph” is the technical term used by semantic web programmers for groups of triples that fit together to describe an object.

Just for giggles and gratitude, here are a couple of photographs from DHOXSS 2013, where my classmates and I started figuring out what the graph structure for VP might look like.

2013-07-10 17.36.20

2013-07-10 17.36.08

I’ll have more on the current structure soon, but this afternoon, I’ve actually been working on a related aspect of the project — namely, the dataset for pre-1971 British currency values. Today, the UK has a decimal currency similar to that of the US and other countries, where there are 100 pennies to the pound/dollar/euro etc. Before 1971, however, (and more importantly, in the 18th and 19th centuries), the values looked like this:

farthing: 1/4 penny

halfpenny: 2 farthings

penny: 2 halfpennies/4 farthings

sixpence: 6 pennies

shilling: 12 pence

pound: 20 shillings

guinea: 21 shillings

Every entry that goes into Visible Prices has to have a price, and they all have to be connected together, so that I can pull back everything that costs 5 shillings, or 5 shillings 6 pence, or whatever.

Early on in the project, I thought I would be dealing with the different values by having the computer do a bunch of math. Then, when I thought I would build VP using MySQL, I figured I’d have a table of values, running from 1 farthing up to a not-quite infinite upper limit. I thought I’d probably write a little program in Java or Python to make me a list of all the different prices, in order to save myself the trouble of writing them all out — still, while the idea of having that table of values sounded great, I was always a little bit sad that it would just be a mass of values stuck on my server.

That sadness was one of the reasons that I got really excited when I learned about linked open data — because the whole idea was that I could create a dataset for all those values, and make it something that other people and projects could use. After all — my project isn’t the only one working with economic information. (See Trading Consequences; and Kathryn Tomasek’s article in JDH on encoding financial records). At the University of Washington, the Newbook Digital Text project encoding Emma B. Andrews’ diaries is also dealing with a bunch of different monetary amounts.

A shared dataset of amounts would help make projects that include economic information conversant with each other — and that would be brilliant.

However — in linked open data, it’s especially important not to reinvent the wheel. If a dataset already exists, then a duplicate version will only create redundancy and confusion. As a result, I spent most of this afternoon scouring the web to find out whether or not such a dataset already existed. While Wikipedia (which is built with linked open data) includes pages on pre-decimal British coins, like this entry for the penny, and while that page includes the note that a penny is 1/240 of a pound sterling, that doesn’t fulfill my needs. It doesn’t have pages for four pence, or eight pence halfpenny. And the idea behind the semantic web isn’t that computers do math — it’s that users can point at a stable definition. (If this is new to you, then 5StarData has some great examples).

The only way to find out whether something like this exists is to go looking — so I did: checking this registry, and Swoogle, which though no longer active, is still a pretty useful place to go looking. I also did a number of keyword searches, which led me to ontologies for numismatics — but focusing on the features of coins, rather than their identifier of value.

Looking for vocabularies and datasets is entertaining: I found a vocab for whiskey, and another one for ancient wisdoms. There are even datasets specifically for food.

There is not, however, a dataset of currency amounts starting with a farthing, and working up. I wasn’t really surprised — I spent some time searching last summer as well, and didn’t find anything — but then I had to let VP go into hibernation so that I could focus on finishing the diss. I’ll do a bit more looking, and ask around, but it looks like I need to see about automating that list of amounts, and read up on best practices for creating URIs. I’ll say a little bit more about that in a few days.