Visible Prices: Technical Statement

(cross-posted from Visible Prices in Scalar)

Visible Prices is, in many ways, a simple project — so simple that when I first had the idea, about one-third of the people I spoke with said that surely, it had been done already. Another third counseled me not to tell anyone about it, lest Google or Microsoft steal my idea, and do it themselves. After all, it was just a database, making connections between things that cost the same amount of money. It shouldn’t be that hard. Surely, people suggested, there were existing software programs that could do it.
In fact, there weren’t software programs that would work, though I’ve certainly made an effort to find them. Part of the issue was that any platform would need to be equipped to deal with pre-decimal British currency values. However, there was –and is– a larger problem: most of the existing database and cataloging platforms for humanities projects are intended to work with bibliographic metadata. Instead, Visible Prices is intended to work with the contents of the texts themselves. Those contents don’t conform neatly to the existing metadata parameters. Thus, the collection has been, and continues to be, a project that needs to be built from scratch.

One of the major challenges of working with economic data is its heterogeneity. Some prices will be fixed, and others variable. Some wages will include non-monetary supplements: 6 shillings a day, plus beer and potatoes. XML-based encoding languages (which have been highly popular for digital humanities and scholarly editing projects) assume that the things you’re encoding (plays by Shakespeare, poems by late Victorians) will look relatively similar, so that the same tags can be used to describe them.

MySQL, and other relational databases, which use tables rather than markup language, assume that if you have a table with 5 columns that describe certain qualities, that most of the entries in the table should have data in each column. Otherwise, the structure becomes rickety, making it harder to construct queries, and more likely to return errors, or crash the database.

Choosing a platform has been a long process, because learning enough to evaluate how well a particular tool will work with the data is slow work. It seems probable to me that if I had been willing to limit my scope; say, to prices related to governesses’ salaries, or to prices in a particular author’s body of work, that MySQL or TEI might have worked more effectively. But a smaller project, while more immediately gratifying, wouldn’t have taught me nearly as much as I’ve learned in the past few years.

In July 2013, I attended the Digital Humanities Oxford Summer School, and took Kevin Page and John Pybus’ course in semantic web programming, focusing on two closely-related specifications: OWL (Web Ontology Language) and RDF (Resource Description Framework). Both OWL and RDF are intended to model complex data. You’ve encountered them before — they provide the structure behind resources like Wikipedia, and the databases of music metadata that iTunes uses to identify your CDs. Semantic web description attempts to capture as much detail as it can, and make it searchable.

The basic unit of OWL is called a triple, and it contains a subject, a predicate, and an object. For example:

Subject: Jane Eyre Predicate: hasAuthor Object: Charlotte Bronte

Subject: Charlotte Bronte Predicate: hasBirthdate: Object: April 21, 1816

Triples are linked together to form what semantic web programmers call a graph. (To outsiders, it looks more like a cluster). For example, the graph for Charlotte Bronte would involve both of the above triples (as well as several others, including triples that would tell you that Bronte has two sisters, used the pseudonym Currer Bell, etc.).

Triples are queried using SPARQL (SPARQL Protocol and RDF Query Language – pronounced “sparkle”) – a language that matches the subject, object, or predicate – or all three, or a combination of two, and returns the information that matches.

So, you might write a SPARQL query that asks for all the novels that match the “hasAuthor” predicate with “Charlotte Bronte.” Alternately, you might write a query that asks for all the novels written by authors with pseudonyms; and you might specify that the pseudonym include the name “Bell.” This would return the Bronte sisters’ works – and the works of any other authors whose pseudonym included “Bell.”

The advantage of OWL, and other semantic web specifications, is that they can handle my highly heterogeneous data without crashing. They balance of structure and flexibility for modeling data from sources — even if those sources have significantly different types of metadata. They can represent detailed features of the source texts without sacrificing expressivity. This means that OWL and semantic web programming are a good fit for Visible Prices.

Like TEI, which encourages users to customize markup language for their needs, and to develop new terms and categories, OWL allows users to develop new vocabularies for particular subjects, and to share them, making them available for other similar projects. This is why semantic web programming is often referred to as “linked open data.” It’s meant to be open and shareable, meaning that if another scholar developed a digital humanities project, focusing only on Charlotte Bronte, they could utilize my data on the prices that show up in Bronte’s novels.

The vocabularies that semantic web programmers develop are called ontologies, because they define concepts and relationships within a specific area. If you’ve worked with metadata, then you may have made use of the Dublin Core ontology.

A semantic web database may make a certain set of data usable. One of the difficulties of building Visible Prices has involved the non-decimal currency values for British money before 1971. No existing database has indexed currency values, from one farthing and upwards. As a result, the values that you might see in texts (1 shilling or 5 pounds or 20 guineas) are arguably data – but they’re not good data, because they’re much harder to work with. Making such an index would be slightly tedious (though much of the process could be automated) – but once created, it would transform currency amounts from unwieldy to usable objects.

Creating the dataset for pre-1971 currency is part of my continuing work on the Visible Prices project. But I’ll also be working to create my own ontology that allows me to encode prices into my database. There are parts of my database that will make use of existing vocabularies, like DublinCore. Other parts of it will require me to develop my own terms – for things like the non-monetary supplement to wages of alcohol. Because my own semantic web programming knowledge is still relatively new, I’ll be consulting with a professional web ontologist this spring as I work out the structure. This work will be supported by a Small Project Grant from the European Office of Digital Humanities.

Developing my own ontology is a significant step forward for Visible Prices. It will allow me to populate the database, and set up an interface through which users can query my data. Once that database is set up, Visible Prices will be ready to grow at a much faster rate. At that point, I’ll be ready to seek large-scale grants for its ongoing expansion and support.



Leave a Reply

Your email address will not be published. Required fields are marked *