Making HayleyWorld: a new form of biography: May 2015

Thursday 28 May 2015

Semantic Web for the Working Ontologist: chapter 8

Chapter 8 is titled RDFS-Plus. It introduces a selection of OWL – Web Ontology Language – keywords and demonstrates their use.

Clearly, OWL's originators aren't Winnie the Pooh fans, else they'd have stuck with the acronym WOL. Unless they were avoiding confusion with "Wake-on-LANan Ethernet standard that allows computers to be turned on by a network message" and a bunch of other stuff (see Wikipedia).

It felt much easier to understand most of this than I expected - but I suspect that was partly down to the assistance I had this week from Paul Rissen, Data Architect for BBC News, Paul kindly gave up a lunchtime to explain stuff to me, show me how and where the BBC are using linked data, and how he taught himself to develop ontologies. He also introduced me to two useful resources – BBC Ontologies and BBC Things which are hugely helpful for putting all this into context.

Anyway, the new functions covered in chapter 8 are…

equivalentClass
It's not unusual when federating data from different sources, to discover different URIs referring to the same thing, or class of things. Specifying that two classes are equivalent indicates that they will always have the same members. One important point the authors make here is that "when two classes are equivalent, it only means the two classes have the same members. Other properties of the classes are not shared".

equivalentProperty
This, unsurprisingly functions similarly to equivalentClass, but for properties of things, rather than classes of things. So, for example, in the authors example of library books, "borrows" is an owl:equivalentProperty of "checkedOut".

sameAs
Where equivalentClass and equivalentProperty are the ways of expressing and defining identity relationships between classes and properties of things, sameAs is for individuals. The examples given here include Shakespeare: the Shakespeare that Anne Hathaway married and who had three children is the sameAs the Shakespeare who wrote plays and sonnets. I

inverseOf
Used where one property is the inverse of another: the examples given include hasParent and hasChild and, for checking out of library books, :signedTo owl:inverseOf :signedOut. So, it specifies a mutual relationship between two properties.

TransitiveProperty
In OWL, a transitive property is one that applies throughout a chain of relationships. So, as in the instance given here, ancestors are transitive, parents are not (although parents are a subset of ancestors). "owl:TransitiveProperty is a class of properties, so a model can assert that a property is a member of the class

:P rdf:type owl:TransitiveProperty."

SymmetricProperty
"is an aspect of a single property… expressed in OWL as a Class." It applies in situations where, for instance, you need to specify that two people are married to each other, or are siblings. It applies in any case where a :isWhateverOf b, means that b :isWhateverOf a.

FunctionalProperty
"RDFS-Plus borrows the name functional to describe a property that, like a mathematical function, can only take one value for any particular individual."So, for an individual, age and National Insurance Number could each be a FunctionalProperty.

InverseFunctionalProperty
But FunctionalProperty isn't as useful as InverseFunctionalProperty, which is considered by some pwople "the most important modeling construct in RDFS-Plus, especially in situations in which a model is being used to manage data from multiple sources." It applies in a situation where "a single value of the property cannot be shared by two entities". National Insurance Number can be an InverseFunctionalProperty, but age can't. In RDFS-Plus, where two entities share the same InverseFunctionalProperty, the inference will be made not that there has been an error, but that the two entities are, in fact, the same.

Some properties – like National Insurance Number – can be both functional and inverse functional properties.

ObjectProperty and DatatypeProperty
I struggled to understand the distinction between these two from the details provided in the book, so resorted to Googling, which led me to a forum – http://stackoverflow.com – where someone asked the
difference between the two, and the answer was "Datatype properties relate individuals to literal data (e.g., strings, numbers, datetypes, etc.) whereas object properties relate individuals to other individuals". Which makes a lot of sense!

Next week's chapter – "Using RDFS-Plus in the wild" – looks at real world applications. So, I'm hoping that should be comprehensible!

Until then…

Thursday 21 May 2015

Semantic Web for the Working Ontologist: chapter 7

I am now much, much further into a technical book than I've ever been before in my life. That feels like an achievement: even if I don't come away from this with any more than a fragmentary grasp of all things RDF and beyond…

So – this week's chapter is concerned with RDF Schema. It is all about sets which, somehow, feels more comfortable and intuitive than all the RDF graphs, even though I've forgotten all the mathematical set-related signs (a very long time ago I did a pure maths with stats A level very badly). It explores the following questions. "Which individuals are related to one another and how? How are the properties we use to define our individuals related to other sets of individuals and, indeed to one another?"And it answers the question of how to express those relationships in a way that allows inferred triples to be constructed from asserted triples.

And "schema"? The term schema was, according to Oxford Dictionaries, coined in the 18th century. It's derived from the Greek "skhēma … [meaning] …'form, figure'", and the definitions the dictionary provides are

1. technical A representation of a plan or theory in the form of an outline or model:
2. Logic A syllogistic figure.
3. (In Kantian philosophy) a conception of what is common to all members of a class; a general or essential type or form.

Or, in RDFS, according to Allemang and Hendler, "The schema is information about the data." It is information about information. "The key idea of the schema in RDF", they continue, "is that it should help provide some sense of meaning to the data."

It does this "by specifying semantics using inference patterns". Which means that it expresses relationships in triples, using defined terms. So "the basic construct for specifying a set in RDFS is called an rdfs:Class. So for a subject say :FloweringPlant you'd have the predicate rdf:type and the object rdfs:Class. There's also rdfs:subClassOf, and, importantly – because it can include verbs as well as nouns – rdfs:subPropertyOf.

"In general, rdfs:subPropertyOf allows a modeler to describe a hierarchy of related properties". The more specific a property, the lower down the hierarchy it sits; the more general, the higher up. So, "whenever any property in the tree holds between two entities, so does every property above it". In other words, if entity 1 is the subproperty of entity 2 above it, it must also be a subproperty of entities 3 - n above entity 2. And we only need to assert the first relationship in a triple for the rest to be inferred.

"RDFS," say the authors, "'extends' RDF by introducing a set of distinguished resources into the language." I'm assuming that they're using "distinguished" to mean "distinct", rather than either "Very successful, authoritative, and commanding great respect" or "Dignified and noble in appearance or manner" (Oxford Dictionaries again). But what do I know?

Apparently this is an image of "Distinguished Gentleman's Ride London". Glossing over any issues with the word "gentleman" for the purposes of this caption, it's possible there may be something I don't understand about the word "distinguished".

Meanwhile, back in the book I'm supposed to be getting my head round, the authors are introducing the other key concepts for this chapter: rdfs:domain and rdfs:range. These describe a property "that determines class membership of individuals related by that property".

In the margin of p130 (for this, dear reader, is how far we have come), I've scrawled in big pencilled caps REREAD THIS, as I evidently failed to understand it first time round. This time it seems fairly clear. Essentially the use of the terms "domain" and "range" is inspired by their use in maths, where "the domain of a function is the set of values for which it is defined, and the range is the set of values it can take." In RDFS, "A property P can have an rdfs:domain and/or an rdfs:range." And the two terms provide information on how a "P"is to be used: "domain refers to the subject of any triple that uses P as its predicate, and range refers to the object of any such triple".

EDIT 28/05/2015. After reading this blog, Paul Rissen provided the following, much improved explanation for domain and range

It's also significant that there's no way in RDF to say that something isn't a member of a particular class: this means that "there is no notion of an incorrect or inconsistent inference". So modelers have to be careful in defining set relationships.

Most of the rest of this chapter is concerned with application of these concepts/terminologies, to show, for instance, how you can relate entities by using triples to define them as subsets of one another: doing this both ways ensures that an item defined as a subset of one will automatically be considered a subset of the other, or if they're related, but not hierarchically to each other, by creating another entity (or set) that sits higher up the hierarchy and that they original two entities are subsets of.

Conceptually it's comparatively easy – applied IRL situations, I imagine, requires a lot of careful thought.

Next week… RDFS-Plus. So I guess that'll be like this week, only more so…

Thursday 14 May 2015

Semantic Web for the Working Ontologist: chapter 6

This week's chapter – it's nice and short: or, at least, short) is called "RDF and inferencing".

It covers the way that data modelers can ensure that, when someone searches the web overall – or a specific site – the results include the relevant examples of the thing(s) they're searching for, even when these haven't been mentioned, by name, in the search. The example the authors use here is a search for "shirts" that should return results including "Henleys" (which, apparently, are a type of shirt. Who knew?).

What's particularly significant about the Semantic Web approach, is that it enables a data modeler to create/define data so that "the data can describe something about the way they should be used". But, more than that, "given some stated information, we can determine other, related information that we can also consider as if it had been stated".

This is inference: the ability to model data in such ways that we are "able to add relationships into the data that constrain how the data is used".

So,

IF
?Thing1 rdfs:subClassOf ?Thing2 .
AND
?x rdf:type ?Thing1 .
THEN
?x rdf:type ?Thing2 .

In other words (or to be more accurate, in words), if Thing1 is a subset of Thing2, and x is an example of Thing1, then x must also be an example of Thing2.

We won't need to specify anywhere that that x is an example of Thing2. The query engine will infer that itself.

This is not a remotely helpful illustration, as Thing 1 is clearly not a subclass (or subset) of Thing 2, but it is what the inference quoted about made me think of…

Anyway, this is one of the uses of CONSTRUCT (see last week's blog) in SPARQL, which "provides a precise and compact way to express inference rules". All of this means that SPARQL can form "the basis for an inference language", such as SPIN – SPARQL Inferencing Notation which – according to its web page – "has become the de-facto industry standard to represent SPARQL rules and constraints on Semantic Web models.

Overall, "the strategy of basing the meaning of our terms on inferencing, provides a robust solution to understanding the meaning of novel combinations of terms". Overall, it means that any "deployment architecture" will require not merely the functionality of a Query Engine, but of something that functions as an Inference and Query Engine. Which, in other words, will work both with "triples that have been asserted" - ie those specified within any query, but also triples that are inferred from those that have been asserted (see above). Incidentally, when these relationships are represented graphically, the convention is to print asserted triples with unbroken lines, and inferred triples with broken ones.

In some instances, the querying and inferencing are done by the query engine. In other formulations, "the data are preprocessed by an inferencing engine and then queried directly" - as sometimes it's "convenient to think about inferencing and queries as separate processes". This means that inferencing can happen at different points in the storing and querying process, depending on the implementation. The decisions around this have implications: when the inferring is done early in the storing and querying processes for storage and choices of which inferred triples to retain and which to discard to keep when data sources change at. And just in time inferencing" approach – where all inferencing happens only in response to queries, "risks duplicating inference work.

And that is about that for this week. Next week we're moving on to RDF Schema. Which is obviously going to be a challenge, as I'm not sure I even understand the chapter title. Still, I've already made it to page 125, which makes this the longest relationship I've ever sustained with a technical manual…

Thursday 7 May 2015

Semantic Web for the Working Ontologist: chapter 5

Remarkably, I have managed to read my way through all 51 (okay 50-and-a-half) pages of chapter 5. I had to do it in very short bursts, as my ability to retain – or make sense of – even fairly straightforward information of this particular kind is extremely limited. I have no idea whether this is down to lack of experience and/or focused application, or something to do with the way my brain is made and functions.

Anyone care to enlighten me?

Chapter 5 is all about SPARQL, an acronym for SPARQL Protocol And RDF Query Language. Someone, somewhere made the brain-knotting decision to make the first letter of the acronym the first letter of the acronym for reasons that can only be to make it a homophone of sparkle. Because otherwise it would be called PARQL. Or xPARQL, WHERE x= the first letter of something that sounds less Escheric than SPARQL but makes more sense.

There follows 50 pages of basic SPARQL explanation and instruction, setting the scene with descriptions and illustrations of increasingly sophisticated and flexible tell-and-ask systems: from spreadsheets, via relational databases and into RDF and SPARQL, starting from the simple statement that "the basic idea behind SPARQL [is] that you can write a question that looks a lot like the data,with a question word standing in for the thing you want to know."

Most of the chapter is taken up with introducing the basic vocabulary and syntax of writing queries in SPARQL. So, readers gain an understanding of

SELECT queries: these have 2 parts "a set of question words, and a question pattern." IN SPARQL, any word can be a question word, as long as it has a ? directly before it. This means that, in order for the question word to do its job, it needs to be defined with a triple. Essentially, "question words' = variables
WHERE "indicates the selection pattern" and is written in what the authors call braces and I call curly brackets, because in my brain braces are what my daughter has on her teeth (I am, by-the-by, hugely impressed by the dexterity of orthodontists).
DISTINCT filters out duplicate results, and appears after SELECT.
FILTER which is used to define which query results will be retained, and which rejected "is a Boolean test, not a graph pattern". So, its operation is written in parentheses, rather than curly brackets. Also "you cannot reference a variable in a FILTER that hasn't already been referenced in the graph pattern" – ie the SELECT/WHERE part of the query.
OPTIONAL according to the W3C "tries to match a graph pattern, but doesn't fail the whole query if the optional match fails" (http://www.w3.org/2009/Talks/0615-qbe/). So, as in the example Allemang and Hendler give, if you want to find out the names of actors in a film, and when they died, the query won't exclude anyone who's still alive.
UNSAID enables you to exclude some data from the results: so, for example, if you wanted details of only the actors who were in a film and are still alive
ASK appears at the beginning of a query, instead of SELECT and is used in instances where a yes/no answer is required.
CONSTRUCT – which also appears instead of SELECT – "introduces a graph pattern to be used as a template in constructing a new graph". In other words, it creates relationships between data items that might not have been previously linked.
ORDER BY comes at the end of the graph pattern and does what you'd expect: ie allows you to specify how you would like query results ordered.
DESC if this appears after ORDER BY, it organises data in descending order (ascending is default)
COUNT, MIN, MAX, AVG and SUM enable data to be aggregated. They appear in parentheses, follow SELECT and are followed by the word AS "followed by a new variable, which will be bound to the aggregated value."
GROUP BY allows data to be grouped by a specified variable: this variable "must already have been bound in the graph pattern" - ie: must have been defined with a triple.
HAVING allows you to isolate specific results from the overall query results.
UNION "combines two graph patterns, resulting in the set union of all bindings made by each pattern".
SERVICE "followed by a URL for the SPARQL endpoint before a graph pattern" specifies where the results of a query are to appear.
GRAPH does the same sort of thing in the same sort of way, but for named graphs.

Other important things

The order in which triples appear in a SPARQL query has no impact on the results, but can impact on the speed the results can be delivered, as it will vary the amount of data that needs to be processed. So "order triples in a query so that the fewest number of new variables are introduced in each new triple".
SPARQL was developed for publishing to the web. "A server for the SPARQL protocol is called a SPARQL Endpoint." This "accepts queries and returns results" and "is the most web-friendly way to provide access to RDF data". It is also "identified with a URL".
"The namespace dc stands for 'Dublin Core' a metadata standard used by many libraries worldwide"
SPARQL has less need for subqueries than most query languages, because graph patterns "can include arbitrary connections between variables and resource identifiers". But they can still sometimes come in handy.
Assignments – which are "expressed as part of the SELECT clause"and aren't supported under SPARQL 1.0, but apparently will be (or are) under 1.1 – enable a query to write "the value of a variable through some computation". In other words, it assigns "a value to that variable"
Queries can be federated: ie, an individual query can query multiple data sources.

So, having read that, could I now write a query in SPARQL. Er, no. But I might, slowly, hesitantly, and with repeated references back to this book or to the W3C SPARQL standard, be just about able to read an easy one… Which is progress.

And here are pictures of James Dean and Elizabeth Taylor who star in this chapter. No, really.