Metadata Quality and Mapping Between Domain Languages

January 21, 2006

@ 01:01 PM

One part of the XML vision that has always resonated with me is that it encourages people to build custom XML formats specific to their needs but allows them to map between languages using technologies like XSLT. However XML technologies like XSLT focus on mapping one kind of syntax for another. There is another school of thought from proponents of Semantic Web technologies like RDF, OWL, and DAML+OIL, etc that higher level mapping between the semantics of languages is a better approach.

In previous posts such as RDF, The Semantic Web and Perpetual Motion Machines and More on RDF, The Semantic Web and Perpetual Motion Machines I've disagreed with the thinking of Semantic Web proponents because in the real world you have to mess with both syntactical mappings and semantic mappings. A great example of this is shown in the post entitled On the Quality of Metadata... by Stefano Mazzocchi where he writes

One thing we figured out a while ago is that merging two (or more) datasets with high quality metadata results in a new dataset with much lower quality metadata. The "measure" of this quality is just subjective and perceptual, but it's a constant thing: everytime we showed this to people that cared about the data more than the software we were writing, they could not understand why we were so excited about such a system, where clearly the data was so much poorer than what they were expecting.

We use the usual "this is just a prototype and the data mappings were done without much thinking" kind of excuse, just to calm them down, but now that I'm tasked to "do it better this time", I'm starting to feel a little weird because it might well be that we hit a general rule, one that is not a function on how much thinking you put in the data mappings or ontology crosswalks, and talking to Ben helped me understand why.

First, let's start noting that there is no practical and objective definition of metadata quality, yet there are patterns that do emerge. For example, at the most superficial level, coherence is considered a sign of good care and (here all the metadata lovers would agree) good care is what it takes for metadata to be good. Therefore, lack of coherence indicates lack of good care, which automatically resolves in bad metadata.

Note how the is nothing but a syllogism, yet, it's something that, rationally or not, comes up all the time.

This is very important. Why? Well, suppose you have two metadatasets, each of them very coherent and well polished about, say, music. The first encodes Artist names as "Beatles, The" or "Lennon, John", while the second encodes them as "The Beatles" and "John Lennon". Both datasets, independently, are very coherent: there is only one way to spell an artist/band name, but when the two are merged and the ontology crosswalk/map is done (either implicitly or explicitly), the result is that some songs will now be associated with "Beatles, The" and others with "The Beatles".

The result of merging two high quality datasets is, in general, another dataset with a higher "quantity" but a lower "quality" and, as you can see, the ontological crosswalks or mappings were done "right", where for "right" I mean that both sides of the ontological equation would have approved that "The Beatles" or "Beatles, The" are the band name that is associated with that song.

At this point, the fellow semantic web developers would say "pfff, of course you are running into trouble, you haven't used the same URI" and the fellow librarians would say "pff, of course, you haven't mapped them to a controlled vocabulary of artist names, what did you expect?".. deep inside, they are saying the same thing: you need to further link your metadata references "The Beatles" or "Beatles, The" to a common, hopefully globally unique identifier. The librarian shakes the semantic web advocate's hand, nodding vehemently and they are happy campers.

The problem Stefano has pointed out is that just being able to say that two items are semantically identical (i.e. an artist field in dataset A is the same as the 'band name' field in dataset B) doesn't mean you won't have to do some syntactic mapping as well (i.e. alter artist names of the form "ArtistName, The" to "The ArtistName") if you want an accurate mapping.

The example I tend to cull from in my personal experience is mapping between different XML syndication formats such as Atom 1.0 and RSS 2.0. Mapping between both formats isn't simply a case of saying <atom:published> owl:sameAs <pubDate> or that <atom:author> owl:sameAs <author> . In both cases, an application that understands how to process one format (e.g. an RSS 2.0 parser) would not be able to process the syntax of the equivalent elements in the other (e.g. processing RFC 3339 dates as opposed to RFC 822 dates).

Proponents of Semantic Web technologies tend to gloss over these harsh realities of mapping between vocabularies in the real world. I've seen some claims that simply using XML technologies for mapping between XML vocabularies means you will need N² transforms as opposed to needing 2N transforms if using SW technologies (Stefano mentions this in his post as has Ken Macleod in his post XML vs. RDF :: N × M vs. N + M). The explicit assumption here is that these vocabularies have similar data models and semantics which should be true otherwise a mapping wouldn't be possible. However the implicit assumption is that the syntax of each vocabulary is practically identical (e.g. same naming conventions, same date formats, etc) which this post provides a few examples where this is not the case.

What I'd be interested in seeing is whether there is a way to get some of the benefits of Semantic Web technologies while acknowledging the need for syntactical mappings as well. Perhaps some weird hybrid of OWL and XSLT? One can only dream...

Categories: Web Development | XML

Tracked by:
"replica" (replica) [Trackback]
"Metadata Quality, Events Databases and Live Clipboard" (Dare Obasanjo aka Carna... [Trackback]

« MSN Search and the DOJ Subpoena | Home | One for the Ladies »

Saturday, 21 January 2006 15:18:37 (GMT Standard Time, UTC+00:00)

Hi Dare, we meet again ;-)

You're right that syntax issues can be very problematic, and as Stefano points out there can be a knock-on effect to the semantics. But this kind of problem isn't new to Semantic Web technologies, there's exactly the same set of problems if you wanted to integrate data containing say RFC 3339 dates and RFC 822 dates in a relational database, object model or for that matter an XML DB. But the problems can be addressed at the syntax layer using technologies like XSLT (or programmatically) while maintaining a globally consistent data model. Right now the integration problem tends to be solved afresh for every connection between systems.

If everyone used exactly the same programming language, libraries and naming schemes then integration would be trivial. That's not going to happen in the foreseeable future, but cross-system sharing of a relatively minimal declarative data model is feasible. It's already happening with domain-specific languages (like HTML, RSS etc), RDF offers a way to generalise, a kind of web-friendly UML.

So the "date" attribute of something in your DB or XML might not match the "date" attribute of something in mine. But by using URIs to describe these attributes the data can be merged without clashes, fairly orthogonally from working out equivalences. The syntax issues don't go away, but at least there's a logical framework in which they can be addressed.

Weird hybrid of OWL and XSLT? Yep, there's GRDDL (see also micromodels). It does mean you need mappings between target formats and the RDF model. I don't think the fact that an absence of conventions between N systems invalidates the N × M vs. N + M angle. Ok, strictly speaking it might be more accurate to call it 2(N+M), in that for every mapping to RDF you'll need both semantic and syntax layer mappings. But once the appropriate mapping has been determined, a single XSLT can be enough to transform every instance of the mapped data.

Anyhow, whether or not you favour the Semantic Web technologies, I think there's a lot of value in Stefano's post, this line is a beaut: "What's missing is the feedback loop: a way for people to inject information back into the system that keeps the system stable."

Danny

Saturday, 21 January 2006 15:21:15 (GMT Standard Time, UTC+00:00)

(Oops, sorry about the extra line breaks)

Danny

Saturday, 21 January 2006 16:56:42 (GMT Standard Time, UTC+00:00)

Yup. What has fascinated me about metadata interchange (what's a better term for metadata reconciliation across domains of use?) is how everyone steps over the lack of syntactic agreement for expression of common concepts, let alone agreement on the abstract concept to be preserved. We seem to be blind to the different canonical syntaxes, our automatic preferences, and the absence of magical interconversion.

I don't think forcing to a single at-the-surface canonical form will work in a distributed community effort, maybe not even for something as focused and rigorous as web services, so we have the problem that you hint at: allowing users of metadata to employ their "natural" expression forms and reconciling them under the covers. The challenge: reconciling term sets that are mutually incoherent in a way that allows all parties to have it their way. Sort of localization taken to the obvious dynamic extreme.

I'm not sure there is a simple-enough community participation effort that can create a semblance of coherence. Even having the canonical usages distinguished by, say, careful use of namespaces, seems too difficult. It blows my mind that Dublin Core doesn't deal with syntactic conventions at all, as well as I can tell, and that's a pretty great example of how invisible or swept-under-the-rug this problem is. I'm also reminded of the observation, attributed to Terry Winograd, that people are simply not going to provide (rigorous) metadata for much.

I fancy calling this problem and the term-use reconciliation problem as the "Dueling Lexicographers" situation. My use case was people building glossaries and wanting to interchange and build on each others definitions and sources in some sort of loosely-coupled federated activity.

My thought was that we would not come around to addressing metadata reconciliation until the obvious solution (simply defining the value of an attribute as text) began to hurt enough for us to work smarter. Meanwhile, I have been admiring how some systems, such as Outlook, seem to incorporate nice filtering procedures while otherwise just shaking my head a lot. I just realized I've been doing the head-shaking part for over 15 years now.

Prediction: We may give up on metadata coherence and use something like smart search systems to get us out of the mess. Someplace in there lies the previously-unsolved problem, but I think that will get us close to good enough without turning everyone at the edges into librarians and catalogers.

orcmid

Sunday, 22 January 2006 18:18:03 (GMT Standard Time, UTC+00:00)

Thanks to a link about RSS guys being at last night's Naked Conversation launch party, I learned about SSE this morning: http://blogs.msdn.com/rssteam/archive/2005/12/01/498704.aspx

That sounds like a great laboratory for the [correction] "Feuding Lexicographers" scenario. I shall read deeper.

orcmid

Friday, 03 November 2006 18:48:07 (GMT Standard Time, UTC+00:00)

A weird hybrid of OWL and XSLT?

We are working on something very similar on Sourceforge. It is exactly as Danny indicated; "a single XSLT can be enough to transform every instance of the mapped data."

We are convinced a simple and flexible framework must be built to make this exchange of semantic meaning possible and to automate the mapping of data tags between autonomous systems. This thin layer provides a common semantic vocabulary, human readable descriptions (formally structred xsd annotations), and data structure sets to capture the underlying syntax data model.

This is all captured in an upper-level ontology called the Metadata Semantic Language. The MSL is a semantic markup language structured as a set of formal ontologies. It was designed to be similar to CIDOC-CRM and METS but delivered as a completely seperate shim layer. It is not intended to be delivered like OWL as an emebeded part of the XML, but instead to be a completely seperate entity that provides mapping from the low-level syntax (what ever it might be) into the MSL standard vocabulary. Its only connection to the XML schema is that this seperate mapping shim layer is advertised in the WSDL.

Both the OWL and MSL languages are very similar in that they enable semantic interoperability, however, they are designed to solve very different problems. OWL is designed to provide descriptive metadata. It is a core set of markup language constructs for describing the properties and capabilities of Web Services. OWL allows service providers to define their services based on agreed upon ontologies that describe the “real world” functions they provide. The MSL is not descriptive metadata but integration metadata. It provides a language and architecture that is used to integrate systems so that they can exchange XML data. The MSL provides a mapping language so that the low-level XML syntax can automatically be mapped between systems. The complexity of data integration is not well suited to OWL type categorization. Data domains require a more concise model to capture explicit mapping constructs creating tighter and more exact transformations.

MSL presents a formal vocabulary that is applied as an overlay to low-level XML syntax. This overlay markup provides a layer of semantic abstraction that captures the meaning of the customized, low-level, syntactic naming conventions. This layer of semantic abstraction is used to advertise proprietary low-level syntax in a higher-level commonly understood semantic vocabulary. The semantic language does not replace XML. Applications still use localized and otherwise customized XML markup languages. The semantic model maps the low-level syntax into the MSL to enable the automated transformation of one proprietary schema into another. It does this by mapping the proprietary markup into a shared semantic model that expresses the underlying meaning of the XML syntax in a commonly understood vocabulary. The semantic markup language is presented separately from the XML document and from the XSD. It is presented in a semantic model (called the smodel) the location of which will be advertised in the WSDL to provide a mechanism to automate syntax mapping.

Steve Perry

Comments are closed.

Dare Obasanjo's weblog

"You can buy cars but you can't buy respect in the hood" - Curtis Jackson

Navigation for Metadata Quality and Mapping Between Domain Languages - Dare Obasanjo's weblog