November 15, 2003
@ 11:31 PM

We finally got around to adding some screen shots to the RSS Bandit wiki.

For those who are curious, there should be another release in the next couple of weeks. This should be mostly a bug fix release with a number of improvements in responsiveness of the GUI. The only noticeable new features should be a new preferences tab for adding search engines to the ones available from the search bar, the ability to apply themes to feed items from the preferences dialog without having to exit the dialog and the ability to search RSS items on disk.  

Hopefully, if I can get some cooperation from a couple of folks there also may be some changes to the subscription harmonization functionality.


 

Categories: RSS Bandit

Robert Scoble writes

Microsoft has 55,000 employees. $50 billion or so in the bank.

Yet what has gotten me to use the Web less and less lately? RSS 2.0.

Seriously. I rarely use the browser anymore (except to post my weblog since I use Radio UserLand).

See the irony there? Dave Winer (who at minimum popularized RSS 2.0) has done more to get me to move away from the Web than a huge international corporation that's supposedly focused on killing the Web.

Diego Duval responds

Robert: the web is not the browser.

Robert says that he's "using the web less and less" because of RSS. He's completely, 100% wrong.

RSS is not anti-web, RSS is the web at its best.

The web is a complex system, an interconnection of open protocols that run on any operating system
...
Let me say it again. The web is not the browser. The web is protocols and formats. Presentation is almost a side-effect.

Both of them have limited visions of what actually constitutes the World Wide Web. The current draft of the W3C's Architecture of the World Wide Web gives a definition of the Web that is more consistent with reality and highlights the limitations of both Diego and Robert's opinions of what consititutes the WWW. The document currently states

The World Wide Web is an network-spanning information space consisting of resources, which are interconnected by links defined within that space. This information space is the basis of, and is shared by, a number of information systems. Within each of these systems, agents (e.g., browsers, servers, spiders, and proxies) a provide, retrieve, create, analyze, and reason about resources.

This contradicts Robert's opinion that the web is simply about HTML pages that you can view in a Web browser and it contradicts Diego's statements that the Web is about "open" protocols that run on "any" operating system. There are a number of technologies that populate the Web whose "open-ness" some may question, I know better than the cast stones when I live in a glass house but there are a few prominent examples that come to mind.  

The way I read it, the Web is about URIs that identify resources that can be retrieved using HTTP by user agents. In this case, I agree with Diego that RSS 2.0 is all about the Web. A news aggregator is simply a Web agent  that retrieves a particular Web resource (the RSS feed) at periodic intervals on behalf of the user using HTTP as the transfer protocol.


 

Categories: Ramblings

November 14, 2003
@ 04:22 PM

 Fumiaki Yoshimatsu writes

Why does someone still think that they have to write Unicode BOMs by themselves, digging deep inside XmlTextWriter.BaseStream and UnicodeEncoding.GetPreamble?  Encoding hint in the XML declarations and Unicode BOMs are all about XML 1.0 thing, but WriteStartElement and WriteStartDocument are not.  They are InfoSet thing, so they do not have anything to do with the serialization format.  Think about XmlNodeWriter for example.  Why does XmlNodeWriter NOT have any constructor that have a parameter of type Encoding?  Why does it always call XmlDocument.CreateXmlDeclaration with null as the second argument?

This is a common point of confusion for users of XML in the CLR. XmlNodeWriter doesn't have a parameter of type Encoding because it writes to an XmlDocument which is stored in memory and all strings in the CLR are in UTF-16 encoding. Setting the encoding only matters when saving the XmlDocument to a stream. As for having to dig into XmlTextWriter.BaseStream to set the encoding, I find this  weird considering that the XmlTextWriter constructor has a number of ways to specifying the encoding on instantiating an instance of the class. Since XML 1.0 mandates that an XML document can only have one encoding there is no reason for methods like WriteStartElement and WriteStartDocument to concern themselves with encoding issues.  

If you really want to dive deep into issues involving specifying the encoding of XML documents and the CLR take a look at  this discussion in Robert McLaws's weblog.

PS: One of my pet peeves is the way people misuse the term XML infoset to mean "things in XML I don't care about" even though there is a precise definitition (nay an entire spec) that describes what it means. The document information item clearly has a [character encoding scheme] property which means character encodings are an XML infoset thing.


 

Categories: XML

November 14, 2003
@ 05:11 AM

Irwando the Magnificent (king of SQLXML) just pointed me at iPocalypse Photoshop. A number of the pseudo-engravings are quite amusing, my favorites are "Stolen music is better than sex" and "once you've had small and white..."

The photoshopped iPods in the scenes from Eddie Murphy's Haunted Mansion are also worth a snicker or two.

 


 

It looks like its confirmed that I'll be attending XML 2003.

Should be fun.


 

Categories: Ramblings

Oleg Tkachenko writes

Just found new beast in the Longhorn SDK documentation - OPath language:

The OPath language is the query language used to query for objects using an ObjectSpace. The syntax of OPath also allows you to query for objects using standard object oriented syntax. OPath enables you to traverse object relationships in a query as you would with standard object oriented application code and includes several operators for complex value comparisons.

Orders[Freight > 5].Details.Quantity > 50 OPath expression should remind you something familiar. Object-oriented XPath cross-breeded with SQL? Hmm, xml-dev flamers would love it.

The approach seems to be exactly opposite to ObjectXPathNavigator's one - instead of representing object graphs in XPathNavigable form, brand new query language is invented to fit the data model. Actually that makes some sense, XPath as XML-oriented query language can't fit all. I wonder what Dare think about it. More studying is needed, but as for me (note I'm not DBMS-oriented guy though) it's too crude yet

Oleg is right that an XML oriented query language like XPath doesn't fit for querying objects. There is a definitely an impedance mismatch between XML and objects, a good number of which were pointed out by Erik Meijer in his paper Programming with Circles, Triangles and Rectangles. A significant number of constructs and semantics of XPath simply don't make sense in a language designed to query objects. The primary construct in XPath is the location step which consists of an axis, a node test and zero or more predicates, of which both the axis and the node test are out of place in an object query language.

From the XPath Grammar, there are 13 axes of which almost none make sense for objects besides self. They are listed below

[6]    AxisName    ::=    'ancestor'
| 'ancestor-or-self'
| 'attribute'
| 'child'
| 'descendant'
| 'descendant-or-self'
| 'following'
| 'following-sibling'
| 'namespace'
| 'parent'
| 'preceding'
| 'preceding-sibling'
| 'self'

The ones related to document order such as preceding, following, preceding-sibling and following-siblings don't really apply to objects since there is no concept of order amongst the properties and fields of a class. The attribute axis is similarly unrelated since there is no equivalent of the distinction between elements and attributes among the fields and properties of a class. 

The axes related to document hierarchy such as parent, child, ancestor, descendent, etc look like they may make sense to map to object oriented concepts until one asks what exactly is meant to be the parent of an object? Is it the base class or the object to which the current object belongs as a field or property? Most would respond that it is the latter. However what happens when multiple objects have the same object as a field which is often the case since objects structures are graph-like not tree-like as XML structures? It also gets tricky when an object that is a field in one class is a member of a collection in another class. Is the object a child of the collection? If so what is the parent of the object, if not what is the relationship of the object to the collection then? The questions can go on...

On the surface the namespace axes sounds like it could map to concepts from object oriented programming since languages like C#, C++ and Java all have a concept of a "namespace". However namespace nodes in the XPath data model have distinct characteristics (such as the fact that each element node in document has a distinct set of namespace nodes regardless of whether each of these namespace nodes represent the same mapping of a prefix to a namespace URI). 

A similar argument can also be made around node tests which are the second primary constructs in XPath location steps. A node test either specifies a name or a type of node to match. A number of XPath node types don't have equivelants in the object oriented world such as comment and processing instruction nodes. Other nodes such as text and element nodes are problematic when one begins to try to tie them in to the various axes such as the parent axis.

Basically, a significant amount of XPath is not really applicable to querying objects without changing the semantics of certain aspects of the language in a way that conflicts with how XPath is used when querying XML documents.

As for how this compares to my advocacy of XML to object mapping techniques such as the ObjectXPathNavigator, the answer is simple; XML is the universal data interchange format and the software world is moving to a situation where all the major sources of important data can be accessed or viewed as XML from office documents to network messages to information locked within databases. It makes sense then that in creating this universal data access layer that one create a way for all interesting sources of data to be viewed as XML so they to can participate as input for data aggregation technologies such as XSLT or XQuery and enable the reuse of XML technologies for processing and manipulating them.


 

Categories: Life in the B0rg Cube | XML

November 11, 2003
@ 11:10 PM

I noticed the followingRDF Interest Group IRC chat log discussing my recent post More on RDF, The Semantic Web and Perpetual Motion Machines in my referrer logs. I found the following excerpts quite illuminating

15:43:42 <f8dy> is owl rich enough to be able to say that my <pubDate>Tue, Nov 11, 2003</pubDate> is the same as your <dc:date>2003-11-11</dc:date>

15:44:35 <swh> shellac: I believe that XML datatypes are...

...

16:08:15 <f8dy> that vocabulary also uses dates, but it stores them in rfc822 format

16:08:51 <f8dy> 1. how do i programmatically determine this?

16:08:58 <JHendler> ah, but you cannot merge graphs on things without the same URI, unless you have some other way to do it

16:09:02 <f8dy> 2. how do i programmatically convert them to a format i understand?

...

16:09:40 <shellac> 1. use

...

16:10:13 <shellac> 1. use a xsd library

16:10:32 <shellac> 2. use an xsd library

...

16:11:08 <JHendler> n. use an xsd library :->

16:11:30 <shellac> the graph merge won't magically merge that information, true

16:11:34 <JHendler> F: one of my old advisors used to say the only thing better than a strong advocate is a weak critic

This argument cements my suspicions that the using RDF and Semantic Web technologies are a losing proposition when compared to using XML-centric technologies for information interchange on the World Wide Web. It is quite telling that none of the participants who tried to counter my arguments gave a cogent response besides "use an xsd library" when in fact anyone with a passing knowledge of XSD would inform them that XSD only supports ISO 8601 dates and would barf on RFC 822 if asked to treat them as dates. In fact, this is a common complaint about them from our customers w.r.t internationalization [that and the fact decimals use a period as a delimiter instead of a comma for fractional digits]. 

Even in this simple case of mapping equivalent elements (dc:date and pubDate) the Semantic Web advocates cannot provide a solution to how their vaunted ontolgies can provide a solution to a problem the average RSS aggregator author solves in about 5 minutes of coding using off-the-shelf XML tools. It is easy to say philosphically that dc:date and pubDate after all, they are both dates, but another thing to write code that knows how to treat them uniformly. I am quite surprised that such a straightforward real-world example cannot be handled by Semantic Web technologies. Clay Shirky's The Semantic Web, Syllogism, and Worldview makes even more sense now.

One of my co-workers recently called RDF an academic plaything, after seeing how many of its advocates ignore the difficult real world problems faced by software developers and computer users today while pretending that obtuse solution to trivial problems are important, I've definitely lost any interest I had left in investigating any further about the Semantic Web.


 

Categories: XML

November 11, 2003
@ 03:40 PM

From the Memory Hole

The Memory Hole posted an extract from an essay by George Bush Sr. and Brent Scowcroft, in which they explain why they didn't have the military push into Iraq and topple Saddam during Gulf War 1. Although there are differences between the Iraq situations in 1991 and 2002-3, Bush's key points apply to both.

But a funny thing happened. Fairly recently, Time pulled the essay off of their site. It used to be at this link, which now gives a 404 error. If you go to the table of contents for the issue in which the essay appeared (2 March 1998), "Why We Didn't Remove Saddam" is conspicuously absent.

Ever since September 11, 2001 the news continues to sound more and more like excerpts from George Orwell's 1984. All is not lost though, it has been heartening to see that some teachers are using this incident as a way to teach their students about media literacy. My favorite is Rewriting History: The Dangers of Digitized Research by Peg Hesketh 


 

Categories: Ramblings

I always love the Top 50 IRC Quotes. Warning, some of them are a bit risqué.


 

Categories: Ramblings

My post from yesterday garnered a couple of responses from the RDF crowd who questioned the viability of the approaches I described. Below I take a look at some of their arguments and relate them to practical examples of exchanging information using XML I have encountered in my regular development cycle.  

Shelley Powers writes

One last thing: I wanted to also comment on Dare Obasanjo's post on this issue. Dare is saying that we don't need RDF because we can use transforms between different data models; that way everyone can use their own XML vocabulary. This sounds good in principle, but from previous experience I've had with this type of effort in the past, this is not as trivial as it sounds. By not using an agreed on model, not only do you now have to sit down and work out an agreement as to differences in data, you also have to work out the differences in the data model, too. In other words -- you either pay upfront, once; or you keep paying in the end, again and again. Now, what was that about a Perpetual Motion Machine, Dare?

In responding to Shelley's post it is easier for me if I use a concrete example. RSS Bandit uses a custom format that I came up with for describing a user's list of subscribed feeds. However in the wild, other news aggregators us differing formats such as OPML and OCS. To ensure that users who've used other aggregators can try out RSS Bandit without having to manually enter all their feeds I support importing feed subscription lists in both the OPML and OCS format even though this is distinct from the format and data model I use internally. This importation is done by applying an XSLT to the input OPML or  OCS file to convert it to my internal format then converting that XML into the RSS Bandit object model. The stylesheets took me about 15 to 30 minutes to write for each one. This is the XML-based solution.

Folks like Shelley believe my problem could be better solved by RDF and other Semantic Web technologies. For example, if my internal format was RDF/XML and I was trying to import an RDF-based format such as OCS then instead of using a language like XSLT that performs a syntactic transform of one XML format to the other I'd use an ontology language such as OWL to map between the data models of my internal format and OCS. This is the RDF-based solution.

Right of the bat, it is clear that both approaches share certain drawbacks. In both cases, I have to come up with a transformation from one represention of a feed list to another. Ideally, for popular formats there would be standard transformations described by others to move from one popular format to another (e.g. I don't have to write a transformation for WordML to HTML but do for WordML to my custom document format)  so developers who stick to popular formats simply have to locate the transformation as opposed to actually authoring it themselves. 

However there are further drawbacks to using the semantics based approach than using the XML-based syntactic approach. In certain cases, where the mapping isn't merely a case of showing equivalencies between the semantics of similarly structured elemebts  (e.g. the equivalent of element renaming such as stating that a url and link element are equivalent) an ontology language is insufficient and a Turing complete transformation language like XSLT is not.  A good example of this is another example from RSS Bandit. In various RSS 2.0 feeds there are two popular ways to specify the date an item was posted, the first is by using the pubDate element which is described as containing a string in the RFC 822 format while the other is using the dc:date element  which is described as containing a string in the ISO 8601 format. Thus even though both elements are semantically equivalent, syntactically they are not. This means that there still needs to be a syntactic transformation applied after the semantic transformation has been applied if one wants an application to treat pubDate and dc:date as equivalent. This means that instead of making one pass with an XSLT stylesheet to perform the transformation in the XML-based solution, two  transformation techniques will be needed in the RDF-based solution and it is quite likely that one of them would be XSLT.

The other practical concern is that I already know XSLT and have good books to choose from to learn about it such as Michael Kay's XSLT : Programmer's Reference and Jeni Tennison's XSLT and XPath On The Edge as well as mailing lists such as xsl-list where experts can help answer tough questions.

From where I sit picking an XML-based solution over an RDF-based one when it comes to dealing with issues involving interchange of XML documents just makes a lot more sense. I hope this post helps clarify my original points.

Ken MacLeod also wrote

In his article, Dare suggests that XSLT can be used to transform to a canonical format, but doesn't suggest what that format should be or that anyone is working on a common, public repository of those transforms.

The transformation is to whatever target format the consumer is comfortable with dealing with. In RSS Bandit the transformations are OCS/OPML to my internal feed list format and RSS 1.0 to RSS 2.0. There is no canonical transformation to one Über XML format that will solve every one's problems.  As for keeping a common, public repository of such transformations that is an interesting idea which I haven't seen anyone propose in the past. A publicly accessible database of XSLT stylesheets  for transforming between RSS 1.0 and RSS 2.0, WordML to HTML, etc. would be a useful addition to the XML community.

Sam Ruby muddies the waters in his post  Blind Spots and subsequent comments in that thread by confusing the use cases around XML as a data interchange format and XML as a storage data format. My comments above have been about XML as a data interchange format, I'll probably post more in future about RDF vs. XML as a data storage format using the thread in Sam's blog for context.


 

Categories: XML