A recent comment on the Groklaw blog entitled Which Binary Key? claims that one needs a "binary key" to consume XML produced by Microsoft Office 2003. Specifically the post claims
No_Axe speaks as if MS Office 12 had already been released and everyone was using it. He assumes everyone knows the binary key is gone. Yet Microsoft is saying that MS Office 12 is more or less a year away from release. So who really knows when and if the binary key has been dropped? All i know is that MSXML 12 is not available today. And that MSXML 2003 has a binary key in the header of every file.
...
So let me close with this last comment on the fabled “binary key”. In March of 2005, when phase II of the ODF TC work was complete, and the specification had been prepared for both OASIS and ISO ratification, the ODF TC took up the issue of “compliance and conformance” testing. Specifically, we decided to start work on a compliance testing suite that would be useful for developers and application providers to perfect their implementations of ODF. Guess who's XML file format was the first test target? Right. And guess what the problem is with MSXML? Right. It's the binary key. We can't do even a simple transformation between MSXML and ODF!

As someone who's used the XML features of Excel and Word, I know for a fact that you don't need a "binary key" to process the files using traditional XML tools. Brian Jones, who works on a number of the XML features in Office, has a post entitled The myth of the Binary Key where he mentions various parts of the Office XML formats that may confuse one into thinking they are some sort of "binary key" such as namespace URIs, processing instructions and Base64 encoded binary data. All of these are standard aspects of XML which one may typically doesn't see in simple uses of the technology such as in RSS feeds.

Being that I used to work on the XML team there is one thing I want to add the Brian's list which often confuses people trying to process XML; the unicode byte order mark (BOM). This is often at the beginning of documents saved in UTF-16 or UTF-8 encoding on Windows. However as the Wikipedia entry on BOM's states

In UTF-16, a BOM is expressed as the two-byte sequence FE FF at the beginning of the encoded string, to indicate that the encoded characters that follow it use big-endian byte order; or it is expressed as the byte sequence FF FE to indicate little-endian order.

Whilst UTF-8 does not have byte order issues, a BOM encoded in UTF-8 may be used to mark text as UTF-8. Quite a lot of Windows software (including Windows Notepad) adds one to UTF-8 files. However in Unix-like systems (which make heavy use of text files for configuration) this practice is not recommended, as it will interfere with correct processing of important codes such as the hash-bang at the start of an interpreted script. It may also interfere with source for programming languages that don't recognise it. For example, gcc reports stray characters at the beginning of a source file, and in PHP, if output buffering is disabled, it has the subtle effect of causing the page to start being sent to the browser, preventing custom headers from being specified by the PHP script. The UTF-8 representation of the BOM is the byte sequence EF BB BF, which appears as the ISO-8859-1 characters "" in most text editors and web browsers not prepared to handle UTF-8.

I wouldn't be surprised if the alleged "binary key" was just a byte order mark which caused problems when trying to process the XML file using non-Unicode savvy tools. I suspect some of the ODF folks who had problems with the XML file would get some use out of Sam Ruby's Just Use XML talk at this year's XML 2005 conference. 


 

Tuesday, October 18, 2005 8:47:05 PM (GMT Daylight Time, UTC+01:00)
Don't mind me, just leaving a few URIs for later tracking...

http://java2.5341.com/msg/6066.html
http://www.w3.org/Bugs/Public/show_bug.cgi?id=1762
http://www.mail-archive.com/xerces-c-dev@xml.apache.org/msg15322.html
Comments are closed.