Thanks to the recent news of the US Department of Justice's requests for information from the major web search engines, I've seen a number of people express surprise and dismay that online services track information that they'd consider private. A term that I've seen bandied about a lot recently is Personally Identifiable Information (PII) which I'd never heard before starting work at MSN.

The Wikipedia definition for Personally Identifiable Information (PII) states

In information security and privacy, personally identifiable information or personally identifying information (PII) is any piece of information which can potentially be used to uniquely identify, contact, or locate a single person.

Items which might be considered PII include, but are not limited to, a person's:

Information that is not generally considered personally identifiable, because many people share the same trait, include:

  • First or last name, if common
  • Country, state, or city of residence
  • Age, especially if non-specific
  • Gender or race
  • Name of the school they attend or workplace
  • Grades, salary, or job position
  • Criminal record

When a person wishes to remain anonymous, descriptions of them will often employ several of the above, such as "a 34-year-old black man who works at Target". Note that information can still be private, in the sense that a person may not wish for it to become publicly known, without being personally identifiable. Moreover, sometimes multiple pieces of information, none of which are PII, may uniquely identify a person when brought together; this is one reason that multiple pieces of evidence are usually presented at criminal trials. For example, there may be only one Inuit person named Steve in the town of Lincoln Park, Michigan.

In addition, there is the notion of sensitive PII. This is information which can be linked to a person which the person desires to keep private due to potential for abuse. Examples of "sensitive PII" are a person's medical/health conditions; racial or ethnic origin; political, religious or philosophical beliefs or affiliations; trade union membership or sex life.

Many online services such as MSN have strict rules about when PII should be collected from users, how it must be secured and under what conditions it can be shared with other entities. However many Internet users don't understand that they disclose PII when using online services. Not only is there explicit collection of PII such as when user's provide their name, address and credit card information to online stores but there is often implicit PII collected which even savvy users fail to consider. For example, most Web servers log IP addresses of incoming HTTP requests which can then be used to identify users in many cases. It's easy to forget that practically every website you visit stores your IP address somewhere on their servers as soon as you hit the site. Other examples aren't so obvious. There was a recent article on Boing Boing entitled Data Mining 101: Finding Subversives with Amazon Wishlists which showed how to obtain sensitive PII such as people's political beliefs from their wishlists on Amazon.com. A few years ago I read a blog post entitled Pets Considered Harmful which showed how one could obtain sensitive PII such as someone's email password by obtaining the name of the person's pet from reading their blog since "What is the name of your cat?" was a question used by GMail to allow one to change their password.  

The reason I bring this stuff up is that I've seen people like Robert Scoble's make comments about wanting "a button to click that shows everything that’s being collected from their experience". This really shows a lack of understanding about PII. Would such a button prevent users from revealing their political affiliations in their Amazon wishlists or giving would be email account hijackers the keys to their accounts by blogging about their pets? I doubt it.

The problem is that most people don't realize that they've revealed too much information about themselves until something bad happens. Unfortunately, by then it is usually too late to do anything about it. If are an Internet user,  you should be cognizant of the amount of PII you are giving away by using web applications like search engines, blogs, email, instant messaging, online stores and even social bookmarking services.

Be careful out there.


 

Categories: Current Affairs

January 26, 2006
@ 06:54 PM

From the press release Microsoft Expands Internet Research Efforts With Founding of Live Labs we learn

REDMOND, Wash. — Jan. 25, 2006 —Microsoft Corp. today announced the formation of Microsoft® Live Labs, a research partnership between MSN® and Microsoft Research. Under the leadership of Dr. Gary William Flake, noted industry technologist and Microsoft technical fellow, Live Labs will consist of a dedicated group of researchers from MSN and Microsoft Research that will work with researchers across Microsoft and the academic research community. Live Labs will provide consistency in vision, leadership and infrastructure as well as a nimble applied research environment that fosters rapid innovations.

"Live Labs is a fantastic alliance between some of the best engineering and scientific talent in the world. It will be the pre-eminent applied research laboratory for Internet technologies," Flake said. “This is a very exciting opportunity for researchers and technologists to have an immediate impact on the next evolution of Microsoft's Internet products and services and will help unify our customers' digital world so they can easily find information, pursue their interests and enrich their lives."

The Live Labs — a confederation of dedicated technologists and affiliated researchers in pre-existing projects from around Microsoft — will focus on Internet-centric applied research programs including rapidly prototyping and launching of emerging technologies, incubating entirely new inventions, and improving and accelerating Windows Live™ offerings. This complements the company’s continuing deep investment in basic research at Microsoft Research and product development at MSN.

Ray Ozzie, Craig Mundie and David Vaskevitch, Microsoft’s chief technical officers, will serve as the Live Labs Advisory Board. Ozzie sees Live Labs as an agile environment for fast-tracking research from the lab into people’s hands. "Live Labs is taking an exciting approach that is both organic and consumer-driven," Ozzie said. "Within the context of a broad range of rich usage scenarios for Windows Live, the labs will explore new ways of bringing content, commerce and community to the Internet."

You can check out the site at http://labs.live.com/. It's unclear to me why we felt we had to apply the "Live" brand to what seems to be a subsection of http://research.microsoft.com/. I guess "Live" is going to be the new ".NET" and before the end of the year everything at Microsoft will have a "Live" version.

*sigh*


 

Categories: Windows Live

Today while browsing the Seattle Post Intelligencer, I saw an article with the headline Google agrees to censor results in China which began

SAN FRANCISCO -- Online search engine leader Google Inc. has agreed to censor its results in China, adhering to the country's free-speech restrictions in return for better access in the Internet's fastest growing market.

The Mountain View, Calif.-based company planned to roll out a new version of its search engine bearing China's Web suffix ".cn," on Wednesday. A Chinese-language version of Google's search engine has previously been available through the company's dot-com address in the United States. By creating a unique address for China, Google hopes to make its search engine more widely available and easier to use in the world's most populous country.
...
To obtain the Chinese license, Google agreed to omit Web content that the country's government finds objectionable. Google will base its censorship decisons on guidance provided by Chinese government officials.

Although China has loosened some of its controls in recent years, some topics, such as Taiwan's independence and 1989's Tiananmen Square massacre, remain forbidden subjects.

Google officials characterized the censorship concessions in China as an excruciating decision for a company that adopted "don't be evil" as a motto. But management believes it's a worthwhile sacrifice.

"We firmly believe, with our culture of innovation, Google can make meaningful and positive contributions to the already impressive pace of development in China," said Andrew McLaughlin, Google's senior policy counsel.

Google's decision rankled Reporters Without Borders, a media watchdog group that has sharply criticized Internet companies including Yahoo and Microsoft Corp.'s MSN.com for submitting to China's censorship regime.

No comment.


 

Brian Jones has a blog post entitled Corel to support Microsoft Office Open XML Formats which begins

Corel has stated that they will support the new XML formats in Wordperfect once we release Office '12'. We've already seen other applications like OpenOffice and Apple's TextEdit support the XML formats that we built in Office 2003. Now as we start providing the documentation around the new formats and move through Ecma we'll see more and more people come on board and support these new formats. Here is a quote from Jason Larock of Corel talking about the formats they are looking to support in coming versions (http://labs.pcw.co.uk/2006/01/new_wordperfect_1.html):

Larock said no product could match Wordperfect's support for a wide variety of formats and Corel would include OpenXML when Office 12 is released. "We work with Microsoft now and we will continue to work with Microsoft, which owns 90 percent of the market. We would basically cut ouirselves off if you didn't support the format."

But he admitted that X3 does not support the Open Document Format (ODF), which is being proposed as a rival standard, "because no customer that we are currently dealing with as asked us to do so."

X3 does however allow the import and export of portable document format (pdf) files, something Microsoft has promised for Office 12.

I mention this article because I wanted to again stress that even our competitors will now have clear documentation that allows them to read and write our formats. That isn't really as big of a deal though as the fact that any solution provider can do this. It means that the documents can now be easily accessed 100 years from now, and start to play a more meaningful role in business processes.

Again I want to extend my kudos to Brian and the rest of the folks on the Office team who have been instrumental in the transition of the Microsoft Office file formats from proprietary binary formats to open XML formats.


 

Categories: Mindless Link Propagation | XML

Sunava Dutta on the Internet Explorer team has written about their support for a Native XMLHTTPRequest object in IE 7. He writes

I’m excited to mention that IE7 will support a scriptable native version of XMLHTTP. This can be instantiated using the same syntax across different browsers and decouples AJAX functionality from an ActiveX enabled environment.

What is XMLHTTP?

XMLHTTP was first introduced to the world as an ActiveX control in Internet Explorer 5.0. Over time, this object has been implemented by other browsing platforms, and is the cornerstone of “AJAX” web applications. The object allows web pages to send and receive XML (or other data) via the HTTP protocol. XMLHTTP makes it possible to create responsive web applications that do not require redownloading the entire page to display new data. Popular examples of AJAX applications include the Beta version of Windows Live Local, Microsoft Outlook Web Access, and Google’s GMail.

Charting the changes: XMLHTTP in IE7 vs. IE6

In IE6 and below, XMLHTTP is implemented as an ActiveX object provided by MSXML.

In IE7, XMLHTTP is now also exposed as a native script object. Users and organizations that choose to disable ActiveX controls can still use XMLHTTP based web applications. (Note that an organization may use Group Policy or IE Options to disable the new native XMLHTTP object if desired.) As part of our continuing security improvements we now allow clients to configure and customize a security policy of their choice and simultaneously retain functionality across key AJAX scenarios.

IE7’s implementation of the XMLHTTP object is consistent with that of other browsers, simplifying the task of cross-browser compatibility.  Using just a bit of script, it’s easy to build a function which works with any browser that supports XMLHTTP:

if (window.XMLHttpRequest){

          // If IE7, Mozilla, Safari, etc: Use native object
          var xmlHttp = new XMLHttpRequest()

}
else
{
if (window.ActiveXObject){

          // ...otherwise, use the ActiveX control for IE5.x and IE6
          var xmlHttp = new ActiveXObject("Microsoft.XMLHTTP");
          }

}

Note that IE7 will still support the legacy ActiveX implementation of XMLHTTP alongside the new native object, so pages currently using the ActiveX control will not require rewrites.

I wonder if anyone else sees the irony in Internet Explorer copying features from Firefox which were originally copied from IE?


 

Categories: Web Development

Brad Fitzpatrick, founder of Livejournal, has a blog post entitled Firefox bugs where he talks about some of the issues that led to the recent account hijackings on the LiveJournal service.

What I found most interesting were Brad's comments on Bug# 324253 - Do Something about the XSS issues that -moz-binding introduces in the Firefox bugzilla database. Brad wrote

Hello, this is Brad Fitzpatrick from LiveJournal.

Just to clear up any confusion: we do have a very strict HTML sanitizer. But we made the decision (years ago) to allow users to host CSS files offsite because... why not? It's just style declarations, right?

But then came along behavior, expression, -moz-binding, etc, etc...

Now CSS is full of JavaScript. Bleh.

But Internet Explorer has two huge advantages over Mozilla:

-- HttpOnly cookies (Bug 178993), which LiveJournal sponsored for Mozilla, over a year ago. Still not in tree.

-- same-origin restrictions, so an offsite behavior/binding can't mess with the calling node's DOM/Cookies/etc.

Either one of these would've saved our ass.

Now, I understand the need to innovate and add things like -moz-bindings, but please keep in mind the authors of webapps which are fighting a constant battle to improve their HTML sanitizers against new features which are added to browser.

What we'd REALLY love is some document meta tag or HTTP response header that declares the local document safe from all external scripts. HttpOnly cookies are such a beautiful idea, we'd be happy with just that, but Comment 10 is also a great proposal... being able to declare the trust level, effectively, of external resources. Then our HTML cleaner would just insert/remove the untrusted/trusted, respectively.

Cross site scripting attacks are a big problem for websites that allow users to provide HTML input. LiveJournal isn't the only major blogging site to have been hit by them, last year the 'samy is my hero' worm hit MySpace and caused some downtime for the service.

What I find interesting from Brad's post is how on the one hand having richer features in browsers is desirable (e.g. embedded Javascript in CSS) and on the other becomes a burden for developers building web apps who now have to worry that even stylesheets can contain malicious code.

The major browser vendors really need to do a better job here. I totally agree with one of the follow up comments in the bug which stated If Moz & Microsoft can agree on SSL/anti-phishing policy and an RSS icon, is consensus on scripting security policy too hard to imagine?. Collaborating on simple stuff like what orange icon to use for subscribing to feeds is nice, but areas like Web security could do with more standardization across browsers. I wonder if the WHAT WG is working on standardizing anything in this area... 


 

Categories: Web Development

January 23, 2006
@ 10:42 PM

I don't usually spam folks with links to amusing video clips that are making the rounds in email inboxes, but the video of Aussie comedy group Tripod performing their song "Make You Happy Tonight" struck a chord with me because I did what the song talks about this weekend.

The game in question was Star Wars: Knights of the Old Republic II. :)


 

One part of the XML vision that has always resonated with me is that it encourages people to build custom XML formats specific to their needs but allows them to map between languages using technologies like XSLT. However XML technologies like XSLT focus on mapping one kind of syntax for another. There is another school of thought from proponents of Semantic Web technologies like RDF, OWL, and DAML+OIL, etc that higher level mapping between the semantics of languages is a better approach. 

In previous posts such as RDF, The Semantic Web and Perpetual Motion Machines and More on RDF, The Semantic Web and Perpetual Motion Machines I've disagreed with the thinking of Semantic Web proponents because in the real world you have to mess with both syntactical mappings and semantic mappings. A great example of this is shown in the post entitled On the Quality of Metadata... by Stefano Mazzocchi where he writes

One thing we figured out a while ago is that merging two (or more) datasets with high quality metadata results in a new dataset with much lower quality metadata. The "measure" of this quality is just subjective and perceptual, but it's a constant thing: everytime we showed this to people that cared about the data more than the software we were writing, they could not understand why we were so excited about such a system, where clearly the data was so much poorer than what they were expecting.

We use the usual "this is just a prototype and the data mappings were done without much thinking" kind of excuse, just to calm them down, but now that I'm tasked to "do it better this time", I'm starting to feel a little weird because it might well be that we hit a general rule, one that is not a function on how much thinking you put in the data mappings or ontology crosswalks, and talking to Ben helped me understand why.

First, let's start noting that there is no practical and objective definition of metadata quality, yet there are patterns that do emerge. For example, at the most superficial level, coherence is considered a sign of good care and (here all the metadata lovers would agree) good care is what it takes for metadata to be good. Therefore, lack of coherence indicates lack of good care, which automatically resolves in bad metadata.

Note how the is nothing but a syllogism, yet, it's something that, rationally or not, comes up all the time.

This is very important. Why? Well, suppose you have two metadatasets, each of them very coherent and well polished about, say, music. The first encodes Artist names as "Beatles, The" or "Lennon, John", while the second encodes them as "The Beatles" and "John Lennon". Both datasets, independently, are very coherent: there is only one way to spell an artist/band name, but when the two are merged and the ontology crosswalk/map is done (either implicitly or explicitly), the result is that some songs will now be associated with "Beatles, The" and others with "The Beatles".

The result of merging two high quality datasets is, in general, another dataset with a higher "quantity" but a lower "quality" and, as you can see, the ontological crosswalks or mappings were done "right", where for "right" I mean that both sides of the ontological equation would have approved that "The Beatles" or "Beatles, The" are the band name that is associated with that song.

At this point, the fellow semantic web developers would say "pfff, of course you are running into trouble, you haven't used the same URI" and the fellow librarians would say "pff, of course, you haven't mapped them to a controlled vocabulary of artist names, what did you expect?".. deep inside, they are saying the same thing: you need to further link your metadata references "The Beatles" or "Beatles, The" to a common, hopefully globally unique identifier. The librarian shakes the semantic web advocate's hand, nodding vehemently and they are happy campers.

The problem Stefano has pointed out is that just being able to say that two items are semantically identical (i.e. an artist field in dataset A is the same as the 'band name' field in dataset B) doesn't mean you won't have to do some syntactic mapping as well (i.e. alter artist names of the form "ArtistName, The" to "The ArtistName") if you want an accurate mapping.

The example I tend to cull from in my personal experience is mapping between different XML syndication formats such as Atom 1.0 and RSS 2.0. Mapping between both formats isn't simply a case of saying <atom:published>  owl:sameAs <pubDate> or that <atom:author>  owl:sameAs <author> . In both cases, an application that understands how to process one format (e.g. an RSS 2.0 parser) would not be able to process the syntax of the equivalent  elements in the other (e.g. processing RFC 3339 dates as opposed to RFC 822 dates).

Proponents of Semantic Web technologies tend to gloss over these harsh realities of mapping between vocabularies in the real world. I've seen some claims that simply using XML technologies for mapping between XML vocabularies means you will need N2 transforms as opposed to needing 2N transforms if using SW technologies (Stefano mentions this in his post as has Ken Macleod in his post XML vs. RDF :: N × M vs. N + M). The explicit assumption here is that these vocabularies have similar data models and semantics which should be true otherwise a mapping wouldn't be possible. However the implicit assumption is that the syntax of each vocabulary is practically identical (e.g. same naming conventions, same date formats, etc) which this post provides a few examples where this is not the case. 

What I'd be interested in seeing is whether there is a way to get some of the benefits of Semantic Web technologies while acknowledging the need for syntactical mappings as well. Perhaps some weird hybrid of OWL and XSLT? One can only dream...


 

Categories: Web Development | XML

January 21, 2006
@ 02:26 AM

There's been a bunch of speculation about the recent DOJ requests for logs from the major search engines. Ken Moss of the MSN Search team tells their side of the story in his post Privacy and MSN Search. He writes

There’s been quite a frenzy of speculation over the past 24 hours regarding the request by the government for some data in relation to a child online protection lawsuit.  Obviously both privacy and child protection are both super important topics – so I’m glad this discussion is happening.

Some facts have been reported, but mostly I’ve seen a ton of speculation reported as facts.   I wanted to use this blog post to clarify some facts and to share with you what we are thinking here at MSN Search.

Let me start with this core principle statement: privacy of our customers is non-negotiable and something worth fighting to protect.

Now, on to the specifics.  

Over the summer we were subpoenaed by the DOJ regarding a lawsuit.  The subpoena requested that we produce data from our search service. We worked hard to scope the request to something that would be consistent with this principle.  The applicable parties to the case received this data, and  the parties agreed that the information specific to this case would remain confidential.  Specifically, we produced a random sample of pages from our index and some aggregated query logs that listed queries and how often they occurred .  Absolutely no personal data was involved.

With this data you:

        CAN see how frequently some query terms occurred.
        CANNOT look up an IP and see what they queried
        CANNOT look for users who queried for both “TERM A” and “TERM B”.

At MSN Search, we have strict guidelines in place to protect the privacy of our customers data, and I think you’ll agree that privacy was fully protected.  We tried to strike the right balance in a very sensitive matter.

I've been surprised at how much rampant speculation from blogs has been reported in mainstream media articles as facts without people getting information directly from the source.


 

Categories: Current Affairs

A user of RSS Bandit recently forwarded me a discussion on the atom-syntax mailing list which criticized some of our design decisions. In an email in the thread entitled Reader 'updated' semantics Tim Bray wrote

On Jan 10, 2006, at 9:07 AM, James M Snell wrote:

In RSS there is definite confusion on what constitutes an update. In
Atom it is very clear. If <updated> changes, the item has been updated.
No controversy at all.

Indeed. There's a word for behavior of RssBandit and Sage: WRONG. Atom provides a 100% unambiguous way to say "this is the same entry, but it's been changed and the publisher thinks the change is significant." Software that chooses to hide this fact from users is broken - arguably dangerous to those who depend on receiving timely and accurate information - and should not be used. -Tim

People who write technology specifications often have good intentions but unfortunately they often aren't implementers of the specs they are creating. This leads to disconnects between reality and what is actually in the spec.

The problems with updates to blog posts is straightforward. There are minor updates which don't warrant signalling to the user such as typos being fixed (e.g. 12 of 13 miner survive mine collapse changed to 12 of 13 miners survive mine collapse) and those which do because they add significant changes to the story (e.g. 12 of 13 miners survive mine collapse changed to 12 of 13 miners survive killed in mine collapse). 

James Snell is right that it is ambiguous how to detect this in RSS but not in Atom due to the existence of the atom:updated element. The Atom spec states

The "atom:updated" element is a Date construct indicating the most recent instant in time when an entry or feed was modified in a way the publisher considers significant. Therefore, not all modifications necessarily result in a changed atom:updated value.

On paper it sounds like this solves the problem. On paper, it does. However for this to work correctly, weblog software now need to include an option such as 'Indicate that this change is significant' when users edit posts. Without such an option, the software cannot correctly support the atom:updated element. Since I haven't found any mainstream tools that support this functionality, I haven't bothered to implement a feature which is likely to annoy users more often than be useful since many people edit their blog posts in ways that don't warrant alerting the user.

However I do plan to add features for indicating when posts have changed in unambiguous scenarios such as when new comments are added to a blog post of interest to the user. The question I have for our users is how would you like this indicated in the RSS Bandit user interface?


 

Categories: RSS Bandit