February 11, 2004
@ 04:02 PM

One of the big problems with arguing about metadata is that one persons data is another person's metadata. I was reading Joshua Allen's blog post entitled Trolling EFNet, or Promiscuous Memories where he wrote

  • Some people deride "metacrap" and complain that "nobody will enter all of that metadata".  These people display a stunning lack of vision and imagination, and should be pitied.  Simply by living their lives, people produce immense amounts of metadata about themselves and their relationships to things, places, and others that can be harvested passively and in a relatively low-tech manner.
  • Being able to remember what we have experienced is very powerful.  Being able to "remember" what other people have experienced is also very powerful.  Language improved our ability to share experiences to others, and written language made it possible to communicate experiences beyond the wall of death, but that was just the beginning.  How will your life change when you can near-instantly "remember" the relevant experiences of millions of other people and in increasingly richer detail and varied modality?
  • From my perspective it seems Joshua is confusing data and metadata. If I had a video camera attached to my forehead recording I saw then the actual audiovisual content of the files on my harddrive are the data while the metadata is information such as what date it was, where I was and who I saw. Basically the metadata is the data about data. The interesting thing about metadata is that if we have enough good quality metadata then we can do things like near-instantly "remember" the relevant experiences of ourselves and millions of other people. It won't matter if all my experiences are cataloged and stored on a hard drive if the retrieval process isn't automated (e.g. I can 'search' for experiences by who they were shared with, where they occured or when they occured) as opposed to me having to fast forward through gigabytes of video data. The metadata ideal would be that all this extra, descriptive information would be attached to my audiovisual experiences stored on disk so I could quickly search for “videos from conversations with my boss in October, 2003”.

    This is where metacrap comes in. From Cory Doctorow's excellent article entitled Metacrap

    A world of exhaustive, reliable metadata would be a utopia. It's also a pipe-dream, founded on self-delusion, nerd hubris and hysterically inflated market opportunities.

    This applies to Joshua's vision as well. Data acquisition is easy, anyone can walk around with a camcorder or digital camera today recording everything they can. Effectively tagging the content so it can be categorized in a way you can do interesting things with it search-wise is unfeasible. Cory's article does a lot better job than I can at explaining the many different ways this is unfeasible, cameras with datestamps and built in GPS are just a tip of the iceberg. I can barely remember dates once the event didn't happen in the recent past and wasn't a special occassion. As for built in GPS, until the software is smart enough to convert longitude and latitude coordinates to “that Chuck E Cheese in Redmond“ then they only solve problems for geeks not regular people.  I'm sure technology will get better but metacrap is and may always be an insurmountable problem on a global network like the World Wide Web without lots of standardization.


    Wednesday, February 11, 2004 4:50:44 PM (GMT Standard Time, UTC+00:00)
    We seem to agree that metadata is just data. In my post specifically I was addressing the people who say that "nobody will enter all of that metadata", and leaving out of scope discussions about the quality or trustworthiness of the metadata. If I succeeded in getting across the point that only a retard would think that metadata has to be entered by hand, then I accomplished my goal for yesterday -- I'm just getting started. Now briefly, with respect to the whole "quality" thing, I would point out that all of those arguments apply to any sort of data, and in fact you are right that data acquisition is no utopia. But Doctorow is setting up a strawman by talking about "A world of exhaustive, reliable metadata". That's like talking about a world of exhaustive, reliable data. We don't have it, and we don't need it. People have corrupt data and selective memories, but nobody is saying that processing data or remembering things is useless. I also don't think that standardization is the biggest deal. GPS is standardized and quite reliable; RFID will soon be ubiquitous, and so on -- the standardization that matters most is already happening as part of "normal" data acquisition, and I don't think metadata is that significantly different. Trust metrics and filtering noise will be more difficult. But in any case, the fact that some things are difficult does not mean that there are not a whole universe of other things that can be done which are very useful.
    Wednesday, February 11, 2004 6:40:30 PM (GMT Standard Time, UTC+00:00)
    If metadata is produced in the woods and nobody captures it, did it make a noise?

    Dare said it best: one person's data is another person's meta-data. What about the 5 peta-bytes of data generated last year (the sum of all words spoken by all people across all time). What about the same amount * N that is being produced this year? Next year? This is a universe of data/meta-data ... but to what end?

    For over 20 years, AI researchers have known that meta-data either needs to be in context of the problem/search space being attacked. Meta-data outside of that context is ... metacrap. It's not a strawman. I believe it's a universal axiom.

    In the context of smart-searching of feeds and blogs -- you've started to isolate the context. You *can* auto-tag certain types of things (but there are potential copyright/infringment issues). For example, I'm interested in building a service that did relatively simple re-tagging of feeds (along some of the work that Jon Udell has been doing over the past year around XPath/XQuery). The more limited the domain, the easier it is to bound the problem, develop appropriate heuristics, etc. For example, one could easily envision a way to modify RSSBandit to add additional CSS class declarations around un-tagged code blocks (finding VB, python, Java, C#). But there are unnatural languages and writing the grammar checker/finder is relatively easy. Unbounded problems like the CYC project that was(is) attempting to codify all knowledge ... well, it's an unbounded problem and subject to natural language (i.e. human) interpretation. If only everything could be restructured in a universal language (first attempted 300+ years ago) and then again 100 years ago. That would make it easier now, wouldn't it.

    The alternative of course (I guess it's the retarded method) is to hand-code this stuff. Guess what? A great many blog authors do this (e.g. each have their own unique categorization approach -- some with hierarchies, some without). Creating a standard won't solve the fact that Dare and many others already have great commentary that won't meet the to be agreed upon standard of what classifications should have been used ... blah blah blah. I've said enough (meta)crap for now.
    Wednesday, February 11, 2004 10:59:45 PM (GMT Standard Time, UTC+00:00)
    I find the metadata in Dare's and Cory's RSS feeds quite acceptable.

    But I really do suggest you look into RDF some more. Considerably more sophisticated applications of metadata are feasible, useful and already being deployed. Check these testimonials :
    Thursday, February 12, 2004 12:41:12 AM (GMT Standard Time, UTC+00:00)
    Phil, I still think it's a strawman. Everyone knows that categorization is a hard problem, and bayseian classifiers still suck. I wasn't talking about blog categorization in my post. But RFID and GPS are pretty good; much better than the manual alternatives in fact. There are tons of amazing things that can be done without requiring categorization; I never suggested that writing a blog would be easier; there are much more important things than blogs in the world, and progress can be made on those things regardless of whether Cory likes bayesian classifiers or whatever.
    Thursday, February 12, 2004 12:45:45 PM (GMT Standard Time, UTC+00:00)
    I'm intrigued ... I wasn't actually thinking of bayesian stuff, I came out of the case-based reasoning camp in AI circles. The work I did way back then was at the representational level and we called the stuff that we did 'semantic networks'. We had lots of link-types between nodes and it was the hand-coding of the linktypes and the corresponding rules that was the 'judgement part'. Perhaps that's the source of my bias around hand-coding, the difficulty, and, oh yeah, the incredibly powerful/near-magic like reasoning capabilities that could be accomplished in limited contexts. What I found 20 years ago was that when the representations and rules were maintained by small groups who shared similar world views (and language) that you could evolve systems that had few false positives in the search results. [Note: I've not been involved in this type of stuff directly since the late 80's and haven't followed any of the research, etc.]

    But as soon as others joined the team brought different viewpoints to the table ... well ... the systems degraded. This was all done in lisp, a language that not only allowed for very flexible representational capabilities, but also encouraged introspection and extension.

    I'd like to understand where you're coming from on the RFID and GPS front as I'm directly involved in productizing some new software around these technologies.
    Comments are closed.