A few years ago Joel Spolsky wrote a widely quoted blog post which stated

A very senior Microsoft developer who moved to Google told me that Google works and thinks at a higher level of abstraction than Microsoft. "Google uses Bayesian filtering the way Microsoft uses the if statement," he said. That's true. Google also uses full-text-search-of-the-entire-Internet the way Microsoft uses little tables that list what error IDs correspond to which help text. Look at how Google does spell checking: it's not based on dictionaries; it's based on word usage statistics of the entire Internet, which is why Google knows how to correct my name, misspelled, and Microsoft Word doesn't.

This morning I fired up my favorite RSS reader and saw the danah boyd's entry entitled quality of Google searches? where she pointed out

It's also annoying that they've stopped correcting my atrocious spelling. I mean, it's all fine and well that lots of people in the blogosphere can't spell in exactly the same way that i can't spell, but the #1 type of search i do everyday is spell check. I throw something god-awful like Cziskentmihalyi into the engine knowing that it'll return Csikszentmihalyi. This still works quite well for names but it's stopped working for lots of regular words that i just can't spell to save my life. How pathetic is it that i've started opening up Word for the little red squigglies instead of relying on search? Or maybe both practices are weird...

This seems like a predictable problem. There are lots of commonly misspelled words and in many online communities people have simply given up on correct spelling (heck, I've now grown used to the fact that computer geeks have decided that the correct spelling of ridiculous is rediculous). Thus it is quite likely that a frequently misspelled word eventually occurs so much in the wild that it is considered a valid word. Maybe Google needs some if statements in their code after all, instead of blindly trusting the popularity contest that is Bayesian analysis. :)

The interesting thing is that one could argue that if a particular spelling of a word becomes popular then it automatically is "correct" since that is how the English language has evolved over time anyway.


 

Categories: Competitors/Web Companies
Tracked by:
http://www.robotskirts.com/?p=724 [Pingback]
http://www.fixmood.com/searchcap-the-day-in-search-may-11-2007/2007/05/11/ [Pingback]
http://dev.clickfraudnetwork.com/blogs/searchengineland/archive/2007/05/11/searc... [Pingback]
http://www.techzi.com/2007/05/10/has-overuse-of-bayesian-analysis-screwed-google... [Pingback]
http://www.yayki.com/blogs/yayki_weblog/archive/2007/05/11/searchcap-the-day-in-... [Pingback]
http://www.google.com/search?q=lcbildzo [Pingback]
http://www.psychologicalscience.org/cfs/program/view_submission.cfm?Abstract_ID=... [Pingback]
http://www.artsalive.ca/fr/dan/meet/bios/artistDetail.asp?artistID=228 [Pingback]
http://www.artsalive.ca/fr/dan/meet/bios/artistDetail.asp?artistID=202 [Pingback]
http://www.psychologicalscience.org/cfs/program/view_submission.cfm?Abstract_ID=... [Pingback]
http://www.artsalive.ca/fr/dan/meet/bios/artistDetail.asp?artistID=195 [Pingback]
http://www.artsalive.ca/fr/dan/meet/bios/artistDetail.asp?artistID=194 [Pingback]
http://www.artsalive.ca/fr/dan/meet/bios/artistDetail.asp?artistID=232 [Pingback]
http://www.google.com/search?q=ixjjivuw [Pingback]
https://www.cce.csus.edu/conferences/webreg/register.cfm?CID=320 [Pingback]
http://cf.unc.edu/depts/career/fairs/education/ucsdisplayprofile.cfm?org_id=5260... [Pingback]
https://www.cce.csus.edu/conferences/webreg/register.cfm?CID=348 [Pingback]
https://www.cce.csus.edu/conferences/webreg/register.cfm?CID=321 [Pingback]
http://cf.unc.edu/depts/career/fairs/education/ucsdisplayprofile.cfm?org_id=7666... [Pingback]
https://www.cce.csus.edu/conferences/webreg/register.cfm?CID=351 [Pingback]
http://aphid.csuchico.edu/tlpvirtual/atec/equipment/itemInfo.asp?ID=374 [Pingback]
http://www.aecl.ca/kidszone/atomicenergy/soundoff/index.asp?msgID=7067 [Pingback]
http://www.hiu.edu/athletics/data/article.asp?ArticleID=82 [Pingback]
http://www.aecl.ca/kidszone/atomicenergy/soundoff/index.asp?msgID=7039 [Pingback]
http://www.google.com/search?q=xnirfiki [Pingback]
http://www.google.com/search?q=lrstycuf [Pingback]
http://www.rederecord.com.br/programas/domingoespetacular/conteudo_ver.asp?c=364 [Pingback]

Thursday, May 10, 2007 7:31:26 PM (GMT Daylight Time, UTC+01:00)
Google does Bayesian analysis of non-English languages the way Microsoft does if statements of the English language.

It might be true that Bayesian analysis is not the only thing one should use, but it does not follow that if statements are the immediate fallback. It makes more sense to populate your data set with some high quatlity data and then rate it higher, for example WordNet.

bryan
Thursday, May 10, 2007 9:22:19 PM (GMT Daylight Time, UTC+01:00)
The funniest thing about Joel's article is that he 'proves' his point by searching for "Splosky" on Google (click the "spell checking" link in "Look at how Google does spell checking:..." or go to http://www.google.com/search?hl=en&q=splosky&btnG=Google+Search ).

At the time the article was written, Google would actually suggest that you should be searching for Spolsky instead. In the mean time, so many people got it wrong that now Google happily returns pages containing the incorrect spelling.

For those who haven't taken an algorithm class in their life, an algorithm must be specified rigurously, finite and consistent (every execution of the algorithm with the same input set must provide the same output set). I'd hardly rely on something that takes the same input and produces two different outputs for spell checking.

PS: I configured my IE to block all cookies by default. When trying to post a comment to your blog, I'm unable to do so, presumably because of this setting. I was able to post from Firefox (which is configured to save all cookies, and delete them at exit time). Maybe you want to pass this on to the author of your blog software.
Haha
Friday, May 11, 2007 12:17:53 AM (GMT Daylight Time, UTC+01:00)
Is it possible this is because lots of common misspellings are now covered with Adword spam sites? Google has a vested interest in getting people onto those sites.
Anon
Friday, May 11, 2007 8:16:38 AM (GMT Daylight Time, UTC+01:00)
Also, regardless of the correct spelling, do you want to use the spelling which returns 20000 pages or the one which returns 1000 pages. You might just want the incorrect spelling if it gives you better (quantitative) results, otoh, the correct spelled results might be better qualitative-wise ;)
Friday, May 11, 2007 11:15:25 AM (GMT Daylight Time, UTC+01:00)
Consider that spelling is a matter of fashion, for it hath evolved through time. There ifn't one right way of spelling.
Friday, May 11, 2007 2:04:12 PM (GMT Daylight Time, UTC+01:00)
The thing I do not like about the common search engines is, that they do not recognize documents with similar content. It happens often on the Web that a post or document is spread out over more then 50 websites. Now that is great for the author but not for the searcher because it blows up your search result unnecessarily. With InfoCodex this will not happen because the linguistical database recognizes similar documents and puts them into groups. This does not blow up your search result unnecessarily.

http://www.ywesee.com/pmwiki.php/Ywesee/InfoCodexProcedure

Three things a modern Search engine should do:

1. Automatically classify a document according to its content.
2. Automatically generate an abstract of a document.
3. Generate a Heat-Map of the Contents of a Search Result.

http://www.ywesee.com/uploads/Main/InfoCodex_22.2.2007.pdf
Friday, May 11, 2007 2:33:50 PM (GMT Daylight Time, UTC+01:00)
You should have seen the argument I had with my English teacher over the spelling of program when used to mean computer program. In British English the non-computer word is programme, but I've never seen anyone say computer programme. How could a kid know more than her?

Oh yes, we're happy to talk about dialog boxes, imagine if they were called dialogue boxes. We hang onto colour though :-)
Friday, May 11, 2007 8:40:05 PM (GMT Daylight Time, UTC+01:00)
Ah, now I can finally understand why GMail's spell checking feature recommends words that don't exist.
Oh well, together with Firefox' V2 spell checker I almost manage to get my spelling sorted.
Comments are closed.