While I was in the cafeteria with Mike Vernal this afternoon I bumped into some members of the Windows Desktop Search team. They mentioned that they'd heard that I'd decided to go with Lucene.NET for the search feature of RSS Bandit instead of utilizing WDS. Much to my surprise they were quite supportive of my decision and agreed that Lucene.NET is a better solution for my particular problem than relying on WDS. In addition, they brought an experienced perspective to a question that Torsten and I had begun to ask ourselves. The question was how to deal with languages other than English.

When building a search index, the indexer has to know what the stop words it shouldn't index are (e.g. a, an, the) as well as have some knowledge about word boundaries. Where things get tricky is that a user can receive content in multiple languages, you may receive email in Japanese from some friends and English from others. Similarly you could subscribe to some feeds in French and others in Chinese. Our original thinking was that we would have to figure out the language of each feed and build a separate search index for each language. This approach seemed error prone for a number of reasons

  1. Many feeds don't provide information about what language they are in
  2. People tend to mix different languages in their speech and writing. Spanglish anyone?

The Windows Desktop Search folks advised that instead of building a complicated solution that wasn't likely to work correctly in the general case, we should consider simply choosing the indexer based on the locale/language of the Operating System. This is already what we do today to determine what language to display in the UI and we have considered allowing users to change the UI language in future which would also affect the search indexer [if we chose this approach]. This assumes that people read feeds primarily in the same language that they chose for their operating system. This seems like a valid assumption but I'd like to hear from RSS Bandit users if this is indeed the case. 

If you use the search features of RSS Bandit, I'd appreciate getting your feedback on this issue.


 

Thursday, 01 June 2006 07:23:17 (GMT Daylight Time, UTC+01:00)
The assumption is not correct. For a number of large companies, the operating system is English, but locale is set according the where they work. For example, my Windows XP system is English, but locale is Finnish. I'm reading English and Finnish feeds.

It's no big deal, though, and I propose you'd go ahead and implement this feature. It will work fine with a large number of users.
Apo
Thursday, 01 June 2006 08:04:03 (GMT Daylight Time, UTC+01:00)
Agree with Apo: same here. Think, we should decide always to use the default one (english) PLUS the other locale language from the OS/Bandit UI language in case the feed or item is published in that language.
Thursday, 01 June 2006 10:27:13 (GMT Daylight Time, UTC+01:00)
OS locale is german, 4/5 of my feeds are english.

WM_FYI
-thomas woelfer
Thursday, 01 June 2006 10:46:39 (GMT Daylight Time, UTC+01:00)
Same here. I'm French. I run a US or French version of XP/W2k3/whatever, and 80% of my feeds are in English, 19,5% in French and one or two in Italian. I think that a better solution would be to have a per-feed locale and to index accordingly. This would have the unfortunate side effect of having several distinct indexes, but the search result could aggregate hits from the indexes.
yann schwartz
Thursday, 01 June 2006 19:48:10 (GMT Daylight Time, UTC+01:00)
Ditto. I haven't found a single (interesting) developers blogs in Dutch, so all my subscriptions are in English. My OS and regional settings at home are Dutch though, but on my work the OS is English with Dutch regional settings.

Is there an official Dutch version of RRS Bandit yet? [rant]Mind you, I usually hate the developer-supplied translations you see for most OSS projects. The Dutch tend to be awful at spelling; under software developers it's even worse.[/rant]
Ruben
Friday, 02 June 2006 08:34:34 (GMT Daylight Time, UTC+01:00)
I'm from Germany and 2/3 of my feeds are in english, because most developer feeds are in english.
Lars
Monday, 05 June 2006 23:00:40 (GMT Daylight Time, UTC+01:00)
Soy de Mexico,la mayoria de las subscripciones que tengo son en ingles,sin embargo,tambien tengo en Español,Japones y Chino.
Y quisiera agregar mas subscripciones en Mongol o Polaco, y Aleman, hoy en dia es normal tener contactos en todo el mundo.

I am from Mexico,most of my feeds are in english,also I have some in Spanish,japaness and chiness.
I will like to add some feeds in Mongolish,Polish and German,today is normal to have contacts from all over the world.
German
Monday, 05 June 2006 23:01:44 (GMT Daylight Time, UTC+01:00)
Soy de Mexico,la mayoria de las subscripciones que tengo son en ingles,sin embargo,tambien tengo en Español,Japones y Chino.
Y quisiera agregar mas subscripciones en Mongol o Polaco, y Aleman, hoy en dia es normal tener contactos en todo el mundo.

I am from Mexico,most of my feeds are in english,also I have some in Spanish,japaness and chiness.
I will like to add some feeds in Mongolish,Polish and German,today is normal to have contacts from all over the world.
German Diaz
Sunday, 18 June 2006 16:09:03 (GMT Daylight Time, UTC+01:00)
... and another German user with mainly english feeds ;)

Guess we have a problem here. However there probably is no solution as at least German feeds often contain english quotes... Let the user choose, which language the feed is in - and if he does not, use default.

Or start to build an online feed database with information about the language.
Sunday, 05 November 2006 23:21:22 (GMT Standard Time, UTC+00:00)
Most of my feeds are also in english, 90% i'd say, even though im spanish.
Comments are closed.