API Hall of Shame: Lucene's IndexReader and IndexWriter

March 5, 2007

@ 06:20 PM

For the current release of RSS Bandit we decided to forego our homegrown solution for providing search over a user's subscribed feeds and go with Lucene.NET. The search capabilities are pretty cool but the provided APIs leave a lot to be desired. The only major problem we encountered with Lucene.NET is that concurrency issues are commonplace. We decided to protect against this by having only one thread that modified the Lucene index since a lot of problems seemed to occur when multiple threads were trying to modify the search index.

This is where programming with Lucene.NET turns into a journey into the land of Daily WTF style proportions.

WTF #1: There are two classes used for modifying the Lucene index. This means you can't just create a singleton and protect access to it from multiple threads. Instead one must keep instances of two different types around and make sure if one instance is open the other is closed.

WTF #2: Although the classes are called IndexReader and IndexWriter, they are both used for editing the search index. There's a fricking Delete() method on a class named IndexReader.

Code Taken from Lucene Examples

public void  DeleteDocument(int docNum)
{
	lock (directory)
	{
		AssureOpen();
		CreateIndexReader();
		indexReader.DeleteDocument(docNum);
	}
}

void  CreateIndexReader()
{
	if (indexReader == null)
	{
		if (indexWriter != null)
		{
			indexWriter.Close();
			indexWriter = null;
		}
		indexReader = IndexReader.Open(directory);
	}
}

void  AddDocument(Document doc)
{
	lock (directory)
	{
		AssureOpen();
		CreateIndexWriter();
		indexWriter.AddDocument(doc);
	}
}

void  CreateIndexWriter()
{
	if (indexWriter == null)
	{
		if (indexReader != null)
		{
			indexReader.Close();
			indexReader = null;
		}
		
	}
}

As lame as this is, Lucene.NET is probably the best way to add desktop search capabilities to your .NET Framework application. I've heard they've created an IndexModifier class in newer versions of the API so some of this ugliness is hidden from application developers. How anyone thought it was OK to ship with this kind of API ugliness in the first place is beyond me.

Categories: Programming

« Seattle Startup Shoutout: iLike | Home | What Comes After AJAX? »

Monday, 05 March 2007 19:40:39 (GMT Standard Time, UTC+00:00)

I haven't used Lucene.Net for some time, but I think some of this API weirdness is due to them maintaining parity (more or less) with the original java implimentation.
Don't know what their excuse is though.

Andy Pook

Monday, 05 March 2007 20:15:20 (GMT Standard Time, UTC+00:00)

According to the book Windows Developer Power Tools, Lucene.NET attempts to keep *exact* method by method parity with the Java Lucene, as Andy pointed out. Even keeping method implementations as similar as possible. The key difference is they follow the .NET coding conventions rather than Java's (such as pascal casing methods and properties).

At some point, I hope they branch out like NUnit did and take advantage of the differences between the platforms. But I understand why they took this approach, as it gave early users confidence in the codebase since it was pretty much the same code as the well tested Java version.

Haacked

Monday, 05 March 2007 20:50:08 (GMT Standard Time, UTC+00:00)

It's hardly something that can be blamed on maintaining parity with Java - it's not like it's a great API there either, you know. It should be fixed in both.

IndexModifier is no panacea. I was using (Java) Lucene last year for the first time in several years, and noticed with some relief that IndexModifier had since been added.

Then I noticed that the performance was abysmal. On inspection, it turns out that IndexModifier is just a dumb wrapper for IndexReader and IndexWriter, and automatically closes one before opening the other, depending on which operation you perform. Most of our index changes were updating existing documents, and a series of delete, add, delete, add, calls were killing us (because creating an IndexReader is expensive, and IndexModifier was doing it every time we called delete). It's basically what your sample code above does.

I coped with it all by having one shared IndexReader object used by all search threads, while updates were performed by a separate process (not for any particular reason; it was a web app) using its own IndexReader for deletes, and IndexWriter for writes. (Obviously I also had to do two passes over the updated objects, so I could call delete, delete, delete, add, add, add, rather than delete, add, delete, add, delete, add.) When the update was complete, the update thread notified the search component that it needed to close its IndexReader and open a new one (because an IndexReader never sees index changes that were performed after it was opened).

Adam Fitzpatrick

Monday, 05 March 2007 20:51:20 (GMT Standard Time, UTC+00:00)

They're not exactly _generous_ in following .NET coding conventions. None of the collections implement a single relevant interface, there aren't any indexers for obvious things like the fields of a document, nothing that requires explicit cleanup is disposable etc etc.

I've always had the impression that Lucene.NET was generated by a Java-to-C# tool, with a tiny bit of extra tweaking. And I don't see what having method parity with the Java version is giving anyone... it's not as if the Javadoc is particularly useful.

It's yucky. I wish it wasn't so terribly useful. :)

Thom

Monday, 05 March 2007 21:29:23 (GMT Standard Time, UTC+00:00)

Well I'm not trying to say they *should* continue to do so. Just saying they are. And as long as they stick to that, then to get these things fixed, it would require getting the Java side to fix it.

Personally, I think they've reached the point where their product is mature enough to stand on its own. Time to ditch Java parity and clean that sucker up! ;)

haacked@gmail.com (Haacked)

Tuesday, 06 March 2007 08:37:54 (GMT Standard Time, UTC+00:00)

By the way, Dare, what happened to the last sentence of this entry? It appears as a single run-on sentence both in Windows Live Mail desktop (my current RSS reader) and in IE7 (which widens the page by a *lot*).

Damit

Tuesday, 06 March 2007 08:42:49 (GMT Standard Time, UTC+00:00)

handkante

Tuesday, 06 March 2007 17:56:53 (GMT Standard Time, UTC+00:00)

The presence of IndexReader.delete() traces back to the origins of Lucene. IndexReader and IndexWriter were written, then deletion was added. IndexWriter should really have been called IndexAppender. The only place deletion was possible was IndexReader, since one must read an index to figure out which document one intends to delete, and IndexWriter only knows how to append to indexes. This has big performance implications: IndexReader is a heavy-weight view of an index, while IndexWriter is lightweight. Adding delete() to IndexWriter in the obvious way (as experience with IndexModifier shows) had bad performance implications. So while the API was ugly, the alternatives were unusable. The issue has been discussed at length for years, but no acceptable solution had been developed.

Recently Ning Li contributed to Lucene a fix for this, a version of IndexWriter that can efficiently intermix deletes with additions. This took a lot of cleverness. Much more cleverness than, e.g., nominating some useful and widely used code to a "Hall of Shame".

Someone above states "Time to ditch Java parity and clean that sucker up!". That's a wonderful! Please help. Lucene.Net is mostly a one-man project. It's been languishing in Apache's incubator for over a year, due to lack of help. If you find it useful, you energy would be better put to helping out directly, rather than offering advice in blog comments. Then things like the clever enhancement described above, made in Java, will be available to you sooner.

Doug Cutting

Wednesday, 07 March 2007 14:14:03 (GMT Standard Time, UTC+00:00)

George Aroush has done an astounding job on his own by providing a .Net port of an amazing Java library. The Lucene.Net project needs some serious help from people used to working in a collaborative team environment.

Lucene is so good I' happy to forgive its odditties, but I'd love for some .Net experts to join in and create a pure .Net branch that supports disposables and collections. But most importantly, any branch must also support the excellent contributions made to the Java version which are also manually ported by George.

Doug

Wednesday, 07 March 2007 14:28:21 (GMT Standard Time, UTC+00:00)

In addition to Doug's well articulated comment, please note that the official Lucene.Net website is: http://incubator.apache.org/lucene.net/

Join the mailing list, bring this discussion to the list, contribute to the project and in time you will be voted in as a committer. If we have more then just me doing the port, then Lucene.Net well be only weeks behind it's Java release instead of months like it is now and we will have time to address those issues related to .NET.

Those are important steps for Lucene.Net to graduate from incubation.

George Aroush

Monday, 12 March 2007 01:03:21 (GMT Standard Time, UTC+00:00)

Hi, all,
Just want to say that having parity with the Java API's does have its value, at least to our company. Our company is developing a portable search engine on top of Lucene that will work on both Java and .NET platforms. It's basically a wrapper around Lucene written in constrained Java that will be translated automatically by a Java to C# translator we've developed. If the Java and .NET versions of the API's differ too much then we will have to have an abstraction layer on top of the platform specific Lucene.

The Java and .NET equivalence is very important to us and is the main reason we picked Lucene over other search engines for the the company-wide shared search component that's to be embedded in both our desktop and web applications, and while I know you can still keep functional parity with a different .NET API's (same index structure, same search results given the same query, etc), having the same API's as Java definitely makes writing our wrapper component a lot easier.

James Shaw

Comments are closed.

Dare Obasanjo's weblog

"You can buy cars but you can't buy respect in the hood" - Curtis Jackson

Navigation for API Hall of Shame: Lucene's IndexReader and IndexWriter - Dare Obasanjo's weblog