I’ve been looking for site indexing code to support searching my web site. Yes, you can certainly use Google for some of my content. But Google won’t typically see content that’s not linked in. Then there’s the data hidden away in databases instead of HTML.
Looking around at this time, nothing really excites me. I want something fairly light weight, fast, and capable of indexing content from my HTML, PHP, blogs and photos. I also want something that easily integrates into the look and feel of my site, today and 2 years from now.
Conceptually, I like xapian. There’s also the ‘omega’ package built on xapian. However, omega’s ‘omindex’ doesn’t work for me; I’ve got considerable textual content in PHP files and ‘omindex’ skips PHP. It also doesn’t handle WordPress blogs, gallery3 databases, etc. I’m also not terribly fond of the output of omega (though that’s not difficult to change).
I’ve decided I’ll write my own indexer and search using xapian as the back end. In fact I’ve already written the code to index HTML and PHP files, and have a design for the code to index my blog posts. I have a test search program that emits simple HTML. All looks good so far; after indexing a handful of pages, searches yield appropriate document weights and rankings.