Nobal Tech: Lucene revisited

Lucene is an open-source full-text search library which makes it easy to add search functionality to an application or website. Want to understand Lucene in 5 minutes ? Go here. The following slide provides a quick review of Lucene.

Figure: Steps in building applications using Lucene [Source: IBM ]

Lucene Introduction

Why Lucene ? From this DOC.

Incremental versus batch indexing
Data sources
Indexing Control
File Format
Content Tagging
Stop Word Processing
Stemming
Query Features
Concurrency
Non-English Support

Go through this document that presents the fundamental concept of Lucent e.g. Index, Document, Field, Term, Segment and Query Term. I recommend to read that for the beginners.

Searching and Indexing

Lucene is able to achieve fast search responses because, instead of rearching the text directly, it searches an index instead. This would be the equivalent of retrieving pages in a book related to a keyword by searching the index at the back of a book, as opposed to searching the words in each page of the book.

This type of index is called an inverted index, because it inverts a page-centric data structure (page->words) to a keyword-centric data structure (word->pages).

Lucene's Drawback and Nutch

Lucene provides a powerful indexing and search library which may be used as a base for online search engines, however on its own the library doesn't include any form of web crawling or HTML parsing abilities. These features are necessary in order to create a fully functional online search engine. Several projects have modified Lucene with the intent of adding this missing functionality. One of the most notable of these efforts is Nutch, a SourceForge.net project.

More Resources:

Nobal Tech

Lucene revisited

0 comments:

Post a Comment

About Me

Blog Archive

Labels

Number of Visitors