Lucene revisited

Lucene is an open-source full-text search library which makes it easy to add search functionality to an application or website. Want to understand Lucene in 5 minutes ? Go here. The following slide provides a quick review of Lucene.
Figure: Steps in building applications  using Lucene [Source: IBM ]

Why Lucene ? From this DOC.
  • Incremental versus batch indexing
  • Data sources
  • Indexing Control
  • File Format
  • Content Tagging
  • Stop Word Processing
  • Stemming
  • Query Features
  • Concurrency
  • Non-English Support
Go through this document that presents the fundamental concept of Lucent e.g. Index, Document, Field, Term, Segment and Query Term. I recommend to read that for the beginners.

Searching and Indexing 
Lucene is able to achieve fast search responses because, instead of rearching the text directly, it searches an index instead. This would be the equivalent of retrieving pages in a book related to a keyword by searching the index at the back of a book, as opposed to searching the words in each page of the book.

This type of index is called an inverted index, because it inverts a page-centric data structure (page->words) to a keyword-centric data structure (word->pages).
Lucene's Drawback and Nutch
Lucene provides a powerful indexing and search library which may be used as a base for online search engines, however on its own the library doesn't include any form of web crawling or HTML parsing abilities. These features are necessary in order to create a fully functional online search engine. Several projects have modified Lucene with the intent of adding this missing functionality. One of the most notable of these efforts is Nutch, a SourceForge.net project.
More Resources:
  1. Lucene QUERY SYNTAX
  2. Lucene QUERY SYNTAX I
  3. Luke- Lucent INDEX TOOLBAR
  4. Lucene BASICS
  5. Lucene ISSUES
  6. Lucene 3.0 API Documentation
  7. Advance Lucene
  8. BEST TUTORIAL@ IBM

0 comments:

Post a Comment