FullText Search Solutions

Types of FullText Search Solutions

  • Special Database Features
    • MySQL Full Text Search, Sienna
    • Solutions exists for PostgreSQL, Oracle and many others
  • Home baked database based solutions
    • Using stored procedures or set of queries and keyword tables for search
  • External Full Text Search Solutions
    • Lucene, Sphinx, Mnogosearch etc.

My Thoughts on Performance

  • Lucene and Sphinx both can handle large data sizes
  • Lucene has more features, dynamic updates
  • For the lamp developer use the Zend Framework’s Lucene Search which is based on Lucene but it will not work on php 4.
  • speed of fulltext search in lucene is much faster as compared to mysql
  • lucene is much more complex to use as compared to mysql.
  • lucene does not allow us to modify a document. Modifying a document is equivalent to deleting the current document and adding the modified document to the index.
  • lucene requires an object of the index to perform the search. We will know about it when we use the api. Whenever we add a new document to the index, a new object of the index has to be created to include the new document in the index. But creation of a new object is not a major overhead. Though it does slow down the searching process to some extent.
  • MySQL FullText Search is good for Small (relatively) data sizes

MySQL FullText Search and Updates

  • Simple: In mysql, we can simply mark an index on a text/varchar column as fulltext, and our work is done. All we need to do next is to fire MATCH AGAINST queries. Adding and modification of indexes is handled by mysql internally as and when new data is added.
  • MyISAM FullText Search is based on BTREE
    • Special form of BTREE index
  • Each word is index entry
  • Updating text with 1000 words – 1000 key entries needs to be updated
    • A lot of random IO if index is not in memory
  • Index degradation by fragmentation
    • Run OPTIMIZE TABLE for best performance
  • Minimum Length: With mysql we have the minimum length of word to be indexed which is by default 4. So all words which have less than 4 characters will not be indexed. What will we do if we want to index words like “php”, “asp”, “c”? We will have to decrease the minimum length from 4 to 1. And this will increase the index size drastically and slow down all our searches as a consequence. There are no such issues in lucene.

Lucene

Lucene is a free/open source information retrieval library, originally implemented in Java by Doug Cutting. It is supported by the Apache Software Foundation and is released under the Apache Software License. Lucene has been ported to programming languages including Delphi, Perl, C#, C++, Python, Ruby and PHP.

  • Popular full text search Library written in Java
    • http://lucene.apache.org/
    • Clucene – C port exists, but is not current
    • Is not specially tailored for indexing databases
    • Some coding is needed for integration
  • Dynamic index changes possible
  • Very Advanced query language
    • Wildcard searches:
    • Search in Fields: title:”The Right Way” AND text:go
    • Proximity Searches, Fuzzy Searches etc
  • Supports attributes (indexed and non indexed)
  • Some CJK Support
  • Easily integrates with Java

Sphinx Search

  • Designed for indexing Database content
  • Focuses
    • High performance
    • Search Quality
    • Ease of use
  • Supports multi-node clustering out of box
  • Support Multiple Attributes
  • Different sort modes (relevance, data etc)
  • Supports trend queries
  • Support for snippets
  • Client available as MySQL Storage Engine plugin
  • Number of limitations
  • no partial word searches, hard to update indexes

Tbgsearch

  • Very fast for medium size data sizes
  • Only boolean searches available

.