search – Jit Varish Tiwari

Types of FullText Search Solutions

Special Database Features
- MySQL Full Text Search, Sienna
- Solutions exists for PostgreSQL, Oracle and many others
Home baked database based solutions
- Using stored procedures or set of queries and keyword tables for search
External Full Text Search Solutions
- Lucene, Sphinx, Mnogosearch etc.

My Thoughts on Performance

Lucene and Sphinx both can handle large data sizes
Lucene has more features, dynamic updates
For the lamp developer use the Zend Framework’s Lucene Search which is based on Lucene but it will not work on php 4.
speed of fulltext search in lucene is much faster as compared to mysql
lucene is much more complex to use as compared to mysql.
lucene does not allow us to modify a document. Modifying a document is equivalent to deleting the current document and adding the modified document to the index.
lucene requires an object of the index to perform the search. We will know about it when we use the api. Whenever we add a new document to the index, a new object of the index has to be created to include the new document in the index. But creation of a new object is not a major overhead. Though it does slow down the searching process to some extent.
MySQL FullText Search is good for Small (relatively) data sizes

MySQL FullText Search and Updates

Simple: In mysql, we can simply mark an index on a text/varchar column as fulltext, and our work is done. All we need to do next is to fire MATCH AGAINST queries. Adding and modification of indexes is handled by mysql internally as and when new data is added.

MyISAM FullText Search is based on BTREE
- Special form of BTREE index
Each word is index entry
Updating text with 1000 words – 1000 key entries needs to be updated
- A lot of random IO if index is not in memory
Index degradation by fragmentation
- Run OPTIMIZE TABLE for best performance
Minimum Length: With mysql we have the minimum length of word to be indexed which is by default 4. So all words which have less than 4 characters will not be indexed. What will we do if we want to index words like “php”, “asp”, “c”? We will have to decrease the minimum length from 4 to 1. And this will increase the index size drastically and slow down all our searches as a consequence. There are no such issues in lucene.

Lucene

Lucene is a free/open source information retrieval library, originally implemented in Java by Doug Cutting. It is supported by the Apache Software Foundation and is released under the Apache Software License. Lucene has been ported to programming languages including Delphi, Perl, C#, C++, Python, Ruby and PHP.

Popular full text search Library written in Java
- http://lucene.apache.org/
- Clucene – C port exists, but is not current
- Is not specially tailored for indexing databases
- Some coding is needed for integration
Dynamic index changes possible
Very Advanced query language
- Wildcard searches:
- Search in Fields: title:”The Right Way” AND text:go
- Proximity Searches, Fuzzy Searches etc
Supports attributes (indexed and non indexed)
Some CJK Support
Easily integrates with Java

Sphinx Search

Designed for indexing Database content
- http://www.sphinxsearch.com
Focuses
- High performance
- Search Quality
- Ease of use
Supports multi-node clustering out of box
Support Multiple Attributes
Different sort modes (relevance, data etc)
Supports trend queries
Support for snippets
Client available as MySQL Storage Engine plugin
Number of limitations
no partial word searches, hard to update indexes

Tbgsearch

Vector based Search System
- http://www.tbg.nu/tbgsearch/
Written in Perl, Stores data in MySQL Database

Very fast for medium size data sizes
Only boolean searches available