‘Hey guys, is it about those beard Sphinx? Come on!’ – crying you.
Really, why not modern ElasticSearch, Solr or PostgreSQL?
The 2 main reasons are:
- efficiency. You know we use this brilliant feature when develop each project. ‘Efficiency’ means do a lot of job with less efforts. Sphinx’s written on C++, uses less memory and does the search faster that Java-based search engines. We lose cross-platforms but won speed.
- Second is middle size database. We known the project rather wouldn’t operate big data but would with tables contained millions of rows.
That is why the decision in favour of Sphinx was making 7 years ago but still actual now.
How did we chose
The project (where Sphinx applied later) started 2009 had large database of emails , contacts and various business transaction records. It was done with PHP and MySQL. The project had had to be long-standing, and the database was growing fast. After couple years we’ve felt lack of fine full-text search. For that time the MySQL build-in fulltext indexes haven’t been available yet as it was implemented in v5.5. After the quick review of some open source tools for full text search we decided to use the Sphinx. Actually, for that time the Sphinx was the only product with stable quick search. For example, according to this nice research that compared Sphinx, Solr and Ferret, last two were slower and, moreover, needed more memory.
The most obvious positive Sphinx’ features were and are:
- Fast indexing speed.
- Low memory usage and ability to set how much memory engine can use.
- Ability to tune result ranking. By default ranking by relevance is set but it is possible to change the order.
- Useful additional tools (send emails in case on daemon events, slow queries etc, tools for monitoring service).
What about disadvantages? There are few but minor for us:
- Does not support ‘did-you-mean’ service. However, as the engine indexes the word as in dictionary, the single-root words (‘coding’ and ‘code’) are found naturally.
- Doesn’t allow partial index updates for field data. It is need to re-index the data regularly.
The present
Now the project contains over 40 tables and the biggest one has 10.180.000+ rows. Total weight of database is 100+GB. Fulltext search is run for ¼ of all project’s tables. Sphinx applies 3-stage indexing: delta – once per minute, merging delta to main index – every 10 minutes and full data reindex – once per day (the listings below show one example).
The Sphinx tool is good and silent. We just do not notice how the Sphinx works. Common query to database executes for milliseconds.
indexing index 'customer_search'... collected 21 docs, 0.0 MB sorted 0.0 Mhits, 100.0% done total 21 docs, 301 bytes total 0.056 sec, 5335 bytes/sec, 372.26 docs/sec Listing of delta indexing for one table indexing index 'delta_10min_messages'... collected 5 docs, 0.0 MB collected 80 attr values sorted 0.0 Mvalues, 100.0% done sorted 0.0 Mhits, 100.0% done total 5 docs, 10295 bytes total 2.415 sec, 4262 bytes/sec, 2.07 docs/sec total 44 reads, 0.000 sec, 0.6 kb/call avg, 0.0 msec/call avg total 14 writes, 0.000 sec, 5.6 kb/call avg, 0.0 msec/call avg Listing of merge indexing for one table indexing index 'company_search'... collected 5933 docs, 0.1 MB sorted 0.0 Mhits, 100.0% done total 5933 docs, 109919 bytes total 0.138 sec, 791108 bytes/sec, 42700.96 docs/sec Listing of full data reindex for one table
The close future
For now Sphinx is not too popular. Almost each other tool like ElasticSearch, Solr and even MySQL build-in full-text search beat it. But it’s still alive. Sphinx 3.0 is coming, where developers promise to implement big amount of good features like RT-indexes, REST API, thread pool using and more. Hope Sphinx will be not only in trend line but stay lightweight and simple as it is.