Download - Sphinx Full Text Search Server
By Andrew Kandels
Search Server
Sphinx is an open source full text search server, designed from the ground up with performance, relevance (a.k.a. search quality), and integration simplicity in mind.
• Craigslist serves 200 million queries/day• Used by Slashdot, Mozilla, Meetup• Scales to billions of documents (distributed)• Support almost any data source (SQL, XML, etc.)• Batch and real-time indexes
What is a Search Server?
Sphinx is like a database because…
• It has a schema
• It has field types (integer, boolean, strings, dates)
• It responds to queries (SQL, API):
SELECT * FROM Books WHERE MATCH(“a rose by any other name”)
Documents
Sphinx indexes data from just about any source.
SELECT CONCAT(a.first_name, ' ', a.last_name) AS full_name, COUNT(b.book_id) AS num_books, MIN(b.publish_date) AS first_publishedFROM author aINNER JOIN book b ON a.author_id = b.author_id
<?xml version=“1.0”?><author> <id>1433</id> <name>Mark Twain</name> <books> <book>A Connecticut Yankee in King Arthur’s Court</book> </books></author>
How it Works
Sphinx parses plain text queries and answers with rows.
Search
@author_id 15 “Mark Twain” king << arthur
Results
1. document=1433, weight=1692, createdAt=Jan 1 1889
Relevance
Only the strongest will survive; but, relevance is in the eye of the beholder. Some factors include:
• How many times did our keywords match?• How many times did they repeat in the query?• How frequently do keywords appear?• Do keywords in the document appear in the same order as
the query?• Did we match exactly, or is it a stemmed match?
B-Tree Index
User Index (Last Name (4))First Name Last Name City State Notes Row # ContentsAllison Janney Baltimore MA Cregg 1 JannJohn Spencer Des Moines IA McGarry 5 MoloBradley Whitford Newport VA Lyman 6 SchiMartin Sheen Seattle WA Bartlett 4 SheeJanel Moloney Hollywood CA Moss 2 SpenRichard Schiff Lincoln NE Ziegler 3 Whit
A B-tree is a tree data structure that keeps data sorted and allows searches, sequential access, insertions, and deletions in logarithmic time.
Logical Queries
Logical conditions return a boolean result based on an expression:
country = “United States”AND num_published >= 50AND (author_id = 5 OR author_id = 8 OR author_id = 10)
Logic queries can be complex and typically evaluate based on the whole value of a column.
Stemming
Stemming (a.k.a. morphology) is the process for reducing inflected or derived words to their stem, base or root form.
For example, “dove” is a synonym for “pigeon”. The words are different; but they can mean the same thing.
Tokenizing
Sphinx breaks down documents into keywords. This is called tokenization.
Word breaker characters allow exception cases for keywords like AT&T, C++ or T-Mobile.
Short words are ignored (by default, words less than 3 characters) but a placeholder is saved to support proximity and phrase searching.
Full Text Index
Inversion
Document Index (Full Text)A man caught a fish [spacer]
man, person, human, beingcaught, catch, catcher, catching, catches[spacer]fish, fishing, fished, fisher
Metadataman 2 1 caught 3 1 fish 5 1
Full Text Queries
Searches multiple columns or within contents in columns, also known as Keyword Searching.
Boolean Search fiction AND (Twain OR Dickens)
Phrase Search “Mark Twain”
Field-Based Search @author_id 15
Proximity Search “fear itself”~2, fear << itself
Substring Search @author[4] Mark
Quorum Search “the world is a wonderful place”/3
Same Sentence/Paragraph fear SENTENCE itself
Getting Sphinx
Download it from http://www.sphinxsearch.com (RPM, DEB, Tarball)
Important Files and Binaries
A successful Sphinx installation will yield the following:
searchd The search daemon, answers queries
Indexer Collects documents and builds the index
search Performs a search (useful for debugging)
sphinx.conf Defines your data and configures your indexes and daemon
Sphinx.conf
Defaults to /etc/sphinx/sphinx.conf, but can exist anywhere.
It can even be executable:
#!/usr/bin/env phpsource mysource{ type = mysql sql_host = <?php echo DB_HOST; ?>}
Sphinx.conf Blocks
The contents of sphinx.conf consists of several named blocks:
source Defines your data source and queries
index Define sources to index searches for
indexer Configure the indexer utility
searchd Configure the search daemon
Source
Define the connection to your database and query in the source block.
source filmssource { type = mysql sql_host = localhost sql_user = root sql_pass = sql_db = sakila
sql_query = \ SELECT f.film_id, f.title, f.description,\ f.release_year, f.rating, l.name as language\ FROM film f\ INNER JOIN language l\ ON l.language_id = f.language_id
sql_attr_uint = release_year sql_attr_string = rating sql_attr_string = language}
Index
Define which sources to include and index parameters:
index films{ source = filmssource charset_type = utf-8 path = /home/andrew/sphinx/films stopwords = /home/andrew/sphinx/stopwords.txt enable_star = 1 min_word_len = 2 min_prefix_len = 0 min_infix_len = 2 }
Indexer (optional)
Configure the indexing process which runs occasionally as a batch:
indexer{ mem_limit = 256M}
Searchd (optional)
Configure the search daemon (searchd) which answers queries:
searchd{ listen = localhost:9312 listen = localhost:9306:mysql41 log = /home/andrew/sphinx.log read_timeout = 8 max_children = 30 pid_file = /home/andrew/sphinx.pid max_matches = 25 seamless_rotate = 1 preopen_indexes = 1 unlink_old = 1}
stopwords.txt
To generate stopwords from your data, use the indexer binary:
indexer --config /path/to/sphinx.conf --buildstops /path/to/stopwords.txt 25
ofwhomustinandthemadAn
Builds a stopwords.txt file with the 25 most commonly found words. Use --buildfreqs to include counts.
Stopwords can dramatically reduce the index size and time-to-build; but, it’s a good idea to inspect the output before using it!
Build your Index
To generate your index, use the indexer binary:
indexer --config /path/to/sphinx.conf --all –rotate
Sphinx 2.0.4-release (r3135)Copyright (c) 2001-2012, Andrew AksyonoffCopyright (c) 2008-2012, Sphinx Technologies Inc (http://sphinxsearch.com)
using config file 'sphinx.conf`...indexing index 'films'...collected 1000 docs, 0.1 MBsorted 0.3 Mhits, 100.0% donetotal 1000 docs, 108077 bytestotal 0.148 sec, 727012 bytes/sec, 6726.80 docs/sectotal 3 reads, 0.003 sec, 675.6 kb/call avg, 1.1 msec/call avgtotal 11 writes, 0.004 sec, 331.8 kb/call avg, 0.4 msec/call avg
Start the Server
Start the server by executing the searchd binary:
searchd --config /path/to/sphinx.conf
Sphinx 2.0.4-release (r3135)Copyright (c) 2001-2012, Andrew AksyonoffCopyright (c) 2008-2012, Sphinx Technologies Inc (http://sphinxsearch.com)
using config file 'sphinx.conf’...listening on 127.0.0.1:9312listening on 127.0.0.1:9306precaching index 'films'precached 1 indexes in 0.001 sec
Run a Search
Test your index by running a search:
search --limit 3 robot
Sphinx 2.0.4-release (r3135)Copyright (c) 2001-2012, Andrew AksyonoffCopyright (c) 2008-2012, Sphinx Technologies Inc (http://sphinxsearch.com)
using config file './sphinx.conf'...index 'films': query 'robot ': returned 77 matches of 77 total in 0.000 sec
displaying matches:1. document=138, weight=1612, release_year=2006, rating=R, language=English2. document=920, weight=1612, release_year=2006, rating=G, language=English3. document=6, weight=1581, release_year=2006, rating=PG, language=English
words:1. 'robot': 77 documents, 79 hits
MySQL Interface
You can query Sphinx using the MySQL protocol:
mysql –h127.0.0.1 –P 9306
Reading table information for completion of table and column namesYou can turn off this feature to get a quicker startup with -A
Welcome to the MySQL monitor. Commands end with ; or \g.Your MySQL connection id is 1Server version: 2.0.4-release (r3135)
Copyright (c) 2000, 2010, Oracle and/or its affiliates. All rights reserved.This software comes with ABSOLUTELY NO WARRANTY. This is free software,and you are welcome to modify and redistribute it under the GPL v2 license
Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.
mysql>
MySQL Interface
Queries are written in SphinxQL, which is much like SQL:
mysql> SELECT * FROM films WHERE MATCH('robot') ORDER BY release_year DESC LIMIT 5;+------+--------+--------------+--------+----------+| id | weight | release_year | rating | language |+------+--------+--------------+--------+----------+| 6 | 1581 | 2006 | PG | English || 16 | 1581 | 2006 | NC-17 | English || 25 | 1581 | 2006 | G | English || 42 | 1581 | 2006 | NC-17 | English || 61 | 1581 | 2006 | G | English |+------+--------+--------------+--------+----------+5 rows in set (0.00 sec)
MySQL Interface
Additional metrics can also be retrieved:
mysql> SHOW META;+---------------+-------+| Variable_name | Value |+---------------+-------+| total | 77 || total_found | 77 || time | 0.000 || keyword[0] | robot || docs[0] | 77 || hits[0] | 79 |+---------------+-------+6 rows in set (0.00 sec)
MySQL Interface
You can even do grouping:
mysql> SELECT rating, COUNT(*) AS num_movies, MIN(release_year) AS first_year FROM films GROUP BY rating ORDER BY num_movies DESC;+------+--------+--------------+--------+------------+--------+| id | weight | release_year | rating | first_year | @count |+------+--------+--------------+--------+------------+--------+| 7 | 1 | 2006 | PG-13 | 2006 | 223 || 3 | 1 | 2006 | NC-17 | 2006 | 210 || 8 | 1 | 2006 | R | 2006 | 195 || 1 | 1 | 2006 | PG | 2006 | 194 || 2 | 1 | 2006 | G | 2006 | 178 |+------+--------+--------------+--------+------------+--------+5 rows in set (0.00 sec)
Other Applications
Sphinx does more than just full text search. It has other practical applications as well:
• Metrics and Reporting
• Data Warehouse
• Materialized Views
• Operational Data Store
• Offloading Queries
Quick and Dirty PHP
Integrate Sphinx by using any MySQL driver (like PDO):
SphinxAPI
Or use a native extension like SphinxClient for PHP:
Download it here: http://pecl.php.net/sphinx
Indexing Strategies
Sphinx supports several types of indexes:
• Disk
• In-memory
• Distributed
• Real-time
Main+delta Batch Indexes
Disk indexes often use the main+delta(s) strategy:
• One or more delta indexes collect new data as often as every minute.
• Larger batch indexes rebuild daily, weekly or even less frequently.
Disk indexes have the following benefits:
• They can be re-indexed online without interruption (--rotate)
• They can be distributed over filesystems and hardware
The End
Andrew KandelsWebsite: http://andrewkandels.com
Twitter: @andrewkandels
Facebook/G+: No thanks
There’s a book!