Download - Full Text Search In PostgreSQL
Practical full-text search in PostgreSQLBill KarwinPostgreSQL Conference West 09 • 2009/10/17
Me
• 20+ years experience
• Application/SDK developer• Support, Training, Proj Mgmt• C, Java, Perl, PHP
• SQL maven
• MySQL, PostgreSQL, InterBase• Zend Framework• Oracle, SQL Server, IBM DB2, SQLite
• Community contributor
Full Text Search
Text search
• Web applications demand speed
• Let’s compare 5 solutions for text search
Sample data
• StackOverflow.com Posts
• Data dump exported September 2009
• 1.2 million tuples
• ~850 MB
StackOverflow ER diagram
Naive SearchingSome people, when confronted with a problem,
think “I know, I’ll use regular expressions.” Now they have two problems.
— Jamie Zawinsky
Performance issue
• LIKE with wildcards:
SELECT * FROM PostsWHERE body LIKE ‘%postgresql%’
• POSIX regular expressions:
SELECT * FROM PostsWHERE body ~ ‘postgresql’
time: 91 sec
time: 105 sec
Why so slow?
CREATE TABLE telephone_book ( full_name VARCHAR(50));
CREATE INDEX name_idx ON telephone_book (full_name);
INSERT INTO telephone_book VALUES (‘Riddle, Thomas’), (‘Thomas, Dean’);
Why so slow?
• Search for all with last name “Thomas”
SELECT * FROM telephone_bookWHERE full_name LIKE ‘Thomas%’
• Search for all with first name “Thomas”
SELECT * FROM telephone_bookWHERE full_name LIKE ‘%Thomas’
uses index
doesn’t use index
Indexes don’t help searching for substrings
☞
Accuracy issue
• Irrelevant or false matching words ‘one’, ‘money’, ‘prone’, etc.:
body LIKE ‘%one%’
• Regular expressions in PostgreSQLsupport escapes for word boundaries:
body ~ ‘\yone\y’
Solutions
• Full-Text Indexing in the RDBMS
• Sphinx Search
• Apache Lucene
• Inverted Index
• Search Engine Service
PostgreSQL Text-Search
PostgreSQL Text-Search
• Since PostgreSQL 8.3
• TSVECTOR to represent text data
• TSQUERY to represent search predicates
• Special indexes
PostgreSQL Text-Search:
Basic Querying
SELECT * FROM Posts WHERE to_tsvector(title || ‘ ’ || body || ‘ ’ || tags) @@ to_tsquery(‘postgresql & performance’);
text-search matching operator
PostgreSQL Text-Search:
Basic Querying
SELECT * FROM Posts WHERE title || ‘ ’ || body || ‘ ’ || tags @@ ‘postgresql & performance’;
time with no index: 8 min 2 sec
PostgreSQL Text-Search:
Add TSVECTOR column
ALTER TABLE Posts ADD COLUMN PostText TSVECTOR;
UPDATE Posts SET PostText = to_tsvector(‘english’, title || ‘ ’ || body || ‘ ’ || tags);
Special index types
• GIN (generalized inverted index)
• GiST (generalized search tree)
PostgreSQL Text-Search:
Indexing
CREATE INDEX PostText_GIN ON Posts USING GIN(PostText);
time: 39 min 36 sec
PostgreSQL Text-Search:
Querying
SELECT * FROM Posts WHERE PostText @@ ‘postgresql & performance’;
time with index: 20 milliseconds
PostgreSQL Text-Search:
Keep TSVECTOR in sync
CREATE TRIGGER TS_PostText BEFORE INSERT OR UPDATE ON PostsFOR EACH ROW EXECUTE PROCEDURE tsvector_update_trigger( PostText, ‘english’, title, body, tags);
Lucene
Lucene
• Full-text indexing and search engine
• Apache Project since 2001
• Apache License
• Java implementation
• Ports exist for C, Perl, Ruby, Python, PHP, etc.
Lucene:
How to use
1. Add documents to index
2. Parse query
3. Execute query
Lucene:
Creating an index
• Programmatic solution in Java...
time: 8 minutes 55 seconds
Lucene:
Indexing
String url = "jdbc:postgresql:stackoverflow"; Properties props = new Properties(); props.setProperty("user", "postgres"); Class.forName("org.postgresql.Driver"); Connection con = DriverManager.getConnection(url, props);
Statement stmt = con.createStatement(); String sql = "SELECT PostId, Title, Body, Tags FROM Posts"; ResultSet rs = stmt.executeQuery(sql);
Date start = new Date();
IndexWriter writer = new IndexWriter(FSDirectory.open(INDEX_DIR), new StandardAnalyzer(Version.LUCENE_CURRENT), true, IndexWriter.MaxFieldLength.LIMITED);
run any SQL query
open Lucene index writer
Lucene:
Indexing
while (rs.next()) { Document doc = new Document();
doc.add(new Field("PostId", rs.getString("PostId"), Field.Store.YES, Field.Index.NO)); doc.add(new Field("Title", rs.getString("Title"), Field.Store.YES, Field.Index.ANALYZED)); doc.add(new Field("Body", rs.getString("Body"), Field.Store.YES, Field.Index.ANALYZED)); doc.add(new Field("Tags", rs.getString("Tags"), Field.Store.YES, Field.Index.ANALYZED));
writer.addDocument(doc);}
writer.optimize();writer.close();
loop over SQL result
each row is a Document
with four Fields
finish and close index
Lucene:
Querying
• Parse a Lucene queryString[] fields = new String[3];fields[0] = “title”; fields[1] = “body”; fields[2] = “tags”;
Query q = new MultiFieldQueryParser(fields, new StandardAnalyzer()).parse(‘performance’);
• Execute the querySearcher s = new IndexSearcher(indexName);
Hits h = s.search(q);
time: 80 milliseconds
parse search query
define fields
Sphinx Search
Sphinx Search
• Embedded full-text search engine
• Started in 2001
• GPLv2 license
• Good database integration
Sphinx Search:
How to use
1. Edit configuration file
2. Index the data
3. Query the index
4. Issues
Sphinx Search:
sphinx.conf
source stackoverflowsrc{ type = pgsql sql_host = localhost sql_user = postgres sql_pass = xxxx sql_db = stackoverflow sql_query = SELECT PostId, Title, Body, Tags FROM Posts sql_query_info = SELECT * FROM Posts WHERE PostId=$id}
Sphinx Search:
sphinx.conf
index stackoverflow{ source = stackoverflowsrc path = /opt/local/var/db/sphinx/stackoverflow}
Sphinx Search:
Building index
indexer -c sphinx.conf stackoverflow
collected 1242365 docs, 720.5 MBsorted 88.3 Mhits, 100.0% donetotal 1242365 docs, 720452944 bytestotal 357.647 sec, 2014423.75 bytes/sec, 3473.72 docs/sec
time: 5 min 57 sec
Sphinx Search:
Querying index
search -c sphinx.conf -i stackoverflow -b “sql & performance”
time: 8 milliseconds
Sphinx Search:
Issues
• Index updates are as expensive as rebuilding the index from scratch
• Maintain “main” index plus “delta” index for recent changes
• Merge indexes periodically
• Not all data fits into this model
Inverted Index
Inverted index
TagTypesPosts Tags
intersection of words / Posts
searchable words
Inverted index:
Updated ER Diagram
Inverted index:
Data definitionCREATE TABLE TagTypes ( TagId SERIAL PRIMARY KEY, Tag VARCHAR(50) NOT NULL);
CREATE UNIQUE INDEX TagTypes_Tag_index ON TagTypes(Tag);
CREATE TABLE Tags ( PostId INT NOT NULL, TagId INT NOT NULL, PRIMARY KEY (PostId, TagId), FOREIGN KEY (PostId) REFERENCES Posts (PostId), FOREIGN KEY (TagId) REFERENCES TagTypes (TagId));
CREATE INDEX Tags_PostId_index ON Tags(PostId);CREATE INDEX Tags_TagId_index ON Tags(TagId);
Inverted index:
Indexing
INSERT INTO Tags (PostId, TagId) SELECT p.PostId, t.TagId FROM Posts p JOIN TagTypes t ON (p.Tags LIKE ‘%<’ || t.Tag || ‘>%’);
90 seconds per tag!!
Inverted index:
Querying
SELECT p.* FROM Posts pJOIN Tags t USING (PostId)JOIN TagTypes tt USING (TagId)WHERE tt.Tag = ‘performance’;
40 milliseconds
Search Engine Services
Search engine services:
Google Custom Search Engine
• http://www.google.com/cse/
• DEMO ➪ http://www.karwin.com/demo/gcse-demo.html
even big web sites use this solution
Search engine services:
Is it right for you?
• Your site is public and allows external index
• Search is a non-critical feature for you
• Search results are satisfactory
• You need to offload search processing
Comparison: Time to Build Index
LIKE predicate none
PostgreSQL / GIN 40 min
Sphinx Search 6 min
Apache Lucene 9 min
Inverted index high
Google / Yahoo! offline
Comparison: Index Storage
LIKE predicate none
PostgreSQL / GIN 532 MB
Sphinx Search 533 MB
Apache Lucene 1071 MB
Inverted index 101 MB
Google / Yahoo! offline
Comparison: Query Speed
LIKE predicate 90+ sec
PostgreSQL / GIN 20 ms
Sphinx Search 8 ms
Apache Lucene 80 ms
Inverted index 40 ms
Google / Yahoo! *
Comparison: Bottom-Line
LIKE predicate none none 11,250x SQL
PostgreSQL / GIN 7x 5.3x 2.5x RDBMS
Sphinx Search 1x * 5.3x 1x 3rd party
Apache Lucene 1.5x 10x 10x 3rd party
Inverted index high 1x 5x SQL
Google / Yahoo! offline offline * Service
indexing storage query solution
Copyright 2009 Bill Karwin
www.slideshare.net/billkarwin
Released under a Creative Commons 3.0 License: http://creativecommons.org/licenses/by-nc-nd/3.0/
You are free to share - to copy, distribute and transmit this work, under the following conditions:
Attribution. You must attribute this work to Bill Karwin.
Noncommercial. You may not use this work for commercial purposes.
No Derivative Works. You may not alter, transform, or build
upon this work.