Download - Full Text Search In PostgreSQL

Practical full-text search in PostgreSQLBill KarwinPostgreSQL Conference West 09 • 2009/10/17

Me

• 20+ years experience

• Application/SDK developer• Support, Training, Proj Mgmt• C, Java, Perl, PHP

• SQL maven

• MySQL, PostgreSQL, InterBase• Zend Framework• Oracle, SQL Server, IBM DB2, SQLite

• Community contributor

Full Text Search

Text search

• Web applications demand speed

• Let’s compare 5 solutions for text search

Sample data

• StackOverflow.com Posts

• Data dump exported September 2009

• 1.2 million tuples

• ~850 MB

StackOverflow ER diagram

Naive SearchingSome people, when confronted with a problem,

think “I know, I’ll use regular expressions.” Now they have two problems.

— Jamie Zawinsky

Performance issue

• LIKE with wildcards:

SELECT * FROM PostsWHERE body LIKE ‘%postgresql%’

• POSIX regular expressions:

SELECT * FROM PostsWHERE body ~ ‘postgresql’

time: 91 sec

time: 105 sec

Why so slow?

CREATE TABLE telephone_book ( full_name VARCHAR(50));

CREATE INDEX name_idx ON telephone_book (full_name);

INSERT INTO telephone_book VALUES (‘Riddle, Thomas’), (‘Thomas, Dean’);

Why so slow?

• Search for all with last name “Thomas”

SELECT * FROM telephone_bookWHERE full_name LIKE ‘Thomas%’

• Search for all with first name “Thomas”

SELECT * FROM telephone_bookWHERE full_name LIKE ‘%Thomas’

uses index

doesn’t use index

Indexes don’t help searching for substrings

☞

Accuracy issue

• Irrelevant or false matching words ‘one’, ‘money’, ‘prone’, etc.:

body LIKE ‘%one%’

• Regular expressions in PostgreSQLsupport escapes for word boundaries:

body ~ ‘\yone\y’

Solutions

• Full-Text Indexing in the RDBMS

• Sphinx Search

• Apache Lucene

• Inverted Index

• Search Engine Service

PostgreSQL Text-Search

PostgreSQL Text-Search

• Since PostgreSQL 8.3

• TSVECTOR to represent text data

• TSQUERY to represent search predicates

• Special indexes

PostgreSQL Text-Search:

Basic Querying

SELECT * FROM Posts WHERE to_tsvector(title || ‘ ’ || body || ‘ ’ || tags) @@ to_tsquery(‘postgresql & performance’);

text-search matching operator


Basic Querying

SELECT * FROM Posts WHERE title || ‘ ’ || body || ‘ ’ || tags @@ ‘postgresql & performance’;

time with no index: 8 min 2 sec


Add TSVECTOR column

ALTER TABLE Posts ADD COLUMN PostText TSVECTOR;

UPDATE Posts SET PostText = to_tsvector(‘english’, title || ‘ ’ || body || ‘ ’ || tags);

Special index types

• GIN (generalized inverted index)

• GiST (generalized search tree)


Indexing

CREATE INDEX PostText_GIN ON Posts USING GIN(PostText);

time: 39 min 36 sec


Querying

SELECT * FROM Posts WHERE PostText @@ ‘postgresql & performance’;

time with index: 20 milliseconds


Keep TSVECTOR in sync

CREATE TRIGGER TS_PostText BEFORE INSERT OR UPDATE ON PostsFOR EACH ROW EXECUTE PROCEDURE tsvector_update_trigger( PostText, ‘english’, title, body, tags);

Lucene

Lucene

• Full-text indexing and search engine

• Apache Project since 2001

• Apache License

• Java implementation

• Ports exist for C, Perl, Ruby, Python, PHP, etc.

Lucene:

How to use

1. Add documents to index

2. Parse query

3. Execute query

Lucene:

Creating an index

• Programmatic solution in Java...

time: 8 minutes 55 seconds

Lucene:

Indexing

String url = "jdbc:postgresql:stackoverflow"; Properties props = new Properties(); props.setProperty("user", "postgres"); Class.forName("org.postgresql.Driver"); Connection con = DriverManager.getConnection(url, props);

Statement stmt = con.createStatement(); String sql = "SELECT PostId, Title, Body, Tags FROM Posts"; ResultSet rs = stmt.executeQuery(sql);

Date start = new Date();

IndexWriter writer = new IndexWriter(FSDirectory.open(INDEX_DIR), new StandardAnalyzer(Version.LUCENE_CURRENT), true, IndexWriter.MaxFieldLength.LIMITED);

run any SQL query

open Lucene index writer

Lucene:

Indexing

while (rs.next()) { Document doc = new Document();

doc.add(new Field("PostId", rs.getString("PostId"), Field.Store.YES, Field.Index.NO)); doc.add(new Field("Title", rs.getString("Title"), Field.Store.YES, Field.Index.ANALYZED)); doc.add(new Field("Body", rs.getString("Body"), Field.Store.YES, Field.Index.ANALYZED)); doc.add(new Field("Tags", rs.getString("Tags"), Field.Store.YES, Field.Index.ANALYZED));

writer.addDocument(doc);}

writer.optimize();writer.close();

loop over SQL result

each row is a Document

with four Fields

finish and close index

Lucene:

Querying

• Parse a Lucene queryString[] fields = new String[3];fields[0] = “title”; fields[1] = “body”; fields[2] = “tags”;

Query q = new MultiFieldQueryParser(fields, new StandardAnalyzer()).parse(‘performance’);

• Execute the querySearcher s = new IndexSearcher(indexName);

Hits h = s.search(q);

time: 80 milliseconds

parse search query

define fields

Sphinx Search

Sphinx Search

• Embedded full-text search engine

• Started in 2001

• GPLv2 license

• Good database integration

Sphinx Search:

How to use

1. Edit configuration file

2. Index the data

3. Query the index

4. Issues

Sphinx Search:

sphinx.conf

source stackoverflowsrc{ type = pgsql sql_host = localhost sql_user = postgres sql_pass = xxxx sql_db = stackoverflow sql_query = SELECT PostId, Title, Body, Tags FROM Posts sql_query_info = SELECT * FROM Posts WHERE PostId=$id}

Sphinx Search:

sphinx.conf

index stackoverflow{ source = stackoverflowsrc path = /opt/local/var/db/sphinx/stackoverflow}

Sphinx Search:

Building index

indexer -c sphinx.conf stackoverflow

collected 1242365 docs, 720.5 MBsorted 88.3 Mhits, 100.0% donetotal 1242365 docs, 720452944 bytestotal 357.647 sec, 2014423.75 bytes/sec, 3473.72 docs/sec

time: 5 min 57 sec

Sphinx Search:

Querying index

search -c sphinx.conf -i stackoverflow -b “sql & performance”

time: 8 milliseconds

Sphinx Search:

Issues

• Index updates are as expensive as rebuilding the index from scratch

• Maintain “main” index plus “delta” index for recent changes

• Merge indexes periodically

• Not all data fits into this model

Inverted Index

Inverted index

TagTypesPosts Tags

intersection of words / Posts

searchable words

Inverted index:

Updated ER Diagram

Inverted index:

Data definitionCREATE TABLE TagTypes ( TagId SERIAL PRIMARY KEY, Tag VARCHAR(50) NOT NULL);

CREATE UNIQUE INDEX TagTypes_Tag_index ON TagTypes(Tag);

CREATE TABLE Tags ( PostId INT NOT NULL, TagId INT NOT NULL, PRIMARY KEY (PostId, TagId), FOREIGN KEY (PostId) REFERENCES Posts (PostId), FOREIGN KEY (TagId) REFERENCES TagTypes (TagId));

CREATE INDEX Tags_PostId_index ON Tags(PostId);CREATE INDEX Tags_TagId_index ON Tags(TagId);

Inverted index:

Indexing

INSERT INTO Tags (PostId, TagId) SELECT p.PostId, t.TagId FROM Posts p JOIN TagTypes t ON (p.Tags LIKE ‘%<’ || t.Tag || ‘>%’);

90 seconds per tag!!

Inverted index:

Querying

SELECT p.* FROM Posts pJOIN Tags t USING (PostId)JOIN TagTypes tt USING (TagId)WHERE tt.Tag = ‘performance’;

40 milliseconds

Search Engine Services

Search engine services:

Google Custom Search Engine

• http://www.google.com/cse/

• DEMO ➪ http://www.karwin.com/demo/gcse-demo.html

even big web sites use this solution

http://www.google.com/cse/

http://www.google.com/cse/

http://www.karwin.com/demo/gcse-demo.html




Search engine services:

Is it right for you?

• Your site is public and allows external index

• Search is a non-critical feature for you

• Search results are satisfactory

• You need to offload search processing

Comparison: Time to Build Index

LIKE predicate none

PostgreSQL / GIN 40 min

Sphinx Search 6 min

Apache Lucene 9 min

Inverted index high

Google / Yahoo! offline

Comparison: Index Storage

LIKE predicate none

PostgreSQL / GIN 532 MB

Sphinx Search 533 MB

Apache Lucene 1071 MB

Inverted index 101 MB

Google / Yahoo! offline

Comparison: Query Speed

LIKE predicate 90+ sec

PostgreSQL / GIN 20 ms

Sphinx Search 8 ms

Apache Lucene 80 ms

Inverted index 40 ms

Google / Yahoo! *

Comparison: Bottom-Line

LIKE predicate none none 11,250x SQL

PostgreSQL / GIN 7x 5.3x 2.5x RDBMS

Sphinx Search 1x * 5.3x 1x 3rd party

Apache Lucene 1.5x 10x 10x 3rd party

Inverted index high 1x 5x SQL

Google / Yahoo! offline offline * Service

indexing storage query solution

Copyright 2009 Bill Karwin

www.slideshare.net/billkarwin

Released under a Creative Commons 3.0 License: http://creativecommons.org/licenses/by-nc-nd/3.0/

You are free to share - to copy, distribute and transmit this work, under the following conditions:

Attribution. You must attribute this work to Bill Karwin.

Noncommercial. You may not use this work for commercial purposes.

No Derivative Works. You may not alter, transform, or build

upon this work.

http://www.karwin.com

http://www.karwin.com

http://creativecommons.org/licenses/by-nc-nd/3.0/

http://creativecommons.org/licenses/by-nc-nd/3.0/

Download - Full Text Search In PostgreSQL

Top Related