in search of: integrating site search (php barcelona)

66
In Search Of... Ian Barber @ianbarber http://phpir.com [email protected] integrating site search Friday, 29 October 2010

Upload: ian-barber

Post on 17-May-2015

3.554 views

Category:

Documents


2 download

DESCRIPTION

Despite being a key method of navigation on many sites, search functionality often gets the short end of the stick in development, either by handing the job over to Google or just enabling full text search on the appropriate column in the database. In this talk we will look at how full text search actually works, how to integrate local text search engines into your PHP application, and how it’s possible to actually provide better and more relevant results than Google itself, at least for your own site.

TRANSCRIPT

Page 1: In Search Of: Integrating Site Search (PHP Barcelona)

In Search Of...

Ian Barber@ianbarber

http://[email protected]

integrating site search

Friday, 29 October 2010

Page 2: In Search Of: Integrating Site Search (PHP Barcelona)

2

How Search WorksIntegrating SearchImproving Results

Using SearchSearch Performance

Questions

Friday, 29 October 2010

Page 3: In Search Of: Integrating Site Search (PHP Barcelona)

3Friday, 29 October 2010

Page 4: In Search Of: Integrating Site Search (PHP Barcelona)

4

Index

DocumentDocumentDocumentDocumentAnalyser

Query Parser

QueryQueryQueryQuery

ResultResultResultResult

Friday, 29 October 2010

Page 5: In Search Of: Integrating Site Search (PHP Barcelona)

5

With AT&T’s help, the F.B.I Miami-Dade office had recovered $1.1 million from O’Healy’s Ponzi scheme, 10-15% more than expected.

Tokenisation

“”

Friday, 29 October 2010

Page 6: In Search Of: Integrating Site Search (PHP Barcelona)

6

PHP Tokenisation

function tokenise($string) { $string = strtolower($string); preg_match_all('/\w+/', $string, $matches, PREG_OFFSET_CAPTURE); return $matches[0];}

Friday, 29 October 2010

Page 7: In Search Of: Integrating Site Search (PHP Barcelona)

7

Document Term PairsDocument ID Term

1 the 1 best1 of1 the ... ...

204 and 204 what204 would

Friday, 29 October 2010

Page 8: In Search Of: Integrating Site Search (PHP Barcelona)

8

Inverted IndexTerm Documents

best 1 (4, 16), 4 (422), 129 (344) ...

what 24 (50, 98), 75 (33, 208) ...

would 99 (32, 599), 201 (344) ..

... ...

Friday, 29 October 2010

Page 9: In Search Of: Integrating Site Search (PHP Barcelona)

9

Boolean Query MergeQuery: Best Western Hotel

Result: Document 298

best 1 4 129 298 305 338western 4 95 194 204 298 305

hotel 2 40 200 298 355 402working 4 298 305

Friday, 29 October 2010

Page 10: In Search Of: Integrating Site Search (PHP Barcelona)

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Sed sit amet ante vitae enim elementum semper sodales quis ipsum. Aliquam vel condimentum neque. Curabitur ornare feugiat ornare. Donec consectetur elit metus. Nulla eleifend tincidunt massa et euismod. Vestibulum vestibulum, justo vel egestas elementum, purus enim ornare quam, vel gravida est enim vel nibh.

Nam non eros nisi, eget fringilla justo. Fusce vel risus vitae mauris vehicula facilisis sit amet in mi. Nulla ut turpis id felis sollicitudin dictum sed non ipsum. Praesent ut risus nulla, sed blandit leo. Curabitur volutpat laoreet lacus, ut consectetur arcu vestibulum vel. Donec dapibus fringilla arcu, et semper lacus egestas non. Quisque eu purus ut lacus egestas dapibus. Integer in velit id est dictum bibendum in id mi.

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Sed sit amet ante vitae enim elementum semper sodales quis ipsum. Aliquam vel condimentum neque. Curabitur ornare feugiat ornare. Donec consectetur elit metus. Nulla eleifend tincidunt massa et euismod. Vestibulum vestibulum, justo vel egestas elementum, purus enim ornare quam, vel gravida est enim vel nibh.

Nam non eros nisi, eget fringilla justo. Fusce vel risus vitae mauris vehicula facilisis sit amet in mi. Nulla ut turpis id felis sollicitudin dictum sed non ipsum. Praesent ut risus nulla, sed blandit leo. Curabitur volutpat laoreet lacus, ut consectetur arcu vestibulum vel. Donec dapibus fringilla arcu, et semper lacus

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Sed sit amet ante vitae enim elementum semper sodales quis ipsum. Aliquam vel condimentum neque. Curabitur ornare feugiat ornare. Donec consectetur elit metus. Nulla eleifend tincidunt massa et euismod. Vestibulum vestibulum, justo vel egestas elementum, purus enim ornare quam, vel gravida est enim vel nibh.

Nam non eros nisi, eget fringilla justo. Fusce vel risus vitae mauris vehicula facilisis sit amet in mi. Nulla ut turpis id felis sollicitudin dictum sed non ipsum. Praesent ut risus nulla, sed blandit leo. Curabitur volutpat laoreet lacus, ut consectetur arcu vestibulum vel. Donec dapibus fringilla arcu, et semper lacus

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Sed sit amet ante vitae enim elementum semper sodales quis ipsum. Aliquam vel condimentum neque. Curabitur ornare feugiat ornare. Donec consectetur elit metus. Nulla eleifend tincidunt massa et euismod. Vestibulum vestibulum, justo vel egestas elementum, purus enim ornare quam, vel gravida est enim vel nibh.

Nam non eros nisi, eget fringilla justo. Fusce vel risus vitae mauris vehicula facilisis sit amet in mi. Nulla ut turpis id felis sollicitudin dictum sed non ipsum. Praesent ut risus nulla, sed blandit leo. Curabitur volutpat laoreet lacus, ut consectetur arcu vestibulum vel. Donec dapibus fringilla arcu, et semper lacus

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Sed sit amet ante vitae enim elementum semper sodales quis ipsum. Aliquam vel condimentum neque. Curabitur ornare feugiat ornare. Donec consectetur elit metus. Nulla eleifend tincidunt massa et euismod. Vestibulum vestibulum, justo vel egestas elementum, purus enim ornare quam, vel gravida est enim vel nibh.

Nam non eros nisi, eget fringilla justo. Fusce vel risus vitae mauris vehicula facilisis sit amet in mi. Nulla ut turpis id felis sollicitudin dictum sed non ipsum. Praesent ut risus nulla, sed blandit leo. Curabitur volutpat laoreet lacus, ut consectetur arcu vestibulum vel. Donec dapibus fringilla arcu, et semper lacus Lorem ipsum dolor sit amet,

consectetur adipiscing elit. Sed sit amet ante vitae enim elementum semper sodales quis ipsum. Aliquam vel condimentum neque. Curabitur ornare feugiat ornare. Donec consectetur elit metus. Nulla eleifend tincidunt massa et euismod. Vestibulum vestibulum, justo vel egestas elementum, purus enim ornare quam, vel gravida est enim vel nibh.

Nam non eros nisi, eget fringilla justo. Fusce vel risus vitae mauris vehicula facilisis sit amet in mi. Nulla ut turpis id felis sollicitudin dictum sed non ipsum. Praesent ut risus nulla, sed blandit leo. Curabitur volutpat laoreet lacus, ut consectetur arcu vestibulum vel. Donec dapibus fringilla arcu, et semper lacus

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Sed sit amet ante vitae enim elementum semper sodales quis ipsum. Aliquam vel condimentum neque. Curabitur ornare feugiat ornare. Donec consectetur elit metus. Nulla eleifend tincidunt massa et euismod. Vestibulum vestibulum, justo vel egestas elementum, purus enim ornare quam, vel gravida est enim vel nibh.

Nam non eros nisi, eget fringilla justo. Fusce vel risus vitae mauris vehicula facilisis sit amet in mi. Nulla ut turpis id felis sollicitudin dictum sed non ipsum. Praesent ut risus nulla, sed blandit leo. Curabitur volutpat laoreet lacus, ut consectetur arcu vestibulum vel. Donec dapibus fringilla arcu, et semper lacus

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Sed sit amet ante vitae enim elementum semper sodales quis ipsum. Aliquam vel condimentum neque. Curabitur ornare feugiat ornare. Donec consectetur elit metus. Nulla eleifend tincidunt massa et euismod. Vestibulum vestibulum, justo vel egestas elementum, purus enim ornare quam, vel gravida est enim vel nibh.

Nam non eros nisi, eget fringilla justo. Fusce vel risus vitae mauris vehicula facilisis sit amet in mi. Nulla ut turpis id felis sollicitudin dictum sed non ipsum. Praesent ut risus nulla, sed blandit leo. Curabitur volutpat laoreet lacus, ut consectetur arcu vestibulum vel. Donec dapibus fringilla arcu, et semper lacus

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Sed sit amet ante vitae enim elementum semper sodales quis ipsum. Aliquam vel condimentum neque. Curabitur ornare feugiat ornare. Donec consectetur elit metus. Nulla eleifend tincidunt massa et euismod. Vestibulum vestibulum, justo vel egestas elementum, purus enim ornare quam, vel gravida est enim vel nibh.

Nam non eros nisi, eget fringilla justo. Fusce vel risus vitae mauris vehicula facilisis sit amet in mi. Nulla ut turpis id felis sollicitudin dictum sed non ipsum. Praesent ut risus nulla, sed blandit leo. Curabitur volutpat laoreet lacus, ut consectetur arcu vestibulum vel. Donec dapibus fringilla arcu, et semper lacus

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Sed sit amet ante vitae enim elementum semper sodales quis ipsum. Aliquam vel condimentum neque. Curabitur ornare feugiat ornare. Donec consectetur elit metus. Nulla eleifend tincidunt massa et euismod. Vestibulum vestibulum, justo vel egestas elementum, purus enim ornare quam, vel gravida est enim vel nibh.

Nam non eros nisi, eget fringilla justo. Fusce vel risus vitae mauris vehicula facilisis sit amet in mi. Nulla ut turpis id felis sollicitudin dictum sed non ipsum. Praesent ut risus nulla, sed blandit leo. Curabitur volutpat laoreet lacus, ut consectetur arcu vestibulum vel. Donec dapibus fringilla arcu, et semper lacus

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Sed sit amet ante vitae enim elementum semper sodales quis ipsum. Aliquam vel condimentum neque. Curabitur ornare feugiat ornare. Donec consectetur elit metus. Nulla eleifend tincidunt massa et euismod. Vestibulum vestibulum, justo vel egestas elementum, purus enim ornare quam, vel gravida est enim vel nibh.

Nam non eros nisi, eget fringilla justo. Fusce vel risus vitae mauris vehicula facilisis sit amet in mi. Nulla ut turpis id felis sollicitudin dictum sed non ipsum. Praesent ut risus nulla, sed blandit leo. Curabitur volutpat laoreet lacus, ut consectetur arcu vestibulum vel. Donec dapibus fringilla arcu, et semper lacus

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Sed sit amet ante vitae enim elementum semper sodales quis ipsum. Aliquam vel condimentum neque. Curabitur ornare feugiat ornare. Donec consectetur elit metus. Nulla eleifend tincidunt massa et euismod. Vestibulum vestibulum, justo vel egestas elementum, purus enim ornare quam, vel gravida est enim vel nibh.

Nam non eros nisi, eget fringilla justo. Fusce vel risus vitae mauris vehicula facilisis sit amet in mi. Nulla ut turpis id felis sollicitudin dictum sed non ipsum. Praesent ut risus nulla, sed blandit leo. Curabitur volutpat laoreet lacus, ut consectetur arcu vestibulum vel. Donec dapibus fringilla arcu, et semper lacus

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Sed sit amet ante vitae enim elementum semper sodales quis ipsum. Aliquam vel condimentum neque. Curabitur ornare feugiat ornare. Donec consectetur elit metus. Nulla eleifend tincidunt massa et euismod. Vestibulum vestibulum, justo vel egestas elementum, purus enim ornare quam, vel gravida est enim vel nibh.

Nam non eros nisi, eget fringilla justo. Fusce vel risus vitae mauris vehicula facilisis sit amet in mi. Nulla ut turpis id felis sollicitudin dictum sed non ipsum. Praesent ut risus nulla, sed blandit leo. Curabitur volutpat laoreet lacus, ut consectetur arcu vestibulum vel. Donec dapibus fringilla arcu, et semper lacus

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Sed sit amet ante vitae enim elementum semper sodales quis ipsum. Aliquam vel condimentum neque. Curabitur ornare feugiat ornare. Donec consectetur elit metus. Nulla eleifend tincidunt massa et euismod. Vestibulum vestibulum, justo vel egestas elementum, purus enim ornare quam, vel gravida est enim vel nibh.

Nam non eros nisi, eget fringilla justo. Fusce vel risus vitae mauris vehicula facilisis sit amet in mi. Nulla ut turpis id felis sollicitudin dictum sed non ipsum. Praesent ut risus nulla, sed blandit leo. Curabitur volutpat laoreet lacus, ut consectetur arcu vestibulum vel. Donec dapibus fringilla arcu, et semper lacus

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Sed sit amet ante vitae enim elementum semper sodales quis ipsum. Aliquam vel condimentum neque. Curabitur ornare feugiat ornare. Donec consectetur elit metus. Nulla eleifend tincidunt massa et euismod. Vestibulum vestibulum, justo vel egestas elementum, purus enim ornare quam, vel gravida est enim vel nibh.

Nam non eros nisi, eget fringilla justo. Fusce vel risus vitae mauris vehicula facilisis sit amet in mi. Nulla ut turpis id felis sollicitudin dictum sed non ipsum. Praesent ut risus nulla, sed blandit leo. Curabitur volutpat laoreet lacus, ut consectetur arcu vestibulum vel. Donec dapibus fringilla arcu, et semper lacus

Friday, 29 October 2010

Page 11: In Search Of: Integrating Site Search (PHP Barcelona)

11

TF-IDF

function getWeight($docID, $term, $total) { $tf = count($term[$docID]); $idf = log($total / count($term), 2); return $tf * $idf;}

Friday, 29 October 2010

Page 12: In Search Of: Integrating Site Search (PHP Barcelona)

12

Document Vector

socket what heavy steel ...

Doc 1 0.02 0.3 0.001 0 ...

Doc 2 0 0 0 0 ...

Doc 3 0.001 0.2 0 0 ...

Doc 4 0 0 0.002 0.003 ...

Friday, 29 October 2010

Page 13: In Search Of: Integrating Site Search (PHP Barcelona)

best 23 42 179 246 333 703

weight 0.008 0.002 0.023 0.039 0.014 0.001

western 42 88 120 179 246 798

weight 0.003 0.004 0.023 0.001 0.034 0.004

1 - 246: 0.0732 - 179: 0.0243 - 120: 0.023

Ranked Query Merge

13Friday, 29 October 2010

Page 14: In Search Of: Integrating Site Search (PHP Barcelona)

14

PHP Similarityfunction score($queryString, $index) { $query = tokenize($queryString); $matches = array(); foreach($query as $qterm) { $postings = $index[$qterm]; foreach($postings as $id => $posting) { $matches[$id] += $posting['score']; } } return arsort($matches);}

Friday, 29 October 2010

Page 15: In Search Of: Integrating Site Search (PHP Barcelona)

15

Integrating SearchFriday, 29 October 2010

Page 16: In Search Of: Integrating Site Search (PHP Barcelona)

16

CREATE TABLE example ( id INT(11) NOT NULL auto_increment, title VARCHAR(255), content TEXT, PRIMARY KEY(id), FULLTEXT(title,content)) Engine=MyISAM;

INSERT INTO example (title, content) VALUES ('Mikko & Bacon','Mikko loves bacon'),('Marcello & Bacon','Marcello hates bacon'),('Jo & Sausages','Johanna loves sausages'),('Hollywood & Garlic','Lorenzo hates garlic'),('James & Cheddar','James is keen on cheeses');

MySQL Full Text Search

Friday, 29 October 2010

Page 17: In Search Of: Integrating Site Search (PHP Barcelona)

17

MySQL FTI QuerySELECT * FROM example WHERE MATCH(title,content) AGAINST('loves bacon');

+----+------------------+------------------------+| id | title | content |+----+------------------+------------------------+| 1 | Mikko & Bacon | Mikko loves bacon | | 2 | Marcello & Bacon | Marcello hates bacon | | 3 | Jo & Sausages | Johanna loves sausages | +----+------------------+------------------------+3 rows in set (0.00 sec)

Friday, 29 October 2010

Page 18: In Search Of: Integrating Site Search (PHP Barcelona)

18

Sphinx http://www.sphinxsearch.com

Friday, 29 October 2010

Page 19: In Search Of: Integrating Site Search (PHP Barcelona)

19

Sphinx Configurationsource posts{ type = mysql sql_host = localhost sql_user = user sql_pass = password sql_db = search

sql_query = \ SELECT id, title, content FROM example; sql_attr_multi = uint tag from query; \ SELECT example_id, tag_id FROM tags;}

Friday, 29 October 2010

Page 20: In Search Of: Integrating Site Search (PHP Barcelona)

20

index posts{ source = posts path = /var/data/sphinx/example morphology = stem_en

min_word_len = 3 min_prefix_len = 3 min_infix_len = 0 enable_star = 1}

Friday, 29 October 2010

Page 21: In Search Of: Integrating Site Search (PHP Barcelona)

21

Stemming

happeninghappenedhappens

http://tartarus.org/~martin/PorterStemmer

- happen- happen- happen

Friday, 29 October 2010

Page 22: In Search Of: Integrating Site Search (PHP Barcelona)

22

Command Line Searchingindexer --config /etc/sphinx.conf --allsearch --config /etc/sphinx.conf love bacon

displaying matches:1. document=1, weight=3, tag=(1,2)! id=1! title=Mikko & Bacon! content=Mikko loves baconwords:1. 'love': 2 documents, 2 hits2. 'bacon': 2 documents, 4 hits

searchd --config /etc/sphinx.conf

Friday, 29 October 2010

Page 23: In Search Of: Integrating Site Search (PHP Barcelona)

23

Sphinx From PHP

$cl = new SphinxClient();$cl->SetServer('localhost', 3312);$cl->SetMatchMode(SPH_MATCH_ANY);

$result = $cl->Query('bac*');$docIDs = array_keys($result["matches"]);

$cl->SetFilter('tag', array(1));$result = $cl->Query('bac*');$docIDs = array_keys($result["matches"]);

Friday, 29 October 2010

Page 24: In Search Of: Integrating Site Search (PHP Barcelona)

24

Swish-E . http://swish-e.org

pecl install swish-beta

Friday, 29 October 2010

Page 25: In Search Of: Integrating Site Search (PHP Barcelona)

25

Filesystem Index With Swish-E

IndexDir /var/data/documentsIndexFile fs-swish-e.indexIndexOnly .doc .docx .pdfFuzzyIndexingMode Stemming_en1

FileFilter .pdf /usr/local/bin/swish_filter.plFileFilter .doc /usr/local/bin/swish_filter.pl

fs-swish-e.conf

/usr/local/bin/swish-e -S fs -c fs-swish-e.conf

Friday, 29 October 2010

Page 26: In Search Of: Integrating Site Search (PHP Barcelona)

26

Crawling Content

IndexDir /usr/local/lib/swish-e/spider.plIndexFile www-swish-e.indexSwishProgParameters default http://phpir.com/

FuzzyIndexingMode Stemming_en1DefaultContents HTML

www-swish-e.conf

/usr/local/bin/swish-e -S prog -c www-swish-e.conf

Friday, 29 October 2010

Page 27: In Search Of: Integrating Site Search (PHP Barcelona)

27

Swish-E With Multiple Indices$swish = new Swish( 'www-swish-e.index fs-swish-e.index');$search = $swish->prepare();

$queryStr = 'search string goes here';$result = $search->execute($queryStr);$total = $result->hits;

while($r = $result->nextResult()) { echo $r->swishdocpath; // url}

Friday, 29 October 2010

Page 28: In Search Of: Integrating Site Search (PHP Barcelona)

28

Lucene

Friday, 29 October 2010

Page 29: In Search Of: Integrating Site Search (PHP Barcelona)

29

$index = Zend_Search_Lucene::create('idx');foreach($documents as $title => $content) { $doc = new Zend_Search_Lucene_Document(); $doc->addField( Zend_Search_Lucene_Field::Text( 'title', $title)); $doc->addField( Zend_Search_Lucene_Field::UnStored( 'content', $content)); $index->addDocument($doc);}

Build Index

Friday, 29 October 2010

Page 30: In Search Of: Integrating Site Search (PHP Barcelona)

30

$results = $index->find('loves bacon');foreach($results as $result) { echo $result->score, " "; echo $result->title, "\n";} Output: 0.81656279309067 Mikko and Bacon0.24800278854758 Marcello & Bacon

Query Zend Search Lucene

Friday, 29 October 2010

Page 31: In Search Of: Integrating Site Search (PHP Barcelona)

31

$file = file_get_contents($url);

$doc = Zend_Search_Lucene_Document_Html:: loadHTML($file);

$doc->addField( Zend_Search_Lucene_Field::Text( 'url', $url);$index->addDocument($doc)

Index HTML

Friday, 29 October 2010

Page 32: In Search Of: Integrating Site Search (PHP Barcelona)

32

Solr http://lucene.apache.org/solr/

Friday, 29 October 2010

Page 33: In Search Of: Integrating Site Search (PHP Barcelona)

33

Solr Search Index$options = array( 'hostname' => 'localhost', 'port' => 8983 );

$client = new SolrClient($options);$doc = new SolrInputDocument();$doc->addField('id', $id);$doc->addField('cat', $category);$doc->addField('title', $title);$doc->addField('text', $text);$response = $client->addDocument($doc);$client->commit();

Friday, 29 October 2010

Page 34: In Search Of: Integrating Site Search (PHP Barcelona)

34

Solr Search Client$client = new SolrClient($options);

$query = new SolrQuery('bacon');$response = $client->query($query);$r = $response->getResponse();

foreach($r['response']['docs'] as $d) { echo $d->title[0] . "\n";}

Friday, 29 October 2010

Page 35: In Search Of: Integrating Site Search (PHP Barcelona)

35

Xapian .

http://xapian.org

Friday, 29 October 2010

Page 36: In Search Of: Integrating Site Search (PHP Barcelona)

36

Xapian In PHP$db = new XapianWritableDatabase( 'idx', Xapian::DB_CREATE_OR_OPEN);$i = new XapianTermGenerator();$i->set_stemmer(new XapianStem("english"));

$doc = new XapianDocument();$doc->set_data($content);$doc->add_value(1, $title);

$i->set_document($doc);$i->index_text($content);$db->add_document($doc);

Friday, 29 October 2010

Page 37: In Search Of: Integrating Site Search (PHP Barcelona)

37

Xapian Search In PHP

$database = new XapianDatabase('idx');$enquire = new XapianEnquire($database);$qp = new XapianQueryParser();$qp->set_stemmer(new XapianStem("english"));$qp->set_database($database);$qp->set_stemming_strategy( XapianQueryParser::STEM_SOME);$query = $qp->parse_query($queryString);

$enquire->set_query($query);

Friday, 29 October 2010

Page 38: In Search Of: Integrating Site Search (PHP Barcelona)

38

$matches = $enquire->get_mset(0, 10);

$i = $matches->begin();while(!$i->equals($matches->end())) { $n = $i->get_rank() + 1; $data = $i->get_document()->get_data(); $title = $i->get_document()->get_value(1); $score = $i->get_percent(); $i->next();}

Friday, 29 October 2010

Page 39: In Search Of: Integrating Site Search (PHP Barcelona)

39

Improving Results

Friday, 29 October 2010

Page 40: In Search Of: Integrating Site Search (PHP Barcelona)

40

Anchor Text

Friday, 29 October 2010

Page 41: In Search Of: Integrating Site Search (PHP Barcelona)

41

$p = file_get_contents('http://phpir.com');

libxml_use_internal_errors(true);$dom = DomDocument::loadHTML($p);$links = $dom->getElementsByTagName('a');

foreach($links as $link) { $href = $link->getAttribute('href'); $text = $link->nodeValue;}

Parse Anchor Text

Friday, 29 October 2010

Page 42: In Search Of: Integrating Site Search (PHP Barcelona)

42

1

2

3

Zone WeightingFriday, 29 October 2010

Page 43: In Search Of: Integrating Site Search (PHP Barcelona)

43

ZSL Zone Weighting

$doc = new Zend_Search_Lucene_Document();

$tfield = Zend_Search_Lucene_Field::Text ('title', $title);$tfield->boost = 1.3;$doc->addField($tfield);

$doc->addField( Zend_Search_Lucene_Field::UnStored ('content', $content));

$index->addDocument($doc);

Friday, 29 October 2010

Page 44: In Search Of: Integrating Site Search (PHP Barcelona)

44

Document Authority

Friday, 29 October 2010

Page 45: In Search Of: Integrating Site Search (PHP Barcelona)

45

Document Weights in ZSL$doc = new Zend_Search_Lucene_Document();$doc->addField( Zend_Search_Lucene_Field::Text ('title', $title));$doc->addField( Zend_Search_Lucene_Field::UnStored ('content', $content));

$doc->boost = 1 + ($numComments / 100);

$index->addDocument($doc);

Friday, 29 October 2010

Page 46: In Search Of: Integrating Site Search (PHP Barcelona)

46

Using Search

Friday, 29 October 2010

Page 47: In Search Of: Integrating Site Search (PHP Barcelona)

47

Summaries & Highlighting

Friday, 29 October 2010

Page 48: In Search Of: Integrating Site Search (PHP Barcelona)

48

Sphinx Extract & Highlight$cl = new SphinxClient();$cl->SetServer( "localhost", 3312 );$q = 'bacon';$r = $cl->Query($q);foreach ($r["matches"] as $doc => $info) { $text[$doc] = getTextFromDB($doc);}

$e = $cl->BuildExcerpts($text, 'posts', $q);foreach($extracts as $extract) { echo $extract;}

Friday, 29 October 2010

Page 49: In Search Of: Integrating Site Search (PHP Barcelona)

Friday, 29 October 2010

Page 50: In Search Of: Integrating Site Search (PHP Barcelona)

50

Xapian Spelling Correction$indexer = new XapianTermGenerator();$indexer->set_database($database);$indexer->set_flags( XapianTermGenerator::FLAG_SPELLING);

Indexer

$queryString = "strreplace or str_cmp";$q = new XapianQueryParser();$q->set_database($database);$query = $q->parse_query($queryString, XapianQueryParser::FLAG_SPELLING_CORRECTION);echo "Did you mean: " . $q->get_corrected_query_string() . "\n";

Searcher

Friday, 29 October 2010

Page 51: In Search Of: Integrating Site Search (PHP Barcelona)

51

Spelling Correction Output php xapsearch.php

Did you mean: str_replace or strcmp

4644 results found for “strreplace or str_cmp”:1: 2% docid=572 [phpdocs/html/cc.license.html]2: 2% docid=7169 [phpdocs/html/imagick.constants.html]3: 2% docid=10086 [phpdocs/html/sqlite3result.fetcharray.html]4: 2% docid=6132 [phpdocs/html/function.swf-posround.html]

Friday, 29 October 2010

Page 52: In Search Of: Integrating Site Search (PHP Barcelona)

52

Results Sorting

Friday, 29 October 2010

Page 53: In Search Of: Integrating Site Search (PHP Barcelona)

53

Sorting in ZSL

$q = Zend_Search_Lucene_Search_QueryParser:: parse('search string');

$results = $index->find($q, 'title');foreach($results as $result) { echo '<h3>', $result->title, "</h3>\n"; $doc = getDocumentFromDB($result->did); echo $q->htmlFragmentHighlightMatches($doc);}

Friday, 29 October 2010

Page 54: In Search Of: Integrating Site Search (PHP Barcelona)

54

Faceted Search

Friday, 29 October 2010

Page 55: In Search Of: Integrating Site Search (PHP Barcelona)

55

Faceted Search In Solr$client = new SolrClient($options);$query = new SolrQuery('bacon');$response = $client->query($query);$query->setFacet(true);$query->addFacetField('cat');$r = $response->getResponse();$f = $r['facet_counts']['facet_fields'];foreach($f['cat'] as $facet => $count) { echo $facet . " " . $count . "\n";}

Friday, 29 October 2010

Page 56: In Search Of: Integrating Site Search (PHP Barcelona)

56

More Like This

Friday, 29 October 2010

Page 57: In Search Of: Integrating Site Search (PHP Barcelona)

57

More Like This$rset = new XapianRset();$rset->add_document(5959); // str_replace$e = $enquire->get_eset(40, $rset);

$t = $e->begin();for($t; !$t->equals($e->end()); $t->next()){ $qs[] = new XapianQuery($t->get_term(), intval($t->get_weight()));}

$query = new XapianQuery( XapianQuery::OP_OR, $qs);

Friday, 29 October 2010

Page 58: In Search Of: Integrating Site Search (PHP Barcelona)

58

More Like This Example php xapsim.php

1656 results found:1: 100% docid=5959 [phpdocs/html/function.str-replace.html]2: 47% docid=5956 [phpdocs/html/function.str-ireplace.html]3: 24% docid=5328 [phpdocs/html/function.preg-replace.html]4: 18% docid=5958 [phpdocs/html/function.str-repeat.html]

Friday, 29 October 2010

Page 59: In Search Of: Integrating Site Search (PHP Barcelona)

59

Search Performance

Friday, 29 October 2010

Page 60: In Search Of: Integrating Site Search (PHP Barcelona)

60

Index Updates

Docs

Main

New

Delta Delta Main

Query

Delta Main

Main

DocsDocsDocs

Friday, 29 October 2010

Page 61: In Search Of: Integrating Site Search (PHP Barcelona)

61

Search Speed$index = Zend_Search_Lucene::open('index');$index->optimize();

indexer --merge main delta --rotate

Zend Search Lucene

Sphinx

$client = new SolrClient($options);$client->optimize();

Solr

xapian-compact xapindex xapindex2Xapian

Friday, 29 October 2010

Page 62: In Search Of: Integrating Site Search (PHP Barcelona)

62

Distributing Search

Index

Application

Index Index

DocumentDocumentDocumentDocument

Friday, 29 October 2010

Page 63: In Search Of: Integrating Site Search (PHP Barcelona)

63

Large Scale Search

http://www.nutch.org

http://hadoop.apache.org

Friday, 29 October 2010

Page 64: In Search Of: Integrating Site Search (PHP Barcelona)

64

Image CreditsTitle http://www.flickr.com/photos/generated/2084287794/What Do You Want http://www.flickr.com/photos/the_justified_sinner/

2498066986/You Are Here http://www.flickr.com/photos/alecvuijlsteke/2692475420/Integrating Search http://www.flickr.com/photos/squeaks2569/3700355684/Sphinx http://www.flickr.com/photos/generated/2084287794/Lucene http://www.flickr.com/photos/mypanda/7731447/Swish-e http://www.flickr.com/photos/ryan_fung/2239687100/Solr http://www.flickr.com/photos/m-j-s/2724756177/Xapian http://www.flickr.com/photos/olibac/3522056495/Using Search http://www.flickr.com/photos/eneas/175027945/Improving Search http://www.flickr.com/photos/x-ray_delta_one/3928200642/Search Performance http://www.flickr.com/photos/maisonbisson/1634408/Large Scale Search http://www.flickr.com/photos/zedzap/3663508847/

Friday, 29 October 2010

Page 65: In Search Of: Integrating Site Search (PHP Barcelona)

Questions?

65Friday, 29 October 2010

Page 66: In Search Of: Integrating Site Search (PHP Barcelona)

Thank You!

Ian Barber@ianbarber

http://[email protected]

Friday, 29 October 2010