in search of... (dutch php conference 2010)

61
In Search Of... Ian Barber @ianbarber http://phpir.com [email protected] http://joind.in/1556 integrating site search

Upload: ian-barber

Post on 17-May-2015

2.640 views

Category:

Technology


2 download

DESCRIPTION

DPC version of my In Search Of search introduction talk

TRANSCRIPT

Page 1: In Search Of... (Dutch PHP Conference 2010)

In Search Of...

Ian Barber@ianbarber

http://[email protected]

http://joind.in/1556

integrating site search

Page 2: In Search Of... (Dutch PHP Conference 2010)

2

How Search WorksIntegrating SearchImproving Results

Using SearchQuestions

Page 3: In Search Of... (Dutch PHP Conference 2010)

3

Page 4: In Search Of... (Dutch PHP Conference 2010)

4

Index

DocumentDocumentDocumentDocumentAnalyser

Query Parser

QueryQueryQueryQuery

ResultResultResultResult

Page 5: In Search Of... (Dutch PHP Conference 2010)

5

With AT&T’s help, the F.B.I Miami-Dade office had recovered $1.1 million from O’Healy’s Ponzi scheme, 10-15% more than expected.

Tokenisation

“”

Page 6: In Search Of... (Dutch PHP Conference 2010)

6

PHP Tokenisation

function tokenise($string) { $string = strtolower($string); preg_match_all('/\w+/', $string, $matches, PREG_OFFSET_CAPTURE); return $matches[0];}

Page 7: In Search Of... (Dutch PHP Conference 2010)

7

Document Term PairsDocument ID Term

1 the 1 best1 of1 the ... ...

204 and 204 what204 would

Page 8: In Search Of... (Dutch PHP Conference 2010)

8

Inverted IndexTerm Documents

best 1 (4, 16), 4 (422), 129 (344) ...

what 24 (50, 98), 75 (33, 208) ...

would 99 (32, 599), 201 (344) ..

... ...

Page 9: In Search Of... (Dutch PHP Conference 2010)

9

Boolean Query MergeQuery: Best Western Hotel

Result: Document 298

best 1 4 129 298 305 338western 4 95 194 204 298 305

hotel 2 40 200 298 355 402working 4 298 305

Page 10: In Search Of... (Dutch PHP Conference 2010)

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Sed sit amet ante vitae enim elementum semper sodales quis ipsum. Aliquam vel condimentum neque. Curabitur ornare feugiat ornare. Donec consectetur elit metus. Nulla eleifend tincidunt massa et euismod. Vestibulum vestibulum, justo vel egestas elementum, purus enim ornare quam, vel gravida est enim vel nibh.

Nam non eros nisi, eget fringilla justo. Fusce vel risus vitae mauris vehicula facilisis sit amet in mi. Nulla ut turpis id felis sollicitudin dictum sed non ipsum. Praesent ut risus nulla, sed blandit leo. Curabitur volutpat laoreet lacus, ut consectetur arcu vestibulum vel. Donec dapibus fringilla arcu, et semper lacus egestas non. Quisque eu purus ut lacus egestas dapibus. Integer in velit id est dictum bibendum in id mi.

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Sed sit amet ante vitae enim elementum semper sodales quis ipsum. Aliquam vel condimentum neque. Curabitur ornare feugiat ornare. Donec consectetur elit metus. Nulla eleifend tincidunt massa et euismod. Vestibulum vestibulum, justo vel egestas elementum, purus enim ornare quam, vel gravida est enim vel nibh.

Nam non eros nisi, eget fringilla justo. Fusce vel risus vitae mauris vehicula facilisis sit amet in mi. Nulla ut turpis id felis sollicitudin dictum sed non ipsum. Praesent ut risus nulla, sed blandit leo. Curabitur volutpat laoreet lacus, ut consectetur arcu vestibulum vel. Donec dapibus fringilla arcu, et semper lacus

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Sed sit amet ante vitae enim elementum semper sodales quis ipsum. Aliquam vel condimentum neque. Curabitur ornare feugiat ornare. Donec consectetur elit metus. Nulla eleifend tincidunt massa et euismod. Vestibulum vestibulum, justo vel egestas elementum, purus enim ornare quam, vel gravida est enim vel nibh.

Nam non eros nisi, eget fringilla justo. Fusce vel risus vitae mauris vehicula facilisis sit amet in mi. Nulla ut turpis id felis sollicitudin dictum sed non ipsum. Praesent ut risus nulla, sed blandit leo. Curabitur volutpat laoreet lacus, ut consectetur arcu vestibulum vel. Donec dapibus fringilla arcu, et semper lacus

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Sed sit amet ante vitae enim elementum semper sodales quis ipsum. Aliquam vel condimentum neque. Curabitur ornare feugiat ornare. Donec consectetur elit metus. Nulla eleifend tincidunt massa et euismod. Vestibulum vestibulum, justo vel egestas elementum, purus enim ornare quam, vel gravida est enim vel nibh.

Nam non eros nisi, eget fringilla justo. Fusce vel risus vitae mauris vehicula facilisis sit amet in mi. Nulla ut turpis id felis sollicitudin dictum sed non ipsum. Praesent ut risus nulla, sed blandit leo. Curabitur volutpat laoreet lacus, ut consectetur arcu vestibulum vel. Donec dapibus fringilla arcu, et semper lacus

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Sed sit amet ante vitae enim elementum semper sodales quis ipsum. Aliquam vel condimentum neque. Curabitur ornare feugiat ornare. Donec consectetur elit metus. Nulla eleifend tincidunt massa et euismod. Vestibulum vestibulum, justo vel egestas elementum, purus enim ornare quam, vel gravida est enim vel nibh.

Nam non eros nisi, eget fringilla justo. Fusce vel risus vitae mauris vehicula facilisis sit amet in mi. Nulla ut turpis id felis sollicitudin dictum sed non ipsum. Praesent ut risus nulla, sed blandit leo. Curabitur volutpat laoreet lacus, ut consectetur arcu vestibulum vel. Donec dapibus fringilla arcu, et semper lacus Lorem ipsum dolor sit amet,

consectetur adipiscing elit. Sed sit amet ante vitae enim elementum semper sodales quis ipsum. Aliquam vel condimentum neque. Curabitur ornare feugiat ornare. Donec consectetur elit metus. Nulla eleifend tincidunt massa et euismod. Vestibulum vestibulum, justo vel egestas elementum, purus enim ornare quam, vel gravida est enim vel nibh.

Nam non eros nisi, eget fringilla justo. Fusce vel risus vitae mauris vehicula facilisis sit amet in mi. Nulla ut turpis id felis sollicitudin dictum sed non ipsum. Praesent ut risus nulla, sed blandit leo. Curabitur volutpat laoreet lacus, ut consectetur arcu vestibulum vel. Donec dapibus fringilla arcu, et semper lacus

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Sed sit amet ante vitae enim elementum semper sodales quis ipsum. Aliquam vel condimentum neque. Curabitur ornare feugiat ornare. Donec consectetur elit metus. Nulla eleifend tincidunt massa et euismod. Vestibulum vestibulum, justo vel egestas elementum, purus enim ornare quam, vel gravida est enim vel nibh.

Nam non eros nisi, eget fringilla justo. Fusce vel risus vitae mauris vehicula facilisis sit amet in mi. Nulla ut turpis id felis sollicitudin dictum sed non ipsum. Praesent ut risus nulla, sed blandit leo. Curabitur volutpat laoreet lacus, ut consectetur arcu vestibulum vel. Donec dapibus fringilla arcu, et semper lacus

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Sed sit amet ante vitae enim elementum semper sodales quis ipsum. Aliquam vel condimentum neque. Curabitur ornare feugiat ornare. Donec consectetur elit metus. Nulla eleifend tincidunt massa et euismod. Vestibulum vestibulum, justo vel egestas elementum, purus enim ornare quam, vel gravida est enim vel nibh.

Nam non eros nisi, eget fringilla justo. Fusce vel risus vitae mauris vehicula facilisis sit amet in mi. Nulla ut turpis id felis sollicitudin dictum sed non ipsum. Praesent ut risus nulla, sed blandit leo. Curabitur volutpat laoreet lacus, ut consectetur arcu vestibulum vel. Donec dapibus fringilla arcu, et semper lacus

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Sed sit amet ante vitae enim elementum semper sodales quis ipsum. Aliquam vel condimentum neque. Curabitur ornare feugiat ornare. Donec consectetur elit metus. Nulla eleifend tincidunt massa et euismod. Vestibulum vestibulum, justo vel egestas elementum, purus enim ornare quam, vel gravida est enim vel nibh.

Nam non eros nisi, eget fringilla justo. Fusce vel risus vitae mauris vehicula facilisis sit amet in mi. Nulla ut turpis id felis sollicitudin dictum sed non ipsum. Praesent ut risus nulla, sed blandit leo. Curabitur volutpat laoreet lacus, ut consectetur arcu vestibulum vel. Donec dapibus fringilla arcu, et semper lacus

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Sed sit amet ante vitae enim elementum semper sodales quis ipsum. Aliquam vel condimentum neque. Curabitur ornare feugiat ornare. Donec consectetur elit metus. Nulla eleifend tincidunt massa et euismod. Vestibulum vestibulum, justo vel egestas elementum, purus enim ornare quam, vel gravida est enim vel nibh.

Nam non eros nisi, eget fringilla justo. Fusce vel risus vitae mauris vehicula facilisis sit amet in mi. Nulla ut turpis id felis sollicitudin dictum sed non ipsum. Praesent ut risus nulla, sed blandit leo. Curabitur volutpat laoreet lacus, ut consectetur arcu vestibulum vel. Donec dapibus fringilla arcu, et semper lacus

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Sed sit amet ante vitae enim elementum semper sodales quis ipsum. Aliquam vel condimentum neque. Curabitur ornare feugiat ornare. Donec consectetur elit metus. Nulla eleifend tincidunt massa et euismod. Vestibulum vestibulum, justo vel egestas elementum, purus enim ornare quam, vel gravida est enim vel nibh.

Nam non eros nisi, eget fringilla justo. Fusce vel risus vitae mauris vehicula facilisis sit amet in mi. Nulla ut turpis id felis sollicitudin dictum sed non ipsum. Praesent ut risus nulla, sed blandit leo. Curabitur volutpat laoreet lacus, ut consectetur arcu vestibulum vel. Donec dapibus fringilla arcu, et semper lacus

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Sed sit amet ante vitae enim elementum semper sodales quis ipsum. Aliquam vel condimentum neque. Curabitur ornare feugiat ornare. Donec consectetur elit metus. Nulla eleifend tincidunt massa et euismod. Vestibulum vestibulum, justo vel egestas elementum, purus enim ornare quam, vel gravida est enim vel nibh.

Nam non eros nisi, eget fringilla justo. Fusce vel risus vitae mauris vehicula facilisis sit amet in mi. Nulla ut turpis id felis sollicitudin dictum sed non ipsum. Praesent ut risus nulla, sed blandit leo. Curabitur volutpat laoreet lacus, ut consectetur arcu vestibulum vel. Donec dapibus fringilla arcu, et semper lacus

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Sed sit amet ante vitae enim elementum semper sodales quis ipsum. Aliquam vel condimentum neque. Curabitur ornare feugiat ornare. Donec consectetur elit metus. Nulla eleifend tincidunt massa et euismod. Vestibulum vestibulum, justo vel egestas elementum, purus enim ornare quam, vel gravida est enim vel nibh.

Nam non eros nisi, eget fringilla justo. Fusce vel risus vitae mauris vehicula facilisis sit amet in mi. Nulla ut turpis id felis sollicitudin dictum sed non ipsum. Praesent ut risus nulla, sed blandit leo. Curabitur volutpat laoreet lacus, ut consectetur arcu vestibulum vel. Donec dapibus fringilla arcu, et semper lacus

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Sed sit amet ante vitae enim elementum semper sodales quis ipsum. Aliquam vel condimentum neque. Curabitur ornare feugiat ornare. Donec consectetur elit metus. Nulla eleifend tincidunt massa et euismod. Vestibulum vestibulum, justo vel egestas elementum, purus enim ornare quam, vel gravida est enim vel nibh.

Nam non eros nisi, eget fringilla justo. Fusce vel risus vitae mauris vehicula facilisis sit amet in mi. Nulla ut turpis id felis sollicitudin dictum sed non ipsum. Praesent ut risus nulla, sed blandit leo. Curabitur volutpat laoreet lacus, ut consectetur arcu vestibulum vel. Donec dapibus fringilla arcu, et semper lacus

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Sed sit amet ante vitae enim elementum semper sodales quis ipsum. Aliquam vel condimentum neque. Curabitur ornare feugiat ornare. Donec consectetur elit metus. Nulla eleifend tincidunt massa et euismod. Vestibulum vestibulum, justo vel egestas elementum, purus enim ornare quam, vel gravida est enim vel nibh.

Nam non eros nisi, eget fringilla justo. Fusce vel risus vitae mauris vehicula facilisis sit amet in mi. Nulla ut turpis id felis sollicitudin dictum sed non ipsum. Praesent ut risus nulla, sed blandit leo. Curabitur volutpat laoreet lacus, ut consectetur arcu vestibulum vel. Donec dapibus fringilla arcu, et semper lacus

Page 11: In Search Of... (Dutch PHP Conference 2010)

11

TF-IDF

function getWeight($docID, $term, $total) { $tf = count($term[$docID]); $idf = log($total / count($term), 2); return $tf * $idf;}

Page 12: In Search Of... (Dutch PHP Conference 2010)

12

Document Vector

socket what heavy steel ...

Doc 1 0.02 0.3 0.001 0 ...

Doc 2 0 0 0 0 ...

Doc 3 0.001 0.2 0 0 ...

Doc 4 0 0 0.002 0.003 ...

Page 13: In Search Of... (Dutch PHP Conference 2010)

best 23 42 179 246 333 703

weight 0.008 0.002 0.023 0.039 0.014 0.001

western 42 88 120 179 246 798

weight 0.003 0.004 0.023 0.001 0.034 0.004

1 - 246: 0.0732 - 179: 0.0243 - 120: 0.023

Ranked Query Merge

13

Page 14: In Search Of... (Dutch PHP Conference 2010)

14

PHP Similarityfunction score($queryString, $index) { $query = tokenize($queryString); $matches = array(); foreach($query as $qterm) { $postings = $index[$qterm]; foreach($postings as $id => $posting) { $matches[$id] += $posting['score']; } } return arsort($matches);}

Page 15: In Search Of... (Dutch PHP Conference 2010)

15

Integrating Search

Page 16: In Search Of... (Dutch PHP Conference 2010)

16

CREATE TABLE example ( id INT(11) NOT NULL auto_increment, title VARCHAR(255), content TEXT, PRIMARY KEY(id), FULLTEXT(title,content)) Engine=MyISAM;

INSERT INTO example (title, content) VALUES ('Mikko & Bacon','Mikko loves bacon'),('Marcello & Bacon','Marcello hates bacon'),('Jo & Sausages','Johanna loves sausages'),('Hollywood & Garlic','Lorenzo hates garlic'),('James & Cheddar','James is keen on cheeses');

MySQL Full Text Search

Page 17: In Search Of... (Dutch PHP Conference 2010)

17

MySQL FTI QuerySELECT * FROM example WHERE MATCH(title,content) AGAINST('loves bacon');

+----+------------------+------------------------+| id | title | content |+----+------------------+------------------------+| 1 | Mikko & Bacon | Mikko loves bacon | | 2 | Marcello & Bacon | Marcello hates bacon | | 3 | Jo & Sausages | Johanna loves sausages | +----+------------------+------------------------+3 rows in set (0.00 sec)

Page 18: In Search Of... (Dutch PHP Conference 2010)

18

Sphinx http://www.sphinxsearch.com

Page 19: In Search Of... (Dutch PHP Conference 2010)

19

Sphinx Configurationsource posts{ type = mysql sql_host = localhost sql_user = user sql_pass = password sql_db = search

sql_query = \ SELECT id, title, content FROM example; sql_attr_multi = uint tag from query; \ SELECT example_id, tag_id FROM tags;}

Page 20: In Search Of... (Dutch PHP Conference 2010)

20

index posts{ source = posts path = /var/data/sphinx/example morphology = stem_en

min_word_len = 3 min_prefix_len = 3 min_infix_len = 0 enable_star = 1}

Page 21: In Search Of... (Dutch PHP Conference 2010)

21

Stemming

happeninghappenedhappens

http://tartarus.org/~martin/PorterStemmer

- happen- happen- happen

Page 22: In Search Of... (Dutch PHP Conference 2010)

22

Command Line Searchingindexer --config /etc/sphinx.conf --allsearch --config /etc/sphinx.conf love bacon

displaying matches:1. document=1, weight=3, tag=(1,2)! id=1! title=Mikko & Bacon! content=Mikko loves baconwords:1. 'love': 2 documents, 2 hits2. 'bacon': 2 documents, 4 hits

searchd --config /etc/sphinx.conf

Page 23: In Search Of... (Dutch PHP Conference 2010)

23

Sphinx From PHP

$cl = new SphinxClient();$cl->SetServer('localhost', 3312);$cl->SetMatchMode(SPH_MATCH_ANY);

$result = $cl->Query('bac*');$docIDs = array_keys($result["matches"]);

$cl->SetFilter('tag', array(1));$result = $cl->Query('bac*');$docIDs = array_keys($result["matches"]);

Page 24: In Search Of... (Dutch PHP Conference 2010)

24

Swish-E . http://swish-e.org

pecl install swish-beta

Page 25: In Search Of... (Dutch PHP Conference 2010)

25

Filesystem Index With Swish-E

IndexDir /var/data/documentsIndexFile fs-swish-e.indexIndexOnly .doc .docx .pdfFuzzyIndexingMode Stemming_en1

FileFilter .pdf /usr/local/bin/swish_filter.plFileFilter .doc /usr/local/bin/swish_filter.pl

fs-swish-e.conf

/usr/local/bin/swish-e -S fs -c fs-swish-e.conf

Page 26: In Search Of... (Dutch PHP Conference 2010)

26

Crawling Content

IndexDir /usr/local/lib/swish-e/spider.plIndexFile www-swish-e.indexSwishProgParameters default http://phpir.com/

FuzzyIndexingMode Stemming_en1DefaultContents HTML

www-swish-e.conf

/usr/local/bin/swish-e -S prog -c www-swish-e.conf

Page 27: In Search Of... (Dutch PHP Conference 2010)

27

Swish-E With Multiple Indices$swish = new Swish( 'www-swish-e.index fs-swish-e.index');$search = $swish->prepare();

$queryStr = 'search string goes here';$result = $search->execute($queryStr);$total = $result->hits;

while($r = $result->nextResult()) { echo $r->swishdocpath; // url}

Page 28: In Search Of... (Dutch PHP Conference 2010)

28

Lucene

Page 29: In Search Of... (Dutch PHP Conference 2010)

29

$index = Zend_Search_Lucene::create('idx');foreach($documents as $title => $content) { $doc = new Zend_Search_Lucene_Document(); $doc->addField( Zend_Search_Lucene_Field::Text( 'title', $title)); $doc->addField( Zend_Search_Lucene_Field::UnStored( 'content', $content)); $index->addDocument($doc);}

Build Index

Page 30: In Search Of... (Dutch PHP Conference 2010)

30

$results = $index->find('loves bacon');foreach($results as $result) { echo $result->score, " "; echo $result->title, "\n";} Output: 0.81656279309067 Mikko and Bacon0.24800278854758 Marcello & Bacon

Query Zend Search Lucene

Page 31: In Search Of... (Dutch PHP Conference 2010)

31

$file = file_get_contents($url);

$doc = Zend_Search_Lucene_Document_Html:: loadHTML($file);

$doc->addField( Zend_Search_Lucene_Field::Text( 'url', $url);$index->addDocument($doc)

Index HTML

Page 32: In Search Of... (Dutch PHP Conference 2010)

32

Solr http://lucene.apache.org/solr/

Page 33: In Search Of... (Dutch PHP Conference 2010)

33

Solr Search Index$options = array( 'hostname' => 'localhost', 'port' => 8983 );

$client = new SolrClient($options);$doc = new SolrInputDocument();$doc->addField('id', $id);$doc->addField('cat', $category);$doc->addField('title', $title);$doc->addField('text', $text);$response = $client->addDocument($doc);$client->commit();

Page 34: In Search Of... (Dutch PHP Conference 2010)

34

Solr Search Client$client = new SolrClient($options);

$query = new SolrQuery('bacon');$response = $client->query($query);$r = $response->getResponse();

foreach($r['response']['docs'] as $d) { echo $d->title[0] . "\n";}

Page 35: In Search Of... (Dutch PHP Conference 2010)

35

Xapian .

http://xapian.org

Page 36: In Search Of... (Dutch PHP Conference 2010)

36

Xapian In PHP$db = new XapianWritableDatabase( 'idx', Xapian::DB_CREATE_OR_OPEN);$i = new XapianTermGenerator();$i->set_stemmer(new XapianStem("english"));

$doc = new XapianDocument();$doc->set_data($content);$doc->add_value(1, $title);

$i->set_document($doc);$i->index_text($content);$db->add_document($doc);

Page 37: In Search Of... (Dutch PHP Conference 2010)

37

Xapian Search In PHP

$database = new XapianDatabase('idx');$enquire = new XapianEnquire($database);$qp = new XapianQueryParser();$qp->set_stemmer(new XapianStem("english"));$qp->set_database($database);$qp->set_stemming_strategy( XapianQueryParser::STEM_SOME);$query = $qp->parse_query($queryString);

$enquire->set_query($query);

Page 38: In Search Of... (Dutch PHP Conference 2010)

38

$matches = $enquire->get_mset(0, 10);

$i = $matches->begin();while(!$i->equals($matches->end())) { $n = $i->get_rank() + 1; $data = $i->get_document()->get_data(); $title = $i->get_document()->get_value(1); $score = $i->get_percent(); $i->next();}

Page 39: In Search Of... (Dutch PHP Conference 2010)

39

Improving Results

Page 40: In Search Of... (Dutch PHP Conference 2010)

40

Anchor Text

Page 41: In Search Of... (Dutch PHP Conference 2010)

41

$p = file_get_contents('http://phpir.com');

libxml_use_internal_errors(true);$dom = DomDocument::loadHTML($p);$links = $dom->getElementsByTagName('a');

foreach($links as $link) { $href = $link->getAttribute('href'); $text = $link->nodeValue;}

Parse Anchor Text

Page 42: In Search Of... (Dutch PHP Conference 2010)

42

1

2

3

Zone Weighting

Page 43: In Search Of... (Dutch PHP Conference 2010)

43

ZSL Zone Weighting

$doc = new Zend_Search_Lucene_Document();

$tfield = Zend_Search_Lucene_Field::Text ('title', $title);$tfield->boost = 1.3;$doc->addField($tfield);

$doc->addField( Zend_Search_Lucene_Field::UnStored ('content', $content));

$index->addDocument($doc);

Page 44: In Search Of... (Dutch PHP Conference 2010)

44

Document Authority

Page 45: In Search Of... (Dutch PHP Conference 2010)

45

Document Weights in ZSL$doc = new Zend_Search_Lucene_Document();$doc->addField( Zend_Search_Lucene_Field::Text ('title', $title));$doc->addField( Zend_Search_Lucene_Field::UnStored ('content', $content));

$doc->boost = 1 + ($numComments / 100);

$index->addDocument($doc);

Page 46: In Search Of... (Dutch PHP Conference 2010)

46

Using Search

Page 47: In Search Of... (Dutch PHP Conference 2010)

47

Summaries & Highlighting

Page 48: In Search Of... (Dutch PHP Conference 2010)

48

Sphinx Extract & Highlight$cl = new SphinxClient();$cl->SetServer( "localhost", 3312 );$q = 'bacon';$r = $cl->Query($q);foreach ($r["matches"] as $doc => $info) { $text[$doc] = getTextFromDB($doc);}

$e = $cl->BuildExcerpts($text, 'posts', $q);foreach($extracts as $extract) { echo $extract;}

Page 49: In Search Of... (Dutch PHP Conference 2010)
Page 50: In Search Of... (Dutch PHP Conference 2010)

50

Xapian Spelling Correction$indexer = new XapianTermGenerator();$indexer->set_database($database);$indexer->set_flags( XapianTermGenerator::FLAG_SPELLING);

Indexer

$queryString = "strreplace or str_cmp";$q = new XapianQueryParser();$q->set_database($database);$query = $q->parse_query($queryString, XapianQueryParser::FLAG_SPELLING_CORRECTION);echo "Did you mean: " . $q->get_corrected_query_string() . "\n";

Searcher

Page 51: In Search Of... (Dutch PHP Conference 2010)

51

Spelling Correction Output php xapsearch.php

Did you mean: str_replace or strcmp

4644 results found for “strreplace or str_cmp”:1: 2% docid=572 [phpdocs/html/cc.license.html]2: 2% docid=7169 [phpdocs/html/imagick.constants.html]3: 2% docid=10086 [phpdocs/html/sqlite3result.fetcharray.html]4: 2% docid=6132 [phpdocs/html/function.swf-posround.html]

Page 52: In Search Of... (Dutch PHP Conference 2010)

52

Results Sorting

Page 53: In Search Of... (Dutch PHP Conference 2010)

53

Sorting in ZSL

$q = Zend_Search_Lucene_Search_QueryParser:: parse('search string');

$results = $index->find($q, 'title');foreach($results as $result) { echo '<h3>', $result->title, "</h3>\n"; $doc = getDocumentFromDB($result->did); echo $q->htmlFragmentHighlightMatches($doc);}

Page 54: In Search Of... (Dutch PHP Conference 2010)

54

Faceted Search

Page 55: In Search Of... (Dutch PHP Conference 2010)

55

Faceted Search In Solr$client = new SolrClient($options);$query = new SolrQuery('bacon');$response = $client->query($query);$query->setFacet(true);$query->addFacetField('cat');$r = $response->getResponse();$f = $r['facet_counts']['facet_fields'];foreach($f['cat'] as $facet => $count) { echo $facet . " " . $count . "\n";}

Page 56: In Search Of... (Dutch PHP Conference 2010)

56

More Like This

Page 57: In Search Of... (Dutch PHP Conference 2010)

57

More Like This$rset = new XapianRset();$rset->add_document(5959); // str_replace$e = $enquire->get_eset(40, $rset);

$t = $e->begin();for($t; !$t->equals($e->end()); $t->next()){ $qs[] = new XapianQuery($t->get_term(), intval($t->get_weight()));}

$query = new XapianQuery( XapianQuery::OP_OR, $qs);

Page 58: In Search Of... (Dutch PHP Conference 2010)

58

More Like This Example php xapsim.php

1656 results found:1: 100% docid=5959 [phpdocs/html/function.str-replace.html]2: 47% docid=5956 [phpdocs/html/function.str-ireplace.html]3: 24% docid=5328 [phpdocs/html/function.preg-replace.html]4: 18% docid=5958 [phpdocs/html/function.str-repeat.html]

Page 59: In Search Of... (Dutch PHP Conference 2010)

59

Image CreditsTitle http://www.flickr.com/photos/generated/2084287794/What Do You Want http://www.flickr.com/photos/the_justified_sinner/

2498066986/You Are Here http://www.flickr.com/photos/alecvuijlsteke/2692475420/Integrating Search http://www.flickr.com/photos/squeaks2569/3700355684/Sphinx http://www.flickr.com/photos/generated/2084287794/Lucene http://www.flickr.com/photos/mypanda/7731447/Swish-e http://www.flickr.com/photos/ryan_fung/2239687100/Solr http://www.flickr.com/photos/m-j-s/2724756177/Xapian http://www.flickr.com/photos/olibac/3522056495/Using Search http://www.flickr.com/photos/eneas/175027945/Improving Search http://www.flickr.com/photos/x-ray_delta_one/3928200642/Search Performance http://www.flickr.com/photos/maisonbisson/1634408/Large Scale Search http://www.flickr.com/photos/zedzap/3663508847/

Page 60: In Search Of... (Dutch PHP Conference 2010)

Questions?

60

Page 61: In Search Of... (Dutch PHP Conference 2010)

Thank You!

Ian Barber@ianbarber

http://[email protected]

http://joind.in/1556