text indexing and search libraries for php - zoë slattery - barcelona php conference 2008

36
Can you be dynamic and fast? “Miss Marple and the case of the Missing MIPS” Zoë Slattery

Upload: phpbarcelona

Post on 22-Apr-2015

10.435 views

Category:

Technology


0 download

DESCRIPTION

 

TRANSCRIPT

Page 1: Text indexing and search libraries for PHP - Zoë Slattery - Barcelona PHP Conference 2008

Can you be dynamic and fast?

“Miss Marple and the case of the Missing MIPS”

Zoë Slattery

Page 2: Text indexing and search libraries for PHP - Zoë Slattery - Barcelona PHP Conference 2008

Agenda

● Index and search applications

● The problem for PHP programmers

● Understanding execution times

● Conclusions

Page 3: Text indexing and search libraries for PHP - Zoë Slattery - Barcelona PHP Conference 2008

Index and search

● Problem of finding relevant information is not new.– 3000 years BC [1]– Vannevar Bush, As We May Think, 1945.

● Today applications that search the Web must be able to provide instant access to > 10 billion documents

● Many applications need some form of search, eg searching your hard drive, email....

1. Lagoze, C. Singhal, A. Information Discovery: Needles and Haystacks. IEEE Internet Computing. Volume 9(3), 16­18, 2005.

Page 4: Text indexing and search libraries for PHP - Zoë Slattery - Barcelona PHP Conference 2008

Options for information retrieval

● Search engines– Nutch, SearchBlox.....

● Information Retrieval libraries– Three with broadly similar features

Egothor

Xapian

Lucene

Implementationlanguage

Languagebindings

Languageports

License

Java None None BSD like

C++Perl, Python,

PHP, Java, TCL None GPL

Java NoneC++, Perl, PHP, C#

Apache 2

Page 5: Text indexing and search libraries for PHP - Zoë Slattery - Barcelona PHP Conference 2008

Lucene [2]

DBWeb

Filesystem

Get user query Present search 

results

Index

Indexdocuments

Searchindex

Gatherdata

Luce

neA

pplic

atio

n

User

2. Gospodnetic, O., Hatcher, E. Lucene in Action. Manning Publications Co., Greenwich. 2005.

Page 6: Text indexing and search libraries for PHP - Zoë Slattery - Barcelona PHP Conference 2008

.

Lucene indexing

Oh for a muse of fire that would 

acsend thebrightest heaven of 

invention.....

start

fire

ascend

...

Henry V, Scouting for boys...

Aerospace, Henry V...

Terms Documents

3. Inverted index

1. Documents

AnalysisIndex creation

end

[fire]   [ascend]  [bright]  [heaven]

2. Token stream

Optimise

4. Optimised inverted index

Page 7: Text indexing and search libraries for PHP - Zoë Slattery - Barcelona PHP Conference 2008

Agenda

● Index and search applications

● The problem for PHP programmers

● Understanding execution times

● Conclusions

Page 8: Text indexing and search libraries for PHP - Zoë Slattery - Barcelona PHP Conference 2008

Indexing speed

Java + JIT

Java

PHP

4

32

167

Time to index/seconds

0.3

3

43

Time to optimise/seconds

4.3

35

210

Total time

Benchmark:●17.4 MB, 814 files of PHP source code●Linux/Thinkpad T60

Ouch! nearly 50 times as fast in Java

Page 9: Text indexing and search libraries for PHP - Zoë Slattery - Barcelona PHP Conference 2008

Why is the performance so bad?

First make sure we are comparing same thing:

➢ Analyser➢ Java Lucene has many analysers

➢ Limits on terms➢ Java stops looking at 10,000 terms

➢ Scoring➢ Java rounds down, PHP rounds to closest

➢ Compare indexes using Luke

Page 10: Text indexing and search libraries for PHP - Zoë Slattery - Barcelona PHP Conference 2008

Analysis ­ Java

Analyzing "A Quick Brown Fox jumped over the Lazy Dog" StandardAnalyzer: [quick] [brown] [fox] [jumped] [over] [lazy] [dog]

SimpleAnalyzer: [a] [quick] [brown] [fox] [jumped] [over] [the] [lazy] [dog]

StopAnalyzer: [quick] [brown] [fox] [jumped] [over] [lazy] [dog]

Analyzing "XY&Z Corporation - [email protected]" StandardAnalyzer: [xy&z] [corporation] [[email protected]]

SimpleAnalyzer: [xy] [z] [corporation] [xyz] [example] [com]

StopAnalyzer: [xy] [z] [corporation] [xyz] [example] [com]

Page 11: Text indexing and search libraries for PHP - Zoë Slattery - Barcelona PHP Conference 2008

Analysis ­ PHPAnalysing "A Quick Brown Fox jumped over the Lazy Dog" Default (lower case) filter: [a] [quick] [brown] [fox] [jumped] [over] [the] [lazy] [dog]

Stop words filter: [quick] [brown] [fox] [jumped] [over] [lazy] [dog]

Short words filter: [quick] [brown] [fox] [jumped] [over] [the] [lazy] [dog]

Analysing "XY&Z Corporation - [email protected]" Default (lower case) filter: [xy] [z] [corporation] [xyz] [example] [com]

Stop words filter: [xy] [z] [corporation] [xyz] [example] [com]

Short words filter: [xy] [corporation] [xyz] [example] [com]

Page 12: Text indexing and search libraries for PHP - Zoë Slattery - Barcelona PHP Conference 2008

Compare indexes

Same 663 terms

java

php

Page 13: Text indexing and search libraries for PHP - Zoë Slattery - Barcelona PHP Conference 2008

Agenda

● Index and search applications

● The problem for PHP programmers

● Understanding execution times– Part one– Part two

● Conclusions

Page 14: Text indexing and search libraries for PHP - Zoë Slattery - Barcelona PHP Conference 2008

Execution profiles

● Now that we are definitely comparing the same thing, look at execution profiles for Java and PHP implementations

● Profiling tools (all open source)

– Java● Eclipse TPTP

– PHP● Xdebug● KCachegrind

– System● Sysprof● vmstat, iostat

Page 15: Text indexing and search libraries for PHP - Zoë Slattery - Barcelona PHP Conference 2008

Java profile

Page 16: Text indexing and search libraries for PHP - Zoë Slattery - Barcelona PHP Conference 2008

Small problems with TPTP...

Java

Java + profile

2.3

687258

Time to index/seconds

0.3

673851

Time to optimise/seconds

88

50

% time in indexing

●Invasive and slow. Takes 600,000 times as long to execute●Some problems getting to run on Ubuntu (missing C++ libraries, ksh specific scripts)●Output file is machine readable only

But – it's free, open source and it works enough.

Benchmark data:● 39 files of PHP source code (php/Zend), 1.2 MB 

Page 17: Text indexing and search libraries for PHP - Zoë Slattery - Barcelona PHP Conference 2008

PHP profile

Page 18: Text indexing and search libraries for PHP - Zoë Slattery - Barcelona PHP Conference 2008

No problems with this tool

PHP

PHP + profile

5

70

Time to index/seconds

3

55

Time to optimise/seconds

63

56

% time in indexing

●Not so invasive as the Java tool  but still adds to time and distorts slightly●Results easy to display with KCachegrind●Output file is readable

Benchmark data:● 39 files of PHP source code (php/Zend), 1.2 MB 

Page 19: Text indexing and search libraries for PHP - Zoë Slattery - Barcelona PHP Conference 2008

The normalize() function

Sum( ) = 2.92;  

18.99 – 2.92 = 16.07 

Page 20: Text indexing and search libraries for PHP - Zoë Slattery - Barcelona PHP Conference 2008

Micro benchmark

<?php         require_once "Token.php";         require_once "LowerCase.php"; 

        $token = new Token("GO", 105, 107);         $filter = new LowerCase(); 

        for ($i=0; $i < 10000000; $i++) {                 $norm_token = $filter­>normalize($token);         } ?> 

Page 21: Text indexing and search libraries for PHP - Zoë Slattery - Barcelona PHP Conference 2008

normalize() opcodescompiled vars:  !0 = $srcToken, !1 = $newToken line     #  op                   ext  return   operands ­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­ 11     0  RECV 1 13     1  ZEND_FETCH_CLASS :0 'Token'        2  NEW $1 :0        3  ZEND_INIT_METHOD_CALL !0, 'getTermText'        4  DO_FCALL_BY_NAME 0        5  SEND_VAR_NO_REF $3        6  DO_FCALL 1     'strtolower'        7  SEND_VAR_NO_REF $4 14     8  ZEND_INIT_METHOD_CALL !0, 'getStartOffset'        9  DO_FCALL_BY_NAME 0       10  SEND_VAR_NO_REF $6 15    11  ZEND_INIT_METHOD_CALL  !0, 'getEndOffset'       12  DO_FCALL_BY_NAME 0       13  SEND_VAR_NO_REF $8       14  DO_FCALL_BY_NAME 3       15  ASSIGN  !1, $1 16    ......

Page 22: Text indexing and search libraries for PHP - Zoë Slattery - Barcelona PHP Conference 2008

System profile

1. Convert to lower case2. Look up opcodes

Page 23: Text indexing and search libraries for PHP - Zoë Slattery - Barcelona PHP Conference 2008

How Xdebug worksS

crip

t exe

cutio

n

●Convert function name to lower case●Look up function in function table

Execute function

Call out to profiler – start time 

Call out to profiler – end time 

ZEND_INIT_METHOD_CALL

DO_FCALL_BY_NAME

Page 24: Text indexing and search libraries for PHP - Zoë Slattery - Barcelona PHP Conference 2008

The normalize() function

Sum( ) = 2.92;  

18.99 – 2.92 = 16.07 

Is consumed in setting up functions to be run

Page 25: Text indexing and search libraries for PHP - Zoë Slattery - Barcelona PHP Conference 2008

Why is function calling faster in Java?

● Java is a static language. VM structures are known at start up – can't add code on the fly, types are known at compile time.

● First time a function is called Java caches a reference to it in a virtual dispatch table. After that function calls are fast.

● In PHP, code can be added during execution, for example, create_function() and types are not known till code is executed. This makes keeping virtual dispatch tables much more difficult.

Page 26: Text indexing and search libraries for PHP - Zoë Slattery - Barcelona PHP Conference 2008

Agenda

● Index and search applications

● The problem for PHP programmers

● Understanding execution times– Part one– Part two

● Conclusions

Page 27: Text indexing and search libraries for PHP - Zoë Slattery - Barcelona PHP Conference 2008

PHP profile

Page 28: Text indexing and search libraries for PHP - Zoë Slattery - Barcelona PHP Conference 2008

look at the call to normalize()

$token = $this­>normalize(new Zend_Search_Lucene_Analysis_Token($str, $pos, $endpos));

public function normalize(Token $srcToken ){

         $newToken = new Token(strtolower( $srcToken­>getTermText() ),                                $srcToken­>getStartOffset(),                                $srcToken­>getEndOffset());

        $newToken­>setPositionIncrement($srcToken­>getPositionIncrement());

     return $newToken;    }

Page 29: Text indexing and search libraries for PHP - Zoë Slattery - Barcelona PHP Conference 2008

look at the call to normalize()

$token = $this­>normalize(new Zend_Search_Lucene_Analysis_Token($str, $pos, $endpos));

public function normalize (Token $srcToken) {$srcToken­>setTermText(strtolower($srcToken­>getTermtext()));return $srcToken;

}

normalize() recoded....

Page 30: Text indexing and search libraries for PHP - Zoë Slattery - Barcelona PHP Conference 2008

After fix

Page 31: Text indexing and search libraries for PHP - Zoë Slattery - Barcelona PHP Conference 2008

Performance improvement?

PHP + fix

PHP

151

167

Time to index/seconds

43

43

Time to optimise/seconds

Java  32 3 35

194

210

Total time

9.5 % improvement

Java + JIT 4 0.3 4.3

Page 32: Text indexing and search libraries for PHP - Zoë Slattery - Barcelona PHP Conference 2008

Agenda

● Index and search applications

● The problem for PHP programmers

● Understanding execution times– Part one– Part two

● Conclusions

Page 33: Text indexing and search libraries for PHP - Zoë Slattery - Barcelona PHP Conference 2008

Conclusions

● Two reasons why the PHP implementation of Lucene is slow:– Function calling overhead in PHP– Inefficient code in the analyser [3]– These are the main two, there are others....

● Dynamic and fast?– Hard to get to the same execution speed as Java – but possible to get closer.– But development speed is much better [4]– what speed to you care about?– Better not to use Java coding style (lots of methods that do nothing)

● So which implementation of Lucene should I use?– it depends.....

3. http://framework.zend.com/issues/browse/ZF-36834. Prechelt, L. An empirical comparison of seven programming languages. Computer. Volume 33(10), 23-29, 2000.

Page 34: Text indexing and search libraries for PHP - Zoë Slattery - Barcelona PHP Conference 2008

Options for PHP 

Do you care about 

speed?

Use Zend Search Lucene

Only need basic features?

Can support Java environment?

Use a Web Service?

Use Lucene via a Java bridge

No Lucene solution today [5]

Use SOLR as web service

Y

Y

Y

NN N

N

Y

5. http://pecl.php.net/package/clucene

Page 35: Text indexing and search libraries for PHP - Zoë Slattery - Barcelona PHP Conference 2008

Acknowledgements

● Rob Young's presentation [6] to the London PHP user group.

● Members of the PHP internals community, in particular Scott MacVicar, Derick Rethans and Dmitry Stogov.

6. http://www.phplondon.org/wiki/Search_tools_in_PHP_(Rob_Young)

Page 36: Text indexing and search libraries for PHP - Zoë Slattery - Barcelona PHP Conference 2008

Other useful links

●http://www.egothor.org/●http://xapian.org/●http://lucene.apache.org/●http://www.tiobe.com/index.php/content/paperinfo/tpci/index.html●http://www.derickrethans.nl/vld.php●http://lucene.apache.org/nutch/●http://www.searchblox.com/●http://www.xdebug.org/●http://www.eclipse.org/tptp/●http://www.getopt.org/luke/●http://www.projectzero.org●http://www.ibm.com/developerworks/ (Publication due 24/09/08)●http://php-java-bridge.sourceforge.net/doc/●http://www.zend.com/en/products/platform/product-comparison/java-bridge●http://lucene.apache.org/solr/●http://www.ibm.com/developerworks/websphere/library/techarticles/0809_phillips/0809_phillips.html