pycon india 2012: rapid development of website search in python

33
Rapid development of website search in Python PyCon India, Bangalore, Sept’ 12 Chetan Giridhar

Upload: chetan-giridhar

Post on 15-Jun-2015

1.267 views

Category:

Entertainment & Humor


2 download

TRANSCRIPT

Page 1: PyCon India 2012: Rapid development of website search in python

Rapid development of website search in Python

PyCon India, Bangalore, Sept’ 12

Chetan Giridhar

Page 2: PyCon India 2012: Rapid development of website search in python

For whom!

If you’re,an experienced developer who has

implemented search solutionscurrently dirtying your handsprototyping website search for your startupdreading to learn Java just curious..

Page 3: PyCon India 2012: Rapid development of website search in python

Think web development

Core functionality Design patterns Web Interface Usability Scalability Performance …?

Page 4: PyCon India 2012: Rapid development of website search in python

Search

Often considered – ‘good to have’ Enhances user experience

Focused information Relevance Interaction Ranked searching

Page 5: PyCon India 2012: Rapid development of website search in python

Typical Search Engine

Designing a schema Convert your data as Documents and store

them to index Document is a set of fields Field is a name=value pair {title = “python”, content = “computer”, tag = “language”}

Analyzers "parse" each field of your data into index-

able "tokens" or keywords. “Welcome to Pycon" it will produce list

[“welcome", “to", “Pycon”]

Page 6: PyCon India 2012: Rapid development of website search in python

Typical Search Engine

Indexing Adding documents to the index

Query and query parsers Prepare query Parse Analyze

Searching Lookup index

Page 7: PyCon India 2012: Rapid development of website search in python

Indexing & Committing

Input files

Field1

Field3

Analyzer

Schema based document

Field2

In-memory Index

Index Writer

Committed

Page 8: PyCon India 2012: Rapid development of website search in python

Searching

Query Parser Analyzer

Results

Input query

Index Searcher

Index

Page 9: PyCon India 2012: Rapid development of website search in python

Development : Considerations

Sourcing input data setHandling input queriesHow to search

Search enginesHow to display results

Customization

Page 10: PyCon India 2012: Rapid development of website search in python

Development: Options

Apache Solr: SunburntHaystackXapian: XappyElastic Search

Whoosh Lucene: Pylucene

Page 11: PyCon India 2012: Rapid development of website search in python

Talking Pylucene & Whoosh

Pythonic APIsDeployment

Large scale and medium sized web sites

Rapid Minimal installationClear DocumentationQuick SetupEase of Integration

Page 12: PyCon India 2012: Rapid development of website search in python

Pylucene

Pylucene: Python wrappers to Lucene The de-facto standard for search engine library Lucene: an open source, pure Java, search

engine library Embeds a Java VM with Lucene into a Python

process

Page 13: PyCon India 2012: Rapid development of website search in python

Pylucene

Simple API High performance indexing Scalable to millions of documents Efficient and feature rich search algorithms Cross platform

Page 14: PyCon India 2012: Rapid development of website search in python

Whoosh

Whoosh is a search engine library Fast indexing and search

One of the fastest Python search engine 100% Python code Extensible code No external dependency Active development and support

Page 15: PyCon India 2012: Rapid development of website search in python

Whoosh

Easy to setup Neutral to web frameworks Powerful query language Feature rich Intuitive APIs

Page 16: PyCon India 2012: Rapid development of website search in python

PyLucene Whoosh

Document Field IndexWriter QueryParser Analyzer IndexSearcher

fields.Schema

index.Index qparser.QueryParser analysis. Analyzer searching.Searcher

Page 17: PyCon India 2012: Rapid development of website search in python

Designing search in websites

Search design should be: An independent componentPluggablePlatform independentAssume minimal external dependencyEasily extendibleSeamless integration

Page 18: PyCon India 2012: Rapid development of website search in python

Search.py

fsMgr

Page 19: PyCon India 2012: Rapid development of website search in python

Demo

Page 20: PyCon India 2012: Rapid development of website search in python

Comparing Engines

Basis of comparison Indexing, Committing and Searching

Dataset 1 GB data ~5000 files file size ranging between 1KB to 50MB

SetupIntel® Core™2 Duo CPU P8600 @ 2.40GHz × 23 GB RAM Ubuntu Release 12.04 (precise) 32-bit

Page 21: PyCon India 2012: Rapid development of website search in python

Indexing

pylucene whoosh0

100

200

300

400

500

Time to Index

time (s)

Page 22: PyCon India 2012: Rapid development of website search in python

Committing

pylucene whoosh0

50100150200250300

Time to Commit

time (s)

Page 23: PyCon India 2012: Rapid development of website search in python

Searching

pylucene whoosh0

0.002

0.004

0.006

0.008

0.01

Time to Search

time (s)

Page 24: PyCon India 2012: Rapid development of website search in python

Recommendations

Search Engine LibraryNo one solution fits all problemsSearch engine abstraction is the keyScalability is criticalRapid to setup, develop and tweak

Understand and use

Page 25: PyCon India 2012: Rapid development of website search in python

Web development in Python

Getting rapid and easier by the day Web frameworks

Django, Pylons Http Servers

Tornado, Gunicorn Support for SQL/NoSQL databases

MySQL-python, pymongo Template Engines

Cheetah, jinja2 Search

Pylucene, Whoosh

Page 26: PyCon India 2012: Rapid development of website search in python

References

Whooshhttps://bitbucket.org/mchaput/whoosh/wiki/Home

Pylucenehttp://lucene.apache.org/pylucene/http://lucene.apache.org/core/3_6_1/api/all/index.html

Xappyhttp://code.google.com/p/xappy/

ElasticSearchhttp://www.elasticsearch.org/guide/reference/api/

Page 27: PyCon India 2012: Rapid development of website search in python

References

Chetan’s tech spacehttp://technobeans.com

Vishal’s technical bloghttp://freethreads.net

Page 28: PyCon India 2012: Rapid development of website search in python

Q and A

Page 29: PyCon India 2012: Rapid development of website search in python

Backup

Page 30: PyCon India 2012: Rapid development of website search in python

Whoosh v/s Haystack v/s Xapian

• Whoosh is suitable for a small project. Limited scalability for search and indexing– A good beginning

• Haystack is appropriate with Django• Xapian is ultra fast, but is not as feature rich as

Solr• Lucene is not distributed; has external

dependency

Page 31: PyCon India 2012: Rapid development of website search in python

Lucene v/s Database search• There are a number of query types that RDBMSs in general do not

support without vendor extensions: • Fuzzy queries, in which "fuzzy" and "wuzzy" are considered matches• Word stemming queries, which consider "take," "took," and "taken" to

be identical• Sound-like queries, which consider "cat" and "kat" to be identical• Synonym queries, which consider "jump," "hop," and "leap" to be

identical• Queries on binary BLOB data types, such as PDF documents, Microsoft

Word or Excel documents, or HTML and XML documents• More disappointingly, SQL search results are not ranked by match-

relevance scores. The SQL standard is simply not intended for full-text querying.

Page 32: PyCon India 2012: Rapid development of website search in python

Typical search engine

• Indexing– Convert files to a format for quick

look up– Fast random access to stored words

• Searching– Specify keywords

• Displaying– Lookup documents that are

relevant– Ranking– Different types of queries

Page 33: PyCon India 2012: Rapid development of website search in python

Advanced Searching

Morelikethis

didyoumean