pycon india 2012: rapid development of website search in python

Rapid development of website search in Python

PyCon India, Bangalore, Sept’ 12

Chetan Giridhar

For whom!

If you’re,an experienced developer who has

implemented search solutionscurrently dirtying your handsprototyping website search for your startupdreading to learn Java just curious..

Think web development

Core functionality Design patterns Web Interface Usability Scalability Performance …?

Search

Often considered – ‘good to have’ Enhances user experience

Focused information Relevance Interaction Ranked searching

Typical Search Engine

Designing a schema Convert your data as Documents and store

them to index Document is a set of fields Field is a name=value pair {title = “python”, content = “computer”, tag = “language”}

Analyzers "parse" each field of your data into index-

able "tokens" or keywords. “Welcome to Pycon" it will produce list

[“welcome", “to", “Pycon”]

Typical Search Engine

Indexing Adding documents to the index

Query and query parsers Prepare query Parse Analyze

Searching Lookup index

Indexing & Committing

Input files

Field1

Field3

Analyzer

Schema based document

Field2

In-memory Index

Index Writer

Committed

Searching

Query Parser Analyzer

Results

Input query

Index Searcher

Index

Development : Considerations

Sourcing input data setHandling input queriesHow to search

Search enginesHow to display results

Customization

Development: Options

Apache Solr: SunburntHaystackXapian: XappyElastic Search

Whoosh Lucene: Pylucene

Talking Pylucene & Whoosh

Pythonic APIsDeployment

Large scale and medium sized web sites

Rapid Minimal installationClear DocumentationQuick SetupEase of Integration

Pylucene

Pylucene: Python wrappers to Lucene The de-facto standard for search engine library Lucene: an open source, pure Java, search

engine library Embeds a Java VM with Lucene into a Python

process

Pylucene

Simple API High performance indexing Scalable to millions of documents Efficient and feature rich search algorithms Cross platform

Whoosh

Whoosh is a search engine library Fast indexing and search

One of the fastest Python search engine 100% Python code Extensible code No external dependency Active development and support

Whoosh

Easy to setup Neutral to web frameworks Powerful query language Feature rich Intuitive APIs

PyLucene Whoosh

Document Field IndexWriter QueryParser Analyzer IndexSearcher

fields.Schema

index.Index qparser.QueryParser analysis. Analyzer searching.Searcher

Designing search in websites

Search design should be: An independent componentPluggablePlatform independentAssume minimal external dependencyEasily extendibleSeamless integration

Search.py

fsMgr

Comparing Engines

Basis of comparison Indexing, Committing and Searching

Dataset 1 GB data ~5000 files file size ranging between 1KB to 50MB

SetupIntel® Core™2 Duo CPU P8600 @ 2.40GHz × 23 GB RAM Ubuntu Release 12.04 (precise) 32-bit

Indexing

pylucene whoosh0

100

200

300

400

500

Time to Index

time (s)

Committing

pylucene whoosh0

50100150200250300

Time to Commit

time (s)

Searching

pylucene whoosh0

0.002

0.004

0.006

0.008

0.01

Time to Search

time (s)

Recommendations

Search Engine LibraryNo one solution fits all problemsSearch engine abstraction is the keyScalability is criticalRapid to setup, develop and tweak

Understand and use

Web development in Python

Getting rapid and easier by the day Web frameworks

Django, Pylons Http Servers

Tornado, Gunicorn Support for SQL/NoSQL databases

MySQL-python, pymongo Template Engines

Cheetah, jinja2 Search

Pylucene, Whoosh

References

Whooshhttps://bitbucket.org/mchaput/whoosh/wiki/Home

Pylucenehttp://lucene.apache.org/pylucene/http://lucene.apache.org/core/3_6_1/api/all/index.html

Xappyhttp://code.google.com/p/xappy/

ElasticSearchhttp://www.elasticsearch.org/guide/reference/api/

https://bitbucket.org/mchaput/whoosh/wiki/Home

http://lucene.apache.org/pylucene/

http://lucene.apache.org/core/3_6_1/api/all/index.html

http://code.google.com/p/xappy/

http://www.elasticsearch.org/guide/reference/api/

References

Chetan’s tech spacehttp://technobeans.com

Vishal’s technical bloghttp://freethreads.net

http://technobeans.com/

http://freethreads.net/

Q and A

Backup

Whoosh v/s Haystack v/s Xapian

• Whoosh is suitable for a small project. Limited scalability for search and indexing– A good beginning

• Haystack is appropriate with Django• Xapian is ultra fast, but is not as feature rich as

Solr• Lucene is not distributed; has external

dependency

Lucene v/s Database search• There are a number of query types that RDBMSs in general do not

support without vendor extensions: • Fuzzy queries, in which "fuzzy" and "wuzzy" are considered matches• Word stemming queries, which consider "take," "took," and "taken" to

be identical• Sound-like queries, which consider "cat" and "kat" to be identical• Synonym queries, which consider "jump," "hop," and "leap" to be

identical• Queries on binary BLOB data types, such as PDF documents, Microsoft

Word or Excel documents, or HTML and XML documents• More disappointingly, SQL search results are not ranked by match-

relevance scores. The SQL standard is simply not intended for full-text querying.

Typical search engine

• Indexing– Convert files to a format for quick

look up– Fast random access to stored words

• Searching– Specify keywords

• Displaying– Lookup documents that are

relevant– Ranking– Different types of queries

Advanced Searching

Morelikethis

didyoumean

pycon india 2012: rapid development of website search in python

Entertainment & Humor