pycon india 2012: rapid development of website search in python
TRANSCRIPT
Rapid development of website search in Python
PyCon India, Bangalore, Sept’ 12
Chetan Giridhar
For whom!
If you’re,an experienced developer who has
implemented search solutionscurrently dirtying your handsprototyping website search for your startupdreading to learn Java just curious..
Think web development
Core functionality Design patterns Web Interface Usability Scalability Performance …?
Search
Often considered – ‘good to have’ Enhances user experience
Focused information Relevance Interaction Ranked searching
Typical Search Engine
Designing a schema Convert your data as Documents and store
them to index Document is a set of fields Field is a name=value pair {title = “python”, content = “computer”, tag = “language”}
Analyzers "parse" each field of your data into index-
able "tokens" or keywords. “Welcome to Pycon" it will produce list
[“welcome", “to", “Pycon”]
Typical Search Engine
Indexing Adding documents to the index
Query and query parsers Prepare query Parse Analyze
Searching Lookup index
Indexing & Committing
Input files
Field1
Field3
Analyzer
Schema based document
Field2
In-memory Index
Index Writer
Committed
Searching
Query Parser Analyzer
Results
Input query
Index Searcher
Index
Development : Considerations
Sourcing input data setHandling input queriesHow to search
Search enginesHow to display results
Customization
Development: Options
Apache Solr: SunburntHaystackXapian: XappyElastic Search
Whoosh Lucene: Pylucene
Talking Pylucene & Whoosh
Pythonic APIsDeployment
Large scale and medium sized web sites
Rapid Minimal installationClear DocumentationQuick SetupEase of Integration
Pylucene
Pylucene: Python wrappers to Lucene The de-facto standard for search engine library Lucene: an open source, pure Java, search
engine library Embeds a Java VM with Lucene into a Python
process
Pylucene
Simple API High performance indexing Scalable to millions of documents Efficient and feature rich search algorithms Cross platform
Whoosh
Whoosh is a search engine library Fast indexing and search
One of the fastest Python search engine 100% Python code Extensible code No external dependency Active development and support
Whoosh
Easy to setup Neutral to web frameworks Powerful query language Feature rich Intuitive APIs
PyLucene Whoosh
Document Field IndexWriter QueryParser Analyzer IndexSearcher
fields.Schema
index.Index qparser.QueryParser analysis. Analyzer searching.Searcher
Designing search in websites
Search design should be: An independent componentPluggablePlatform independentAssume minimal external dependencyEasily extendibleSeamless integration
Search.py
fsMgr
Demo
Comparing Engines
Basis of comparison Indexing, Committing and Searching
Dataset 1 GB data ~5000 files file size ranging between 1KB to 50MB
SetupIntel® Core™2 Duo CPU P8600 @ 2.40GHz × 23 GB RAM Ubuntu Release 12.04 (precise) 32-bit
Indexing
pylucene whoosh0
100
200
300
400
500
Time to Index
time (s)
Committing
pylucene whoosh0
50100150200250300
Time to Commit
time (s)
Searching
pylucene whoosh0
0.002
0.004
0.006
0.008
0.01
Time to Search
time (s)
Recommendations
Search Engine LibraryNo one solution fits all problemsSearch engine abstraction is the keyScalability is criticalRapid to setup, develop and tweak
Understand and use
Web development in Python
Getting rapid and easier by the day Web frameworks
Django, Pylons Http Servers
Tornado, Gunicorn Support for SQL/NoSQL databases
MySQL-python, pymongo Template Engines
Cheetah, jinja2 Search
Pylucene, Whoosh
References
Whooshhttps://bitbucket.org/mchaput/whoosh/wiki/Home
Pylucenehttp://lucene.apache.org/pylucene/http://lucene.apache.org/core/3_6_1/api/all/index.html
Xappyhttp://code.google.com/p/xappy/
ElasticSearchhttp://www.elasticsearch.org/guide/reference/api/
References
Chetan’s tech spacehttp://technobeans.com
Vishal’s technical bloghttp://freethreads.net
Q and A
Backup
Whoosh v/s Haystack v/s Xapian
• Whoosh is suitable for a small project. Limited scalability for search and indexing– A good beginning
• Haystack is appropriate with Django• Xapian is ultra fast, but is not as feature rich as
Solr• Lucene is not distributed; has external
dependency
Lucene v/s Database search• There are a number of query types that RDBMSs in general do not
support without vendor extensions: • Fuzzy queries, in which "fuzzy" and "wuzzy" are considered matches• Word stemming queries, which consider "take," "took," and "taken" to
be identical• Sound-like queries, which consider "cat" and "kat" to be identical• Synonym queries, which consider "jump," "hop," and "leap" to be
identical• Queries on binary BLOB data types, such as PDF documents, Microsoft
Word or Excel documents, or HTML and XML documents• More disappointingly, SQL search results are not ranked by match-
relevance scores. The SQL standard is simply not intended for full-text querying.
Typical search engine
• Indexing– Convert files to a format for quick
look up– Fast random access to stored words
• Searching– Specify keywords
• Displaying– Lookup documents that are
relevant– Ranking– Different types of queries
Advanced Searching
Morelikethis
didyoumean