how solr search works

22
How SOLR Search Works Rajat Jain - 20 th Dec, 2016

Upload: atlogys-technical-consulting

Post on 21-Jan-2017

44 views

Category:

Technology


0 download

TRANSCRIPT

Page 1: How Solr Search Works

How SOLR Search WorksRajat Jain - 20th Dec, 2016

Page 2: How Solr Search Works

Agenda

• What do you mean by Search?

• Search Requirements

• Comparison of SOLR with SQL/NoSQL

• SOLR Architecture

• SOLR Usage in Trellis

• How Google Search Works

• Other Search Technologies

Page 3: How Solr Search Works

What do you mean by Search?

Page 4: How Solr Search Works

What do you mean by Search?

Page 5: How Solr Search Works

What do you mean by Search?

Page 6: How Solr Search Works

Search Requirements

• Text Search – eg. “Architects”

• Filters – eg. “In New Delhi”, “iOS”

• Sorting – eg. “Best Match”, “Highest Rating”, etc.

• And More..• Facets

• Stemming

• Fuzzy Matching

• Image Search, etc.

Page 7: How Solr Search Works

Search Requirements

• Full Text Search

• Fast reads (writes can be slower)

• Various Combinations of Filters

• Various Combinations of Sorting

• Non Features:• Real-time – usually staleness is not a problem

• Data Integrity – usually not a source of storage – can be ‘lossy’

Page 8: How Solr Search Works

Search Requirements – Faceted Search

• A Type of Filtering with suggestions

• In most cases – sorted by number

• Basically helps the user to narrow down the search without having to ‘guess’ how to narrow it

Page 9: How Solr Search Works

Conventional Storage for Search

• SQL (MySQL)• Relational Tables

• Normalized Data

• Assuming using Keys / Indexes for reads & writes

• Optimized for reads and writes & transactional data (acid transactions)

• Lots of security, etc.

• Table Data stored in File System

• Indexing - Individual columns – set of columns

• Full Text search – recent addition (full text index)

Page 10: How Solr Search Works

Conventional Storage for Search

• No SQL (think MongoDB)• Key Value Pairs

• De-normalized Data

• Unstructured Data

• Optimized for Reads – writes can be slightly slower (in case of transactional)

• Data stored in File System

• Indexing – individual fields

• Full Text Search – has in-built support

Page 11: How Solr Search Works

Advantages of SOLR over MySQL/NoSQL

• Reversed Index

• Mind-blowing Text-analysis / stemming / scoring / fuzziness

• Weighting fields / boosting – custom scoring functions

• Single document concept – no relations (in general)

• Faceting support out-of-the box

• Optimized for search and search alone (at scale without performance drop)

Page 12: How Solr Search Works

SOLR Architecture – Indexing

• Take a ‘document’ / field, etc.

• For each field apply set of filters / tokenizers

• Convert to individual tokens

• Update the ‘inverted’ index based on the tokens

• In general in the Index keep track of stats, etc. for the various terms

• Different indexes per field

Page 13: How Solr Search Works

SOLR Architecture - Indexing

13

XML Update Handler

CSV Update Handler

/update /update/csv

XML Update with custom

processor chain

/update/xml

Extracting RequestHandler(PDF, Word, …)

/update/extract

Lucene Index

Data ImportHandler

Database pullRSS pullSimple

transformsSQL DB

RSS feed

<doc><title>

Remove Duplicatesprocessor

Loggingprocessor

Indexprocessor

Custom Transformprocessor

PDF

HTTP POSTHTTP POST

pull

pull

Update Processor Chain (per handler)

Lucene

Text Index Analyzers

Page 14: How Solr Search Works

SOLR Architecture – Searching

• User enters query

• Parse the query, i.e. apply the required filters and tokenizers

• Converted to tokens

• Parallel search across multiple indexes (per field)

• Score all the documents

• Sort in async fashion

Page 15: How Solr Search Works

SOLR Architecture - Full

Page 16: How Solr Search Works

SOLR Architecture – Updating Index

• Types of Index Updates• Instant Index

• Incremental Indexing

• Full Indexing

• Index Update Strategies• Instant / Incremental Index cannot happen continuously

• Too much causes performance degradation

• Full Index periodically to optimize the index

Page 17: How Solr Search Works

SOLR Architecture – Scalability

• Sharding• Splitting collections across servers

– search in parallel

• Replication• More than one copy of the data

for failover

• SolrCloud• Using Zookeeper for managing

clusters

Page 18: How Solr Search Works

SOLR Architecture – Other Features

• Stemming• Identify root word and variations of the word, eg. "stems", "stemmer",

"stemming", "stemmed" as based on "stem"

• Fuzzy Matching• Similar Words / Misspellings

• Edit Distance

• NLP• Identify Entities / Nouns in Search Query

• OpenNLP Plugin for SOLR

• And much more…

Page 19: How Solr Search Works

SOLR Usage in Trellis

• Architecture• Data-in from MySQL

• Index Update Strategy

• AutoComplete

• Basic Search

• Advanced Search

• Filters / Sorting / Facets & More

• Demo (Incl. Config Files)

Page 20: How Solr Search Works

How Google Search Works

• Crawling• Robots.txt

• Indexing• Multiple Indexes – Instant / Daily / Weekly / Long Tail

• Searching• NLP, Stemming, Auto-correct, etc.

• Ranking – PageRank

• Video - https://www.youtube.com/watch?v=BNHR6IQJGZs

Page 21: How Solr Search Works

Other Search Technologies

• ElasticSearch• Much newer than Solr

• Built-in scalability

• Uses same Lucene as the base

• JSON instead of XML

• Good for Analytical querying

• Others• Splunk

• Sphinx

Page 22: How Solr Search Works

That’s All Folks

References• SOLR Home Page -

http://lucene.apache.org/solr/

• Tutorials• http://www.solrtutorial.com/index.h

tml

• https://lucene.apache.org/solr/4_10_0/tutorial.html

• Just Google the rest!!