osscon: big search 4 big data

Post on 27-Jan-2015

118 Views

Category:

Documents

3 Downloads

Preview:

Click to see full reader

DESCRIPTION

At Basis Technologies Open Source Search conference I talked about a project I did this past year, and talked about the lessons, both good and the bad that we learned.

TRANSCRIPT

Big Search w/ Big Data Principles

Basis Technology Open Source Search 2012 Eric Pugh | epugh@o19s.com | @dep4b

Tuesday, October 2, 2012

Who am I?

• Principal of OpenSource Connections - Solr/Lucene Search Consultancy

• Member of Apache Software Foundation

• SOLR-284 UpdateRichDocuments (July 07)

• Fascinated by the art of software development

Tuesday, October 2, 2012

CO-AUTHOR

2nd edition!

Tuesday, October 2, 2012

Telling some stories

• Prototyping

• Application Development

• Maintaining Your Big Search Indexes

war^

Tuesday, October 2, 2012

Not an intro to SolrCloud!

• Great tutorials given by Tomás Fernández Löbbe from LucidWorks yesterday!

Tuesday, October 2, 2012

Background for Client X’s Project

• Big Data is any data set that is primarily at rest due to the difficulty of working with it.

• 100’s of millions of documents to search

• Limited selection of tools available.

• Aggressive timeline.

• All the data must be searched per query.

• On Solr 3.x line

Tuesday, October 2, 2012

Telling some stories

• Prototyping

• Application Development

• Maintaining Your Big Search Indexes

Tuesday, October 2, 2012

Boy meets Girl Story

Metadata

Content Files

IngestPipeline

SolrSolrSolrSolr

Tuesday, October 2, 2012

Bash Rocks

Tuesday, October 2, 2012

Bash Rocks

• Remote Solr stop/start scripts

• Remote Indexer stop/start scripts

• Performance Monitoring

• Content Extraction scripts (+Java)

• Ingestor Scripts (+Java)

• Artifact Deployment (CM)

Tuesday, October 2, 2012

Make it easy to change approach

Tuesday, October 2, 2012

Make it easy to change sharding

public void run(Map options, List<SolrInputDocument> docs) throws InstantiationException, IllegalAccessException, ClassNotFoundException { IndexStrategy indexStrategy = (IndexStrategy) Class.forName( "com.o19s.solr.ModShardIndexStrategy").newInstance(); indexStrategy.configure(options); for (SolrInputDocument doc:docs){ indexStrategy.addDocument(doc); } }

Tuesday, October 2, 2012

Separate JVM from Solr Cores

• Step 1: Fire up empty Solr’s on all the servers (nohup &).

• Step 2: Verify they started cleanly.

• Step 3: Create Cores (curl http://search1.o19s.com:8983/solr/admin?action=create&name=run2)

• Step 4: Create a “aggregator” core, passing in urls of Cores. (&property.shards=)

Tuesday, October 2, 2012

Go Wide Quickly

Tuesday, October 2, 2012

shard1shard1shard1shard1 :8983

shard1shard1shard1shard8 :8984

shard1shard1shard1shard12 :8985

search1.o19s.com

shard1shard1shard1shard12 :8985

shard1shard1shard1shard1 :8983

search1.o19s.com

shard1shard1shard1shard8 :8983

search2.o19s.com

shard1shard1shard1shard12 :8983

search3.o19s.com

Tuesday, October 2, 2012

Simple Pipeline

• Simple pipeline

• mv is atomic

Tuesday, October 2, 2012

Don’t Move Files

• SCP across machines is slow/error prone

• NFS share, single point of failure.

• Clustered file system like GFS (Global File System) can have “fencing” issues

• HDFS shines here.

• ZooKeeper shines here.

Tuesday, October 2, 2012

Can you test your changes?

Tuesday, October 2, 2012

JVM tuning is black art-verbose:gc-XX:+PrintGCDetails-server-Xmx8G-Xms8G-XX:MaxPermSize=256m-XX:PermSize=256m-XX:+AggressiveHeap-XX:+DisableExplicitGC-XX:ParallelGCThreads=16-XX:+UseParallelOldGC

Tuesday, October 2, 2012

Tuesday, October 2, 2012

Run, don’t Walk

Tuesday, October 2, 2012

Telling some stories

• Prototyping

•Application Development

• Maintaining Your Big Search Indexes

Tuesday, October 2, 2012

Using Solr as key/value store

Metadata

Content Files

IngestPipeline

SolrSolrSolrSolr

Solr Key/Value Cache

Tuesday, October 2, 2012

• thousands of queries per second without real time get.

• how fast with real time get?

http://localhost:8983/solr/run2_enrichment/select?q=id:DOC45242&fl=entities,html

http://localhost:8983/solr/run2_enrichment/get?id=DOC45242&fl=entities,html

Using Solr as key/value store

Tuesday, October 2, 2012

Push schema definition to the application

• Not “schema less”

• Just different owner of schema!

• Schema may have common set of fields like id, type, timestamp, version

• Nothing required.

q=intensity_i:[70 TO 0]&fq=TYPE:streetlamp_monitor

Tuesday, October 2, 2012

Don’t do expensive things in Solr

• Tika content extraction aka Solr Cell

• UpdateRequestProcessorChain

Tuesday, October 2, 2012

Don’t do expensive things in Solr

• Tika content extraction aka Solr Cell

• UpdateRequestProcessorChain

Tuesday, October 2, 2012

Beware JavaBin

Metadata

Content Files

IngestPipeline

SolrSolrSolrSolr

Solr Key/Value Cache

Tuesday, October 2, 2012

Beware JavaBin

Metadata

Content Files

IngestPipeline

SolrSolrSolrSolr

Solr Key/Value Cache

Solr 3.4

Tuesday, October 2, 2012

Beware JavaBin

Metadata

Content Files

IngestPipeline

SolrSolrSolrSolr

Solr Key/Value Cache

Solr 3.4

Solr 4

Tuesday, October 2, 2012

Beware JavaBin

Metadata

Content Files

IngestPipeline

SolrSolrSolrSolr

Solr Key/Value Cache

Solr 3.4

Solr 4

Which SolrJ version do I

use?

Tuesday, October 2, 2012

No JavaBin

• Avoid Jarmaggeddon

• Reflection? Ugh.

Give m

e

/update/avro!

Tuesday, October 2, 2012

Avro!

• Supports serialization of data readable from multiple languages

• It’s smart XML, w/o the XML!

• Handles forward and reverse versions of an object

• Compact and fast to read.

Tuesday, October 2, 2012

Avro!

Metadata

Content Files

IngestPipeline

SolrSolrSolrSolr

Solr Key/Value Cache

.avro

Tuesday, October 2, 2012

Telling some stories

• Prototyping

• Application Development

•Maintaining Your Big Search Indexes

Tuesday, October 2, 2012

Upgrade Lucene Indexes Easily

• Don’t reindex!

• Try out new versions of Lucene based search engines.

David Lyle

java -cp lucene-core.jar org.apache.lucene.index.IndexUpgrader [-delete-prior-commits] [-verbose] indexDir

Tuesday, October 2, 2012

Indexing is Easy and Quick

Tuesday, October 2, 2012

CHEAP AND CHEERFUL

><

Tuesday, October 2, 2012

NRT versus BigData

Tuesday, October 2, 2012

The tension between scale and update rate

10 million 100’s of millionsBad Place

Tuesday, October 2, 2012

Grim ReaperTuesday, October 2, 2012

Delayed Replication<requestHandler name="/replication" class="solr.ReplicationHandler" ><lst name="slave"> <str name="masterUrl">http://localhost:8983/solr/replication</str> <str name="pollInterval">36:00:00</str></lst></requestHandler>

Tuesday, October 2, 2012

Enable/Disable

• Solr-3301

Tuesday, October 2, 2012

Enable/Disable

<requestHandler name="/admin/ping" class="solr.PingRequestHandler"><lst name="invariants"> <str name="q">MY HARD QUERY</str> <str name="shards">http://search1.o19s.com:8983/solr/run2_1,http://search1.o19s.com:8983/solr/run2_2,http://search1.o19s.com:8983/solr/run2_2</lst><lst name="defaults"> <str name="echoParams">all</str></lst><str name="healthcheckFile">server-enabled.txt</str></requestHandler>

Tuesday, October 2, 2012

Provisioning

• Chef/Puppet

• ZooKeeper

• Have you versioned everything to build an index over again?

Tuesday, October 2, 2012

TRADITIONAL ENVIRONMENT

Tuesday, October 2, 2012

POOLED ENVIRONMENTthink Cloud!

Tuesday, October 2, 2012

Do I need Failover?

• Can I build quickly?

• Do I have a reliable cluster of servers?

• Am I spread across data centers?

• Is sooo 90’s....

Tuesday, October 2, 2012

Telling some stories

• Prototyping

• Application Development

• Maintaining Your Big Search Indexes

Tuesday, October 2, 2012

One more thought...

Tuesday, October 2, 2012

Measuring the impact of our algorithms

changes is just getting harder with Big Data.

Tuesday, October 2, 2012

Project SolrPanlTuesday, October 2, 2012

Thank you!

Questions?

• epugh@o19s.com

• @dep4b

• www.opensourceconnections.com

Tuesday, October 2, 2012

top related