solr+hadoop = big data search

33
1 Solr + Hadoop = Big Data Search Mark Miller

Upload: cloudera-inc

Post on 20-Aug-2015

18.506 views

Category:

Technology


0 download

TRANSCRIPT

Page 1: Solr+Hadoop = Big Data Search

1

Solr%+%Hadoop%=%Big%Data%SearchMark%Miller

Page 2: Solr+Hadoop = Big Data Search

2

Who$Am$I?Cloudera$employee,$Lucene/Solr$committer,$Lucene$PMC$member,Apache$member

First$job$out$of$college$was$in$the$Newspaper$archiving$business.

First$full$time$employee$at$LucidWorks$G$a$startup$aroundLucene/Solr.

Spent$a$couple$years$as$“Core”$engineering$manager,$reporting$tothe$VP$of$engineering.

Page 3: Solr+Hadoop = Big Data Search

3

Very%fast%and%feature%rich%‘core’%search%engine%library.%

Compact%and%powerful,%Lucene%is%an%extremely%popular%full>textsearch%library.

Provides%low%level%API’s%for%analyzing,%indexing,%and%searchingtext,%along%with%a%myriad%of%related%features.

Just%the%core%>%either%you%write%the%‘glue’%or%use%a%higher%levelsearch%engine%built%with%Lucene.

Page 4: Solr+Hadoop = Big Data Search

4

Solr%(pronounced%"solar")%is%an%open%source%enterprise%searchplatform%from%the%Apache%Lucene%project.%Its%major%featuresinclude%full;text%search,%hit%highlighting,%faceted%search,%dynamicclustering,%database%integration,%and%rich%document%(e.g.,%Word,PDF)%handling.%Providing%distributed%search%and%indexreplication,%Solr%is%highly%scalable.%Solr%is%the%most%popularenterprise%search%engine.

;%Wikipedia

Page 5: Solr+Hadoop = Big Data Search

5

Search'on'Hadoop'History'Katta

'Blur

'SolBase

'HBASE73529'SOLR71301

'SOLR71045

'Ad7Hoc

•••••••

Page 6: Solr+Hadoop = Big Data Search

6

Family'Tree

...

Page 7: Solr+Hadoop = Big Data Search

7

Strengthen(the(Family(Bonds

No(need(to(build(something(radically(new(8(we(have(thepieces(we(need.

Focus(on(integration(points.

Create(high(quality,(first(class(integrations(and(contributethe(work(to(the(projects(involved.

Focus(on(integration(and(quality(first(8(then(performanceand(scale.

Page 8: Solr+Hadoop = Big Data Search

8

SolrCloud

Page 9: Solr+Hadoop = Big Data Search

9

Solr%Integration

Read%and%Write%directly%to%HDFS

First%Class%Custom%Directory%Support%in%Solr

Support%Solr%Replication%on%HDFS

Other%improvements%around%usability%and%configuration

••

Page 10: Solr+Hadoop = Big Data Search

10

Read%and%Write%directly%to%HDFS

Lucene%did%not%historically%support%append%only%file%system

“Flexible%Indexing”%brought%around%support%for%append%onlyfilesystem%support

Lucene%support%append%only%filesystem%by%default%since%4.2

Page 11: Solr+Hadoop = Big Data Search

11

Lucene&Directory&AbstractionIt’s&how&Lucene&interacts&with&index&files.Solr&uses&the&Lucene&library&and&offers&DirectoryFactory

Class&Directory&{&&&&&&&&listAll();&&&&&&&&createOutput(file,&context);&&&&&&&&openInput(file,&context);&&&&&&&&deleteFile(file);&&&&&&&&makeLock(file);&&&&&&&&clearLock(file);&&&&&&&&…}

Page 12: Solr+Hadoop = Big Data Search

12

Putting'the'Index'in'HDFS

Solr'relies'on'the'filesystem'cache'to'operate'at'full'speed.

HDFS'not'known'for'it’s'random'access'speed.

Apache'Blur'has'already'solved'this'with'an'HdfsDirectory'thatworks'on'top'of'a'BlockDirectory.

The'“block'cache”'caches'the'hot'blocks'of'the'index'off'heap(direct'byte'array)'and'takes'the'place'of'the'filesystem'cache.

We'contributed'back'optional'‘write’'caching.

Page 13: Solr+Hadoop = Big Data Search

13

Putting'the'TransactionLog'in'HDFS

HdfsUpdateLog'added'9'extends'UpdateLog

Triggered'by'setting'the'UpdateLog'dataDir'to'something'thatstarts'with'hdfs:/'9'no'additional'configuration'necessary.

Same'extensive'testing'as'used'on'UpdateLog

Page 14: Solr+Hadoop = Big Data Search

14

Running&Solr&on&HDFS

Set&DirectoryFactory&to&HdfsDirectoryFactory&and&set&the&dataDir&to&alocation&in&hdfs.

Set&LockType&to&‘hdfs’

Use&an&UpdateLog&dataDir&location&that&begins&with&‘hdfs:/’

Or&java&FDsolr.directoryFactory=HdfsDirectoryFactory&

&&&&&&&&&&&&&&&FDsolr.lockType=solr.HdfsLockFactory

&&&&&&&&&&&&&&&FDsolr.updatelog=hdfs://host:port/path&Fjar&start.jar

Page 15: Solr+Hadoop = Big Data Search

15

Solr%Replication%on%HDFS

While%Solr%has%exposed%a%plug8able%DirectoryFactory%for%a%longtime%now,%it%was%really%quite%limited.

Most%glaring,%only%a%local%file%system%based%Directory%wouldwork%with%replication.

There%where%also%other%more%minor%areas%that%relied%on%a%localfilesystem%Directory%implementation.

Page 16: Solr+Hadoop = Big Data Search

16

Future&Solr&Replication&on&HDFS

Take&advantage&of&“distributed&filesystem”&and&allow&forsomething&similar&to&HBase&regions.

If&a&node&goes&down,&the&data&is&still&available&in&HDFS&D&allowfor&that&index&to&be&automatically&served&by&a&node&that&is&still&upif&it&has&the&capacity.

Solr&Node Solr&Node Solr&Node

HDFS

Page 17: Solr+Hadoop = Big Data Search

17

MR#Index#BuildingScalable#index#creation#via#map8reduce

Many#initial#‘homegrown’#implementations#sent#documents#from#reducer#toSolrCloud#over#http

To#really#scale,#you#want#the#reducers#to#create#the#indexes#in#HDFS#andthen#load#them#up#with#Solr

The#ideal#impl#will#allow#using#as#many#reducers#as#are#available#in#yourhadoop#cluster,#and#then#merge#the#indexes#down#to#the#correct#number#of‘shards’

Page 18: Solr+Hadoop = Big Data Search

18

MR#Index#Building

Mapper:Parse#input#into

indexable#document

Mapper:Parse#input#into

indexable#document

Mapper:Parse#input#into

indexable#document

Index#shard1

Index#shard2

Arbitrary#reducing#steps#of#indexing#and#merging

End@Reducer#(shard#1):Index#document

End@Reducer#(shard#2):Index#document

Page 19: Solr+Hadoop = Big Data Search

19

SolrCloud(Aware

Can(‘inspect’(ZooKeeper(to(learn(about(Solr(cluster.

What(URL’s(to(GoLive(to.

The(Schema(to(use(when(building(indexes.

Match(hash(E>(shard(assignments(of(a(Solr(cluster.

Page 20: Solr+Hadoop = Big Data Search

20

GoLive

After+building+your+indexes+with+map:reduce,+how+do+youdeploy+them+to+your+Solr+cluster?

We+want+it+to+be+easy+:+so+we+built+the+GoLive+option.

GoLive+allows+you+to+easily+merge+the+indexes+you+havecreated+atomically+into+a+live+running+Solr+cluster.

Paired+with+the+ZooKeeper+Aware+ability,+this+allows+you+tosimply+point+your+map:reduce+job+to+your+Solr+cluster+and+it+willautomatically+discover+how+many+shards+to+build+and+whatlocations+to+deliver+the+final+indexes+to+in+HDFS.

••

Page 21: Solr+Hadoop = Big Data Search

21

Flume&Solr&Sync

Flume&is&a&distributed,&reliable,&and&available&service&forefficiently&collecting,&aggregating,&and&moving&large&amountsof&log&data.&It&has&a&simple&and&flexible&architecture&based&onstreaming&data&flows.&It&is&robust&and&fault&tolerant&withtunable&reliability&mechanisms&and&many&failover&andrecovery&mechanisms.&It&uses&a&simple&extensible&data&modelthat&allows&for&online&analytic&application.

=&Apache&Flume&Website

Page 22: Solr+Hadoop = Big Data Search

OtherLogs

22

Flume.Solr.Sync

HDFS

FlumeAgent

FlumeAgent

Solr

Page 23: Solr+Hadoop = Big Data Search

23

SolrCloud(Aware

Can(‘inspect’(ZooKeeper(to(learn(about(Solr(cluster.

What(URL’s(to(send(data(to.

The(Schema(for(the(collection(being(indexed(to.

Page 24: Solr+Hadoop = Big Data Search

24

HBase&Integration

Collaboration&between&NGData&&&Cloudera

NGData&are&creators&of&the&Lily&data&management&platform

Lily&HBase&Indexer

Service&which&acts&as&a&HBase&replication&listener

HBase&replication&features,&such&as&filtering,&supported

Replication&updates&trigger&indexing&of&updates&(rows)

Integrates&Morphlines&library&for&ETL&of&rows

AL2&licensed&on&github&https://github.com/ngdata

••••••••

Page 25: Solr+Hadoop = Big Data Search

25

HBase&Integration

HDFS

HBase

interactive&load Indexer(s)

Triggers&on&updates Solr&server

Solr&serverSolr&serverSolr&serverSolr&server

Page 26: Solr+Hadoop = Big Data Search

26

Morphlines

A,morphline,is,a,configuration,file,that,allows,you,to,define,ETLtransformation,pipelines

Extract,content,from,input,files,,transform,content,,load,content,(egto,Solr)

Uses,Tika,to,extract,content,from,a,large,variety,of,input,documents

Part,of,the,CDK,(Cloudera,Development,Kit)

Page 27: Solr+Hadoop = Big Data Search

27

Morphlines

syslog FlumeAgent

Solr3Sink

Command:3readLine

Command:3grok

Command:3loadSolr

Solr

3Open3Source3framework3for3simple3ETL

3Ships3as3part3Cloudera3Developer3Kit3(CDK)

3It’s3a3Java3library

3AL23licensed3on3githubhttps://github.com/cloudera/cdk

3Similar3to3Unix3pipelines

3Configuration3over3coding

3Supports3common3Hadoop3formatsAvroSequence3fileTextEtc…

••••

•••

Page 28: Solr+Hadoop = Big Data Search

28

Morphlines

+Integrate+with+and+load+into+Apache+Solr

+Flexible+log+file+analysis

+Single:line+record,+multi:line+records,+CSV+files+

+Regex+based+pattern+matching+and+extraction+

+Integration+with+Avro+

+Integration+with+Apache+Hadoop+Sequence+Files

+Integration+with+SolrCell+and+all+Apache+Tika+parsers+

+Auto:detection+of+MIME+types+from+binary+data+using+++Apache+Tika

••••••••

Page 29: Solr+Hadoop = Big Data Search

29

Morphlines

+Scripting+support+for+dynamic+java+code+

+Operations+on+fields+for+assignment+and+comparison

+Operations+on+fields+with+list+and+set+semantics+

+if:then:else+conditionals+

+A+small+rules+engine+(tryRules)

+String+and+timestamp+conversions+

+slf4j+logging

+Yammer+metrics+and+counters+

+Decompression+and+unpacking+of+arbitrarily+nested+container+fileformats

+Etc…

•••••••••

Page 30: Solr+Hadoop = Big Data Search

30

Morphlines+Example+Config

morphlines+:+[

+{

+++id+:+morphline1

+++importCommands+:+["com.cloudera.**",+"org.apache.solr.**"]

+++commands+:+[

+++++{+readLine+{}+}++++++++++++++++++++

+++++{+

+++++++grok+{+

+++++++++dictionaryFiles+:+[/tmp/grokFdictionaries]+++++++++++++++++++++++++++++++

+++++++++expressions+:+{+

+++++++++++message+:+"""<%{POSINT:syslog_pri}>%{SYSLOGTIMESTAMP:syslog_timestamp}+%

{SYSLOGHOST:syslog_hostname}+%{DATA:syslog_program}(?:\[%{POSINT:syslog_pid}\])?:+%

{GREEDYDATA:syslog_message}"""

+++++++++}

+++++++}

+++++}

+++++{+loadSolr+{}+}+++++

++++]

+}

]

Example(Input<164>Feb++4+10:46:14+syslog+sshd[607]:+listening+on+0.0.0.0+port+22

Output(Recordsyslog_pri:164

syslog_timestamp:Feb++4+10:46:14

syslog_hostname:syslog

syslog_program:sshd

syslog_pid:607

syslog_message:listening+on+0.0.0.0+port+22.

Page 31: Solr+Hadoop = Big Data Search

31

Hue$Integration

HueSimple$UI

Navigated,$faceted$drill$down

Customizable$display

Full$text$search,$standard$SolrAPI$and$query$language

••••

Page 32: Solr+Hadoop = Big Data Search

32

Cloudera)Search

https://ccp.cloudera.com/display/SUPPORT/Downloads

Or)Google

“cloudera=search=download”

Page 33: Solr+Hadoop = Big Data Search

Mark%Miller,%Cloudera @heismark