solrcloud on hadoop

47
1 SolrCloud on Hadoop Cleveland Big Data and Hadoop User Group, January 2014 Alex Moundalexis [email protected] @technmsg

Upload: alex-moundalexis

Post on 06-May-2015

1.409 views

Category:

Technology


2 download

DESCRIPTION

An overview of building and serving Lucene indexes on a Hadoop cluster with Solr for text and parametric searching, as presented at Cleveland Hadoop User Group on 13 January 2014.

TRANSCRIPT

Page 1: SolrCloud on Hadoop

1

SolrCloud  on  Hadoop  Cleveland  Big  Data  and  Hadoop  User  Group,  January  2014  Alex  Moundalexis  [email protected]    @technmsg  

Page 2: SolrCloud on Hadoop

Disclaimer  

•  Technologies,  not  products  •  Cloudera  builds  things  soHware  

•  most  donated  to  Apache  •  some  closed-­‐source  

•  I  will  likely  menMon  “Cloudera  Something”  •  Cloudera  “products”  I  reference  are  open  source  

•  Apache  Licensed  •  Source  code  is  on  GitHub  

•  hRps://github.com/cloudera  

2

Page 3: SolrCloud on Hadoop

What  This  Talk  Isn’t  About  

•  Deploying  •  Puppet,  Chef,  Ansible,  homegrown  scripts,  intern  labor  

•  Sizing  &  Tuning  •  Depends  heavily  on  data  and  workload  

•  Coding  •  Unless  you  count  XML  or  CSV  

•  Algorithms  

3

Page 4: SolrCloud on Hadoop

4

Quick  and  dirty,  more  Mme  for  use  cases.  

The  Apache  Hadoop  Ecosystem  

Page 5: SolrCloud on Hadoop

Why  “Ecosystem?”  

•  In  the  beginning,  just  Hadoop  •  HDFS  •  MapReduce  

•  Today,  dozens  of  interrelated  components  •  I/O  •  Processing  •  Specialty  ApplicaMons  •  ConfiguraMon  •  Workflow  

5

Page 6: SolrCloud on Hadoop

ParMal  Ecosystem  

6

Hadoop  

external  system  

RDBMS  /  DWH  

web  server  

device  logs  

API  access  

log  collecMon  

DB  table  import  

batch  processing  

machine  learning  

external  system  

API  access  

user  

RDBMS  /  DWH  

DB  table    export  

BI  tool  +  JDBC/ODBC  

Search  

SQL  

Page 7: SolrCloud on Hadoop

HDFS  

•  Distributed,  highly  fault-­‐tolerant  filesystem  •  OpMmized  for  large  streaming  access  to  data  •  Based  on  Google  File  System  

•  hRp://research.google.com/archive/gfs.html  

7

Page 8: SolrCloud on Hadoop

Lots  of  Commodity  Machines  

8

Image:Yahoo! Hadoop cluster [ OSCON ’07 ] Image:Yahoo! Hadoop cluster [ OSCON ’07 ]

Image:Yahoo! Hadoop cluster [ OSCON ’07 ] Image:Yahoo! Hadoop cluster [ OSCON ’07 ]

Image:Yahoo! Hadoop cluster [ OSCON ’07 ] Image:Yahoo! Hadoop cluster [ OSCON ’07 ]

Image:Yahoo! Hadoop cluster [ OSCON ’07 ]

Image:Yahoo! Hadoop cluster [ OSCON ’07 ]

Page 9: SolrCloud on Hadoop

MapReduce  (MR)  

•  Programming  paradigm  •  Batch  oriented,  not  realMme  •  Works  well  with  distributed  compuMng  •  Lots  of  Java,  but  other  languages  supported  •  Based  on  Google’s  paper  

•  hRp://research.google.com/archive/mapreduce.html  

9

Page 10: SolrCloud on Hadoop

Under  the  Covers  

Page 11: SolrCloud on Hadoop

You specify map() and reduce() functions. ���

���The framework does the

rest. 60

Page 12: SolrCloud on Hadoop

Apache  HBase  

•  Random,  realMme  read/write  access  •  Key/value  columnar  store  •  (b|tr)illions  of  rows/columns  •  Based  on  Google  BigTable  

•  hRp://research.google.com/archive/bigtable.html  

12

Page 13: SolrCloud on Hadoop

Cloudera  Hue  

•  Hadoop  User  Experience  •  Hadoop  is  largely  command  line  •  Hue  provides  a  UI  for  end-­‐users  •  SDK  to  build  your  own  apps  on  top  

13

Page 14: SolrCloud on Hadoop

Apache  Tika  

•  Content  analysis  toolkit  •  Simply  put,  a  lot  of  parsers  •  Detect/extract  metadata/text  from  documents  

•  HTML  •  XML  •  Office  •  PDF  •  mbox  •  More…  

14

Page 15: SolrCloud on Hadoop

Apache  ZooKeeper  

•  Distributed  systems  are  HARD  •  Everyone  was  trying  to  implement  the  same  subsystems  •  Bugs  leads  to  race  condiMons,  other  bad  things  

•  ZK:  Highly  reliable  distributed  coordinaMon  services  •  ConfiguraMon  •  Naming  •  SynchronizaMon  •  Group  Services  

15

Page 16: SolrCloud on Hadoop

Cloudera  Morphlines  

•  In-­‐memory  transformaMons  •  Load,  parse,  transform,  process  •  Records  as  name-­‐value  pairs  w/  opMonal  blob/pojo  objects  

•  Java  library,  embedded  in  your  codebase  •  Used  to  ETL  data  from  Flume  and  MR  into  Solr  

•  Was  part  of  CDK,  now  part  of  Kite  •  hRp://kitesdk.org  

16

Page 17: SolrCloud on Hadoop

Apache  Lucene  

•  Java-­‐based  index  and  search  •  ranked  or  sorted  results  •  hits  streamed  through  QP  

•  mem(results)  <  mem(collecMon)  

•  rich/extensible  query  operators  •  bool,  phrase,  range,  span,  spaMal  

•  Features  •  spellchecking  •  hit  highlighMng  •  tokenizaMon  

17

Page 18: SolrCloud on Hadoop

Apache  Solr  

•  Enterprise  search  plaporm  •  Based  on  Apache  Lucene  

•  Full-­‐text  search  •  FaceMng  •  NRT  indexing  •  UI  

18

Page 19: SolrCloud on Hadoop

Apache  Solr  –  Simple  Indexing  via  CLI    

$  java  -­‐jar  post.jar  solr.xml  money.xml  SimplePostTool:  version  1.4  SimplePostTool:  POSTing  files  to  http://localhost:8983/solr/update..  SimplePostTool:  POSTing  file  solr.xml  SimplePostTool:  POSTing  file  money.xml  SimplePostTool:  COMMITting  Solr  index  changes..    $  post.sh  *.xml  

19

Page 20: SolrCloud on Hadoop

Apache  Solr  –  Document  money.xml  

<add>  <doc>      <field  name="id">USD</field>      <field  name="name">One  Dollar</field>      <field  name="manu">Bank  of  America</field>      <field  name="manu_id_s">boa</field>      <field  name="cat">currency</field>      <field  name="features">Coins  and  notes</field>      <field  name="price_c">1,USD</field>      <field  name="inStock">true</field>  </doc>    <doc>      <field  name="id">EUR</field>      <field  name="name">One  Euro</field>  

20

Page 21: SolrCloud on Hadoop

Apache  Solr  –  More  Advanced  Indexing  

•  From  DB,  using  Data  Import  Handler  (DIH)  •  Load  a  CSV  file  •  POST  JSON  documents  •  Index  binary  documents  (uses  Tika)  •  SolrJ  for  programmaMc  document  creaMon  

21

Page 22: SolrCloud on Hadoop

Apache  Solr  –  Querying  

•  HTTP  GET  •  hRp://solr:8983/solr/collecMon1/select/  

•  Examples  •  ?q=Mmestamp:[*  TO  NOW]  •  ?q=-­‐instock:false  •  ?q={!lucene  q.op=AND  df=text}myfield:foo  +bar  -­‐bat  

22

Page 23: SolrCloud on Hadoop

Apache  Solr  –  Querying  

•  HTTP  GET  •  hRp://solr:8983/solr/collecMon1/select/?q=video  

•  Examples  •  &fl=name,id          (return  only  name  and  id  fields)  •  &fl=name,id,score        (return  relevancy  score  as  well)  •  &fl=*,score            (return  all  fields  +  relevancy  score)  •  &sort=price  desc&fl=name,id,price        (sort  by  price  desc)    •  &wt=json                (return  response  in  JSON  format)  

23

Page 24: SolrCloud on Hadoop

What  the  Heck  is  FaceMng?    

•  Generate  counts  for  properMes  or  categories  •  Links  allow  drill-­‐down  or  refine  search  results  

   

What?  

24

Page 25: SolrCloud on Hadoop

Facets  on  Amazon.com    

25

Page 26: SolrCloud on Hadoop

Apache  Solr  –  Facets  at  Query  Time    

•  HTTP  GET  •  hRp://solr:8983/solr/collecMon1/select/?q=video  •  All  docs,  count  by  category  q=*:*&facet=true&facet.field=cat  

•  All  docs,  count  by  category  and  in-­‐stock  status  q=*:*&facet=true&facet.field=cat&facet.field=inStock  

•  Docs  matching  “ipod”,  count  by  price  (above/below  $100)  q=ipod&facet=true&facet.query=price:[0  TO  100]&facet.query=price:[100  TO  *]      

26

Page 27: SolrCloud on Hadoop

Apache  Solr  –  Querying  via  UI  

 

27

Page 28: SolrCloud on Hadoop

Apache  SolrCloud  

•  IntegraMon  of  Solr  +  ZooKeeper  •  Provides  for  shard  failover  

28

Page 29: SolrCloud on Hadoop

Cloudera  Search  

•  Based  on  Apache  Solr  (incl  Lucene  and  SolrCloud)  •  Fault-­‐tolerance:  collecMons  backed  by  HDFS  or  HBase  •  IntegraMon  galore:  

•  HBase/Flume/MapReduce  w/  Lucene  •  Hue  w/  Solr  •  Avro  w/  Tika  •  HDFS  w/  Solr/Lucene  •  Sentry  w/  Solr    

29

Page 30: SolrCloud on Hadoop

Cloudera  Search  +  Hue  

30  

Page 31: SolrCloud on Hadoop

Cloudera  Search  +  Hue  

31  

Page 32: SolrCloud on Hadoop

32

Apologies,  I  swiped  some  preRy  slides  from  markeMng…  

Why  Search?  

Page 33: SolrCloud on Hadoop

Search  Design  Strategy  

33

One  pool  of  data  

One  security  framework  

One  set  of  system  resources  

One  management  interface  

An  Integrated  Part  of  the  Hadoop  System  

Storage  

Integra5on  

Resource  Management  

Metad

ata  

Batch  Processing  MAPREDUCE,  HIVE  &  PIG  

HDFS   HBase  

TEXT,  RCFILE,  PARQUET,  AVRO,  ETC.   RECORDS  

Engines  

InteracMve  SQL  

CLOUDERA  IMPALA  

InteracMve  Search  CLOUDERA  SEARCH  

Machine  Learning  MAHOUT  

Math  &  Sta5s5cs  

SAS,  R    

Page 34: SolrCloud on Hadoop

Benefits  of  Search  IntegraMon  

34

Improved  Big  Data  ROI  §  An  interacMve  experience  without  technical  knowledge  §  Single  data  set  for  mulMple  compuMng  frameworks  

Faster  Time  to  Insight  §  Exploratory  analysis,  esp.  unstructured  data  §  Broad  range  of  indexing  opMons  to  accommodate  needs  

Cost  Efficiency  §  Single  scalable  plaporm;  no  incremental  investment  §  No  need  for  separate  systems,  storage  

Solid  Founda5ons  &  Reliability  §  Solr  in  producMon  environments  for  years  §  Hadoop-­‐powered  reliability  and  scalability  

Page 35: SolrCloud on Hadoop

35

Some  quick  examples.  

Search  Use  Cases  

Page 36: SolrCloud on Hadoop

Search  Use  Cases  

36

Offer  easy  access  to  non-­‐technical  resources  

Explore  data  prior  to  processing  and  modeling  

Gain  immediate  access  and  find  correlaMons  in  mission-­‐criMcal  data  

Powerful,  proven  search  capabili5es  that  let  organiza5ons:  

Page 37: SolrCloud on Hadoop

Monsanto  

37

Scalable,  efficient  image  search  for  analysis  and  research  

Track  plant  characterisMcs  throughout  their  lifecycle  

Before:  Manual  aRribute  extracMon  and  search  queries  within  database  

Now:  Parse  and  index  images  at  acquisiMon  and  on  demand,  index  archived  images  in  batch  

Page 38: SolrCloud on Hadoop

38

Cloudera:  Internal  Field  Portal  

Custom  Aggregated  Search  

Page 39: SolrCloud on Hadoop

Cloudera  –  Internal  Field  Portal  

•  Single  stop  for  field  engineers  •  Mailing  lists:  public,  private  •  Tickets:  support,  development,  public  ASF  •  Customer  data:  accounts,  clusters,  KB  arMcles  •  Customer  Clusters:  configs,  audits,  logs,  events  •  Books  and  papers  •  Discussion  forums  

•  Dogfooding,  yes  • Makes  my  life  easier  

39

Page 40: SolrCloud on Hadoop

Cloudera  –  Internal  Field  Portal  

40  

Page 41: SolrCloud on Hadoop

Cloudera  –  Internal  Field  Portal  

•  Varied  fetchers/observers  for  web/API  content  •  Content  is  retrieved  via  Flume,  Sqoop  

•  Search  indexes  and  replicates  into  HBase  •  Each  collecMon  has  collecMon-­‐specific  filters/fields  •  Provides  Mtle,  content  snippet,  link  to  original  

• Morphlines  extracts  books  and  papers  using  Tika  •  Impala  for  analyMcs  

•  Future:  Use  MapReduce  to  ingest  logs  

41

Page 42: SolrCloud on Hadoop

42

ParMng  thoughts…  in  no  parMcular  order.  

Summary  

Page 43: SolrCloud on Hadoop

Search  Simplifies  InteracMon  

43

Explore  

Navigate  

Correlate  Experts  know  MapReduce.  Savvy  people  know  SQL.    

Everyone  knows  Search.  

Page 44: SolrCloud on Hadoop

Summary  

•  With  Hadoop,  it  depends.  •  The  tools  are  out  there.  •  Open  source  soHware,  hooray!  

•  Many  interconnected  pieces  •  Many  unexplored  opportuniMes  •  A  thriving  community  awaits  you…  

•  Data  can  make  a  difference.  •  Search  allows  everyone  to  interact  with  data.  

•  This  is  a  Big  Deal.  

44

Page 45: SolrCloud on Hadoop

What’s  Next?  

•  Search  examples  •  hRp://blog.cloudera.com/blog/category/search/  

•  Cloudera  provides  pre-­‐loaded  VMs  •  hRp://Mny.cloudera.com/quickstartvm  

•  Clone  our  repos!  •  hRps://github.com/cloudera  

45

Page 46: SolrCloud on Hadoop

46

Preferably  related  to  the  talk…  

QuesMons?  

Page 47: SolrCloud on Hadoop

47

Thank  You!  Alex  Moundalexis  [email protected]  @technmsg    We’re  hiring,  kids!  Well,  not  kids.