adding&search&as&afirstclass&ci2zen& to&hadoop& ·...

23
1 Adding Search as a First Class Ci2zen to Hadoop Wolfgang Hoschek ([email protected]) Sr. SoAware Engineer @ Cloudera Search & Hadoop PlaEorm Team CERN Seminar, Geneva

Upload: others

Post on 24-May-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Adding&Search&as&aFirstClass&Ci2zen& to&Hadoop& · The&Enterprise&DataHub& UnifiedScaleoutStorage’ For&Any&Type&of&Data& Elas2c,&Faulttolerant,&SelfOhealing,&InOmemory&capabili2es

1

Adding  Search  as  a  First  Class  Ci2zen  to  Hadoop  Wolfgang  Hoschek  ([email protected])  Sr.  SoAware  Engineer  @  Cloudera  Search  &  Hadoop  PlaEorm  Team  CERN  Seminar,  Geneva  

Page 2: Adding&Search&as&aFirstClass&Ci2zen& to&Hadoop& · The&Enterprise&DataHub& UnifiedScaleoutStorage’ For&Any&Type&of&Data& Elas2c,&Faulttolerant,&SelfOhealing,&InOmemory&capabili2es

The  Enterprise  Data  Hub  

Unified  Scale-­‐out  Storage  For  Any  Type  of  Data  

Elas2c,  Fault-­‐tolerant,  Self-­‐healing,  In-­‐memory  capabili2es  

Resource  Management  

Online  NoSQL    DBMS  

Analy>c    MPP  DBMS  

Search    Engine  

Batch    Processing  

Stream    Processing  

Machine    Learning  

SQL   Streaming   File  System  (NFS)  

System    

Managem

ent  Data    

Managem

ent  

Metadata,  Security,  Audit,  Lineage  

2  

•  Mul2ple  processing  frameworks  

•  One  pool  of  data  •  One  set  of  system  

resources  •  One  management  

interface  •  One  security  

framework  

Page 3: Adding&Search&as&aFirstClass&Ci2zen& to&Hadoop& · The&Enterprise&DataHub& UnifiedScaleoutStorage’ For&Any&Type&of&Data& Elas2c,&Faulttolerant,&SelfOhealing,&InOmemory&capabili2es

Search  Simplifies  Interac2on  -­‐  to  Everyone!  

Explore  

Navigate  

Correlate  Experts  know  MapReduce.  Savvy  people  know  SQL.    

Everyone  knows  Search!  

3  

Page 4: Adding&Search&as&aFirstClass&Ci2zen& to&Hadoop& · The&Enterprise&DataHub& UnifiedScaleoutStorage’ For&Any&Type&of&Data& Elas2c,&Faulttolerant,&SelfOhealing,&InOmemory&capabili2es

Open  Source  •  100%  Apache,  100%  Solr  •  Standard  Solr  APIs  

What  is  Cloudera  Search?  

Interac>ve  search  for  Hadoop  •  Full-­‐text  and  faceted  naviga2on  •  Batch,  near  real-­‐2me,  and  on-­‐demand  indexing  

4

Apache  Solr  integrated  with  CDH  •  Established,  mature  search  with  vibrant  community  •  Incorporated  as  part  of  the  Hadoop  ecosystem  

•  Apache  Flume,  Apache  HBase  •  Apache  MapReduce,  Kite  Morphlines  •  Apache  Spark,  Apache  Crunch  

Page 5: Adding&Search&as&aFirstClass&Ci2zen& to&Hadoop& · The&Enterprise&DataHub& UnifiedScaleoutStorage’ For&Any&Type&of&Data& Elas2c,&Faulttolerant,&SelfOhealing,&InOmemory&capabili2es

HDFS  

Online  Streaming  Data  End  User  Client  App  

(e.g.  Hue)  

Flum

e  

Raw,  filtered,  or  annotated  data  

SolrCloud  Cluster(s)  NRT  Data  indexed  w/  Morphlines  

Indexed  data  

Spark  &  MapReduce  Batch  Indexing  w/  Morphlines  

GoLive  updates  

HBase  Cluster  

NRT  Replica2on  Events  indexed  w/  Morphlines  

OLTP  Data  

Cloudera  Manager  

Search  queries  

Cloudera  Search  Architecture  Overview  

5  

Page 6: Adding&Search&as&aFirstClass&Ci2zen& to&Hadoop& · The&Enterprise&DataHub& UnifiedScaleoutStorage’ For&Any&Type&of&Data& Elas2c,&Faulttolerant,&SelfOhealing,&InOmemory&capabili2es

Challenges  

•  Scalable/Reliable  Index  Storage  •  Near  Real  Time  (NRT)  indexing  •  Scalable  Batch  Indexing  •  Usability  

6  

Page 7: Adding&Search&as&aFirstClass&Ci2zen& to&Hadoop& · The&Enterprise&DataHub& UnifiedScaleoutStorage’ For&Any&Type&of&Data& Elas2c,&Faulttolerant,&SelfOhealing,&InOmemory&capabili2es

Apache  Lucene/Solr  

•  Lucene  -­‐  full  text  search  library  •  Solr  –  search  service  on  Lucene  •  SolrCloud  –  distributed  search    

•  Par22oned  and  replicated  inverted  index  •  Low  latency,  scalable,  reliable,  HA,  secure  •  Uses  Zookeeper  

•  Solr  on  HDFS  •  Scalable,  cost-­‐efficient  index  storage  •  Higher  availability  •  Search  and  process  data  in  one  plaEorm  

7  

Page 8: Adding&Search&as&aFirstClass&Ci2zen& to&Hadoop& · The&Enterprise&DataHub& UnifiedScaleoutStorage’ For&Any&Type&of&Data& Elas2c,&Faulttolerant,&SelfOhealing,&InOmemory&capabili2es

Near  Real  Time  Indexing  with  Apache  Flume  

Log  File   Solr  and  Flume  •  Reliable  streaming  data  inges2on  at  scale  

•  Indexing  at  data  ingest  •  Flexible  ETL  via  Morphlines  •  Packaged  as  “Flume  Morphline  Solr  Sink”  

HDFS  

Flume  Agent  

Indexer  w/  Morphlines  

Other  Log  File  

Flume  Agent  

Indexer  w/  Morphlines  

8  

SolrCloud   SolrCloud  

agent.sinks.solrSink.type = org.apache.flume.sink.solr.morphline.MorphlineSolrSink agent.sinks.solrSink.morphlineFile = /etc/flume-ng/conf/morphline.conf

Flume.conf

Page 9: Adding&Search&as&aFirstClass&Ci2zen& to&Hadoop& · The&Enterprise&DataHub& UnifiedScaleoutStorage’ For&Any&Type&of&Data& Elas2c,&Faulttolerant,&SelfOhealing,&InOmemory&capabili2es

Searchable  Real-­‐Time  Data  –  Indexing  HBase  

HDFS  

HBase  

Interac2ve  OLTP  up

dates  

Lily  HBase  Indexer(s)  

w/  Morphlines  Tr

iggers  on  

updates   Solr  server  

Solr  server  Solr  server  Solr  server  SolrCloud  

9  

<indexer table=“myHBaseTable" mapper="com.ngdata.hbaseindexer.morphline.MorphlineResultToSolrMapper"> <param name="morphlineFile" value="/path/to/extractHBaseCell.conf"/>

hbase-indexer.conf

•  Updates  Solr  index  immediately  on  HBase  cell  update  •  Service  that  acts  as  a  HBase  replica2on  listener  •  Non-­‐intrusive,  Flexible,  Scalable,  Reliable  •  Par22oned  and  Replicated  Indexing  w/  Failover,  HA  •  hmp://github.com/ngdata/hbase-­‐indexer  

Page 10: Adding&Search&as&aFirstClass&Ci2zen& to&Hadoop& · The&Enterprise&DataHub& UnifiedScaleoutStorage’ For&Any&Type&of&Data& Elas2c,&Faulttolerant,&SelfOhealing,&InOmemory&capabili2es

Scalable  Batch  ETL  &  Indexing  

Index  shard  

Files  

Index  shard  

Indexer  w/  Morphlines  

Files  or  HBase  tables  

Solr  server  

Indexer  w/  Morphlines  

Solr  server  

10

HDFS  

Solr  and  MapReduce  •  Flexible,  scalable,  reliable  batch  indexing  

•  On-­‐demand  indexing,  cost-­‐efficient  re-­‐indexing  

•  Start  serving  new  indices  without  down2me  

•  “MapReduceIndexerTool”  •  “HBaseMapReduceIndexerTool”  •  “CrunchIndexerTool  on  MR”  

Solr  and  Spark  •  “CrunchIndexerTool    on  Spark”  

hadoop ... MapReduceIndexerTool --morphline-file morphline.conf ...

Page 11: Adding&Search&as&aFirstClass&Ci2zen& to&Hadoop& · The&Enterprise&DataHub& UnifiedScaleoutStorage’ For&Any&Type&of&Data& Elas2c,&Faulttolerant,&SelfOhealing,&InOmemory&capabili2es

Scalable  Batch  Indexing  

11

hadoop ... MapReduceIndexerTool --morphline-file morphline.conf ...

S0_0_0

Extractors(Mappers)

Leaf Shards(Reducers)

Root Shards(Mappers)

S0_0_1S0S0_1_0

S0_1_1

S1_0_0

S1_0_1S1S1_1_0

S1_1_1

Input Files

...

...

...

...

•  Morphline  runs  inside  Mapper  •  Can  exploit  all  reducer  slots  even  if    

#reducers  >>  #solrShards  

Page 12: Adding&Search&as&aFirstClass&Ci2zen& to&Hadoop& · The&Enterprise&DataHub& UnifiedScaleoutStorage’ For&Any&Type&of&Data& Elas2c,&Faulttolerant,&SelfOhealing,&InOmemory&capabili2es

Streaming  ETL  (Extract,  Transform,  Load)  

Kite  Morphlines  •  Consume  any  kind  of  data  from  any  

kind  of  data  source,  process  and  load  into  Solr,  HDFS,  HBase  or  anything  else  

•  Simple  and  flexible  data  transforma2on  •  Extensible  set  of  transf.  commands  •  Reusable  across  mul2ple  workloads  •  For  Batch  &  Near  Real  Time  •  Configura2on  over  coding    

•  reduces  2me  &  skills  •  ASL  licensed  on  github  

hmps://github.com/kite-­‐sdk/kite  

syslog   Flume  Agent  

Solr  sink  

Command:  readLine  

Command:  grok  

Command:  loadSolr  

Solr  

Event  

Record  

Record  

Record  

Document  

12  

Page 13: Adding&Search&as&aFirstClass&Ci2zen& to&Hadoop& · The&Enterprise&DataHub& UnifiedScaleoutStorage’ For&Any&Type&of&Data& Elas2c,&Faulttolerant,&SelfOhealing,&InOmemory&capabili2es

Morphline  Example  –  syslog  with  grok  

morphlines  :  [    {        id  :  morphline1        importCommands  :  [”org.kitesdk.**",  "org.apache.solr.**"]        commands  :  [            {  readLine  {}  }                                                    {                  grok  {                      dic2onaryFiles  :  [/tmp/grok-­‐dic2onaries]                                                                                  expressions  :  {                          message  :  """<%{POSINT:syslog_pri}>%{SYSLOGTIMESTAMP:syslog_2mestamp}  %{SYSLOGHOST:syslog_hostname}  %{DATA:syslog_program}(?:\[%{POSINT:syslog_pid}\])?:  %{GREEDYDATA:syslog_message}"""                    }                }            }            {  loadSolr  {}  }                    ]    }  ]  

Example  Input  <164>Feb    4  10:46:14  syslog  sshd[607]:  listening  on  0.0.0.0  port  22  Output  Record  syslog_pri:164  syslog_2mestamp:Feb    4  10:46:14  syslog_hostname:syslog  syslog_program:sshd  syslog_pid:607  syslog_message:listening  on  0.0.0.0  port  22.      13

Page 14: Adding&Search&as&aFirstClass&Ci2zen& to&Hadoop& · The&Enterprise&DataHub& UnifiedScaleoutStorage’ For&Any&Type&of&Data& Elas2c,&Faulttolerant,&SelfOhealing,&InOmemory&capabili2es

Morphline  Example  -­‐  Escape  to  Java  Code  

morphlines : [ { id : morphline1 importCommands : [”org.kitesdk.**”] commands : [ { java { code: """ List tags = record.get("tags"); if (!tags.contains("hello")) { return false; } tags.add("world"); return child.process(record); """ } } ] } ]

14

Page 15: Adding&Search&as&aFirstClass&Ci2zen& to&Hadoop& · The&Enterprise&DataHub& UnifiedScaleoutStorage’ For&Any&Type&of&Data& Elas2c,&Faulttolerant,&SelfOhealing,&InOmemory&capabili2es

Current  Morphline  Command  Library  

•  Supported  Data  Formats  •  Text:  Single-­‐line  record,  mul2-­‐line  records,  CSV,  CLOB  •  Apache  Avro,  Parquet  files  •  Apache  Hadoop  Sequence  Files  •  Apache  Hadoop  RCFiles  •  JSON  •  XML,  XPath,  XQuery  •  Via  Apache  Tika:  HTML,  PDF,  MS-­‐Office,  Images,  Audio,  Video,  Email  •  HBase  rows/cells      •  Via  pluggable  commands:  Your  custom  data  formats  

•  Regex  based  pamern  matching  and  extrac2on    •  Flexible  log  file  analysis  •  Integrate  with  and  load  data  into  Apache  Solr  •  Scrip2ng  support  for  dynamic  Java  code  •  Etc,  etc,  etc    

15

Page 16: Adding&Search&as&aFirstClass&Ci2zen& to&Hadoop& · The&Enterprise&DataHub& UnifiedScaleoutStorage’ For&Any&Type&of&Data& Elas2c,&Faulttolerant,&SelfOhealing,&InOmemory&capabili2es

Morphline  Plugin  Commands  

•  Easy  to  add  new  I/O  &  transforma2on  cmds    •  Integrate  exis2ng  func2onality  and  third  party  systems  

•  Implement  Java  interface  Command  or  subclass  AbstractCommand

•  Add  it  to  Java  classpath  •  No  registra2on  or  other  administra2ve  ac2on  required  

16

Page 17: Adding&Search&as&aFirstClass&Ci2zen& to&Hadoop& · The&Enterprise&DataHub& UnifiedScaleoutStorage’ For&Any&Type&of&Data& Elas2c,&Faulttolerant,&SelfOhealing,&InOmemory&capabili2es

Morphline  Performance  and  Scaling  

•  The  run2me  compiles  morphline  on  the  fly    •  The  run2me  processes  all  commands  of  a  given  morphline  in  the  same  thread  

•  Piping  a  record  from  one  command  to  another  is  fast  •  just  a  cheap  Java  method  call  •  no  queues,  no  handoffs  among  threads,  no  context  switches,  and  no  serializa2on  between  commands  

•  For  scalability,  deploy  many  morphline  instances  on  a  cluster  in  many  Flume  agents  and  MapReduce  tasks  and  Spark  tasks  

17

Page 18: Adding&Search&as&aFirstClass&Ci2zen& to&Hadoop& · The&Enterprise&DataHub& UnifiedScaleoutStorage’ For&Any&Type&of&Data& Elas2c,&Faulttolerant,&SelfOhealing,&InOmemory&capabili2es

Security  

Cloudera  Search  +  Sentry  •  Kerberos  •  Encryp2on  on  the  wire  and  at  rest  •  Cluster  level  access  control  •  Index  level  access  control  •  Document  level  access  control  •  Field  level  access  control  

18  

Page 19: Adding&Search&as&aFirstClass&Ci2zen& to&Hadoop& · The&Enterprise&DataHub& UnifiedScaleoutStorage’ For&Any&Type&of&Data& Elas2c,&Faulttolerant,&SelfOhealing,&InOmemory&capabili2es

Customizable  Hue  UI  •  Navigated,  faceted  drill  down  •  Full  text  search,  standard  Solr  API  and  query  language  

19   hmp://gethue.com  

Page 20: Adding&Search&as&aFirstClass&Ci2zen& to&Hadoop& · The&Enterprise&DataHub& UnifiedScaleoutStorage’ For&Any&Type&of&Data& Elas2c,&Faulttolerant,&SelfOhealing,&InOmemory&capabili2es

Simplified  Management  

Cloudera  Manager  •  Install,  configure,  deploy  Solr  services  on  the  cluster  

•  Unified  management  and  monitoring  

•  Resource  management  

20  

Page 21: Adding&Search&as&aFirstClass&Ci2zen& to&Hadoop& · The&Enterprise&DataHub& UnifiedScaleoutStorage’ For&Any&Type&of&Data& Elas2c,&Faulttolerant,&SelfOhealing,&InOmemory&capabili2es

Simplified  Configura2on  

21

Page 22: Adding&Search&as&aFirstClass&Ci2zen& to&Hadoop& · The&Enterprise&DataHub& UnifiedScaleoutStorage’ For&Any&Type&of&Data& Elas2c,&Faulttolerant,&SelfOhealing,&InOmemory&capabili2es

Conclusions  

•  Experts  know  MapReduce.  Savvy  people  know  SQL.  Everyone  knows  Search!  

•  Integrated  Search  seamlessly  into  Hadoop  •  Cloudera  Search  available  for  CDH  4  &  CDH5  

•  Free  Download    

22

Page 23: Adding&Search&as&aFirstClass&Ci2zen& to&Hadoop& · The&Enterprise&DataHub& UnifiedScaleoutStorage’ For&Any&Type&of&Data& Elas2c,&Faulttolerant,&SelfOhealing,&InOmemory&capabili2es

©2014 Cloudera, Inc. All rights reserved.