search on hadoop

38
1 Finding a needle in a stack of needles adding Search to the Hadoop Ecosystem Patrick Hunt (@phunt) Big Data Gurus Meetup July 2013

Upload: elephantscale

Post on 06-May-2015

228 views

Category:

Technology


3 download

DESCRIPTION

Patrick Hunt (Cloudera) talks about search capabilities of Hadoop

TRANSCRIPT

Page 1: Search On Hadoop

1

Finding  a  needle  in  a  stack  of  needles  -­‐  adding  Search  to  the  Hadoop  Ecosystem  

Patrick  Hunt  (@phunt)  Big  Data  Gurus  Meetup  July  2013  

Page 2: Search On Hadoop

Agenda  

•  Big  Data  and  Search  –  seIng  the  stage  •  Cloudera  Search’s  Architecture  •  Component  deep  dive  •  Early  performance  insights  • What’s  next?  

Feel  free  to  ask  quesQons  as  we  go!  

Page 3: Search On Hadoop

Why  Search?  

An  Integrated  Part  of  the  Hadoop  System  

One  pool  of  data  

One  security  framework  

One  set  of  system  resources  

One  management  interface  

Page 4: Search On Hadoop

Search  Simplifies  InteracQon  

•  User  Goals  •  Explore  •  Navigate  •  Correlate  

•  Experts  know  MapReduce  •  Savvy  people  know  SQL  •  Everyone  knows  Search!  

Page 5: Search On Hadoop

Benefits  of  Search  

•  Improved  Big  Data  ROI  •  An  interacQve  experience  without  technical  knowledge  •  Single  data  set  for  mulQple  compuQng  frameworks  

•  Faster  Qme  to  insight  •  Exploratory  analysis,  esp.  unstructured  data  •  Broad  range  of  indexing  opQons  to  accommodate  needs  

•  Cost  efficiency  •  Single  scalable  pla`orm;  no  incremental  investment  •  No  need  for  separate  systems,  storage  

•  Solid  foundaQons  and  reliability  •  Apache  Solr  in  producQon  environments  for  years  •  Hadoop-­‐powered  reliability  and  scalability  

Page 6: Search On Hadoop

What  is  Cloudera  Search?  

•  Full-­‐text,  interacQve  search  and  faceted  navigaQon  •  Batch,  near  real-­‐Qme,  and  on-­‐demand  indexing  •  Apache  Solr  integrated  with  CDH  

•  Established,  mature  search  with  vibrant  community  •  Separate  runQme  like  MapReduce,  Impala  •  Incorporated  as  part  of  the  Hadoop  ecosystem  

•  Open  Source  •  100%  Apache,  100%  Solr  •  Standard  Solr  APIs  

Page 7: Search On Hadoop

Cloudera  Search  Components  

•  Refresher  –  HDFS/MR/Lucene/Solr/SolrCloud  •  HDFSDirectoryFactory/HDFSDirectory  •  BlockDirectory/BlockDirectoryCache  •  Near  Real  Time  (NRT)  indexing  

•  Apache  Flume  MorphlineSolrSink  •  Lily  HBase  Indexer  

•  Batch  –  MapReduce  Indexer  •  ETL  –  Cloudera  Morphlines  •  Hue  Search  ApplicaQon  

Page 8: Search On Hadoop

Apache  Hadoop  

•  Apache  HDFS  •  Distributed  file  system  •  High  reliability  •  High  throughput  

•  Apache  MapReduce  •  Parallel,  distributed  programming  model  •  Allows  processing  of  large  datasets  •  Fault  tolerant  

Page 9: Search On Hadoop

Apache  Lucene  

•  Full  text  search  •  Indexing  •  Query  

•  TradiQonal  inverted  index  •  Batch  and  Incremental  indexing  • We  are  using  version  4  (4.3  currently)  

Page 10: Search On Hadoop

Apache  Solr  

•  Search  service  built  using  Lucene  •  Ships  with  Lucene  (same  TLP  at  Apache)  

•  Provides  XML/HTTP/JSON/Python/Ruby/…  APIs  •  Indexing  •  Query  •  AdministraQve  interface  •  Also  rich  web  admin  GUI  via  HTTP  

Page 11: Search On Hadoop

Apache  SolrCloud  

•  Provides  distributed  Search  capability  •  Part  of  Solr  (not  a  separate  library/codebase)  •  Shards  -­‐  both  verQcally  and  horizontally  scaleable    

•  Horizontally  –  parQQon  index  for  size  •  VerQcally  –  replicate  for  query  performance  

•  Uses  ZooKeeper  for  coordinaQon  •  No  split-­‐brain  issues  •  Simplifies  operaQons  

Page 12: Search On Hadoop

Distributed  Search  on  Hadoop  

Flume  Hue  UI  

Custom  UI  

Custom  App  

Solr  

Solr  

Solr  

SolrCloud  query  

query  

query  

index  

Hadoop  Cluster  

MR  

HDFS  

index  

HBase  index  

Page 13: Search On Hadoop

High  Level  View  

13  

HDFS  

Lucene  

Solr  

ZooKeeper  

SolrCloud  

Querying  API   Indexing  API  

Solr  on  HDFS  •  Scalable,  cost-­‐efficient  index  storage  

•  Higher  availability  •  Search  and  process  data  in  one  pla`orm  

Page 14: Search On Hadoop

Cloudera  Upstream  ContribuQons  

•  SOLR-­‐3911  -­‐  Directory/DirectoryFactory  now  first  class  •  Solr  ReplicaQon  now  uses  Directory  abstracQon  •  Solr  Admin  UI  no  longer  assumes  local  directory  access  

•  SOLR-­‐4916  –  support  for  reading/wriQng  Solr  index  files  and  transacQon  log  files  to/from  HDFS  

•  HDFSDirectoryFactory/HDFSDirectory  implementaQon  

•  SOLR-­‐4655  -­‐  The  Overseer  should  assign  node  names  by  default.  •  SOLR-­‐3706  -­‐  Ship  setup  to  log  with  log4j  •  SOLR-­‐4494  -­‐  Clean  up  and  polish  CollecQons  API  •  SOLR-­‐4718  -­‐Improvements  to  configurability  

•  ConfiguraQon  now  enQrely  through  ZooKeeper    (opQonal)  

•  Many  more  improvements/cleanup/hardening/…  

Page 15: Search On Hadoop

Lucene  Directory  abstracQon  

•  It’s  how  Lucene  interacts  with  index  files  •  Solr  uses  it  too,  but  spory  prior  to  4.x    Class Directory { listAll(); createOutput(file, context); openInput(file, context); deleteFile(file); makeLock(file); clearLock(file); …

}

Page 16: Search On Hadoop

HDFSDirectory  

•  Originally  implemented  against  Lucene  3  by  Blur  •  Cloudera  ported  to  Lucene  4  and  now  upstream  

•  Solr  trunk  and  version  4.4  (upcoming)  •  Uses  the  HDFS  Client  API   import org.apache.hadoop.fs.FileSystem; public IndexInput openInput(file, context){ … _inputStream = fileSystem.open(path, bufferSize); …

}  

Page 17: Search On Hadoop

HDFSDirectoryFactory  

•  Enables  plugin  of  HDFSDirectory  into  Solr  •  Configurable  through  solrconfig.xml  

•  Also  handles  •  Directory  configuraQon  •  ComposiQng  of  Directory(s)  

•  NRTCachingDirectory  •  BlockDirectory/BlockDirectoryCache  

Page 18: Search On Hadoop

BlockDirectory/BlockDirectoryCache  

•  In  memory  cache  of  index  file  blocks  •  Caches  on  read,  in  some  cases  on  write  

•  Compensate  for  less  effecQve  file  system  cache  •  Uses  DirectByteBuffer,  not  JVM  heap  (default)  •  Size  configurable  by  user  

Page 19: Search On Hadoop

Near  Real  Time  Indexing  with  Flume  

Log  File   Solr  and  Flume  •  Data  ingest  at  scale  •  Flexible  extracQon  and  mapping  

•  Indexing  at  data  ingest  

HDFS  

Flume  Agent  

Indexer  

Other  Log  File  

Flume  Agent  

Indexer  

19  

Page 20: Search On Hadoop

Apache  Flume  -­‐  MorphlineSolrSink  

•  A  Flume  Source…  •  Receives/gathers  events    

•  A  Flume  Channel…  •  Carries  the  event  –  MemoryChannel  or  reliable  FileChannel    

•  A  Flume  Sink…  •  Sends  the  events  on  to  the  next  locaQon  

•  Flume  MorphlineSolrSink  •  Integrates  Cloudera  Morphlines  library  

•  ETL,  more  on  that  in  a  bit  •  Does  batching  •  Results  sent  to  Solr  for  indexing  

Page 21: Search On Hadoop

Near  Real  Time  indexing  of  Apache  HBase  

HDFS  

HBase  

interacQve  load  

Indexer(s)  

Triggers  on  

updates   Solr  server  

Solr  server  Solr  server  Solr  server  Solr  server  

Search  

+   =  planet-­‐sized  tabular  data  immediate  access  &  updates  fast  &  flexible  informaFon  discovery  

B IG  DATA  DATAMANAGEMENT  

Page 22: Search On Hadoop

Lily  HBase  Indexer  

•  CollaboraQon  between  NGData  &  Cloudera  •  NGData  are  creators  of  the  Lily  data  management  pla`orm  

•  Lily  HBase  Indexer  •  Service  which  acts  as  a  HBase  replicaQon  listener  

•  HBase  replicaQon  features,  such  as  filtering,  supported  

•  ReplicaQon  updates  trigger  indexing  of  updates  (rows)  •  Integrates  Cloudera  Morphlines  library  for  ETL  of  rows  •  AL2  licensed  on  github  hrps://github.com/ngdata  

Page 23: Search On Hadoop

Scalable  Batch  Indexing  

Index  shard  

Files  

Index  shard  

Indexer  

Files  

Solr  server  

Indexer  

Solr  server  

23

HDFS  

Solr  and  MapReduce  •  Flexible,  scalable  batch  indexing  

•  Start  serving  new  indices  with  no  downQme  

•  On-­‐demand  indexing,  cost-­‐efficient  re-­‐indexing  

Page 24: Search On Hadoop

Scalable  Batch  Indexing  

24

Mapper:  Parse  input  into  

indexable  document  

Mapper:  Parse  input  into  

indexable  document  

Mapper:  Parse  input  into  

indexable  document  

Index  shard  1  

Index  shard  2  

Arbitrary  reducing  steps  of  indexing  and  merging  

End-­‐Reducer  (shard  1):  Index  document  

End-­‐Reducer  (shard  2):  Index  document  

Page 25: Search On Hadoop

MapReduce  Indexer  

MapReduce  Job  with  two  parts    

1)  Scan  HDFS  for  files  to  be  indexed  •  Much  like  Unix  “find”  –  see  HADOOP-­‐8989  •  Output  is  NLineInputFormat’ed  file  

2)  Mapper/Reducer  indexing  step  •  Mapper  extracts  content  via  Cloudera  Morphlines  •  Reducer  indexes  documents  via  embedded  Solr  server  •  Originally  based  on  SOLR-­‐1301  

•  Many  modificaQons  to  enable  linear  scalability  

Page 26: Search On Hadoop

MapReduce  Indexer  “golive”  

•  Cloudera  created  this  to  bridge  the  gap  between  NRT  (low  latency,  expensive)  and  Batch  (high  latency,  cheap  at  scale)  indexing  

•  Results  of  MR  indexing  operaQon  are  immediately  merged  into  a  live  SolrCloud  serving  cluster  

•  No  downQme  for  users  •  No  NRT  expense  •  Linear  scale  out  to  the  size  of  your  MR  cluster  

Page 27: Search On Hadoop

Cloudera  Morphlines  

•  Open  Source  framework  for  simple  ETL  •  Ships  as  part  Cloudera  Developer  Kit  (CDK)  

•  It’s  a  Java  library  •  AL2  licensed  on  github  hrps://github.com/cloudera/cdk  

•  Similar  to  Unix  pipelines  •  ConfiguraQon  over  coding  •  Supports  common  Hadoop  formats  

•  Avro  •  Sequence  file  •  Text  •  Etc…  

Page 28: Search On Hadoop

Cloudera  Morphlines  Architecture  

Solr  

Solr  

Solr  

SolrCloud  

Logs,  tweets,  social  media,  html,  

images,  pdf,  text….    

Anything  you  want  to  index  

Flume,  MR  Indexer,  HBase  indexer,  etc...    Or  your  applicaQon!  

Morphline  Library  

Morphlines  can  be  embedded  in  any  applicaQon…  

Page 29: Search On Hadoop

ExtracQon  and  Mapping  

•  Simple  and  flexible  data  transformaQon    

•  Reusable  across  mulQple  index  workloads  

•  Over  Qme,  extend  and  re-­‐use  across  pla`orm  workloads  

syslog   Flume  Agent  

Solr  sink  

Command:  readLine  

Command:  grok  

Command:  loadSolr  

Solr  

Event  

Record  

Record  

Record  

Document  

Morph

line  Library  

Page 30: Search On Hadoop

Current  Command  Library  

•  Integrate  with  and  load  into  Apache  Solr  •  Flexible  log  file  analysis  •  Single-­‐line  record,  mulQ-­‐line  records,  CSV  files    •  Regex  based  parern  matching  and  extracQon    •  IntegraQon  with  Avro    •  IntegraQon  with  Apache  Hadoop  Sequence  Files  •  IntegraQon  with  SolrCell  and  all  Apache  Tika  parsers    •  Auto-­‐detecQon  of  MIME  types  from  binary  data  using  Apache  Tika  

Page 31: Search On Hadoop

Current  Command  Library  (cont)  

•  ScripQng  support  for  dynamic  java  code    •  OperaQons  on  fields  for  assignment  and  comparison  •  OperaQons  on  fields  with  list  and  set  semanQcs    •  if-­‐then-­‐else  condiQonals    •  A  small  rules  engine  (tryRules)  •  String  and  Qmestamp  conversions    •  slf4j  logging  •  Yammer  metrics  and  counters    •  Decompression  and  unpacking  of  arbitrarily  nested  container  file  formats  

•  Etc…  

Page 32: Search On Hadoop

Morphline  Example  –  syslog  with  grok  

morphlines  :  [    {        id  :  morphline1        importCommands  :  ["com.cloudera.**",  "org.apache.solr.**"]        commands  :  [            {  readLine  {}  }                                                    {                  grok  {                      dicQonaryFiles  :  [/tmp/grok-­‐dicQonaries]                                                                                  expressions  :  {                          message  :  """<%{POSINT:syslog_pri}>%{SYSLOGTIMESTAMP:syslog_Qmestamp}  %{SYSLOGHOST:syslog_hostname}  %{DATA:syslog_program}(?:\[%{POSINT:syslog_pid}\])?:  %{GREEDYDATA:syslog_message}"""                    }                }            }            {  loadSolr  {}  }                    ]    }  ]  

Example  Input  <164>Feb    4  10:46:14  syslog  sshd[607]:  listening  on  0.0.0.0  port  22  Output  Record  syslog_pri:164  syslog_Qmestamp:Feb    4  10:46:14  syslog_hostname:syslog  syslog_program:sshd  syslog_pid:607  syslog_message:listening  on  0.0.0.0  port  22.      

Page 33: Search On Hadoop

Simple,  Customizable  Search  Interface  

Hue  •  Simple  UI  •  Navigated,  faceted  drill  down  

•  Customizable  display  •  Full  text  search,  standard  Solr  API  and  query  language  

Page 34: Search On Hadoop

Performance  

•  Cloudera  internal  tesQng  results  •  Cisco  WebEx  results  from  Hadoop  Summit  2013  

Page 35: Search On Hadoop

Cloudera  Internal  TesQng  

• We’ve  looked  at  •  NRT  and  Batch  indexing  •  Query  performance  

•  Performance  has  been  similar  to  Solr  on  local  disk  •  Indexing/query  operaQons  are  typically  CPU  bound  •  Caching  obviously  plays  a  big  factor  for  queries  •  Limited  use  cases  explored  –  public  beta  helping  here!  

Page 36: Search On Hadoop

Details  shared  by  WebEx  at  2013  Summit  

•  Cisco  presented  on  their  use  of  Flume,  Cloudera  Search,  and  Cloudera  Morphlines  

•  Indexing  log  events  in  Near  Real  Time  via  Flume  •  Cisco  UCS  C240  M3  servers  

•  2  quad  cores  @2.3ghz  •  16gb  RAM  •  12  x  3TB  storage  

•  Ingest  rate  •  70k  events/sec,  1.2  TB/day  inbound  

Page 37: Search On Hadoop

What’s  next  

•  Usability  –  “solrctl”  •  Security  

•  Index,  Document  and  (eventually)  Field  level  security  

•  Lots  of  scalability/performance  work  to  be  done  •  What  are  the  best  Solr/Lucene  seIngs  for  HDFS?  •  InvesQgate  short  circuit  HDFS  reads  •  BlockDirectoryCache  tuning  •  HDFS  block  affinity  

• More  sophisQcated  index  management  •  Take  advantage  of  collecQon  alias  support  (SOLR-­‐4497)  

Page 38: Search On Hadoop

Conclusion  

•  Cloudera  Search  now  in  public  beta  •  Free  Download    •  Extensive  documentaQon  •  Send  your  quesQons  and  feedback  to  search-­‐[email protected]  

•  Take  the  Search  online  training  

•  Cloudera  Manager  Standard  (i.e.  the  free  version)  •  Simple  management  of  Search  •  Free  Download  

•  QuickStart  VM  also  available!