lucene rev preso cope real time searching of big data with solr and hadoop

Upload: trongho1

Post on 03-Jun-2018

219 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/12/2019 Lucene Rev Preso Cope Real Time Searching of Big Data With Solr and Hadoop

    1/34

  • 8/12/2019 Lucene Rev Preso Cope Real Time Searching of Big Data With Solr and Hadoop

    2/34

    OpenLogic, Inc.

    Agenda

    Introduction

    The Problem

    The Solution

    Lessons Learned

    Public Clouds &Big Data

    Final Thoughts

    Q & A

    2

  • 8/12/2019 Lucene Rev Preso Cope Real Time Searching of Big Data With Solr and Hadoop

    3/34

    OpenLogic, Inc.

    Introduction

    Rod CopeCTO & Founder of OpenLogic

    25 years of software development experience

    IBM Global Services, Anthem, General Electric

    OpenLogicOpen Source Provisioning, Support, and Governance Solutions

    Certified library w/SLA support on 500+ Open Source packageshttp://olex.openlogic.com

    Over 200 Enterprise customers

    3

    http://olex.openlogic.com/http://olex.openlogic.com/http://olex.openlogic.com/
  • 8/12/2019 Lucene Rev Preso Cope Real Time Searching of Big Data With Solr and Hadoop

    4/34

    OpenLogic, Inc.

    The Problem

    Big DataAll the worlds Open SourceSoftware

    Metadata, code, indexes

    Individual tables contain manyterabytes

    Relational databases arentscale-free

    Growing every day

    Need real-time random access to all data

    Long-running and complex analysis jobs

    4

  • 8/12/2019 Lucene Rev Preso Cope Real Time Searching of Big Data With Solr and Hadoop

    5/34

    OpenLogic, Inc.

    The Solution

    Hadoop, HBase, and SolrHadoop distributed file system

    HBase NoSQL data store column-oriented

    Solr search server based on Lucene

    All are scalable, flexible, fast, well-supported,used in production environments

    And a supporting cast of thousandsStargate, MySQL, Rails, Redis, Resque,

    Nginx, Unicorn, HAProxy, Memcached,Ruby, JRuby, CentOS,

    5

  • 8/12/2019 Lucene Rev Preso Cope Real Time Searching of Big Data With Solr and Hadoop

    6/34

    OpenLogic, Inc.6

    Solution Architecture

    Live

    replication

    Livereplication

    Livereplication(3x)

    Livereplication

    Data LANInternet Application LANCaching and load balancing not shown

  • 8/12/2019 Lucene Rev Preso Cope Real Time Searching of Big Data With Solr and Hadoop

    7/34OpenLogic, Inc.

    Getting Data out of HBase

    HBase!NoSQLThink hash table, not relational database

    Scanning vs. querying

    How do find my data if primary key wont cut it?

    Solr to the rescueVery fast, highly scalable search server with built-in sharding

    and replication based on Lucene

    Dynamic schema, powerful query language, faceted search,

    accessible via simple REST-like web API w/XML, JSON, Ruby,and other data formats

    7

  • 8/12/2019 Lucene Rev Preso Cope Real Time Searching of Big Data With Solr and Hadoop

    8/34OpenLogic, Inc.

    Solr

    Sharding

    Query any server it executes the same query against all otherservers in the group

    Returns aggregated result to original caller

    Async replication (slaves poll their masters)

    Can use repeaters if replicating across data centersOpenLogic

    Solr farm, sharded, cross-replicated, fronted with HAProxyLoad balanced writes across masters, reads across masters and slaves

    Be careful not to over-commit

    Billions of lines of code in HBase, all indexed in Solr for real-timesearch in multiple ways

    Over 20 Solr fields indexed per source file

    8

  • 8/12/2019 Lucene Rev Preso Cope Real Time Searching of Big Data With Solr and Hadoop

    9/34OpenLogic, Inc.

    Solr Implementation Sharding + Replication

    9

    Masters

    Slaves

  • 8/12/2019 Lucene Rev Preso Cope Real Time Searching of Big Data With Solr and Hadoop

    10/34OpenLogic, Inc.

    Write Flow

    10

    Masters

    Slaves

  • 8/12/2019 Lucene Rev Preso Cope Real Time Searching of Big Data With Solr and Hadoop

    11/34OpenLogic, Inc.

    Read Flow

    11

    Masters

    Slaves

  • 8/12/2019 Lucene Rev Preso Cope Real Time Searching of Big Data With Solr and Hadoop

    12/34OpenLogic, Inc.

    Delete Flow

    12

    Masters

    Slaves

  • 8/12/2019 Lucene Rev Preso Cope Real Time Searching of Big Data With Solr and Hadoop

    13/34OpenLogic, Inc.

    Write Flow - Failover

    13

    Masters

    Slaves

  • 8/12/2019 Lucene Rev Preso Cope Real Time Searching of Big Data With Solr and Hadoop

    14/34

    OpenLogic, Inc.

    Read Flow - Failover

    14

    Masters

    Slaves

  • 8/12/2019 Lucene Rev Preso Cope Real Time Searching of Big Data With Solr and Hadoop

    15/34

    OpenLogic, Inc.

    Loading Big Data

    Experiment with different Solr merge factors

    During huge loads, it can help to use a higher factor for loadperformance

    Minimize index manipulation gymnastics

    Start with something like 25

    When youre done with the massive initial load/import, turn itback down for search performanceMinimize number of queries

    Start with something like 5

    Example:

    curl http://solr1:8080/solr/master/update?optimize=true&maxSegments=5

    This can take a few minutes, so you might need to adjust various timeouts

    Note that a small merge factor will hurt indexing performance if youneed to do massive loads on a frequent basis or continuous indexing

    15

    http://solr1:8080/solr/master/update?optimize=true&maxSegments=5http://solr1:8080/solr/master/update?optimize=true&maxSegments=5http://solr1:8080/solr/master/update?optimize=true&maxSegments=5http://solr1:8080/solr/master/update?optimize=true&maxSegments=5http://solr1:8080/solr/master/update?optimize=true&maxSegments=5http://solr1:8080/solr/master/update?optimize=true&maxSegments=5http://solr1:8080/solr/master/update?optimize=true&maxSegments=5http://solr1:8080/solr/master/update?optimize=true&maxSegments=5http://solr1:8080/solr/master/update?optimize=true&maxSegments=5http://solr1:8080/solr/master/update?optimize=true&maxSegments=5http://solr1:8080/solr/master/update?optimize=true&maxSegments=5http://solr1:8080/solr/master/update?optimize=true&maxSegments=5http://solr1:8080/solr/master/update?optimize=true&maxSegments=5http://solr1:8080/solr/master/update?optimize=true&maxSegments=5http://solr1:8080/solr/master/update?optimize=true&maxSegments=5
  • 8/12/2019 Lucene Rev Preso Cope Real Time Searching of Big Data With Solr and Hadoop

    16/34

    OpenLogic, Inc.

    Loading Big Data (cont.)

    Test your write-focused load balancingLook for large skews in Solr index size

    Note: you may have to commit, optimize, write again, andcommit before you can really tell

    Make sure your replication slaves are keeping upUsing identical hardware helpsIf index directories dont look the same, something is wrong

    16

  • 8/12/2019 Lucene Rev Preso Cope Real Time Searching of Big Data With Solr and Hadoop

    17/34

    OpenLogic, Inc.

    Loading Big Data (cont.)

    Dont commit to Solr too frequentlyIts easy to auto-commit or commit after every record

    Doing this 100s of times per second will take Solr down,especially if you have serious warm up queries configured

    Avoid putting large values in HBase (> 5MB)Works, but may cause instability and/or performance issuesRows and columns are cheap, so use more of them instead

    17

  • 8/12/2019 Lucene Rev Preso Cope Real Time Searching of Big Data With Solr and Hadoop

    18/34

    OpenLogic, Inc.

    Loading Big Data (cont.)

    Dont use a single machine to load the cluster

    You might not live long enough to see it finish

    At OpenLogic, we spread raw source data acrossmany machines and hard drives via NFS

    Be very careful with NFS configuration can hang machines

    Load data into HBase via Hadoop map/reduce jobsTurn off WAL for much better performance

    put.setWriteToWAL(false)

    Index in Solr as you goGood way to test your load balancing

    write schemes and replication set up

    This will find your weak spots!

    18

  • 8/12/2019 Lucene Rev Preso Cope Real Time Searching of Big Data With Solr and Hadoop

    19/34

    OpenLogic, Inc.

    Scripting Languages Can Help

    Writing data loading jobs can be tedious

    Scripting is faster and easier than writing Java

    Great for system administration tasks, testing

    Standard HBase shell is based on JRuby

    Very easy Map/Reduce jobs with J/Ruby andWukong

    Used heavily at OpenLogic

    Productivity of RubyPower of Java Virtual Machine

    Ruby on Rails, Hadoop integration, GUI clients

    19

  • 8/12/2019 Lucene Rev Preso Cope Real Time Searching of Big Data With Solr and Hadoop

    20/34

    OpenLogic, Inc.

    Java (27 lines)

    20

    public class Filter { public static void main( String[] args ) {

    List list = new ArrayList(); list.add( "Rod" );list.add( "Neeta" );

    list.add( "Eric" );list.add( "Missy" );

    Filter filter = new Filter();

    List shorts = filter.filterLongerThan( list, 4 ); System.out.println( shorts.size() );

    Iterator iter = shorts.iterator(); while ( iter.hasNext() ) { System.out.println( iter.next() ); }}

    public List filterLongerThan( List list, int length ) { List result = new ArrayList(); Iterator iter = list.iterator(); while ( iter.hasNext() ) { String item = (String) iter.next(); if ( item.length()

  • 8/12/2019 Lucene Rev Preso Cope Real Time Searching of Big Data With Solr and Hadoop

    21/34

    OpenLogic, Inc.

    Scripting languages (4 lines)

    Groovy

    list = ["Rod", "Neeta", "Eric", "Missy"]shorts = list.find_all { |name| name.size 2

    -> Rod Eric

    list = ["Rod", "Neeta", "Eric", "Missy"]shorts = list.findAll { name -> name.size() println name }

    -> 2 -> Rod Eric

    JRuby

  • 8/12/2019 Lucene Rev Preso Cope Real Time Searching of Big Data With Solr and Hadoop

    22/34

    OpenLogic, Inc.

    Cutting Edge

    HadoopSPOF around Namenode, append functionality

    HBaseBackup, replication, and indexing solutions

    in fluxSolr

    Several competing solutions around cloud-likescalability and fault-tolerance, includingZooKeeper and Hadoop integration

    No clear winner, none quite ready for production

    22

  • 8/12/2019 Lucene Rev Preso Cope Real Time Searching of Big Data With Solr and Hadoop

    23/34

    OpenLogic, Inc.

    Configuration is Key

    Many moving parts

    Its easy to let typos slip throughConsider automated configuration

    via Chef, Puppet, or similar

    Pay attention to the details

    Operating system max open files,sockets, and other limits

    Hadoop and HBase configurationhttp://wiki.apache.org/hadoop/Hbase/Troubleshooting

    Solr merge factor and normsDont starve HBase or Solr for memory

    Swapping will cripple your system

    23

    http://wiki.apache.org/hadoop/Hbase/Troubleshootinghttp://wiki.apache.org/hadoop/Hbase/Troubleshootinghttp://wiki.apache.org/hadoop/Hbase/Troubleshootinghttp://wiki.apache.org/hadoop/Hbase/Troubleshootinghttp://wiki.apache.org/hadoop/Hbase/Troubleshooting
  • 8/12/2019 Lucene Rev Preso Cope Real Time Searching of Big Data With Solr and Hadoop

    24/34

    OpenLogic, Inc.

    Commodity Hardware

    Commodity hardware != 3 year old desktop

    Dual quad-core, 32GB RAM, 4+ disks

    Dont bother with RAID on Hadoop data disksBe wary of non-enterprise drives

    Expect ugly hardware issues at some point

    24

  • 8/12/2019 Lucene Rev Preso Cope Real Time Searching of Big Data With Solr and Hadoop

    25/34

    OpenLogic, Inc.

    OpenLogics Hadoop Implementation

    Environment100+ CPU cores

    100+ Terabytes of disk

    Machines dont have identity

    Add capacity by plugging innew machines

    Why not Amazon EC2?Great for computational bursts

    Expensive for long-term storage of Big DataNot yet consistent enough for mission-critical usage of HBase

    25

  • 8/12/2019 Lucene Rev Preso Cope Real Time Searching of Big Data With Solr and Hadoop

    26/34

    OpenLogic, Inc.

    OpenLogics Hadoop and Solr Deployment

    Dual quad-core and dual hex-core

    Dell boxes32-64GB RAM

    ECC (highly recommended by Google)

    6 x 2TB enterprise hard drives

    RAID 1 on two of the drivesOS, Hadoop, HBase, Solr, NFS mounts (be careful!), job code, etc.

    Key source data backups

    Hadoop datanode gets remaining drivesRedundant enterprise switches

    Dual- and quad-gigabit NICs

    26

  • 8/12/2019 Lucene Rev Preso Cope Real Time Searching of Big Data With Solr and Hadoop

    27/34

    OpenLogic, Inc.

    Public Clouds and Big Data

    Amazon EC2EBS Storage

    100TB * $0.10/GB/month = $120k/year

    Double Extra Large instances13 EC2 compute units, 34.2GB RAM

    20 instances * $1.00/hr * 8,760 hrs/yr = $175k/year3 year reserved instances

    20 * 4k = $80k up front to reserve

    (20 * $0.34/hr * 8,760 hrs/yr * 3 yrs) / 3 = $86k/yearto operate

    Totals for 20 virtual machines1styear cost: $120k + $80k + $86k = $286k2nd& 3rdyear costs: $120k + $86k = $206k

    Average: ($286k + $206k + $206k) / 3 = $232k/year

    27

  • 8/12/2019 Lucene Rev Preso Cope Real Time Searching of Big Data With Solr and Hadoop

    28/34

    OpenLogic, Inc.

    Private Clouds and Big Data

    Buy your own20 * Dell servers w/12 CPU cores, 32GB RAM, 5 TB disk = $160k

    Over 33 EC2 compute units each

    Total: $53k/year (amortized over 3 years)

    28

  • 8/12/2019 Lucene Rev Preso Cope Real Time Searching of Big Data With Solr and Hadoop

    29/34

    OpenLogic, Inc.

    Public Clouds are Expensive for Big Data

    Amazon EC2

    20 instances * 13 EC2 compute units =260 EC2 compute units

    Cost: $232k/year

    Buy your own20 machines * 33 EC2 compute units =

    660 EC2 compute units

    Cost: $53k/year

    Does not include hosting & maintenance costs

    Dont think system administration goes awayYou still own all the instances monitoring, debugging, support

    29

  • 8/12/2019 Lucene Rev Preso Cope Real Time Searching of Big Data With Solr and Hadoop

    30/34

    OpenLogic, Inc.

    Expect Things to Fail A Lot

    Hardware

    Power supplies, hard drivesOperating System

    Kernel panics, zombie processes,dropped packets

    Software ServersHadoop datanodes, HBase regionservers,

    Stargate servers, Solr servers

    Your Code and DataStray Map/Reduce jobs, strange corner

    cases in your data leading to programfailures

    30

  • 8/12/2019 Lucene Rev Preso Cope Real Time Searching of Big Data With Solr and Hadoop

    31/34

    OpenLogic, Inc.

    Not Possible Without Open Source

    31

  • 8/12/2019 Lucene Rev Preso Cope Real Time Searching of Big Data With Solr and Hadoop

    32/34

    OpenLogic, Inc.

    Not Possible Without Open Source

    Hadoop, HBase, Solr

    Apache, Tomcat, ZooKeeper,HAProxy

    Stargate, JRuby, Lucene,

    Jetty, HSQLDB, GeronimoApache Commons, JUnit

    CentOS

    Dozens more

    Too expensive to build or buy everything

    32

  • 8/12/2019 Lucene Rev Preso Cope Real Time Searching of Big Data With Solr and Hadoop

    33/34

    OpenLogic, Inc.

    Final Thoughts

    You can host your own big data

    Tools are available today that didnt exist a few years agoFast to prototype production

    readiness takes time

    Expect to invest in training and support

    HBase and Solr are fast100+ random queries/sec per instance

    Give them memory and stand back

    HBase scales, Solr scales (to a point)Dont worry about outgrowing a few machines

    Do worry about outgrowing a rack of Solr instancesLook for ways to partition your data other than automatic sharding

    33

  • 8/12/2019 Lucene Rev Preso Cope Real Time Searching of Big Data With Solr and Hadoop

    34/34

    Q & A

    Any questions for [email protected]

    Slides: http://www.openlogic.com/presentations/rod* Unless otherwise credited, all images in this presentation are either open source project logos or were licensed from BigStockPhoto.com

    http://www.openlogic.com/presentations/rodmailto:[email protected]://www.openlogic.com/presentations/rodhttp://www.openlogic.com/presentations/rodmailto:[email protected]:[email protected]