scaling big data search with solr and hbase

35
Scaling Big Data Search with Solr and HBase Rod Cope, CTO & Founder OpenLogic, Inc.

Upload: morrison

Post on 24-Feb-2016

24 views

Category:

Documents


0 download

DESCRIPTION

Scaling Big Data Search with Solr and HBase. Rod Cope, CTO & Founder OpenLogic, Inc. Agenda. Introduction The Problem The Solution Details Final Thoughts Q & A. Introduction. Rod Cope CTO & Founder of OpenLogic 25 years of software development experience - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Scaling Big Data Search with  Solr and HBase

Scaling Big Data Search with Solr and HBaseRod Cope, CTO & FounderOpenLogic, Inc.

Page 2: Scaling Big Data Search with  Solr and HBase

OpenLogic, Inc.

Agenda

IntroductionThe ProblemThe SolutionDetailsFinal ThoughtsQ & A

2

Page 3: Scaling Big Data Search with  Solr and HBase

OpenLogic, Inc.

Introduction

Rod CopeCTO & Founder of OpenLogic25 years of software development experienceIBM Global Services, Anthem, General ElectricWriting book: “Cloud Computing in Action: Innovating with Open Source” for Manning

OpenLogicOpen Source Support, Governance, and Scanning SolutionsCertified library w/SLA support on 650+ Open Source packages

http://olex.openlogic.com

Over 200 Enterprise customers

3

Page 4: Scaling Big Data Search with  Solr and HBase

OpenLogic, Inc.

The Problem

“Big Data”All the world’s Open Source SoftwareMetadata, code, indexesIndividual tables contain manyterabytesRelational databases aren’t scale-free

Growing every dayNeed real-time random access to all dataLong-running and complex analysis jobs

4

Page 5: Scaling Big Data Search with  Solr and HBase

OpenLogic, Inc.

The Solution

Hadoop, HBase, and SolrHadoop – distributed file systemHBase – “NoSQL” data store – column-orientedSolr – search server based on LuceneAll are scalable, flexible, fast, well-supported,used in production environments

And a supporting cast of thousands…Stargate, MySQL, Rails, Redis, Resque, Nginx, Unicorn, HAProxy, Memcached,Ruby, JRuby, CouchDB, CentOS, …

5

Page 6: Scaling Big Data Search with  Solr and HBase

OpenLogic, Inc. 6

Solution Architecture

RailsRailsRuby

on Rails RailsRailsResque Workers

RailsRailsSolr

Web Browser Rails

MySQL

RailsRailsHBase

Nginx & Unicorn

Livereplication

Livereplication

Livereplication(3x)

RailsRailsStargate

Redis

Livereplication

Data LANInternet Application LAN

RailsRailsMaven Repo

Scanner Client

Maven Client

Caching and load balancing not shown

Page 7: Scaling Big Data Search with  Solr and HBase

OpenLogic, Inc.

Hadoop/HBase Implementation

Private Cloud100+ CPU cores100+ Terabytes of diskMachines don’t have identityAdd capacity by plugging innew machines

Why not EC2?Great for computational burstsExpensive for long-term storage of Big DataNot yet consistent enough for mission-critical usage of HBase

7

Page 8: Scaling Big Data Search with  Solr and HBase

OpenLogic, Inc.

Public Clouds and Big Data

Amazon EC2EBS Storage

100TB * $0.10/GB/month = $120k/yearDouble Extra Large instances

13 EC2 compute units, 34.2GB RAM20 instances * $1.00/hr * 8,760 hrs/yr = $175k/year3 year reserved instances

20 * 4k = $80k up front to reserve(20 * $0.34/hr * 8,760 hrs/yr * 3 yrs) / 3 = $86k/year to operate

Totals for 20 virtual machines1st year cost: $120k + $80k + $86k = $286k2nd & 3rd year costs: $120k + $86k = $206kAverage: ($286k + $206k + $206k) / 3 = $232k/year

8

Page 9: Scaling Big Data Search with  Solr and HBase

OpenLogic, Inc.

Private Clouds and Big Data

Buy your own20 * Dell servers w/12 CPU cores, 32GB RAM, 5 TB disk = $160k

Over 33 EC2 compute units each

Total: $53k/year (amortized over 3 years)

9

Page 10: Scaling Big Data Search with  Solr and HBase

OpenLogic, Inc.

Public Clouds are Expensive for Big DataAmazon EC2

20 instances * 13 EC2 compute units = 260 EC2 compute unitsCost: $232k/year

Buy your own20 machines * 33 EC2 compute units = 660 EC2 compute unitsCost: $53k/yearDoes not include hosting & maintenance costs

Don’t think system administration goes awayYou still “own” all the instances – monitoring, debugging, support

10

Page 11: Scaling Big Data Search with  Solr and HBase

OpenLogic, Inc.

Getting Data out of HBase

HBase NoSQLThink hash table, not relational databaseScanning vs. querying

How do find my data if primary key won’t cut it?Solr to the rescue

Very fast, highly scalable search server with built-in sharding and replication – based on LuceneDynamic schema, powerful query language, faceted search, accessible via simple REST-like web API w/XML, JSON, Ruby, and other data formats

11

Page 12: Scaling Big Data Search with  Solr and HBase

OpenLogic, Inc.

SolrSharding

Query any server – it executes the same query against all other servers in the groupReturns aggregated result to original caller

Async replication (slaves poll their masters)Can use repeaters if replicating across data centers

OpenLogicSolr farm, sharded, cross-replicated, fronted with HAProxy

Load balanced writes across masters, reads across masters and slaves

Billions of lines of code in HBase, all indexed in Solr for real-time search in multiple waysOver 20 Solr fields indexed per source file

12

Page 13: Scaling Big Data Search with  Solr and HBase

OpenLogic, Inc.

Machine 2 Machine 3 Machine 26Machine 1

Solr Implementation – Sharding + Replication

13

Solr Core A

Solr Core Z’

Solr Core B

Solr Core A’

Solr Core C

Solr Core B’

Solr Core Z

Solr Core Y’

Masters

Slaves

HAProxyHAProxy

Page 14: Scaling Big Data Search with  Solr and HBase

OpenLogic, Inc.

Machine 2 Machine 3 Machine 26Machine 1

Solr Implementation – Sharding + Replication

14

Solr Core A

Solr Core Z’

Solr Core B

Solr Core A’

Solr Core C

Solr Core B’

Solr Core Z

Solr Core Y’

Masters

Slaves

HAProxyHAProxy

Page 15: Scaling Big Data Search with  Solr and HBase

OpenLogic, Inc.

Machine 2 Machine 3 Machine 26Machine 1

Write Example

15

Solr Core A

Solr Core Z’

Solr Core B

Solr Core A’

Solr Core C

Solr Core B’

Solr Core Z

Solr Core Y’

Masters

Slaves

HAProxyHAProxy

Page 16: Scaling Big Data Search with  Solr and HBase

OpenLogic, Inc.

Machine 2 Machine 3 Machine 26Machine 1

Read Example

16

Solr Core A

Solr Core Z’

Solr Core B

Solr Core A’

Solr Core C

Solr Core B’

Solr Core Z

Solr Core Y’

Masters

Slaves

HAProxyHAProxy

Page 17: Scaling Big Data Search with  Solr and HBase

OpenLogic, Inc.

Machine 2 Machine 3 Machine 26Machine 1

Delete Example

17

Solr Core A

Solr Core Z’

Solr Core B

Solr Core A’

Solr Core C

Solr Core B’

Solr Core Z

Solr Core Y’

Masters

Slaves

HAProxyHAProxy

Page 18: Scaling Big Data Search with  Solr and HBase

OpenLogic, Inc.

Machine 2 Machine 3 Machine 26Machine 1

Write Example - Failover

18

Solr Core A

Solr Core Z’

Solr Core B

Solr Core A’

Solr Core C

Solr Core B’

Solr Core Z

Solr Core Y’

Masters

Slaves

HAProxyHAProxy

Page 19: Scaling Big Data Search with  Solr and HBase

OpenLogic, Inc.

Machine 2 Machine 3 Machine 26Machine 1

Read Example - Failover

19

Solr Core A

Solr Core Z’

Solr Core B

Solr Core A’

Solr Core C

Solr Core B’

Solr Core Z

Solr Core Y’

Masters

Slaves

HAProxyHAProxy

Page 20: Scaling Big Data Search with  Solr and HBase

OpenLogic, Inc.

Configuration is Key

Many moving partsIt’s easy to let typos slip throughConsider automated configurationvia Chef, Puppet, or similar

Pay attention to the detailsOperating system – max open files, sockets, and other limitsHadoop and HBase configuration

http://wiki.apache.org/hadoop/Hbase/Troubleshooting

Solr merge factor and norms

Don’t starve HBase or Solr for memorySwapping will cripple your system

20

Page 21: Scaling Big Data Search with  Solr and HBase

OpenLogic, Inc.

Commodity Hardware

“Commodity hardware” != 3 year old desktopDual quad-core, 32GB RAM, 4+ disksDon’t bother with RAID on Hadoop data disks

Be wary of non-enterprise drives

Expect ugly hardware issues at some point

21

Page 22: Scaling Big Data Search with  Solr and HBase

OpenLogic, Inc.

OpenLogic’s Hadoop and Solr Deployment

Dual quad-core and dual hex-core Dell boxes32-64GB RAM

ECC (highly recommended by Google)

6 x 2TB enterprise hard drivesRAID 1 on two of the drives

OS, Hadoop, HBase, Solr, NFS mounts (be careful!), job code, etc.Key “source” data backups

Hadoop datanode gets remaining drivesRedundant enterprise switchesDual- and quad-gigabit NIC’s

22

Page 23: Scaling Big Data Search with  Solr and HBase

OpenLogic, Inc.

Expect Things to Fail – A Lot

HardwarePower supplies, hard drives

Operating SystemKernel panics, zombie processes, dropped packets

Software ServersHadoop datanodes, HBase regionservers, Stargate servers, Solr servers

Your Code and DataStray Map/Reduce jobs, strange corner cases in your data leading to program failures

23

Page 24: Scaling Big Data Search with  Solr and HBase

OpenLogic, Inc.

Cutting Edge

HadoopSPOF around Namenode, append functionality

HBaseBackup, replication, and indexing solutions in flux

Solr Several competing solutions around cloud-like scalability and fault-tolerance, including ZooKeeper and Hadoop integration

SolrCloud, Katta, Elastic CloudNo clear winner, none quite ready for production

24

Page 25: Scaling Big Data Search with  Solr and HBase

OpenLogic, Inc.

Loading Big Data

Experiment with different Solr merge factorsDuring huge loads, it can help to use a higher factor for load performance

Minimize index manipulation gymnasticsStart with something like 25

When you’re done with the massive initial load/import, turn it back down for search performance

Minimize number of queriesStart with something like 5Example:

curl http://solr1:8080/solr/master/update?optimize=true&maxSegments=5This can take a few minutes, so you might need to adjust various timeouts

Note that a small merge factor will hurt indexing performance if you need to do massive loads on a frequent basis or continuous indexing

25

Page 26: Scaling Big Data Search with  Solr and HBase

OpenLogic, Inc.

Loading Big Data (cont.)

Test your write-focused load balancingLook for large skews in Solr index sizeNote: you may have to commit, optimize, write again, and commit before you can really tell

Make sure your replication slaves are keeping upUsing identical hardware helpsIf index directories don’t look the same, something is wrong

26

Page 27: Scaling Big Data Search with  Solr and HBase

OpenLogic, Inc.

Loading Big Data (cont.)

Don’t commit to Solr too frequentlyIt’s easy to auto-commit or commit after every recordDoing this 100’s of times per second will take Solr down, especially if you have serious warm up queries configured

Avoid putting large values in HBase (> 5MB)Works, but may cause instability and/or performance issuesRows and columns are cheap, so use more of them instead

27

Page 28: Scaling Big Data Search with  Solr and HBase

OpenLogic, Inc.

Loading Big Data (cont.)

Don’t use a single machine to load the clusterYou might not live long enough to see it finish

At OpenLogic, we spread raw source data across many machines and hard drives via NFS

Be very careful with NFS configuration – can hang machines

Load data into HBase via Hadoop map/reduce jobsTurn off WAL for much better performance

put.setWriteToWAL(false)

Index in Solr as you goGood way to test your load balancingwrite schemes and replication set up

This will find your weak spots!

28

Page 29: Scaling Big Data Search with  Solr and HBase

OpenLogic, Inc.

Scripting Languages Can Help

Writing data loading jobs can be tediousScripting is faster and easier than writing JavaGreat for system administration tasks, testingStandard HBase shell is based on JRubyVery easy Map/Reduce jobs with J/Ruby and WukongUsed heavily at OpenLogic

Productivity of RubyPower of Java Virtual MachineRuby on Rails, Hadoop integration, GUI clients

29

Page 30: Scaling Big Data Search with  Solr and HBase

OpenLogic, Inc.

Java (27 lines)

30

public class Filter { public static void main( String[] args ) {

List list = new ArrayList(); list.add( "Rod" ); list.add( "Neeta" ); list.add( "Eric" ); list.add( "Missy" );

Filter filter = new Filter(); List shorts = filter.filterLongerThan( list, 4 ); System.out.println( shorts.size() );

Iterator iter = shorts.iterator(); while ( iter.hasNext() ) { System.out.println( iter.next() ); }}

public List filterLongerThan( List list, int length ) { List result = new ArrayList(); Iterator iter = list.iterator(); while ( iter.hasNext() ) { String item = (String) iter.next(); if ( item.length() <= length ) { result.add( item ); } } return result;}

}

Page 31: Scaling Big Data Search with  Solr and HBase

OpenLogic, Inc.

Scripting languages (4 lines)

31

Groovy

list = ["Rod", "Neeta", "Eric", "Missy"]shorts = list.find_all { |name| name.size <= 4 }puts shorts.sizeshorts.each { |name| puts name }

-> 2 -> Rod Eric

JRuby

list = ["Rod", "Neeta", "Eric", "Missy"]shorts = list.findAll { name -> name.size() <= 4 }println shorts.sizeshorts.each { name -> println name }

-> 2 -> Rod Eric

Page 32: Scaling Big Data Search with  Solr and HBase

OpenLogic, Inc.

Not Possible Without Open Source

32

Page 33: Scaling Big Data Search with  Solr and HBase

OpenLogic, Inc.

Not Possible Without Open Source

Hadoop, HBase, SolrApache, Tomcat, ZooKeeper, HAProxyStargate, JRuby, Lucene, Jetty, HSQLDB, GeronimoApache Commons, JUnitCentOSDozens more

Too expensive to build or buy everything

33

Page 34: Scaling Big Data Search with  Solr and HBase

OpenLogic, Inc.

Final ThoughtsYou can host Big Data in your own cloud

Tools are available today that didn’t exist a few years agoFast to prototype – production readiness takes timeExpect to invest in training and support

HBase and Solr are fast100+ random queries/sec per instanceGive them memory and stand back

HBase scales, Solr scales (to a point)Don’t worry about outgrowing a few machinesDo worry about outgrowing a rack of Solr instances

Look for ways to partition your data other than “automatic” sharding

34

Page 35: Scaling Big Data Search with  Solr and HBase

OpenLogic, Inc.

Q & A

Any questions for [email protected]

Slides: http://www.openlogic.com/downloads/presentations.php* Unless otherwise credited, all images in this presentation are either open source project logos or were licensed from BigStockPhoto.com