lucene rev preso cope real time searching of big data with solr and hadoop

8/12/2019 Lucene Rev Preso Cope Real Time Searching of Big Data With Solr and Hadoop

1/34


2/34

OpenLogic, Inc.

Agenda

Introduction

The Problem

The Solution

Lessons Learned

Public Clouds &Big Data

Final Thoughts

Q & A

2


3/34

OpenLogic, Inc.

Introduction

Rod CopeCTO & Founder of OpenLogic

25 years of software development experience

IBM Global Services, Anthem, General Electric

OpenLogicOpen Source Provisioning, Support, and Governance Solutions

Certified library w/SLA support on 500+ Open Source packageshttp://olex.openlogic.com

Over 200 Enterprise customers

3
http://olex.openlogic.com/http://olex.openlogic.com/http://olex.openlogic.com/


4/34

OpenLogic, Inc.

The Problem

Big DataAll the worlds Open SourceSoftware

Metadata, code, indexes

Individual tables contain manyterabytes

Relational databases arentscale-free

Growing every day

Need real-time random access to all data

Long-running and complex analysis jobs

4


5/34

OpenLogic, Inc.

The Solution

Hadoop, HBase, and SolrHadoop distributed file system

HBase NoSQL data store column-oriented

Solr search server based on Lucene

All are scalable, flexible, fast, well-supported,used in production environments

And a supporting cast of thousandsStargate, MySQL, Rails, Redis, Resque,

Nginx, Unicorn, HAProxy, Memcached,Ruby, JRuby, CentOS,

5


6/34

OpenLogic, Inc.6

Solution Architecture

Live

replication

Livereplication

Livereplication(3x)

Livereplication

Data LANInternet Application LANCaching and load balancing not shown


7/34OpenLogic, Inc.

Getting Data out of HBase

HBase!NoSQLThink hash table, not relational database

Scanning vs. querying

How do find my data if primary key wont cut it?

Solr to the rescueVery fast, highly scalable search server with built-in sharding

and replication based on Lucene

Dynamic schema, powerful query language, faceted search,

accessible via simple REST-like web API w/XML, JSON, Ruby,and other data formats

7


8/34OpenLogic, Inc.

Solr

Sharding

Query any server it executes the same query against all otherservers in the group

Returns aggregated result to original caller

Async replication (slaves poll their masters)

Can use repeaters if replicating across data centersOpenLogic

Solr farm, sharded, cross-replicated, fronted with HAProxyLoad balanced writes across masters, reads across masters and slaves

Be careful not to over-commit

Billions of lines of code in HBase, all indexed in Solr for real-timesearch in multiple ways

Over 20 Solr fields indexed per source file

8


9/34OpenLogic, Inc.

Solr Implementation Sharding + Replication

9

Masters

Slaves


10/34OpenLogic, Inc.

Write Flow

10

Masters

Slaves



Read Flow

11

Masters

Slaves



Delete Flow

12

Masters

Slaves



Write Flow - Failover

13

Masters

Slaves


14/34

OpenLogic, Inc.

Read Flow - Failover

14

Masters

Slaves


15/34

OpenLogic, Inc.

Loading Big Data

Experiment with different Solr merge factors

During huge loads, it can help to use a higher factor for loadperformance

Minimize index manipulation gymnastics

Start with something like 25

When youre done with the massive initial load/import, turn itback down for search performanceMinimize number of queries

Start with something like 5

Example:

curl http://solr1:8080/solr/master/update?optimize=true&maxSegments=5

This can take a few minutes, so you might need to adjust various timeouts

Note that a small merge factor will hurt indexing performance if youneed to do massive loads on a frequent basis or continuous indexing

15
http://solr1:8080/solr/master/update?optimize=true&maxSegments=5http://solr1:8080/solr/master/update?optimize=true&maxSegments=5http://solr1:8080/solr/master/update?optimize=true&maxSegments=5http://solr1:8080/solr/master/update?optimize=true&maxSegments=5http://solr1:8080/solr/master/update?optimize=true&maxSegments=5http://solr1:8080/solr/master/update?optimize=true&maxSegments=5http://solr1:8080/solr/master/update?optimize=true&maxSegments=5http://solr1:8080/solr/master/update?optimize=true&maxSegments=5http://solr1:8080/solr/master/update?optimize=true&maxSegments=5http://solr1:8080/solr/master/update?optimize=true&maxSegments=5http://solr1:8080/solr/master/update?optimize=true&maxSegments=5http://solr1:8080/solr/master/update?optimize=true&maxSegments=5http://solr1:8080/solr/master/update?optimize=true&maxSegments=5http://solr1:8080/solr/master/update?optimize=true&maxSegments=5http://solr1:8080/solr/master/update?optimize=true&maxSegments=5


16/34

OpenLogic, Inc.

Loading Big Data (cont.)

Test your write-focused load balancingLook for large skews in Solr index size

Note: you may have to commit, optimize, write again, andcommit before you can really tell

Make sure your replication slaves are keeping upUsing identical hardware helpsIf index directories dont look the same, something is wrong

16


17/34

OpenLogic, Inc.


Dont commit to Solr too frequentlyIts easy to auto-commit or commit after every record

Doing this 100s of times per second will take Solr down,especially if you have serious warm up queries configured

Avoid putting large values in HBase (> 5MB)Works, but may cause instability and/or performance issuesRows and columns are cheap, so use more of them instead

17


18/34

OpenLogic, Inc.


Dont use a single machine to load the cluster

You might not live long enough to see it finish

At OpenLogic, we spread raw source data acrossmany machines and hard drives via NFS

Be very careful with NFS configuration can hang machines

Load data into HBase via Hadoop map/reduce jobsTurn off WAL for much better performance

put.setWriteToWAL(false)

Index in Solr as you goGood way to test your load balancing

write schemes and replication set up

This will find your weak spots!

18


19/34

OpenLogic, Inc.

Scripting Languages Can Help

Writing data loading jobs can be tedious

Scripting is faster and easier than writing Java

Great for system administration tasks, testing

Standard HBase shell is based on JRuby

Very easy Map/Reduce jobs with J/Ruby andWukong

Used heavily at OpenLogic

Productivity of RubyPower of Java Virtual Machine

Ruby on Rails, Hadoop integration, GUI clients

19


20/34

OpenLogic, Inc.

Java (27 lines)

20

public class Filter { public static void main( String[] args ) {

List list = new ArrayList(); list.add( "Rod" );list.add( "Neeta" );

list.add( "Eric" );list.add( "Missy" );

Filter filter = new Filter();

List shorts = filter.filterLongerThan( list, 4 ); System.out.println( shorts.size() );

Iterator iter = shorts.iterator(); while ( iter.hasNext() ) { System.out.println( iter.next() ); }}

public List filterLongerThan( List list, int length ) { List result = new ArrayList(); Iterator iter = list.iterator(); while ( iter.hasNext() ) { String item = (String) iter.next(); if ( item.length()


21/34

OpenLogic, Inc.

Scripting languages (4 lines)

Groovy

list = ["Rod", "Neeta", "Eric", "Missy"]shorts = list.find_all { |name| name.size 2

-> Rod Eric

list = ["Rod", "Neeta", "Eric", "Missy"]shorts = list.findAll { name -> name.size() println name }

-> 2 -> Rod Eric

JRuby


22/34

OpenLogic, Inc.

Cutting Edge

HadoopSPOF around Namenode, append functionality

HBaseBackup, replication, and indexing solutions

in fluxSolr

Several competing solutions around cloud-likescalability and fault-tolerance, includingZooKeeper and Hadoop integration

No clear winner, none quite ready for production

22


23/34

OpenLogic, Inc.

Configuration is Key

Many moving parts

Its easy to let typos slip throughConsider automated configuration

via Chef, Puppet, or similar

Pay attention to the details

Operating system max open files,sockets, and other limits

Hadoop and HBase configurationhttp://wiki.apache.org/hadoop/Hbase/Troubleshooting

Solr merge factor and normsDont starve HBase or Solr for memory

Swapping will cripple your system

23
http://wiki.apache.org/hadoop/Hbase/Troubleshootinghttp://wiki.apache.org/hadoop/Hbase/Troubleshootinghttp://wiki.apache.org/hadoop/Hbase/Troubleshootinghttp://wiki.apache.org/hadoop/Hbase/Troubleshootinghttp://wiki.apache.org/hadoop/Hbase/Troubleshooting


24/34

OpenLogic, Inc.

Commodity Hardware

Commodity hardware != 3 year old desktop

Dual quad-core, 32GB RAM, 4+ disks

Dont bother with RAID on Hadoop data disksBe wary of non-enterprise drives

Expect ugly hardware issues at some point

24


25/34

OpenLogic, Inc.

OpenLogics Hadoop Implementation

Environment100+ CPU cores

100+ Terabytes of disk

Machines dont have identity

Add capacity by plugging innew machines

Why not Amazon EC2?Great for computational bursts

Expensive for long-term storage of Big DataNot yet consistent enough for mission-critical usage of HBase

25


26/34

OpenLogic, Inc.

OpenLogics Hadoop and Solr Deployment

Dual quad-core and dual hex-core

Dell boxes32-64GB RAM

ECC (highly recommended by Google)

6 x 2TB enterprise hard drives

RAID 1 on two of the drivesOS, Hadoop, HBase, Solr, NFS mounts (be careful!), job code, etc.

Key source data backups

Hadoop datanode gets remaining drivesRedundant enterprise switches

Dual- and quad-gigabit NICs

26


27/34

OpenLogic, Inc.

Public Clouds and Big Data

Amazon EC2EBS Storage

100TB * $0.10/GB/month = $120k/year

Double Extra Large instances13 EC2 compute units, 34.2GB RAM

20 instances * $1.00/hr * 8,760 hrs/yr = $175k/year3 year reserved instances

20 * 4k = $80k up front to reserve

(20 * $0.34/hr * 8,760 hrs/yr * 3 yrs) / 3 = $86k/yearto operate

Totals for 20 virtual machines1styear cost: $120k + $80k + $86k = $286k2nd& 3rdyear costs: $120k + $86k = $206k

Average: ($286k + $206k + $206k) / 3 = $232k/year

27


28/34

OpenLogic, Inc.

Private Clouds and Big Data

Buy your own20 * Dell servers w/12 CPU cores, 32GB RAM, 5 TB disk = $160k

Over 33 EC2 compute units each

Total: $53k/year (amortized over 3 years)

28


29/34

OpenLogic, Inc.

Public Clouds are Expensive for Big Data

Amazon EC2

20 instances * 13 EC2 compute units =260 EC2 compute units

Cost: $232k/year

Buy your own20 machines * 33 EC2 compute units =

660 EC2 compute units

Cost: $53k/year

Does not include hosting & maintenance costs

Dont think system administration goes awayYou still own all the instances monitoring, debugging, support

29


30/34

OpenLogic, Inc.

Expect Things to Fail A Lot

Hardware

Power supplies, hard drivesOperating System

Kernel panics, zombie processes,dropped packets

Software ServersHadoop datanodes, HBase regionservers,

Stargate servers, Solr servers

Your Code and DataStray Map/Reduce jobs, strange corner

cases in your data leading to programfailures

30


31/34

OpenLogic, Inc.

Not Possible Without Open Source

31


32/34

OpenLogic, Inc.

Not Possible Without Open Source

Hadoop, HBase, Solr

Apache, Tomcat, ZooKeeper,HAProxy

Stargate, JRuby, Lucene,

Jetty, HSQLDB, GeronimoApache Commons, JUnit

CentOS

Dozens more

Too expensive to build or buy everything

32


33/34

OpenLogic, Inc.

Final Thoughts

You can host your own big data

Tools are available today that didnt exist a few years agoFast to prototype production

readiness takes time

Expect to invest in training and support

HBase and Solr are fast100+ random queries/sec per instance

Give them memory and stand back

HBase scales, Solr scales (to a point)Dont worry about outgrowing a few machines

Do worry about outgrowing a rack of Solr instancesLook for ways to partition your data other than automatic sharding

33


34/34

Q & A

Any questions for [email protected]

Slides: http://www.openlogic.com/presentations/rod* Unless otherwise credited, all images in this presentation are either open source project logos or were licensed from BigStockPhoto.com
http://www.openlogic.com/presentations/rodmailto:[email protected]://www.openlogic.com/presentations/rodhttp://www.openlogic.com/presentations/rodmailto:[email protected]:[email protected]

lucene rev preso cope real time searching of big data with solr and hadoop

Documents