lessons learned fromfiles.meetup.com/17453062/bdx2015 - haggai shachar... · lessons learned from...

Post on 20-May-2020

2 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Lessons Learned from Building a Big Data Technology Stack

Haggai Shachar Director, Data Services haggais@liveperson.com

!{ name: "Haggai Shachar", work: [

{ employee: "LivePerson", title: "Director, Data Services“ }, { employee: “NuConomy”, title: “Co-Founder, CTO” }, { employee: “Israeli Intelligence Corps”, title: “n/a” } ],

likes: [ “data”, “machine learning”, “cycling”, “diving” ], wife: "Orit", kids: [ { gender: “female”, age: -0.2, name: undefined } ] , todos: [ "buy a stroller" ] }

Hello World!

LivePerson(“you do something with chat, right ??”)

1990s Click-to-Chat User initiated

2000 Proactive Based on Real-Time Behavior

2010 Real-time Prediction Multichannel

Predictive Intelligence

TodayEngage

everywhereWeb, Social, Native Apps, SMS, Email

40

TB Raw data 2

2M Interactions 2 B

Visits

* monthly figures

LivePerson Data stack

LiveEngage Console

MONITORING CHAT/VOICE system

Batch track Real-Time trackAPACHE KAFKA

STORM

COMPLEX EVENT PROCESSING

PERPETUAL STORE

BUSINESS INTELLIGENCE

ANALYTICAL DB

Serving layer (Data Producers) Monitoring Engagement systems

Middleware using Kafka Batch Track (near) Real Time Track

CEP using Storm Real Time computation Real Time data aggregation

Rich Business Intelligence Pre-defined dashboards Drill down to the record level Ad-hoc and self service BI

Data Repositories DSPT, Analytics, RT

Aggregation

Data Repositories DSPT, Analytics, RT

Aggregation

LiveEngage backoffice

RT REPOSITORIES

Forget data, lets talk cars -What’s the ultimate vehicle ?

What’s the ultimate vehicle ?

What’s the ultimate vehicle ?

!1. Choosing the right tool 2. Organization-wide schema 3. Decouple producers from consumers 4. Write Optimized vs Read Optimized Models 5. Freshness vs Correctness

Lessons Learned

Since the beginning of mankind <-> ~2004

LL#1 choosing the right tool

2004 - Today

1. Problem fit 2. Scaling fit 3. Query language (SQL is not going anywhere) 4. Aggregation framework 5. By Key R/W throughput 6. Community

LL#1 choosing the right tool

Scaling Query Language

Aggregation framework

By Key throughput

Community

Hadoop Great MR, Hive Robust but slow

n/a Huge

Cassandra Great CQL, Thrift Sucks Awesome Big

MySQL Medium SQL Ok Ok Huge

Vertica Good SQL, R Awesome Ok Small

▪ 150 developers ▪ 20 scrum teams ▪ 50 services ▪ 3 floors ▪ 4 development languages (Java, Scala, Python, Javascript) ▪ 3-5 deployments a week ▪ Marketing terms keep on changing

LL#2 Organization-wide data model

Tower of Babel by Pieter Bruegel the Elder Jacob's Ladder by William Blake

OR

Apache Avro to rescue ▪ A schema based serialization/deserialization framework ▪ Strong Hadoop integration & efficient storage ▪ Backward & Forward Compatibility ▪ Rich data structures (primitives, records, maps, arrays, enums)

LL#2 Organization-wide data model

LL#2 Organization-wide data model

Protobuf Thrift Avro

Created 2001 (2008) 2007 2009

Creator / Maintainer Google / Google Facebook / Apache Doug cutting / Apache

Hadoop support No No Yes

Used by Google Facebook, Cassandra

Hadoop, Liveperson

Lang support Good Great Good

#3 Producers / Consumers decoupling

▪ Flexibility of development / deployment ▪ Publisher multi subscribers

PRODUCER

MULTIPULE CONSUMERS

▪ Predicting the exact future architecture and project needs is hard

▪ Use a middleware layer to simplify the interface between producers and consumers.

▪ Happily extend and modify each of the tiers independently

LL#3 Decouple producers from consumers

middleware

Hadoop

Producer ProducerProducer

ExternalStorm

Apache Kafka

▪ Distributed pub-sub system ▪ Developed at LinkedIn, Maintained by Apache ▪ Very high throughput (~300K messages/sec) ▪ Horizontally scalable ▪ Multiple subscribers for topics

Message queues !!ActiveMQ TIBCO

Log aggregators !!

Flume Scribe

• Low throughput • Secondary indexes • Tuned for low

latency

• Focus on HDFS • Push model • No rewindable

consumption

KAFKA

Apache Kafka

Writers like ▪ Write fast

LL#4 Write Optimized vs Read Optimized

Readers like ▪ Pre-defined aggregations ▪ Denormalized dimensions ▪ Data duplication

Not all data needs are made equal

LL#5: Freshness vs Correctness

▪ High freshness is the key ▪ Minor inaccuracy is acceptable ▪ Fire & forget or eventually

consistent ▪ NoSQL

▪ It’s all about accuracy ▪ Billable data ▪ Batch oriented ▪ Transactional ▪ RDBMS

MONITORING CHAT/VOICE system

Batch track Real-Time trackAPACHE KAFKA

PERPETUAL STORE

RT REPOSITORIES

300K events/sec

STORM

CEP

ANALYTICAL DB

Real Time counters Accuracy 99.9%

Raw data & Aggregations

Accuracy 100%

~300ms~2h

LL#4 Write Optimized vs Read OptimizedLL#5: Freshness vs Correctness

Read optimized

!1. Choosing the right tool 2. Organization-wide schema 3. Decouple producers from consumers 4. Write Optimized vs Read Optimized Models 5. Freshness vs Correctness

So, what did we have ??

I’m Data

We do cool stuff, come work with us! haggais@liveperson.com 054-7000814

top related