atlanta hadoop users group july 2013

Our Hadoop JourneyChris CurtinHead of Technical Research

Atlanta Hadoop Users Group July 2013

2

About Me• 20+ years in technology• Head of Technical Research at Silverpop (12 + years at

Silverpop)• Built a SaaS platform before the term ‘SaaS’ was being

used• Prior to Silverpop: real-time control systems, factory

automation and warehouse management• Always looking for technologies and algorithms to help

with our challenges• Car nut

3

Silverpop Open Positions• Senior Software Engineer (Java, Oracle, Spring, Hibernate,

MongoDB)• Senior Software Engineer – MIS (.NET stack)• Software Engineer• Software Engineer – Integration Services (PHP, MySQL)• Delivery Manager – Engineering• Technical Lead – Engineering• Technical Project Manager – Integration Services• http://www.silverpop.com – Go to Careers under About

http://www.silverpop.com/

http://www.silverpop.com/

4

About Silverpop• Founded in late 1999, Atlanta based, offices in London,

Germany, Irvine California• Digital Marketing Technology provider, unifying

marketing automation, email, mobile and social.• Track billions of contact events, execute on those

events, send billions of emails• Clients are in marketing departments

5

Challenge from the business• Engage allows clients to define their own database

schema for contact records• No two client’s schemas are the same• Schemas often change weekly/monthly• Contact’s records are ‘point in time’ • Users want to report on value of a contact record

when activity occurred

6

Example• How well did my marketing campaign to my loyalty

clients do last quarter? • Easy question, hard answer

– Contact’s ‘level’ changes throughout the year (Silver to Gold)

– Some piece of data wasn’t known at the time of the email send, but is now

– What do you want to pivot on? Level? Age? Source Code? Time in database?

7

Technical solutions• Traditional Data warehouse• Queries against OLTP or OLAP stores• Customer-specific databases

8

Hadoop• Started working on R&D project in 2008• First raw map/reduce• Some Pig• Some Hive/Hbase • (and several start-ups long since dead …)

• Flexible schema caused problems with most of them

9

First ‘real’ application• Pivot reports against flexible schemas• Per contact, not aggregate• Let the user select any communication(s), see what

user attributes are available to use as pivots• Pivot data is at time of communication, not current

values (slow moving data)• Could be against a few thousand events, to billions

10

First ‘real’ challenges• Flexible schema meant Hbase, Hive etc. wouldn’t work

easily• Flexible schema meant Pig scripts were difficult to

maintain (even generating on the fly)• Need to coordinate multiple steps OUTSIDE of the

Hadoop process• UI• Resource Allocation and control

11

Cascading• Answered a number of problems• Allowed integration with other platforms, even

between M/R jobs– MySQL to find list of supported columns– HDFS to find actual files on disk– JMS for job sourcing/status updates (not implemented)

12

Cascading Dynamic Schema Solution• Allows the definition of schema at run time• Allows definition of steps at run time

– One report may have 10 mailings, another 10,000– 10,000 mailings can’t be run in parallel, so

programmatically create temporary results

13

Sample Cascading Code

14

Client Response• Either got it immediately or didn’t see the need for

something this flexible• Found a reason to talk to others in organization to find

other pivot fields• Most common use case: behaviors based on Source

Code• Turned out to be a weekly/monthly report not a day-to-

day tool• Some used it for ad hoc, but to build a requirement for

their BI teams

15

Profiling Application• Retention is a big theme in marketing• Looking at a single mailing/ad buy etc. showed aggregates

about that slice of time, but are misleading:– Is the 20% who opened that email the same 20% as last

week?– For people in my database for 6 months, how often do they

interact with my marketing?– What is a typical interaction rate for my database?– How many times on average does a contact interact with me

in a month? Who is outside of that rate?• Instead of looking across communication now needed to

look at each contact

16

New technical challenges• Previous report could be broken into specific steps to

reduce volume of events before ‘heavy’ math was done

• New report needs to look at all events together• Quickly overwhelmed scheduler

17

Hadoop Challenges• No schema – external store of mappings• No appending in HDFS – daily integration could be 10MM

rows for a communication or 5• ‘lots of small files’ – thousands of clients with thousands

of communications means millions of files• ETL from Oracle meant concatenating files weekly to keep

count down• Single point of failure (Name Node) took long time to

recover• Non-batch processes, how to schedule jobs on demand?• Hadoop Job History – memory vs. concurrent job tradeoffs

18

MapR• Eventually settled on MapR M3

– Large number of files was main driver– NFS mount is nice feature– Cascading works

• Not without issues– Found several bugs around Volumes in HDFS and log retention

that we had to work around (later fixed)– Can’t copy between volumes using HDFS commands– More complicated for operations to manage (had a CLDB

failure that took a day to recover, mostly us trying to figure out what to do.)

19

Misc. Technical Information• Fair Scheduler

– Our scheduling logic knows how many queues and controls how many jobs can be submitted at the same time

• Mapr ExpressLane is useful for small jobs– Our scheduler knows it is a small job so lets MapR take it

• Mapr’s NFS mount is great– Write directly to it from Java apps instead of HDFS API– Concatenating daily files is a simple Java app now – (Still don’t append to files in HDFS, but could)

• Nagios for monitoring

20

Cluster details• 5 nodes

– 1 admin, 4 workers– 8 core Xeon 16 GB– 5TB usable per box assigned to MapR

• Had 9 nodes, reduced to 5– Cluster was mostly idle due to user’s submittal patterns

(heavy on Tuesdays, 7th day of the month)– Delay to end users was minimal when we reduced the

number of machines

21

Closing the loop• Next logical step was for clients to ask to target the

contacts• The volume of data didn’t make that easily possible• Integrating from Hadoop back to Oracle became an

ETL project– Export from Oracle was single dump, import would be a

job per client.

• Automation of reports (and emailing results) was 2nd most asked for feature

• Lots of support required to know what to do with the results– No easy ‘go do this when you see this in the reports’

22

Current Status• Dozens of monthly users• Some optimizations to toss data early in the import

step for clients not using the tool• Packaging and pricing is vexing the product marketing

team• Runs lights out unless the ETL process breaks

23

Business Challenges• Lots of cool ideas we came up with, even implemented

a few• But end users didn’t know what to do with the data• ‘SaaS-ifying’ is proving difficult

– Multi-tenancy resource management is not available– How to price? End report may have 20 rows but

processed 1BN rows to get there

• If I hear ‘do you do big data’ one more time …

24

Things we are watching• Real-time tools on top of Hadoop (Drill, Impala)• Storm inside of YARN• Storm in general• Integration of Kafka, Storm, Drill/Impala, Hadoop &

MongoDB

25

Information• Slides: http://www.slideshare.net/chriscurtin• Me: [email protected] @ChrisCurtin on twitter

mailto:[email protected]

26

Silverpop Open Positions• Senior Software Engineer (Java, Oracle, Spring, Hibernate,

MongoDB)• Senior Software Engineer – MIS (.NET stack)• Software Engineer• Software Engineer – Integration Services (PHP, MySQL)• Delivery Manager – Engineering• Technical Lead – Engineering• Technical Project Manager – Integration Services• http://www.silverpop.com/marketing-company/

careers/open-positions.html

atlanta hadoop users group july 2013

Technology

database schema

time of communication

run time

time users

long time

definition of schema

slice of time

dead flexible schema