eharmony in cloud
DESCRIPTION
eHarmony in Cloud. Subtitle. Brian Ko. eHarmony. Online subscription-based matchmaking service Available in United States, Canada, Australia and United Kingdom. On average, 236 members in US marry every day. More than 20 million registered users. 1. Why Cloud?. - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: eHarmony in Cloud](https://reader035.vdocuments.us/reader035/viewer/2022062501/568160e5550346895dd01953/html5/thumbnails/1.jpg)
eHarmony in Cloud
Subtitle
Brian Ko
![Page 2: eHarmony in Cloud](https://reader035.vdocuments.us/reader035/viewer/2022062501/568160e5550346895dd01953/html5/thumbnails/2.jpg)
eHarmony
• Online subscription-based matchmaking service
• Available in United States, Canada, Australia and United Kingdom.
• On average, 236 members in US marry every day.
• More than 20 million registered users.
1
![Page 3: eHarmony in Cloud](https://reader035.vdocuments.us/reader035/viewer/2022062501/568160e5550346895dd01953/html5/thumbnails/3.jpg)
Why Cloud?
• Problem exceeds the limits of the data center and data warehouse environment.
• Leverage EC2 and Hadoop to scale data
2
![Page 4: eHarmony in Cloud](https://reader035.vdocuments.us/reader035/viewer/2022062501/568160e5550346895dd01953/html5/thumbnails/4.jpg)
Finding match
3
• Model Creation
![Page 5: eHarmony in Cloud](https://reader035.vdocuments.us/reader035/viewer/2022062501/568160e5550346895dd01953/html5/thumbnails/5.jpg)
Find matching
• Matching
4
![Page 6: eHarmony in Cloud](https://reader035.vdocuments.us/reader035/viewer/2022062501/568160e5550346895dd01953/html5/thumbnails/6.jpg)
Find Matching
• Predicative Model Scores
5
![Page 7: eHarmony in Cloud](https://reader035.vdocuments.us/reader035/viewer/2022062501/568160e5550346895dd01953/html5/thumbnails/7.jpg)
Requirement
• All the matches, scores, and user information should be archived daily
• Ready for 10X growth• Possible O(n2) problem• Need to support set of models becoming
more complex
6
![Page 8: eHarmony in Cloud](https://reader035.vdocuments.us/reader035/viewer/2022062501/568160e5550346895dd01953/html5/thumbnails/8.jpg)
Challenge• Current architecture is multi-tiered with a
relational back-end• Scoring is DB join intensive• Data need constant archiving
– Matches, match scores, user attributes at time of match creation
– Model validation is done at a later time across many days
• Need a non-DB solution
7
![Page 9: eHarmony in Cloud](https://reader035.vdocuments.us/reader035/viewer/2022062501/568160e5550346895dd01953/html5/thumbnails/9.jpg)
Solution
• Open Source Java implementation of Google’s MapReduce framework
– Distributes work across vast amounts of data– Hadoop Distributed File System (HDFS)
provides reliability through replication– Automatic re-execution on failure/distribution– Scale horizontally on commodity hardware
8
![Page 10: eHarmony in Cloud](https://reader035.vdocuments.us/reader035/viewer/2022062501/568160e5550346895dd01953/html5/thumbnails/10.jpg)
Slide 9
• Simple Storage Service (S3) provides cheap unlimited storage.
• Elastic Cloud Computing (EC2) enables horizontal scaling by adding servers on demand.
9
![Page 11: eHarmony in Cloud](https://reader035.vdocuments.us/reader035/viewer/2022062501/568160e5550346895dd01953/html5/thumbnails/11.jpg)
MapReduce• A large server farm can use MapReduce to
process huge dataset.• Map step
– Master node takes the input– Chops it up into smaller sub-problems– Distributes those to worker nodes.
• Reduce step– Master node takes the answers to all the sub-
problems – Combines them in a way to get the output
10
![Page 12: eHarmony in Cloud](https://reader035.vdocuments.us/reader035/viewer/2022062501/568160e5550346895dd01953/html5/thumbnails/12.jpg)
Why Hadoop
• Mapper and Reducer are written by you• Hadoop provides
– Parallelization– Shuffle and sort
11
![Page 13: eHarmony in Cloud](https://reader035.vdocuments.us/reader035/viewer/2022062501/568160e5550346895dd01953/html5/thumbnails/13.jpg)
Actual Process
• Upload to S3 and start EC2 Cluster
13
![Page 14: eHarmony in Cloud](https://reader035.vdocuments.us/reader035/viewer/2022062501/568160e5550346895dd01953/html5/thumbnails/14.jpg)
Actual Process
• Process and archive
14
![Page 15: eHarmony in Cloud](https://reader035.vdocuments.us/reader035/viewer/2022062501/568160e5550346895dd01953/html5/thumbnails/15.jpg)
Amazon Elastic MapReduce
• It is a web service• EC2 cluster is managed for you behind the
scenes• Starts Hadoop implementation of the
MapReduce framework on Amazon EC2• Each step can read and write data directly
from and to S3• Based on Hadoop 0.18.3
15
![Page 16: eHarmony in Cloud](https://reader035.vdocuments.us/reader035/viewer/2022062501/568160e5550346895dd01953/html5/thumbnails/16.jpg)
Elastic MapReduce• No need to explicitly allocate, start and
shutdown EC2 instances• Individual jobs were managed by a remote
script running on master node (no longer required)
• Jobs are arranged into a job flow, created with a single command
• Status of a job flow and all its steps are accessible by a REST service
16
![Page 17: eHarmony in Cloud](https://reader035.vdocuments.us/reader035/viewer/2022062501/568160e5550346895dd01953/html5/thumbnails/17.jpg)
Before Elastic Map Reduce
• Allocate/Verify cluster• Push application to cluster• Run a control script on the master• Kick off each job step on the master• Create and detect a job completion token• Shut the cluster down
17
![Page 18: eHarmony in Cloud](https://reader035.vdocuments.us/reader035/viewer/2022062501/568160e5550346895dd01953/html5/thumbnails/18.jpg)
After Elastic MapReduce
• With Elastic MapReduce we can do all this with a single local command
• Uses jar and conf files stored on S3• Various monitoring tools for EC2 and S3
are provided
18
![Page 19: eHarmony in Cloud](https://reader035.vdocuments.us/reader035/viewer/2022062501/568160e5550346895dd01953/html5/thumbnails/19.jpg)
Development Environment
• Cheap to set up on Amazon• Quick setup - Number of servers is
controlled by a config variable• Identical to production• Separate development account
recommended
19
![Page 20: eHarmony in Cloud](https://reader035.vdocuments.us/reader035/viewer/2022062501/568160e5550346895dd01953/html5/thumbnails/20.jpg)
Cost comparison• Average EC2 and S3 Cost
– Each run is 2 to 3 hours– $1200/month for EC2– $100/month for S3
• Projected in-house cost– $5000/month for a local cluster of 50 nodes
running 24/7– A new company needs to add data center and
operation personnel expense
20
![Page 21: eHarmony in Cloud](https://reader035.vdocuments.us/reader035/viewer/2022062501/568160e5550346895dd01953/html5/thumbnails/21.jpg)
Summary
• Dev tools really easy to work with and just work right out of the box
• Standard Hadoop AMI worked great• Easy to write unit tests for MapReduce• Hadoop community support is great.• EC2/S3/EMR are cost effective
![Page 22: eHarmony in Cloud](https://reader035.vdocuments.us/reader035/viewer/2022062501/568160e5550346895dd01953/html5/thumbnails/22.jpg)
The End
5 minutes of question timestarts now!
![Page 23: eHarmony in Cloud](https://reader035.vdocuments.us/reader035/viewer/2022062501/568160e5550346895dd01953/html5/thumbnails/23.jpg)
Questions
4 minutes left!
![Page 24: eHarmony in Cloud](https://reader035.vdocuments.us/reader035/viewer/2022062501/568160e5550346895dd01953/html5/thumbnails/24.jpg)
Questions
3 minutes left!
![Page 25: eHarmony in Cloud](https://reader035.vdocuments.us/reader035/viewer/2022062501/568160e5550346895dd01953/html5/thumbnails/25.jpg)
Questions
2 minutes left!
![Page 26: eHarmony in Cloud](https://reader035.vdocuments.us/reader035/viewer/2022062501/568160e5550346895dd01953/html5/thumbnails/26.jpg)
Questions
1 minute left!
![Page 27: eHarmony in Cloud](https://reader035.vdocuments.us/reader035/viewer/2022062501/568160e5550346895dd01953/html5/thumbnails/27.jpg)
Questions
30 seconds left!
![Page 28: eHarmony in Cloud](https://reader035.vdocuments.us/reader035/viewer/2022062501/568160e5550346895dd01953/html5/thumbnails/28.jpg)
Questions
TIME IS UP!