eharmony in the cloud
DESCRIPTION
This is a lightning presentation given by Brian Ko a member of my development team. It is a recap of a presentation he attended at JavaOne 2009.TRANSCRIPT
eHarmony in Cloud
Subtitle
Brian Ko
eHarmony
• Online subscription-based matchmaking service
• Available in United States, Canada, Australia and United Kingdom.
• On average, 236 members in US marry every day.
• More than 20 million registered users.
1
Why Cloud?
• Problem exceeds the limits of the data center and data warehouse environment.
• Leverage EC2 and Hadoop to scale data
2
Finding match
3
• Model Creation
Find matching
• Matching
4
Find Matching
• Predicative Model Scores
5
Requirement
• All the matches, scores, and user information should be archived daily
• Ready for 10X growth
• Possible O(n2) problem
• Need to support set of models becoming more complex
6
Challenge
• Current architecture is multi-tiered with a relational back-end
• Scoring is DB join intensive• Data need constant archiving
– Matches, match scores, user attributes at time of match creation
– Model validation is done at a later time across many days
• Need a non-DB solution
7
Solution
• Open Source Java implementation of Google’s MapReduce framework
– Distributes work across vast amounts of data– Hadoop Distributed File System (HDFS)
provides reliability through replication– Automatic re-execution on failure/distribution– Scale horizontally on commodity hardware
8
Slide 9
• Simple Storage Service (S3) provides cheap unlimited storage.
• Elastic Cloud Computing (EC2) enables horizontal scaling by adding servers on demand.
9
MapReduce
• A large server farm can use MapReduce to process huge dataset.
• Map step– Master node takes the input– Chops it up into smaller sub-problems– Distributes those to worker nodes.
• Reduce step– Master node takes the answers to all the sub-
problems – Combines them in a way to get the output
10
Why Hadoop
• Mapper and Reducer are written by you
• Hadoop provides– Parallelization– Shuffle and sort
11
Actual Process
• Upload to S3 and start EC2 Cluster
13
Actual Process
• Process and archive
14
Amazon Elastic MapReduce
• It is a web service
• EC2 cluster is managed for you behind the scenes
• Starts Hadoop implementation of the MapReduce framework on Amazon EC2
• Each step can read and write data directly from and to S3
• Based on Hadoop 0.18.3
15
Elastic MapReduce
• No need to explicitly allocate, start and shutdown EC2 instances
• Individual jobs were managed by a remote script running on master node (no longer required)
• Jobs are arranged into a job flow, created with a single command
• Status of a job flow and all its steps are accessible by a REST service
16
Before Elastic Map Reduce
• Allocate/Verify cluster
• Push application to cluster
• Run a control script on the master
• Kick off each job step on the master
• Create and detect a job completion token
• Shut the cluster down
17
After Elastic MapReduce
• With Elastic MapReduce we can do all this with a single local command
• Uses jar and conf files stored on S3
• Various monitoring tools for EC2 and S3 are provided
18
Development Environment
• Cheap to set up on Amazon
• Quick setup - Number of servers is controlled by a config variable
• Identical to production
• Separate development account recommended
19
Cost comparison
• Average EC2 and S3 Cost– Each run is 2 to 3 hours– $1200/month for EC2– $100/month for S3
• Projected in-house cost– $5000/month for a local cluster of 50 nodes
running 24/7– A new company needs to add data center and
operation personnel expense
20
Summary
• Dev tools really easy to work with and just work right out of the box
• Standard Hadoop AMI worked great
• Easy to write unit tests for MapReduce
• Hadoop community support is great.
• EC2/S3/EMR are cost effective
The End
5 minutes of question time
starts now!
Questions
4 minutes left!
Questions
3 minutes left!
Questions
2 minutes left!
Questions
1 minute left!
Questions
30 seconds left!
Questions
TIME IS UP!