Download - MarketShare Big Data Analytics
![Page 1: MarketShare Big Data Analytics](https://reader036.vdocuments.us/reader036/viewer/2022081503/56816649550346895dd9c3ac/html5/thumbnails/1.jpg)
© 2011 MarketShare. All Rights Reserved. Confidential & Proprietary.
MarketShare Big Data AnalyticsAn Big Data Analytics architecture for the cloud
![Page 2: MarketShare Big Data Analytics](https://reader036.vdocuments.us/reader036/viewer/2022081503/56816649550346895dd9c3ac/html5/thumbnails/2.jpg)
>50 TB analytics pipeline on Amazon Web Services using Hadoop
With and Without Buzzwords
Big Data analytics on the Cloud
![Page 3: MarketShare Big Data Analytics](https://reader036.vdocuments.us/reader036/viewer/2022081503/56816649550346895dd9c3ac/html5/thumbnails/3.jpg)
3
• Cloud architecture evolution
• A real life big data workflow on the cloud
• Fixing issues in big data workflow
MarketShare: Big Data Analytics on the Cloud
MarketShare confidential and proprietary
* Source: Interbrand 2011 report
![Page 4: MarketShare Big Data Analytics](https://reader036.vdocuments.us/reader036/viewer/2022081503/56816649550346895dd9c3ac/html5/thumbnails/4.jpg)
Cloud + Big Data
![Page 5: MarketShare Big Data Analytics](https://reader036.vdocuments.us/reader036/viewer/2022081503/56816649550346895dd9c3ac/html5/thumbnails/5.jpg)
Traditional 3 tier architecture
ApplicationServer
User
AppServer
Database Server
Master
Eliminate accessibility restrictions
![Page 6: MarketShare Big Data Analytics](https://reader036.vdocuments.us/reader036/viewer/2022081503/56816649550346895dd9c3ac/html5/thumbnails/6.jpg)
Moving to web based applications
ApplicationServer(s)
User
AppServer
Database Server(s)
Master
WebServer
Corporate Data Center
Laptop
StorageServer(s)
Eliminate Hardware Expenditure
![Page 7: MarketShare Big Data Analytics](https://reader036.vdocuments.us/reader036/viewer/2022081503/56816649550346895dd9c3ac/html5/thumbnails/7.jpg)
Moving to the cloud
User
WebServer
Amazon Web Services Cloud
Laptop
EC2 Sharded Instancewith MySQL
EC2 Instance
AppServer
EC2 Instance
Eliminate Distributed Storage
![Page 8: MarketShare Big Data Analytics](https://reader036.vdocuments.us/reader036/viewer/2022081503/56816649550346895dd9c3ac/html5/thumbnails/8.jpg)
Moving big data to Hadoop
User
WebServer
Amazon Web Services Cloud
Laptop
EC2 Instance
AppServer
EC2 Instance
EC2 Instancewith MySQLEliminate distributed systems mgmt. HDFS Cluster
![Page 9: MarketShare Big Data Analytics](https://reader036.vdocuments.us/reader036/viewer/2022081503/56816649550346895dd9c3ac/html5/thumbnails/9.jpg)
Compute Elasticity
User
WebServer
Amazon EC2 permanent Instances
Laptop
EC2 Instance
AppServer
EC2 Instance
EC2 Instancewith MySQL
Amazon Elastic MapReduce
Amazon EC2 On Demand Instances
On demand hadoop instances
![Page 10: MarketShare Big Data Analytics](https://reader036.vdocuments.us/reader036/viewer/2022081503/56816649550346895dd9c3ac/html5/thumbnails/10.jpg)
Storage Elasticity
User
WebServer
Amazon EC2 permanent Instances
Laptop
EC2 Instance
AppServer
EC2 Instance
EC2 Instancewith MySQL
Amazon Elastic MapReduce
Amazon EC2 On Demand Instances
Managed Database ServicesAmazon Simple Storage Service
(S3)
Amazon Managed Storage
RDS Database Instance
![Page 11: MarketShare Big Data Analytics](https://reader036.vdocuments.us/reader036/viewer/2022081503/56816649550346895dd9c3ac/html5/thumbnails/11.jpg)
Network Elasticity
UserLaptop
WebServer
Amazon EC2 permanent Instances
EC2 Instance
AppServer
EC2 Instance Amazon Elastic MapReduce
Amazon EC2 On Demand Instances
Network ElasticityAmazon Simple Storage Service
(S3)
Amazon Managed Storage
RDS Database Instance
Elastic LoadBalancer
WebServer
Amazon EC2 permanent Instances
EC2 Instance
AppServer
EC2 Instance
![Page 12: MarketShare Big Data Analytics](https://reader036.vdocuments.us/reader036/viewer/2022081503/56816649550346895dd9c3ac/html5/thumbnails/12.jpg)
Cloud = Managed Storage + Network Elasticity + On Demand Compute
Defining the Cloud
![Page 13: MarketShare Big Data Analytics](https://reader036.vdocuments.us/reader036/viewer/2022081503/56816649550346895dd9c3ac/html5/thumbnails/13.jpg)
An Example Big Data Workflow
![Page 14: MarketShare Big Data Analytics](https://reader036.vdocuments.us/reader036/viewer/2022081503/56816649550346895dd9c3ac/html5/thumbnails/14.jpg)
© 2008 jovianDATA , Proprietary & Confidential, DO NOT Copy or Distribute jovianDATATM
Customer Data
Repository
Raw Data Repositor
y
Generic Schema Funnel LFIS
LUIS
FTP
Transformation
Validation
Sequencing Validation
Summarization
Validation
transformation
Report Generation
and Visualization
StatsStatistical Analysis
Raw Data Repository
SCP
hadoop fs -put
GUID propagation
Cleaning & truncation
Joining with dimensions
Transform in Generic Schema
Provision Cluster
Terminate Cluster
Generic Schema--Exception Handling!<cmd>
Add Nodes
Rebalance Partitions
Restart
MR job not launched
Mapred job timeout 95% done
Hadoop stopped responding
Provision Cluster
Estimate Cluster Size
Verify Cluster Topology
Fix Cluster & Deploy Binaries
....................................
....................................
-- GUID Propagation
CREATE TABLE userid_guid_map AS SELECT distinct user_id, guid from dbact_guid;
CREATE TABLE mapcount AS SELECT COUNT(*) AS cnt, user_id FROM userid_guid_map GROUP BY user_id;
CREATE TABLE impr_guid AS SELECTa.*, m.guidFROM impr i JOIN userid_guid_map m ON (i.user_id=m.user_id);
SELECT COUNT(*) FROM impr_guid;
....................................
....................................
....................................
Transformation Validation
![Page 15: MarketShare Big Data Analytics](https://reader036.vdocuments.us/reader036/viewer/2022081503/56816649550346895dd9c3ac/html5/thumbnails/15.jpg)
The big picture
![Page 16: MarketShare Big Data Analytics](https://reader036.vdocuments.us/reader036/viewer/2022081503/56816649550346895dd9c3ac/html5/thumbnails/16.jpg)
© 2011 MarketShare. All Rights Reserved. Confidential & Proprietary.
Data Management on the CloudBig Data + Dynamic Provisioning
![Page 17: MarketShare Big Data Analytics](https://reader036.vdocuments.us/reader036/viewer/2022081503/56816649550346895dd9c3ac/html5/thumbnails/17.jpg)
17
A dynamically provisioned commodity cluster of virtual machines with the following characteristics• Infinite
• A large number of nodes can be commissioned in minutes• Taxi Meter
• Most services are billed at an hourly usage level
Defining the cloud for databases
![Page 18: MarketShare Big Data Analytics](https://reader036.vdocuments.us/reader036/viewer/2022081503/56816649550346895dd9c3ac/html5/thumbnails/18.jpg)
18
• When should resources be added to a data processing system?• Partition management for Low Cost on the cloud
• How many of these resources should be permanent?• Materialization with Intermittent Scalability
• Where should these resources be added in the stack?• Replication to Improve Query Performance
3 new problems for database query processing
![Page 19: MarketShare Big Data Analytics](https://reader036.vdocuments.us/reader036/viewer/2022081503/56816649550346895dd9c3ac/html5/thumbnails/19.jpg)
Transforming Data to Actionable Insights
HighMediumLow
Engagement
Campaign Heat MapCube Materialization
Publishers Geograp
hy
Time Incremental updates Multi-dimensional
indexes Multi-dimensional
partitions
![Page 20: MarketShare Big Data Analytics](https://reader036.vdocuments.us/reader036/viewer/2022081503/56816649550346895dd9c3ac/html5/thumbnails/20.jpg)
Generic Schema
Dimension Level Level Level LevelTIME YEAR MONTH DAY HOUR
TIME_WEEKLY YEAR DAY_NAME
GEO COUNTRY STATE
PUBLISHER PUBLISHER SITE_NAME SITE_TYPE
PUBLISHER_PLACEMENT SITE_NAME PLACEMENT
CAMPAIGN CAMPAIGN_NAME CAMPAIGN_DESC INDUSTRY_SEGMENT
AD SIZE HEIGHT WIDTH
ADVERTISER ADVERTISER_NAME
FLIGHT FLIGHT_NAME FLIGHT_START_DATE FLIGHT_END_DATE FLIGHT_CREATIVE_ID
RID
USERID GENDER AGE_BUCKET AGE
![Page 21: MarketShare Big Data Analytics](https://reader036.vdocuments.us/reader036/viewer/2022081503/56816649550346895dd9c3ac/html5/thumbnails/21.jpg)
Life of a Query
Access Partitions
OptimizeQuery
IdentifyPartitions
Query
ConsolidateResults
21
![Page 22: MarketShare Big Data Analytics](https://reader036.vdocuments.us/reader036/viewer/2022081503/56816649550346895dd9c3ac/html5/thumbnails/22.jpg)
Tuple Access Layer YEAR MONTH DAY HOUR COUNTRY WEEK STATE PUBLISHE
RSITE_NAME
CAMPAIGN_NAME
CAMPAIGN_DESC
SIZE HEIGHT FLIGHT_NAME
FLIGHT_START_DATE
FLIGHT_END_DATE
FLIGHT_CREATIVE_ID
2009 June 3 1 US * * AOL Autos
* * * * * Ford * * Banner
MONTH DAY HOUR COUNTRY STATE WEEK
June 3 1 US * *
YEAR PUBLISHER SITE_NAME CAMPAIGN_NAME
CAMPAIGN_DESC
SIZE HEIGHT FLIGHT_NAME
FLIGHT_START_DATE
FLIGHT_END_DATE
FLIGHT_CREATIVE_ID
2009 AOL Autos
* * * * * Ford * * Banner
Locate the partition for the fast dimension values
Note here that STATE is set to ‘*’ but it has already been materializedTherefore, we eliminate any sum across all states
On the local partition, do the aggregation to calculate the ‘*’
![Page 23: MarketShare Big Data Analytics](https://reader036.vdocuments.us/reader036/viewer/2022081503/56816649550346895dd9c3ac/html5/thumbnails/23.jpg)
24
5 types of partitions maintained in the cloud
Exclusive
EC2 nodes that are allocated for specific keys
Permanent
EC2 nodes that are permanently allocated to service queries
Archive
S3 storage that is not accessible directly by the query engine
Intermittent
EC2 nodes that are allocated by the loading engine
Temporary
EC2 nodes that host temporary replicas
![Page 24: MarketShare Big Data Analytics](https://reader036.vdocuments.us/reader036/viewer/2022081503/56816649550346895dd9c3ac/html5/thumbnails/24.jpg)
25
Taxi MeterMaking resources transient
![Page 25: MarketShare Big Data Analytics](https://reader036.vdocuments.us/reader036/viewer/2022081503/56816649550346895dd9c3ac/html5/thumbnails/25.jpg)
Usage Patterns• Usage patterns vary throughout the day and throughout the week• A couple of periods of heavy usage daily, followed by moderate to low usage
4 – 6am 8 – 10am 10 – Noon Noon – 4pm 4 – 6pm
Data LoadingCube Materialization
Users review standard reports looking for campaign exceptions
Users run ad hoc queries to understand campaign exceptions
Users make any necessary campaign adjustments
Quick review of key reports by users prior to heading home
Com
putin
g R
esou
rces
![Page 26: MarketShare Big Data Analytics](https://reader036.vdocuments.us/reader036/viewer/2022081503/56816649550346895dd9c3ac/html5/thumbnails/26.jpg)
Traditional Computing Approach• Traditional computing approach buys enough computing resources to meet peak usage
demand• Even many cloud “solutions” provide only the peak computing power option with no way
to dynamically reallocate the computing resources to match the current usage demand• Result: Substantial waste in computing resources and money
4 – 6am 8 – 10am 10 – Noon Noon – 4pm 4 – 6pm
Com
putin
g R
esou
rces
$$ $$ $$ $$
Maximum Computing Resources
![Page 27: MarketShare Big Data Analytics](https://reader036.vdocuments.us/reader036/viewer/2022081503/56816649550346895dd9c3ac/html5/thumbnails/27.jpg)
“Adaptive” Computing Economics• Finely matching computing resources to user usage patterns can provide a 50% to 90%
cost savings versus the traditional computing resource allocation approach• Result: Lower cost with improvements in availability and performance
4 – 6am 8 – 10am 10 – Noon Noon – 4pm 4 – 6pm
Com
putin
g R
esou
rces
$$ $$ $$ $$
Maximum Computing Resources
Adaptive Computin
g Resources
![Page 28: MarketShare Big Data Analytics](https://reader036.vdocuments.us/reader036/viewer/2022081503/56816649550346895dd9c3ac/html5/thumbnails/28.jpg)
29
Intermittent scalabilityUsing large number of nodes during load time
![Page 29: MarketShare Big Data Analytics](https://reader036.vdocuments.us/reader036/viewer/2022081503/56816649550346895dd9c3ac/html5/thumbnails/29.jpg)
Managing CapEx with Role Based Clusters
SINGLECLUSTER FOR
DATA CLEANSING, LOAD AND QUERY
15TB100 NODES
Monthly Cost = $28,800
![Page 30: MarketShare Big Data Analytics](https://reader036.vdocuments.us/reader036/viewer/2022081503/56816649550346895dd9c3ac/html5/thumbnails/30.jpg)
Role Based Clusters
BUILD CUBE
HIBERNATE CUBE
QUERY READY PARTITIONS
UI Ad Server Data, Search Engine Data
DATA CLEANSING
2 hours daily for load on 10 nodesQuery on 5 nodes
Monthly Cost = $2,052
![Page 31: MarketShare Big Data Analytics](https://reader036.vdocuments.us/reader036/viewer/2022081503/56816649550346895dd9c3ac/html5/thumbnails/31.jpg)
32
Selective replication for hot partitions
![Page 32: MarketShare Big Data Analytics](https://reader036.vdocuments.us/reader036/viewer/2022081503/56816649550346895dd9c3ac/html5/thumbnails/32.jpg)
Partition level query slowdown• Dynamic statistics
• The query execution system logs status for each partition• If a particular partition is regularly lagging behind, it is marked for replication
• Static statistics• The query execution system identifies skews in specific partitions• Partitions with size skew etc are marked for replication
Operational (EC2)
P1 P2
P3 P4 P5 P6
P1 P2 P3 P4
P5 P6
Partition Size Average Execution Time
P1 1MB 1.2s
P2 2MB
P3 1.5MB 60s
…
![Page 33: MarketShare Big Data Analytics](https://reader036.vdocuments.us/reader036/viewer/2022081503/56816649550346895dd9c3ac/html5/thumbnails/33.jpg)
Fixing partition level slowdown• If the query execution system detects SLA violations• Adds two new temporary nodes (Temp 1)• Creates new replicas for the ‘hot’ partitions
Operational (EC2)
P1
Node 1
P2
P3 P4 P5
Node 2
P6
P1 P2 P3
Node 3
P4
P5 P6
Temporary (EC2)
P3
Temp 1
P6
P3
P6
![Page 34: MarketShare Big Data Analytics](https://reader036.vdocuments.us/reader036/viewer/2022081503/56816649550346895dd9c3ac/html5/thumbnails/34.jpg)
Key level query slowdown• Key Level Dynamic statistics
• A particular key takes time for materializing various facets of the cube
Operational (EC2)
P1 P2
P3 P4 P5 P6
P1 P2 P3 P4
P5 P6
Keys Size Average Execution Time
P3. K1 20MB 120s
…
![Page 35: MarketShare Big Data Analytics](https://reader036.vdocuments.us/reader036/viewer/2022081503/56816649550346895dd9c3ac/html5/thumbnails/35.jpg)
Fixing partition level slowdown• If the query execution system detects SLA violations for a particular key• Adds a new temporary node (Temp 2)• Denormalizes the key such that all data for that key is materialized
Operational (EC2)
P1
Node 1
P2
P3 P4 P5
Node 2
P6
P1 P2 P3
Node 3
P4
P5 P6
Temporary (EC2)
KN
Materialized
Node 5
![Page 36: MarketShare Big Data Analytics](https://reader036.vdocuments.us/reader036/viewer/2022081503/56816649550346895dd9c3ac/html5/thumbnails/36.jpg)
37
Partitions can be in 5 different states
Isolated (EC2)
Operational (EC2)
Data Retrieval Module
Access replicas if base partition is overwhelmed
Access base partitions is not materialized
Load (temporary EC2)
Create replicas if partition is hot
Isolate keys on separate machines
Retrieve data from materialized
Post load, archive the partitions
Replicated (temporary EC2)
Archive (S3/EBS)
![Page 37: MarketShare Big Data Analytics](https://reader036.vdocuments.us/reader036/viewer/2022081503/56816649550346895dd9c3ac/html5/thumbnails/37.jpg)
• Lots of challenges in cloud + modeling• Collaboration opportunities• We are hiring!
Next Steps