utilizing aster ncluster to support processing in excess of 100 billion rows per month
DESCRIPTION
TRANSCRIPT
Utilizing Aster nCluster to support processing in excess of 100 Billion rows per month
A Plan for Large Scale Data Analytics
Will Duckworth, Director Software Engineering ([email protected])
2© comScore, Inc. Proprietary and Confidential.
Agenda
comScore – Introduction and Technology
MM360 Initiative
The Challenge
Our Analysis
Plans
3© comScore, Inc. Proprietary and Confidential.
comScore, Inc.
Founded in 1999
Publically traded on NASDAQ (SCOR)
Acquired MediaMetrix in 2002, M:Metrics in 2007, Certifica in 2009, and ARSgroup
in 2010
Corporate headquarters: Reston, VA
– Offices in Chicago, NYC, San Francisco, Seattle, Toronto, London, Tokyo, and Paris
– 500+ full-time employees
Experienced senior leadership team with a unique record of innovation in the market research industry
More than 1200 clients across many industries
4© comScore, Inc. Proprietary and Confidential.
Advising Hundreds of Leading Businesses(partial list)
Internet Agencies Telecom Financial Retail Travel CPG Pharma Technology
5© comScore, Inc. Proprietary and Confidential.
Powerful Platform: Massive Database and Cost Effective Technology Infrastructure
5
Database andComputational Infrastructure
Patents
■ 3 Issued■ 24 Pending
ContinuousOperation
■ 24/7■ 99.99% Uptime
Cost Effective
System Capex< $7M/Year
ProprietaryTechnology with
Strong IP Protection
Highly Scalable,Distributed Processing
Architecture
MassiveOperational Scale
■ 1,000 TB of storage
■ 1,100 Servers■ 30 TB per month
Largest Windows Data Warehouse in the World
Sophisticated Technology to Keep Up With Internet Advancements
6© comScore, Inc. Proprietary and Confidential.
6
Even for us this is getting big…
0
2,000
4,000
6,000
8,000
10,000
12,000
6/24/2009 7/24/2009 8/24/2009 9/24/2009 10/24/2009 11/24/2009 12/24/2009 1/24/2010 2/24/2010 3/24/2010
Mill
ions
New Rows per Day (panel vs. non-panel)
beacon panel
7© comScore, Inc. Proprietary and Confidential.
Where we come from …
Our skill set came from a need to measure Win32
We chose technologies and built a core team around our mandate to have accurate consumer Internet measurement– All Intel Based– 2/3 Microsoft OS, 1/3 Linux OS– C++ Now very much a “best tool for the job” organization
8© comScore, Inc. Proprietary and Confidential.
MM360 Initiative
8
9© comScore, Inc. Proprietary and Confidential.
Internet = “The Most Measurable Medium”
How many real users?
What kind of users are they?
Which request is a valid Page View?
How long did the users spend on my site?
100% Accurate count of server requests, but…
10© comScore, Inc. Proprietary and Confidential.
Basic Problem with Servers: No Unique User ID
Unique User = Cookie ID (if Cookies can be set)or IP Address + User Agent
Web Analytics Approximation
Sounds Simple, But Major Problems:
Cookies are deleted frequently, and the same person can be counted multiple times
IP Addresses change frequently causing inflation of user counts
In any case, servers identify a machine (or a browser), which can represent multiple persons or a fraction of the usage of a single person
11© comScore, Inc. Proprietary and Confidential.
Media Metrix 360: Key Benefits for Participating Sites
Comprehensive coverage: 100% of activity– New “Universe Report” covers mobile and public machines– Census-adjusted metrics in current Media Metrix reports (Home and
Work)– Coverage Calculation for beaconing sites
Improved coverage of At-Work population
Harmonization / Reconciliation of panel vs. server
More granularity
More timely reporting
Transparency
12© comScore, Inc. Proprietary and Confidential.
The Challenge
12
13© comScore, Inc. Proprietary and Confidential.
Goals
Be able to scale to support an initial monthly volume of 160 Billion records– Store 3 months of data online
Be able to add incrementally to the environment to support growth
Support advanced analytics– 150 analysts
Support end user access to record level data, preferably through a SQL interface
Support the storage of row level data
Have yesterdays data available today
14© comScore, Inc. Proprietary and Confidential.
Existing Internal Systems
NGUA– Ability to run specific queries for a given time period very quickly because all processing is
parallelized– Currently holding 560+ days of data; 800B+ rows.– All traffic for a machine for a month – 1 minute run time (140k records)– All traffic for pizzahut.com for a month – 4 minutes run time (1.9 million records)– All traffic from google.com where toys is in the URL – 1 hour 15 minutes (400k records)
■ Fusion– Primary System used for processing and providing the data behind the majority of
comScore’s products and analysis– Runs on 32 servers– For one month we read over 8TB of compressed log files with over 40B rows– Produces 1.3 B rows and 120 GB of output for load into a DW– Can turn around the processing in less than 8 hours
Both systems leverage the same core concepts of locality to data and distributed processing
15© comScore, Inc. Proprietary and Confidential.
Aster Data nCluster
Current Aster environments– Dev: 1 Queen; 3 Workers; 650+GB total storage– Prod: 1 Queen; 4 Loaders; 10 Workers; 32TB total storage
Plans– Building new Prod environment 1 Queen, 70 workers and 10
Loaders / Staging servers– 350TB total storage– 432 Cores
16© comScore, Inc. Proprietary and Confidential.
Aster Data nCluster
Table design is key with data of this size– What is the end user going to do 80% of the time? On the web, no matter how clean you think your data set
is there are still going to be issues– 6 Sigma on 10 billion records a day is still nearly 35,000
“bad rows”– Staging Servers Looking at using Aster-Hadoop Data Connector for
integration with in-house Hadoop environment– Aster Data for the analysts– Hadoop for the developers
17© comScore, Inc. Proprietary and Confidential.
Critical Cost Drivers to factor into the Analysis
Data Centers– Power is the big issue at data centers today. All allocations
for power and space are based on the number of circuits and the cost per circuit are all expected to rise
Servers– Even high end servers have reached relative commodity
prices if you stay to the 2U footprint and standard components