utilizing aster ncluster to support processing in excess of 100 billion rows per month

17
Utilizing Aster nCluster to support processing in excess of 100 Billion rows per month A Plan for Large Scale Data Analytics Will Duckworth, Director Software Engineering ([email protected])

Upload: teradata-aster

Post on 31-Oct-2014

2.078 views

Category:

Documents


0 download

DESCRIPTION

 

TRANSCRIPT

Page 1: Utilizing Aster nCluster to support processing in excess of 100 Billion rows per month

Utilizing Aster nCluster to support processing in excess of 100 Billion rows per month

A Plan for Large Scale Data Analytics

Will Duckworth, Director Software Engineering ([email protected])

Page 2: Utilizing Aster nCluster to support processing in excess of 100 Billion rows per month

2© comScore, Inc. Proprietary and Confidential.

Agenda

comScore – Introduction and Technology

MM360 Initiative

The Challenge

Our Analysis

Plans

Page 3: Utilizing Aster nCluster to support processing in excess of 100 Billion rows per month

3© comScore, Inc. Proprietary and Confidential.

comScore, Inc.

Founded in 1999

Publically traded on NASDAQ (SCOR)

Acquired MediaMetrix in 2002, M:Metrics in 2007, Certifica in 2009, and ARSgroup

in 2010

Corporate headquarters: Reston, VA

– Offices in Chicago, NYC, San Francisco, Seattle, Toronto, London, Tokyo, and Paris

– 500+ full-time employees

Experienced senior leadership team with a unique record of innovation in the market research industry

More than 1200 clients across many industries

Page 4: Utilizing Aster nCluster to support processing in excess of 100 Billion rows per month

4© comScore, Inc. Proprietary and Confidential.

Advising Hundreds of Leading Businesses(partial list)

Internet Agencies Telecom Financial Retail Travel CPG Pharma Technology

Page 5: Utilizing Aster nCluster to support processing in excess of 100 Billion rows per month

5© comScore, Inc. Proprietary and Confidential.

Powerful Platform: Massive Database and Cost Effective Technology Infrastructure

5

Database andComputational Infrastructure

Patents

■ 3 Issued■ 24 Pending

ContinuousOperation

■ 24/7■ 99.99% Uptime

Cost Effective

System Capex< $7M/Year

ProprietaryTechnology with

Strong IP Protection

Highly Scalable,Distributed Processing

Architecture

MassiveOperational Scale

■ 1,000 TB of storage

■ 1,100 Servers■ 30 TB per month

Largest Windows Data Warehouse in the World

Sophisticated Technology to Keep Up With Internet Advancements

Page 6: Utilizing Aster nCluster to support processing in excess of 100 Billion rows per month

6© comScore, Inc. Proprietary and Confidential.

6

Even for us this is getting big…

0

2,000

4,000

6,000

8,000

10,000

12,000

6/24/2009 7/24/2009 8/24/2009 9/24/2009 10/24/2009 11/24/2009 12/24/2009 1/24/2010 2/24/2010 3/24/2010

Mill

ions

New Rows per Day (panel vs. non-panel)

beacon panel

Page 7: Utilizing Aster nCluster to support processing in excess of 100 Billion rows per month

7© comScore, Inc. Proprietary and Confidential.

Where we come from …

Our skill set came from a need to measure Win32

We chose technologies and built a core team around our mandate to have accurate consumer Internet measurement– All Intel Based– 2/3 Microsoft OS, 1/3 Linux OS– C++ Now very much a “best tool for the job” organization

Page 8: Utilizing Aster nCluster to support processing in excess of 100 Billion rows per month

8© comScore, Inc. Proprietary and Confidential.

MM360 Initiative

8

Page 9: Utilizing Aster nCluster to support processing in excess of 100 Billion rows per month

9© comScore, Inc. Proprietary and Confidential.

Internet = “The Most Measurable Medium”

How many real users?

What kind of users are they?

Which request is a valid Page View?

How long did the users spend on my site?

100% Accurate count of server requests, but…

Page 10: Utilizing Aster nCluster to support processing in excess of 100 Billion rows per month

10© comScore, Inc. Proprietary and Confidential.

Basic Problem with Servers: No Unique User ID

Unique User = Cookie ID (if Cookies can be set)or IP Address + User Agent

Web Analytics Approximation

Sounds Simple, But Major Problems:

Cookies are deleted frequently, and the same person can be counted multiple times

IP Addresses change frequently causing inflation of user counts

In any case, servers identify a machine (or a browser), which can represent multiple persons or a fraction of the usage of a single person

Page 11: Utilizing Aster nCluster to support processing in excess of 100 Billion rows per month

11© comScore, Inc. Proprietary and Confidential.

Media Metrix 360: Key Benefits for Participating Sites

Comprehensive coverage: 100% of activity– New “Universe Report” covers mobile and public machines– Census-adjusted metrics in current Media Metrix reports (Home and

Work)– Coverage Calculation for beaconing sites

Improved coverage of At-Work population

Harmonization / Reconciliation of panel vs. server

More granularity

More timely reporting

Transparency

Page 12: Utilizing Aster nCluster to support processing in excess of 100 Billion rows per month

12© comScore, Inc. Proprietary and Confidential.

The Challenge

12

Page 13: Utilizing Aster nCluster to support processing in excess of 100 Billion rows per month

13© comScore, Inc. Proprietary and Confidential.

Goals

Be able to scale to support an initial monthly volume of 160 Billion records– Store 3 months of data online

Be able to add incrementally to the environment to support growth

Support advanced analytics– 150 analysts

Support end user access to record level data, preferably through a SQL interface

Support the storage of row level data

Have yesterdays data available today

Page 14: Utilizing Aster nCluster to support processing in excess of 100 Billion rows per month

14© comScore, Inc. Proprietary and Confidential.

Existing Internal Systems

NGUA– Ability to run specific queries for a given time period very quickly because all processing is

parallelized– Currently holding 560+ days of data; 800B+ rows.– All traffic for a machine for a month – 1 minute run time (140k records)– All traffic for pizzahut.com for a month – 4 minutes run time (1.9 million records)– All traffic from google.com where toys is in the URL – 1 hour 15 minutes (400k records)

■ Fusion– Primary System used for processing and providing the data behind the majority of

comScore’s products and analysis– Runs on 32 servers– For one month we read over 8TB of compressed log files with over 40B rows– Produces 1.3 B rows and 120 GB of output for load into a DW– Can turn around the processing in less than 8 hours

Both systems leverage the same core concepts of locality to data and distributed processing

Page 15: Utilizing Aster nCluster to support processing in excess of 100 Billion rows per month

15© comScore, Inc. Proprietary and Confidential.

Aster Data nCluster

Current Aster environments– Dev: 1 Queen; 3 Workers; 650+GB total storage– Prod: 1 Queen; 4 Loaders; 10 Workers; 32TB total storage

Plans– Building new Prod environment 1 Queen, 70 workers and 10

Loaders / Staging servers– 350TB total storage– 432 Cores

Page 16: Utilizing Aster nCluster to support processing in excess of 100 Billion rows per month

16© comScore, Inc. Proprietary and Confidential.

Aster Data nCluster

Table design is key with data of this size– What is the end user going to do 80% of the time? On the web, no matter how clean you think your data set

is there are still going to be issues– 6 Sigma on 10 billion records a day is still nearly 35,000

“bad rows”– Staging Servers Looking at using Aster-Hadoop Data Connector for

integration with in-house Hadoop environment– Aster Data for the analysts– Hadoop for the developers

Page 17: Utilizing Aster nCluster to support processing in excess of 100 Billion rows per month

17© comScore, Inc. Proprietary and Confidential.

Critical Cost Drivers to factor into the Analysis

Data Centers– Power is the big issue at data centers today. All allocations

for power and space are based on the number of circuits and the cost per circuit are all expected to rise

Servers– Even high end servers have reached relative commodity

prices if you stay to the 2U footprint and standard components