building a big data solution

63
Building a Big Data solution “Building an Effective Data Warehouse Architecture with Hadoop, the cloud, and MPP” James Serra Big Data Evangelist Microsoft [email protected]

Upload: james-serra

Post on 14-Jul-2015

2.373 views

Category:

Technology


6 download

TRANSCRIPT

Page 1: Building a Big Data Solution

Building a Big Data solution

“Building an Effective Data Warehouse Architecturewith Hadoop, the cloud, and MPP”

James SerraBig Data Evangelist

Microsoft

[email protected]

Page 2: Building a Big Data Solution

Other Presentations Building an Effective Data Warehouse Architecture

Reasons for building a DW and the various approaches and DW concepts (Kimball vs Inmon)

Building a Big Data Solution (Building an Effective Data Warehouse

Architecture with Hadoop, the cloud and MPP) Explains what Big Data is, it’s benefits including use cases, and how Hadoop, the cloud, and MPP fit in

Finding business value in Big Data (What exactly is Big Data and why

should I care?)Very similar to “Building a Big Data Solution” but target audience is business users/CxO instead of architects

How does Microsoft solve Big Data?Covers the Microsoft products that can be used to create a Big Data solution

Modern Data Warehousing with the Microsoft Analytics Platform SystemThe next step in data warehouse performance is APS, a MPP appliance

Power BI, Azure ML, Azure HDInsights, Azure Data Factory, etcDeep dives into the various Microsoft Big Data related products

Page 3: Building a Big Data Solution

About Me

Business Intelligence Consultant, in IT for 28 years

Microsoft, Big Data Evangelist

Worked as desktop/web/database developer, DBA, BI and DW architect and developer, MDM

architect, PDW developer

Been perm, contractor, consultant, business owner

Presenter at PASS Business Analytics Conference and PASS Summit

MCSE for SQL Server 2012: Data Platform and BI

Blog at JamesSerra.com

SQL Server MVP

Author of book “Reporting with Microsoft SQL Server 2012”

Page 4: Building a Big Data Solution

I tried building a Big Data solution…

And ended up passed-out drunk in a Denny’s

parking lot

Let’s prevent that from happening…

Page 5: Building a Big Data Solution

Agenda

Review of Building an Effective Data Warehouse Architecture

Overview of Big Data and Analytics

Use cases

Data Lake

Hadoop and its role

IoT and real-time data

Modern data warehouse

Federated querying

DW and the cloud

Symmetric Multiprocessing (SMP) vs. Massively Parallel Processing (MPP)

Page 6: Building a Big Data Solution

Review of Building and Effective Data Warehouse Architecture

Page 7: Building a Big Data Solution

What is a Data Warehouse and why use one?

A data warehouse is where you store data from multiple data sources to be used for historical and trend analysis reporting. It acts as a central repository for many subject areas and contains the "single version of truth". It is NOT to be used for OLTP applications.

Reasons for a data warehouse:

Reduce stress on production system

Optimized for read access, sequential disk scans

Integrate many sources of data

Keep historical records (no need to save hardcopy reports)

Restructure/rename tables and fields, model data

Protect against source system upgrades

Use Master Data Management, including hierarchies

No IT involvement needed for users to create reports

Improve data quality and plugs holes in source systems

One version of the truth

Easy to create BI solutions on top of it (i.e. SSAS Cubes)

Previous presentation “Building an Effective Data Warehouse Architecture”:

http://pragmaticworks.com/Training/FreeTraining/ViewWebinar/WebinarID/532

http://www.slideshare.net/jamserra/data-warehouse-architecture-16065902

Page 8: Building a Big Data Solution

Why use a Data Warehouse?

Legacy applications + databases = chaos

Production Control

MRP

InventoryControl

Parts Management

Logistics

Shipping

Raw Goods

Order Control

Purchasing

Marketing

Finance

Sales

Accounting

Management Reporting

Engineering

Actuarial

Human Resources

Continuity

Consolidation

Control

Compliance

Collaboration

Enterprise data warehouse = order

Single version of the truth

Enterprise Data

Warehouse

Every question = decision

Two purposes of data warehouse: 1) save time building reports; 2) slice in dice in ways you could not do before

Page 9: Building a Big Data Solution

Data Warehouse Hybrid Model

Advice: Use SQL Server Views to interface between each level in the model

In the DW Bus Architecture, each data mart could be a schema (broken out by business process subject areas), all in one database.

Another option is to have each data mart in its own database with all databases on one server or spread among multiple servers.

Also, the staging areas, CIF, and DW Bus can all be on the same powerful server (MPP)

Page 10: Building a Big Data Solution

Data Warehouse Architecture

How does “Big Data” change this architecture?

Page 11: Building a Big Data Solution

Overview of Big Data and Analytics

Page 12: Building a Big Data Solution

What differentiates today’s thriving organizations?

Data.

Page 13: Building a Big Data Solution

What is Big Data, really?

Data in all forms & sizes

is being generated

faster than ever before

Capture & combine it

for new insights & better,

faster decisions

16

Page 14: Building a Big Data Solution

Harness the growing and changing nature of data

Collect any data

StreamingStructured

Challenge is combining transactional data stored in relational databases with less structured data

Big Data = All Data

Get the right information to the right people at the right time in the right format

Unstructured

“ ”

Page 15: Building a Big Data Solution

An illustration of the velocity of data created

Kalakota, R. (2012, October 22). Sizing “Mobile + Social” Big Data Stats. Retrieved from http://practicalanalytics.wordpress.com/

Page 16: Building a Big Data Solution

The three V’s

Page 17: Building a Big Data Solution

Complex implementationsEnterprise data warehouse

Spreadmarts

Siloed data

Hadoop

DashboardsAd hoc analysis

Machine learning

OLAP

Any dataIn-memory

Internet of Things

Innovation

Transactional systemsETL

Operational reporting

Valu

e

Technology innovation accelerates value

Page 18: Building a Big Data Solution

Discover and connect

Answering new questions

Value

Page 19: Building a Big Data Solution

26

Put data to work for everyone in your organization

Inspire innovation

Accelerate decision-making

Learn from & share insights

Page 20: Building a Big Data Solution

Units Sold, Discounts, and Profit

before Tax

27

Embrace Big Data across your business

Revenue and Target by Region Departments HeadcountXT2000 Status List

Show Only Problems

Indicator

Preliminary Budget

Materials and Packaging Review

Book Advertising Slots

Fall Showcase Event Analysis

End User Survey

Technical Review Milestone

Status 2M

1.5M

1M

0.5M

0M

Dis

cou

nts

(M

illio

ns)

50K 60K 70K 80K 90K 100K 110

Product A

Product D Product C

Product F

Product G

0 5 10 15

Accounting

Administration

Customer Support

Finance

Human Resources

IT

Marketing

R&D

Sales

SalesImprove revenue

performance

HRMaximize employee

engagement

MarketingBuild deeper customer

relationships

FinanceImpact your company’s

bottom line

0

5

10

15

0

5

10

15

(Th

ou

san

ds)

North South

Region: South

Target: 13450

Highlighted:

4900

Revenue Target

Page 21: Building a Big Data Solution

28

The Data Divide

80%

of data

stored

70%

of data

generated by

customers

<0.5%

being

operationalized

0.5%

being

analyzed

3%

prepared for

analysis

Page 22: Building a Big Data Solution

Major Fail

Gartner: “Through 2017, 60% of big-data projects will fail to go beyond piloting and experimentation”

Paradigm4: 76% of those who have used Hadoop or Apache Spark complained of significant limitations

Page 23: Building a Big Data Solution

Analytics Solution

Capture and

integrate data from multiple internal

and external sources

Derive insight

from data with rich, interactive dashboards

and reports using the tools you know

Put insight

into action to increase efficiency

and constituent satisfaction

Page 24: Building a Big Data Solution

Advanced Analytics Defined

Page 25: Building a Big Data Solution

The end result of Big Data - Icing on the cake

Page 26: Building a Big Data Solution

Use Cases

Page 27: Building a Big Data Solution

Let’s set off light bulbs in your head

Page 28: Building a Big Data Solution

Recommenda-

tion engines

Smart meter

monitoring

Equipment

monitoring

Advertising

analysis

Life sciences

research

Fraud

detection

Healthcare

outcomes

Weather

forecasting for

business

planning

Oil & Gas

exploration

Social network

analysis

Churn

analysis

Traffic flow

optimization

IT infrastructure

& Web App

optimization

Legal

discovery and

document

archiving

Data Analytics is needed everywhere

Intelligence

Gathering

Location-based

tracking &

services

Pricing Analysis

Personalized

Insurance

Page 29: Building a Big Data Solution

Personalized

policies can

reduce costs &

better meet

customer needs

Insurance companies can help

(and some have already started

helping) their customers with truly

personalized insurance plans

tailored to their needs and risks

Personalized Insurance

Insurance Companies can collect real-time data from in-

car sensors and combine it with geolocation and in-house

systems. With information such as distance and speed,

provide personalized insurance offers based on driving

amount, risk, and other factors, for a truly personalized

plan that may often save drivers money

$1,600/yr.US national avg. car insurance premium

Page 30: Building a Big Data Solution

The vast amount of current and ever-growing customer

purchase, rating and click data can all be collected and

managed with an Hadoop-based solution, to pinpoint

preferences based on purchase history and demographics, and

be able to serve useful and compelling cross-sell and up-sell

recommendations.

Recommendation Engines

Significantly

improve up-sell

and cross-sell

opportunities

Retailers can use customer

purchase & rating information to

serve recommendations to current

customers, based on similarities

across many dimensions

158Items sold/second

by Amazon.com on 11/29/2010 (Cyber

Monday)

Page 31: Building a Big Data Solution

Retailers – whether large, small, online or in-store – can improve

margins with more detailed pricing analysis. When a customer

is in range of a transaction (either in the store, online or perhaps

passing by), offer personalized offers, real-time price quotes, or

other frequent-buyer perks to help bring more customers to the

store and improve repeat business.

Pricing Analysis

Significantly

improve sales

and customer

satisfaction

Retailers can use customer past

purchase, preference, and demo-

graphic information to serve real-

time custom pricing, instant

discounts when near the store.

up to 30%Additional price Mac users accepted for travel from Orbitz

Page 32: Building a Big Data Solution

Using Big data to complete the picture

Page 33: Building a Big Data Solution

Data Lake

Page 34: Building a Big Data Solution

What is a data lake?

A storage repository, usually Hadoop, that holds a vast amount of raw data in its native format until it is needed.

• A place to store unlimited amounts of data in any format inexpensively

• Allows collection of data that you may or may not use later: “just in case”

• A way to describe any large data pool in which the schema and data requirements are not defined until the data is queried: “just in time” or “schema on read”

• Complements EDW and can be seen as a data source for the EDW – capturing all data but only passing relevant data to the EDW

• Frees up expensive EDW resources (storage and processing), especially for data refinement

• Allows for data exploration to be performed without waiting for the EDW team to model and load the data

• Some processing in better done on Hadoop than ETL tools like SSIS

• Also called bit bucket, staging area, landing zone or enterprise data hub (Cloudera)

Page 35: Building a Big Data Solution

Current state of a data warehouse

Traditional Approaches

CRMERPOLTP LOB

DATA SOURCES ETL DATA WAREHOUSE

Star schemas,

views

other read-

optimized

structures

BI AND ANALYTCIS

Emailed,

centrally

stored Excel

reports and

dashboards

Well manicured, often relational

sources

Known and expected data volume

and formats

Little to no change

Complex, rigid transformations

Required extensive monitoring

Transformed historical into read

structures

Flat, canned or multi-dimensional

access to historical data

Many reports, multiple versions of

the truth

24 to 48h delay

MONITORING AND TELEMETRY

Page 36: Building a Big Data Solution

Current state of a data warehouse

Traditional Approaches

CRMERPOLTP LOB

DATA SOURCES ETL DATA WAREHOUSE

Star schemas,

views

other read-

optimized

structures

BI AND ANALYTCIS

Emailed,

centrally

stored Excel

reports and

dashboards

Increase in variety of data sources

Increase in data volume

Increase in types of data

Pressure on the ingestion engine

Complex, rigid transformations can’t

longer keep pace

Monitoring is abandoned

Delay in data, inability to transform

volumes, or react to new sources

Repair, adjust and redesign ETL

Reports become invalid or unusable

Delay in preserved reports increases

Users begin to “innovate” to relieve

starvation

MONITORING AND TELEMETRY

INCREASING DATA VOLUME NON-RELATIONAL DATA

INCREASE IN TIMESTALE REPORTING

Page 37: Building a Big Data Solution

Data Lake Transformation (ELT not ETL)

New Approaches

All data sources are considered

Leverages the power of on-prem

technologies and the cloud for

storage and capture

Native formats, streaming data, big

data

Extract and load, no/minimal transform

Storage of data in near-native format

Orchestration becomes possible

Streaming data accommodation becomes

possible

Refineries transform data on read

Produce curated data sets to

integrate with traditional warehouses

Users discover published data

sets/services using familiar tools

CRMERPOLTP LOB

DATA SOURCES

FUTURE DATA

SOURCESNON-RELATIONAL DATA

EXTRACT AND LOADDATA LAKE DATA REFINERY PROCESS

(TRANSFORM ON READ)

Transform

relevant data

into data sets

BI AND ANALYTCIS

Discover and

consume

predictive

analytics, data

sets and other

reports

OTHER REFINERY

PROCESSES

DATA WAREHOUSE

Star schemas,

views

other read-

optimized

structures

Page 38: Building a Big Data Solution

Hadoop and its role

Page 39: Building a Big Data Solution

What is Hadoop?

Microsoft Confidential

61

Distributed, scalable system on commodity HW

Composed of a few parts:

HDFS – Distributed file system

MapReduce – Programming model

Other tools: Hive, Pig, SQOOP, HCatalog, HBase, Flume, Mahout, YARN, Tez, Spark, Stinger, Oozie, ZooKeeper, Flume, Storm

Main players are Hortonworks, Cloudera, MapR

WARNING: Hadoop, while ideal for processing huge volumes of data, is inadequate for analyzing that data in real time (companies do batch analytics instead)

Core Services

OPERATIONAL SERVICES

DATASERVICES

HDFS

SQOOP

FLUME

NFS

LOAD & EXTRACT

WebHDFS

OOZIE

AMBARI

YARN

MAP REDUCE

HIVE &HCATALOG

PIG

HBASEFALCON

Hadoop Cluster

compute

&

storage . . .

. . .

. .compute

&

storage

.

.

Hadoop clusters provide

scale-out storage and

distributed data processing

on commodity hardware

Page 40: Building a Big Data Solution

Hortonworks Data Platform 2.3

Simply put, Hortonworks ties all the open source products together (22)

Page 41: Building a Big Data Solution

The real cost of Hadoop

http://www.wintercorp.com/tcod-report/

Page 42: Building a Big Data Solution

Use cases using Hadoop and a DW in combination Bringing islands of Hadoop data together

Archiving data warehouse data to Hadoop (move)

(Hadoop as cold storage)

Exporting relational data to Hadoop (copy)

(Hadoop as backup/DR, analysis, cloud use)

Importing Hadoop data into data warehouse (copy)

(Hadoop as staging area, sandbox, Data Lake)

Page 43: Building a Big Data Solution

IoT and real-time data

Page 44: Building a Big Data Solution

What is the Internet of Things?

Connectivity Data AnalyticsThings

IoT = sensor-acquired data

Page 45: Building a Big Data Solution

What is the Internet of Things (IoT)?Internet-connected devices that can perceive the environment in some way, share their data, and communicate with

you. IoT is just a catch-all term for ways of using machine-generated data to create something useful.

- Has it one processor and sensor to collect information

- Examples: heart monitoring implants, biochip transponders on farm animals, automobiles with build-in sensors, field

operation devices that assist firefighters in search and rescue

- Excludes computers, tablets, and smart phones- But really, it’s in the sphere of business intelligence that IoT will really make a difference.

Cool possibilities

- When a milk carton is almost empty it will ping you when you are near a store

- An alarm clock that signals your coffee maker to start brewing when you wake up

- An embedded chip that monitors your vital signs and notifies a medical provider if exceeds limit

Gartner: 10 billion devices connected to the internet today, 26B by 2020

At some point in the future, nearly every manmade object will contain a device that transmits data!

Page 46: Building a Big Data Solution

Modern Data Warehouse

Page 47: Building a Big Data Solution

Modern Data Warehouse

Think about future needs:• Increasing data volumes

• Real-time performance

• New data sources and types

• Cloud-born data

• Multi-platform solution

• Hybrid architecture

Page 48: Building a Big Data Solution

Modern Data Warehouse Defined

Page 49: Building a Big Data Solution

Modern Data WarehouseThe

Dream

Page 50: Building a Big Data Solution

The

Reality

Page 51: Building a Big Data Solution

Federated Querying

Page 52: Building a Big Data Solution

Federated Querying

Other names: Data virtualization, logical data warehouse, data

federation, virtual database, and decentralized data warehouse.

A model that allows a single query to retrieve and combine data as it sits

from multiple data sources, so as to not need to use ETL or learn more

than one retrieval technology

Page 53: Building a Big Data Solution

Select… Result set

Federated Querying

Relational

Data

DB2

Oracle

MongoDB

SQL Server

Query Model

Non-

Relational

Data

Cloudera CHD Linux

Hortonworks HDP

Windows Azure

HDInsight

Page 54: Building a Big Data Solution

DW and the Cloud

Page 55: Building a Big Data Solution

Can I use the cloud with my DW?

• Public and private cloud

• Cloud-born data vs on-prem born data

• Transfer cost from/to cloud and on-prem

• Sensitive data on-prem, non-sensitive in cloud

• Look at hybrid solutions

Page 56: Building a Big Data Solution

TDWI Best Practices Report (2015)

Page 57: Building a Big Data Solution

SMP vs MPP

Page 58: Building a Big Data Solution

SMP vs MPP

• Uses many separate CPUs running in parallel to execute a single program

• Shared Nothing: Each CPU has its own memory and disk (scale-out)

• Segments communicate using high-speed network between nodes

MPP - Massively

Parallel Processing

• Multiple CPUs used to complete individual processes simultaneously

• All CPUs share the same memory, disks, and network controllers (scale-up)

• All SQL Server implementations up until now have been SMP

• Mostly, the solution is housed on a shared SAN

SMP - Symmetric

Multiprocessing

Page 59: Building a Big Data Solution

50 TB

100 TB

500 TB

10 TB

5 PB

1.000

100

10.000

3-5 Way

Joins

Joins +

OLAP operations +

Aggregation +

Complex “Where”

constraints +

Views

Parallelism

5-10 Way

Joins

Normalized

Multiple, Integrated

Stars and Normalized

Simple

Star

Multiple,

Integrated

Stars

TB’s

MB’s

GB’s

Batch Reporting,

Repetitive Queries

Ad Hoc Queries

Data Analysis/Mining

Near Real Time

Data FeedsDaily

Load

Weekly

Load

Strategic, Tactical

Strategic

Strategic, Tactical

Loads

Strategic, Tactical

Loads, SLA

“Query Freedom“

“Query complexity““Data

Freshness”

“Query Data Volume“

“Query Concurrency“

“Mixed

Workload”

“Schema Sophistication“

“Data Volume”

DW SCALABILITY SPIDER CHART

MPP – Multidimensional

Scalability

SMP – Tunable in one dimension

on cost of other dimensions

The spiderweb depicts important attributes to consider when evaluating Data Warehousing options.

Big Data support is newest dimension.

Page 60: Building a Big Data Solution

When do you need a MPP solution?

• We need at least 3x query performance improvement

• We are near disk capacity and see a lot of growth in the upcoming years

• We need to support queries during our maintenance window

• We need to load data outside of our maintenance window

• We will spend a lot of money for FusionIO cards, SSDs, more SAN space, more

memory, faster cpu

Page 61: Building a Big Data Solution

Summary

• We live in an increasingly data-intensive world

• Much of the data stored online and analyzed today is more varied than the data stored in recent years

• More of our data arrives in near-real time

This present a large business opportunity. Are you ready for it?

Page 62: Building a Big Data Solution

Resources The Modern Data Warehouse: http://bit.ly/1xuX4Py

Fast Track Data Warehouse Reference Architecture for SQL Server 2014: http://bit.ly/1xuX9m6

Should you move your data to the cloud? http://bit.ly/1xuXbKU

Presentation slides for Modern Data Warehousing: http://bit.ly/1xuXcP5

Presentation slides for Building an Effective Data Warehouse Architecture: http://bit.ly/1xuXeX4

Hadoop and Data Warehouses: http://bit.ly/1xuXfu9

What is the Microsoft Analytics Platform System (APS)? http://bit.ly/1xuXipO

Parallel Data Warehouse (PDW) benefits made simple: http://bit.ly/1xuXlSy

What is Advanced Analytics? http://bit.ly/1LDklkB

Page 63: Building a Big Data Solution

Q & A ?James Serra, Big Data Evangelist

Email me at: [email protected]

Follow me at: @JamesSerra

Link to me at: www.linkedin.com/in/JamesSerra

Visit my blog at: JamesSerra.com (where this slide deck will be posted)