wednesday, october 6, 2010 · presentation outline 1. defining the platform bi: science for...

76
Wednesday, October 6, 2010

Upload: others

Post on 10-Aug-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Wednesday, October 6, 2010 · Presentation Outline 1. Defining the Platform BI: Science for Profit Need tools for whole research cycle SQL Server 2008 R2: defining the platform

Wednesday, October 6, 2010

Page 2: Wednesday, October 6, 2010 · Presentation Outline 1. Defining the Platform BI: Science for Profit Need tools for whole research cycle SQL Server 2008 R2: defining the platform

Evolving a New Analytical PlatformWhat Works and What’s Missing

Jeff HammerbacherChief Scientist, ClouderaOctober 10, 2010

Wednesday, October 6, 2010

Page 3: Wednesday, October 6, 2010 · Presentation Outline 1. Defining the Platform BI: Science for Profit Need tools for whole research cycle SQL Server 2008 R2: defining the platform

My BackgroundThanks for Asking

[email protected]▪ Studied Mathematics at Harvard▪ Worked as a Quant on Wall Street▪ Conceived, built, and led Data team at Facebook▪ Nearly 30 amazing engineers and data scientists▪ Several open source projects and research papers

▪ Founder of Cloudera▪ Chief Scientist▪ Also, check out the book “Beautiful Data”

Wednesday, October 6, 2010

Page 4: Wednesday, October 6, 2010 · Presentation Outline 1. Defining the Platform BI: Science for Profit Need tools for whole research cycle SQL Server 2008 R2: defining the platform

Presentation Outline▪ 1. Defining the Platform▪ BI: Science for Profit▪ Need tools for whole research cycle▪ SQL Server 2008 R2: defining the platform

▪ 2. State of the Platform Ecosystem▪ 3. Foundations for a New Implementation▪ HDFS and MapReduce▪ Evolution of Hadoop

▪ 4. Future Developments▪ Questions and Discussion

Wednesday, October 6, 2010

Page 5: Wednesday, October 6, 2010 · Presentation Outline 1. Defining the Platform BI: Science for Profit Need tools for whole research cycle SQL Server 2008 R2: defining the platform

1. Defining the Platform

Wednesday, October 6, 2010

Page 6: Wednesday, October 6, 2010 · Presentation Outline 1. Defining the Platform BI: Science for Profit Need tools for whole research cycle SQL Server 2008 R2: defining the platform

BI is looking more like science (for profit)

Wednesday, October 6, 2010

Page 7: Wednesday, October 6, 2010 · Presentation Outline 1. Defining the Platform BI: Science for Profit Need tools for whole research cycle SQL Server 2008 R2: defining the platform

Jim Gray: Science entering Fourth Paradigm“We have to do better at producing tools to

support the whole research cycle”

Wednesday, October 6, 2010

Page 8: Wednesday, October 6, 2010 · Presentation Outline 1. Defining the Platform BI: Science for Profit Need tools for whole research cycle SQL Server 2008 R2: defining the platform

RDBMS only a small part of this tool set

Wednesday, October 6, 2010

Page 9: Wednesday, October 6, 2010 · Presentation Outline 1. Defining the Platform BI: Science for Profit Need tools for whole research cycle SQL Server 2008 R2: defining the platform

Example: SQL Server 2008 R2

Wednesday, October 6, 2010

Page 10: Wednesday, October 6, 2010 · Presentation Outline 1. Defining the Platform BI: Science for Profit Need tools for whole research cycle SQL Server 2008 R2: defining the platform

RDBMS: SQL Server

Wednesday, October 6, 2010

Page 11: Wednesday, October 6, 2010 · Presentation Outline 1. Defining the Platform BI: Science for Profit Need tools for whole research cycle SQL Server 2008 R2: defining the platform

RDBMS: SQL ServerETL: SQL Server Integration Services

Wednesday, October 6, 2010

Page 12: Wednesday, October 6, 2010 · Presentation Outline 1. Defining the Platform BI: Science for Profit Need tools for whole research cycle SQL Server 2008 R2: defining the platform

RDBMS: SQL ServerETL: SQL Server Integration Services

Reporting: SQL Server Reporting Services

Wednesday, October 6, 2010

Page 13: Wednesday, October 6, 2010 · Presentation Outline 1. Defining the Platform BI: Science for Profit Need tools for whole research cycle SQL Server 2008 R2: defining the platform

RDBMS: SQL ServerETL: SQL Server Integration Services

Reporting: SQL Server Reporting ServicesAnalysis: SQL Server Analysis Services

Wednesday, October 6, 2010

Page 14: Wednesday, October 6, 2010 · Presentation Outline 1. Defining the Platform BI: Science for Profit Need tools for whole research cycle SQL Server 2008 R2: defining the platform

RDBMS: SQL ServerETL: SQL Server Integration Services

Reporting: SQL Server Reporting ServicesAnalysis: SQL Server Analysis Services

Search: Full-Text Search

Wednesday, October 6, 2010

Page 15: Wednesday, October 6, 2010 · Presentation Outline 1. Defining the Platform BI: Science for Profit Need tools for whole research cycle SQL Server 2008 R2: defining the platform

RDBMS: SQL ServerETL: SQL Server Integration Services

Reporting: SQL Server Reporting ServicesAnalysis: SQL Server Analysis Services

Search: Full-Text Search

CEP: StreamInsight

Wednesday, October 6, 2010

Page 16: Wednesday, October 6, 2010 · Presentation Outline 1. Defining the Platform BI: Science for Profit Need tools for whole research cycle SQL Server 2008 R2: defining the platform

RDBMS: SQL ServerETL: SQL Server Integration Services

Reporting: SQL Server Reporting ServicesAnalysis: SQL Server Analysis Services

Search: Full-Text Search

CEP: StreamInsight

OLAP: PowerPivot

Wednesday, October 6, 2010

Page 17: Wednesday, October 6, 2010 · Presentation Outline 1. Defining the Platform BI: Science for Profit Need tools for whole research cycle SQL Server 2008 R2: defining the platform

RDBMS: SQL ServerETL: SQL Server Integration Services

Reporting: SQL Server Reporting ServicesAnalysis: SQL Server Analysis Services

Search: Full-Text Search

CEP: StreamInsight

OLAP: PowerPivot

MDM: Master Data Services

Wednesday, October 6, 2010

Page 18: Wednesday, October 6, 2010 · Presentation Outline 1. Defining the Platform BI: Science for Profit Need tools for whole research cycle SQL Server 2008 R2: defining the platform

RDBMS: SQL ServerETL: SQL Server Integration Services

Reporting: SQL Server Reporting ServicesAnalysis: SQL Server Analysis Services

Search: Full-Text Search

CEP: StreamInsight

OLAP: PowerPivot

MDM: Master Data ServicesCollaboration: SharePoint

Wednesday, October 6, 2010

Page 19: Wednesday, October 6, 2010 · Presentation Outline 1. Defining the Platform BI: Science for Profit Need tools for whole research cycle SQL Server 2008 R2: defining the platform

What do we call this unified suite?

Wednesday, October 6, 2010

Page 20: Wednesday, October 6, 2010 · Presentation Outline 1. Defining the Platform BI: Science for Profit Need tools for whole research cycle SQL Server 2008 R2: defining the platform

For today: Analytical Data Platform

Wednesday, October 6, 2010

Page 21: Wednesday, October 6, 2010 · Presentation Outline 1. Defining the Platform BI: Science for Profit Need tools for whole research cycle SQL Server 2008 R2: defining the platform

LAMP Stack for Analytical Data ManagementFor today: Analytical Data Platform

Wednesday, October 6, 2010

Page 22: Wednesday, October 6, 2010 · Presentation Outline 1. Defining the Platform BI: Science for Profit Need tools for whole research cycle SQL Server 2008 R2: defining the platform

2. The State of the Platform Ecosystem

Wednesday, October 6, 2010

Page 23: Wednesday, October 6, 2010 · Presentation Outline 1. Defining the Platform BI: Science for Profit Need tools for whole research cycle SQL Server 2008 R2: defining the platform

Who makes up the platform ecosystem?

Wednesday, October 6, 2010

Page 24: Wednesday, October 6, 2010 · Presentation Outline 1. Defining the Platform BI: Science for Profit Need tools for whole research cycle SQL Server 2008 R2: defining the platform

Platform Providers

Wednesday, October 6, 2010

Page 25: Wednesday, October 6, 2010 · Presentation Outline 1. Defining the Platform BI: Science for Profit Need tools for whole research cycle SQL Server 2008 R2: defining the platform

Platform ProvidersInfrastructure Providers

Wednesday, October 6, 2010

Page 26: Wednesday, October 6, 2010 · Presentation Outline 1. Defining the Platform BI: Science for Profit Need tools for whole research cycle SQL Server 2008 R2: defining the platform

Platform ProvidersInfrastructure Providers

Application Developers

Wednesday, October 6, 2010

Page 27: Wednesday, October 6, 2010 · Presentation Outline 1. Defining the Platform BI: Science for Profit Need tools for whole research cycle SQL Server 2008 R2: defining the platform

Platform ProvidersInfrastructure Providers

Application Developers

Content Providers

Wednesday, October 6, 2010

Page 28: Wednesday, October 6, 2010 · Presentation Outline 1. Defining the Platform BI: Science for Profit Need tools for whole research cycle SQL Server 2008 R2: defining the platform

Platform ProvidersInfrastructure Providers

Application DevelopersEnd Users

Content Providers

Wednesday, October 6, 2010

Page 29: Wednesday, October 6, 2010 · Presentation Outline 1. Defining the Platform BI: Science for Profit Need tools for whole research cycle SQL Server 2008 R2: defining the platform

What is new about the ecosystem today?

Wednesday, October 6, 2010

Page 30: Wednesday, October 6, 2010 · Presentation Outline 1. Defining the Platform BI: Science for Profit Need tools for whole research cycle SQL Server 2008 R2: defining the platform

Content Providers1. > 95% of enterprise data is unstructured

2. Data volumes growing rapidly

Wednesday, October 6, 2010

Page 31: Wednesday, October 6, 2010 · Presentation Outline 1. Defining the Platform BI: Science for Profit Need tools for whole research cycle SQL Server 2008 R2: defining the platform

Infrastructure Providers1. Cloud

2. Warehouse-Scale Computers

Wednesday, October 6, 2010

Page 32: Wednesday, October 6, 2010 · Presentation Outline 1. Defining the Platform BI: Science for Profit Need tools for whole research cycle SQL Server 2008 R2: defining the platform

Platform Providers1. Open source

2. Driven by consumer web properties

Wednesday, October 6, 2010

Page 33: Wednesday, October 6, 2010 · Presentation Outline 1. Defining the Platform BI: Science for Profit Need tools for whole research cycle SQL Server 2008 R2: defining the platform

Application Developers1. Data Scientists

2. Diversity of languages

Wednesday, October 6, 2010

Page 34: Wednesday, October 6, 2010 · Presentation Outline 1. Defining the Platform BI: Science for Profit Need tools for whole research cycle SQL Server 2008 R2: defining the platform

End Users1. Browser is the client

2. Tell a story about the business

Wednesday, October 6, 2010

Page 35: Wednesday, October 6, 2010 · Presentation Outline 1. Defining the Platform BI: Science for Profit Need tools for whole research cycle SQL Server 2008 R2: defining the platform

3. Foundations for a New Implementation

Wednesday, October 6, 2010

Page 36: Wednesday, October 6, 2010 · Presentation Outline 1. Defining the Platform BI: Science for Profit Need tools for whole research cycle SQL Server 2008 R2: defining the platform

New foundations: HDFS and MapReduce

Wednesday, October 6, 2010

Page 37: Wednesday, October 6, 2010 · Presentation Outline 1. Defining the Platform BI: Science for Profit Need tools for whole research cycle SQL Server 2008 R2: defining the platform

2005: Doug/Mike start project inside Nutch

Wednesday, October 6, 2010

Page 38: Wednesday, October 6, 2010 · Presentation Outline 1. Defining the Platform BI: Science for Profit Need tools for whole research cycle SQL Server 2008 R2: defining the platform

2006: Doug joins Yahoo!

Wednesday, October 6, 2010

Page 39: Wednesday, October 6, 2010 · Presentation Outline 1. Defining the Platform BI: Science for Profit Need tools for whole research cycle SQL Server 2008 R2: defining the platform

2007: Make Hadoop scale

Wednesday, October 6, 2010

Page 40: Wednesday, October 6, 2010 · Presentation Outline 1. Defining the Platform BI: Science for Profit Need tools for whole research cycle SQL Server 2008 R2: defining the platform

2007: Make Hadoop scaleYahoo! makes Pig open source

Wednesday, October 6, 2010

Page 41: Wednesday, October 6, 2010 · Presentation Outline 1. Defining the Platform BI: Science for Profit Need tools for whole research cycle SQL Server 2008 R2: defining the platform

2007: Make Hadoop scaleJim Gray’s “Fourth Paradigm” lecture

Yahoo! makes Pig open source

Wednesday, October 6, 2010

Page 42: Wednesday, October 6, 2010 · Presentation Outline 1. Defining the Platform BI: Science for Profit Need tools for whole research cycle SQL Server 2008 R2: defining the platform

2007: Make Hadoop scaleJim Gray’s “Fourth Paradigm” lecture

Yahoo! makes Pig open source

Randy Bryant’s “DISC” lecture

Wednesday, October 6, 2010

Page 43: Wednesday, October 6, 2010 · Presentation Outline 1. Defining the Platform BI: Science for Profit Need tools for whole research cycle SQL Server 2008 R2: defining the platform

2007: Make Hadoop scaleJim Gray’s “Fourth Paradigm” lecture

Yahoo! makes Pig open source

Randy Bryant’s “DISC” lecture

Powerset makes HBase open source

Wednesday, October 6, 2010

Page 44: Wednesday, October 6, 2010 · Presentation Outline 1. Defining the Platform BI: Science for Profit Need tools for whole research cycle SQL Server 2008 R2: defining the platform

2008: Make Hadoop fast

Wednesday, October 6, 2010

Page 45: Wednesday, October 6, 2010 · Presentation Outline 1. Defining the Platform BI: Science for Profit Need tools for whole research cycle SQL Server 2008 R2: defining the platform

2008: Make Hadoop fastYahoo! wins Daytona terabyte sort benchmark

Wednesday, October 6, 2010

Page 46: Wednesday, October 6, 2010 · Presentation Outline 1. Defining the Platform BI: Science for Profit Need tools for whole research cycle SQL Server 2008 R2: defining the platform

2008: Make Hadoop fastFirst Hadoop Summit

Yahoo! wins Daytona terabyte sort benchmark

Wednesday, October 6, 2010

Page 47: Wednesday, October 6, 2010 · Presentation Outline 1. Defining the Platform BI: Science for Profit Need tools for whole research cycle SQL Server 2008 R2: defining the platform

2008: Make Hadoop fastFirst Hadoop Summit

Yahoo! wins Daytona terabyte sort benchmarkYahoo! builds production webmap with Hadoop

Wednesday, October 6, 2010

Page 48: Wednesday, October 6, 2010 · Presentation Outline 1. Defining the Platform BI: Science for Profit Need tools for whole research cycle SQL Server 2008 R2: defining the platform

2008: Make Hadoop fastFirst Hadoop Summit

Yahoo! wins Daytona terabyte sort benchmarkYahoo! builds production webmap with Hadoop

Facebook makes Hive open source

Wednesday, October 6, 2010

Page 49: Wednesday, October 6, 2010 · Presentation Outline 1. Defining the Platform BI: Science for Profit Need tools for whole research cycle SQL Server 2008 R2: defining the platform

2008: Make Hadoop fastFirst Hadoop Summit

Yahoo! wins Daytona terabyte sort benchmarkYahoo! builds production webmap with Hadoop

Facebook makes Hive open source“MapReduce: A Major Step Backwards”

Wednesday, October 6, 2010

Page 50: Wednesday, October 6, 2010 · Presentation Outline 1. Defining the Platform BI: Science for Profit Need tools for whole research cycle SQL Server 2008 R2: defining the platform

2009: Insert Hadoop into the enterprise

Wednesday, October 6, 2010

Page 51: Wednesday, October 6, 2010 · Presentation Outline 1. Defining the Platform BI: Science for Profit Need tools for whole research cycle SQL Server 2008 R2: defining the platform

2009: Insert Hadoop into the enterpriseCloudera releases CDH

Wednesday, October 6, 2010

Page 52: Wednesday, October 6, 2010 · Presentation Outline 1. Defining the Platform BI: Science for Profit Need tools for whole research cycle SQL Server 2008 R2: defining the platform

2009: Insert Hadoop into the enterpriseCloudera releases CDH

First Hadoop World NYC

Wednesday, October 6, 2010

Page 53: Wednesday, October 6, 2010 · Presentation Outline 1. Defining the Platform BI: Science for Profit Need tools for whole research cycle SQL Server 2008 R2: defining the platform

2009: Insert Hadoop into the enterpriseCloudera releases CDH

First Hadoop World NYCYahoo! sorts a petabyte with Hadoop

Wednesday, October 6, 2010

Page 54: Wednesday, October 6, 2010 · Presentation Outline 1. Defining the Platform BI: Science for Profit Need tools for whole research cycle SQL Server 2008 R2: defining the platform

2009: Insert Hadoop into the enterpriseCloudera releases CDH

First Hadoop World NYCYahoo! sorts a petabyte with Hadoop

Cloudera adds training, support, services

Wednesday, October 6, 2010

Page 55: Wednesday, October 6, 2010 · Presentation Outline 1. Defining the Platform BI: Science for Profit Need tools for whole research cycle SQL Server 2008 R2: defining the platform

2009: Insert Hadoop into the enterpriseCloudera releases CDH

First Hadoop World NYCYahoo! sorts a petabyte with Hadoop

Cloudera adds training, support, services

“The Unreasonable Effectiveness of Data”

Wednesday, October 6, 2010

Page 56: Wednesday, October 6, 2010 · Presentation Outline 1. Defining the Platform BI: Science for Profit Need tools for whole research cycle SQL Server 2008 R2: defining the platform

2010: Integrate Hadoop into the enterprise

Wednesday, October 6, 2010

Page 57: Wednesday, October 6, 2010 · Presentation Outline 1. Defining the Platform BI: Science for Profit Need tools for whole research cycle SQL Server 2008 R2: defining the platform

2010: Integrate Hadoop into the enterpriseIBM announces InfoSphere BigInsights

Wednesday, October 6, 2010

Page 58: Wednesday, October 6, 2010 · Presentation Outline 1. Defining the Platform BI: Science for Profit Need tools for whole research cycle SQL Server 2008 R2: defining the platform

2010: Integrate Hadoop into the enterpriseIBM announces InfoSphere BigInsights

Yahoo! completes enterprise-class security

Wednesday, October 6, 2010

Page 59: Wednesday, October 6, 2010 · Presentation Outline 1. Defining the Platform BI: Science for Profit Need tools for whole research cycle SQL Server 2008 R2: defining the platform

2010: Integrate Hadoop into the enterpriseIBM announces InfoSphere BigInsights

Yahoo! completes enterprise-class security

Datameer and Karmasphere funded

Wednesday, October 6, 2010

Page 60: Wednesday, October 6, 2010 · Presentation Outline 1. Defining the Platform BI: Science for Profit Need tools for whole research cycle SQL Server 2008 R2: defining the platform

2010: Integrate Hadoop into the enterpriseIBM announces InfoSphere BigInsights

Yahoo! completes enterprise-class security

Datameer and Karmasphere funded

Quest, Talend, Netezza, and more integrate

Wednesday, October 6, 2010

Page 61: Wednesday, October 6, 2010 · Presentation Outline 1. Defining the Platform BI: Science for Profit Need tools for whole research cycle SQL Server 2008 R2: defining the platform

2010: Integrate Hadoop into the enterpriseIBM announces InfoSphere BigInsights

Yahoo! completes enterprise-class security

Datameer and Karmasphere funded

Quest, Talend, Netezza, and more integrateCloudera releases Cloudera Enterprise

Wednesday, October 6, 2010

Page 62: Wednesday, October 6, 2010 · Presentation Outline 1. Defining the Platform BI: Science for Profit Need tools for whole research cycle SQL Server 2008 R2: defining the platform

Hadoop will be an Analytical Data Platform

Wednesday, October 6, 2010

Page 63: Wednesday, October 6, 2010 · Presentation Outline 1. Defining the Platform BI: Science for Profit Need tools for whole research cycle SQL Server 2008 R2: defining the platform

Wednesday, October 6, 2010

Page 64: Wednesday, October 6, 2010 · Presentation Outline 1. Defining the Platform BI: Science for Profit Need tools for whole research cycle SQL Server 2008 R2: defining the platform

4. Future Developments

Wednesday, October 6, 2010

Page 65: Wednesday, October 6, 2010 · Presentation Outline 1. Defining the Platform BI: Science for Profit Need tools for whole research cycle SQL Server 2008 R2: defining the platform

Capture: Web and Intranet Documents

Wednesday, October 6, 2010

Page 66: Wednesday, October 6, 2010 · Presentation Outline 1. Defining the Platform BI: Science for Profit Need tools for whole research cycle SQL Server 2008 R2: defining the platform

Curate: Unified Metadata

Wednesday, October 6, 2010

Page 67: Wednesday, October 6, 2010 · Presentation Outline 1. Defining the Platform BI: Science for Profit Need tools for whole research cycle SQL Server 2008 R2: defining the platform

Curate: Workflow and Scheduling

Wednesday, October 6, 2010

Page 68: Wednesday, October 6, 2010 · Presentation Outline 1. Defining the Platform BI: Science for Profit Need tools for whole research cycle SQL Server 2008 R2: defining the platform

Curate: Indexes and Materialized Views

Wednesday, October 6, 2010

Page 69: Wednesday, October 6, 2010 · Presentation Outline 1. Defining the Platform BI: Science for Profit Need tools for whole research cycle SQL Server 2008 R2: defining the platform

Curate: Learn Structure from Data

Wednesday, October 6, 2010

Page 70: Wednesday, October 6, 2010 · Presentation Outline 1. Defining the Platform BI: Science for Profit Need tools for whole research cycle SQL Server 2008 R2: defining the platform

Analyze: Mesos-enabled frameworks

Wednesday, October 6, 2010

Page 71: Wednesday, October 6, 2010 · Presentation Outline 1. Defining the Platform BI: Science for Profit Need tools for whole research cycle SQL Server 2008 R2: defining the platform

Analyze: Link working set and historical data

Wednesday, October 6, 2010

Page 72: Wednesday, October 6, 2010 · Presentation Outline 1. Defining the Platform BI: Science for Profit Need tools for whole research cycle SQL Server 2008 R2: defining the platform

Analyze: Iterative in-memory analysis

Wednesday, October 6, 2010

Page 73: Wednesday, October 6, 2010 · Presentation Outline 1. Defining the Platform BI: Science for Profit Need tools for whole research cycle SQL Server 2008 R2: defining the platform

Analyze: Low-latency queries on Avro data

Wednesday, October 6, 2010

Page 74: Wednesday, October 6, 2010 · Presentation Outline 1. Defining the Platform BI: Science for Profit Need tools for whole research cycle SQL Server 2008 R2: defining the platform

All behind a single user interface

Wednesday, October 6, 2010

Page 75: Wednesday, October 6, 2010 · Presentation Outline 1. Defining the Platform BI: Science for Profit Need tools for whole research cycle SQL Server 2008 R2: defining the platform

HueMaking Many Computers Feel Like One

Wednesday, October 6, 2010

Page 76: Wednesday, October 6, 2010 · Presentation Outline 1. Defining the Platform BI: Science for Profit Need tools for whole research cycle SQL Server 2008 R2: defining the platform

(c) 2010 Cloudera, Inc. or its licensors.  "Cloudera" is a registered trademark of Cloudera, Inc.. All rights reserved. 1.0

Wednesday, October 6, 2010