boston hadoop meetup, april 26 2012

15
The Proliferation of Database Systems and the Data Silo Problem @daniel_abadi @daniel_abadi Yale University / Hadapt Yale University / Hadapt April 26 April 26 th th , 2012 , 2012

Upload: daniel-abadi

Post on 07-May-2015

1.581 views

Category:

Technology


2 download

DESCRIPTION

Daniel Abadi presentation at the Boston Hadoop Meetup held on April 26, 2012.

TRANSCRIPT

Page 1: Boston Hadoop Meetup, April 26 2012

The Proliferation of Database Systems and the Data Silo

Problem

@daniel_abadi@daniel_abadi

Yale University / HadaptYale University / Hadapt

April 26April 26thth, 2012, 2012

Page 2: Boston Hadoop Meetup, April 26 2012

In The Old Days …In The Old Days …

Database

Page 3: Boston Hadoop Meetup, April 26 2012

In The Old Days …In The Old Days …

Data Warehouse

Database

Database

Database

External Data

ETL Tools

Data Integration Tools

MDM ToolsData Governance Tools

Page 4: Boston Hadoop Meetup, April 26 2012

One Size Does Not Fit AllOne Size Does Not Fit All

Transactional DatabasesTransactional Databases– Single digit millisecond latencies, and high Single digit millisecond latencies, and high

throughputthroughput– Store data in rowsStore data in rows– Heavy on flash and main memoryHeavy on flash and main memory– Indexing is very importantIndexing is very important– High availability extremely importantHigh availability extremely important

Page 5: Boston Hadoop Meetup, April 26 2012

One Size Does Not Fit AllOne Size Does Not Fit All

Analytical DatabasesAnalytical Databases– Single digit second latencies (and higher)Single digit second latencies (and higher)– Store data in columnsStore data in columns– Scale out commodity hardwareScale out commodity hardware– Still need magnetic diskStill need magnetic disk– Indexing less importantIndexing less important– High availability less importantHigh availability less important

Page 6: Boston Hadoop Meetup, April 26 2012

One Size Does Not Fit AllOne Size Does Not Fit All

Streaming DatabasesStreaming Databases– Continuous queriesContinuous queries– Data flows through the systemData flows through the system– Network latencies are paramountNetwork latencies are paramount– Drop data to deal with loadDrop data to deal with load

Page 7: Boston Hadoop Meetup, April 26 2012

Therefore, in my PhD years Therefore, in my PhD years alone …alone …

Aurora and Borealis projects became Aurora and Borealis projects became StreambaseStreambase

C-Store project became VerticaC-Store project became Vertica

H-Store project became VoltDBH-Store project became VoltDB

Page 8: Boston Hadoop Meetup, April 26 2012

Right Tool for the JobRight Tool for the Job

Page 9: Boston Hadoop Meetup, April 26 2012

What We Have Now …What We Have Now …

Transactional DBMS

Transactional DBMS

Web DBMS (like MySQL)

Web Logs

Reporting and Dashboarding Data

Warehouse

Analytical Datamart

High Performance Column-Store

Analytical DBMS

NoSQL NewSQL

Streaming DBMS

HadoopOLAP Database

Page 10: Boston Hadoop Meetup, April 26 2012

What We Have Now …What We Have Now …

Transactional DBMS

Transactional DBMS

Web DBMS (like MySQL)

Web Logs

Reporting and Dashboarding Data

Warehouse

Analytical Datamart

High Performance Column-Store

Analytical DBMS

NoSQL NewSQL

Streaming DBMS

HadoopOLAP Database

Page 11: Boston Hadoop Meetup, April 26 2012

What We Have Now …What We Have Now …

Transactional DBMS

Transactional DBMS

Web DBMS (like MySQL)

Web Logs

Reporting and Dashboarding Data

Warehouse

Analytical Datamart

High Performance Column-Store

Analytical DBMS

NoSQL NewSQL

Streaming DBMS

HadoopOLAP Database

Page 12: Boston Hadoop Meetup, April 26 2012

What This Leads To…What This Leads To…

Very little data provenanceVery little data provenance

Data silosData silos

Non identical data copiesNon identical data copies

Not even close to a single version of the Not even close to a single version of the truthtruth

Page 13: Boston Hadoop Meetup, April 26 2012

A Potential Way Towards a A Potential Way Towards a SolutionSolution

Hadoop

Data Streaming

(HstreamingFlume)

NoSQL & Simple Xacts & Short

Request Processing(HBase, Brisk)

Data Analysis DBMS(Hive,

Hadapt)

Page 14: Boston Hadoop Meetup, April 26 2012

What this has Potential to What this has Potential to EnableEnable

Fewer data silosFewer data silos

Increased data provenanceIncreased data provenance

Reduced systems management overheadReduced systems management overhead

Better resource utilization and Better resource utilization and managementmanagement

Page 15: Boston Hadoop Meetup, April 26 2012

But we still needBut we still need

Hadoop-based data integration toolsHadoop-based data integration tools

MDM and data governance tools for MDM and data governance tools for HadoopHadoop

Data provenance tracking across Hadoop Data provenance tracking across Hadoop projectsprojects