boston hadoop meetup, april 26 2012
DESCRIPTION
Daniel Abadi presentation at the Boston Hadoop Meetup held on April 26, 2012.TRANSCRIPT
The Proliferation of Database Systems and the Data Silo
Problem
@daniel_abadi@daniel_abadi
Yale University / HadaptYale University / Hadapt
April 26April 26thth, 2012, 2012
In The Old Days …In The Old Days …
Database
In The Old Days …In The Old Days …
Data Warehouse
Database
Database
Database
External Data
ETL Tools
Data Integration Tools
MDM ToolsData Governance Tools
One Size Does Not Fit AllOne Size Does Not Fit All
Transactional DatabasesTransactional Databases– Single digit millisecond latencies, and high Single digit millisecond latencies, and high
throughputthroughput– Store data in rowsStore data in rows– Heavy on flash and main memoryHeavy on flash and main memory– Indexing is very importantIndexing is very important– High availability extremely importantHigh availability extremely important
One Size Does Not Fit AllOne Size Does Not Fit All
Analytical DatabasesAnalytical Databases– Single digit second latencies (and higher)Single digit second latencies (and higher)– Store data in columnsStore data in columns– Scale out commodity hardwareScale out commodity hardware– Still need magnetic diskStill need magnetic disk– Indexing less importantIndexing less important– High availability less importantHigh availability less important
One Size Does Not Fit AllOne Size Does Not Fit All
Streaming DatabasesStreaming Databases– Continuous queriesContinuous queries– Data flows through the systemData flows through the system– Network latencies are paramountNetwork latencies are paramount– Drop data to deal with loadDrop data to deal with load
Therefore, in my PhD years Therefore, in my PhD years alone …alone …
Aurora and Borealis projects became Aurora and Borealis projects became StreambaseStreambase
C-Store project became VerticaC-Store project became Vertica
H-Store project became VoltDBH-Store project became VoltDB
Right Tool for the JobRight Tool for the Job
What We Have Now …What We Have Now …
Transactional DBMS
Transactional DBMS
Web DBMS (like MySQL)
Web Logs
Reporting and Dashboarding Data
Warehouse
Analytical Datamart
High Performance Column-Store
Analytical DBMS
NoSQL NewSQL
Streaming DBMS
HadoopOLAP Database
What We Have Now …What We Have Now …
Transactional DBMS
Transactional DBMS
Web DBMS (like MySQL)
Web Logs
Reporting and Dashboarding Data
Warehouse
Analytical Datamart
High Performance Column-Store
Analytical DBMS
NoSQL NewSQL
Streaming DBMS
HadoopOLAP Database
What We Have Now …What We Have Now …
Transactional DBMS
Transactional DBMS
Web DBMS (like MySQL)
Web Logs
Reporting and Dashboarding Data
Warehouse
Analytical Datamart
High Performance Column-Store
Analytical DBMS
NoSQL NewSQL
Streaming DBMS
HadoopOLAP Database
What This Leads To…What This Leads To…
Very little data provenanceVery little data provenance
Data silosData silos
Non identical data copiesNon identical data copies
Not even close to a single version of the Not even close to a single version of the truthtruth
A Potential Way Towards a A Potential Way Towards a SolutionSolution
Hadoop
Data Streaming
(HstreamingFlume)
NoSQL & Simple Xacts & Short
Request Processing(HBase, Brisk)
Data Analysis DBMS(Hive,
Hadapt)
What this has Potential to What this has Potential to EnableEnable
Fewer data silosFewer data silos
Increased data provenanceIncreased data provenance
Reduced systems management overheadReduced systems management overhead
Better resource utilization and Better resource utilization and managementmanagement
But we still needBut we still need
Hadoop-based data integration toolsHadoop-based data integration tools
MDM and data governance tools for MDM and data governance tools for HadoopHadoop
Data provenance tracking across Hadoop Data provenance tracking across Hadoop projectsprojects