kansas city big data: the future of insights - keynote: "big data technologies and...
DESCRIPTION
Kansas City IT Professionals, a grassroots tech community of 9,000+ members held an event on August 30th, 2012 entitled Big Data: The Future Of Insights (see: http://kcitp.me/M67S9M). The event consisted of 2 keynotes & a panel with expert data scientists, engineers, and data analysts from companies like Adknowledge and Cerner. This talk, entitled "Big Data Technologies and Tools" was delivered by Ryan Brush, Distinguished Engineer w/ CernerTRANSCRIPT
![Page 1: Kansas City Big Data: The Future Of Insights - Keynote: "Big Data Technologies and Techniques"](https://reader035.vdocuments.us/reader035/viewer/2022070304/54c664734a7959f3208b4571/html5/thumbnails/1.jpg)
Big Data Technologies and Techniques
Ryan BrushDistinguished Engineer, Cerner Corporation
@ryanbrush
![Page 2: Kansas City Big Data: The Future Of Insights - Keynote: "Big Data Technologies and Techniques"](https://reader035.vdocuments.us/reader035/viewer/2022070304/54c664734a7959f3208b4571/html5/thumbnails/2.jpg)
Relational Databases are Awesome
![Page 3: Kansas City Big Data: The Future Of Insights - Keynote: "Big Data Technologies and Techniques"](https://reader035.vdocuments.us/reader035/viewer/2022070304/54c664734a7959f3208b4571/html5/thumbnails/3.jpg)
Relational Databases are Awesome
Atomic, transactional updates
Declarative queries
Guaranteed consistency
Easy to reason about
Long track record of success
![Page 4: Kansas City Big Data: The Future Of Insights - Keynote: "Big Data Technologies and Techniques"](https://reader035.vdocuments.us/reader035/viewer/2022070304/54c664734a7959f3208b4571/html5/thumbnails/4.jpg)
Relational Databases are Awesome
…so use them!
![Page 5: Kansas City Big Data: The Future Of Insights - Keynote: "Big Data Technologies and Techniques"](https://reader035.vdocuments.us/reader035/viewer/2022070304/54c664734a7959f3208b4571/html5/thumbnails/5.jpg)
Relational Databases are Awesome
…so use them!
But…
![Page 6: Kansas City Big Data: The Future Of Insights - Keynote: "Big Data Technologies and Techniques"](https://reader035.vdocuments.us/reader035/viewer/2022070304/54c664734a7959f3208b4571/html5/thumbnails/6.jpg)
Those advantages have a cost
Global, atomic state means global, atomic coordination
Coordination does not scale linearly
![Page 7: Kansas City Big Data: The Future Of Insights - Keynote: "Big Data Technologies and Techniques"](https://reader035.vdocuments.us/reader035/viewer/2022070304/54c664734a7959f3208b4571/html5/thumbnails/7.jpg)
The costs of coordination
Remember the network effect?
![Page 8: Kansas City Big Data: The Future Of Insights - Keynote: "Big Data Technologies and Techniques"](https://reader035.vdocuments.us/reader035/viewer/2022070304/54c664734a7959f3208b4571/html5/thumbnails/8.jpg)
The costs of coordination
2 nodes = 1 channel5 nodes = 10 channels12 nodes = 66 channels25 nodes = 300 channels
![Page 9: Kansas City Big Data: The Future Of Insights - Keynote: "Big Data Technologies and Techniques"](https://reader035.vdocuments.us/reader035/viewer/2022070304/54c664734a7959f3208b4571/html5/thumbnails/9.jpg)
So we better be able to scale
![Page 10: Kansas City Big Data: The Future Of Insights - Keynote: "Big Data Technologies and Techniques"](https://reader035.vdocuments.us/reader035/viewer/2022070304/54c664734a7959f3208b4571/html5/thumbnails/10.jpg)
The costs of coordination
Databases have optimized this in many clever ways, but a limit on scalability still exists
![Page 11: Kansas City Big Data: The Future Of Insights - Keynote: "Big Data Technologies and Techniques"](https://reader035.vdocuments.us/reader035/viewer/2022070304/54c664734a7959f3208b4571/html5/thumbnails/11.jpg)
Let’s look at some ways to scale
![Page 12: Kansas City Big Data: The Future Of Insights - Keynote: "Big Data Technologies and Techniques"](https://reader035.vdocuments.us/reader035/viewer/2022070304/54c664734a7959f3208b4571/html5/thumbnails/12.jpg)
Bulk processing billions of records
![Page 13: Kansas City Big Data: The Future Of Insights - Keynote: "Big Data Technologies and Techniques"](https://reader035.vdocuments.us/reader035/viewer/2022070304/54c664734a7959f3208b4571/html5/thumbnails/13.jpg)
Bulk processing billions of recordsData aggregation and storage
![Page 14: Kansas City Big Data: The Future Of Insights - Keynote: "Big Data Technologies and Techniques"](https://reader035.vdocuments.us/reader035/viewer/2022070304/54c664734a7959f3208b4571/html5/thumbnails/14.jpg)
Bulk processing billions of recordsData aggregation and storage
Real-time processing of updates
![Page 15: Kansas City Big Data: The Future Of Insights - Keynote: "Big Data Technologies and Techniques"](https://reader035.vdocuments.us/reader035/viewer/2022070304/54c664734a7959f3208b4571/html5/thumbnails/15.jpg)
Bulk processing billions of recordsData aggregation and storage
Real-time processing of updates
Serving data for: Online AppsAnalytics
![Page 16: Kansas City Big Data: The Future Of Insights - Keynote: "Big Data Technologies and Techniques"](https://reader035.vdocuments.us/reader035/viewer/2022070304/54c664734a7959f3208b4571/html5/thumbnails/16.jpg)
Let’s start with scalability of bulk processing
![Page 17: Kansas City Big Data: The Future Of Insights - Keynote: "Big Data Technologies and Techniques"](https://reader035.vdocuments.us/reader035/viewer/2022070304/54c664734a7959f3208b4571/html5/thumbnails/17.jpg)
Quiz: which one is scalable?
![Page 18: Kansas City Big Data: The Future Of Insights - Keynote: "Big Data Technologies and Techniques"](https://reader035.vdocuments.us/reader035/viewer/2022070304/54c664734a7959f3208b4571/html5/thumbnails/18.jpg)
Quiz: which one is scalable?1000-node Hadoop cluster where jobs depend on a common process
![Page 19: Kansas City Big Data: The Future Of Insights - Keynote: "Big Data Technologies and Techniques"](https://reader035.vdocuments.us/reader035/viewer/2022070304/54c664734a7959f3208b4571/html5/thumbnails/19.jpg)
Quiz: which one is scalable?1000-node Hadoop cluster where jobs depend on a common process
1000 Windows ME machines runningindependent Excel macros
![Page 20: Kansas City Big Data: The Future Of Insights - Keynote: "Big Data Technologies and Techniques"](https://reader035.vdocuments.us/reader035/viewer/2022070304/54c664734a7959f3208b4571/html5/thumbnails/20.jpg)
Quiz: which one is scalable?1000-node Hadoop cluster where jobs depend on a common process
1000 Windows ME machines runningindependent Excel macros
![Page 21: Kansas City Big Data: The Future Of Insights - Keynote: "Big Data Technologies and Techniques"](https://reader035.vdocuments.us/reader035/viewer/2022070304/54c664734a7959f3208b4571/html5/thumbnails/21.jpg)
Independence Parallelizable
![Page 22: Kansas City Big Data: The Future Of Insights - Keynote: "Big Data Technologies and Techniques"](https://reader035.vdocuments.us/reader035/viewer/2022070304/54c664734a7959f3208b4571/html5/thumbnails/22.jpg)
Independence Parallelizable
Parallelizable Scalable
![Page 23: Kansas City Big Data: The Future Of Insights - Keynote: "Big Data Technologies and Techniques"](https://reader035.vdocuments.us/reader035/viewer/2022070304/54c664734a7959f3208b4571/html5/thumbnails/23.jpg)
“Shared Nothing” architectures are themost scalable…
![Page 24: Kansas City Big Data: The Future Of Insights - Keynote: "Big Data Technologies and Techniques"](https://reader035.vdocuments.us/reader035/viewer/2022070304/54c664734a7959f3208b4571/html5/thumbnails/24.jpg)
“Shared Nothing” architectures are themost scalable…
…but most real-world problems requireus to share something…
![Page 25: Kansas City Big Data: The Future Of Insights - Keynote: "Big Data Technologies and Techniques"](https://reader035.vdocuments.us/reader035/viewer/2022070304/54c664734a7959f3208b4571/html5/thumbnails/25.jpg)
“Shared Nothing” architectures are themost scalable…
…but most real-world problems requireus to share something…
…so our designs usually have a parallelpart and a serial part
![Page 26: Kansas City Big Data: The Future Of Insights - Keynote: "Big Data Technologies and Techniques"](https://reader035.vdocuments.us/reader035/viewer/2022070304/54c664734a7959f3208b4571/html5/thumbnails/26.jpg)
The key is to make sure the vast majorityof our work in the cloud is independent andparallelizable.
![Page 27: Kansas City Big Data: The Future Of Insights - Keynote: "Big Data Technologies and Techniques"](https://reader035.vdocuments.us/reader035/viewer/2022070304/54c664734a7959f3208b4571/html5/thumbnails/27.jpg)
Amdahl’s LawS : speed improvementP : ratio of the problem that can be parallelizedN: number of processors
![Page 28: Kansas City Big Data: The Future Of Insights - Keynote: "Big Data Technologies and Techniques"](https://reader035.vdocuments.us/reader035/viewer/2022070304/54c664734a7959f3208b4571/html5/thumbnails/28.jpg)
MapReduce PrimerInput Data
Split 1
Split 2
Split 3
Split N
.
.
.
Mapper 1
Mapper 2
Mapper 3
Mapper N
.
.
.
Map Phase
Reducer 1
Reducer 2
Reducer N
.
.
ReducePhase
Shuffle
![Page 29: Kansas City Big Data: The Future Of Insights - Keynote: "Big Data Technologies and Techniques"](https://reader035.vdocuments.us/reader035/viewer/2022070304/54c664734a7959f3208b4571/html5/thumbnails/29.jpg)
MapReduce Example: Word CountBooks
Count words per book
.
.
.
Map Phase
Sum words A-C
.
.
ReducePhase
Shuffle
Sum wordsD-E
Sum words W-Z
Count words per book
Count words per book
![Page 30: Kansas City Big Data: The Future Of Insights - Keynote: "Big Data Technologies and Techniques"](https://reader035.vdocuments.us/reader035/viewer/2022070304/54c664734a7959f3208b4571/html5/thumbnails/30.jpg)
Notice there is still a serial part of the problem: the of the reducers must be combined
![Page 31: Kansas City Big Data: The Future Of Insights - Keynote: "Big Data Technologies and Techniques"](https://reader035.vdocuments.us/reader035/viewer/2022070304/54c664734a7959f3208b4571/html5/thumbnails/31.jpg)
Notice there is still a serial part of the problem: the of the reducers must be combined
…but this is much smaller, and can behandled by a single process
![Page 32: Kansas City Big Data: The Future Of Insights - Keynote: "Big Data Technologies and Techniques"](https://reader035.vdocuments.us/reader035/viewer/2022070304/54c664734a7959f3208b4571/html5/thumbnails/32.jpg)
Also notice that the network is a shared resource when processing big data
![Page 33: Kansas City Big Data: The Future Of Insights - Keynote: "Big Data Technologies and Techniques"](https://reader035.vdocuments.us/reader035/viewer/2022070304/54c664734a7959f3208b4571/html5/thumbnails/33.jpg)
Also notice that the network is a shared resource when processing big data
So rather than moving data to computation,we move computation to data.
![Page 34: Kansas City Big Data: The Future Of Insights - Keynote: "Big Data Technologies and Techniques"](https://reader035.vdocuments.us/reader035/viewer/2022070304/54c664734a7959f3208b4571/html5/thumbnails/34.jpg)
MapReduce Data LocalityInput Data
Split 1
Split 2
Split 3
Split N
.
.
.
Mapper 1
Mapper 2
Mapper 3
Mapper N
.
.
.
Map Phase
Reducer 1
Reducer 2
Reducer N
.
.
ReducePhase
Shuffle
= a physical machine
![Page 35: Kansas City Big Data: The Future Of Insights - Keynote: "Big Data Technologies and Techniques"](https://reader035.vdocuments.us/reader035/viewer/2022070304/54c664734a7959f3208b4571/html5/thumbnails/35.jpg)
Data locality is only guaranteed the Map phase
![Page 36: Kansas City Big Data: The Future Of Insights - Keynote: "Big Data Technologies and Techniques"](https://reader035.vdocuments.us/reader035/viewer/2022070304/54c664734a7959f3208b4571/html5/thumbnails/36.jpg)
Data locality is only guaranteed the Map phase
So the most data-intensive work should bedone in the map, with smaller sets set to the reducer
![Page 37: Kansas City Big Data: The Future Of Insights - Keynote: "Big Data Technologies and Techniques"](https://reader035.vdocuments.us/reader035/viewer/2022070304/54c664734a7959f3208b4571/html5/thumbnails/37.jpg)
Data locality is only guaranteed the Map phase
So the most data-intensive work should bedone in the map, with smaller sets set to the reducer
Some Map/Reduce jobs have no reducer at all!
![Page 38: Kansas City Big Data: The Future Of Insights - Keynote: "Big Data Technologies and Techniques"](https://reader035.vdocuments.us/reader035/viewer/2022070304/54c664734a7959f3208b4571/html5/thumbnails/38.jpg)
MapReduce Gone WrongBooks
Count words per book
.
.
.
Map Phase
Sum words A-C
.
.
ReducePhase
Shuffle
Sum wordsD-E
Sum words W-Z
Count words per book
Count words per book
Word Addition
Service
![Page 39: Kansas City Big Data: The Future Of Insights - Keynote: "Big Data Technologies and Techniques"](https://reader035.vdocuments.us/reader035/viewer/2022070304/54c664734a7959f3208b4571/html5/thumbnails/39.jpg)
Even if our Word Addition Service is scalable, we’d need to scale it to the size of the largest Map/Reduce job that will ever use it
![Page 40: Kansas City Big Data: The Future Of Insights - Keynote: "Big Data Technologies and Techniques"](https://reader035.vdocuments.us/reader035/viewer/2022070304/54c664734a7959f3208b4571/html5/thumbnails/40.jpg)
So for data processing, prefer embedded libraries over remote services
![Page 41: Kansas City Big Data: The Future Of Insights - Keynote: "Big Data Technologies and Techniques"](https://reader035.vdocuments.us/reader035/viewer/2022070304/54c664734a7959f3208b4571/html5/thumbnails/41.jpg)
So for data processing, prefer embedded libraries over remote services
Use remote services for configuration, to prime caches, etc. – just not for every data element!
![Page 42: Kansas City Big Data: The Future Of Insights - Keynote: "Big Data Technologies and Techniques"](https://reader035.vdocuments.us/reader035/viewer/2022070304/54c664734a7959f3208b4571/html5/thumbnails/42.jpg)
Joining a billion records
Word counts are great, but many real-worldproblems mean bringing together multiple datasets.
So how do we “join” with MapReduce?
![Page 43: Kansas City Big Data: The Future Of Insights - Keynote: "Big Data Technologies and Techniques"](https://reader035.vdocuments.us/reader035/viewer/2022070304/54c664734a7959f3208b4571/html5/thumbnails/43.jpg)
Map-Side Joins
Data Set 1
Split 3 Mapper 3
Map Phase
Reducer 1
Reducer 2..
ReducePhase
Shuffle
Data set 2
Split 1 Mapper 1Data set 2
Split 2 Mapper 2Data set 2
When joining one big input to a small one,Simply copy the small data set to each mapper
![Page 44: Kansas City Big Data: The Future Of Insights - Keynote: "Big Data Technologies and Techniques"](https://reader035.vdocuments.us/reader035/viewer/2022070304/54c664734a7959f3208b4571/html5/thumbnails/44.jpg)
Merge in Reducer
Data Set 1
Split 1
Split 2
Split 3
Group by key
Map Phase
Reducer 1
Reducer 2
Reducer N
.
.
ReducePhase
Shuffle
Group by key
Group by key
Data Set 2
Split 1
Split 2
Split 3
Group by key
Group by key
Group by key
Route common items to the same reducer
![Page 45: Kansas City Big Data: The Future Of Insights - Keynote: "Big Data Technologies and Techniques"](https://reader035.vdocuments.us/reader035/viewer/2022070304/54c664734a7959f3208b4571/html5/thumbnails/45.jpg)
Higher-Level Constructs
MapReduce is a primitive operation forhigher-level constructsHive, Pig, Cascading, and Crunch all compileInto MapReduce
Crunch!
Use one!
![Page 46: Kansas City Big Data: The Future Of Insights - Keynote: "Big Data Technologies and Techniques"](https://reader035.vdocuments.us/reader035/viewer/2022070304/54c664734a7959f3208b4571/html5/thumbnails/46.jpg)
MapReduce and MPP Databases
![Page 47: Kansas City Big Data: The Future Of Insights - Keynote: "Big Data Technologies and Techniques"](https://reader035.vdocuments.us/reader035/viewer/2022070304/54c664734a7959f3208b4571/html5/thumbnails/47.jpg)
MapReduce MPP DatabasesData in a distributed filesystem Data in sharded relational databases
![Page 48: Kansas City Big Data: The Future Of Insights - Keynote: "Big Data Technologies and Techniques"](https://reader035.vdocuments.us/reader035/viewer/2022070304/54c664734a7959f3208b4571/html5/thumbnails/48.jpg)
MapReduce MPP DatabasesOriented towards unstructured or semi-structured data
Oriented towards structured dataData in a distributed filesystem Data in sharded relational databases
![Page 49: Kansas City Big Data: The Future Of Insights - Keynote: "Big Data Technologies and Techniques"](https://reader035.vdocuments.us/reader035/viewer/2022070304/54c664734a7959f3208b4571/html5/thumbnails/49.jpg)
MapReduce MPP DatabasesOriented towards unstructured or semi-structured data
Oriented towards structured data
Java or Domain-Specific Languages(e.g., Pig and Hive)
SQL
Data in a distributed filesystem Data in sharded relational databases
![Page 50: Kansas City Big Data: The Future Of Insights - Keynote: "Big Data Technologies and Techniques"](https://reader035.vdocuments.us/reader035/viewer/2022070304/54c664734a7959f3208b4571/html5/thumbnails/50.jpg)
MapReduce MPP DatabasesOriented towards unstructured or semi-structured data
Oriented towards structured data
Java or Domain-Specific Languages(e.g., Pig and Hive)
SQL
Data in a distributed filesystem Data in sharded relational databases
Poor support for iterative operations Good support of iterative operations
![Page 51: Kansas City Big Data: The Future Of Insights - Keynote: "Big Data Technologies and Techniques"](https://reader035.vdocuments.us/reader035/viewer/2022070304/54c664734a7959f3208b4571/html5/thumbnails/51.jpg)
MapReduce MPP DatabasesOriented towards unstructured or semi-structured data
Oriented towards structured data
Java or Domain-Specific Languages(e.g., Pig and Hive)
SQL
Data in a distributed filesystem Data in sharded relational databases
Poor support for iterative operations Good support of iterative operationsArbitrarily complex programsrunning next to data
SQL and User-Defined Functionsrunning next to data
![Page 52: Kansas City Big Data: The Future Of Insights - Keynote: "Big Data Technologies and Techniques"](https://reader035.vdocuments.us/reader035/viewer/2022070304/54c664734a7959f3208b4571/html5/thumbnails/52.jpg)
MapReduce MPP DatabasesOriented towards unstructured or semi-structured data
Oriented towards structured data
Java or Domain-Specific Languages(e.g., Pig and Hive)
SQL
Data in a distributed filesystem Data in sharded relational databases
Poor support for iterative operations Good support of iterative operationsArbitrarily complex programsrunning next to data
SQL and User-Defined Functionsrunning next to data
Poor interactive query support Good interactive query support
![Page 53: Kansas City Big Data: The Future Of Insights - Keynote: "Big Data Technologies and Techniques"](https://reader035.vdocuments.us/reader035/viewer/2022070304/54c664734a7959f3208b4571/html5/thumbnails/53.jpg)
MapReduce MPP Databases
…are complementary!
![Page 54: Kansas City Big Data: The Future Of Insights - Keynote: "Big Data Technologies and Techniques"](https://reader035.vdocuments.us/reader035/viewer/2022070304/54c664734a7959f3208b4571/html5/thumbnails/54.jpg)
MapReduce MPP Databases
…are complementary!
Map/Reduce to clean, normalize, reconcile and codify data to load into a MPP system for interactive analysis
![Page 55: Kansas City Big Data: The Future Of Insights - Keynote: "Big Data Technologies and Techniques"](https://reader035.vdocuments.us/reader035/viewer/2022070304/54c664734a7959f3208b4571/html5/thumbnails/55.jpg)
Bulk processing of millions of recordsData aggregation and storage
![Page 56: Kansas City Big Data: The Future Of Insights - Keynote: "Big Data Technologies and Techniques"](https://reader035.vdocuments.us/reader035/viewer/2022070304/54c664734a7959f3208b4571/html5/thumbnails/56.jpg)
Hadoop Distributed Filesystem
Scales to many petabytes
![Page 57: Kansas City Big Data: The Future Of Insights - Keynote: "Big Data Technologies and Techniques"](https://reader035.vdocuments.us/reader035/viewer/2022070304/54c664734a7959f3208b4571/html5/thumbnails/57.jpg)
Hadoop Distributed Filesystem
Scales to many petabytesSplits all files into blocks and spreadsthem across data nodes
![Page 58: Kansas City Big Data: The Future Of Insights - Keynote: "Big Data Technologies and Techniques"](https://reader035.vdocuments.us/reader035/viewer/2022070304/54c664734a7959f3208b4571/html5/thumbnails/58.jpg)
Hadoop Distributed Filesystem
Scales to many petabytesSplits all files into blocks and spreadsthem across data nodesThe name node keeps track of what blocks belong to what file
![Page 59: Kansas City Big Data: The Future Of Insights - Keynote: "Big Data Technologies and Techniques"](https://reader035.vdocuments.us/reader035/viewer/2022070304/54c664734a7959f3208b4571/html5/thumbnails/59.jpg)
Hadoop Distributed Filesystem
Scales to many petabytesSplits all files into blocks and spreadsthem across data nodesThe name node keeps track of what blocks belong to what fileAll blocks written in triplicate
![Page 60: Kansas City Big Data: The Future Of Insights - Keynote: "Big Data Technologies and Techniques"](https://reader035.vdocuments.us/reader035/viewer/2022070304/54c664734a7959f3208b4571/html5/thumbnails/60.jpg)
Hadoop Distributed Filesystem
Scales to many petabytesSplits all files into blocks and spreadsthem across data nodesThe name node keeps track of what blocks belong to what fileAll blocks written in triplicateWrite and append only – no random updates!
![Page 61: Kansas City Big Data: The Future Of Insights - Keynote: "Big Data Technologies and Techniques"](https://reader035.vdocuments.us/reader035/viewer/2022070304/54c664734a7959f3208b4571/html5/thumbnails/61.jpg)
Client
Name Node
Data Node 1 Data Node 2 Data Node N. . .Block
Block
Block Block
Block
Lookup Data Node
Replicate Replicate
Write
HDFS Writes
![Page 62: Kansas City Big Data: The Future Of Insights - Keynote: "Big Data Technologies and Techniques"](https://reader035.vdocuments.us/reader035/viewer/2022070304/54c664734a7959f3208b4571/html5/thumbnails/62.jpg)
Client
Name Node
Data Node 1 Data Node 2 Data Node N. . .Block
Block
Block Block
Block
Lookup Block locations
Read
HDFS Reads
![Page 63: Kansas City Big Data: The Future Of Insights - Keynote: "Big Data Technologies and Techniques"](https://reader035.vdocuments.us/reader035/viewer/2022070304/54c664734a7959f3208b4571/html5/thumbnails/63.jpg)
HDFS Shortcomings
No random readsNo random writesDoesn’t deal with many small files
![Page 64: Kansas City Big Data: The Future Of Insights - Keynote: "Big Data Technologies and Techniques"](https://reader035.vdocuments.us/reader035/viewer/2022070304/54c664734a7959f3208b4571/html5/thumbnails/64.jpg)
HDFS Shortcomings
No random readsNo random writesDoesn’t deal with many small files
Enter HBase“Random Access To Your Planet-Size Data”
![Page 65: Kansas City Big Data: The Future Of Insights - Keynote: "Big Data Technologies and Techniques"](https://reader035.vdocuments.us/reader035/viewer/2022070304/54c664734a7959f3208b4571/html5/thumbnails/65.jpg)
HBase
Emulates random I/O with a Write Ahead Log (WAL)Periodically flushes log to sorted files
![Page 66: Kansas City Big Data: The Future Of Insights - Keynote: "Big Data Technologies and Techniques"](https://reader035.vdocuments.us/reader035/viewer/2022070304/54c664734a7959f3208b4571/html5/thumbnails/66.jpg)
HBase
Emulates random I/O with a Write Ahead Log (WAL)Periodically flushes log to sorted filesFiles accessible as tables, split acrossmany regions, hosted by region servers
![Page 67: Kansas City Big Data: The Future Of Insights - Keynote: "Big Data Technologies and Techniques"](https://reader035.vdocuments.us/reader035/viewer/2022070304/54c664734a7959f3208b4571/html5/thumbnails/67.jpg)
HBase
Emulates random I/O with a Write Ahead Log (WAL)Periodically flushes log to sorted filesFiles accessible as tables, split acrossmany regions, hosted by region servers
Preserves scalability, data locality, andMap/Reduce features of Hadoop
![Page 68: Kansas City Big Data: The Future Of Insights - Keynote: "Big Data Technologies and Techniques"](https://reader035.vdocuments.us/reader035/viewer/2022070304/54c664734a7959f3208b4571/html5/thumbnails/68.jpg)
Use HBase when:You have noisy, semi-structured data
![Page 69: Kansas City Big Data: The Future Of Insights - Keynote: "Big Data Technologies and Techniques"](https://reader035.vdocuments.us/reader035/viewer/2022070304/54c664734a7959f3208b4571/html5/thumbnails/69.jpg)
Use HBase when:You have noisy, semi-structured dataYou want to apply massively parallelprocessing to your problem
![Page 70: Kansas City Big Data: The Future Of Insights - Keynote: "Big Data Technologies and Techniques"](https://reader035.vdocuments.us/reader035/viewer/2022070304/54c664734a7959f3208b4571/html5/thumbnails/70.jpg)
Use HBase when:You have noisy, semi-structured dataYou want to apply massively parallelprocessing to your problem
To handle huge write loads
![Page 71: Kansas City Big Data: The Future Of Insights - Keynote: "Big Data Technologies and Techniques"](https://reader035.vdocuments.us/reader035/viewer/2022070304/54c664734a7959f3208b4571/html5/thumbnails/71.jpg)
Use HBase when:You have noisy, semi-structured dataYou want to apply massively parallelprocessing to your problem
To handle huge write loadsAs a scalable key/value store
![Page 72: Kansas City Big Data: The Future Of Insights - Keynote: "Big Data Technologies and Techniques"](https://reader035.vdocuments.us/reader035/viewer/2022070304/54c664734a7959f3208b4571/html5/thumbnails/72.jpg)
But there are drawbacks:Limited schema supportLimited atomicity guaranteesNo built-in secondary indexes
HBase is a great tool for many jobs,but not every job
![Page 73: Kansas City Big Data: The Future Of Insights - Keynote: "Big Data Technologies and Techniques"](https://reader035.vdocuments.us/reader035/viewer/2022070304/54c664734a7959f3208b4571/html5/thumbnails/73.jpg)
The data store should alignwith the needs of the application
![Page 74: Kansas City Big Data: The Future Of Insights - Keynote: "Big Data Technologies and Techniques"](https://reader035.vdocuments.us/reader035/viewer/2022070304/54c664734a7959f3208b4571/html5/thumbnails/74.jpg)
So a pattern is emerging:
Hadoop with
HBase
Millennium
CCDs
Claims
HL7
Collection Aggregation Processing
MapReduce Jobs
MPP
Relational
Document Store
Storage
HBase
![Page 75: Kansas City Big Data: The Future Of Insights - Keynote: "Big Data Technologies and Techniques"](https://reader035.vdocuments.us/reader035/viewer/2022070304/54c664734a7959f3208b4571/html5/thumbnails/75.jpg)
But we have a potential bottleneck
Hadoop with
HBase
Millennium
CCDs
Claims
HL7
Collection Aggregation Processing
MapReduce Jobs
MPP
Relational
Document Store
Storage
HBase
![Page 76: Kansas City Big Data: The Future Of Insights - Keynote: "Big Data Technologies and Techniques"](https://reader035.vdocuments.us/reader035/viewer/2022070304/54c664734a7959f3208b4571/html5/thumbnails/76.jpg)
Direct inserts are designed for online updates, not massively parallel data loads
So shift the work into MapReduce, and pre-build files for bulk import
Oracle Loader for HadoopHBase HFile Import Bulk Loads for MPP
![Page 77: Kansas City Big Data: The Future Of Insights - Keynote: "Big Data Technologies and Techniques"](https://reader035.vdocuments.us/reader035/viewer/2022070304/54c664734a7959f3208b4571/html5/thumbnails/77.jpg)
And we’re missing an important piece:
Hadoop with
HBase
Millennium
CCDs
Claims
HL7
Collection Aggregation Processing
MapReduce Jobs
MPP
Relational
Document Store
Storage
HBase
![Page 78: Kansas City Big Data: The Future Of Insights - Keynote: "Big Data Technologies and Techniques"](https://reader035.vdocuments.us/reader035/viewer/2022070304/54c664734a7959f3208b4571/html5/thumbnails/78.jpg)
And we’re missing an important piece:
Hadoop with
HBase
Millennium
CCDs
Claims
HL7
Collection Aggregation Processing
Realtime Processing
MPP
Relational
Document Store
Storage
HBase
Map/Reduce
Jobs (batch)
![Page 79: Kansas City Big Data: The Future Of Insights - Keynote: "Big Data Technologies and Techniques"](https://reader035.vdocuments.us/reader035/viewer/2022070304/54c664734a7959f3208b4571/html5/thumbnails/79.jpg)
How do we make it fast?
Speed Layer
Batch Layer
http://www.slideshare.net/nathanmarz/the-secrets-of-building-realtime-big-data-systems
![Page 80: Kansas City Big Data: The Future Of Insights - Keynote: "Big Data Technologies and Techniques"](https://reader035.vdocuments.us/reader035/viewer/2022070304/54c664734a7959f3208b4571/html5/thumbnails/80.jpg)
How do we make it fast?
Speed Layer
Batch LayerHigh Latency (minutes or hours to process)
Low Latency (seconds to process)
Move data to computation
Move computation to dataYears of data
Hours of data
Bulk loads
Incremental updates
![Page 81: Kansas City Big Data: The Future Of Insights - Keynote: "Big Data Technologies and Techniques"](https://reader035.vdocuments.us/reader035/viewer/2022070304/54c664734a7959f3208b4571/html5/thumbnails/81.jpg)
How do we make it fast?
Speed Layer
Batch LayerMapReduce
Storm
Complex Event Processing
Hadoop
![Page 82: Kansas City Big Data: The Future Of Insights - Keynote: "Big Data Technologies and Techniques"](https://reader035.vdocuments.us/reader035/viewer/2022070304/54c664734a7959f3208b4571/html5/thumbnails/82.jpg)
And now, the challenge…
![Page 83: Kansas City Big Data: The Future Of Insights - Keynote: "Big Data Technologies and Techniques"](https://reader035.vdocuments.us/reader035/viewer/2022070304/54c664734a7959f3208b4571/html5/thumbnails/83.jpg)
Process all data overnight
![Page 84: Kansas City Big Data: The Future Of Insights - Keynote: "Big Data Technologies and Techniques"](https://reader035.vdocuments.us/reader035/viewer/2022070304/54c664734a7959f3208b4571/html5/thumbnails/84.jpg)
Process all data overnight
Quickly create new data models
Simple correction of any bugs
Fast iteration cycles means fast innovation
Much easier to understand and work with
![Page 85: Kansas City Big Data: The Future Of Insights - Keynote: "Big Data Technologies and Techniques"](https://reader035.vdocuments.us/reader035/viewer/2022070304/54c664734a7959f3208b4571/html5/thumbnails/85.jpg)
Questions?