hadoop 201 -- deeper into the elephant
TRANSCRIPT
![Page 1: Hadoop 201 -- Deeper into the Elephant](https://reader033.vdocuments.us/reader033/viewer/2022042717/55d578aabb61eba92f8b45da/html5/thumbnails/1.jpg)
Deeper into the elephant: a whirlwind tour of Hadoop ecosystem
Roman Shaposhnik Director of Open Source @Pivotal
(Twitter: @rhatr)
![Page 2: Hadoop 201 -- Deeper into the Elephant](https://reader033.vdocuments.us/reader033/viewer/2022042717/55d578aabb61eba92f8b45da/html5/thumbnails/2.jpg)
Who’s this guy?
• Director of Open Source, building a team of OS contributors
• Apache Software Foundation guy (VP of Apache Incubator, ASF member, committer on Hadoop, Giraph, Sqoop, etc)
• Used to be root@Cloudera
• Used to be PHB@Yahoo! (original Hadoop team)
• Used to be a hacker at Sun microsystems (Sun Studio compilers and tools)
![Page 3: Hadoop 201 -- Deeper into the Elephant](https://reader033.vdocuments.us/reader033/viewer/2022042717/55d578aabb61eba92f8b45da/html5/thumbnails/3.jpg)
Agenda
&
![Page 4: Hadoop 201 -- Deeper into the Elephant](https://reader033.vdocuments.us/reader033/viewer/2022042717/55d578aabb61eba92f8b45da/html5/thumbnails/4.jpg)
Agenda
![Page 5: Hadoop 201 -- Deeper into the Elephant](https://reader033.vdocuments.us/reader033/viewer/2022042717/55d578aabb61eba92f8b45da/html5/thumbnails/5.jpg)
Long, long time ago…
HDFS
ASF Projects FLOSS Projects Pivotal Products
MapReduce
![Page 6: Hadoop 201 -- Deeper into the Elephant](https://reader033.vdocuments.us/reader033/viewer/2022042717/55d578aabb61eba92f8b45da/html5/thumbnails/6.jpg)
In a blink of an eye:
HDFS
Pig
Sqoop Flume
Coordination and workflow
management
Zookeeper
Command Center
ASF Projects FLOSS Projects Pivotal Products
GemFire XD
Oozie
MapReduce
Hive
Tez
Giraph
Hadoop UI
Hue
SolrCloud
Phoenix
HBase
Crunch Mahout
Spark
Shark
Streaming
MLib
GraphX
Impala
HAW
Q
SpringXD
MADlib
Ham
ster
PivotalR
YARN
![Page 7: Hadoop 201 -- Deeper into the Elephant](https://reader033.vdocuments.us/reader033/viewer/2022042717/55d578aabb61eba92f8b45da/html5/thumbnails/7.jpg)
Genesis of Hadoop
• Google papers on GFS and MapReduce
• A subproject of Apache Nutch
• A bet by Yahoo!
![Page 8: Hadoop 201 -- Deeper into the Elephant](https://reader033.vdocuments.us/reader033/viewer/2022042717/55d578aabb61eba92f8b45da/html5/thumbnails/8.jpg)
Data brings value
• What features to add to the product
• Data analysis must enable decisions
• V3: volume, velocity, variety
![Page 9: Hadoop 201 -- Deeper into the Elephant](https://reader033.vdocuments.us/reader033/viewer/2022042717/55d578aabb61eba92f8b45da/html5/thumbnails/9.jpg)
Big Data brings big value
![Page 10: Hadoop 201 -- Deeper into the Elephant](https://reader033.vdocuments.us/reader033/viewer/2022042717/55d578aabb61eba92f8b45da/html5/thumbnails/10.jpg)
![Page 11: Hadoop 201 -- Deeper into the Elephant](https://reader033.vdocuments.us/reader033/viewer/2022042717/55d578aabb61eba92f8b45da/html5/thumbnails/11.jpg)
Entering: Industrial Data
![Page 12: Hadoop 201 -- Deeper into the Elephant](https://reader033.vdocuments.us/reader033/viewer/2022042717/55d578aabb61eba92f8b45da/html5/thumbnails/12.jpg)
Hadoop’s childhood
• HDFS: Hadoop Distributed Filesystem
• MapReduce: computational framework
![Page 13: Hadoop 201 -- Deeper into the Elephant](https://reader033.vdocuments.us/reader033/viewer/2022042717/55d578aabb61eba92f8b45da/html5/thumbnails/13.jpg)
![Page 14: Hadoop 201 -- Deeper into the Elephant](https://reader033.vdocuments.us/reader033/viewer/2022042717/55d578aabb61eba92f8b45da/html5/thumbnails/14.jpg)
HDFS: not a POSIXfs • Huge blocks: 64Mb (128Mb)
• Mostly immutable files (append, truncate)
• Streaming data access
• Block replication
![Page 15: Hadoop 201 -- Deeper into the Elephant](https://reader033.vdocuments.us/reader033/viewer/2022042717/55d578aabb61eba92f8b45da/html5/thumbnails/15.jpg)
How do I use it?
$ hadoop fs –lsr / # hadoop-fuse-dfs dfs://hadoop-hdfs /mnt $ ls /mnt # mount –t nfs –o vers=3,proto=tcp,nolock host:/ /mnt $ ls /mnt
![Page 16: Hadoop 201 -- Deeper into the Elephant](https://reader033.vdocuments.us/reader033/viewer/2022042717/55d578aabb61eba92f8b45da/html5/thumbnails/16.jpg)
Principle #1
HDFS is the datalake
![Page 17: Hadoop 201 -- Deeper into the Elephant](https://reader033.vdocuments.us/reader033/viewer/2022042717/55d578aabb61eba92f8b45da/html5/thumbnails/17.jpg)
Pivotal’s Focus on Data Lakes
Existing EDW / Datamarts
Raw “untouched” Data
In-Mem
ory Parallel Ingest
Data Management���
(Search Engine)
Processed Data
In-Memory Services BI / A
nalytical Tools
Data Lake
ERP
HR
SFDC
New Data Sources/Formats
Machine
Traditional Data Sources
Finally! I now have full
transparency on the data
with amazing speed!
All data��� is now
accessible!
I can now afford ���“Big Data”
Business Users
ELT Processing with Hadoop
HDFS MapReduce/SQL/Pig/Hive
Analytical Data Marts/
Sandboxes
Security and Control
![Page 18: Hadoop 201 -- Deeper into the Elephant](https://reader033.vdocuments.us/reader033/viewer/2022042717/55d578aabb61eba92f8b45da/html5/thumbnails/18.jpg)
HDFS enables the stack
HDFS
Pig
Sqoop Flume
Coordination and workflow
management
Zookeeper
Command Center
ASF Projects FLOSS Projects Pivotal Products
GemFire XD
Oozie
MapReduce
Hive
Tez
Giraph
Hadoop UI
Hue
SolrCloud
Phoenix
HBase
Crunch Mahout
Spark
Shark
Streaming
MLib
GraphX
Impala
HAW
Q
SpringXD
MADlib
Ham
ster
PivotalR
YARN
![Page 19: Hadoop 201 -- Deeper into the Elephant](https://reader033.vdocuments.us/reader033/viewer/2022042717/55d578aabb61eba92f8b45da/html5/thumbnails/19.jpg)
Principle #2
Apps share their internal state
![Page 20: Hadoop 201 -- Deeper into the Elephant](https://reader033.vdocuments.us/reader033/viewer/2022042717/55d578aabb61eba92f8b45da/html5/thumbnails/20.jpg)
MapReduce
• Batch oriented (long jobs; final results)
• Brings the computation to the data
• Very constrained programming model
• Embarrassingly parallel programming model
• Used to be the only game in town for compute
![Page 21: Hadoop 201 -- Deeper into the Elephant](https://reader033.vdocuments.us/reader033/viewer/2022042717/55d578aabb61eba92f8b45da/html5/thumbnails/21.jpg)
MapReduce Overview
• Record = (Key, Value)
• Key : Comparable, Serializable
• Value: Serializable
• Logical Phases: Input, Map, Shuffle, Reduce, Output
![Page 22: Hadoop 201 -- Deeper into the Elephant](https://reader033.vdocuments.us/reader033/viewer/2022042717/55d578aabb61eba92f8b45da/html5/thumbnails/22.jpg)
Map
• Input: (Key1, Value1)
• Output: List(Key2, Value2)
• Projections, Filtering, Transformation
![Page 23: Hadoop 201 -- Deeper into the Elephant](https://reader033.vdocuments.us/reader033/viewer/2022042717/55d578aabb61eba92f8b45da/html5/thumbnails/23.jpg)
Shuffle
• Input: List(Key2, Value2)
• Output
• Sort(Partition(List(Key2, List(Value2))))
• Provided by Hadoop : Several Customizations Possible
![Page 24: Hadoop 201 -- Deeper into the Elephant](https://reader033.vdocuments.us/reader033/viewer/2022042717/55d578aabb61eba92f8b45da/html5/thumbnails/24.jpg)
Reduce
• Input: List(Key2, List(Value2))
• Output: List(Key3, Value3)
• Aggregations
![Page 25: Hadoop 201 -- Deeper into the Elephant](https://reader033.vdocuments.us/reader033/viewer/2022042717/55d578aabb61eba92f8b45da/html5/thumbnails/25.jpg)
Anatomy of MapReduce
d a c
a b c
a 3 b 1 c 2
a 1 b 1 c 1
a 1 c 1 a 1
a 1 1 1 b 1 c 1 1
HDFS mappers reducers HDFS
![Page 26: Hadoop 201 -- Deeper into the Elephant](https://reader033.vdocuments.us/reader033/viewer/2022042717/55d578aabb61eba92f8b45da/html5/thumbnails/26.jpg)
MapReduce DataFlow
![Page 27: Hadoop 201 -- Deeper into the Elephant](https://reader033.vdocuments.us/reader033/viewer/2022042717/55d578aabb61eba92f8b45da/html5/thumbnails/27.jpg)
How do I use it? public static class TokenizerMapper extends Mapper<Object, Text, Text, IntWritable> { public void map(Object key, Text value, Context context) { StringTokenizer itr = new StringTokenizer(value.toString()); while (itr.hasMoreTokens()) { word.set(itr.nextToken()); context.write(word, one); } } }
![Page 28: Hadoop 201 -- Deeper into the Elephant](https://reader033.vdocuments.us/reader033/viewer/2022042717/55d578aabb61eba92f8b45da/html5/thumbnails/28.jpg)
How do I use it? public static class IntSumReducer extends Reducer<Text,IntWritable,Text,IntWritable> { public void reduce(Text key, Iterable<IntWritable> values, Context context) { int sum = 0; for (IntWritable val : values) { sum += val.get(); } result.set(sum); context.write(key, result); } }
![Page 29: Hadoop 201 -- Deeper into the Elephant](https://reader033.vdocuments.us/reader033/viewer/2022042717/55d578aabb61eba92f8b45da/html5/thumbnails/29.jpg)
How do I run it?
$ hadoop jar hadoop-examples.jar wordcount \ input \ output
![Page 30: Hadoop 201 -- Deeper into the Elephant](https://reader033.vdocuments.us/reader033/viewer/2022042717/55d578aabb61eba92f8b45da/html5/thumbnails/30.jpg)
Principle #3
MapReduce is assembly language of Hadoop
![Page 31: Hadoop 201 -- Deeper into the Elephant](https://reader033.vdocuments.us/reader033/viewer/2022042717/55d578aabb61eba92f8b45da/html5/thumbnails/31.jpg)
Hadoop’s childhood
• Compact (pretty much a single jar)
• Challenged in scalability and SPOFs
• Extremely batch oriented
• Hard for non-Java programmers
![Page 32: Hadoop 201 -- Deeper into the Elephant](https://reader033.vdocuments.us/reader033/viewer/2022042717/55d578aabb61eba92f8b45da/html5/thumbnails/32.jpg)
Then, something happened
![Page 33: Hadoop 201 -- Deeper into the Elephant](https://reader033.vdocuments.us/reader033/viewer/2022042717/55d578aabb61eba92f8b45da/html5/thumbnails/33.jpg)
Hadoop 1.0
HDFS
ASF Projects FLOSS Projects Pivotal Products
MapReduce
![Page 34: Hadoop 201 -- Deeper into the Elephant](https://reader033.vdocuments.us/reader033/viewer/2022042717/55d578aabb61eba92f8b45da/html5/thumbnails/34.jpg)
Hadoop 2.0
HDFS
ASF Projects FLOSS Projects Pivotal Products
MapReduce Tez
YARN
Ham
ster
YARN
![Page 35: Hadoop 201 -- Deeper into the Elephant](https://reader033.vdocuments.us/reader033/viewer/2022042717/55d578aabb61eba92f8b45da/html5/thumbnails/35.jpg)
Hadoop 2.0
• HDFS 2.0
• Yet Another Resource Negotiator (YARN)
• MapReduce is just an “application” now
• Tez is another “application”
• Pivotal’s Hamster (OpenMPI) yet another one
![Page 36: Hadoop 201 -- Deeper into the Elephant](https://reader033.vdocuments.us/reader033/viewer/2022042717/55d578aabb61eba92f8b45da/html5/thumbnails/36.jpg)
MapReduce 1.0
Job Tracker
Task Tracker���(HDFS)
Task Tracker���(HDFS)
task1 task1 task1 task1 task1
task1 task1 task1 task1 taskN
![Page 37: Hadoop 201 -- Deeper into the Elephant](https://reader033.vdocuments.us/reader033/viewer/2022042717/55d578aabb61eba92f8b45da/html5/thumbnails/37.jpg)
YARN (AKA MR2.0)
Resource���Manager
Job Tracker
task1 task1 task1 task1 task1 Task Tracker
![Page 38: Hadoop 201 -- Deeper into the Elephant](https://reader033.vdocuments.us/reader033/viewer/2022042717/55d578aabb61eba92f8b45da/html5/thumbnails/38.jpg)
YARN (AKA MR2.0)
Resource���Manager
Job Tracker
task1 task1 task1 task1 task1 Task Tracker
![Page 39: Hadoop 201 -- Deeper into the Elephant](https://reader033.vdocuments.us/reader033/viewer/2022042717/55d578aabb61eba92f8b45da/html5/thumbnails/39.jpg)
YARN • Yet Another Resource Negotiator
• Resource Manager
• Node Managers
• Application Masters
• Specific to paradigm, e.g. MR Application master (aka JobTracker)
![Page 40: Hadoop 201 -- Deeper into the Elephant](https://reader033.vdocuments.us/reader033/viewer/2022042717/55d578aabb61eba92f8b45da/html5/thumbnails/40.jpg)
YARN: beyond MR
Resource���Manager
MPI
MPI
![Page 41: Hadoop 201 -- Deeper into the Elephant](https://reader033.vdocuments.us/reader033/viewer/2022042717/55d578aabb61eba92f8b45da/html5/thumbnails/41.jpg)
Hamster
• Hadoop and MPI on the same cluster
• OpenMPI Runtime on Hadoop YARN
• Hadoop Provides: Resource Scheduling, ���Process monitoring, Distributed File System
• Open MPI Provides: Process launching, ���Communication, I/O forwarding
![Page 42: Hadoop 201 -- Deeper into the Elephant](https://reader033.vdocuments.us/reader033/viewer/2022042717/55d578aabb61eba92f8b45da/html5/thumbnails/42.jpg)
Hamster Components • Hamster Application Master
• Gang Scheduler, YARN Application Preemption
• Resource Isolation (lxc Containers)
• ORTE: Hamster Runtime
• Process launching, Wireup, Interconnect
![Page 43: Hadoop 201 -- Deeper into the Elephant](https://reader033.vdocuments.us/reader033/viewer/2022042717/55d578aabb61eba92f8b45da/html5/thumbnails/43.jpg)
Hamster Architecture
![Page 44: Hadoop 201 -- Deeper into the Elephant](https://reader033.vdocuments.us/reader033/viewer/2022042717/55d578aabb61eba92f8b45da/html5/thumbnails/44.jpg)
Hadoop 2.0
HDFS
ASF Projects FLOSS Projects Pivotal Products
MapReduce Tez
YARN
Ham
ster
YARN
![Page 45: Hadoop 201 -- Deeper into the Elephant](https://reader033.vdocuments.us/reader033/viewer/2022042717/55d578aabb61eba92f8b45da/html5/thumbnails/45.jpg)
Hadoop ecosystem
HDFS
Pig
Sqoop Flume
Coordination and workflow
management
Zookeeper
Command Center
ASF Projects FLOSS Projects Pivotal Products
Oozie
MapReduce
Hive
Tez
Giraph
Hadoop UI
Hue
SolrCloud
Phoenix
HBase
Crunch Mahout
YARN
Ham
ster
YARN
![Page 46: Hadoop 201 -- Deeper into the Elephant](https://reader033.vdocuments.us/reader033/viewer/2022042717/55d578aabb61eba92f8b45da/html5/thumbnails/46.jpg)
There’s way too much stuff
• Tracking dependencies
• Integration testing
• Optimizing the defaults
• Rationalizing the behaviour
![Page 47: Hadoop 201 -- Deeper into the Elephant](https://reader033.vdocuments.us/reader033/viewer/2022042717/55d578aabb61eba92f8b45da/html5/thumbnails/47.jpg)
Wait! We’ve seen this!
GNU Software Linux kernel
![Page 48: Hadoop 201 -- Deeper into the Elephant](https://reader033.vdocuments.us/reader033/viewer/2022042717/55d578aabb61eba92f8b45da/html5/thumbnails/48.jpg)
Apache Bigtop Hadoop ecosystem (Hbase, Pig, Hive)
Hadoop���(HDFS, YARN, MR)
![Page 49: Hadoop 201 -- Deeper into the Elephant](https://reader033.vdocuments.us/reader033/viewer/2022042717/55d578aabb61eba92f8b45da/html5/thumbnails/49.jpg)
Principle #4
Apache Bigtop is how the Hadoop distros get
defined
![Page 50: Hadoop 201 -- Deeper into the Elephant](https://reader033.vdocuments.us/reader033/viewer/2022042717/55d578aabb61eba92f8b45da/html5/thumbnails/50.jpg)
The ecosystem • Apache HBase
• Apache Crunch, Pig, Hive and Phoenix
• Apache Giraph
• Apache Oozie
• Apache Mahout
• Apache Sqoop and Flume
![Page 51: Hadoop 201 -- Deeper into the Elephant](https://reader033.vdocuments.us/reader033/viewer/2022042717/55d578aabb61eba92f8b45da/html5/thumbnails/51.jpg)
Apache HBase • Small mutable records vs. HDFS files
• HFiles kept in HDFS
• Memcached for HDFS
• Built on HDFS and Zookeeper
• Google’s Bigtable
![Page 52: Hadoop 201 -- Deeper into the Elephant](https://reader033.vdocuments.us/reader033/viewer/2022042717/55d578aabb61eba92f8b45da/html5/thumbnails/52.jpg)
Hbase datamodel
• Driven by the original Webtable usecase:
com.cnn.www <html>...
content:
CNN CNN.co
anchor:a.com anchor:b.com
![Page 53: Hadoop 201 -- Deeper into the Elephant](https://reader033.vdocuments.us/reader033/viewer/2022042717/55d578aabb61eba92f8b45da/html5/thumbnails/53.jpg)
How do I use it? HTable table = new HTable(config, “table”);
Put p = new Put(Bytes.toBytes(“row”));
p.add(Bytes.toBytes(“family”),
Bytes.toBytes(“qualifier”),
Bytes.toBytes(“data”));
table.put(p);
![Page 54: Hadoop 201 -- Deeper into the Elephant](https://reader033.vdocuments.us/reader033/viewer/2022042717/55d578aabb61eba92f8b45da/html5/thumbnails/54.jpg)
Dataflow model
HBase
HDFS
Producer Consumer
![Page 55: Hadoop 201 -- Deeper into the Elephant](https://reader033.vdocuments.us/reader033/viewer/2022042717/55d578aabb61eba92f8b45da/html5/thumbnails/55.jpg)
When do I use it?
• Serving up large amounts of data
• Fast random access
• Scan operations
![Page 56: Hadoop 201 -- Deeper into the Elephant](https://reader033.vdocuments.us/reader033/viewer/2022042717/55d578aabb61eba92f8b45da/html5/thumbnails/56.jpg)
Principle #5
HBase: when you need OLAP + OLTP
![Page 57: Hadoop 201 -- Deeper into the Elephant](https://reader033.vdocuments.us/reader033/viewer/2022042717/55d578aabb61eba92f8b45da/html5/thumbnails/57.jpg)
What if its OLTP?
HDFS
Pig
Sqoop Flume
Coordination and workflow
management
Zookeeper
Command Center
ASF Projects FLOSS Projects Pivotal Products
Oozie
MapReduce
Hive
Tez
Giraph
Hadoop UI
Hue
SolrCloud
Phoenix
HBase
Crunch Mahout
YARN
Ham
ster
YARN
![Page 58: Hadoop 201 -- Deeper into the Elephant](https://reader033.vdocuments.us/reader033/viewer/2022042717/55d578aabb61eba92f8b45da/html5/thumbnails/58.jpg)
GemFire XD
HDFS
Pig
Sqoop Flume
Coordination and workflow
management
Zookeeper
Command Center
ASF Projects FLOSS Projects Pivotal Products
Oozie
MapReduce
Hive
Tez
Giraph
Hadoop UI
Hue
SolrCloud
Phoenix
HBase
Crunch Mahout
YARN
GemFire XD
Ham
ster
YARN
![Page 59: Hadoop 201 -- Deeper into the Elephant](https://reader033.vdocuments.us/reader033/viewer/2022042717/55d578aabb61eba92f8b45da/html5/thumbnails/59.jpg)
GemFire XD: a better HBase? • Close sourced but extremely mature
• SQL/Objects/JSON data model
• High concurrency, high update load
• Mostly selective point queries (no scans)
• Tiered storage architecture
![Page 60: Hadoop 201 -- Deeper into the Elephant](https://reader033.vdocuments.us/reader033/viewer/2022042717/55d578aabb61eba92f8b45da/html5/thumbnails/60.jpg)
YCSB Benchmark; Throughput is 2-12X
0
100000
200000
300000
400000
500000
600000
700000
800000
AU BU CU D FU LOAD
Th
rou
ghp
ut
(op
s/se
c)
HBase
4
8
12
16
0
100000
200000
300000
400000
500000
600000
700000
800000
AU BU CU D FU LOAD
Th
rou
ghp
ut
(op
s/se
c)
GemFire XD
4
8
12
16
![Page 61: Hadoop 201 -- Deeper into the Elephant](https://reader033.vdocuments.us/reader033/viewer/2022042717/55d578aabb61eba92f8b45da/html5/thumbnails/61.jpg)
YCSB Benchmark; Latency is 2X – 20X better
0
2000
4000
6000
8000
10000
12000
14000
Lat
en
cy (μ
sec)
HBase
4
8
12
16
0
2000
4000
6000
8000
10000
12000
14000
Lat
en
cy (μ
sec)
GemFire XD
4
8
12
16
![Page 62: Hadoop 201 -- Deeper into the Elephant](https://reader033.vdocuments.us/reader033/viewer/2022042717/55d578aabb61eba92f8b45da/html5/thumbnails/62.jpg)
Principle #6
There are always 3 implementations
![Page 63: Hadoop 201 -- Deeper into the Elephant](https://reader033.vdocuments.us/reader033/viewer/2022042717/55d578aabb61eba92f8b45da/html5/thumbnails/63.jpg)
Querying data
• MapReduce: “an assembly language”
• Apache Pig: a data manipulation DSL (now Turing complete!)
• Apache Hive: a batch-oriented SQL on top of Hadoop
![Page 64: Hadoop 201 -- Deeper into the Elephant](https://reader033.vdocuments.us/reader033/viewer/2022042717/55d578aabb61eba92f8b45da/html5/thumbnails/64.jpg)
How do I use Pig?
grunt> A = load ‘./input.txt’;
grunt> B = foreach A generate ��� flatten(TOKENIZE((chararray)$0)) as��� words;
grunt> C = group B by word;
grunt> D = foreach C generate COUNT(B), ��� group;
![Page 65: Hadoop 201 -- Deeper into the Elephant](https://reader033.vdocuments.us/reader033/viewer/2022042717/55d578aabb61eba92f8b45da/html5/thumbnails/65.jpg)
How do I use Hive? CREATE TABLE docs (line STRING);
LOAD DATA INPATH 'text' OVERWRITE INTO TABLE docs;
CREATE TABLE word_counts AS
SELECT word, count(1) AS count FROM
(SELECT explode(split(line, '\s')) AS word FROM docs)
GROUP BY word
ORDER BY word;
![Page 66: Hadoop 201 -- Deeper into the Elephant](https://reader033.vdocuments.us/reader033/viewer/2022042717/55d578aabb61eba92f8b45da/html5/thumbnails/66.jpg)
Can we short Oracle now?
• No indexing
• Batch oriented scheduling
• Optimization for long running queries
• Metadata management is still in flux
![Page 67: Hadoop 201 -- Deeper into the Elephant](https://reader033.vdocuments.us/reader033/viewer/2022042717/55d578aabb61eba92f8b45da/html5/thumbnails/67.jpg)
[Close to] real-time SQL
• Impala (inspired by Google’s F1)
• Hive/Tez (AKA Stinger)
• Facebook’s Presto (Hive’s lineage)
• Pivotal’s HAWQ
![Page 68: Hadoop 201 -- Deeper into the Elephant](https://reader033.vdocuments.us/reader033/viewer/2022042717/55d578aabb61eba92f8b45da/html5/thumbnails/68.jpg)
HAWQ
• GreenPlum MPP database core
• True ANSI SQL support
• HDFS storage backend
• Parquet support
![Page 69: Hadoop 201 -- Deeper into the Elephant](https://reader033.vdocuments.us/reader033/viewer/2022042717/55d578aabb61eba92f8b45da/html5/thumbnails/69.jpg)
Principle #7
SQL on Hadoop
![Page 70: Hadoop 201 -- Deeper into the Elephant](https://reader033.vdocuments.us/reader033/viewer/2022042717/55d578aabb61eba92f8b45da/html5/thumbnails/70.jpg)
Feeding the elephant
![Page 71: Hadoop 201 -- Deeper into the Elephant](https://reader033.vdocuments.us/reader033/viewer/2022042717/55d578aabb61eba92f8b45da/html5/thumbnails/71.jpg)
Getting data in: Flume • Designed for collecting log data
• Flexible deployment topology
![Page 72: Hadoop 201 -- Deeper into the Elephant](https://reader033.vdocuments.us/reader033/viewer/2022042717/55d578aabb61eba92f8b45da/html5/thumbnails/72.jpg)
Sqoop: RDBMs connection • Sqoop 1
• A MapReduce tool
• Must use Oozie for workflows
• Sqoop 2
• Well, 0.99.x really
• A standalone service
![Page 73: Hadoop 201 -- Deeper into the Elephant](https://reader033.vdocuments.us/reader033/viewer/2022042717/55d578aabb61eba92f8b45da/html5/thumbnails/73.jpg)
Spring XD
• Unified, distributed, extensible system for data ingestions, real time analytics and data exports
• Apache Licensed, not ASF
• A runtime service, not a library
• AKA “Oozie + Flume + Sqoop + Morphlines”
![Page 74: Hadoop 201 -- Deeper into the Elephant](https://reader033.vdocuments.us/reader033/viewer/2022042717/55d578aabb61eba92f8b45da/html5/thumbnails/74.jpg)
How do I use it?
# deployment: ./xd-singlenode
$ ./xd-shell
xd:> hadoop config fs –namenode hdfs://nn:8020
xd:> stream create –definition “time | hdfs” ��� –name ticktock
xd:> stream destroy –name ticktock
![Page 75: Hadoop 201 -- Deeper into the Elephant](https://reader033.vdocuments.us/reader033/viewer/2022042717/55d578aabb61eba92f8b45da/html5/thumbnails/75.jpg)
Feeding the Elephant
HDFS
Pig
Sqoop Flume
Coordination and workflow
management
Zookeeper
Command Center
ASF Projects FLOSS Projects Pivotal Products
Oozie
MapReduce
Hive
Tez
Giraph
Hadoop UI
Hue
SolrCloud
Phoenix
HBase
Crunch Mahout
YARN
GemFire XD
SpringXD
Ham
ster
YARN
![Page 76: Hadoop 201 -- Deeper into the Elephant](https://reader033.vdocuments.us/reader033/viewer/2022042717/55d578aabb61eba92f8b45da/html5/thumbnails/76.jpg)
Spark the disruptor
HDFS
Pig
Sqoop Flume
Coordination and workflow
management
Zookeeper
Command Center
ASF Projects FLOSS Projects Pivotal Products
GemFireXD
Oozie
MapReduce
Hive
Tez
Giraph
Hadoop UI
Hue
SolrCloud
Phoenix
HBase
Crunch Mahout
Spark
Shark
Streaming
MLib
GraphX
SpringXD
YARN
Ham
ster
YARN
![Page 77: Hadoop 201 -- Deeper into the Elephant](https://reader033.vdocuments.us/reader033/viewer/2022042717/55d578aabb61eba92f8b45da/html5/thumbnails/77.jpg)
What’s wrong with MR?
Source: UC Berkeley Spark project (just the image)
![Page 78: Hadoop 201 -- Deeper into the Elephant](https://reader033.vdocuments.us/reader033/viewer/2022042717/55d578aabb61eba92f8b45da/html5/thumbnails/78.jpg)
Spark innovations • Resilient Distribtued Datasets (RDDs)
• Distributed on a cluster
• Manipulated via parallel operators (map, etc.)
• Automatically rebuilt on failure
• A parallel ecosystem
• A solution to iterative and multi-stage apps
![Page 79: Hadoop 201 -- Deeper into the Elephant](https://reader033.vdocuments.us/reader033/viewer/2022042717/55d578aabb61eba92f8b45da/html5/thumbnails/79.jpg)
RDDs
warnings = textFile(…).filter(_.contains(“warning”)) .map(_.split(‘ ‘)(1))
HadoopRDD���path = hdfs://
FilteredRDD���contains…
MappedRDD split…
![Page 80: Hadoop 201 -- Deeper into the Elephant](https://reader033.vdocuments.us/reader033/viewer/2022042717/55d578aabb61eba92f8b45da/html5/thumbnails/80.jpg)
Parallel operators
• map, reduce
• sample, filter
• groupBy, reduceByKey
• join, leftOuterJoin, rightOuterJoin
• union, cross
![Page 81: Hadoop 201 -- Deeper into the Elephant](https://reader033.vdocuments.us/reader033/viewer/2022042717/55d578aabb61eba92f8b45da/html5/thumbnails/81.jpg)
An alternative backend
• Shark: a Hive on Spark (now Spark SQL)
• Spork: a Pig on Spark
• Mlib: machine learning on Spark
• GraphX: Graph processing on Spark
• Also featuring its own streaming engine
![Page 82: Hadoop 201 -- Deeper into the Elephant](https://reader033.vdocuments.us/reader033/viewer/2022042717/55d578aabb61eba92f8b45da/html5/thumbnails/82.jpg)
How do I use it?
val file = spark.textFile("hdfs://...")
val counts = file.flatMap(line => line.split(" "))
.map(word => (word, 1))
.reduceByKey(_ + _)
counts.saveAsTextFile("hdfs://...")
![Page 83: Hadoop 201 -- Deeper into the Elephant](https://reader033.vdocuments.us/reader033/viewer/2022042717/55d578aabb61eba92f8b45da/html5/thumbnails/83.jpg)
Principle #8
Spark is the technology of 2014
![Page 84: Hadoop 201 -- Deeper into the Elephant](https://reader033.vdocuments.us/reader033/viewer/2022042717/55d578aabb61eba92f8b45da/html5/thumbnails/84.jpg)
Where’s the cloud?
![Page 85: Hadoop 201 -- Deeper into the Elephant](https://reader033.vdocuments.us/reader033/viewer/2022042717/55d578aabb61eba92f8b45da/html5/thumbnails/85.jpg)
What’s new?
• True elasticity
• Resource partitioning
• Security
• Data marketplace
• Data leaks/breaches
![Page 86: Hadoop 201 -- Deeper into the Elephant](https://reader033.vdocuments.us/reader033/viewer/2022042717/55d578aabb61eba92f8b45da/html5/thumbnails/86.jpg)
Hadoop Maturity
ETL Offload Accommodate massive ���
data growth with existing EDW investments
Data Lakes Unify Unstructured and Structured Data Access
Big Data Apps
Build analytic-led applications impacting ���
top line revenue
Data-Driven Enterprise
App Dev and Operational Management on HDFS
Data Architecture
![Page 87: Hadoop 201 -- Deeper into the Elephant](https://reader033.vdocuments.us/reader033/viewer/2022042717/55d578aabb61eba92f8b45da/html5/thumbnails/87.jpg)
Pivotal HD on Pivotal CF
� Enterprise PaaS Management System
� Flexible multi-language ‘buildpack’ architecture
� Deployed applications enjoy built-in services
� On-Premise Hadoop as a Service
� Single cluster deployment of Pivotal HD
� Developers instantly bind to shared Hadoop Clusters
� Speeds up time-to-value
![Page 88: Hadoop 201 -- Deeper into the Elephant](https://reader033.vdocuments.us/reader033/viewer/2022042717/55d578aabb61eba92f8b45da/html5/thumbnails/88.jpg)
Pivotal Data Fabric Evolution
Analytic���Data Marts
SQL Services
Operational ���Intelligence
In-Memory Database
Run-Time���Applications
Data Staging���Platform
Data Mgmt. Services
Pivotal Data Platform
Stream ���Ingestion
Streaming Services
Software-Defined Datacenter
New Data-fabrics
In-Memory Grid
...ETC
![Page 89: Hadoop 201 -- Deeper into the Elephant](https://reader033.vdocuments.us/reader033/viewer/2022042717/55d578aabb61eba92f8b45da/html5/thumbnails/89.jpg)
Principle #9
Hadoop in the Cloud is one of many
distributed frameworks
![Page 90: Hadoop 201 -- Deeper into the Elephant](https://reader033.vdocuments.us/reader033/viewer/2022042717/55d578aabb61eba92f8b45da/html5/thumbnails/90.jpg)
2014 is the year of Hadoop
HDFS
Pig
Sqoop Flume
Coordination and workflow
management
Zookeeper
Command Center
ASF Projects FLOSS Projects Pivotal Products
GemFire XD
Oozie
MapReduce
Hive
Tez
Giraph
Hadoop UI
Hue
SolrCloud
Phoenix
HBase
Crunch Mahout
Spark
Shark
Streaming
MLib
GraphX
Impala
HAW
Q
SpringXD
MADlib
Ham
ster
PivotalR
YARN
![Page 91: Hadoop 201 -- Deeper into the Elephant](https://reader033.vdocuments.us/reader033/viewer/2022042717/55d578aabb61eba92f8b45da/html5/thumbnails/91.jpg)
A NEW PLATFORM FOR A NEW ERA
Additional Line 18 Point Verdana
![Page 92: Hadoop 201 -- Deeper into the Elephant](https://reader033.vdocuments.us/reader033/viewer/2022042717/55d578aabb61eba92f8b45da/html5/thumbnails/92.jpg)
Credits
• Apache Software Foundation
• Milind Bhandarkar
• Konstantin Boudnik
• Robert Geiger
• Susheel Kaushik
• Mak Gokhale
![Page 93: Hadoop 201 -- Deeper into the Elephant](https://reader033.vdocuments.us/reader033/viewer/2022042717/55d578aabb61eba92f8b45da/html5/thumbnails/93.jpg)
Questions ?