hadoop ecosystem · – flume – impala. yarn ... spark vs. m/r spark tries to ... append-only...
TRANSCRIPT
Hadoop Ecosystem
Ryan BrownRed Hat, Inc.
Agenda● MapReduce
● Storage
● The Ecosystem (Spark, Hive, etc)
● The Real World
“Big Data”● Well it depends
● Petabyte and beyond
● Capture, curation, storage, search, and analysis
● Unstructured
● Too big for one machine
Brief History● Hadoop is based on Google papers
from the early 2000s
– Google File System (GFS)
– MapReduce
● Distribute data at write-time, and reads are distributed by default.
MapReduce● Built for parallelism
● Split input
● Process small pieces
● Merge results from each
● Final output
Infrastructure● Commodity servers, commodity prices
● … lots and lots of them
● Plan for failure
● … and growth
M/R Concepts● “Shared-nothing” computation
● Minimum of communication
● Master node restarts failed tasks
● Splits tasks between nodes
HDFS Concepts● Files are split into big (128MB+) blocks
● Blocks are replicated (3x by default)
● Spindles, platters, and performance
● Best for big sequential reads
● Write only once
Reading Data● Client: NameNode, who has foo.txt?
● NN: Nodes 7, 11, and 187
● Client: 7, give me foo.txt
● 7: Blocks 1-10 of foo.txt
SQL-likes● Hive
● Pig
● Impala
Pig Latin● Scripting-like language
● Mature (err...kind of old)
● Built for analysts, can be used interactively
● Every line compiles to a MapReduce job
Hive● SQL-ish language called HiveQL
● Metadata is stored separately from datahive> CREATE TABLE invites (foo INT, bar STRING) PARTITIONED BY (ds STRING);
Impala● Newer than Hive and Pig
● Faster to run queries
● Built on HBase (BigTable clone)
● SQL-92 compliant, making it a great for BI and easy to integrate
Under the Hood● Accepts queries on any DataNode
● Query planner
● Coordinates all other DataNodes with relevant data
● Gathers results to go back over SQL connection
Impala
Hadoop Today● YARN, Mesos, and other schedulers
● Federated NameNodes
● Huge tool ecosystem
– Sqoop
– Flume
– Impala
YARN● MapReduce: Reloaded
YARN● Splits up JobTracker from Hadoop v1
● Monitoring, resource management, and scheduling are now separate daemons
● Application Masters negotiate with Resource Manager for capacity
Spark● Began ~2009
● Targeted for Scala and Python
● Application Masters negotiate with Resource Manager for capacity
● Jobs are set up as a DAG
● “Micro-batching”
Spark vs. M/R● Spark tries to work in memory
● MapReduce is slower for multi-pass work
● Spark jobs work great as a team
● Spark is better for streaming
● MapReduce looks more like “traditional” ETL
Kafka Streaming● Append-only logs
● Shard by stream
● Replay when consumers fail
● Connects real-time systems to batch systems
Kafka Streaming● Does not process data
● Not coupled to the Hadoop ecosystem
● Can act as a key-value store
● Performance in catch-up cases
Flink Combo● Streaming OR batch
● Streams in Scala, Python, or Java
● Tables for SQL-ish access
● Graph processing with Gelly
Flink Combo● One-stop framework
● Caches intermediate data
● Jobs can be chained like Spark
Hadoop-ables● Risk calculations
● Recommendations
● Transaction analysis
● Failure prediction
● Customer churn analysis
● Fraud/threat detection
● Word count – haha
Questions?