big data storage options for hadoop
TRANSCRIPT
-
8/10/2019 Big Data Storage Options for Hadoop
1/44
PRESENTATION TITLE GOES HEREBig Data Storage Options for Hadoop
Sam Fineberg/HP Storage Division
-
8/10/2019 Big Data Storage Options for Hadoop
2/44
Big Data Storage Options for Hadoop
2013 Storage Networking Industry Association. All Rights Reserved.
SNIA Legal Notice
The material contained in this tutorial is copyrighted by the SNIA unlessotherwise noted.
Member companies and individual members may use this material inpresentations and literature under the following conditions:Any slide or slides used must be reproduced in their entirety without modification
The SNIA must be acknowledged as the source of any material used in the body ofany document containing material from these presentations.
This presentation is a project of the SNIA Education Committee.
Neither the author nor the presenter is an attorney and nothing in thispresentation is intended to be, or should be construed as legal advice or anopinion of counsel. If you need legal advice or a legal opinion pleasecontact your attorney.
The information presented herein represents the author's personal opinionand current understanding of the relevant issues involved. The author, thepresenter, and the SNIA do not assume any responsibility or liability fordamages arising out of any reliance on or use of this information.
NO WARRANTIES, EXPRESS OR IMPLIED. USE AT YOUR OWN RISK.
2
-
8/10/2019 Big Data Storage Options for Hadoop
3/44
Big Data Storage Options for Hadoop
2013 Storage Networking Industry Association. All Rights Reserved.
Abstract
Big Data Storage Options for Hadoop
The Hadoop system was developed to enable the transformationand analysis of vast amounts of structured and unstructuredinformation. It does this by implementing an algorithm calledMapReduce across compute clusters that may consist ofhundreds or even thousands of nodes. In this presentation
Hadoop will be looked at from a storage perspective. Thetutorial will describe the key aspects of Hadoop storage, thebuilt-in Hadoop file system (HDFS), and some other options forHadoop storage that exist in the commercial and open sourcecommunities.
3
-
8/10/2019 Big Data Storage Options for Hadoop
4/44
Big Data Storage Options for Hadoop
2013 Storage Networking Industry Association. All Rights Reserved.
Overview
Introduction
What is HadoopWhat is MapReduce
How does Hadoop use storage
Distributed filesystem concepts
Storage optionsNative Hadoop HDFS
On direct attached storage
On networked (SAN) storage
Alternative distributed filesystems
Cloud object storage
Emerging options
4
-
8/10/2019 Big Data Storage Options for Hadoop
5/44
Big Data Storage Options for Hadoop
2013 Storage Networking Industry Association. All Rights Reserved.
Overview
Introduction
What is HadoopWhat is MapReduce
How does Hadoop use storage
Distributed filesystem concepts
Storage optionsNative Hadoop HDFS
On direct attached storage
On networked (SAN) storage
Alternative distributed filesystems
Cloud object storage
Emerging options
5
-
8/10/2019 Big Data Storage Options for Hadoop
6/44
Big Data Storage Options for Hadoop
2013 Storage Networking Industry Association. All Rights Reserved.
What is Hadoop?
A scalable fault-tolerant distributed system for data storageand processing
Core Hadoop has two main componentsMapReduce: fault-tolerant distributed processing
Programming model for processing sets of data
Mapping inputs to outputs and reducing the output of multiple Mappers to one (ora few) answer(s)
Hadoop Distributed File System (HDFS): high-bandwidth clusteredstorageDistributed file system optimized for large files
Operates on unstructured and structured data
A large and active ecosystem
Written in JavaOpen source under the friendly Apache License
http://hadoop.apache.org
6
-
8/10/2019 Big Data Storage Options for Hadoop
7/44
Big Data Storage Options for Hadoop
2013 Storage Networking Industry Association. All Rights Reserved.
What is MapReduce?
A method for distributing a task across multiple nodes
Each node processes data stored on that nodeConsists of two developer-created phases
1. Map
2. ReduceIn between Map and Reduce is the Shuffle and Sort
7
-
8/10/2019 Big Data Storage Options for Hadoop
8/44
Big Data Storage Options for Hadoop
2013 Storage Networking Industry Association. All Rights Reserved.
MapReduce
8
Google, from Google Code University, http://code.google.com/edu/parallel/mapreduce-tutorial.html
-
8/10/2019 Big Data Storage Options for Hadoop
9/44
Big Data Storage Options for Hadoop
2013 Storage Networking Industry Association. All Rights Reserved.
What was the max temperature for the last century?
MapReduce Operation
9
-
8/10/2019 Big Data Storage Options for Hadoop
10/44
Big Data Storage Options for Hadoop
2013 Storage Networking Industry Association. All Rights Reserved.
Key MapReduce Terminology Concepts
A user runs a client program (typically a Java application) on
a client computerThe client program submits a job to Hadoop
The job is sent to the JobTracker process on the Master Node
Each Slave Node runs a process called the TaskTracker
The JobTracker instructs TaskTrackers to run and monitortasks
A task attempt is an instance of a task running on a slave
node
There will be at least as many task attempts as there are
tasks which need to be performed
10
-
8/10/2019 Big Data Storage Options for Hadoop
11/44
Big Data Storage Options for Hadoop
2013 Storage Networking Industry Association. All Rights Reserved.
MapReduce in Hadoop
11
Google, from Google Code University, http://code.google.com/edu/parallel/mapreduce-tutorial.html
Task Tracker
Input (HDFS)
Output (HDFS)
Mapper
Reducer
Worker=Tasks
-
8/10/2019 Big Data Storage Options for Hadoop
12/44
Big Data Storage Options for Hadoop
2013 Storage Networking Industry Association. All Rights Reserved.
MapReduce: Basic Concepts
Each Mapper processes single input split from HDFS
Hadoop passes one record at a time to the developersMap code
Each record has a key and a value
Intermediate data written by the Mapper to local disk (notHDFS) on each of the individual cluster nodes
intermediate data is reliable or globally accessible
During shuffle and sort phase, all values associated with
same intermediate key are transferred to same ReducerReducer is passed each key and a list of all its values
Output from Reducers is written to HDFS
12
-
8/10/2019 Big Data Storage Options for Hadoop
13/44
Big Data Storage Options for Hadoop
2013 Storage Networking Industry Association. All Rights Reserved.
What is a Distributed File System?
A distributed file system is a file system that allows
access to files from multiple hosts across a networkA network filesystem (NFS/CIFS) is a type of distributed file
system more tuned for file sharing than distributed computation
Distributed computing applications, like Hadoop, utilize a tightly
coupled distributed file systemTightly coupled distributed filesystems
Provide a single global namespace across all nodes
Support multiple initiators, multiple disk nodes, multiple access
to files file parallelismExamples include HDFS, GlusterFS, pNFS, as well as many
commercial and research systems
13
-
8/10/2019 Big Data Storage Options for Hadoop
14/44
Big Data Storage Options for Hadoop
2013 Storage Networking Industry Association. All Rights Reserved.
Overview
IntroductionWhat is Hadoop
What is MapReduce
How does Hadoop use storage
Distributed filesystem concepts
Storage optionsNative Hadoop HDFSOn direct attached storage
On networked (SAN) storage
Alternative distributed filesystems
Cloud object storage
Emerging options
14
-
8/10/2019 Big Data Storage Options for Hadoop
15/44
Big Data Storage Options for Hadoop 2013 Storage Networking Industry Association. All Rights Reserved.
Hadoop Distributed File System - HDFS
ArchitectureJava application, not deeply integrated with the server OS
Layered on top of a standard FS (e.g., ext, xfs, etc.)
Must use Hadoop or a special library to access HDFS filesShared-nothing, all nodes have direct attached disks
Write once filesystem must copy a file to modify it
HDFS basicsData is organized into files & directories
Files are divided into 64-128MB blocks, distributed across nodesBlock placement is handled by the NameNode
Placement coordinated with job tracker = writes always co-located, reads co-located with computation whenever possible
Blocks replicated to handle failure, replica blocks can be used by compute tasks
Checksums used to ensure data integrity
Replication: one and only strategy for error handling, recovery and faulttolerance
Self Healing
Makes multiple copies (typically 3)
15
-
8/10/2019 Big Data Storage Options for Hadoop
16/44
Big Data Storage Options for Hadoop 2013 Storage Networking Industry Association. All Rights Reserved.
HDFS on local DAS
A Hadoop cluster consisting of many nodes, each of
which has local direct attached storage (DAS)Disks are running a standard file system (e.g., ext. xfs, etc.)
HDFS blocks are stored as files in a special directory
Disks attached directly, for example, with SAS or SATA
No storage is shared, disks only attach to a single node
The most common use case for Hadoop
Original design point for Hadoop/HDFS
Can work with cheap unreliable hardware
Some very large systems utilize this model
16
-
8/10/2019 Big Data Storage Options for Hadoop
17/44
Big Data Storage Options for Hadoop 2013 Storage Networking Industry Association. All Rights Reserved.
HDFS on local DAS
17
Compute nodes are part of
HDFS, data spread across nodes
HDFS
Protocol
-
8/10/2019 Big Data Storage Options for Hadoop
18/44
Big Data Storage Options for Hadoop 2013 Storage Networking Industry Association. All Rights Reserved.
HDFS File Write Operation
18
Image source: Hadoop, The Definitive Guide Tom White, OReilly
3-way
replication
-
8/10/2019 Big Data Storage Options for Hadoop
19/44
Big Data Storage Options for Hadoop 2013 Storage Networking Industry Association. All Rights Reserved.
HDFS File Read Operation
19
Image source: Hadoop, The Definitive Guide Tom White, OReilly
-
8/10/2019 Big Data Storage Options for Hadoop
20/44
Big Data Storage Options for Hadoop 2013 Storage Networking Industry Association. All Rights Reserved.
HDFS on local DAS - Pros and Cons
ProsWrites are highly parallel
Large files are broken into many parts, distributed across the cluster
Three copies of any file block, one written local, two remoteNot a simple round-robin scheme, tuned for Hadoop jobs
Job tracker attempts to make reads localIf possible, tasks scheduled in same node as the needed file segment
Duplicate file segments are also readable, can be used for tasks too
ConsNot a replacement for general purpose storage
Not a kernel-based POSIX filesystem
Incompatible with standard applications and utilities (but future versions ofHadoop are adding more other application models)
High replication cost compared with RAID/shared diskThe NameNode keeps track of data location
SPOF - location data is critical and must be protected
Scalability bottleneck (everything has to be in memory)
Improvements to NameNode are in the works
20
-
8/10/2019 Big Data Storage Options for Hadoop
21/44
Big Data Storage Options for Hadoop 2013 Storage Networking Industry Association. All Rights Reserved.
Other HDFS storage options
HDFS on Storage Area Network (SAN) attached storage
A lot like DAS,Disks are logical volumes in storage array(s), accessed across a SAN
HDFS doesnt know the difference
Still appears like a locally attached disk
SAN attached arrays arent the same as DAS
Array has its own cache, redundancy, replication, etc.
Any node on the SAN can access any array volume
So a new node can be assigned to a failed nodes data
21
-
8/10/2019 Big Data Storage Options for Hadoop
22/44
Big Data Storage Options for Hadoop 2013 Storage Networking Industry Association. All Rights Reserved.
HDFS with SAN Storage
22
Storage Arrays Compute nodes
Hadoop
Cluster
iSCSI or FC SAN
-
8/10/2019 Big Data Storage Options for Hadoop
23/44
Big Data Storage Options for Hadoop 2013 Storage Networking Industry Association. All Rights Reserved.
HDFS File Write Operation
23
Array can provide
redundancy, no need to
replicate data acrossdata nodes
Array
Replication
-
8/10/2019 Big Data Storage Options for Hadoop
24/44
Big Data Storage Options for Hadoop 2013 Storage Networking Industry Association. All Rights Reserved.
HDFS File Read Operation
24
Array redundancy,
means only a single
source for data
-
8/10/2019 Big Data Storage Options for Hadoop
25/44
Big Data Storage Options for Hadoop 2013 Storage Networking Industry Association. All Rights Reserved.
SAN for Hadoop Storage
Instead of storing data on direct attached local disks, data is in oneor more arrays attached to data nodes through a SAN
Looks like local storage to data nodesHadoop still utilizes HDFS
ProsAll the normal advantages of arrays
RAID, centralized caching, thin provisioning, other advanced array features
Centralized management, easy redistribution of storage
Retains advantages of HDFS (as long as array is not over-utilized)Easy failover when compute node dies, can eliminate or reduce 3-wayreplication
ConsCost? It depends
Unless if multiple arrays are used, scale is limitedAnd with multiple arrays, management and cost advantages are reduced
Still have HDFS complexity and manageability issues
25
-
8/10/2019 Big Data Storage Options for Hadoop
26/44
Big Data Storage Options for Hadoop 2013 Storage Networking Industry Association. All Rights Reserved.
Overview
IntroductionWhat is Hadoop
What is MapReduce
How does Hadoop use storage
Distributed filesystem concepts
Storage optionsNative Hadoop HDFS
On direct attached storage
On networked (SAN) storage
Alternative distributed filesystems
Cloud object storageEmerging options
26
-
8/10/2019 Big Data Storage Options for Hadoop
27/44
Big Data Storage Options for Hadoop 2013 Storage Networking Industry Association. All Rights Reserved.
Other distributed filesystems
Kernel-based tightly coupled distributed file systemKernel-based, i.e., no special access libraries, looks like a normallocal file system
These filesystems have existed for years in high performancecomputing, scale-out NAS servers, and other scale-out computingenvironments
Many commercial and research examples
Not originally designed for Hadoop like HDFS
Location awareness is part of the file system no NameNodeWorks better if functionality is exposed to Hadoop
Compute nodes may or may not have local storageCompute nodes are part of the storage cluster, but may bediskless i.e., equal access to files and global namespace
Can tie the filesystems location awareness into task tracker to reduce remotestorage access
Remote storage is accessed using a filesystem specific inter-nodeprotocol
Single network hop due to filesystems location awareness
27
-
8/10/2019 Big Data Storage Options for Hadoop
28/44
Big Data Storage Options for Hadoop 2013 Storage Networking Industry Association. All Rights Reserved.
Tightly coupled DFS for Hadoop
General purpose shared file systemImplemented in the kernel, single namespace, compatible withmost applications (no special library or language)
Data is distributed across local storage node disksArchitecturally like HDFS
Can utilize same disk options as HDFS
Including shared nothing DAS SAN storage
Some can also support shared SAN storage where raw volumes can beaccessed by multiple nodes
Failover model where only one node actively uses a volume, other can takeover after failure
Multiple initiator model where multiple nodes actively use a volume
Shared nothing option has similar cost/performance to HDFS onDAS
28
-
8/10/2019 Big Data Storage Options for Hadoop
29/44
Big Data Storage Options for Hadoop 2013 Storage Networking Industry Association. All Rights Reserved.
Distributed FS local disks
29
Compute
nodes are
part of the
DFS, data
spread
acrossnodes
Distributed
FS inter-node
Protocol
-
8/10/2019 Big Data Storage Options for Hadoop
30/44
Big Data Storage Options for Hadoop 2013 Storage Networking Industry Association. All Rights Reserved.
Distributed FS remote disks
30
Compute nodes are
distributed FS clients
Scale out nodes are
distributed FS servers
Distributed FSinter-node
Protocol
-
8/10/2019 Big Data Storage Options for Hadoop
31/44
Big Data Storage Options for Hadoop 2013 Storage Networking Industry Association. All Rights Reserved.
Remote DFS Write Operation
31
Note that these diagrams are intended
to be generic, and leave out much of the
detail of any specific DFS
-
8/10/2019 Big Data Storage Options for Hadoop
32/44
Big Data Storage Options for Hadoop 2013 Storage Networking Industry Association. All Rights Reserved.
Local DFS Write Operation
32
Note that these diagrams are
intended to be generic, and leave out
much of the detail of any specific DFS
-
8/10/2019 Big Data Storage Options for Hadoop
33/44
Big Data Storage Options for Hadoop 2013 Storage Networking Industry Association. All Rights Reserved.
Remote DFS Read Operation
33
Note that these diagrams are intended
to be generic, and leave out much of the
detail of any specific DFS
-
8/10/2019 Big Data Storage Options for Hadoop
34/44
Big Data Storage Options for Hadoop 2013 Storage Networking Industry Association. All Rights Reserved.
Local DFS Read Operation
34
Note that these diagrams are
intended to be generic, and leave out
much of the detail of any specific DFS
-
8/10/2019 Big Data Storage Options for Hadoop
35/44
Big Data Storage Options for Hadoop 2013 Storage Networking Industry Association. All Rights Reserved.
Tightly coupled DFS for Hadoop
ProsShared data access, any node can access any data like it is local
POSIX compatible, works for non-Hadoop apps just like a local file system
Centralized management and administration
No NameNode, may have a better block mapping mechanism
Compute in-place, same copy can be served via NFS/CIFS
Many of the performance benefits
Cons
HDFS is highly optimized for Hadoop, unlikely to get same optimization for ageneral purpose DFS
Large file striping is not regular, based on compute distribution
Copies are simultaneously readable
Strict POSIX compliance leads to unnecessary serializationHadoop assumes multiple-access to files, however, accesses are on block boundaries and dontoverlap
Need to relax POSIX compliance for large files, or just stick with many smaller filesSome DFSs have scaling limitations that are worse than HDFS, not designed forthousands of nodes
35
-
8/10/2019 Big Data Storage Options for Hadoop
36/44
Big Data Storage Options for Hadoop 2013 Storage Networking Industry Association. All Rights Reserved.
Overview
IntroductionWhat is Hadoop
What is MapReduce
How does Hadoop use storage
Distributed filesystem concepts
Storage optionsNative Hadoop HDFS
On direct attached storage
On networked (SAN) storage
Alternative distributed filesystems
Cloud object storageEmerging options
36
-
8/10/2019 Big Data Storage Options for Hadoop
37/44
Big Data Storage Options for Hadoop 2013 Storage Networking Industry Association. All Rights Reserved.
Cloud Object Storage for Hadoop
Uses a REST API like CDMI, S3, or Swift
HTTP based protocol, data is remote
Objects are write once, read many, streaming access
Objects have some stored metadata
Data is stored in cloud object storage
Could be local or across internetCheap, high volume
Systems utilize triple redundancy or erasure coding, for reliability
Often uses Hadoop S3 connector
37
-
8/10/2019 Big Data Storage Options for Hadoop
38/44
Big Data Storage Options for Hadoop 2013 Storage Networking Industry Association. All Rights Reserved.
Hadoop on Object Storage
38
Cloud Object Storage
REST
/HTTP
-
8/10/2019 Big Data Storage Options for Hadoop
39/44
Big Data Storage Options for Hadoop 2013 Storage Networking Industry Association. All Rights Reserved.
Hadoop Write on Object Storage
39
-
8/10/2019 Big Data Storage Options for Hadoop
40/44
Big Data Storage Options for Hadoop 2013 Storage Networking Industry Association. All Rights Reserved.
Hadoop Read on Object Storage
40
-
8/10/2019 Big Data Storage Options for Hadoop
41/44
Big Data Storage Options for Hadoop 2013 Storage Networking Industry Association. All Rights Reserved.
Object Storage for Hadoop
ProsLow cost, high volume, reliable storage
Good location for infrequently used WORM dataPublic cloud options
Scalable storage
Data can easily be shared between Hadoop and other applications
ConsAll data is remote performance
No data/compute colocation
Limited capabilities, though a good match for Hadoop
High disk cost if triple redundancy is used
Good choice for large infrequently accessed WORM itemsthat may need to be accessed by non-Hadoop jobs as well
41
-
8/10/2019 Big Data Storage Options for Hadoop
42/44
Big Data Storage Options for Hadoop 2013 Storage Networking Industry Association. All Rights Reserved.
Emerging Options
New options are emerging in the storage research
community
Caching from enterprise storage
Mirror to enterprise storage, NAS/NFS
SSD
Improvements to HDFS
HA options
Access to non-Hadoop jobs
Bottom line
The limitations of HDFS are knownWork is ongoing to improve Hadoop storage options
42
-
8/10/2019 Big Data Storage Options for Hadoop
43/44
Big Data Storage Options for Hadoop 2013 Storage Networking Industry Association. All Rights Reserved.
Summary
Hadoop provides a scalable fault-tolerant environment foranalyzing unstructured and structured information
The default way to store data for Hadoop is HDFS on localdirect attached disks
Alternatives to this architecture include SAN array storage,tightly-coupled general purpose DFS, and cloud object
storageThey can provide some significant advantages
However, they arent without their downsides, its hard to beat afilesystem designed specifically for Hadoop
Which one is best for you?Depends on what is most important cost, manageability,compatibility with existing infrastructure, performance, scale,
43
-
8/10/2019 Big Data Storage Options for Hadoop
44/44
Big Data Storage Options for Hadoop 2013 Storage Networking Industry Association. All Rights Reserved.
Attribution & Feedback
44
Please send any questions or comments regarding this SNIA
Tutorial to [email protected]
The SNIA Education Committee thanks the following
individuals for their contributions to this Tutorial.
Authorship History
Sam Fineberg, August 2012
Updates:
Sam Fineberg, February 2013Sam Fineberg, March 2013
Additional Contributors
Rob Peglar
Joseph White
Chris Santilli