hadoop enhancements using next gen ia technologies

26
Hadoop Enhancements Using Next-Gen Intel ® Platform Technologies Anoop Sam John – PMC member for Apache HBase Rakesh R - Committer for Apache Zookeeper and PMC member for Apache Bookkeeper

Upload: bigdata-meetup-kochi

Post on 08-Jan-2017

51 views

Category:

Data & Analytics


0 download

TRANSCRIPT

Page 1: Hadoop enhancements using next gen IA technologies

Hadoop Enhancements Using Next-Gen Intel ® Platform TechnologiesAnoop Sam John – PMC member for Apache HBase

Rakesh R - Committer for Apache Zookeeper and PMC member for Apache Bookkeeper

Page 2: Hadoop enhancements using next gen IA technologies

Hadoop Enhancements Using Next-Gen Intel ® Platform Technologies

Anoop Sam JohnRakesh R

Page 3: Hadoop enhancements using next gen IA technologies

About Us• Anoop Sam John

• PMC member for Apache HBase and Phoenix• [email protected]• https://www.linkedin.com/in/anoopsamjohn

• Rakesh R• Committer for Apache ZooKeeper and BookKeeper• Apache Hadoop contributor• [email protected]• https://www.linkedin.com/in/rakeshadr

Page 4: Hadoop enhancements using next gen IA technologies

Agenda• Intel Enhancements on Hadoop platform• HDFS

Erasure coding using ISA-L library Encryption using AES-NI

• HBase Go Big Cache

Page 5: Hadoop enhancements using next gen IA technologies

HDFS – Distributed FileSystem“Between the birth of the world and 2003, there were 5 Exabytes of information created. We now create 5 Exabytes every two days.”

Eric Schmidt, Executive Chairman of Alphabet, Inc.

Massive

amount of

datafrom many

sources

Page 6: Hadoop enhancements using next gen IA technologies

HDFS – Current Replication Strategy• Inherits 3-way replication from Google File System to increase data availability

- 3x storage overhead

• Expensive for,- Massive amount of data- Geo-distributed data recovery

Datanode1

r1

Datanode2

r2

DFSClient

r3

Rack-1 Rack-2

Write data

block replicates

3X replication

Page 7: Hadoop enhancements using next gen IA technologies

HDFS – Erasure Coding• k data blocks + m parity blocks (k + m)

Example: Reed-Solomon 6 + 3

Codec library

• Save disk space• 1.5x storage overhead

X Y X Y0 0 00 1 11 0 11 1 0

data bits parity bits

Sample codec library (XOR based)

b1 b2 b3 b4 b5 b6 b7 b8 b9

6 data blocks

3 parity blocks

D1 D2 D3 D4 D5 D6 D7 D8 D9

Page 8: Hadoop enhancements using next gen IA technologies

Durability & Efficiency

3-way data replication Erasure coding : RS – (6,3)

Data Durability 2 3

Storage efficiency 1/3 (33.33%) 6/9 (67%)

Data Durability = How many simultaneous failures can be tolerated?Storage Efficiency = How much portion of storage is for useful data?

useful data extra datauseful data

Datanode1 Datanode2 Datanode3Replica1 Replica2 Replica3

redundant data

3-way data replicationD1 D2 D3 D4 D5 D6 D7 D8 D9b1 b2 b3 b4 b5 b6 b7 b8 b9

6 data blocks

3 parity blocks

RS-(6,3) Erasure coding

• Released version – Apache Hadoop 3.0.0-alpha1

Page 9: Hadoop enhancements using next gen IA technologies

Microbenchmark : Codec Calculation

MB

per s

econ

d

Image courtesy Cloudera

High performance

• New Intel architecture solutions for storage (ISA-L)• Intel® Intelligent Storage Acceleration Library provides a solution to deploy EC

with better performance. https://01.org/intel%C2%AE-storage-acceleration-library-open-source-version

Page 10: Hadoop enhancements using next gen IA technologies

HDFS – Encryption

Page 11: Hadoop enhancements using next gen IA technologies

HDFS-Encryption• Sensitivity of the data and

managing privacy of the data is very important for the big data analytics• Encryption is a regulatory

requirement for many business sectors

- Finance- Government- Healthcare etc.

DFSClient

Per-file key operations

Data opsRead/Write encrypted data

KMS

HDFS Cluster

Encryption key ops

data at-rest

data in-transit

Disk

Encryption library

• Released version – Apache Hadoop 2.6.0

Page 12: Hadoop enhancements using next gen IA technologies

Encryption Algorithm• Data encryption/decryption is costlier

• Encryption ciphers.

AES-CTR (Advanced Encryption Standard - Counter Mode) is most popular

Either 128 or 192 or 256 bit keys

Page 13: Hadoop enhancements using next gen IA technologies

Encryption AES-CTR• Two implementations of AES-CTR

1. JCE (Java Cryptography Extension) software implementation

2. OpenSSL hardware accelerated AES-NI (Intel ® Advanced Encryption

Standard New Instructions) implementation

AES-NI available in Westmere(2010) and newer Intel CPUs

AES-NI further optimized in Haswell(2013)

https://software.intel.com/en-us/articles/intel-advanced-encryption-standard-instructions-aes-ni

Page 14: Hadoop enhancements using next gen IA technologies

Microbenchmark: Encrypt/Decrypt 1GB Byte arrayTest Environment:

Run locally on a single Haswell machine Single threaded, excluded HDFS overheads(checksumming, network, copies)

Image courtesy Cloudera

Page 15: Hadoop enhancements using next gen IA technologies

Apache Commons Crypto• Cryptographic layer is incubated as new Apache component

http://commons.apache.org/proper/commons-crypto/

• Apache Commons crypto was integrated with Apache Spark as well

for shuffle encryption.

Page 16: Hadoop enhancements using next gen IA technologies

HBase

Page 17: Hadoop enhancements using next gen IA technologies

HBase• NoSQL database in Hadoop eco• Accumulates writes in memory and flushes to HDFS• Caches data • Reduced read latency• Better read throughput

•Memory hungry processes

Page 18: Hadoop enhancements using next gen IA technologies

Big Memory• Hadoop platforms no longer only for commodity hardware. • Systems moving towards faster CPU and bigger memories

Big Data => Big Storage + Big Memory• Non Volatile memory technology• 3D XPoint™ DIMMS from Intel®• Higher memory capability• Lower cost vs DDR

Page 19: Hadoop enhancements using next gen IA technologies

HBase – Go Big Cache

Data Data Data Data Data Data

HBase

JVM Offheap memory

Client

HDFS

Cache

Reads

Reads

• JVM GC tuning continues to

be a challenge with larger

heaps (new GC algos)

• Much bigger sized cache in

offheap memory for faster

random reads

• Better predictable latency

• Building blocks for

supporting 3D XPoint™

products

Page 20: Hadoop enhancements using next gen IA technologies

HBase – Go Big Cache

Performance before offheaping

Performance after offheaping

Image courtesy Alibaba Inc.

• Alibaba adopted this feature for their 1600 node cluster. • Used in double 11 online sale

Page 21: Hadoop enhancements using next gen IA technologies

Questions

Page 22: Hadoop enhancements using next gen IA technologies

Thank Youhttps://software.intel.com/en-us/bigdata/a

pache-big-data-stack

Page 23: Hadoop enhancements using next gen IA technologies
Page 24: Hadoop enhancements using next gen IA technologies

Backup slides

Page 25: Hadoop enhancements using next gen IA technologies

Microbenchmark : Codec Calculation

MB

per s

econ

ds

Page 26: Hadoop enhancements using next gen IA technologies

Microbenchmark : Codec Calculation

MB

per s

econ

ds