big data technologies and hadoop infrastructure

39
Roman Nikitchenko, 09.05.2014 BIG.DATA technologies & HADOOP infrastructure

Upload: roman-nikitchenko

Post on 25-May-2015

433 views

Category:

Technology


5 download

DESCRIPTION

Big Data technologies are bringing new complexity, new tasks and new opportunities in this world. How good is Apache Hadoop for all of this?

TRANSCRIPT

Page 1: Big data technologies and Hadoop infrastructure

Roman Nikitchenko, 09.05.2014

BIG.DATA technologies& HADOOP infrastructure

Page 2: Big data technologies and Hadoop infrastructure

2www.vitech.com.ua

Agenda

Hadoop causes real big data Industry changes

What technology is behind this name?

Why Hadoop is so promising solution?

BIG DATA APPROACH

HADOOP ENVIRONMENT

INDUSTRY FACE IS CHANGING

Page 3: Big data technologies and Hadoop infrastructure

3www.vitech.com.ua

No escape for you ;-)

Page 4: Big data technologies and Hadoop infrastructure

4www.vitech.com.ua

What is BIG DATA?

● Really BIG DATA things: photo banks, video storage, historical measurements.

● Intensive data transactions and high distribution: stores (offline or online), banks, advertising networks.

● Realtime data: measurements and minitoring, gaming.

● Intensive processing: science, modelling.

● High volumes of small things: social networks, healthcare

BIG DATA IS EVERYWHERE

Page 5: Big data technologies and Hadoop infrastructure

5www.vitech.com.ua

BIG DATA in just 3 words

Indeed any real big data is just about DIGITAL LIFE FOOTPRINT

Page 6: Big data technologies and Hadoop infrastructure

6www.vitech.com.ua

WORLD is big data itself

Yet to remember....

WORLD ITSELF CAN BE DIGITIZED TOO

● Earth weather and environment: realtime, really big data volume, high potential for processing, lot of things to be analysed, historical data.

● Space: unlimited potential for analysis, ocean is yet unknow volume.

● Internet of things is going to be digital world itself.

● ???

Page 7: Big data technologies and Hadoop infrastructure

7www.vitech.com.ua

So...

BIG DATA is not about the

data. It is about OUR ABILITY TO HANDLE THEM.

Page 8: Big data technologies and Hadoop infrastructure

8www.vitech.com.ua

But how can I handle big data?

… BUT HOW TO HANDLE IT?

BIG DATA

Page 9: Big data technologies and Hadoop infrastructure

9www.vitech.com.ua

BIG DATA storage: requirements

NO BACKUPS

Page 10: Big data technologies and Hadoop infrastructure

10www.vitech.com.ua

BIG DATA storage: requirements

SIMPLE BUT RELIABLE

● Really big amount of data is to be stored in reliable manner.

● Storage is to be simple, recoverable and cheap.

Page 11: Big data technologies and Hadoop infrastructure

11www.vitech.com.ua

BIG DATA storage: requirements

DECENTRALIZED● No single point of failure.

● Scalable as close to linear as possible.

● No manual actions to recover in case of failures

Page 12: Big data technologies and Hadoop infrastructure

12www.vitech.com.ua

BIG DATA processing: requirements

SIMPLE TO USE● Complexity is to

be burried inside.● Interface is to be

functional and compatible between versions.

Page 13: Big data technologies and Hadoop infrastructure

13www.vitech.com.ua

BIG DATA processing: requirements

TOOLS TO BE CLOSE TO WORK

● Process data on the same nodes as it is stored on.

● Distributed storage — distributed processing.

Page 14: Big data technologies and Hadoop infrastructure

14www.vitech.com.ua

BIG DATA processing: requirements

● Work is to be balanced.

● Data placement is to be appropriate to balanced work.

● Amount of work is to be balanced in accordance to resources.

SHARE LOAD

Page 15: Big data technologies and Hadoop infrastructure

15www.vitech.com.ua

Solution requirements in general

WHAT FINALLY DO WE NEED?● CPU+HDD in one place

● Cluster of replacable nodes

● Lot of storage space

● Way to control resources and balance load

● Everything is to be relatively simple and affordable x MAX

+

=

BIG DATA

Page 16: Big data technologies and Hadoop infrastructure

16www.vitech.com.ua

… and what is the solution?

HADOOP magic is here!

Page 17: Big data technologies and Hadoop infrastructure

17www.vitech.com.ua

What is it?

What is HADOOP?

● Hadoop is open source framework for big data. Both distributed storage and processing.

● Hadoop is reliable and fault tolerant with no rely on hardware for these properties.

● Hadoop has unique horisontal scalability. Currently — from single computer up to thousands of cluster nodes.

Page 18: Big data technologies and Hadoop infrastructure

18www.vitech.com.ua

Facts and trends

● 2004, Was inspired by by Google MapReduce idea. Originally was named just after son's elephant toy.

● On June 13, 2012 Facebook announced their Hadoop cluster has 100 PB of data. On November 8, 2012 they announced the warehouse grows by roughly half a PB per day.

● On February 19, 2008, Yahoo! Inc. launched what it claimed was the world's largest Hadoop production application. The Yahoo! Search Webmap is a Hadoop application that runs on a more than 10,000 core Linux cluster.

Page 19: Big data technologies and Hadoop infrastructure

19www.vitech.com.ua

Hadoop: classical picture

Hadoop historical top view

● HDFS serves as file system layer

● MapReduce originally served as distributed processing framework.

● Native client API is Java but there are lot of alternatives.

● This is only initial architecture and it is now more complex.

Page 20: Big data technologies and Hadoop infrastructure

20www.vitech.com.ua

HDFS top view

● Namenode is 'management' component. Keeps 'directory' of what file blocks are stored where.

● Actual work is performed by data nodes.

Page 21: Big data technologies and Hadoop infrastructure

21www.vitech.com.ua

HDFS files handling

● Files are stored in large enough blocks. Every block is replicated to several data nodes.

● Replication is tracked by namenode. Clients only locate blocks using namenode and actual load is taken by datanode.

● Datanode failure leads to replication recovery. Namenode could be backed by standby scheme.

Page 22: Big data technologies and Hadoop infrastructure

22www.vitech.com.ua

HDFS properties

● Designed for throughput, not for latency.

● Blocks are expected to be large. There is issue with lot of small files.

● Write once, read many times ideology.

● Only append, no 'edit' ability.

● Special tools are required to implement OLTP like Apache HBase.

HDFS is ...

Page 23: Big data technologies and Hadoop infrastructure

23www.vitech.com.ua

MapReduce framework model

● 2 steps data processing: transform and then reduce. Really nice to do things in distributed manner.

● Large class of jobs can be adopted but not all of them.

Page 24: Big data technologies and Hadoop infrastructure

24www.vitech.com.ua

MapReduce service: top view

● One JobTracker with redundancy possible.

● Multiple TaskTrackers doing actual job.

● Ideology is similar to HDFS handling.

● HDFS is usually used as storage on all phases.

MapReduce service

Page 25: Big data technologies and Hadoop infrastructure

25www.vitech.com.ua

Technology: Hadoop 2.0 concept

● New component (YARN) forms resource management layer and completes real distributed data OS.

● MapReduce is from now only one among other YARN appliactions.

Page 26: Big data technologies and Hadoop infrastructure

26www.vitech.com.ua

YARN: notable addition

● Resource manager dispatches client requests.

● Node managers manage node resources.

● Any application is set of containers including application master.

YARN service

Page 27: Big data technologies and Hadoop infrastructure

27www.vitech.com.ua

YARN: notable addition

● Better resource balance for heterogeneous clusterss and multple applications.

● Dynamic applications over static services.

● Much wider applications model over simple MapReduce. Things like Spark ot Tez.

Why YARN is SO important?

Page 28: Big data technologies and Hadoop infrastructure

28www.vitech.com.ua

Hadoop current picture

● HDFS2 is now about storage and YARN is about processing resources.

● Lot of things to do on top of this data OS starting from traditional MapReduce. Now there is lot of alternatives.

Page 29: Big data technologies and Hadoop infrastructure

29www.vitech.com.ua

Just several items around

Infrastructure● HBase: Scalable structured data

storage for large tables.

● Hive: A data warehouse infrastructure that provides data summarization and ad hoc querying.

● Mahout: A Scalable machine learning and data mining library.

● Pig: A high-level data-flow language and execution framework for parallel computation.

● ZooKeeper: A high-performance distributed coordination service.

Page 30: Big data technologies and Hadoop infrastructure

30www.vitech.com.ua

Most important concept

First ever worldDATA OS

10.000 nodes computer... Recent technology changes are focused on higher scale. Better resource usage and control, lower MTTR, higher security, redundancy, fault tolerance.

Page 31: Big data technologies and Hadoop infrastructure

31www.vitech.com.ua

Big data industry is changing.

HADOOP has influence on whole BIG DATA INDUSTRY face

Page 32: Big data technologies and Hadoop infrastructure

32www.vitech.com.ua

New concepts

DATA LAKETake as much data about your business processes as you can take. The more data you have the more value you could get from it.

Page 33: Big data technologies and Hadoop infrastructure

33www.vitech.com.ua

New concepts

ENTERPRISE DATA HUBDon't ruine your existing data warehouse. Just extend it with new, centralized big data storage through data migration solution.

Page 34: Big data technologies and Hadoop infrastructure

34www.vitech.com.ua

Trends

Big data is goind BIGGER● SSD are going to be widely used as storage

and memory based replica is not a miracle anymore.

● Memory and SSD based caching schemes are going to be more and more aggressive. Particularry in HDFS and HBase.

● Clusters grow. Currently some open source features are targeted for clusters of 1K nodes. How about staging 300 nodes cluster in companies like EBay?

● Production clusters go beyond 4000 nodes (up to 10K). Node failure nearly every day.

Page 35: Big data technologies and Hadoop infrastructure

35www.vitech.com.ua

Trends

● Typecal node is expected to include at least 64G memory

● Starting from 4 x 2T drives for storage. 8-16 x 4T drives are not so rare. This is for general 'workload' node.

● 10 and more CPU cores. 2 CPUs is normal approach.

● SSD is starting to be widely used not only for OS and caching but for data itself.

● Main outcome — per node costs model is changing.

HARDWARE IS GOING CHEAPER

Page 36: Big data technologies and Hadoop infrastructure

36www.vitech.com.ua

Most important concept

● You need to limit things you are guessing

Page 37: Big data technologies and Hadoop infrastructure

37www.vitech.com.ua

For whom bell tools?

Old way● Make assumptions

about data you need.

● Make assumptions about data model.

● Make assumptions about algorithms you need.

● Get confirmation for your initial guess about result. Are you surprised?

New way● Get as much data as you can.

● Detect data model based on set of algorithms with extensive approach.

● Cluster your data, detect correlations, clean from anomalies... in all way you can afford on whole data set.

● Get grounded results. You still cen miss some fundamental aspects but isn't it much better in any case?

Page 38: Big data technologies and Hadoop infrastructure

38www.vitech.com.ua

Major Hadoop distributions

● HortonWorks are 'barely open source'. Innovative, but 'running too fast'. Most ot their key technologies are not so mature yet.

● Cloudera is stable enough but not stale. Hadoop 2.3 with YARN, HBase 0.96.x. Balance.

● MapR focuses on performance per node but they are slightly outdated in term of functionality and their distribution costs. For cases where node performance is high priority.

● Intel is newcomer on this market. Not for near future.

Page 39: Big data technologies and Hadoop infrastructure

39www.vitech.com.ua

Questions and discussion