big data technologies and hadoop infrastructure

Roman Nikitchenko, 09.05.2014

BIG.DATA technologies& HADOOP infrastructure

2www.vitech.com.ua

Agenda

Hadoop causes real big data Industry changes

What technology is behind this name?

Why Hadoop is so promising solution?

BIG DATA APPROACH

HADOOP ENVIRONMENT

INDUSTRY FACE IS CHANGING

3www.vitech.com.ua

No escape for you ;-)

4www.vitech.com.ua

What is BIG DATA?

● Really BIG DATA things: photo banks, video storage, historical measurements.

● Intensive data transactions and high distribution: stores (offline or online), banks, advertising networks.

● Realtime data: measurements and minitoring, gaming.

● Intensive processing: science, modelling.

● High volumes of small things: social networks, healthcare

BIG DATA IS EVERYWHERE

5www.vitech.com.ua

BIG DATA in just 3 words

Indeed any real big data is just about DIGITAL LIFE FOOTPRINT

6www.vitech.com.ua

WORLD is big data itself

Yet to remember....

WORLD ITSELF CAN BE DIGITIZED TOO

● Earth weather and environment: realtime, really big data volume, high potential for processing, lot of things to be analysed, historical data.

● Space: unlimited potential for analysis, ocean is yet unknow volume.

● Internet of things is going to be digital world itself.

● ???

7www.vitech.com.ua

So...

BIG DATA is not about the

data. It is about OUR ABILITY TO HANDLE THEM.

8www.vitech.com.ua

But how can I handle big data?

… BUT HOW TO HANDLE IT?

BIG DATA

9www.vitech.com.ua

BIG DATA storage: requirements

NO BACKUPS

10www.vitech.com.ua


SIMPLE BUT RELIABLE

● Really big amount of data is to be stored in reliable manner.

● Storage is to be simple, recoverable and cheap.

11www.vitech.com.ua


DECENTRALIZED● No single point of failure.

● Scalable as close to linear as possible.

● No manual actions to recover in case of failures

12www.vitech.com.ua

BIG DATA processing: requirements

SIMPLE TO USE● Complexity is to

be burried inside.● Interface is to be

functional and compatible between versions.

13www.vitech.com.ua


TOOLS TO BE CLOSE TO WORK

● Process data on the same nodes as it is stored on.

● Distributed storage — distributed processing.

14www.vitech.com.ua


● Work is to be balanced.

● Data placement is to be appropriate to balanced work.

● Amount of work is to be balanced in accordance to resources.

SHARE LOAD

15www.vitech.com.ua

Solution requirements in general

WHAT FINALLY DO WE NEED?● CPU+HDD in one place

● Cluster of replacable nodes

● Lot of storage space

● Way to control resources and balance load

● Everything is to be relatively simple and affordable x MAX

+

=

BIG DATA

16www.vitech.com.ua

… and what is the solution?

HADOOP magic is here!

17www.vitech.com.ua

What is it?

What is HADOOP?

● Hadoop is open source framework for big data. Both distributed storage and processing.

● Hadoop is reliable and fault tolerant with no rely on hardware for these properties.

● Hadoop has unique horisontal scalability. Currently — from single computer up to thousands of cluster nodes.

18www.vitech.com.ua

Facts and trends

● 2004, Was inspired by by Google MapReduce idea. Originally was named just after son's elephant toy.

● On June 13, 2012 Facebook announced their Hadoop cluster has 100 PB of data. On November 8, 2012 they announced the warehouse grows by roughly half a PB per day.

● On February 19, 2008, Yahoo! Inc. launched what it claimed was the world's largest Hadoop production application. The Yahoo! Search Webmap is a Hadoop application that runs on a more than 10,000 core Linux cluster.

19www.vitech.com.ua

Hadoop: classical picture

Hadoop historical top view

● HDFS serves as file system layer

● MapReduce originally served as distributed processing framework.

● Native client API is Java but there are lot of alternatives.

● This is only initial architecture and it is now more complex.

20www.vitech.com.ua

HDFS top view

● Namenode is 'management' component. Keeps 'directory' of what file blocks are stored where.

● Actual work is performed by data nodes.

21www.vitech.com.ua

HDFS files handling

● Files are stored in large enough blocks. Every block is replicated to several data nodes.

● Replication is tracked by namenode. Clients only locate blocks using namenode and actual load is taken by datanode.

● Datanode failure leads to replication recovery. Namenode could be backed by standby scheme.

22www.vitech.com.ua

HDFS properties

● Designed for throughput, not for latency.

● Blocks are expected to be large. There is issue with lot of small files.

● Write once, read many times ideology.

● Only append, no 'edit' ability.

● Special tools are required to implement OLTP like Apache HBase.

HDFS is ...

23www.vitech.com.ua

MapReduce framework model

● 2 steps data processing: transform and then reduce. Really nice to do things in distributed manner.

● Large class of jobs can be adopted but not all of them.

24www.vitech.com.ua

MapReduce service: top view

● One JobTracker with redundancy possible.

● Multiple TaskTrackers doing actual job.

● Ideology is similar to HDFS handling.

● HDFS is usually used as storage on all phases.

MapReduce service

25www.vitech.com.ua

Technology: Hadoop 2.0 concept

● New component (YARN) forms resource management layer and completes real distributed data OS.

● MapReduce is from now only one among other YARN appliactions.

26www.vitech.com.ua

YARN: notable addition

● Resource manager dispatches client requests.

● Node managers manage node resources.

● Any application is set of containers including application master.

YARN service

27www.vitech.com.ua

YARN: notable addition

● Better resource balance for heterogeneous clusterss and multple applications.

● Dynamic applications over static services.

● Much wider applications model over simple MapReduce. Things like Spark ot Tez.

Why YARN is SO important?

28www.vitech.com.ua

Hadoop current picture

● HDFS2 is now about storage and YARN is about processing resources.

● Lot of things to do on top of this data OS starting from traditional MapReduce. Now there is lot of alternatives.

29www.vitech.com.ua

Just several items around

Infrastructure● HBase: Scalable structured data

storage for large tables.

● Hive: A data warehouse infrastructure that provides data summarization and ad hoc querying.

● Mahout: A Scalable machine learning and data mining library.

● Pig: A high-level data-flow language and execution framework for parallel computation.

● ZooKeeper: A high-performance distributed coordination service.

30www.vitech.com.ua

Most important concept

First ever worldDATA OS

10.000 nodes computer... Recent technology changes are focused on higher scale. Better resource usage and control, lower MTTR, higher security, redundancy, fault tolerance.

31www.vitech.com.ua

Big data industry is changing.

HADOOP has influence on whole BIG DATA INDUSTRY face

32www.vitech.com.ua

New concepts

DATA LAKETake as much data about your business processes as you can take. The more data you have the more value you could get from it.

33www.vitech.com.ua

New concepts

ENTERPRISE DATA HUBDon't ruine your existing data warehouse. Just extend it with new, centralized big data storage through data migration solution.

34www.vitech.com.ua

Trends

Big data is goind BIGGER● SSD are going to be widely used as storage

and memory based replica is not a miracle anymore.

● Memory and SSD based caching schemes are going to be more and more aggressive. Particularry in HDFS and HBase.

● Clusters grow. Currently some open source features are targeted for clusters of 1K nodes. How about staging 300 nodes cluster in companies like EBay?

● Production clusters go beyond 4000 nodes (up to 10K). Node failure nearly every day.

35www.vitech.com.ua

Trends

● Typecal node is expected to include at least 64G memory

● Starting from 4 x 2T drives for storage. 8-16 x 4T drives are not so rare. This is for general 'workload' node.

● 10 and more CPU cores. 2 CPUs is normal approach.

● SSD is starting to be widely used not only for OS and caching but for data itself.

● Main outcome — per node costs model is changing.

HARDWARE IS GOING CHEAPER

36www.vitech.com.ua

Most important concept

● You need to limit things you are guessing

37www.vitech.com.ua

For whom bell tools?

Old way● Make assumptions

about data you need.

● Make assumptions about data model.

● Make assumptions about algorithms you need.

● Get confirmation for your initial guess about result. Are you surprised?

New way● Get as much data as you can.

● Detect data model based on set of algorithms with extensive approach.

● Cluster your data, detect correlations, clean from anomalies... in all way you can afford on whole data set.

● Get grounded results. You still cen miss some fundamental aspects but isn't it much better in any case?

38www.vitech.com.ua

Major Hadoop distributions

● HortonWorks are 'barely open source'. Innovative, but 'running too fast'. Most ot their key technologies are not so mature yet.

● Cloudera is stable enough but not stale. Hadoop 2.3 with YARN, HBase 0.96.x. Balance.

● MapR focuses on performance per node but they are slightly outdated in term of functionality and their distribution costs. For cases where node performance is high priority.

● Intel is newcomer on this market. Not for near future.

39www.vitech.com.ua

Questions and discussion

big data technologies and hadoop infrastructure

Technology

big data processing

big data storage

big data things

big data volume

data nodes

healthcare big data

historical data

realtime data