big data and hadoop overview

Big Data and Hadoop Overview

Shreekanth Vankamamidi

Introduction

1990Could store 1,370 MB of data and had a transfer speed of 4.4 MB/s, so we could read all the data from a full drive in around 5 minutes

2013

One terabyte drives are the norm, but the transfer speed is around 100 MB/s, so it takes more than two and a half hours to read all the data off the disk.

The problem is simple: while the storage capacities of hard drives have increased massively over the years, access speeds —the rate at which data can be read from drives—have not kept up. And to add further, the problem can be still bigger when data grows even larger, i.e. “Big Data”, for instance..

• The New York Stock Exchange generates about one terabyte of new trade data per day.

• Facebook hosts approximately 10 billion photos, taking up one petabyte of storage.

• Ancestry.com, the genealogy site, stores around 2.5 petabytes of data.

Gartner predicts 800% data growth in next 5 years

Big Data

It is the point at which the traditional data management tools and practices no longer meet the demands of the size, diversity, and pace of new data

Big Data is all about storing and accessing large amounts of structured and unstructured data. However, where to put that data and how to access it have become the biggest challenges for enterprises looking to leverage the information

Structured Data: - Data with the defined format, such as XML document or database tables.

Semi Structured Data: - Though there could be a schema but often ignored, example spreadsheet, in which structure is a grid of cells, although cells can store any form of

data.

Unstructured Data: - Data will not have any particular internal structure, example plain text or image file and about 80% of Big Data is unstructured.

Big Data is not always about size, size is one of the attribute but it is when challenges the constraints of systems capability or business needs.

Big Data solution deals with High volume Velocity Variety of data

Big Data attributes Size Speed at which data generated Number & Variety of sources created

Data by itself need not be big at all, but some of these can get Big Data under classification based on volume at the rate which get created by aggregating many fragments of small data that are some how related, ex: Smart Meter data

Big Data Challenges Collecting Analyzing Understanding

Big Data Classification

Big Data Products

Even though Big Data sizes are a constantly moving target, to address the challenges, there are leaders in the Big Data movement like.

Amazon Hadoop Cloudera built on top of Hadoop 10Gen (MangoDB) Greenplum CouchBase Hadapt built on top of Hadoop Hortonworks Karmasphere

What is Hadoop?

Hadoop framework provides parallel processing capabilities with a reliable shared storage and analysis system

Open Source Project

Written in Java

Optimized to handle- Massive amount of data through parallelism-A variety of data (Structure, Un Structure and semi structure)

Great performance

Reliability provided through replication

Not for OLTP (ODBC/JDBC), not for OLAP, good for Big Data

Shared Storage: HDFS is for Storing of Big Data Analysis: Map Reduce is for retrieving Big Data

It is predicted that by 2015, more than half of the world’s data will be processed by Apache Hadoop

Hadoop Projects

HBaseA distributed, column-oriented database. HBase uses HDFS for its underlying storage, and supports both batch-style computations using MapReduce and point queries (random reads)

HiveA distributed data warehouse. Hive manages data stored in HDFS and provides a

query language based on SQL (and which is translated by the runtime engine to MapReduce jobs) for querying the data.

HDFS (Hadoop Distributed File System)

HDFS is a distributed, scalable, and portable filesystem for the Hadoop framework. Each node in a Hadoop instance typically has a single namenode; a cluster of data nodes form the HDFS cluster.


BlocksA disk has a block size, which is the minimum amount of data that it can read or write Disk block size: 512 Bytes File Block size: few kilo bytes HDFS Block Size: 64 MB by default

HDFS blocks are large compared to disk blocks, and the reason is to minimize the cost of seeks. By making a block large enough, the time to transfer the data from the disk can be made to be significantly larger than the time to seek to the start of the block.

An HDFS cluster has two types of node operating in a master-worker pattern

• Name Node: It maintains the filesystem tree and the metadata for all the files and directories in the tree.

• Data Nodes: They store and retrieve blocks when they are told to (by clients or the namenode), and they report back to the namenode periodically with lists of blocks that they are storing.

HDFS High-Availability The 0.23 release series of Hadoop remedies this situation by adding support for HDFS high-

availability (HA). In this implementation there is a pair of name nodes in an active standby configuration

Failover The transition from the active namenode to the standby is managed by a new entity in the

system called the failover controller. Failover controllers are pluggable, but the first implementation uses ZooKeeper to ensure that only one namenode is active.

Fencing HA Implementation employs a range of fencing mechanisms, including killing the namenode’ s

process, revoking its access to the shared storage directory (typically by using a vendor-specific NFS command),and disabling its network port via a remote management command


Map-Reduce

It works by breaking the process into two phases, Map Phase and Reduce Phase. Each phase has key-value pairs as input and output, the types of which may be chosen by the programmer. The programmer also specifies two functions: the map function and the reduce function.

Understanding HDFS & Map Reduce

PIG & HIVE

MAP-Reduce is great but, it more like assembly language

PIG is developed by Yahoo, to help developers writing tedious Map Reduce jobs.

HBase

HDFS is a distributed file system that is well suited for the storage of large files and does not provide fast individual record lookups in files. HBase, on the other hand, is built on top of HDFS and provides fast record lookups (and updates) for large tables. This can sometimes be a point of conceptual confusion. HBase internally puts your data in indexed "StoreFiles" that exist on HDFS for high-speed lookups. HBase is good at when we have all data moved into Hadoop and wanted to get real time access to it.

When to use HBase

Scenario HBase RDBMS Comments

Data of few thousands/millions rows No Yes

Data of hundreds of millions or billions of rows

Yes No

Support extra features (that an RDBMS provide) like typed columns, secondary indexes, transactions, advanced query language)

No Yes To have HBase, make sure you can live without all the extra features that an RDBMS provides (e.g., typed columns, secondary indexes, transactions, advanced query languages, etc.) An application built against an RDBMS cannot be "ported" to HBase by simply changing a JDBC driver, for example. Consider moving from an RDBMS to HBase as a complete redesign as opposed to a port.

Need enough hardware Yes Not required For HBase, make sure you have enough hardware. Even HDFS doesn't do well with anything less than 5 DataNodes (due to things such as HDFS block replication which has a default of 3), plus a NameNode

Conclusion

As experts say, question is not about “Why should I care about Big Data”, but rather, how can I get closer to Big Data and start taking advantage of it.

Courtesyhttp://oreilly.com/http://emc.com

http://oreilly.com/

http://oreilly.com/

http://oreilly.com/

http://emc.com/

big data and hadoop overview

Documents

big data

hadoop framework

typed columns

secondary

transfer speed

unstructured

hdfs cluster

storage