big data and hadoop

Contents

What is Big Data? Limitations to the existing solutions How Hadoop solves the problem Introduction to Hadoop Hadoop Eco-System Hadoop main Components MapReduce execution File Read and Write Sentiment Analysis

Big Data

Extremely large datasets ( Data is in TBs and PBs ), Facebook has the world’s largest Hadoop Cluster with 400 TB(2011)

data(currently 22 PB of data) and generates 20TB of data/day, NYSE generates 1TB data/day, The internet archive store around 2PB of data and is growing at a

very fast rate, The WayBack Machine is an example of Internet archive store, it is

digital archive of the WWW and other information on the internet, their intent is to capture and archive content that would be lost whenever a site is changed or closed down,

Unstructured Data ( 80:20 )

web.archive.org http://web.archive.org/web/20140626111157/http://thapar.edu/

index.asp

Limitations to the existing solutions

Slow to process Seek Time of general storages: IDE drive – 75 MB/s, 10ms SATA drive – 300 MB/s, 8.5ms SSD – 800 MB/s, 2ms

Scaling is expensive Unreliable machines : risk of data loss

Infrastructure Providers

Hadoop solves the problem

Introduction to Hadoop

Apache Hadoop is a set of algorithms (an open-source software framework written in Java) for distributed storage and distributed processing of very large data sets (Big Data) on computer.

All the modules in Hadoop are designed with a fundamental assumption that hardware failures (of individual machines, or racks of machines) are common and thus should be automatically handled in software by the framework.

In December 2004, Google Labs published a paper on the MapReduce algorithm, which allows very large scale computations to be trivially parallelized across large clusters of servers.

Doug Cutting, an employee at Yahoo, realized the importance of this paper and extended the reality of it to handle extremely large search problems.

In 2005, he created the open-source Hadoop framework that allows applications based on the MapReduce paradigm to be run on large clusters of commodity hardware.

Hadoop main components

Two main components:HDFS – Hadoop Distributed File System (Storage): Distributed across nodes (Datanodes), NameNode tracks locations,

MapReduce (Processing): Splits task across processors, Self healing, high bandwidth, Clustered storage, Jobtracker manages the tasktrackers

Modes of working

Three modes:

Standalone Mode(default) : in this Hadoop didn’t use HDFS to store files just use local FS, helpful in debugging,

Pseudomode(Single Node Cluster) : configure the files to run on a single cluster, R = 1

Distributed Mode : use Hadoop at full scale, consists of thousands of nodes, use this mode when we work on large data

Replication and Block Size Default replication factor is 3 and block size is 64MB ( recommended

128MB ) Can be updated by changing the configuration files

MapReduce Programming Model

Example of MapReduce

Hive

Apache Hive is a data warehouse infrastructure built on top of Hadoop for providing data summarization, query, and analysis.

Developed by Facebook. HiveQL – SQL like query language, Hive queries are converted into MR first ( at the backend ), therefore

slower than running MR program,

Twitter Sentiment Analysis

Java program to get tweets

Sample Data

Big Data – The road ahead us

Huge repositories of structured and unstructured data across various digital platforms and social media,

Beyond traditional database methods to analyse, Big data promises growth and long term sustainability, Threats – data integrity, security breach

big data and hadoop

Data & Analytics