big data and hadoop

27

Upload: rahul-johari

Post on 11-Feb-2017

102 views

Category:

Data & Analytics


1 download

TRANSCRIPT

Page 1: Big data and hadoop
Page 2: Big data and hadoop

Contents

What is Big Data? Limitations to the existing solutions How Hadoop solves the problem Introduction to Hadoop Hadoop Eco-System Hadoop main Components MapReduce execution File Read and Write Sentiment Analysis

Page 3: Big data and hadoop
Page 4: Big data and hadoop

Big Data

Extremely large datasets ( Data is in TBs and PBs ), Facebook has the world’s largest Hadoop Cluster with 400 TB(2011)

data(currently 22 PB of data) and generates 20TB of data/day, NYSE generates 1TB data/day, The internet archive store around 2PB of data and is growing at a

very fast rate, The WayBack Machine is an example of Internet archive store, it is

digital archive of the WWW and other information on the internet, their intent is to capture and archive content that would be lost whenever a site is changed or closed down,

Page 5: Big data and hadoop

Unstructured Data ( 80:20 )

Page 6: Big data and hadoop

web.archive.org http://web.archive.org/web/20140626111157/http://thapar.edu/

index.asp

Page 7: Big data and hadoop

Limitations to the existing solutions

Slow to process Seek Time of general storages: IDE drive – 75 MB/s, 10ms SATA drive – 300 MB/s, 8.5ms SSD – 800 MB/s, 2ms

Scaling is expensive Unreliable machines : risk of data loss

Page 8: Big data and hadoop

Infrastructure Providers

Page 9: Big data and hadoop

Hadoop solves the problem

Page 10: Big data and hadoop

Introduction to Hadoop

Apache Hadoop is a set of algorithms (an open-source software framework written in Java) for distributed storage and distributed processing of very large data sets (Big Data) on computer.

All the modules in Hadoop are designed with a fundamental assumption that hardware failures (of individual machines, or racks of machines) are common and thus should be automatically handled in software by the framework.

Page 11: Big data and hadoop

In December 2004, Google Labs published a paper on the MapReduce algorithm, which allows very large scale computations to be trivially parallelized across large clusters of servers.

Doug Cutting, an employee at Yahoo, realized the importance of this paper and extended the reality of it to handle extremely large search problems.

In 2005, he created the open-source Hadoop framework that allows applications based on the MapReduce paradigm to be run on large clusters of commodity hardware.

Page 12: Big data and hadoop
Page 13: Big data and hadoop

Hadoop main components

Two main components:HDFS – Hadoop Distributed File System (Storage): Distributed across nodes (Datanodes), NameNode tracks locations,

MapReduce (Processing): Splits task across processors, Self healing, high bandwidth, Clustered storage, Jobtracker manages the tasktrackers

Page 14: Big data and hadoop
Page 15: Big data and hadoop

Modes of working

Three modes:

Standalone Mode(default) : in this Hadoop didn’t use HDFS to store files just use local FS, helpful in debugging,

Pseudomode(Single Node Cluster) : configure the files to run on a single cluster, R = 1

Distributed Mode : use Hadoop at full scale, consists of thousands of nodes, use this mode when we work on large data

Page 16: Big data and hadoop

Replication and Block Size Default replication factor is 3 and block size is 64MB ( recommended

128MB ) Can be updated by changing the configuration files

Page 17: Big data and hadoop
Page 18: Big data and hadoop
Page 19: Big data and hadoop

MapReduce Programming Model

Page 20: Big data and hadoop

Example of MapReduce

Page 21: Big data and hadoop

Hive

Apache Hive is a data warehouse infrastructure built on top of Hadoop for providing data summarization, query, and analysis.

Developed by Facebook. HiveQL – SQL like query language, Hive queries are converted into MR first ( at the backend ), therefore

slower than running MR program,

Page 22: Big data and hadoop

Twitter Sentiment Analysis

Page 23: Big data and hadoop

Java program to get tweets

Page 24: Big data and hadoop
Page 25: Big data and hadoop
Page 26: Big data and hadoop

Sample Data

Page 27: Big data and hadoop

Big Data – The road ahead us

Huge repositories of structured and unstructured data across various digital platforms and social media,

Beyond traditional database methods to analyse, Big data promises growth and long term sustainability, Threats – data integrity, security breach