big data and hadoop
TRANSCRIPT
Contents
What is Big Data? Limitations to the existing solutions How Hadoop solves the problem Introduction to Hadoop Hadoop Eco-System Hadoop main Components MapReduce execution File Read and Write Sentiment Analysis
Big Data
Extremely large datasets ( Data is in TBs and PBs ), Facebook has the world’s largest Hadoop Cluster with 400 TB(2011)
data(currently 22 PB of data) and generates 20TB of data/day, NYSE generates 1TB data/day, The internet archive store around 2PB of data and is growing at a
very fast rate, The WayBack Machine is an example of Internet archive store, it is
digital archive of the WWW and other information on the internet, their intent is to capture and archive content that would be lost whenever a site is changed or closed down,
Unstructured Data ( 80:20 )
web.archive.org http://web.archive.org/web/20140626111157/http://thapar.edu/
index.asp
Limitations to the existing solutions
Slow to process Seek Time of general storages: IDE drive – 75 MB/s, 10ms SATA drive – 300 MB/s, 8.5ms SSD – 800 MB/s, 2ms
Scaling is expensive Unreliable machines : risk of data loss
Infrastructure Providers
Hadoop solves the problem
Introduction to Hadoop
Apache Hadoop is a set of algorithms (an open-source software framework written in Java) for distributed storage and distributed processing of very large data sets (Big Data) on computer.
All the modules in Hadoop are designed with a fundamental assumption that hardware failures (of individual machines, or racks of machines) are common and thus should be automatically handled in software by the framework.
In December 2004, Google Labs published a paper on the MapReduce algorithm, which allows very large scale computations to be trivially parallelized across large clusters of servers.
Doug Cutting, an employee at Yahoo, realized the importance of this paper and extended the reality of it to handle extremely large search problems.
In 2005, he created the open-source Hadoop framework that allows applications based on the MapReduce paradigm to be run on large clusters of commodity hardware.
Hadoop main components
Two main components:HDFS – Hadoop Distributed File System (Storage): Distributed across nodes (Datanodes), NameNode tracks locations,
MapReduce (Processing): Splits task across processors, Self healing, high bandwidth, Clustered storage, Jobtracker manages the tasktrackers
Modes of working
Three modes:
Standalone Mode(default) : in this Hadoop didn’t use HDFS to store files just use local FS, helpful in debugging,
Pseudomode(Single Node Cluster) : configure the files to run on a single cluster, R = 1
Distributed Mode : use Hadoop at full scale, consists of thousands of nodes, use this mode when we work on large data
Replication and Block Size Default replication factor is 3 and block size is 64MB ( recommended
128MB ) Can be updated by changing the configuration files
MapReduce Programming Model
Example of MapReduce
Hive
Apache Hive is a data warehouse infrastructure built on top of Hadoop for providing data summarization, query, and analysis.
Developed by Facebook. HiveQL – SQL like query language, Hive queries are converted into MR first ( at the backend ), therefore
slower than running MR program,
Twitter Sentiment Analysis
Java program to get tweets
Sample Data
Big Data – The road ahead us
Huge repositories of structured and unstructured data across various digital platforms and social media,
Beyond traditional database methods to analyse, Big data promises growth and long term sustainability, Threats – data integrity, security breach