Download - BIG DATA ANALYTICS WITH HADOOP
BIG DATA ANALYTICS WITH HADOOP
By Viplav Mandalcomputer science and
engg.Volume 1
AGENDA
INTRODUCTION BIG DATA REQUIREMENTS? DATA GROWTH HADOOP HADOOP HISTORY HADOOP DEVELOPMENT HISTORY HADOOP TOOLS REFERENCES
INTRODUCTIONBig Data..What does it means? Volume
- Big data comes in large scale, Its in TB even PB.
- Records, Transactions, Tables, files. Velocity
- Data Flown continues, time sensitive, streaming flow
- Batch, real time, Steams, Historic Variety
- Big data includes structured, semi- structured, un- structured and all variety
- Text, files, logs, xml, audios, videos, stream, flat files etc. Veracity
- Quality, consistency, reliability, and provenance of data
- Good, bad, in-complete, undefined Ratio?
- 20% of structured
- 80% of un-structured
Big Data Requirements
Technology Requirement?
- No technology stack required
- Fresher or experienced doesn’t matter
- Better to Know (But not required)
- Java & Linux
Hardware Requirement?
- No need to purchase anything
- Better to Have – 64 bit Machine
What about growth?
Big data users
Hadoop
A large scale distributed processing apache framework Creator of Hadoop: Doug Cutting No high end server/machine required only commodity server Open source framework & implementation of Google Map reduce Efficient, reliable, Easy to use Store & process large amount of data Performance, Storage, processing scale linearly Simple core, modular and extensible Manageable and heal self Hardware cost effective
Hadoop History
2002-2004 – Doug Cutting started working with Nutch 2003-2004 – Google publish GFS and Map Reduce white papers 2004 – Doug Cutting adds DFS and Map Reduce support to Nutch Yahoo! Hires Doug Cutting build team to develop Hadoop 2007 – NY times converts 4 TB of archive over 100 TB EC2 of Hadoop Web Scale deployment at Yahoo, Facebook, twitter May 2009 – Yahoo does fastest sort of 1 TB in 62 Sec over 146 Nodes
Hadoop development history
HADOOP TOOLS
APACHE HIVE HDFS SQOOP MAPREDUCE
APACHE HIVE:SQL ON HADOOP
OSS data warehouse built on top of Hadoop First Apache Hive released in 2009 Initial goal was to write Map Reduce jobs in SQL
-Most query ran from minutes to hours
-primary used for batch processing
Hive-Single tool for all SQL use cases
HDFS-master and slaves
slaves(Data Nodes) stores blocks of data and serve the Master (Name Node) manages all metadata information to the client
SQOOP
Couldn’t I just do this with a shell script– What year you live 2001? No there is a better way
Structured data already captured in databases should be used with unstructured data in Hadoop
Tedious “glue” code necessary to wrap database records for consumption in Hadoop
Large amount of log data to process Apache open source software Bulk data transfer tool
– Import/Export from/to relational databases,– enterprise data warehouses, and NoSQL systems– Populate tables in HDFS, Hive, and HBase– Integrate with Oozie– Support plugins via connector based architecture
MAP REDUCE
Map Reduce is a programming model and software framework first developed by Google (Google’s Map Reduce paper submitted in 2004) Map Reduce is a programming model and software framework first developed by Google (Google’s Map Reduce paper submitted in 2004)
Intended to facilitate and simplify the processing of vast amounts of data in parallel on large clusters of commodity hardware in a reliable, fault-tolerant manner
-Petabytes of data
-Thousands of nodes
-Computational processing occurs on both:
Unstructured data : file system
Structured data : database
REFFERNCES
https://www.google.co.in/search?q=HADOOP&ie=utf-8&oe=utf-8&gws_rd=cr&ei=13jTVcm2E8m3uQSkzp3wBg
https://hadoop.apache.org/ http://www.slideshare.net/linuxpham/hadoop-at-gnt-2012 http://www.slideshare.net/narangv43/seminar-presentation-hadoop https://
www.google.co.in/search?q=hadoop+books&ie=utf-8&oe=utf-8&gws_rd=cr&ei=nXnTVeSAMY2-uASdoo7ACA
http://www.fromdev.com/2014/07/Best-Hadoop-Books.html
Any Queries?
THANK YOU