big data and hadoop introduction
TRANSCRIPT
BIG DATA AND HADOOP INTRODUCTIONNGUYEN PHAN DZUNG
SEPTEMBER 2015
AGENDA- Objectives
- Contents:• Big data• Apache Hadoop• Examples using Hadoop
- Demo- Q&A- References
Security Classification: Internal
Objectives
Big data and Hadoop introduction 3
• Big data overview.• Apache Hadoop common architecture:– Read/write a file in Hadoop File System– How Hadoop MapReduce tasks work– Hadoop 1 & 2 difference
• Develop a MapReduce job using Hadoop• Apply Hadoop in the real world
Big data introduction
Security Classification: Internal
Big data – Information explosion
Big data and Hadoop introduction 5
Security Classification: Internal
Big data – Definition
Big data and Hadoop introduction 6
“Big data is high-volume, high-velocity and/or high-variety information assets that demand cost-effective, innovative forms of
information processing that enable enhanced insight, decision making, and
process automation”- Gartner
Security Classification: Internal
Big data – The 3Vs
Big data and Hadoop introduction 7
• Volume :– Google receives over 2 million search queries every minute– transactional data or sensor data are being stored every fraction of
seconds• Variety :
– YouTube, Facebook generate video, audio, image and text data– Over 200 million emails are sent every minute
• Velocity:– Experiments at CERN generate colossal amounts of data.– Particles collide 600 million times per second.– Their Data Center processes about one petabyte of data every day.
Security Classification: Internal
Big data – Challenges
Big data and Hadoop introduction 8
• Difficult in identifying the right data and determining how to best use it.
• Struggling to find the right talent.• Data access and connectivity obstacle.• Data technology landscape is evolving extremely fast.• Finding new ways of collaborating across functions and
businesses.• Security concerns.
Security Classification: Internal
Big data – Landscape
Big data and Hadoop introduction 9
Security Classification: Internal
Big data – Plays part in firm’s revenue
Big data and Hadoop introduction 10
Apache Hadoop introduction
Security Classification: Internal
Apache Hadoop – What?
Big data and Hadoop introduction 12
• It is a software platform:
– allows us easily write and run data related applications
– facilitates processing and manipulating massive amount of data
– the processes are conveniently scalable
Security Classification: Internal
Apache Hadoop – Brief history
Big data and Hadoop introduction 13
Security Classification: Internal
Apache Hadoop – Characteristics
Big data and Hadoop introduction 14
• Reliable shared storage (HDFS) and analysis system (MapReduce).
• Highly scalable • Cost effective as it can work with commodity hardware.• Highly flexible and can process both structured as well as
unstructured data.• Built-in fault tolerance. • Write once and read multiple times.• Optimized for large and very large data sets.
Security Classification: Internal
Apache Hadoop – Design principals
Big data and Hadoop introduction 15
• Moving computation is cheaper than moving data• Hardware will fail, manage it• Hide execution details from the user• Use streaming data access• Use a simple file system coherency model
Security Classification: Internal
Apache Hadoop – Core architecture (1)
Big data and Hadoop introduction 16
Security Classification: Internal
Apache Hadoop – Core architecture (2)
Big data and Hadoop introduction 17
Security Classification: Internal
Apache Hadoop – HDFS architecture
Big data and Hadoop introduction 18
Security Classification: Internal
Apache Hadoop – HDFS architecture - Replication
Big data and Hadoop introduction 19
Security Classification: Internal
Apache Hadoop – HDFS architecture – Secondary namenode
Big data and Hadoop introduction 20
Security Classification: Internal
Apache Hadoop – HDFS – Read a file
Big data and Hadoop introduction 21
Security Classification: Internal
Apache Hadoop – HDFS – Write a file (1)
Big data and Hadoop introduction 22
Security Classification: Internal
Apache Hadoop – HDFS – Write a file (2)
Big data and Hadoop introduction 23
Security Classification: Internal
How MapReduce pattern works
Big data and Hadoop introduction 24
Security Classification: Internal
Apache Hadoop – Running jobs In Hadoop 1
Big data and Hadoop introduction 25
Security Classification: Internal
Apache Hadoop – Running jobs in Hadoop 1 – How it works
Big data and Hadoop introduction 26
Security Classification: Internal
Apache Hadoop – Running jobs In Hadoop 2
Big data and Hadoop introduction 27
Security Classification: Internal
Apache Hadoop – Running Jobs In Hadoop 2 – How it works
Big data and Hadoop introduction 28
Security Classification: Internal
Apache Hadoop – Using
Big data and Hadoop introduction 29
• When to use Hadoop:– Hadoop can be used in various scenarios including some of the following:– Analytics– Search– Data Retention– Log file processing– Analysis of Text, Image, Audio, & Video content– Recommendation systems like in E-Commerce Websites
• When Not to Use Hadoop:– Low-latency or near real-time data access.– Having a large number of small files to be processed. – Multiple writes scenario or scenarios requiring arbitrary writes or writes
between the files
Security Classification: Internal
Apache Hadoop – Ecosystem
Big data and Hadoop introduction 30
Examples using Hadoop
Security Classification: Internal
Examples using Hadoop – A retail management system
Big data and Hadoop introduction 32
Security Classification: Internal
Examples using Hadoop – SQL Server and Hadoop
Big data and Hadoop introduction 33
Security Classification: Internal
Real world applications/solutions using Hadoop – MS HDInsight
Big data and Hadoop introduction 34
Security Classification: Internal
Real world applications/solutions using Hadoop – Case studies
Big data and Hadoop introduction 35
Demo
Q & A
Security Classification: Internal
References
Big data and Hadoop introduction 38
- http://hadoop.apache.org- Hadoop in action – Chuck Lam- Hadoop: The definitive guide – Tom White- http://www.bigdatanews.com/- http://stackoverflow.com- http://codeproject.com- Hadoop 2 Fundamentals – LiveLession
Thank you for your attention!