big data and hadoop introduction

39
BIG DATA AND HADOOP INTRODUCTION NGUYEN PHAN DZUNG SEPTEMBER 2015

Upload: dzung-nguyen

Post on 13-Jan-2017

227 views

Category:

Documents


3 download

TRANSCRIPT

Page 1: Big data and Hadoop introduction

BIG DATA AND HADOOP INTRODUCTIONNGUYEN PHAN DZUNG

SEPTEMBER 2015

Page 2: Big data and Hadoop introduction

AGENDA- Objectives

- Contents:• Big data• Apache Hadoop• Examples using Hadoop

- Demo- Q&A- References

Page 3: Big data and Hadoop introduction

Security Classification: Internal

Objectives

Big data and Hadoop introduction 3

• Big data overview.• Apache Hadoop common architecture:– Read/write a file in Hadoop File System– How Hadoop MapReduce tasks work– Hadoop 1 & 2 difference

• Develop a MapReduce job using Hadoop• Apply Hadoop in the real world

Page 4: Big data and Hadoop introduction

Big data introduction

Page 5: Big data and Hadoop introduction

Security Classification: Internal

Big data – Information explosion

Big data and Hadoop introduction 5

Page 6: Big data and Hadoop introduction

Security Classification: Internal

Big data – Definition

Big data and Hadoop introduction 6

“Big data is high-volume, high-velocity and/or high-variety information assets that demand cost-effective, innovative forms of

information processing that enable enhanced insight, decision making, and

process automation”- Gartner

Page 7: Big data and Hadoop introduction

Security Classification: Internal

Big data – The 3Vs

Big data and Hadoop introduction 7

• Volume :– Google receives over 2 million search queries every minute– transactional data or sensor data are being stored every fraction of

seconds• Variety :

– YouTube, Facebook generate video, audio, image and text data– Over 200 million emails are sent every minute

• Velocity:– Experiments at CERN generate colossal amounts of data.– Particles collide 600 million times per second.– Their Data Center processes about one petabyte of data every day.

Page 8: Big data and Hadoop introduction

Security Classification: Internal

Big data – Challenges

Big data and Hadoop introduction 8

• Difficult in identifying the right data and determining how to best use it.

• Struggling to find the right talent.• Data access and connectivity obstacle.• Data technology landscape is evolving extremely fast.• Finding new ways of collaborating across functions and

businesses.• Security concerns.

Page 9: Big data and Hadoop introduction

Security Classification: Internal

Big data – Landscape

Big data and Hadoop introduction 9

Page 10: Big data and Hadoop introduction

Security Classification: Internal

Big data – Plays part in firm’s revenue

Big data and Hadoop introduction 10

Page 11: Big data and Hadoop introduction

Apache Hadoop introduction

Page 12: Big data and Hadoop introduction

Security Classification: Internal

Apache Hadoop – What?

Big data and Hadoop introduction 12

• It is a software platform:

– allows us easily write and run data related applications

– facilitates processing and manipulating massive amount of data

– the processes are conveniently scalable

Page 13: Big data and Hadoop introduction

Security Classification: Internal

Apache Hadoop – Brief history

Big data and Hadoop introduction 13

Page 14: Big data and Hadoop introduction

Security Classification: Internal

Apache Hadoop – Characteristics

Big data and Hadoop introduction 14

• Reliable shared storage (HDFS) and analysis system (MapReduce).

• Highly scalable • Cost effective as it can work with commodity hardware.• Highly flexible and can process both structured as well as

unstructured data.• Built-in fault tolerance. • Write once and read multiple times.• Optimized for large and very large data sets.

Page 15: Big data and Hadoop introduction

Security Classification: Internal

Apache Hadoop – Design principals

Big data and Hadoop introduction 15

• Moving computation is cheaper than moving data• Hardware will fail, manage it• Hide execution details from the user• Use streaming data access• Use a simple file system coherency model

Page 16: Big data and Hadoop introduction

Security Classification: Internal

Apache Hadoop – Core architecture (1)

Big data and Hadoop introduction 16

Page 17: Big data and Hadoop introduction

Security Classification: Internal

Apache Hadoop – Core architecture (2)

Big data and Hadoop introduction 17

Page 18: Big data and Hadoop introduction

Security Classification: Internal

Apache Hadoop – HDFS architecture

Big data and Hadoop introduction 18

Page 19: Big data and Hadoop introduction

Security Classification: Internal

Apache Hadoop – HDFS architecture - Replication

Big data and Hadoop introduction 19

Page 20: Big data and Hadoop introduction

Security Classification: Internal

Apache Hadoop – HDFS architecture – Secondary namenode

Big data and Hadoop introduction 20

Page 21: Big data and Hadoop introduction

Security Classification: Internal

Apache Hadoop – HDFS – Read a file

Big data and Hadoop introduction 21

Page 22: Big data and Hadoop introduction

Security Classification: Internal

Apache Hadoop – HDFS – Write a file (1)

Big data and Hadoop introduction 22

Page 23: Big data and Hadoop introduction

Security Classification: Internal

Apache Hadoop – HDFS – Write a file (2)

Big data and Hadoop introduction 23

Page 24: Big data and Hadoop introduction

Security Classification: Internal

How MapReduce pattern works

Big data and Hadoop introduction 24

Page 25: Big data and Hadoop introduction

Security Classification: Internal

Apache Hadoop – Running jobs In Hadoop 1

Big data and Hadoop introduction 25

Page 26: Big data and Hadoop introduction

Security Classification: Internal

Apache Hadoop – Running jobs in Hadoop 1 – How it works

Big data and Hadoop introduction 26

Page 27: Big data and Hadoop introduction

Security Classification: Internal

Apache Hadoop – Running jobs In Hadoop 2

Big data and Hadoop introduction 27

Page 28: Big data and Hadoop introduction

Security Classification: Internal

Apache Hadoop – Running Jobs In Hadoop 2 – How it works

Big data and Hadoop introduction 28

Page 29: Big data and Hadoop introduction

Security Classification: Internal

Apache Hadoop – Using

Big data and Hadoop introduction 29

• When to use Hadoop:– Hadoop can be used in various scenarios including some of the following:– Analytics– Search– Data Retention– Log file processing– Analysis of Text, Image, Audio, & Video content– Recommendation systems like in E-Commerce Websites

• When Not to Use Hadoop:– Low-latency or near real-time data access.– Having a large number of small files to be processed. – Multiple writes scenario or scenarios requiring arbitrary writes or writes

between the files

Page 30: Big data and Hadoop introduction

Security Classification: Internal

Apache Hadoop – Ecosystem

Big data and Hadoop introduction 30

Page 31: Big data and Hadoop introduction

Examples using Hadoop

Page 32: Big data and Hadoop introduction

Security Classification: Internal

Examples using Hadoop – A retail management system

Big data and Hadoop introduction 32

Page 33: Big data and Hadoop introduction

Security Classification: Internal

Examples using Hadoop – SQL Server and Hadoop

Big data and Hadoop introduction 33

Page 34: Big data and Hadoop introduction

Security Classification: Internal

Real world applications/solutions using Hadoop – MS HDInsight

Big data and Hadoop introduction 34

Page 35: Big data and Hadoop introduction

Security Classification: Internal

Real world applications/solutions using Hadoop – Case studies

Big data and Hadoop introduction 35

Page 36: Big data and Hadoop introduction

Demo

Page 37: Big data and Hadoop introduction

Q & A

Page 38: Big data and Hadoop introduction

Security Classification: Internal

References

Big data and Hadoop introduction 38

- http://hadoop.apache.org- Hadoop in action – Chuck Lam- Hadoop: The definitive guide – Tom White- http://www.bigdatanews.com/- http://stackoverflow.com- http://codeproject.com- Hadoop 2 Fundamentals – LiveLession

Page 39: Big data and Hadoop introduction

Thank you for your attention!