using hadoop for big data

93
Hadoop for (Young) Data Scientist Komes Chandavimol and Team Data Science Lab, Thailand [email protected]

Upload: data-science-thailand

Post on 21-Apr-2017

2.341 views

Category:

Data & Analytics


1 download

TRANSCRIPT

Hadoop for (Young) Data Scientist

Komes Chandavimol and TeamData Science Lab, Thailand

[email protected]

Agenda

• Big Data, Analytics and Data Science

• Hadoop + Sparks Workshops

• Sharing Experience: Hadoop (Real) Use Cases

• Hadoop + Spark Trends,

3

Big Data, Analytics and Data Science

Big Data

http://www.adweek.com/prnewser/how-many-times-do-the-worlds-social-media-users-click-every-minute/117427

https://www.domo.com/learn/data-never-sleeps-3-0

Internet of Things

http://topmanagement.com.mx/innovacion-social-y-empresarial-objetivo-de-hitachi/

6http://www.adweek.com/prnewser/how-many-times-do-the-worlds-social-media-users-click-every-minute/117427

https://www.domo.com/learn/data-never-sleeps-3-0

The Growth of Data

7http://www.adweek.com/prnewser/how-many-times-do-the-worlds-social-media-users-click-every-minute/117427

https://www.domo.com/learn/data-never-sleeps-3-0

What is Big Data?

8http://blogs.forrester.com/category/hadoophttp://solutions.forrester.com/Global/FileLib/webinars/Big_Data_-_Gold_Rush_or_Illusion.pdf

The Big Data Tools

http://thebigdatablog.weebly.com/blog/the-hadoop-ecosystem-overview

11http://hortonworks.com/blog/optimize-your-data-architecture-with-hadoop/

Traditional Data Management Architecture

12http://hortonworks.com/blog/optimize-your-data-architecture-with-hadoop/

New Data Management Architecture

13http://www.kdnuggets.com/2014/05/big-data-landscape-v30-analyzed.html

14

https://www.digitalnewsasia.com/business/forget-data-warehousing-its-data-lakes-now

Data Lake

How the Data Lake works?

15http://www.clearpeaks.com/blog/category/tableau

Traditional Enterprise Data warehouse

16

What you consume from Data Lake?

https://www.digitalnewsasia.com/business/forget-data-warehousing-its-data-lakes-now

17

Volume? Variety? Velocity?

https://www.digitalnewsasia.com/business/forget-data-warehousing-its-data-lakes-now

18

Value

https://www.digitalnewsasia.com/business/forget-data-warehousing-its-data-lakes-now

19

Big Data + Analytics = Values

https://www.digitalnewsasia.com/business/forget-data-warehousing-its-data-lakes-now

Big Data Analytics

20http://hortonworks.com/blog/big-data-refinery-fuels-next-generation-data-architecture/

Big Data Analytics

21http://dataofthings.blogspot.com/2014/04/the-bbbt-sessions-hortonworks-big-data.html

Big Data Analytics

22http://www.gartner.com/it-glossary/predictive-analytics

23

How to do Big Data Analytics?

https://www.digitalnewsasia.com/business/forget-data-warehousing-its-data-lakes-now

Data Science Experience Sharing, Big Data Challenge #2,Bangkok Thailand

http://drewconway.com/zia/2013/3/26/the-data-science-venn-diagram

What is Data Science?

The Rise of Data Scientist

27

http://flowingdata.com/2009/06/04/rise-of-the-data-scientist/

2009

https://hbr.org/

28http://hrb.org

http://www.anlytcs.com/2014/01/data-science-venn-diagram-v20.html

2014

The Rise of Data Scientist

Data Science Experience Sharing, Big Data Challenge #2,Bangkok Thailand

http://www.anlytcs.com/2014/01/data-science-venn-diagram-v20.html

2014

The Data Science

30

The Solution, Data Science Team

31

Data Science Team

Doing Data Science by O'Neil et al (2013)

32

Doing Data Science by O'Neil et al (2013)

33

Doing Data Science by O'Neil et al (2013)

Data Science Team

Analyzing the Analyzers, Harris (2013)

34

Data Science TeamData Scientist & Data Engineer

http://www.kdnuggets.com/2015/11/different-data-science-roles-industry.html

35

Data Science TeamData Scientist & Data Engineer

http://www.kdnuggets.com/2015/11/different-data-science-roles-industry.htmlhttps://www.facebook.com/DataScienceTh/posts/931828353527079:0

36

Data Science Professionals

http://www.kdnuggets.com/2015/11/different-data-science-roles-industry.html

37

Data Science for Dummies Pierson

(2015)

∗Build In-house Team

• Train existing employee

• Train existing employee and hire experts

• Hire experts

∗Outsourcing requirements to private DS consultants

• Outsourcing for comprehensive DS Strategy development

• Outsource for DS Solutions to specific problem

∗Leverage Cloud-based platform solutions

How to build DS Team?

Machine Learning

Improving Performance in some Task with Experience”. Tom Mitchell

Tom Mitchell (1998)

The field of study that gives computers the ability to learn

without being explicitly programmed. Arthur Samuel (1990)

Wikipedia, Data Visualization for Dummies (2014)

Data Points: Visualization That Means Something(2013)38

Machine Learning deals with systems

that can learn from data.

39

Machine Learning Discovery

• Class Discovery• Correlation Discovery• Novelty (Surprise) Discovery• Association (or Link Discovery)

40

KirkBorne-workshop-ODSC2016.pdf

The XYZ of Data Science

Smart X : • Smart Cities • Smart Highways • Smart Supply Chain Precision Y : • Precision Medicine • Precision Farming • Precision Pricing Personalized Z : • Personalized Health • Personalized Learning • Personalized Shopping Experience

41KirkBorne-Workshop-ODSC2016.pdf

Intelligence at the edge of the network… at the point of data collection

42DataInquest – Predictive Analytics and Data Science Bootcamp

Data Science is a Team Sport

http://www.ibmbigdatahub.com/blog/why-data-science-team-sport

44

How to Start?

45

Hadoop + Spark Workshops

49

Workshop #1 การติดตั้ง HDFS และ YARN

51

Workshop #2 WordCount

53

Workshop #3 WordCount (Streaming)

54

Workshop #4 WordCount(Frequency Sort)

56

Workshop #5 Setup Cloudera QuickStart

58

Workshop #6 Exploring HBASE data in HUE

59

Workshop #7 Design a Schema for quick twitter

relationship lookup

60

Workshop #8 Design a schema for IoT log

(Smart Meter)

61

Workshop #9 Create an HBase table for

Smart meter data

62

Workshop #10 Bank Customer Snapshot

65

Workshop #10.1 -10.1 Create Hive Tables

10.2 Create External Hive Tables10.3 Create External Hive Tables

10.4 Partition

67

Workshop #11SQOOP

73

Workshop spk1 WordCountspk2 WordCountspk3 WordCount

76

Workshop spk4 SparkSQL + ML

84

Sharing Experience:

Source: Analytics: The New Path to Value, a joint MIT Sloan Management Review and IBM Institute for Business Value study. Copyright © Massachusetts Institute of Technology 2010.

Top Performers Use Analytics 5

Times More Than Lower

Performers

Revenue - Cost = Profit

Monitoring and MaintenanceData sources: IoT Sensors in factory

Data products: predictive maintenance models

http://www.electrex.it/en/news/600-automated-energy-management-system-a-enms-for-cement-production-plants.html

Customer Engagement + LocationData sources: Mobile App, Loyalty Program, GIS

Data products: Buying behavior analysis, coupon-response model , location visualizationhttp://www.fastcompany.com/3020859/most-creative-people/how-chinas-one-child-policy-forced-starbucks-to-rethink-its-beijing-sto

Fuel Saving Data sources: Telematics (sensor), GPS

Data products: Prescriptive analytics – route

optimization, predictive maintenance

(parts/malfunction)http://www.cnet.com/news/ups-turns-data-analysis-into-big-savings/

Fraud DetectionData sources: historical pattern of transaction data

Data products: predictive models – fraud/non-fraudhttps://bluefishway.com/2013/09/13/panic-oh-no-not-again/

HR Analytics – Google Hiring Data sources: Historical hiring attributesData products: Predictive model – recruiting high performer

Behavioral Test

Situational Test

GPA

Brain Teaser

Good School

Average ROI of Analytics/Data Science

93

Hadoop + Spark Trends