why hadoop for data science?

29
© Hortonworks Inc. 2013 Why Hadoop for data science? Ofer Mendelevitch PASS BA Conference, April 2013

Upload: hortonworks

Post on 05-Dec-2014

9.331 views

Category:

Documents


0 download

DESCRIPTION

 

TRANSCRIPT

Page 1: Why hadoop for data science?

© Hortonworks Inc. 2013

Why Hadoop for data science?

Ofer MendelevitchPASS BA Conference, April 2013

Page 2: Why hadoop for data science?

© Hortonworks Inc. 2013

A brief history of Apache Hadoop

Page 2

2013

Focus on INNOVATION2005: Yahoo! creates

team under E14 to work on Hadoop

Focus on OPERATIONS2008: Yahoo team extends focus to

operations to support multiple projects & growing clusters

Yahoo! begins to Operate at scale

EnterpriseHadoop

Apache Project Established

HortonworksData Platform

2004 2008 2010 20122006

STABILITY2011: Hortonworks created to focus on “Enterprise Hadoop“. Starts with

24 key Hadoop engineers from Yahoo

Page 3: Why hadoop for data science?

© Hortonworks Inc. 2013

Core Hadoop: HDFS & Map Reduce

Deliver high-scale storage & processing

• HDFS: distributed, self-healing data store

• Map-reduce: distributed computation framework that handles the complexities of distributed programming

Page 3

Page 4: Why hadoop for data science?

© Hortonworks Inc. 2013 Page 4

Keys to Hadoop’s power

• Computation co-located with data– Data and computation system co-designed and co-

developed to work together

• Process data in parallel across thousands of “commodity” hardware nodes– Self-healing; failure handled by software

• Designed for one write and multiple reads– There are no random writes– Optimized for minimum seek on hard drives

Page 5: Why hadoop for data science?

© Hortonworks Inc. 2013

ApplianceCloudOS / VM

HDP: Enterprise-Ready Hadoop

HORTONWORKS DATA PLATFORM (HDP)

PLATFORM SERVICES

HADOOP CORE

Enterprise Readiness: HA, DR, Snapshots, Security, …

Distributed Storage & ProcessingHDFS

MAP REDUCE

DATASERVICES

Store, Process and Access Data

HCATALOG

HIVEPIGHBASE

SQOOP

FLUME

OPERATIONAL SERVICES

Manage & Operate at

ScaleOOZIE

AMBARI

Page 6: Why hadoop for data science?

© Hortonworks Inc. 2013 Page 6

What is a

data product?

Page 7: Why hadoop for data science?

© Hortonworks Inc. 2013 Page 7

“A software system whose core functionality depends on the application of statistical analysis and machine learning to data.”

What is a data product?

Page 8: Why hadoop for data science?

© Hortonworks Inc. 2013 Page 8

Example 1: Google Adwords

Page 9: Why hadoop for data science?

© Hortonworks Inc. 2013 Page 9

Example 2: People you may know

Page 10: Why hadoop for data science?

© Hortonworks Inc. 2013 Page 10

Example 3: spell correction

Page 11: Why hadoop for data science?

© Hortonworks Inc. 2013 Page 11

What is

data science?

Page 12: Why hadoop for data science?

© Hortonworks Inc. 2013 Page 12

What is data science?

#1: Extracting deep meaning from data(data mining; finding “gems” in data)

Page 13: Why hadoop for data science?

© Hortonworks Inc. 2013 Page 13

Common data science tasks

Descriptive

ClusteringDetect natural groupings

Outlier detectionDetect anomalies

Affinity AnalysisCo-occurrence patterns

Predictive

ClassificationPredict a category

RegressionPredict a value

RecommendationPredict a preference

Page 14: Why hadoop for data science?

© Hortonworks Inc. 2013 Page 14

What is data science?

#2: Building data products(Delivering gems on a regular basis)

Pre-process Build model SQL

Periodic batch processing

Online serving

Page 15: Why hadoop for data science?

© Hortonworks Inc. 2013 Page 15

Reason #1: Explore full datasets

Why Hadoop for data science?

Page 16: Why hadoop for data science?

© Hortonworks Inc. 2013 Page 16

Explore large datasets directly with Hadoop

Measure/Evaluate

Acquire

Clean Data

Visualize, Grok

Model

Full dataset stored on Hadoop

Researcher laptopR, Matlab, SAS, etc

Page 17: Why hadoop for data science?

© Hortonworks Inc. 2013 Page 17

Integrate Hadoop in your data analysis flow

•Exploratory data analysis on full dataset–Simple statistics: mean, median, quantile, etc–Pre-processing: grep, regex, etc

•Ad-hoc sampling / filtering–Random: with or without replacement–Sample by unique key–K-fold cross-validation

Page 18: Why hadoop for data science?

© Hortonworks Inc. 2013 Page 18

Reason #2: Mine larger datasets

Why Hadoop for data science?

Page 19: Why hadoop for data science?

© Hortonworks Inc. 2013 Page 19

More data -> better outcomes

Banko & Brill, 2001 Halevy, Norvig & Pereira, 2009

Page 20: Why hadoop for data science?

© Hortonworks Inc. 2013 Page 20

Learning algorithms with large datasets…

Challenges:•Data won’t fit in memory•Learning takes a lot longer…

Using Hadoop:•Distribute data across nodes in the Hadoop cluster• Implement a distributed/parallel algorithm

–Recommendation: Alternate Least Squares (ALS)–Clustering: K-means

Page 21: Why hadoop for data science?

© Hortonworks Inc. 2013 Page 21

Reason #3: Large-scale data preparation

Why Hadoop for data science?

Page 22: Why hadoop for data science?

© Hortonworks Inc. 2013 Page 22

80% of data science work is data preparation

Strip away HTML/PDF/DOC/P

PT

Entity resolution

Document vector generation

Sampling, filtering

Joins

Raw Data Processed Data

Term normalization

Page 23: Why hadoop for data science?

© Hortonworks Inc. 2013 Page 23

Hadoop is ideal for batch data preparation and cleanup of large datasets

Page 24: Why hadoop for data science?

© Hortonworks Inc. 2013 Page 24

Reason 4: Accelerate data-driven innovation

Why Hadoop for data science?

Page 25: Why hadoop for data science?

© Hortonworks Inc. 2013 Page 25

Barriers to speed with traditional data architectures

• RDBMS uses “schema on write”; change is expensive• High barrier for data-driven innovation

I need new data

Finally, we start

collecting

Let me see… is it

any good?

Start 6 months 9 months

Schema change project

Page 26: Why hadoop for data science?

© Hortonworks Inc. 2013 Page 26

“Schema on read” means faster time-to-innovation

• Hadoop uses “schema on read”• Low barrier for data-driven innovation

I need new data

Let’s just put it in a folder on

HDFS

Let me see… is it

any good?

Start 3 months 6 months

My model is awesome!

Page 27: Why hadoop for data science?

© Hortonworks Inc. 2013 Page 27

Summary

Why use Hadoop for data science?1. Data exploration with full datasets2. Mine larger datasets3. Pre-processing at scale4. Faster data-driven cycles

Page 28: Why hadoop for data science?

© Hortonworks Inc. 2013

Quick start: Hortonworks Sandbox

• What is it– A free download of a virtualized single-node implementation of the enterprise-ready

Hortonworks Data Platform– A personal Hadoop environment– An integrated learning environment with frequently, easily updatable hands-on

step-by-step tutorials

• What it does– Dramatically accelerates the process of learning Apache Hadoop– Accelerate and validates the use of Hadoop within your unique data architecture– Use your data to explore and investigate your use cases

• ZERO to big data in 15 minutes

Page 28

Download Hortonworks Sandboxwww.hortonworks.com/sandbox

Sign up for Training for in-depth learninghortonworks.com/hadoop-training/

Page 29: Why hadoop for data science?

© Hortonworks Inc. 2013 Page 29

Thank you!

Any Questions?Ofer Mendelevitch

Director, Data Sciences @ Hortonworks

[email protected]

@ofermend, @hortonworks

Come visit us @ Booth S5

We’re hiring!