why hadoop for data science?

© Hortonworks Inc. 2013

Why Hadoop for data science?

Ofer MendelevitchPASS BA Conference, April 2013


A brief history of Apache Hadoop

2013

Focus on INNOVATION2005: Yahoo! creates

team under E14 to work on Hadoop

Focus on OPERATIONS2008: Yahoo team extends focus to

operations to support multiple projects & growing clusters

Yahoo! begins to Operate at scale

EnterpriseHadoop

Apache Project Established

HortonworksData Platform

2004 2008 2010 20122006

STABILITY2011: Hortonworks created to focus on “Enterprise Hadoop“. Starts with

24 key Hadoop engineers from Yahoo


Core Hadoop: HDFS & Map Reduce

Deliver high-scale storage & processing

• HDFS: distributed, self-healing data store

• Map-reduce: distributed computation framework that handles the complexities of distributed programming


Keys to Hadoop’s power

• Computation co-located with data– Data and computation system co-designed and co-

developed to work together

• Process data in parallel across thousands of “commodity” hardware nodes– Self-healing; failure handled by software

• Designed for one write and multiple reads– There are no random writes– Optimized for minimum seek on hard drives


ApplianceCloudOS / VM

HDP: Enterprise-Ready Hadoop

HORTONWORKS DATA PLATFORM (HDP)

PLATFORM SERVICES

HADOOP CORE

Enterprise Readiness: HA, DR, Snapshots, Security, …

Distributed Storage & ProcessingHDFS

MAP REDUCE

DATASERVICES

Store, Process and Access Data

HCATALOG

HIVEPIGHBASE

SQOOP

FLUME

OPERATIONAL SERVICES

Manage & Operate at

ScaleOOZIE

AMBARI


What is a

data product?


“A software system whose core functionality depends on the application of statistical analysis and machine learning to data.”

What is a data product?


Example 1: Google Adwords


Example 2: People you may know


Example 3: spell correction


What is

data science?


What is data science?

#1: Extracting deep meaning from data(data mining; finding “gems” in data)


Common data science tasks

Descriptive

ClusteringDetect natural groupings

Outlier detectionDetect anomalies

Affinity AnalysisCo-occurrence patterns

Predictive

ClassificationPredict a category

RegressionPredict a value

RecommendationPredict a preference


What is data science?

#2: Building data products(Delivering gems on a regular basis)

Pre-process Build model SQL

Periodic batch processing

Online serving


Reason #1: Explore full datasets



Explore large datasets directly with Hadoop

Measure/Evaluate

Acquire

Clean Data

Visualize, Grok

Model

Full dataset stored on Hadoop

Researcher laptopR, Matlab, SAS, etc


Integrate Hadoop in your data analysis flow

•Exploratory data analysis on full dataset–Simple statistics: mean, median, quantile, etc–Pre-processing: grep, regex, etc

•Ad-hoc sampling / filtering–Random: with or without replacement–Sample by unique key–K-fold cross-validation


Reason #2: Mine larger datasets



More data -> better outcomes

Banko & Brill, 2001 Halevy, Norvig & Pereira, 2009


Learning algorithms with large datasets…

Challenges:•Data won’t fit in memory•Learning takes a lot longer…

Using Hadoop:•Distribute data across nodes in the Hadoop cluster• Implement a distributed/parallel algorithm

–Recommendation: Alternate Least Squares (ALS)–Clustering: K-means


Reason #3: Large-scale data preparation



80% of data science work is data preparation

Strip away HTML/PDF/DOC/P

PT

Entity resolution

Document vector generation

Sampling, filtering

Joins

Raw Data Processed Data

Term normalization


Hadoop is ideal for batch data preparation and cleanup of large datasets


Reason 4: Accelerate data-driven innovation



Barriers to speed with traditional data architectures

• RDBMS uses “schema on write”; change is expensive• High barrier for data-driven innovation

I need new data

Finally, we start

collecting

Let me see… is it

any good?

Start 6 months 9 months

Schema change project


“Schema on read” means faster time-to-innovation

• Hadoop uses “schema on read”• Low barrier for data-driven innovation

I need new data

Let’s just put it in a folder on

HDFS

Let me see… is it

any good?

Start 3 months 6 months

My model is awesome!


Summary

Why use Hadoop for data science?1. Data exploration with full datasets2. Mine larger datasets3. Pre-processing at scale4. Faster data-driven cycles


Quick start: Hortonworks Sandbox

• What is it– A free download of a virtualized single-node implementation of the enterprise-ready

Hortonworks Data Platform– A personal Hadoop environment– An integrated learning environment with frequently, easily updatable hands-on

step-by-step tutorials

• What it does– Dramatically accelerates the process of learning Apache Hadoop– Accelerate and validates the use of Hadoop within your unique data architecture– Use your data to explore and investigate your use cases

• ZERO to big data in 15 minutes

Download Hortonworks Sandboxwww.hortonworks.com/sandbox

Sign up for Training for in-depth learninghortonworks.com/hadoop-training/


Thank you!

Any Questions?Ofer Mendelevitch

Director, Data Sciences @ Hortonworks

[email protected]

@ofermend, @hortonworks

Come visit us @ Booth S5

We’re hiring!

mailto:[email protected]

why hadoop for data science?

Documents