hortonworks big data & hadoop

40
© Hortonworks Inc. 2013 Big Data, Data Science & Hadoop Ofer Mendelevitch San Francisco Bay Area Microsoft Business Intelligence User Group May 2013

Upload: mark-ginnebaugh

Post on 26-Jan-2015

106 views

Category:

Business


3 download

DESCRIPTION

Presenter: Ofer Mendelevitch of Hortonworks > Learn the benefits of big data for data scientists, and how Hadoop and HDInsight fit into the modern data architecture and enable data-driven products. You'll learn: * What data science actually means * The term "data products" * The benefits of using big data for data scientists * How Hadoop helps data scientists work with big data * About HDInsight, the big data platform from Microsoft and Hortonworks

TRANSCRIPT

Page 1: Hortonworks Big Data & Hadoop

© Hortonworks Inc. 2013

Big Data, Data Science & Hadoop

Ofer Mendelevitch

San Francisco Bay AreaMicrosoft BusinessIntelligence User Group

May 2013

Page 2: Hortonworks Big Data & Hadoop

© Hortonworks Inc. 2013 Page 2

Who am I?

Director of Data Sciences @ Hortonworks• Data science with Hadoop

• Professional services

Previously…

A Chess Dad

Page 3: Hortonworks Big Data & Hadoop

© Hortonworks Inc. 2013 Page 3

Page 4: Hortonworks Big Data & Hadoop

© Hortonworks Inc. 2013 Page 4

Gartner’s 3 V’s of big data:

Volume

VelocityVariety

Size of the data

Ingest speedResponse latency

Diverse sourcesFormat, structureData quality

Page 5: Hortonworks Big Data & Hadoop

© Hortonworks Inc. 2013

What Makes Up Big Data?

Megabytes

Gigabytes

Terabytes

Petabytes

Purchase detail

Purchase record

Payment record

ERPERP

CRMCRM

WEBWEB

BIG DATABIG DATA

Offer details

Support Contacts

Customer Touches

Segmentation

Web logs

Offer history

A/B testing

Dynamic Pricing

Affiliate Networks

Search Marketing

Behavioral Targeting

Dynamic Funnels

User Generated Content

Mobile Web

SMS/MMSSentiment

External Demographics

HD Video, Audio, Images

Speech to Text

Product/Service Logs

Social Interactions & Feeds

Business Data Feeds

User Click Stream

Sensors / RFID / Devices

Spatial & GPS Coordinates

Increasing Data Variety and Complexity

Transactions + Interactions+ Observations

= BIG DATA

Page 5

Page 6: Hortonworks Big Data & Hadoop

© Hortonworks Inc. 2013 Page 6

• Sensors/devices

• Online: social, forums, etc

• Event logs

• Etc etc…

But also:

• Data that was “thrown away “ previously

Where is all this data coming from?

Page 7: Hortonworks Big Data & Hadoop

© Hortonworks Inc. 2013 Page 7

I like a quote from Michael Franklin (UCB):

“Big Data is any data that is expensive tomanage and hard to extract value from”

It’s a relative term.

Today’s big data may be tomorrow’s small data.

Ok… so what is big data?

Page 8: Hortonworks Big Data & Hadoop

© Hortonworks Inc. 2013 Page 8

Page 9: Hortonworks Big Data & Hadoop

© Hortonworks Inc. 2013 Page 9

“A software system whose corefunctionality depends on theapplication of statistical analysisand machine learning to data.”

What is a data product?

Page 10: Hortonworks Big Data & Hadoop

© Hortonworks Inc. 2013 Page 10

Example 1: Google Adwords

Page 11: Hortonworks Big Data & Hadoop

© Hortonworks Inc. 2013 Page 11

Example 2: People you may know

Page 12: Hortonworks Big Data & Hadoop

© Hortonworks Inc. 2013 Page 12

Example 3: spell correction

Page 13: Hortonworks Big Data & Hadoop

© Hortonworks Inc. 2013 Page 13

Page 14: Hortonworks Big Data & Hadoop

© Hortonworks Inc. 2013 Page 14

What is data science?

#1: Extracting deep meaning from data(data mining; finding “gems” in data)

Page 15: Hortonworks Big Data & Hadoop

© Hortonworks Inc. 2013 Page 15

What is data science?

#2: Building data products(Delivering gems on a regular basis)

Pre-process Build model SQL

Periodic batch processing

Online serving

Page 16: Hortonworks Big Data & Hadoop

© Hortonworks Inc. 2013 Page 16

Common data science tasks

DescriptiveDescriptive

Clustering

Detect natural groupings

Clustering

Detect natural groupings

Outlier detection

Detect anomalies

Outlier detection

Detect anomalies

Affinity Analysis

Co-occurrence patterns

Affinity Analysis

Co-occurrence patterns

PredictivePredictive

Classification

Predict a category

Classification

Predict a category

Regression

Predict a value

Regression

Predict a value

Recommendation

Predict a preference

Recommendation

Predict a preference

Page 17: Hortonworks Big Data & Hadoop

© Hortonworks Inc. 2013 Page 17

Page 18: Hortonworks Big Data & Hadoop

© Hortonworks Inc. 2013

A brief history of Apache Hadoop

Page 18

2013

Focus on INNOVATION2005: Yahoo! creates

team under E14 towork on Hadoop

Focus on OPERATIONS2008: Yahoo team extends focus to

operations to support multipleprojects & growing clusters

Yahoo! begins toOperate at scale

EnterpriseHadoop

Apache ProjectEstablished

HortonworksData Platform

2004 2008 2010 20122006

STABILITY2011: Hortonworks created to focus on

“Enterprise Hadoop“. Starts with 24key Hadoop engineers from Yahoo

Page 19: Hortonworks Big Data & Hadoop

© Hortonworks Inc. 2013

ApplianceCloudOS / VM

HDP: Enterprise-Ready Hadoop

HORTONWORKSDATA PLATFORM (HDP)

PLATFORM SERVICES

HADOOP CORE

Enterprise Readiness: HA,DR, Snapshots, Security, …

DistributedStorage & ProcessingHDFS

MAP REDUCE

DATASERVICES

Store,Process andAccess Data

HCATALOG

HIVEPIGHBASE

SQOOP

FLUME

OPERATIONALSERVICES

Manage &Operate at

ScaleOOZIE

AMBARI

Page 20: Hortonworks Big Data & Hadoop

© Hortonworks Inc. 2013

Core Hadoop: HDFS & Map Reduce

Deliver high-scale storage & processing

• HDFS: distributed, self-healing data store

• Map-reduce: distributed computation framework thathandles the complexities of distributed programming

Page 20

Page 21: Hortonworks Big Data & Hadoop

© Hortonworks Inc. 2013 Page 21

Keys to Hadoop’s power

• Computation co-located with data

– Data and computation system co-designed and co-developed to work together

• Process data in parallel across thousands of“commodity” hardware nodes

– Self-healing; failure handled by software

• Designed for one write and multiple reads

– There are no random writes

– Optimized for minimum seek on hard drives

Page 22: Hortonworks Big Data & Hadoop

© Hortonworks Inc. 2013

Inside HDP for Windows

Page 22

HortonworksData Platform (HDP)

For Windows

• 100% Open SourceEnterprise Hadoop

• Component and versioncompatible with MicrosoftHDInsight

• Availability

• Beta release available now

• GA early 2Q 2012

PLATFORM SERVICES

HADOOP CORE

DATASERVICES

OPERATIONALSERVICES

Manage &Operate at

Scale

Store,Process andAccess Data

HORTONWORKSDATA PLATFORM (HDP)For Windows

DistributedStorage & ProcessingHDFS

WEBHDFS

MAP REDUCE

HCATALOG

HIVEPIG

SQOOP

Oozie

Page 23: Hortonworks Big Data & Hadoop

© Hortonworks Inc. 2013

Seamless Interoperability with Your Microsoft Tools

• Integrated with Microsoft toolsfor native big data analysis

– Bi-directional connectors for SQLServer and SQL Azure through SQOOP

– Excel ODBC integration through Hive

• Addressing demand for Hadoopon Windows

– Ideal for Windows customers withHadoop operational experience

• Enables all common Hadoopworkloads

– Data refinement and ETL offload forhigh-volume data landing

– Data exploration for discovery of newbusiness opportunities

Page 23

AP

PLI

CA

TIO

NS

DA

TASY

STEM

S

Microsoft Applications

HORTONWORKSDATA PLATFORMFor Windows

DA

TASO

UR

CES

MOBILEDATA

OLTP,POS

SYSTEMS

Traditional Sources(RDBMS, OLTP, OLAP)

New Sources(web logs, email, sensor data, social media)

Page 24: Hortonworks Big Data & Hadoop

© Hortonworks Inc. 2013 Page 24

Page 25: Hortonworks Big Data & Hadoop

© Hortonworks Inc. 2013 Page 25

Data Science, now with more data…

Page 26: Hortonworks Big Data & Hadoop

© Hortonworks Inc. 2013 Page 26

Benefit #1:Explore full datasets

Benefits of Hadoop for datascience

Page 27: Hortonworks Big Data & Hadoop

© Hortonworks Inc. 2013 Page 27

Explore large datasets directly with Hadoop

Measure/Evaluate

Acquire

Clean DataVisualize, Grok

Model

Full dataset stored on Hadoop

Researcher laptop

R, Matlab, SAS, etc

Page 28: Hortonworks Big Data & Hadoop

© Hortonworks Inc. 2013 Page 28

Integrate Hadoop in your data analysis flow

•Full dataset resides in Hadoop

• Typical Hadoop tasks:

–Simple statistics: mean, median, correlation

–Text pre-processing: grep, regex, NLP

–Dimensionality reduction: PCA, SVD, clustering, etc

–Random sampling: with or without replacement, by unique

–K-fold cross-validation

Page 29: Hortonworks Big Data & Hadoop

© Hortonworks Inc. 2013 Page 29

Benefit #2:Mine larger datasets

Benefits of Hadoop for datascience

Page 30: Hortonworks Big Data & Hadoop

© Hortonworks Inc. 2013 Page 30

More data -> better outcomes

Banko & Brill, 2001

Halevy, Norvig & Pereira, 2009

Page 31: Hortonworks Big Data & Hadoop

© Hortonworks Inc. 2013 Page 31

Learning algorithms with large datasets…

Challenges:

•Data won’t fit in memory

•Learning takes a lot longer…

Using Hadoop:

•Distribute data across nodes in the Hadoop cluster

• Implement a distributed/parallel algorithm

Page 32: Hortonworks Big Data & Hadoop

© Hortonworks Inc. 2013 Page 32

Benefit #3:Large-scale data preparation

Benefits of Hadoop for datascience

Page 33: Hortonworks Big Data & Hadoop

© Hortonworks Inc. 2013 Page 33

80% of data science work is data preparation

Strip awayHTML/PDF/DOC/PPT

Entity resolution

Document vectorgeneration

Sampling, filtering

Joins

Raw DataProcessed

Data

Term normalization

Page 34: Hortonworks Big Data & Hadoop

© Hortonworks Inc. 2013 Page 34

Hadoop is ideal for batch data preparation andcleanup of large datasets

Page 35: Hortonworks Big Data & Hadoop

© Hortonworks Inc. 2013 Page 35

Benefit #4:Accelerate data-driven innovation

Benefits of Hadoop for datascience

Page 36: Hortonworks Big Data & Hadoop

© Hortonworks Inc. 2013 Page 36

Barriers to speed with traditional data architectures

• RDBMS uses “schema on write”; change is expensive

• High barrier for data-driven innovation

I neednew data

collecting

Finally,we start

collecting

Let mesee… is it

any good?

Start 6 months 9 months

Schema change project

Page 37: Hortonworks Big Data & Hadoop

© Hortonworks Inc. 2013 Page 37

“Schema on read” means faster time-to-innovation

• Hadoop uses “schema on read”

• Low barrier for data-driven innovation

I neednew data

Let’s just putLet’s just putit in a folder

on HDFS

Let mesee… is it

any good?

Start 3 months 6 months

My model isawesome!

Page 38: Hortonworks Big Data & Hadoop

© Hortonworks Inc. 2013

Quick start: Hortonworks Sandbox

• What is it

– A free download of a virtualized single-node implementation of the enterprise-readyHortonworks Data Platform

– A personal Hadoop environment

– An integrated learning environment with frequently, easily updatable hands-onstep-by-step tutorials

• What it does

– Dramatically accelerates the process of learning Apache Hadoop

– Accelerate and validates the use of Hadoop within your unique data architecture

– Use your data to explore and investigate your use cases

• ZERO to big data in 15 minutes

Page 38

Download Hortonworks Sandboxwww.hortonworks.com/sandbox

Sign up for Training for in-depth learninghortonworks.com/hadoop-training/

Page 39: Hortonworks Big Data & Hadoop

Hadoop Summit

Page 39Architecting the Future of Big Data

• June 26-27, 2013- San Jose ConventionCenter

• Co-hosted by Hortonworks & Yahoo!

• Theme: Enabling the Next GenerationEnterprise Data Platform

• 90+ Sessions and 7 Tracks

• Community Focused Event

– Sessions selected by a Conference Committee

– Community Choice allowed public to vote forsessions they want to see

• Pre-event training classes

– Apache Hadoop Essentials: A TechnicalUnderstanding for Business Users

– Understanding Microsoft HDInsight and ApacheHadoop

– Developing Solutions with Apache Hadoop –HDFS and MapReduce

– Applying Data Science using Apache Hadoop

• 10% discount code: 13DiscHUG10

hadoopsummit.org

Page 40: Hortonworks Big Data & Hadoop

© Hortonworks Inc. 2013 Page 40

Thank you!

Any Questions?

Ofer Mendelevitch

Director, Data Sciences @ Hortonworks

[email protected]

@ofermend, @hortonworks

We’re hiring!