2013 dec 9 data marketing 2013 - hadoop
DESCRIPTION
Data Marketing 2013 Presentation of Hadoop. The paradigm shift in 45 minutes or less. No, really.TRANSCRIPT
ELEPHANT AT THE DOOR: HADOOP AND NEXT GENERATION DATA
Adam Muise – Solu/on Architect, Hortonworks
Who am I?
Who is ?
We do Hadoop
The leaders of Hadoop’s development
Community driven, Enterprise Focused
Drive Innova/on in the plaForm – We lead the roadmap
100% Open Source – Democra/zed Access to Data
We do Hadoop successfully.
Support
Professional Services Training
We do Hadoop successfully everywhere.
We do Hadoop successfully, everywhere, with partners.
What is Hadoop? What is everyone talking about?
Data
“Big Data” is the marke/ng term of the decade in IT
What lurks behind the hype is the democra/za/on of Data.
You need data.
But what do you do with your data now?
We are obsessive compulsive about collec/ng and structuring
our data.
Put it away, delete it, tweet it, compress it, shred it, wikileak-‐it, put it in a database, put it in SAN/NAS, put it in the cloud, hide it in tape…
You need data. Your customers expect you to know what they want
before they do.
Let’s talk challenges…
Volume
Volume
Volume
Volume
Volume
Volume
Volume
Volume
Volume Volume
Volume
Volume Volume Volume
Volume
Volume Volume
Volume Volume
Volume Volume
Volume
Volume
Volume
Volume
Volume Volume
Volume
Volume Volume Volume
Volume
Volume Volume
Volume Volume
Volume Volume Volume
Volume
Volume Volume
Volume Volume
Volume Volume
Volume Volume
Volume Volume
Volume Volume
Volume Volume
Volume Volume
Volume Volume
Volume Volume
Volume
Volume
Volume
Volume
Volume Volume
Volume Volume
Volume
Volume
Volume
Volume Volume
Volume Volume
Volume
Volume
Volume
Volume Volume
Volume Volume
Volume
Volume
Volume
Volume Volume
Volume Volume
Volume
Volume
Volume
Volume
Volume
Volume
Volume Volume
Volume
Volume Volume Volume
Volume
Volume Volume
Volume Volume
Volume Volume
Volume Volume
Volume Volume
Volume Volume
Volume Volume
Volume Volume
Volume Volume
Volume Volume
Volume Volume
Volume Volume
Volume Volume
Volume Volume
Volume
Volume
Volume
Volume
Volume Volume
Volume Volume
Volume
Volume
Volume
Volume Volume
Volume Volume
Volume
Volume
Volume
Volume Volume
Volume Volume
Volume
Volume
Volume
Volume Volume
Volume Volume
Volume
Volume
Volume
Volume Volume
Volume Volume
Volume
Volume
Volume
Volume Volume
Volume Volume
Volume
Volume Volume
Volume Volume
Volume Volume
Volume
Volume
Volume
Volume Volume
Volume Volume
Volume
Volume
Volume
Volume Volume
Volume Volume
Volume
Volume
Volume
Volume Volume
Volume Volume
Volume
Volume
Volume
Volume Volume
Volume Volume
Volume
Volume
Volume
Volume Volume
Volume Volume
Volume
Volume
Volume
Volume Volume
Volume Volume
Volume
Volume
Volume
Volume Volume
Volume Volume
Volume
Volume
Volume
Volume Volume
Volume Volume
Volume
Volume
Volume
Volume Volume
Volume Volume
Volume
Volume
Storage, Management, Processing all become challenges with Data at
Volume
Tradi/onal technologies adopt a divide, drop, and conquer approach
The solu/on? EDW
Data Data Data
Data Data Data
Data Data Data
Yet Another EDW
Data Data Data
Data Data Data
Data Data Data
Analy/cal DB
Data Data Data
Data Data Data
Data Data Data OLTP
Data Data Data
Data Data Data
Data Data Data
Another EDW
Data Data Data
Data Data Data
Data Data Data
Ummm…you dropped something
Data Data Data
Data Data Data
Data Data Data Data Data Data
Data Data Data
Data Data Data Data Data Data
Data Data Data
Data Data Data
Data Data Data
Data Data Data
Data Data Data
Data Data Data
Data Data Data
Data Data Data
Data Data Data
Data Data Data
Data Data Data
Data Data Data
Data Data Data
Data Data Data
Data Data Data
Data Data Data
Data Data Data Data Data
Data
Data Data Data
Data Data Data
EDW
Data Data Data
Data Data Data
Data Data Data
Yet Another EDW
Data Data Data
Data Data Data
Data Data Data
Analy/cal DB
Data Data Data
Data Data Data
Data Data Data
OLTP
Data Data Data
Data Data Data
Data Data Data
Another EDW
Data Data Data
Data Data Data
Data Data Data
Analyzing the data usually raises more interes/ng ques/ons…
…which leads to more data
Wait, you’ve seen this before.
Data Data Data
Data Data Data
Data Data Data
Data Data Data
Data Data Data
Data Data Data Data Data
Data
Data Data Data
Data Data Data Data Data Data
Data Data Data
Data Data Data
Data Data Data
Data Data Data
Data Data Data
Data Data Data
Data Data Data
Data Data Data
Analy/cs Sausage Factory
Data Data Data
Data Data Data
Data Data Data … Data
Data Data …
Data Data
Data
Data
Data begets Data.
What keeps us from our Data?
“Prices, Stupid passwords, and Boring Sta/s/cs.” -‐ Hans Rosling
h)p://www.youtube.com/watch?v=hVimVzgtD6w
Your data silos are lonely places.
EDW
Data Data Data
Data Data Data
Data Data Data
Accounts
Data Data Data
Data Data Data
Data Data Data
Customers
Data Data Data
Data Data Data
Data Data Data
Web Proper/es
Data Data Data
Data Data Data
Data Data Data
… Data likes to be together.
EDW
Data Data Data
Data Data Data
Data Data Data
Accounts
Data Data Data
Data Data Data
Data Data Data
Customers
Data Data Data
Data Data Data
Data Data Data
Web Proper/es
Data Data Data
Data Data Data
Data Data Data
Data likes to socialize too. EDW
Data Data Data
Data Data Data
Data Data Data
Accounts
Data Data Data
Data Data Data
Data Data Data
Customers
Data Data Data
Data Data Data
Data Data Data
Web Proper/es
Data Data Data
Data Data Data
Data Data Data
Machine Data
Data Data Data
Data Data Data
Data Data Data
Twi^er
Data Data Data
Data Data Data
Data Data Data
Data Data Data
Data Data Data
Data Data Data
CDR
Data Data Data
Data Data Data
Data Data Data
Weather Data
Data Data Data
Data Data Data
Data Data Data
New types of data don’t quite fit into your pris/ne view of the world.
My Li^le Data Empire
Data Data Data
Data
Data Data
Data Data Data
Logs
Data Data Data Data
Data
Data Data
Machine Data
Data Data Data Data
Data
Data Data
? ?
? ?
To resolve this, some people take hints from Lord Of The Rings...
…and create One-‐Schema-‐To-‐Rule-‐Them-‐All…
EDW
Data Data Data
Data Data Data
Data Data Data Schema
…but that has its problems too.
EDW
Data Data Data
Data Data Data
Data Data Data Schema Data
Data Data
ETL ETL
ETL ETL
EDW
Data Data Data
Data Data Data
Data Data Data Schema Data
Data Data
ETL ETL
ETL ETL
Fragile workflows make suppor/ng the analy/cal models you want expensive and
/me-‐consuming.
EDW
Data Data Data
Data Data Data
Data Data Data Schema Data
Data Data
ETL ETL
ETL ETL
EDW
Data Data Data
Data Data Data
Data Data Data Schema Data
Data Data
ETL ETL
ETL ETL
What do you want to do with data?
Marke/ng Analy/cs needs data. Work with the popula/on, not just a
sample.
Your segmenta/on today.
Male
Female
Age: 25-‐30
Town/City
Middle Income Band
Product Category Preferences
Your segmenta/on with be^er data.
Male
Female
Age: 27 but feels old
GPS coordinates
$65-‐68k per year
Product recommenda/ons
Tea Party Hippie
Looking to start a business
Walking into Starbucks right now…
A depressed Toronto Maple Leaf’s Fan
Products lek in basket indicate drunk amazon shopper
Gene Expression for Risk Taker
Thinking about a new house
Unhappy with his cell phone plan
Pregnant
Spent 25 minutes looking at tea cozies
Pick up all of that data that was prohibi/vely expensive to store and
use.
Why do viewer surveys…
…when raw data can tell you what bu^on on the remote was pressed during what commercial for the
en/re viewer popula/on?
To approach these use cases you need an affordable plaForm that stores, processes, and analyzes the
data.
So what is the answer?
Enter the Hadoop.
h^p://www.fabulouslybroke.com/2011/05/ninja-‐elephants-‐and-‐other-‐awesome-‐stories/
………
Hadoop was created because tradi/onal technologies never cut it
for the Internet proper/es like Google, Yahoo, Facebook, Twi^er,
and LinkedIn
Tradi/onal architecture didn’t scale enough…
DB DB DB
SAN
App App App App
DB DB DB
SAN
App App App App DB DB DB
SAN
App App App App
Databases can become bloated and useless
Tradi/onal architectures cost too much at that volume…
$/TB
$pecial Hardware
$upercompu/ng
So what is the answer?
If you could design a system that would handle this, what would it
look like?
It would probably need a highly resilient, self-‐healing, cost-‐efficient,
distributed file system…
Storage Storage Storage
Storage Storage Storage
Storage Storage Storage
It would probably need a completely parallel processing framework that
took tasks to the data…
Storage Storage Storage
Storage Storage Storage
Storage Storage Storage Processing Processing Processing
Processing Processing Processing
Processing Processing Processing
It would probably run on commodity hardware, virtualized machines, and
common OS plaForms
Storage Storage Storage
Storage Storage Storage
Storage Storage Storage Processing Processing Processing
Processing Processing Processing
Processing Processing Processing
It would probably be open source so innova/on could happen as quickly
as possible
It would need a cri/cal mass of users
Hadoop 2 just hit the ground: Introducing YARN
YARN lets you run more data apps than ever before
HDFS2
MapReduce V2
YARN MapReduce V? STORM
MPI Giraph HBase Tez … and
more
YARN turns Hadoop into a smart phone: An App Ecosystem
hortonworks.com/yarn/
YARN: Yeah, we did that too.
hortonworks.com/yarn/
Apache Hadoop
Flume Ambari
HBase Falcon
MapReduce HDFS
Sqoop HCatalog
Pig
Hive
Storm YARN
Hortonworks Data PlaForm
Flume Ambari
HBase Falcon
MapReduce HDFS
Sqoop HCatalog
Pig
Hive
Storm YARN
What else are we working on?
hortonworks.com/labs/
Hadoop is the new Data Opera/ng System for the Enterprise
© Hortonworks Inc. 2012: DO NOT SHARE. CONTAINS HORTONWORKS CONFIDENTIAL & PROPRIETARY INFORMATION Page 69
There is NO second place
Hortonworks …the Bull Elephant of Hadoop InnovaDon