hortonworks big data & hadoop
DESCRIPTION
Presenter: Ofer Mendelevitch of Hortonworks > Learn the benefits of big data for data scientists, and how Hadoop and HDInsight fit into the modern data architecture and enable data-driven products. You'll learn: * What data science actually means * The term "data products" * The benefits of using big data for data scientists * How Hadoop helps data scientists work with big data * About HDInsight, the big data platform from Microsoft and HortonworksTRANSCRIPT
© Hortonworks Inc. 2013
Big Data, Data Science & Hadoop
Ofer Mendelevitch
San Francisco Bay AreaMicrosoft BusinessIntelligence User Group
May 2013
© Hortonworks Inc. 2013 Page 2
Who am I?
Director of Data Sciences @ Hortonworks• Data science with Hadoop
• Professional services
Previously…
A Chess Dad
© Hortonworks Inc. 2013 Page 3
© Hortonworks Inc. 2013 Page 4
Gartner’s 3 V’s of big data:
Volume
VelocityVariety
Size of the data
Ingest speedResponse latency
Diverse sourcesFormat, structureData quality
© Hortonworks Inc. 2013
What Makes Up Big Data?
Megabytes
Gigabytes
Terabytes
Petabytes
Purchase detail
Purchase record
Payment record
ERPERP
CRMCRM
WEBWEB
BIG DATABIG DATA
Offer details
Support Contacts
Customer Touches
Segmentation
Web logs
Offer history
A/B testing
Dynamic Pricing
Affiliate Networks
Search Marketing
Behavioral Targeting
Dynamic Funnels
User Generated Content
Mobile Web
SMS/MMSSentiment
External Demographics
HD Video, Audio, Images
Speech to Text
Product/Service Logs
Social Interactions & Feeds
Business Data Feeds
User Click Stream
Sensors / RFID / Devices
Spatial & GPS Coordinates
Increasing Data Variety and Complexity
Transactions + Interactions+ Observations
= BIG DATA
Page 5
© Hortonworks Inc. 2013 Page 6
• Sensors/devices
• Online: social, forums, etc
• Event logs
• Etc etc…
But also:
• Data that was “thrown away “ previously
Where is all this data coming from?
© Hortonworks Inc. 2013 Page 7
I like a quote from Michael Franklin (UCB):
“Big Data is any data that is expensive tomanage and hard to extract value from”
It’s a relative term.
Today’s big data may be tomorrow’s small data.
Ok… so what is big data?
© Hortonworks Inc. 2013 Page 8
© Hortonworks Inc. 2013 Page 9
“A software system whose corefunctionality depends on theapplication of statistical analysisand machine learning to data.”
What is a data product?
© Hortonworks Inc. 2013 Page 10
Example 1: Google Adwords
© Hortonworks Inc. 2013 Page 11
Example 2: People you may know
© Hortonworks Inc. 2013 Page 12
Example 3: spell correction
© Hortonworks Inc. 2013 Page 13
© Hortonworks Inc. 2013 Page 14
What is data science?
#1: Extracting deep meaning from data(data mining; finding “gems” in data)
© Hortonworks Inc. 2013 Page 15
What is data science?
#2: Building data products(Delivering gems on a regular basis)
Pre-process Build model SQL
Periodic batch processing
Online serving
© Hortonworks Inc. 2013 Page 16
Common data science tasks
DescriptiveDescriptive
Clustering
Detect natural groupings
Clustering
Detect natural groupings
Outlier detection
Detect anomalies
Outlier detection
Detect anomalies
Affinity Analysis
Co-occurrence patterns
Affinity Analysis
Co-occurrence patterns
PredictivePredictive
Classification
Predict a category
Classification
Predict a category
Regression
Predict a value
Regression
Predict a value
Recommendation
Predict a preference
Recommendation
Predict a preference
© Hortonworks Inc. 2013 Page 17
© Hortonworks Inc. 2013
A brief history of Apache Hadoop
Page 18
2013
Focus on INNOVATION2005: Yahoo! creates
team under E14 towork on Hadoop
Focus on OPERATIONS2008: Yahoo team extends focus to
operations to support multipleprojects & growing clusters
Yahoo! begins toOperate at scale
EnterpriseHadoop
Apache ProjectEstablished
HortonworksData Platform
2004 2008 2010 20122006
STABILITY2011: Hortonworks created to focus on
“Enterprise Hadoop“. Starts with 24key Hadoop engineers from Yahoo
© Hortonworks Inc. 2013
ApplianceCloudOS / VM
HDP: Enterprise-Ready Hadoop
HORTONWORKSDATA PLATFORM (HDP)
PLATFORM SERVICES
HADOOP CORE
Enterprise Readiness: HA,DR, Snapshots, Security, …
DistributedStorage & ProcessingHDFS
MAP REDUCE
DATASERVICES
Store,Process andAccess Data
HCATALOG
HIVEPIGHBASE
SQOOP
FLUME
OPERATIONALSERVICES
Manage &Operate at
ScaleOOZIE
AMBARI
© Hortonworks Inc. 2013
Core Hadoop: HDFS & Map Reduce
Deliver high-scale storage & processing
• HDFS: distributed, self-healing data store
• Map-reduce: distributed computation framework thathandles the complexities of distributed programming
Page 20
© Hortonworks Inc. 2013 Page 21
Keys to Hadoop’s power
• Computation co-located with data
– Data and computation system co-designed and co-developed to work together
• Process data in parallel across thousands of“commodity” hardware nodes
– Self-healing; failure handled by software
• Designed for one write and multiple reads
– There are no random writes
– Optimized for minimum seek on hard drives
© Hortonworks Inc. 2013
Inside HDP for Windows
Page 22
HortonworksData Platform (HDP)
For Windows
• 100% Open SourceEnterprise Hadoop
• Component and versioncompatible with MicrosoftHDInsight
• Availability
• Beta release available now
• GA early 2Q 2012
PLATFORM SERVICES
HADOOP CORE
DATASERVICES
OPERATIONALSERVICES
Manage &Operate at
Scale
Store,Process andAccess Data
HORTONWORKSDATA PLATFORM (HDP)For Windows
DistributedStorage & ProcessingHDFS
WEBHDFS
MAP REDUCE
HCATALOG
HIVEPIG
SQOOP
Oozie
© Hortonworks Inc. 2013
Seamless Interoperability with Your Microsoft Tools
• Integrated with Microsoft toolsfor native big data analysis
– Bi-directional connectors for SQLServer and SQL Azure through SQOOP
– Excel ODBC integration through Hive
• Addressing demand for Hadoopon Windows
– Ideal for Windows customers withHadoop operational experience
• Enables all common Hadoopworkloads
– Data refinement and ETL offload forhigh-volume data landing
– Data exploration for discovery of newbusiness opportunities
Page 23
AP
PLI
CA
TIO
NS
DA
TASY
STEM
S
Microsoft Applications
HORTONWORKSDATA PLATFORMFor Windows
DA
TASO
UR
CES
MOBILEDATA
OLTP,POS
SYSTEMS
Traditional Sources(RDBMS, OLTP, OLAP)
New Sources(web logs, email, sensor data, social media)
© Hortonworks Inc. 2013 Page 24
© Hortonworks Inc. 2013 Page 25
Data Science, now with more data…
© Hortonworks Inc. 2013 Page 26
Benefit #1:Explore full datasets
Benefits of Hadoop for datascience
© Hortonworks Inc. 2013 Page 27
Explore large datasets directly with Hadoop
Measure/Evaluate
Acquire
Clean DataVisualize, Grok
Model
Full dataset stored on Hadoop
Researcher laptop
R, Matlab, SAS, etc
© Hortonworks Inc. 2013 Page 28
Integrate Hadoop in your data analysis flow
•Full dataset resides in Hadoop
• Typical Hadoop tasks:
–Simple statistics: mean, median, correlation
–Text pre-processing: grep, regex, NLP
–Dimensionality reduction: PCA, SVD, clustering, etc
–Random sampling: with or without replacement, by unique
–K-fold cross-validation
© Hortonworks Inc. 2013 Page 29
Benefit #2:Mine larger datasets
Benefits of Hadoop for datascience
© Hortonworks Inc. 2013 Page 30
More data -> better outcomes
Banko & Brill, 2001
Halevy, Norvig & Pereira, 2009
© Hortonworks Inc. 2013 Page 31
Learning algorithms with large datasets…
Challenges:
•Data won’t fit in memory
•Learning takes a lot longer…
Using Hadoop:
•Distribute data across nodes in the Hadoop cluster
• Implement a distributed/parallel algorithm
© Hortonworks Inc. 2013 Page 32
Benefit #3:Large-scale data preparation
Benefits of Hadoop for datascience
© Hortonworks Inc. 2013 Page 33
80% of data science work is data preparation
Strip awayHTML/PDF/DOC/PPT
Entity resolution
Document vectorgeneration
Sampling, filtering
Joins
Raw DataProcessed
Data
Term normalization
© Hortonworks Inc. 2013 Page 34
Hadoop is ideal for batch data preparation andcleanup of large datasets
© Hortonworks Inc. 2013 Page 35
Benefit #4:Accelerate data-driven innovation
Benefits of Hadoop for datascience
© Hortonworks Inc. 2013 Page 36
Barriers to speed with traditional data architectures
• RDBMS uses “schema on write”; change is expensive
• High barrier for data-driven innovation
I neednew data
collecting
Finally,we start
collecting
Let mesee… is it
any good?
Start 6 months 9 months
Schema change project
© Hortonworks Inc. 2013 Page 37
“Schema on read” means faster time-to-innovation
• Hadoop uses “schema on read”
• Low barrier for data-driven innovation
I neednew data
Let’s just putLet’s just putit in a folder
on HDFS
Let mesee… is it
any good?
Start 3 months 6 months
My model isawesome!
© Hortonworks Inc. 2013
Quick start: Hortonworks Sandbox
• What is it
– A free download of a virtualized single-node implementation of the enterprise-readyHortonworks Data Platform
– A personal Hadoop environment
– An integrated learning environment with frequently, easily updatable hands-onstep-by-step tutorials
• What it does
– Dramatically accelerates the process of learning Apache Hadoop
– Accelerate and validates the use of Hadoop within your unique data architecture
– Use your data to explore and investigate your use cases
• ZERO to big data in 15 minutes
Page 38
Download Hortonworks Sandboxwww.hortonworks.com/sandbox
Sign up for Training for in-depth learninghortonworks.com/hadoop-training/
Hadoop Summit
Page 39Architecting the Future of Big Data
• June 26-27, 2013- San Jose ConventionCenter
• Co-hosted by Hortonworks & Yahoo!
• Theme: Enabling the Next GenerationEnterprise Data Platform
• 90+ Sessions and 7 Tracks
• Community Focused Event
– Sessions selected by a Conference Committee
– Community Choice allowed public to vote forsessions they want to see
• Pre-event training classes
– Apache Hadoop Essentials: A TechnicalUnderstanding for Business Users
– Understanding Microsoft HDInsight and ApacheHadoop
– Developing Solutions with Apache Hadoop –HDFS and MapReduce
– Applying Data Science using Apache Hadoop
• 10% discount code: 13DiscHUG10
hadoopsummit.org
© Hortonworks Inc. 2013 Page 40
Thank you!
Any Questions?
Ofer Mendelevitch
Director, Data Sciences @ Hortonworks
@ofermend, @hortonworks
We’re hiring!