why hadoop for data science?
DESCRIPTION
TRANSCRIPT
© Hortonworks Inc. 2013
Why Hadoop for data science?
Ofer MendelevitchPASS BA Conference, April 2013
© Hortonworks Inc. 2013
A brief history of Apache Hadoop
Page 2
2013
Focus on INNOVATION2005: Yahoo! creates
team under E14 to work on Hadoop
Focus on OPERATIONS2008: Yahoo team extends focus to
operations to support multiple projects & growing clusters
Yahoo! begins to Operate at scale
EnterpriseHadoop
Apache Project Established
HortonworksData Platform
2004 2008 2010 20122006
STABILITY2011: Hortonworks created to focus on “Enterprise Hadoop“. Starts with
24 key Hadoop engineers from Yahoo
© Hortonworks Inc. 2013
Core Hadoop: HDFS & Map Reduce
Deliver high-scale storage & processing
• HDFS: distributed, self-healing data store
• Map-reduce: distributed computation framework that handles the complexities of distributed programming
Page 3
© Hortonworks Inc. 2013 Page 4
Keys to Hadoop’s power
• Computation co-located with data– Data and computation system co-designed and co-
developed to work together
• Process data in parallel across thousands of “commodity” hardware nodes– Self-healing; failure handled by software
• Designed for one write and multiple reads– There are no random writes– Optimized for minimum seek on hard drives
© Hortonworks Inc. 2013
ApplianceCloudOS / VM
HDP: Enterprise-Ready Hadoop
HORTONWORKS DATA PLATFORM (HDP)
PLATFORM SERVICES
HADOOP CORE
Enterprise Readiness: HA, DR, Snapshots, Security, …
Distributed Storage & ProcessingHDFS
MAP REDUCE
DATASERVICES
Store, Process and Access Data
HCATALOG
HIVEPIGHBASE
SQOOP
FLUME
OPERATIONAL SERVICES
Manage & Operate at
ScaleOOZIE
AMBARI
© Hortonworks Inc. 2013 Page 6
What is a
data product?
© Hortonworks Inc. 2013 Page 7
“A software system whose core functionality depends on the application of statistical analysis and machine learning to data.”
What is a data product?
© Hortonworks Inc. 2013 Page 8
Example 1: Google Adwords
© Hortonworks Inc. 2013 Page 9
Example 2: People you may know
© Hortonworks Inc. 2013 Page 10
Example 3: spell correction
© Hortonworks Inc. 2013 Page 11
What is
data science?
© Hortonworks Inc. 2013 Page 12
What is data science?
#1: Extracting deep meaning from data(data mining; finding “gems” in data)
© Hortonworks Inc. 2013 Page 13
Common data science tasks
Descriptive
ClusteringDetect natural groupings
Outlier detectionDetect anomalies
Affinity AnalysisCo-occurrence patterns
Predictive
ClassificationPredict a category
RegressionPredict a value
RecommendationPredict a preference
© Hortonworks Inc. 2013 Page 14
What is data science?
#2: Building data products(Delivering gems on a regular basis)
Pre-process Build model SQL
Periodic batch processing
Online serving
© Hortonworks Inc. 2013 Page 15
Reason #1: Explore full datasets
Why Hadoop for data science?
© Hortonworks Inc. 2013 Page 16
Explore large datasets directly with Hadoop
Measure/Evaluate
Acquire
Clean Data
Visualize, Grok
Model
Full dataset stored on Hadoop
Researcher laptopR, Matlab, SAS, etc
© Hortonworks Inc. 2013 Page 17
Integrate Hadoop in your data analysis flow
•Exploratory data analysis on full dataset–Simple statistics: mean, median, quantile, etc–Pre-processing: grep, regex, etc
•Ad-hoc sampling / filtering–Random: with or without replacement–Sample by unique key–K-fold cross-validation
© Hortonworks Inc. 2013 Page 18
Reason #2: Mine larger datasets
Why Hadoop for data science?
© Hortonworks Inc. 2013 Page 19
More data -> better outcomes
Banko & Brill, 2001 Halevy, Norvig & Pereira, 2009
© Hortonworks Inc. 2013 Page 20
Learning algorithms with large datasets…
Challenges:•Data won’t fit in memory•Learning takes a lot longer…
Using Hadoop:•Distribute data across nodes in the Hadoop cluster• Implement a distributed/parallel algorithm
–Recommendation: Alternate Least Squares (ALS)–Clustering: K-means
© Hortonworks Inc. 2013 Page 21
Reason #3: Large-scale data preparation
Why Hadoop for data science?
© Hortonworks Inc. 2013 Page 22
80% of data science work is data preparation
Strip away HTML/PDF/DOC/P
PT
Entity resolution
Document vector generation
Sampling, filtering
Joins
Raw Data Processed Data
Term normalization
© Hortonworks Inc. 2013 Page 23
Hadoop is ideal for batch data preparation and cleanup of large datasets
© Hortonworks Inc. 2013 Page 24
Reason 4: Accelerate data-driven innovation
Why Hadoop for data science?
© Hortonworks Inc. 2013 Page 25
Barriers to speed with traditional data architectures
• RDBMS uses “schema on write”; change is expensive• High barrier for data-driven innovation
I need new data
Finally, we start
collecting
Let me see… is it
any good?
Start 6 months 9 months
Schema change project
© Hortonworks Inc. 2013 Page 26
“Schema on read” means faster time-to-innovation
• Hadoop uses “schema on read”• Low barrier for data-driven innovation
I need new data
Let’s just put it in a folder on
HDFS
Let me see… is it
any good?
Start 3 months 6 months
My model is awesome!
© Hortonworks Inc. 2013 Page 27
Summary
Why use Hadoop for data science?1. Data exploration with full datasets2. Mine larger datasets3. Pre-processing at scale4. Faster data-driven cycles
© Hortonworks Inc. 2013
Quick start: Hortonworks Sandbox
• What is it– A free download of a virtualized single-node implementation of the enterprise-ready
Hortonworks Data Platform– A personal Hadoop environment– An integrated learning environment with frequently, easily updatable hands-on
step-by-step tutorials
• What it does– Dramatically accelerates the process of learning Apache Hadoop– Accelerate and validates the use of Hadoop within your unique data architecture– Use your data to explore and investigate your use cases
• ZERO to big data in 15 minutes
Page 28
Download Hortonworks Sandboxwww.hortonworks.com/sandbox
Sign up for Training for in-depth learninghortonworks.com/hadoop-training/
© Hortonworks Inc. 2013 Page 29
Thank you!
Any Questions?Ofer Mendelevitch
Director, Data Sciences @ Hortonworks
@ofermend, @hortonworks
Come visit us @ Booth S5
We’re hiring!