big data 2.0 - how spark technologies are reshaping the world of big data analytics
Post on 07-Jan-2017
435 Views
Preview:
TRANSCRIPT
Big Data 2.0HOW SPARK TECHNOLOGIES ARE RESHAPING THE WORLD OF BIG DATA ANALYTICS
Presented By: Lillian Pierson, P.E.
Today’s webinarApache Spark: Journey from “Hadoop Eco System component” to “Big Data platform”
The story of how Spark began
Is Spark a data engineering or data science platform?
Who is using Spark and for what?
Got Spark skills? Here’s why you should
Apache SparkJOURNEY FROM “HADOOP ECO SYSTEM COMPONENT ” TO “BIG DATA PLATFORM”
What is Spark?
“In-memory computing appliances are … faster than the traditional Hadoop system because in-memory appliances don’t use MapReduce… By storing data in memory, in-memory appliances are able to bypass the time-consuming disk accesses that are required as part of the map and reduce operations that comprise the MapReduce process. In-memory data storage processing, and analysis is fast enough to generate data analytics in real-time, derived from streaming data sources.“ –Excerpt from my book:
Big Data/Hadoop for Dummies
Why in-memory applications?
From Hadoop ecosystem component…
HDFS
MapReduce 2.0
YARN
From Hadoop ecosystem component…
HDFS
SparkMapReduce
2.0YARN
To big data platform
HDFS
MapReduce 2.0
Spark YARN
To big data platform
Spark-as-a-Service
Spark’s 4 submodules
Spark SQL MLlib
GraphX Streaming
Spark SQL moduleDataFrames
Spark SQL◦ SQL
Hive◦ HiveQL
◦ Spark Processing Engine
Mllib moduleData analysis
Statistics
Machine learning
GraphX moduleGraph data storage and processing
Graphx◦ In-memory graph data processing
HDFS◦ Graph data storage
Streaming module
Continuously Streaming
Data
Discreet Data Streams
(Dstream)
Micro-batch processing
Dstreams and micro-batch architecture
Source: http://www.slideshare.net/skpabba/hadoop-and-spark
RDD @ time 1 RDD @ time 2 RDD @ time 3
Basic Spark Architecture
Spark SQL MLlib GraphX Streaming
Physical Hardware
Data Storage Layer (HDFS)
Resource Manager (YARN)
Spark Core Libraries
Single Abstraction Layer
Processing Processing Processing Processing
Changes with Spark 2.0
RDD API
•DataFrame API
Spark 1.0
•RDD API
•DataFrame API
Spark 1.3
*RDD API
*DataFrame API
*Dataset API
Spark 1.6
Dataset API
•DataFrame API
•RDD API
Spark 2.0
Changes with Spark 2.0
RDD API
Dataset API
DataFrame API
RDD API
Spark 1.0 Spark 2.0
Changes with Spark 2.0
Structured Stream
Processing
DataFrame API
Dataset API
The story of how Spark began
Taking things from the beginning…2009
Mesos
UC Berkeley
Interactive, iterative parallel processing (in-memory)
◦ Machine learning requirements
Integrates with Hadoop ecosystem
Dr. Ion StoicaComputer Science Professor
UC Berkeley
Databricks… the cutting edge of SparkDelivers Apache Spark-as-a-Service
Most popular solution for deploying Spark on the cloud
Dr. Ion StoicaExecutive Chairman, Apache Databricks
Databricks… the cutting edge of SparkSpark on an as-needed basis
Automates◦ Cluster building and configuration
◦ Security
◦ Process monitoring
◦ Resource monitoring
Notebooks◦ For data analysis and machine learning using Python, R, and Scala
Data visualization capabilities◦ Data visualization and dashboard design options
Is Spark a data engineering or data science platform?DATA ENGINEERING COMPONENTS AND TECHNOLOGIES
DATA SCIENCE COMPONENTS AND TECHNOLOGIES
Spark’s data engineering elementsAutomate cluster sizing and configuration requirements
Data Storage: HDFS
Resource Management:◦ Spark Standalone
◦ Apache Mesos
◦ Hadoop YARN
Spark’s data engineering elementsSpark Streaming Submodule – Reuse same code you use for batch processing, but get real-time results!
◦ Integrates with big data source, like:
◦ HDFS
◦ Flume
◦ Kafka
◦ Twitter and
◦ ZeroMQ
Doing data science with SparkUseful for machine learning and analysis of big data
Build big data analytics products
Programmable in Python, R, Scala, and SQL
Submodules:◦ SQL and DataFrames
◦ MLlib for machine learning
◦ GraphX for in-memory big (graph) data computations
Doing data science with SparkSpark integrates with the following data sources and formats:
◦ Hive, Avro, Parquet, CSV, JSON, and JDBC, HBase
◦ BI Tools: Tableau, QLIK, ZoomData, etc. (through JDBC)
Who is using Spark and for what?A U T O M A T I C L A B S
L E N D U P
S E L L P O I N T S
F I N D I F Y
Automatic Labs on DatabricksMaking cars smarter with real-time analytics
Connect to, and make smart use, of your car’s data
Automatic Labs on DatabricksAutomatic apps do things like:
◦ Decoding engine problems
◦ Locating parked cars
◦ Crash detection and response
◦ Low fuel warnings, etc.
Automatic is using Spark to make cars smarter with real-time analytics
During product development, Automatic needs to query, explore, and visualize large amounts of data, QUICKLY. By moving this work over to Spark, Automatic was able to:
◦ Validate products in days, not weeks
◦ Complete complex queries in minutes
◦ Free up 1 full-time data scientist
◦ Save $10K/month on infrastructure costs
LendUp on DatabricksImproving the lending process and experience
“Moving up the LendUpLadder means earningaccess to more money, atbetter rates, for longerperiods of time” - LendUp
LendUp on DatabricksLendUp uses Spark for:
◦ Feature engineering at scale
◦ Fast model building and testing
By using Spark to do this work, LendUp is able to:◦ Build more accurate models, faster
◦ Offer more lines of credit
◦ Develop new products more quickly
◦ Increase in-house productivity of data science team
sellpoints on DatabricksIncreasing ROI on ad spend
sellpoints on DatabricksIncreasing ROI on ad spend
Sellpoint offers services in:◦ Identifying qualified shoppers
◦ Driving traffic
◦ Increasing sales conversion
By moving to Databricks, sellpoints was able to:◦ Productize a new predictive analytics offering, improving the ad spend ROI
by threefold compared to competitive offerings.
◦ Reduce the time and effort required to deliver actionable insights to the business team while lowering costs.
◦ Improve productivity of the engineering and data science team by eliminating the time spent on DevOps and maintaining open source software.
Findify on DatabricksImproving shopping experience for ecommerce customers
Uses machine learning to continually improve search accuracy
Findify on DatabricksImproving shopping experience for ecommerce customers
By moving to Databricks, Findify was able to:◦ Focus on development instead of infrastructure – Allowing them to complete
their feature development projects faster and reduce customer frustration in delayed analytics
◦ Focus on building innovative features - because the managed Spark platform eliminated time spent on DevOps and infrastructure issues.
Uses machine learning to continually improve search accuracy
Got Spark skills? Here’s why you shouldIMPACT ON SALARY
TRAINING ISSUES AND OPPORTUNITIES
How much do Spark skills pay?2015 Data Science Salary Survey, by O’Reilly
$11,000
$4,000$4,600
$8,000
$0
$2,000
$4,000
$6,000
$8,000
$10,000
$12,000
Spark Skills Scala Programming Basic ExploratoryAnalysis (>4 hr/wk)
D3.js Skills
Annual Salary Increase
Annual Salary Increase
Getting training and experience in Spark
$149.50
SaleUntil
March 30Only
DiscountCode:
‘SPRING50’
Getting training and experience in SparkGet hands-on training in the following areas:
◦ Using RDD
◦ Writing applications using Scala
◦ Spark SQL
◦ Spark Streaming
◦ Machine Learning in Spark (Mllib)
◦ Spark GraphX
◦ Spark Project Implementation
Getting training and experience in Spark
$149.50
SaleUntil
March 30Only
DiscountCode:
‘SPRING50’
Download these slide
Why Data Science From Simplilearn
Key Features
40 hours of real life industry project
experience
25 hours of High Quality e-learning
Visualize and optimize data
effectively using the built-in tools in
R , SAS and Excel
48 hours of Live Instructor Led
Online sessions
Get proficient in using R,SAS and Excel
to model data and predict solutions to business problems
Master the concepts of statistical analysis like linear & logistic regression, cluster
analysis & forecasting
OUR JOURNEY SO FARProject
ManagementDigital Marketing
Big Data & Analytics
Business Productivity
Tools
Quality Management
Virtualization and Cloud Computing
IT Security
Financial Management
CompTIACertification
IT Hardware and N/W ERP
IT Services and Architecture
Agile and Scrum Certification
OS and DatabaseWeb and App Programming
Simplilearn : World’s Largest Certification Training Destination
One of the largest collections of accredited certification training in the world.
YEAR 2010
YEAR 2015
YEAR 2010
YEAR 2016
top related