analytics with spark - meetupfiles.meetup.com/16395762/analytics_with_spark.pdfremote spark context...
Post on 27-May-2020
16 Views
Preview:
TRANSCRIPT
1© Cloudera, Inc. All rights reserved.
Analytics with SparkChen, Jianzhong
Cloudera
2© Cloudera, Inc. All rights reserved.
Agenda
• What Cloudera does for Spark Ecosystem
• Advanced Analytics with Spark
3© Cloudera, Inc. All rights reserved.
Spark Engineering in Cloudera
• Cloudera embraced Spark in early 2014
• Engineering with Intel to broaden Spark ecosystem
• Hive-on-Spark
• Pig-on-Spark
• Spark-over-YARN
• Spark Streaming Reliability
• General Spark Optimization
4© Cloudera, Inc. All rights reserved.
Hive on Spark
• Technology
• Hive: “standard” SQL tool in Hadoop
• Spark: next-gen distributed processing framework
• Hive + Spark
• Performance
• Minimum feature gap
• Industry
• A lot of customers heavily invest in Hive
• Want to leverage the Spark engine
5© Cloudera, Inc. All rights reserved.
Design Principles
• No or limited impact on Hive’s existing code path
• Maximize code reuse
• Minimum feature customization
• Low future maintenance cost
6© Cloudera, Inc. All rights reserved.
Class Hierarchy
TaskCompiler
MapRedCompiler TezCompiler
Task Work
MapRedTask TezTask TezWorkMapRedWork
SparkCompiler SparkTask SparkWork
generates described by
7© Cloudera, Inc. All rights reserved.
Work – Metadata for Task
• MapReduceWork contains one MapWork and a possible ReduceWork
• SparkWork contains a graph of MapWorks and ReduceWorks
MapWork1
ReduceWork1
MapWork2
ReduceWork2
MapWork1
ReduceWork1
ReduceWork2
Query: select name, sum(value) as v from dec
group by name order by v;
Spark Job
MR Job 2
MR Job 1
8© Cloudera, Inc. All rights reserved.
Spark Client and Spark Context
• Spark Client
• Talking to Spark cluster
• Support local, local-cluster, standalone, yarn-cluster, yarn-client
• Job submission, monitoring, error reporting, statistics, metrics, counters
• Spark Context
• Core of Spark client
• Heavy-weighted, thread-unsafe
• Designed for a single-user application
• Doesn’t work in multi-session environment
9© Cloudera, Inc. All rights reserved.
Remote Spark Context
• Being created and living outside HiveServer2
• In yarn-cluster mode, Spark context lives in application master (AM)
• Otherwise, Spark context lives in a separate process (other than HS2)
HiveServer 2
Session 1
Session 2
YARN Cluster
AM (RSC)
AM (RSC)
Node 3
Node 2
Node 1
User 2
User 1
10© Cloudera, Inc. All rights reserved.
Data Processing via Spark
• Treat Table as HadoopRDD (input RDD)
• Apply the function that wraps MR’s map-side processing
• Shuffle map output using Spark’s transformations (groupByKey, sortByKey, etc)
• Apply the function that wraps MR’s reduce-side processing
11© Cloudera, Inc. All rights reserved.
Spark Plan
• MapInput – encapsulate a table
• MapTran – map-side processing
• ShuffleTran – shuffling
• ReduceTran – reduce-side processing
Query: Select name, sum(value) as v from dec group by name order by v;
12© Cloudera, Inc. All rights reserved.
Current Status
• All functionality in Hive is implemented
• First round of optimization is completed
• Map join, SMB
• Split generation and grouping
• CBO, vectorization
• More optimization and benchmarking coming
• Beta in CDH
• http://archive-primary.cloudera.com/cloudera-labs/hive-on-spark/
• http://www.cloudera.com/content/cloudera/en/documentation/hive-spark/latest/PDF/hive-spark-get-started.pdf
13© Cloudera, Inc. All rights reserved.
Advanced Analytics with Spark
• Written by Cloudera data science team
• First ever book bridging ML with Hadoopecosystem
• Focusing on use cases and examples rather than amanual
• Target for data scientist solving real word analysisproblems
• Generally available in May 2015
14© Cloudera, Inc. All rights reserved.
Analyzing Big Data
• Building a model to detect credit card fraud using thousands of features andbillions of transactions
• Intelligently recommend millions of products to millions of users
• Estimate financial risk through simulations of portfolios including millions ofinstruments
• Easily manipulate data from thousands of human genomes to detect geneticassociations with disease
15© Cloudera, Inc. All rights reserved.
Challenges of Data Science
• Data preprocessing
• Various fast data from multiple source requires powerful data pipeline
• Iteration
• Fundamental part of data science
• Accelerating disk data loading is much helpful
• From lab to production
• Make data useful to non-data scientists
• Models become part of the production service and may need to be rebuiltperiodically or even in real time.
16© Cloudera, Inc. All rights reserved.
Value at Risk
• VaR(风险价值或者风险收益)
•指在一定的持有期和给定的置信水平下,利率、汇率等市场风险要素发生变化时可能对某项资金头寸、资产组合或机构造成的潜在最大损失。
• 例如,在持有期为1天、置信水平为99%的情况下,若所计算的风险价值为10万人民币,则表明该银行的资产组合在1天中的损失有99%的可能不会超过10万人民币。
• Introduced by Harry Markowitz in 1952, Nobel Prize in Economics in 1990
17© Cloudera, Inc. All rights reserved.
Illustration for VaR
18© Cloudera, Inc. All rights reserved.
Methods for Calculating VaR
• Variance-Covariance
• Historical Simulation
• Monte Carlo Simulation
19© Cloudera, Inc. All rights reserved.
Estimating through Monte Carlo Simulation
Normalize Input
Modeling
instrument
BondsOil SNP SNP
Monte CarloSimulation
Sampling
market factors
20© Cloudera, Inc. All rights reserved.
Monte Carlo Simulation with Spark
• Normalize data
• Fill the missing value
• Transform the historical data to two-weeks’ return
21© Cloudera, Inc. All rights reserved.
Modeling
• Define factor features
• Use regression model to compute factor weights
22© Cloudera, Inc. All rights reserved.
Sampling
• Take the correlation information between the factors into account
• If S&P is down, the Dow is likely to be down as well
23© Cloudera, Inc. All rights reserved.
Simulation
• Broadcast instruments to each node
• Parallelize trial computation across workers
• Compute the trial returns
24© Cloudera, Inc. All rights reserved.
Evaluate the risk value
• VaR
• Return the cutOff value
• CVaR (Conditional Value at Risk)
• The average the loss
25© Cloudera, Inc. All rights reserved.
Q&A
26© Cloudera, Inc. All rights reserved.
Thank youjianzhong.chen@cloudera.com
top related