accelerating real-time analytics with spark 10082015€¦ · 1 ©2015 talend inc...

Post on 08-Oct-2020

0 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

1 ©2015 Talend Inc

Accelera'ng  Real-­‐Time  Analy'cs    with  Spark  October  8,  2015  

Housekeeping

Audio – Streamed via media player, turn volume up

Submit questions for Q&A via Group Chat widget

Download slides and event materials

Hashtag: #stratahadoop

3

Your  Speakers  Today    

Sean Owen Director of Data Science Cloudera, EMEA

Yann Delacourt Director, Big Data Product Management Talend

4

•  Apache  Spark,  its  architecture  and  benefits  •  Spark's  architecture,  deployment  strategies  and  use  cases  •  Spark's  impact  to  data  science,  analy@cs  and  machine  learning  • How  to  move  data  scien@sts'  work  to  IT  produc@on  •  Best  prac@ces  for  large  Spark  deployments  • Mastering  Spark's  complexity  

Agenda  

5  ©  Cloudera,  Inc.  All  rights  reserved.  

Accelera@ng  Real-­‐Time  Analy@cs  with  Apache  Spark  Sean  Owen,  Director  of  Data  Science  Cloudera,  EMEA    

6  ©  Cloudera,  Inc.  All  rights  reserved.  

What  is  Apache  Spark?  

Spark  is  a  general  purpose  computa@onal  framework  with  more  flexibility  than  MapReduce    •  Leverages  distributed  memory  • Full  Directed  Graph  expressions  for  data  parallel  computa@ons  •  Improved  developer  experience  •  Linear  scalability,  Data  Locality  • Fault-­‐tolerance    

7  ©  Cloudera,  Inc.  All  rights  reserved.  

The  Spark  Ecosystem  &  Hadoop  

Spark  Streaming   MLlib   SparkSQL   GraphX   Data-­‐

frames   SparkR  

STORAGE  HDFS,  HBase  

RESOURCE  MANAGEMENT  YARN  

Spark   Impala   MR   Others  Search  

8  ©  Cloudera,  Inc.  All  rights  reserved.  

Apache  Spark  Flexible,  in-­‐memory  data  processing  for  Hadoop  

Easy    Development  

Flexible  Extensible    API  

Fast  Batch  &  Stream  Processing  

•  Rich  APIs  for  Scala,  Java,  and  Python  

 •  Interac@ve  shell  

•  APIs  for  different  types  of  workloads:  •  Batch    •  Streaming  •  Machine  Learning  •  Graph  

•  In-­‐Memory  processing  and  caching  

9  ©  Cloudera,  Inc.  All  rights  reserved.  

Easy  Development  Use  Interac@vely  

•  Interac@ve  explora@on  of  data  for  data  scien@sts  •  No  need  to  develop  “applica@ons”  

•  Developers  can  prototype  applica@on  on  live  system  

percolateur:spark srowen$ ./bin/spark-shell --master local[*]...Welcome to ____ __ / __/__ ___ _____/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 1.5.0-SNAPSHOT /_/

Using Scala version 2.10.4 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_51)Type in expressions to have them evaluated.Type :help for more information....

scala> val words = sc.textFile("file:/usr/share/dict/words")...words: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[1] at textFile at <console>:21

scala> words.count...res0: Long = 235886

scala>

10  ©  Cloudera,  Inc.  All  rights  reserved.  

Easy  Development  Expressive  API  

•  map

•  filter

•  groupBy

•  sort

•  union

•  join

•  leftOuterJoin

•  rightOuterJoin

•  sample

•  take

•  first

•  partitionBy

•  mapWith

•  pipe

•  save

•  …

•  reduce

•  count

•  fold

•  reduceByKey

•  groupByKey

•  cogroup

•  cross

•  zip

11  ©  Cloudera,  Inc.  All  rights  reserved.  

Example  Logis@c  Regression  

data = spark.textFile(...).map(readPoint).cache() w = numpy.random.rand(D) for i in range(iterations): gradient = data .map(lambda p: (1 / (1 + exp(-p.y * w.dot(p.x)))) * p.y * p.x) .reduce(lambda x, y: x + y)

w -= gradient print “Final w: %s” % w

12  ©  Cloudera,  Inc.  All  rights  reserved.  

Spark  Takes  Advantage  of  Memory  

Resilient  Distributed  Datasets  (RDD)  • Memory  caching  layer  that  stores  data  in  a  distributed,  fault-­‐tolerant  cache  

• Can  fall  back  to  disk  when  data-­‐set  does  not  fit  in  memory    

• Created  by  parallel  transforma@ons  on  data  in  stable  storage  • Provides  fault-­‐tolerance  through  concept  of  lineage  

 

13  ©  Cloudera,  Inc.  All  rights  reserved.  

Fast  Processing  Using  RAM,  Operator  Graphs  

In-­‐Memory  Caching  •  Data  Par@@ons  read  from  RAM  

instead  of  disk    Operator  Graphs  •  Scheduling  Op@miza@ons  •  Fault  Tolerance  

join  

filter  

groupBy  

B:   B:  

C:   D:   E:  

F:  

Ç√Ω  

map  

A:  

map  

take  

=  cached  par@@on  =  RDD  

14  ©  Cloudera,  Inc.  All  rights  reserved.  

Data  Science  Baneries  Included  

MLlib   ML  “Pipelines”  •  Exis@ng,  mature  Spark  ML  subproject  •  Covers  the  basics  well  

•  Decision  trees,  SVM,  LR  •  ALS,  SVD  •  K-­‐means  •  …  and  more  

•  Stand-­‐alone  implementa@ons  •  Algorithms  Only  

•  Beta  “MLlib  2.0”  •  Emulates  scikit-­‐learn  APIs  •  Pipelines,  not  just  algos  

•  Feature  engineering  •  Transforma@on  •  Ensembles  

•  Unified  architecture  •  Spark  1.4+  

15  ©  Cloudera,  Inc.  All  rights  reserved.  

Faster  Itera@ve  ML  Algorithms  (Data  Fits  in  Memory)  

0  500  1000  1500  2000  2500  3000  3500  4000  

1   5   10   20   30  

Runn

ing  Time(s)  

#  of  Itera'ons  

MapReduce  

Spark  

110  s/itera@on  

First  itera@on  =  80s  Further  itera@ons  1s  due  to  caching  

16  ©  Cloudera,  Inc.  All  rights  reserved.  

Cloudera  Customer  Use  Cases  Core  Spark   Spark  Streaming  

•  Porvolio  Risk  Analysis  •  ETL  Pipeline  Speed-­‐Up  •  20+  years  of  stock  data  Financial  

Services  

Health  

•  Iden@fy  disease-­‐causing  genes  in  the  full  human  genome  

•  Calculate  Jaccard  scores  on  health  care  data  sets  

ERP  

•  Op@cal  Character  Recogni@on  and  Bill  Classifica@on  

•  Trend  analysis    •  Document  classifica@on  (LDA)  •  Fraud  analy@cs  Data  

Services  

1010  

•  Online  Fraud  Detec@on  Financial  Services  

Health  

•  Incident  Predic@on  for  Sepsis  

Retail  

•  Online  Recommenda@on  Systems  •  Real-­‐Time  Inventory  Management  

Ad  Tech  

•  Real-­‐Time  Ad  Performance  Analysis  

17  ©  Cloudera,  Inc.  All  rights  reserved.  

Uni@ng  Spark  and  Hadoop  The  One  Plavorm  Ini@a@ve  Investment  Areas  

Management  Leverage  Hadoop-­‐na@ve  resource  management.  

Security  Full  support  for  Hadoop  security  

and  beyond.  

Scale  Enable  10k-­‐node  clusters.  

Streaming  Support  for  80%  of  common  stream  

processing  workloads.  

18  ©  Cloudera,  Inc.  All  rights  reserved.  

Management   Security   Scale   Streaming  •  Spark  on  YARN  Integra@on  •  HBase  integra@on  •  Improved  metrics  for  

monitoring/troubleshoo@ng  •  Dynamic  Resource  Alloca@on  

•  Spark  on  YARN:  •  Container  resizing  •  Dynamic  Resource  

Alloca@on  for  Streaming  •  Simplified  resource  

configura@on  •  Improved  WebUI  for    

debugging    •  Improved  metrics  for  visibility  

into  resource  u@liza@on  •  Smart  auto-­‐tuning  of  job  

parameters    

•  Kerberos  Integra@on  •  HDFS  Sync  (Sentry)  •  Secure  data  at  rest  

•  Secure  data  over  the  wire  •  Audit/Lineage  (Navigator)  •  Spark  PCI  compliance  •  Integra@on  with  Intel’s  

advanced  encryp@on  libraries  •  Enable  column  and  view  level  

security  

•  Revamp  Scheduler  handling  of  node  failure  

•  Sort  based  shuffle  improvements  

•  Task  Scheduling  based  on  HDFS  data  locality  and  caching  

•  Scheduler  improvements  for  performance  at  scale  

•  Stress  test  at  scale  with  mixed  mul@-­‐tenant  workloads  

•  HDFS  DDM  Integra@on  •  Dynamic  resource  u@liza@on  &  

priori@za@on  •  Scale  Spark  History  Server  for  

1000s  of  jobs    

•  Zero  Data  Loss  with  Spark  Streaming  Resilience  

•  Flume  integra@on  •  Ka{a  integra@on  

•  SQL  seman@cs  for  expressing  streaming  jobs  (Business  Users)  

•  New  streaming  specific  API  extensions  

•  Streaming  applica@on  management  (pause,  update,  redeploy)  via  CM  

•  Op@mized  state  updates:  efficient  point  lookups  and  delta  updates    

Detailed  Roadmap:  One  PlaTorm  Ini'a've  =  Completed  Work  

=  Planned  Future  Work  

19  ©  Cloudera,  Inc.  All  rights  reserved.  

Spark  is  a  Developer  Framework  

• Spark  means  wri@ng  code  

• And  deploying  it  

• And  monitoring  it  

• Workflow  orchestra@on  is  hard  

• Oozie?  Luigi?  

• Custom  scripts  

 

Data  is  S'll  Fickle  • Data  Quality  is  s@ll  hard  

• Spark  s@ll  can’t  automa@cally  find  and  clean  bad  records  

• Feature  engineering  =  ETL  • Data  Integra@on  is  s@ll  hard  

• Read  /  write  the  right  formats  • “Publish”  to  BI  tools  

The  Bad  News  

20 ©2015 Talend Inc

Accelera'ng  Real-­‐Time  Analy'cs    with  Spark  Yann  Delacourt,  Director  of  Big  Data  Product  Management  Talend    

21

APPLICATION  INTEGRATION  

CLOUD  INTEGRATION  

DATA    INTEGRATION  

BIG  DATA  INTEGRATION  

MASTER  DATA  MANAGEMENT  

A Modern Data Platform for All Your Integration Needs

INTEGRATE  ANYTHING.              OPERATE  IN  REAL-­‐TIME.            ACT  WITH  INSIGHT.  

22

BIG  DATA,  CUSTOMERS  &  SUPPLIERS  

ON-­‐PREMISE  APPS  

CLOUD  APPS      I      IOT  SENSORS      I      CUSTOMERS      I      SUPPLIERS  

DEVELOPER    STUDIO   Web  UI  

DATA  FABRIC  

1st Data Integration Platform on Apache Spark

23 Benefits:    Make  decisions  faster.  Tremendous  developer  produc@vity.    

•  Visually  develop  jobs  that  run  100%  on  Spark  •  5X  'mes  faster  using  independent  benchmarks  •  10X  developer  produc'vity  gained  over  hand-­‐coding  

Spark  •  100X  faster  with  in-­‐memory  processing  

 

•  Over  100  new  drag-­‐n-­‐drop  Spark  components  •  HDFS,  RDBMS,  NoSQL,  Cloud  Storage,  Transforma@on,  

Messaging,  In-­‐memory  analy@cs  &  machine  learning  recommenda@ons,  and  much  more  

•  In-­‐memory  data  caching  &  “windowed”  computa@ons  •  Click  to  enable  Spark  Streaming  for  real-­‐'me  data  

processing    

•  Convert  Talend  MapReduce  jobs  to  Spark  with  the  click  of  a  bunon,  future  proofing  your  investment  

Introducing  Talend  Real-­‐'me  Big  Data  1st  Data  Integra@on  Plavorm  on  Spark  

24  Benefits:  Developer  produc@vity.  Business  agility.  

Enabling  Intelligent  Data  Pipelining  

Lambda  Architecture:  Batch,  Real-­‐'me,  Query  

•  A  single  solu'on  to  address    •  Bulk/batch  •  Real-­‐@me  •  Streaming  &  IoT  data  •  Machine  Learning  

 •  Provides  Fast  Data  access  through  NoSQL      

•  One  tool  for  Hadoop,  Spark,  tradi@onal  ETL/ELT  and  NoSQL  integra@on  

Speed  Layer  

Batch  Layer  

NoSQL  

IOT  

Web  Logs  

ERP  

DBMS/EDW  

Legacy  

Real-Time Views ____________

Pre-computed

Views

Serving  Layer   Query

Incremental  Data  

All  Data  

Sliding  Window  Analy'cs  

 Apply  Learning  

 Learning  on      past  Data  

25

Easily  Convert  MapReduce  to  Spark!  

Your  Job  Now  5X  Faster  

MapReduce  (runs  on  disk)  

Spark  (runs  on  disk  and  in-­‐memory)  

One  Click  

26

Spark/Talend  Enabled  Use  Cases  -­‐  Examples  

Data Discovery (Interactive)

Better Decisions (Batch)

Real-Time Action (Streaming and Machine

Learning)

Digital Economy

Web Analytics Click-Stream Analysis

Real-Time Web Traffic Optimization (retargetting &

reco)

Retail SCM Analytics Find Purchase Corellation

Real-Time Promotion & Coupon Optimization

Financial Services

EDW

Fraud Detection Learning on

Massive Data Volume

High-Scalable Trading, Risk Management & Real-Time

Fraud Detection

27

Talend  Success  Challenge:    •  Ever  increasing  Big  Data  velocity  •  Many  last  minute  cart  abandonments  

•  Hard  to  op@mize  pricing  

Why  Talend:  •  Is  the  central  integra@on  tool  within  their  Business  Intelligence  

(BI)  organiza@on.    •  Integrates  clickstreams  from  last  6  months  

Value:    •  Le}over  merchandise  reduced  by  20%  •  Can  predict  abandoned  shopping  cart  in  real-­‐@me  with  a  90%  

accuracy    •  Op@mize  Pricing  and  Stock  pricing  

28

Challenge:    •  Needed  to  migrate  800  ETL  jobs  to  an  “Industrial  Internet”    •  Improve  service  levels  by  providing  data  and  analy@cs  in  the  cloud  

Industrial  Internet  

Solu'on:  •  Integrate  big  data,  small  data,  and  transac@onal  data  with  high  

quality.  •  Talend  Big  Data,  Data  Quality,  Master  Data  Management  

Value:    •  Provide  a  collabora@ve,  prescrip@ve,  and  predic@ve  environment    •  Improved  customer  sa@sfac@on,  improved  produc@vity  per  

turbine  •  Predict  failures  &  Reduce  inventory  •  Arm  sales  with  compe@@ve  intelligence  

29

From  Zero  to  Big  Data  in  10  Minutes  Download  free  www.talend.com/download  

•  Get up and running in minutes, not weeks, with a big data Sandbox and demos

•  Includes: Sentiment analysis, ETL Offload, Log file analysis, Recommendation engine

•  Start working with Talend, Hadoop & NoSQL today!

Now with

‹#› © 2015 Cloudera, Inc. All rights reserved.

The conference for and by Data Scientists, from startup to enterprise wrangleconf.com

Public registration is now open!

  Who: Featuring data scientists from Salesforce, Uber, Pinterest, and more

  When: Thursday, October 22, 2015   Where: Broadway Studios, San Francisco

31  ©  Cloudera,  Inc.  All  rights  reserved.  ©2015 Talend Inc

Q&A  

top related