advanced data science with apache spark-(reza zadeh, stanford)

Reza Zadeh Introduction to Distributed Optimization @Reza_Zadeh | http://reza-zadeh.com

Upload: spark-summit

Post on 14-Aug-2015

402 views

Category:

Data & Analytics

0 download

Report

Download

Tags:

Embed Size (px):

TRANSCRIPT

Reza Zadeh

Introduction to Distributed Optimization

@Reza_Zadeh | http://reza-zadeh.com

Page 2: Advanced Data Science with Apache Spark-(Reza Zadeh, Stanford)

Optimization At least two large classes of optimization problems humans can solve:"

»  Convex »  Spectral

Page 3: Advanced Data Science with Apache Spark-(Reza Zadeh, Stanford)

Optimization Example: Gradient Descent

Page 4: Advanced Data Science with Apache Spark-(Reza Zadeh, Stanford)

Logistic Regression Already saw this with data scaling Need to optimize with broadcast

Page 5: Advanced Data Science with Apache Spark-(Reza Zadeh, Stanford)

Model Broadcast: LR

Page 6: Advanced Data Science with Apache Spark-(Reza Zadeh, Stanford)

Model Broadcast: LR

Use via .value

Call sc.broadcast

Page 7: Advanced Data Science with Apache Spark-(Reza Zadeh, Stanford)

Separable Updates Can be generalized for »  Unconstrained optimization »  Smooth or non-smooth

»  LBFGS, Conjugate Gradient, Accelerated Gradient methods, …

Page 8: Advanced Data Science with Apache Spark-(Reza Zadeh, Stanford)

Optimization Example: Spectral Program

Page 9: Advanced Data Science with Apache Spark-(Reza Zadeh, Stanford)

Spark PageRank Given directed graph, compute node importance. Two RDDs: »  Neighbors (a sparse graph/matrix)

»  Current guess (a vector) "Using cache(), keep neighbor list in RAM

Page 10: Advanced Data Science with Apache Spark-(Reza Zadeh, Stanford)

Spark PageRank Using cache(), keep neighbor lists in RAM Using partitioning, avoid repeated hashing

Neighbors (id, edges)

Ranks (id, rank)

join

partitionBy

join join …

Page 11: Advanced Data Science with Apache Spark-(Reza Zadeh, Stanford)

Spark PageRank

Generalizes to Matrix Multiplication, opening many algorithms from Numerical Linear Algebra

Page 12: Advanced Data Science with Apache Spark-(Reza Zadeh, Stanford)

Partitioning for PageRank Recall from first lecture that network bandwidth is ~100× as expensive as memory bandwidth

One way Spark avoids using it is through locality-aware scheduling for RAM and disk

Another important tool is controlling the partitioning of RDD contents across nodes

Page 13: Advanced Data Science with Apache Spark-(Reza Zadeh, Stanford)

Spark PageRank Given directed graph, compute node importance. Two RDDs: »  Neighbors (a sparse graph/matrix)

»  Current guess (a vector)

Best representation for vector and matrix?

Page 14: Advanced Data Science with Apache Spark-(Reza Zadeh, Stanford)

PageRank

Page 15: Advanced Data Science with Apache Spark-(Reza Zadeh, Stanford)

Execution

Page 16: Advanced Data Science with Apache Spark-(Reza Zadeh, Stanford)

Solution

Page 17: Advanced Data Science with Apache Spark-(Reza Zadeh, Stanford)

New Execution

Page 18: Advanced Data Science with Apache Spark-(Reza Zadeh, Stanford)

How it works Each RDD has an optional Partitioner object" Any shuffle operation on an RDD with a Partitioner will respect that Partitioner"

Any shuffle operation on two RDDs will take on the Partitioner of one of them, if one is set

Page 19: Advanced Data Science with Apache Spark-(Reza Zadeh, Stanford)

Examples

Page 20: Advanced Data Science with Apache Spark-(Reza Zadeh, Stanford)

Main Conclusion Controlled partitioning can avoid unnecessary all-to-all communication, saving computation

Repeated joins generalizes to repeated Matrix Multiplication, opening many algorithms from Numerical Linear Algebra

Page 21: Advanced Data Science with Apache Spark-(Reza Zadeh, Stanford)

Performance

Page 22: Advanced Data Science with Apache Spark-(Reza Zadeh, Stanford)

RDD partitioner

Page 23: Advanced Data Science with Apache Spark-(Reza Zadeh, Stanford)

Custom Partitioning

Bharath Ramsundar and Reza Bosagh Zadeh - BALKA-BOOK · TensorFlow for Deep Learning From Linear Regression to Reinforcement Learning ... В книге рассмотрены базовые

Arab Zadeh Sp Paper

Computer Vision Made Simple - matroid.com · Reza Zadeh Elva Stainthorp Pete Sonsini Ryan Wong Danny Jeck Julian Bouma ErfanNoury Michael Brown Michael Duignan Michael Suyat Kathie

Fuzzy Rules 1965 paper: “Fuzzy Sets” (Lotfi Zadeh) Apply natural language terms to a formal system of mathematical logic zadeh

Matrix Computations and Optimization in Apache Spark · Matrix Computations and Optimization in Apache Spark Reza Bosagh Zadeh Stanford and Matroid 475 Via Ortega Stanford, CA 94305

Aziza Mustafa Zadeh - Always - Sheet Music

Reza Bosagh Zadeh (Carnegie Mellon) Shai Ben-David (Waterloo) UAI 09, Montréal, June 2009 A UNIQUENESS THEOREM FOR CLUSTERING

2nd Week - Zadeh Operators & Implication Function

Reza Alirezaei , SharePoint MVP /MCTS Development Horizon, Corp. blogs.devhorizon/reza

Reza khairunnisa

Bellman Zadeh(1)

Fuzzy Sets - Zadeh - 1965

Aziza Moustafa Zadeh Exprompt

Yaghoub Zadeh Zohreh

Zadeh L.A. Fuzzy Sets 1965

PREDICTING MARKET VOLATILITY FROM FEDERAL ...rezab/papers/finlp_slides.pdfPREDICTING MARKET VOLATILITY FROM FEDERAL RESERVE BOARD MEETING MINUTES Reza Bosagh Zadeh and Andreas Zollmann

Meray Khawab Reza Reza by Maha Malik.urduinpage.com

Advanced Data Science with Apache Spark-(Reza Zadeh, Stanford)

20120712 Cloud Beyond the Hype, Mina Zadeh

Abbas Zadeh

Generalized Low Rank Models - arXiv.org e-Print …Generalized Low Rank Models Madeleine Udell, Corinne Horn, Reza Zadeh, and Stephen Boyd May 6, 2015. (Original version posted September

Covariance Matrices & All-pairs Similaritystanford.edu/~rezab/dao/slides/lec10.pdf · Covariance Matrices and All-pairs similarity Reza Zadeh Introduction First Pass DIMSUM Analysis

rezayousefzadeh.comrezayousefzadeh.com/cv-reza-yousefzadeh.pdf · REZA YOUSEF ZADEH DEVELOPER / PROGRAMMER PERSONAL INFORMATION Age : 36 Birthdate : 1983/09/01 Gender: Male Citizenship

TRAINING AND QUALIFICATIONS FOR CLUSTER MANAGERS Reza Zadeh, European Foundation for Clusters and Competitiveness Rome, 7 November 2012

Factorbird: a Parameter Server Approach to Distributed Matrix Factorization Sebastian Schelter, Venu Satuluri, Reza Zadeh Distributed Machine Learning

IMF Working Paper...Authorized for distribution by Reza Vaez-Zadeh May 2002 Abstract The views expressed in this Working Paper are those of the author(s) and do not necessarily represent

FusionNet: 3D Object Classiﬁcation Using Multiple Data ... · Vishakh Hegde Stanford and Matroid [email protected] Reza Zadeh Stanford and Matroid [email protected] Abstract High-quality

Reza Revandy

Reza Rejaie CIS Dept. @ UO1 Prof. Reza Rejaie Computer & Information Science University of Oregon reza Fall 2002 Multimedia

Lotfi A. Zadeh: A Short, Sweet, and Personal Remembrance ...Enric Trillas, Lotfi Zadeh, Sergei Ovchinnikov, Elie Sanchez 2 My initial acquaintance with Professor Lotfi A. Zadeh, did

Catherine M. Zadeh Look Book 2012

On the Precision of Social and Information Networksrezab/papers/precision.pdfOn the Precision of Social and Information Networks Reza Bosagh Zadeh Stanford University [email protected]

Generalized Low Rank Models - sribd.cn · Generalized Low Rank Models Madeleine Udell Cornell ORIE Based on joint work with Stephen Boyd, Corinne Horn, Reza Zadeh, Nathan Kallus,

Aziza Mustafa Zadeh - 4 short pieces from Contraststviel.free.fr/Piano/Originaux/Aziza Mustafa Zadeh - 4 short pieces... · Title: Aziza Mustafa Zadeh - 4 short pieces from Contrasts

Conjuntos Borrosos(Fuzzy sets )zadeh

advanced data science with apache spark-(reza zadeh, stanford)

Data & Analytics

optimization example

rdd partitioner

distributed optimization

pagerank recall

partitioning of rdd

custom partitioning

model broadcast

directed graph