advanced data science with apache spark-(reza zadeh, stanford)
TRANSCRIPT
![Page 1: Advanced Data Science with Apache Spark-(Reza Zadeh, Stanford)](https://reader035.vdocuments.us/reader035/viewer/2022062515/55cd1812bb61eb367a8b45fc/html5/thumbnails/1.jpg)
Reza Zadeh
Introduction to Distributed Optimization
@Reza_Zadeh | http://reza-zadeh.com
![Page 2: Advanced Data Science with Apache Spark-(Reza Zadeh, Stanford)](https://reader035.vdocuments.us/reader035/viewer/2022062515/55cd1812bb61eb367a8b45fc/html5/thumbnails/2.jpg)
Optimization At least two large classes of optimization problems humans can solve:"
» Convex » Spectral
![Page 3: Advanced Data Science with Apache Spark-(Reza Zadeh, Stanford)](https://reader035.vdocuments.us/reader035/viewer/2022062515/55cd1812bb61eb367a8b45fc/html5/thumbnails/3.jpg)
Optimization Example: Gradient Descent
![Page 4: Advanced Data Science with Apache Spark-(Reza Zadeh, Stanford)](https://reader035.vdocuments.us/reader035/viewer/2022062515/55cd1812bb61eb367a8b45fc/html5/thumbnails/4.jpg)
Logistic Regression Already saw this with data scaling Need to optimize with broadcast
![Page 5: Advanced Data Science with Apache Spark-(Reza Zadeh, Stanford)](https://reader035.vdocuments.us/reader035/viewer/2022062515/55cd1812bb61eb367a8b45fc/html5/thumbnails/5.jpg)
Model Broadcast: LR
![Page 6: Advanced Data Science with Apache Spark-(Reza Zadeh, Stanford)](https://reader035.vdocuments.us/reader035/viewer/2022062515/55cd1812bb61eb367a8b45fc/html5/thumbnails/6.jpg)
Model Broadcast: LR
Use via .value
Call sc.broadcast
![Page 7: Advanced Data Science with Apache Spark-(Reza Zadeh, Stanford)](https://reader035.vdocuments.us/reader035/viewer/2022062515/55cd1812bb61eb367a8b45fc/html5/thumbnails/7.jpg)
Separable Updates Can be generalized for » Unconstrained optimization » Smooth or non-smooth
» LBFGS, Conjugate Gradient, Accelerated Gradient methods, …
![Page 8: Advanced Data Science with Apache Spark-(Reza Zadeh, Stanford)](https://reader035.vdocuments.us/reader035/viewer/2022062515/55cd1812bb61eb367a8b45fc/html5/thumbnails/8.jpg)
Optimization Example: Spectral Program
![Page 9: Advanced Data Science with Apache Spark-(Reza Zadeh, Stanford)](https://reader035.vdocuments.us/reader035/viewer/2022062515/55cd1812bb61eb367a8b45fc/html5/thumbnails/9.jpg)
Spark PageRank Given directed graph, compute node importance. Two RDDs: » Neighbors (a sparse graph/matrix)
» Current guess (a vector) "Using cache(), keep neighbor list in RAM
![Page 10: Advanced Data Science with Apache Spark-(Reza Zadeh, Stanford)](https://reader035.vdocuments.us/reader035/viewer/2022062515/55cd1812bb61eb367a8b45fc/html5/thumbnails/10.jpg)
Spark PageRank Using cache(), keep neighbor lists in RAM Using partitioning, avoid repeated hashing
Neighbors (id, edges)
Ranks (id, rank)
join
partitionBy
join join …
![Page 11: Advanced Data Science with Apache Spark-(Reza Zadeh, Stanford)](https://reader035.vdocuments.us/reader035/viewer/2022062515/55cd1812bb61eb367a8b45fc/html5/thumbnails/11.jpg)
Spark PageRank
Generalizes to Matrix Multiplication, opening many algorithms from Numerical Linear Algebra
![Page 12: Advanced Data Science with Apache Spark-(Reza Zadeh, Stanford)](https://reader035.vdocuments.us/reader035/viewer/2022062515/55cd1812bb61eb367a8b45fc/html5/thumbnails/12.jpg)
Partitioning for PageRank Recall from first lecture that network bandwidth is ~100× as expensive as memory bandwidth
One way Spark avoids using it is through locality-aware scheduling for RAM and disk
Another important tool is controlling the partitioning of RDD contents across nodes
![Page 13: Advanced Data Science with Apache Spark-(Reza Zadeh, Stanford)](https://reader035.vdocuments.us/reader035/viewer/2022062515/55cd1812bb61eb367a8b45fc/html5/thumbnails/13.jpg)
Spark PageRank Given directed graph, compute node importance. Two RDDs: » Neighbors (a sparse graph/matrix)
» Current guess (a vector)
Best representation for vector and matrix?
![Page 14: Advanced Data Science with Apache Spark-(Reza Zadeh, Stanford)](https://reader035.vdocuments.us/reader035/viewer/2022062515/55cd1812bb61eb367a8b45fc/html5/thumbnails/14.jpg)
PageRank
![Page 15: Advanced Data Science with Apache Spark-(Reza Zadeh, Stanford)](https://reader035.vdocuments.us/reader035/viewer/2022062515/55cd1812bb61eb367a8b45fc/html5/thumbnails/15.jpg)
Execution
![Page 16: Advanced Data Science with Apache Spark-(Reza Zadeh, Stanford)](https://reader035.vdocuments.us/reader035/viewer/2022062515/55cd1812bb61eb367a8b45fc/html5/thumbnails/16.jpg)
Solution
![Page 17: Advanced Data Science with Apache Spark-(Reza Zadeh, Stanford)](https://reader035.vdocuments.us/reader035/viewer/2022062515/55cd1812bb61eb367a8b45fc/html5/thumbnails/17.jpg)
New Execution
![Page 18: Advanced Data Science with Apache Spark-(Reza Zadeh, Stanford)](https://reader035.vdocuments.us/reader035/viewer/2022062515/55cd1812bb61eb367a8b45fc/html5/thumbnails/18.jpg)
How it works Each RDD has an optional Partitioner object" Any shuffle operation on an RDD with a Partitioner will respect that Partitioner"
Any shuffle operation on two RDDs will take on the Partitioner of one of them, if one is set
"
![Page 19: Advanced Data Science with Apache Spark-(Reza Zadeh, Stanford)](https://reader035.vdocuments.us/reader035/viewer/2022062515/55cd1812bb61eb367a8b45fc/html5/thumbnails/19.jpg)
Examples
![Page 20: Advanced Data Science with Apache Spark-(Reza Zadeh, Stanford)](https://reader035.vdocuments.us/reader035/viewer/2022062515/55cd1812bb61eb367a8b45fc/html5/thumbnails/20.jpg)
Main Conclusion Controlled partitioning can avoid unnecessary all-to-all communication, saving computation
Repeated joins generalizes to repeated Matrix Multiplication, opening many algorithms from Numerical Linear Algebra
![Page 21: Advanced Data Science with Apache Spark-(Reza Zadeh, Stanford)](https://reader035.vdocuments.us/reader035/viewer/2022062515/55cd1812bb61eb367a8b45fc/html5/thumbnails/21.jpg)
Performance
![Page 22: Advanced Data Science with Apache Spark-(Reza Zadeh, Stanford)](https://reader035.vdocuments.us/reader035/viewer/2022062515/55cd1812bb61eb367a8b45fc/html5/thumbnails/22.jpg)
RDD partitioner
![Page 23: Advanced Data Science with Apache Spark-(Reza Zadeh, Stanford)](https://reader035.vdocuments.us/reader035/viewer/2022062515/55cd1812bb61eb367a8b45fc/html5/thumbnails/23.jpg)
Custom Partitioning