unsupervised learning with apache spark
TRANSCRIPT
![Page 1: Unsupervised Learning with Apache Spark](https://reader034.vdocuments.us/reader034/viewer/2022042611/58f9ade0760da3da068b9cd1/html5/thumbnails/1.jpg)
![Page 2: Unsupervised Learning with Apache Spark](https://reader034.vdocuments.us/reader034/viewer/2022042611/58f9ade0760da3da068b9cd1/html5/thumbnails/2.jpg)
● Data scientist at Cloudera● Recently lead Apache Spark development at
Cloudera● Before that, committing on Apache Hadoop● Before that, studying combinatorial
optimization and distributed systems at Brown
![Page 3: Unsupervised Learning with Apache Spark](https://reader034.vdocuments.us/reader034/viewer/2022042611/58f9ade0760da3da068b9cd1/html5/thumbnails/3.jpg)
![Page 4: Unsupervised Learning with Apache Spark](https://reader034.vdocuments.us/reader034/viewer/2022042611/58f9ade0760da3da068b9cd1/html5/thumbnails/4.jpg)
![Page 5: Unsupervised Learning with Apache Spark](https://reader034.vdocuments.us/reader034/viewer/2022042611/58f9ade0760da3da068b9cd1/html5/thumbnails/5.jpg)
![Page 6: Unsupervised Learning with Apache Spark](https://reader034.vdocuments.us/reader034/viewer/2022042611/58f9ade0760da3da068b9cd1/html5/thumbnails/6.jpg)
![Page 7: Unsupervised Learning with Apache Spark](https://reader034.vdocuments.us/reader034/viewer/2022042611/58f9ade0760da3da068b9cd1/html5/thumbnails/7.jpg)
![Page 8: Unsupervised Learning with Apache Spark](https://reader034.vdocuments.us/reader034/viewer/2022042611/58f9ade0760da3da068b9cd1/html5/thumbnails/8.jpg)
![Page 9: Unsupervised Learning with Apache Spark](https://reader034.vdocuments.us/reader034/viewer/2022042611/58f9ade0760da3da068b9cd1/html5/thumbnails/9.jpg)
![Page 10: Unsupervised Learning with Apache Spark](https://reader034.vdocuments.us/reader034/viewer/2022042611/58f9ade0760da3da068b9cd1/html5/thumbnails/10.jpg)
![Page 11: Unsupervised Learning with Apache Spark](https://reader034.vdocuments.us/reader034/viewer/2022042611/58f9ade0760da3da068b9cd1/html5/thumbnails/11.jpg)
![Page 12: Unsupervised Learning with Apache Spark](https://reader034.vdocuments.us/reader034/viewer/2022042611/58f9ade0760da3da068b9cd1/html5/thumbnails/12.jpg)
![Page 13: Unsupervised Learning with Apache Spark](https://reader034.vdocuments.us/reader034/viewer/2022042611/58f9ade0760da3da068b9cd1/html5/thumbnails/13.jpg)
![Page 14: Unsupervised Learning with Apache Spark](https://reader034.vdocuments.us/reader034/viewer/2022042611/58f9ade0760da3da068b9cd1/html5/thumbnails/14.jpg)
● How many kinds of stuff are there?● Why is some stuff not like the others?
● How do I contextualize new stuff?● Is there a simpler way to represent this stuff?
![Page 15: Unsupervised Learning with Apache Spark](https://reader034.vdocuments.us/reader034/viewer/2022042611/58f9ade0760da3da068b9cd1/html5/thumbnails/15.jpg)
● Learn hidden structure of your data● Interpret new data as it relates to this
structure
![Page 16: Unsupervised Learning with Apache Spark](https://reader034.vdocuments.us/reader034/viewer/2022042611/58f9ade0760da3da068b9cd1/html5/thumbnails/16.jpg)
● Clustering○ Partition data into categories
● Dimensionality reduction○ Find a condensed representation of your
data
![Page 17: Unsupervised Learning with Apache Spark](https://reader034.vdocuments.us/reader034/viewer/2022042611/58f9ade0760da3da068b9cd1/html5/thumbnails/17.jpg)
● Designing a system for processing huge data in parallel
● Taking advantage of it with algorithms that work well in parallel
![Page 18: Unsupervised Learning with Apache Spark](https://reader034.vdocuments.us/reader034/viewer/2022042611/58f9ade0760da3da068b9cd1/html5/thumbnails/18.jpg)
![Page 19: Unsupervised Learning with Apache Spark](https://reader034.vdocuments.us/reader034/viewer/2022042611/58f9ade0760da3da068b9cd1/html5/thumbnails/19.jpg)
![Page 20: Unsupervised Learning with Apache Spark](https://reader034.vdocuments.us/reader034/viewer/2022042611/58f9ade0760da3da068b9cd1/html5/thumbnails/20.jpg)
![Page 21: Unsupervised Learning with Apache Spark](https://reader034.vdocuments.us/reader034/viewer/2022042611/58f9ade0760da3da068b9cd1/html5/thumbnails/21.jpg)
bigfile.txt lines
val lines = sc.textFile (“bigfile.txt”)
numbers
Partition
Partition
Partition
Partition
Partition
Partition
HDFS
sum
Driver
val numbers = lines.map ((x) => x.toDouble) numbers.sum()
![Page 22: Unsupervised Learning with Apache Spark](https://reader034.vdocuments.us/reader034/viewer/2022042611/58f9ade0760da3da068b9cd1/html5/thumbnails/22.jpg)
![Page 23: Unsupervised Learning with Apache Spark](https://reader034.vdocuments.us/reader034/viewer/2022042611/58f9ade0760da3da068b9cd1/html5/thumbnails/23.jpg)
bigfile.txt lines
val lines = sc.textFile (“bigfile.txt”)
numbers
Partition
Partition
Partition
Partition
Partition
Partition
HDFS
sum
Driver
val numbers = lines.map ((x) => x.toInt) numbers.cache()
.sum()
![Page 24: Unsupervised Learning with Apache Spark](https://reader034.vdocuments.us/reader034/viewer/2022042611/58f9ade0760da3da068b9cd1/html5/thumbnails/24.jpg)
bigfile.txt lines numbers
Partition
Partition
Partition
sum
Driver
![Page 25: Unsupervised Learning with Apache Spark](https://reader034.vdocuments.us/reader034/viewer/2022042611/58f9ade0760da3da068b9cd1/html5/thumbnails/25.jpg)
![Page 26: Unsupervised Learning with Apache Spark](https://reader034.vdocuments.us/reader034/viewer/2022042611/58f9ade0760da3da068b9cd1/html5/thumbnails/26.jpg)
![Page 27: Unsupervised Learning with Apache Spark](https://reader034.vdocuments.us/reader034/viewer/2022042611/58f9ade0760da3da068b9cd1/html5/thumbnails/27.jpg)
Discrete Continuous
Supervised Classification● Logistic regression (and
regularized variants)● Linear SVM● Naive Bayes● Random Decision Forests
(soon)
Regression● Linear regression (and
regularized variants)
Unsupervised Clustering● K-means
Dimensionality reduction, matrix factorization
● Principal component analysis / singular value decomposition
● Alternating least squares
![Page 28: Unsupervised Learning with Apache Spark](https://reader034.vdocuments.us/reader034/viewer/2022042611/58f9ade0760da3da068b9cd1/html5/thumbnails/28.jpg)
Discrete Continuous
Supervised Classification● Logistic regression (and
regularized variants)● Linear SVM● Naive Bayes● Random Decision Forests
(soon)
Regression● Linear regression (and
regularized variants)
Unsupervised Clustering
● K-meansDimensionality reduction, matrix factorization
● Principal component analysis / singular value decomposition
● Alternating least squares
![Page 29: Unsupervised Learning with Apache Spark](https://reader034.vdocuments.us/reader034/viewer/2022042611/58f9ade0760da3da068b9cd1/html5/thumbnails/29.jpg)
![Page 30: Unsupervised Learning with Apache Spark](https://reader034.vdocuments.us/reader034/viewer/2022042611/58f9ade0760da3da068b9cd1/html5/thumbnails/30.jpg)
![Page 31: Unsupervised Learning with Apache Spark](https://reader034.vdocuments.us/reader034/viewer/2022042611/58f9ade0760da3da068b9cd1/html5/thumbnails/31.jpg)
● Anomalies as data points far away from any cluster
![Page 32: Unsupervised Learning with Apache Spark](https://reader034.vdocuments.us/reader034/viewer/2022042611/58f9ade0760da3da068b9cd1/html5/thumbnails/32.jpg)
![Page 33: Unsupervised Learning with Apache Spark](https://reader034.vdocuments.us/reader034/viewer/2022042611/58f9ade0760da3da068b9cd1/html5/thumbnails/33.jpg)
![Page 34: Unsupervised Learning with Apache Spark](https://reader034.vdocuments.us/reader034/viewer/2022042611/58f9ade0760da3da068b9cd1/html5/thumbnails/34.jpg)
![Page 35: Unsupervised Learning with Apache Spark](https://reader034.vdocuments.us/reader034/viewer/2022042611/58f9ade0760da3da068b9cd1/html5/thumbnails/35.jpg)
![Page 36: Unsupervised Learning with Apache Spark](https://reader034.vdocuments.us/reader034/viewer/2022042611/58f9ade0760da3da068b9cd1/html5/thumbnails/36.jpg)
![Page 37: Unsupervised Learning with Apache Spark](https://reader034.vdocuments.us/reader034/viewer/2022042611/58f9ade0760da3da068b9cd1/html5/thumbnails/37.jpg)
val data = sc.textFile("kmeans_data.txt")
val parsedData = data.map( _.split(' ').map(_.toDouble))
// Cluster the data into two classes using KMeans
val numIterations = 20
val numClusters = 2
val clusters = KMeans.train(parsedData, numClusters,
numIterations)
![Page 38: Unsupervised Learning with Apache Spark](https://reader034.vdocuments.us/reader034/viewer/2022042611/58f9ade0760da3da068b9cd1/html5/thumbnails/38.jpg)
● Alternate between two steps:○ Assign each point to a cluster based on
existing centers○ Recompute cluster centers from the
points in each cluster
![Page 39: Unsupervised Learning with Apache Spark](https://reader034.vdocuments.us/reader034/viewer/2022042611/58f9ade0760da3da068b9cd1/html5/thumbnails/39.jpg)
![Page 40: Unsupervised Learning with Apache Spark](https://reader034.vdocuments.us/reader034/viewer/2022042611/58f9ade0760da3da068b9cd1/html5/thumbnails/40.jpg)
![Page 41: Unsupervised Learning with Apache Spark](https://reader034.vdocuments.us/reader034/viewer/2022042611/58f9ade0760da3da068b9cd1/html5/thumbnails/41.jpg)
![Page 42: Unsupervised Learning with Apache Spark](https://reader034.vdocuments.us/reader034/viewer/2022042611/58f9ade0760da3da068b9cd1/html5/thumbnails/42.jpg)
![Page 43: Unsupervised Learning with Apache Spark](https://reader034.vdocuments.us/reader034/viewer/2022042611/58f9ade0760da3da068b9cd1/html5/thumbnails/43.jpg)
![Page 44: Unsupervised Learning with Apache Spark](https://reader034.vdocuments.us/reader034/viewer/2022042611/58f9ade0760da3da068b9cd1/html5/thumbnails/44.jpg)
● Alternate between two steps:○ Assign each point to a cluster based on
existing centers■ Process each data point independently
○ Recompute cluster centers from the points in each cluster■ Average across partitions
![Page 45: Unsupervised Learning with Apache Spark](https://reader034.vdocuments.us/reader034/viewer/2022042611/58f9ade0760da3da068b9cd1/html5/thumbnails/45.jpg)
// Find the sum and count of points mapping to each center
val totalContribs = data.mapPartitions { points =>
val k = centers.length
val dims = centers(0).vector.length
val sums = Array.fill(k)(BDV.zeros[Double](dims).asInstanceOf[BV[Double]])
val counts = Array.fill(k)(0L)
points.foreach { point =>
val (bestCenter, cost) = KMeans.findClosest(centers, point)
costAccum += cost
sums(bestCenter) += point.vector
counts(bestCenter) += 1
}
val contribs = for (j <- 0 until k) yield {
(j, (sums(j), counts(j)))
}
contribs.iterator
}.reduceByKey(mergeContribs).collectAsMap()
![Page 46: Unsupervised Learning with Apache Spark](https://reader034.vdocuments.us/reader034/viewer/2022042611/58f9ade0760da3da068b9cd1/html5/thumbnails/46.jpg)
// Update the cluster centers and costs
var changed = false
var j = 0
while (j < k) {
val (sum, count) = totalContribs(j)
if (count != 0) {
sum /= count.toDouble
val newCenter = new BreezeVectorWithNorm(sum)
if (KMeans.fastSquaredDistance(newCenter, centers(j)) > epsilon * epsilon) {
changed = true
}
centers(j) = newCenter
}
j += 1
}
if (!changed) {
logInfo("Run " + run + " finished in " + (iteration + 1) + " iterations")
}
cost = costAccum.value
![Page 47: Unsupervised Learning with Apache Spark](https://reader034.vdocuments.us/reader034/viewer/2022042611/58f9ade0760da3da068b9cd1/html5/thumbnails/47.jpg)
![Page 48: Unsupervised Learning with Apache Spark](https://reader034.vdocuments.us/reader034/viewer/2022042611/58f9ade0760da3da068b9cd1/html5/thumbnails/48.jpg)
● K-Means is very sensitive to initial set of center points chosen.
● Best existing algorithm for choosing centers is highly sequential.
![Page 49: Unsupervised Learning with Apache Spark](https://reader034.vdocuments.us/reader034/viewer/2022042611/58f9ade0760da3da068b9cd1/html5/thumbnails/49.jpg)
![Page 50: Unsupervised Learning with Apache Spark](https://reader034.vdocuments.us/reader034/viewer/2022042611/58f9ade0760da3da068b9cd1/html5/thumbnails/50.jpg)
● Start with random point from dataset● Pick another one randomly, with probability
proportional to distance from the closest already chosen
● Repeat until initial centers chosen
![Page 51: Unsupervised Learning with Apache Spark](https://reader034.vdocuments.us/reader034/viewer/2022042611/58f9ade0760da3da068b9cd1/html5/thumbnails/51.jpg)
● Initial cluster has expected bound of O(log k) of optimum cost
![Page 52: Unsupervised Learning with Apache Spark](https://reader034.vdocuments.us/reader034/viewer/2022042611/58f9ade0760da3da068b9cd1/html5/thumbnails/52.jpg)
● Requires k passes over the data
![Page 53: Unsupervised Learning with Apache Spark](https://reader034.vdocuments.us/reader034/viewer/2022042611/58f9ade0760da3da068b9cd1/html5/thumbnails/53.jpg)
● Do only a few (~5) passes● Sample m points on each pass● Oversample● Run K-Means++ on sampled points to find
initial centers
![Page 54: Unsupervised Learning with Apache Spark](https://reader034.vdocuments.us/reader034/viewer/2022042611/58f9ade0760da3da068b9cd1/html5/thumbnails/54.jpg)
![Page 55: Unsupervised Learning with Apache Spark](https://reader034.vdocuments.us/reader034/viewer/2022042611/58f9ade0760da3da068b9cd1/html5/thumbnails/55.jpg)
![Page 56: Unsupervised Learning with Apache Spark](https://reader034.vdocuments.us/reader034/viewer/2022042611/58f9ade0760da3da068b9cd1/html5/thumbnails/56.jpg)
![Page 57: Unsupervised Learning with Apache Spark](https://reader034.vdocuments.us/reader034/viewer/2022042611/58f9ade0760da3da068b9cd1/html5/thumbnails/57.jpg)
![Page 58: Unsupervised Learning with Apache Spark](https://reader034.vdocuments.us/reader034/viewer/2022042611/58f9ade0760da3da068b9cd1/html5/thumbnails/58.jpg)
![Page 59: Unsupervised Learning with Apache Spark](https://reader034.vdocuments.us/reader034/viewer/2022042611/58f9ade0760da3da068b9cd1/html5/thumbnails/59.jpg)
![Page 60: Unsupervised Learning with Apache Spark](https://reader034.vdocuments.us/reader034/viewer/2022042611/58f9ade0760da3da068b9cd1/html5/thumbnails/60.jpg)
![Page 61: Unsupervised Learning with Apache Spark](https://reader034.vdocuments.us/reader034/viewer/2022042611/58f9ade0760da3da068b9cd1/html5/thumbnails/61.jpg)
![Page 62: Unsupervised Learning with Apache Spark](https://reader034.vdocuments.us/reader034/viewer/2022042611/58f9ade0760da3da068b9cd1/html5/thumbnails/62.jpg)
![Page 63: Unsupervised Learning with Apache Spark](https://reader034.vdocuments.us/reader034/viewer/2022042611/58f9ade0760da3da068b9cd1/html5/thumbnails/63.jpg)
Discrete Continuous
Supervised Classification● Logistic regression (and
regularized variants)● Linear SVM● Naive Bayes● Random Decision Forests
(soon)
Regression● Linear regression (and
regularized variants)
Unsupervised Clustering● K-means
Dimensionality reduction, matrix factorization
● Principal component analysis / singular value decomposition
● Alternating least squares
![Page 64: Unsupervised Learning with Apache Spark](https://reader034.vdocuments.us/reader034/viewer/2022042611/58f9ade0760da3da068b9cd1/html5/thumbnails/64.jpg)
● Select a basis for your data that○ Is orthonormal
○ Maximizes variance along its axes
![Page 65: Unsupervised Learning with Apache Spark](https://reader034.vdocuments.us/reader034/viewer/2022042611/58f9ade0760da3da068b9cd1/html5/thumbnails/65.jpg)
![Page 66: Unsupervised Learning with Apache Spark](https://reader034.vdocuments.us/reader034/viewer/2022042611/58f9ade0760da3da068b9cd1/html5/thumbnails/66.jpg)
● Find dominant trends
![Page 67: Unsupervised Learning with Apache Spark](https://reader034.vdocuments.us/reader034/viewer/2022042611/58f9ade0760da3da068b9cd1/html5/thumbnails/67.jpg)
● Find a lower-dimensional representation that lets you visualize the data
● Feature learning - find a representation that’s good for clustering or classification
● Latent Semantic Analysis
![Page 68: Unsupervised Learning with Apache Spark](https://reader034.vdocuments.us/reader034/viewer/2022042611/58f9ade0760da3da068b9cd1/html5/thumbnails/68.jpg)
val data: RDD[Vector] = ...
val mat = new RowMatrix(data)
// compute the top 5 principal components
val principalComponents =
mat.computePrincipalComponents(5)
// project data into subspace
val transformed = data.map(_.toBreeze *
mat.toBreeze)
![Page 69: Unsupervised Learning with Apache Spark](https://reader034.vdocuments.us/reader034/viewer/2022042611/58f9ade0760da3da068b9cd1/html5/thumbnails/69.jpg)
● Center data● Find covariance matrix● Its eigenvectors are the principal
components
![Page 70: Unsupervised Learning with Apache Spark](https://reader034.vdocuments.us/reader034/viewer/2022042611/58f9ade0760da3da068b9cd1/html5/thumbnails/70.jpg)
Datam
n
Covariance Matrix
n
n
![Page 71: Unsupervised Learning with Apache Spark](https://reader034.vdocuments.us/reader034/viewer/2022042611/58f9ade0760da3da068b9cd1/html5/thumbnails/71.jpg)
Data
m
n
Data
Data
Data
Data
Data
![Page 72: Unsupervised Learning with Apache Spark](https://reader034.vdocuments.us/reader034/viewer/2022042611/58f9ade0760da3da068b9cd1/html5/thumbnails/72.jpg)
Data
m
n
Data
Data
Data
Data
Data
n
n
n
n
...
![Page 73: Unsupervised Learning with Apache Spark](https://reader034.vdocuments.us/reader034/viewer/2022042611/58f9ade0760da3da068b9cd1/html5/thumbnails/73.jpg)
Data
m
n
Data
Data
Data
Data
Data
n
n
n
n
... ...
![Page 74: Unsupervised Learning with Apache Spark](https://reader034.vdocuments.us/reader034/viewer/2022042611/58f9ade0760da3da068b9cd1/html5/thumbnails/74.jpg)
n
n
![Page 75: Unsupervised Learning with Apache Spark](https://reader034.vdocuments.us/reader034/viewer/2022042611/58f9ade0760da3da068b9cd1/html5/thumbnails/75.jpg)
n
n
![Page 76: Unsupervised Learning with Apache Spark](https://reader034.vdocuments.us/reader034/viewer/2022042611/58f9ade0760da3da068b9cd1/html5/thumbnails/76.jpg)
n
n
![Page 77: Unsupervised Learning with Apache Spark](https://reader034.vdocuments.us/reader034/viewer/2022042611/58f9ade0760da3da068b9cd1/html5/thumbnails/77.jpg)
def computeGramianMatrix (): Matrix = {
val n = numCols().toInt
val nt: Int = n * (n + 1) / 2
// Compute the upper triangular part of the gram matrix.
val GU = rows.aggregate( new BDV[Double](new Array[Double](nt)))(
seqOp = (U, v) => {
RowMatrix.dspr( 1.0, v, U.data)
U
},
combOp = (U1, U2) => U1 += U2
)
RowMatrix.triuToFull(n, GU.data)
}
![Page 78: Unsupervised Learning with Apache Spark](https://reader034.vdocuments.us/reader034/viewer/2022042611/58f9ade0760da3da068b9cd1/html5/thumbnails/78.jpg)
n
n
![Page 79: Unsupervised Learning with Apache Spark](https://reader034.vdocuments.us/reader034/viewer/2022042611/58f9ade0760da3da068b9cd1/html5/thumbnails/79.jpg)
● n^2 must fit in memory
![Page 80: Unsupervised Learning with Apache Spark](https://reader034.vdocuments.us/reader034/viewer/2022042611/58f9ade0760da3da068b9cd1/html5/thumbnails/80.jpg)
● n^2 must fit in memory● Not yet implemented: EM algorithm can do it
with O(kn), where k is the number of principal components