recap
DESCRIPTION
Recap. Curse of dimension Data tends to fall into the “rind” of space Variance gets larger (uniform cube) : Data falls further apart from each other (uniform cube ) : Statistics becomes unreliable, difficult to build histograms, etc. use simple models. Recap. Multivariate Gaussian - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Recap](https://reader036.vdocuments.us/reader036/viewer/2022062410/5681649b550346895dd679ad/html5/thumbnails/1.jpg)
Recap• Curse of dimension
– Data tends to fall into the “rind” of space – Variance gets larger (uniform cube): – Data falls further apart from each other (uniform cube):
– Statistics becomes unreliable, difficult to build histograms, etc. use simple models
![Page 2: Recap](https://reader036.vdocuments.us/reader036/viewer/2022062410/5681649b550346895dd679ad/html5/thumbnails/2.jpg)
Recap• Multivariate Gaussian
– By translation and rotation, it turns into multiplication of normal distributions
– MLE of mean: – MLE of covariance:
![Page 3: Recap](https://reader036.vdocuments.us/reader036/viewer/2022062410/5681649b550346895dd679ad/html5/thumbnails/3.jpg)
Be cautious..
Data may not be in one blob, need to separate data into groups
Clustering
![Page 4: Recap](https://reader036.vdocuments.us/reader036/viewer/2022062410/5681649b550346895dd679ad/html5/thumbnails/4.jpg)
CS 498 Probability & StatisticsClustering methods
Zicheng Liao
![Page 5: Recap](https://reader036.vdocuments.us/reader036/viewer/2022062410/5681649b550346895dd679ad/html5/thumbnails/5.jpg)
What is clustering?
• “Grouping”
– A fundamental part in signal processing
• “Unsupervised classification”
• Assign the same label to data points
that are close to each other
Why?
![Page 6: Recap](https://reader036.vdocuments.us/reader036/viewer/2022062410/5681649b550346895dd679ad/html5/thumbnails/6.jpg)
We live in a universe full of clusters
![Page 7: Recap](https://reader036.vdocuments.us/reader036/viewer/2022062410/5681649b550346895dd679ad/html5/thumbnails/7.jpg)
Two (types of) clustering methods
• Agglomerative/Divisive clustering
• K-means
![Page 8: Recap](https://reader036.vdocuments.us/reader036/viewer/2022062410/5681649b550346895dd679ad/html5/thumbnails/8.jpg)
Agglomerative/Divisive clustering
Agglomerative clustering
(Bottom up)
Hierarchicalcluster tree
Divisive clustering
(Top down)
![Page 9: Recap](https://reader036.vdocuments.us/reader036/viewer/2022062410/5681649b550346895dd679ad/html5/thumbnails/9.jpg)
Algorithm
![Page 10: Recap](https://reader036.vdocuments.us/reader036/viewer/2022062410/5681649b550346895dd679ad/html5/thumbnails/10.jpg)
Agglomerative clustering: an example• “merge clusters bottom up to form a hierarchical cluster tree”
Animation from Georg Berberwww.mit.edu/~georg/papers/lecture6.ppt
![Page 11: Recap](https://reader036.vdocuments.us/reader036/viewer/2022062410/5681649b550346895dd679ad/html5/thumbnails/11.jpg)
Dendrogram
>> X = rand(6, 2); %create 6 points on a plane
>> Z = linkage(X); %Z encodes a tree of hierarchical clusters
>> dendrogram(Z); %visualize Z as a dendrograph
Dis
tanc
e
𝒅(𝟏 ,𝟐)
![Page 12: Recap](https://reader036.vdocuments.us/reader036/viewer/2022062410/5681649b550346895dd679ad/html5/thumbnails/12.jpg)
Distance measure• Popular choices: Euclidean, hamming, correlation, cosine,…• A metric
– – – (triangle inequality)
• Critical to clustering performance• No single answer, depends on the data and the goal• Data whitening when we know little about the data
![Page 13: Recap](https://reader036.vdocuments.us/reader036/viewer/2022062410/5681649b550346895dd679ad/html5/thumbnails/13.jpg)
Inter-cluster distance• Treat each data point as a single cluster
• Only need to define inter-cluster distance– Distance between one set of points and another set of
points
• 3 popular inter-cluster distances– Single-link– Complete-link– Averaged-link
![Page 14: Recap](https://reader036.vdocuments.us/reader036/viewer/2022062410/5681649b550346895dd679ad/html5/thumbnails/14.jpg)
Single-link• Minimum of all pairwise distances between
points from two clusters• Tend to produce long, loose clusters
![Page 15: Recap](https://reader036.vdocuments.us/reader036/viewer/2022062410/5681649b550346895dd679ad/html5/thumbnails/15.jpg)
Complete-link• Maximum of all pairwise distances between
points from two clusters• Tend to produce tight clusters
![Page 16: Recap](https://reader036.vdocuments.us/reader036/viewer/2022062410/5681649b550346895dd679ad/html5/thumbnails/16.jpg)
Averaged-link• Average of all pairwise distances between
points from two clusters
𝐷 (𝐶1 ,𝐶2 )= 1𝑁 ∑
𝑝 𝑖∈𝐶1 ,𝑝2∈𝐶2
𝑑 (𝑝𝑖 ,𝑝 𝑗)
![Page 17: Recap](https://reader036.vdocuments.us/reader036/viewer/2022062410/5681649b550346895dd679ad/html5/thumbnails/17.jpg)
How many clusters are there?• Intrinsically hard to know• The dendrogram gives insights to it• Choose a threshold to split the dendrogram into clusters
𝒕𝒉𝒓𝒆𝒔𝒉𝒐𝒍𝒅
Dis
tanc
e
![Page 18: Recap](https://reader036.vdocuments.us/reader036/viewer/2022062410/5681649b550346895dd679ad/html5/thumbnails/18.jpg)
An example
do_agglomerative.m
![Page 19: Recap](https://reader036.vdocuments.us/reader036/viewer/2022062410/5681649b550346895dd679ad/html5/thumbnails/19.jpg)
Divisive clustering• “recursively split a cluster into smaller clusters”• It’s hard to choose where to split: combinatorial problem• Can be easier when data has a special structure (pixel grid)
![Page 20: Recap](https://reader036.vdocuments.us/reader036/viewer/2022062410/5681649b550346895dd679ad/html5/thumbnails/20.jpg)
K-means• Partition data into clusters such that:
– Clusters are tight (distance to cluster center is small)– Every data point is closer to its own cluster center than to all
other cluster centers (Voronoi diagram)
[figures excerpted from Wikipedia]
![Page 21: Recap](https://reader036.vdocuments.us/reader036/viewer/2022062410/5681649b550346895dd679ad/html5/thumbnails/21.jpg)
Formulation• Find K clusters that minimize:
• Two parameters: • NP-hard for global optimal solution• Iterative procedure (local minimum)
Cluster center
Φ (𝑪 ,𝒙 )= ∑𝑖 ∈∨¿𝑪∨¿ { ∑𝒙 𝑗∈𝑪𝑖
( 𝒙 𝒋−𝝁𝑖¿)𝑇 (𝒙 𝑗−𝝁𝑖 )}
![Page 22: Recap](https://reader036.vdocuments.us/reader036/viewer/2022062410/5681649b550346895dd679ad/html5/thumbnails/22.jpg)
K-means algorithm
1. Choose cluster number K2. Initialize cluster center
a. Randomly select K data points as cluster centersb. Randomly assign data to clusters, compute the cluster center
3. Iterate:a. Assign each point to the closest cluster centerb. Update cluster centers (take the mean of data in each cluster)
4. Stop when the assignment doesn’t change
![Page 23: Recap](https://reader036.vdocuments.us/reader036/viewer/2022062410/5681649b550346895dd679ad/html5/thumbnails/23.jpg)
Illustration
Randomly initialize 3 cluster centers (circles)
Assign each point to the closest cluster center Update cluster center Re-iterate step2
[figures excerpted from Wikipedia]
![Page 24: Recap](https://reader036.vdocuments.us/reader036/viewer/2022062410/5681649b550346895dd679ad/html5/thumbnails/24.jpg)
Example
do_Kmeans.m(show step-by-step updates and effects of cluster number)
![Page 25: Recap](https://reader036.vdocuments.us/reader036/viewer/2022062410/5681649b550346895dd679ad/html5/thumbnails/25.jpg)
Discussion• How to choose cluster number ?
– No exact answer, guess from data (with visualization)– Define a cluster quality measure then optimize
K = 2?K = 3?K = 5?
![Page 26: Recap](https://reader036.vdocuments.us/reader036/viewer/2022062410/5681649b550346895dd679ad/html5/thumbnails/26.jpg)
Discussion• Converge to local minimum => counterintuitive clustering
[figures excerpted from Wikipedia]
1 2 3
4 5 6
![Page 27: Recap](https://reader036.vdocuments.us/reader036/viewer/2022062410/5681649b550346895dd679ad/html5/thumbnails/27.jpg)
Discussion• Favors spherical clusters; • Poor results for long/loose/stretched clusters
Input data(color indicates true labels) K-means results
![Page 28: Recap](https://reader036.vdocuments.us/reader036/viewer/2022062410/5681649b550346895dd679ad/html5/thumbnails/28.jpg)
Discussion• Cost is guaranteed to decrease in every step
– Assign a point to the closest cluster center minimizes the cost for current cluster center configuration
– Choose the mean of each cluster as new cluster center minimizes the squared distance for current clustering configuration
• Finish in polynomial time
![Page 29: Recap](https://reader036.vdocuments.us/reader036/viewer/2022062410/5681649b550346895dd679ad/html5/thumbnails/29.jpg)
Summary• Clustering as grouping “similar” data together
• A world full of clusters/patterns
• Two algorithms– Agglomerative/divisive clustering: hierarchical clustering tree– K-means: vector quantization
![Page 30: Recap](https://reader036.vdocuments.us/reader036/viewer/2022062410/5681649b550346895dd679ad/html5/thumbnails/30.jpg)
CS 498 Probability & StatisticsRegression
Zicheng Liao
![Page 31: Recap](https://reader036.vdocuments.us/reader036/viewer/2022062410/5681649b550346895dd679ad/html5/thumbnails/31.jpg)
Example-I• Predict stock price
t+1 time
Stock price
![Page 32: Recap](https://reader036.vdocuments.us/reader036/viewer/2022062410/5681649b550346895dd679ad/html5/thumbnails/32.jpg)
Example-II• Fill in missing pixels in an image: inpainting
![Page 33: Recap](https://reader036.vdocuments.us/reader036/viewer/2022062410/5681649b550346895dd679ad/html5/thumbnails/33.jpg)
Example-III• Discover relationship in data
Amount of hormones by devices from 3 production lots
Time in service for devices from 3 production lots
![Page 34: Recap](https://reader036.vdocuments.us/reader036/viewer/2022062410/5681649b550346895dd679ad/html5/thumbnails/34.jpg)
Example-III• Discovery relationship in data
![Page 35: Recap](https://reader036.vdocuments.us/reader036/viewer/2022062410/5681649b550346895dd679ad/html5/thumbnails/35.jpg)
Linear regression
• Input:
• Linear model with Gaussian noise
– : explanatory variable– : dependent variable– : parameter of linear model– : zero mean Gaussian random variable
{(𝒙 1, 𝑦1 ) , ( 𝒙2 , 𝑦2 ) ,… (𝒙𝑀 , 𝑦𝑀 ) }
𝑦=𝒙𝑇 𝛽+𝜉𝒙𝑻=(𝑥1 , 𝑥2 ,… 𝑥𝑁 ,1 )
![Page 36: Recap](https://reader036.vdocuments.us/reader036/viewer/2022062410/5681649b550346895dd679ad/html5/thumbnails/36.jpg)
Parameter estimation
• MLE of linear model with Gaussian noise 𝑚𝑎𝑥𝑖𝑚𝑖𝑧𝑒 :
∑𝑖=1
𝑀
(𝑦 𝑖−𝑥 𝑖❑𝑇 𝛽)2𝑚𝑖𝑛𝑖𝑚𝑖𝑧𝑒 :
[Least squares, Carl F. Gauss, 1809]
Likelihood function
![Page 37: Recap](https://reader036.vdocuments.us/reader036/viewer/2022062410/5681649b550346895dd679ad/html5/thumbnails/37.jpg)
Parameter estimation
• Closed form solution
Φ (𝛽 )=∑𝑖=1
𝑀
(𝑦 𝑖− 𝒙𝑖❑𝑇 𝛽)2=(𝒚 −𝑿 𝛽)𝑇 (𝒚 −𝑿 𝛽) 𝑿=(
𝒙𝟏❑𝑇
𝒙𝟐❑𝑇
…𝒙𝑴
❑ 𝑇)𝒚=(𝑦1𝑦 2…𝑦𝑀
)𝜕Φ(𝛽)𝜕 𝛽 =𝑿𝑇 𝑿 𝛽−𝑿𝑇 𝒚
𝑿𝑇 𝑿 𝛽−𝑿𝑇 𝒚=𝟎
𝛽= (𝑿𝑇𝑿 )−1𝑿𝑇 𝒚
Normal equation
Cost function
(expensive to compute the matrix inverse for high dimension)
![Page 38: Recap](https://reader036.vdocuments.us/reader036/viewer/2022062410/5681649b550346895dd679ad/html5/thumbnails/38.jpg)
Gradient descent• http://openclassroom.stanford.edu/MainFolder/VideoPage.php?course=MachineLearning&vid
eo=02.5-LinearRegressionI-GradientDescentForLinearRegression&speed=100 (Andrew Ng)
𝜕Φ(𝛽)𝜕 𝛽 =𝑿𝑇 𝑿 𝛽−𝑿𝑇 𝒚
Init:
Repeat:
Until converge.
(Guarantees to reach global minimum in finite steps)
![Page 39: Recap](https://reader036.vdocuments.us/reader036/viewer/2022062410/5681649b550346895dd679ad/html5/thumbnails/39.jpg)
Example
do_regression.m
![Page 40: Recap](https://reader036.vdocuments.us/reader036/viewer/2022062410/5681649b550346895dd679ad/html5/thumbnails/40.jpg)
Interpreting a regression
𝑦=−0.0574 𝑡+34.2
![Page 41: Recap](https://reader036.vdocuments.us/reader036/viewer/2022062410/5681649b550346895dd679ad/html5/thumbnails/41.jpg)
Interpreting a regression
• Zero mean residual• Zero correlation
![Page 42: Recap](https://reader036.vdocuments.us/reader036/viewer/2022062410/5681649b550346895dd679ad/html5/thumbnails/42.jpg)
Interpreting the residual
![Page 43: Recap](https://reader036.vdocuments.us/reader036/viewer/2022062410/5681649b550346895dd679ad/html5/thumbnails/43.jpg)
Interpreting the residual• has zero mean
• is orthogonal to every column of – is also de-correlated from every column of
• is orthogonal to the regression vector – is also de-correlated from the regression vector (follow the same line of derivation)
![Page 44: Recap](https://reader036.vdocuments.us/reader036/viewer/2022062410/5681649b550346895dd679ad/html5/thumbnails/44.jpg)
How good is a fit?• Information ~ variance
• Total variance is decoupled into regression variance and error variance
• How good is a fit: How much variance is explained by regression:
𝑣𝑎𝑟 [𝒚 ]=𝑣𝑎𝑟 [𝑿 𝛽 ]+𝑣𝑎𝑟 [𝑒](Because and have zero covariance)
![Page 45: Recap](https://reader036.vdocuments.us/reader036/viewer/2022062410/5681649b550346895dd679ad/html5/thumbnails/45.jpg)
How good is a fit?
• R-squared measure– The percentage of variance explained by
regression
– Used in hypothesis test for model selection
𝑅2=𝑣𝑎𝑟 [𝑿 𝛽 ]𝑣𝑎𝑟 [𝒚 ]
![Page 46: Recap](https://reader036.vdocuments.us/reader036/viewer/2022062410/5681649b550346895dd679ad/html5/thumbnails/46.jpg)
Regularized linear regression
• Cost
• Closed-form solution
• Gradient descent
𝛽= (𝑿𝑇𝑿+𝜆𝐼 )− 1𝑿𝑇 𝒚
Init:
Repeat:
Until converge.
Penalize large values in
![Page 47: Recap](https://reader036.vdocuments.us/reader036/viewer/2022062410/5681649b550346895dd679ad/html5/thumbnails/47.jpg)
Why regularization?
• Handle small eigenvalues– Avoid dividing by small values by adding the
regularizer
𝛽= (𝑿𝑇𝑿 )−1𝑿𝑇 𝒚
𝛽= (𝑿𝑇𝑿+𝜆𝐼 )− 1𝑿𝑇 𝒚
![Page 48: Recap](https://reader036.vdocuments.us/reader036/viewer/2022062410/5681649b550346895dd679ad/html5/thumbnails/48.jpg)
Why regularization?
• Avoid over-fitting: – Over fitting– Small parameters simpler model less prone to
over-fitting
𝒙
𝑦
Over fit: hard to generalize to new data
Smaller parameters make a simpler (better) model
![Page 49: Recap](https://reader036.vdocuments.us/reader036/viewer/2022062410/5681649b550346895dd679ad/html5/thumbnails/49.jpg)
L1 regularization (Lasso)
– Some features may be irrelevant but still have a small non-zero coefficient in
– L1 regularization pushes small values of to zero– “Sparse representation”
![Page 50: Recap](https://reader036.vdocuments.us/reader036/viewer/2022062410/5681649b550346895dd679ad/html5/thumbnails/50.jpg)
How does it work?
– When is small, the L1 penalty is much larger than squared penalty.
– Causes trouble in optimization (gradient non-continuity)
![Page 51: Recap](https://reader036.vdocuments.us/reader036/viewer/2022062410/5681649b550346895dd679ad/html5/thumbnails/51.jpg)
Summary
• Linear regression– Linear model + Gaussian noise– Parameter estimation by MLE Least squares– Solving least square by the normal equation– Or by gradient descent for high dimension
• How to interpret a regression model – measure
• Regularized linear regression– Squared norm– L1 norm: Lasso