thesis proposal learning with sparsity: structures, optimization and applications xi chen committee...
TRANSCRIPT
Thesis ProposalLearning with Sparsity: Structures,
Optimization and Applications
Xi ChenCommittee Members: Jaime Carbonell (chair),
Tom Mitchell, Larry Wasserman, Robert Tibshirani
Machine Learning DepartmentCarnegie Mellon University
Modern Data Analysis
Gene expression data for tumor classification: Characteristic: High-dimensional; Very few samples; complex structure Climate Data
Characteristic: Dynamic complex structure
Web-text data:Characteristic: Both high-dimensional& massive amountStructures of word features (e.g., synonym)
Challenges : High-dimensions
Complex & Dynamic Structures
2
Solutions: Sparse Learning
Sparse regression for feature selection & prediction
Incorporating Structural Prior Knowledge
Nonparametric Sparse Regression: flexible model
[Tibshirani 96]
Smooth Convex Loss L1-regularization
[Jenatton et al., 09, Peng et al., 09Tibshirani et al., 05Friedman et al., 10Kim et al., 10]
Structured Penalty (e.g., group, hierarchical tree, graph)
[Ravikumar et al., 09] 3
AdditiveModel
Sparse Learning in Graphical Models
Undirected Graphical Model (Markov Random Fields)
Learn Sparse Structure of Graphical Models
Gene GraphPairwise model for
image
4
Graphical Lasso (gLasso) ( Yuan et al. 06, Friedman et al. 07, Banerjee et al. 08)
Iterated Lasso (Meishausen and Buhlmann, 06)
Forest Density Estimator (Liu et al. 10)
Thesis OverviewHigh-dimensional Sparse Learning with Structures
Sparse Single/Multi-task Regression with General Structured- Penalty
Learning Sparse Structures for Undirected Graphical Models
Nonparametric Sparse Regression
Challenge: Computation
Completed Work:Unified Optimization Framework: Smoothing Proximal Gradient [UAI 11, AOAS]
Future Work: (1) Online Learning for Massive Data (2) Incorporate Structured-Penalty in Other Models (e.g. PCA, CCA)
Existing: Static or Time-varying GraphChallenge: Dynamic Structures
Completed Work:Conditional Gaussian Graphical Model (1) Kernel Smoothing Method for
Spatial-Temporal Graphs [AAAI 10](2) Partition-Based Method [NIPS 10]
Future Work: Relax Conditional Gaussian Assumption: Continuous & Discrete
Existing: Additive Models Challenge: (1) Generalized Models, (2) Structures
Completed Work(1) Generalized Forward
Regression [NIPS 09](2) Penalized Tree
Regression [NIPS 10]
Future Work: Incorporating Rich Structures
Application areas: tumor classification using gene expression data [UAI 11, AOAS], climate data analysis [AAAI 10, NIPS 10], web-text mining [ICDM 10, SDM 10] 5
Roadmap
Q & A
Structure Learning in Graphical Models
Summary and Timeline
Smoothing Proximal Gradient for Structured Sparse Regression
Nonparametric Sparse Regression
6
Useful Structures and Structured Penalty
Group Structure (group-wise selection)
[Yuan 06]
[Peng et al 09, Kim et al 10]
[Bach et al., 09]
Application: pathway selection for gene-expression data in tumor classification
Example: WordNet 7
Useful Structure and Structured Penalty
Graph Structure (to enforce smoothness)
Piece-wise constant
[Kim et al., 10]
[Tibshirani 05]
Graph smoothness
8
Challenge
Unified, Efficient and Scalable Optimization Framework
for Solving all these Structured Penalties
Single-task Regression
Multi-task Regression
9
Nonsmooth
Nonseparable
Existing Optimization
Interior-point Method (IPM) for Second-order Cone Programming (SOCP) / Quadratic Programming (QP)
Poor scalability: solving a huge Newton linear system for each iteration
Sub-gradient Descent(first-order method) Convergence is slow
Block Coordinate Descent
Cannot be applied for non-separable penalties
Optimize at one time while keeping other variables fixed
Proximal Gradient Descent (first-order method)
Cannot be applied to complex structured penalty.No exact solution for proximal operator
Augmented Lagrangian Alternating Direction
No convergence rate resultGlobally converge only for 2 blocks
Solving a large-scale linear system for each iteration
[Nesterov 07, Beck and Teboulle, 09]
10
Proximal Operator:
Overview: Smoothing Proximal Gradient (SPG)
First-order Method (only gradient info): fast and scalable No exact solution for proximal operator Idea: 1) Reformulate the structured penalty (via the dual norm)
2) Introduce its smooth approximation
3) Plug the smooth approximation back into the original problem and solve it by accelerated proximal gradient methods Convergence Results:
[Nesterov 05]
11
Why the Approximation is Smooth?
Geometric Interpretation:
Uppermost LineSmooth
Uppermost LineNonsmooth
12
Smoothing Proximal Gradient (SPG)
Original Problem:
Approximated Problem:
Gradient of the Approximation(Danskin’s Theorem)
Smooth function
Non-smooth with good separability
Convex Smooth Loss Non-smooth Penalty with complex structure
13[Nesterov 07, Beck and Teboulle, 09]
Proximal Operator: Soft-thresholding
Convergence Rate
14
SPG Subgradient IPM
Convergence Rate
Per-iteration Time / Storage
Cheap (Gradient)
Cheap (Gradient)
Expensive(Newton Linear
System/Hessian)
Single-Task Multi-Task
Multi-Task Extension
15
Simulation Study
Multi-task Graph-guided Fused Lasso
ACGTTTTACTGTACAATTTACSNP
Gene-expression data
16
Biological Application
Breast Cancer Tumor Classification Gene expression data for 8,141 genes in 295 breast cancer tumors.
(78 metastatic and 217 non-metastatic, logistic regression loss)
Canonical pathways from MSigDB containing 637 groups of genes
Training:Test=2:1 Important pathways: proteasome, nicotinate (ENPP1) 17
SPG for Overlapping Group Lasso Regularization path (20 parameters): 331 seconds
Proposed Research
More applications for SPG
Web-scale learning: massive amounts of data• Inputs arrive sequentially at a high-rate• Need to provide real-time service
Solution: Stochastic Optimization for Online Learning
Complex Structured Penalty: Smoothing Technique
Simple Penalty with good separability: closed-form solution in proximal operator
E.g. Low Rank + Sparse
18
Proposed Research
Stochastic Optimization
Structured Sparsity: Beyond Regression• Canonical Correlation Analysis and its Application in
Genome-wide Association Study
Deterministic: Stochastic:
Existing Methods : RDA [Lin 10] , Accelerated Stochastic Gradient Descent [Lan et al. 10]
Ruin the sparsity-pattern
Goal: sparsity-persevering stochastic optimization for large-scale online learning
19
Roadmap
Q & A
Structure Learning in Graphical Models
Summary and Timeline
Nonparametric Sparse Regression
20
Smoothing Proximal Gradient for Structured Sparse Regression
Gaussian Graphical Model
Gaussian Graphical Model
Graphical Lasso (gLasso)
Challenge: Dynamic Graph Structure
[Yuan et al., 06, Friedman et al., 07Banerjee et al., 08]
21
gLasso
[Lauritzen 96]
Idea: Graph-Valued Regression
Multivariate Regression Undirected Graphical Model
Graph-Valued Regression:
Application:
[Zhou et al., 08Song et al., 09]
Input data:
22
Applications for higher dimensional X
X: Patient Symptoms CharacterizationY: Gene expression levels
23
Kernel Smoothing Estimator
Conditional Gaussian Assumption
Kernel Smoothing Estimator
Cons: (1) Unstable when the dimension of x is high (2) Computationally heavy and difficult to
analyze (3) Hard to Visualize
bG(x)
24
Partition Based Estimator
Partition Based Estimator: Graph-Optimized CART(Go-CART)
CART (Classification and Regression Tree)
Graphical model: difficult to search for the split point
[Breiman 84, Tibshirani et al.,09]
25
Dyadic Partitioning Tree
Dyadic Partitioning Tree (DPT)
Assumptions and Notations:
[Scott and Nowak, 04]
26
Graph-Optimized CART (Go-CART)
Go-CART: penalized risk minimization estimator
Go-CART: held-out risk minimization estimator• Split the data:
Practical algorithm: greedy learning using held-out data27
Statistical Property
We do not assume that underlying partition is dyadic Oracle Risk
Oracle Inequality: bound the oracle excessive risk
Add the assumption that underlying partition is dyadic: Tree Partitioning Consistency(might obtain finer partition)
28
Real Climate Data Analysis
Data Description 125 locations of U.S. 1990 ~ 2002 (13 years) Monthly observation (18 variables/factors)
CO2CH4
CO
H2
WET
CLD
VAP
PRE
FRSDTRTMN
TMP
TMX
GLO
ETR
ETRN
DIR
UV
Variables Type
CO2, CH4, H2, CO Greenhouse Gases
Precipitation (PRE); Vapor (VAP); Cloud Cover (CLD); Wet Days (WET); Frost Days (FRS)
Weather
Avg. Temp. (TMP); Diurnal Temp. Range (DTR); Min. Temp. (TMN); Max. Temp. (TMX)
Temperature
Global Radiation (GLO); Direct Radiation (DIR); Extraterrestrial Global Radiation (ETR) Extraterrestrial Direct Normal Radiation (ETRN)
Solar Radiation
Ultraviolet irradiance (UV) Aerosol Index
[Lozano et al., 09, IBM]
29
Real Climate Data Analysis
Observations: (1): For graphical lasso, no edge connects greenhouse gases (CO2, CH4, CO, H2) with solar radiation factors (GLO, DIR) which contradicts IPCC report; Co-CART, there is. (2): Graphs along the coasts are more sparse than the ones in the mainland.
30
glasso
Proposed Research
Limitations of Go-CART
(1) Conditional Gaussian Assumption:
(2) Only for continuous Y. For discrete Y : approximation likelihood
Forest Graphical Model• Density only involves univariate and bivariate marginals • Compute mutual information for each pair of variables • Greedily learn the tree structure via Chow-Liu algorithm• Handle both continuous and discrete data
Forest-Valued Regression
[Chow and Liu, 68,Tan et al., 09, Liu et al., 11]
31
Roadmap
Q & A
Structure Learning in Graphical Models
Summary and Timeline
Nonparametric Sparse Regression
32
Smoothing Proximal Gradient for Structured Sparse Regression
Nonparametric Regression
Parametric Models
Additive Models
Sparse Additive Models
Generalized Nonparametric Models: model interaction between variables
[Ravikumar et al., 09]
Bottleneck: Computation
[Hastie et al., 90]
33
My Work and Proposed Research
Greedy Learning Method • Additive Forward Regression (AFR)
– Generalization of Orthogonal Matching Pursuit to Non-parametric setting
• Generalized Forward Regression (GFR)
Penalized Regression Tree Method
Proposed Research: • Formulate the functional forms for structured penalties • Develop efficient algorithms for solving the corresponding
nonparametric structured sparse regression
[Tropp et al., 06]
34
Roadmap
Q & A
Structure Learning in Graphical Models
Summary and Timeline
Nonparametric Sparse Regression
35
Smoothing Proximal Gradient for Structured Sparse Regression
Summary and Timeline
36
Acknowledgements
My Committee Members
Jaime Carbonell (advisor),
Tom Mitchell,
Larry Wasserman,
Robert Tibshirani
Acknowledgements: Eric P. Xing, John Lafferty, Seyoung Kim, Manuel Blum, Aarti Singh,
Jeff Schneider, Javier Pena, Han Liu, Qihang Lin, Junming Yin,
Xiong Liang, Tzu-Kuo Huang, Min Xu, Mladen Kolar, Yan Liu,
Jingrui He, Yanjun Qi, Bing Bai
IBM Fellowship
Feedback: Xi Chen ([email protected]) 37