multi-task additive models for robust estimation and automatic … · group structure...

12
Multi-task Additive Models for Robust Estimation and Automatic Structure Discovery Yingjie Wang 1 , Hong Chen 2* , Feng Zheng 3 , Chen Xu 4 , Tieliang Gong 4,5 , Yanhong Chen 6 1 College of Informatics, Huazhong Agricultural University, China 2 College of Science, Huazhong Agricultural University, China 3 Department of Computer Science and Engineering, Southern University of Science and Technology, China 4 Department of Mathematics and Statistics, University of Ottawa, Canada 5 School of Computer Science and Technology, Xi’an Jiaotong University, China 6 National Space Science Center, Chinese Academy of Sciences, China Abstract Additive models have attracted much attention for high-dimensional regression estimation and variable selection. However, the existing models are usually lim- ited to the single-task learning framework under the mean squared error (MSE) criterion, where the utilization of variable structure depends heavily on a priori knowledge among variables. For high-dimensional observations in real environ- ment, e.g., Coronal Mass Ejections (CMEs) data, the learning performance of previous methods may be degraded seriously due to the complex non-Gaussian noise and the insufficiency of a prior knowledge on variable structure. To tackle this problem, we propose a new class of additive models, called Multi-task Addi- tive Models (MAM), by integrating the mode-induced metric, the structure-based regularizer, and additive hypothesis spaces into a bilevel optimization framework. Our approach does not require any priori knowledge of variable structure and suits for high-dimensional data with complex noise, e.g., skewed noise, heavy-tailed noise, and outliers. A smooth iterative optimization algorithm with convergence guarantees is provided to implement MAM efficiently. Experiments on simulations and the CMEs analysis demonstrate the competitive performance of our approach for robust estimation and automatic structure discovery. 1 Introduction Additive models [14], as nonparametric extension of linear models, have been extensively investigated in machine learning literatures [1, 5, 34, 44]. The attractive properties of additive models include the flexibility on function representation, the interpretability on prediction result, and the ability to circumvent the curse of dimensionality. Typical additive models are usually formulated under Tikhonov regularization schemes and fall into two categories: one focuses on recognizing dominant variables without considering the interaction among the variables [21, 28, 29, 46] and the other aims to screen informative variables at the group level, e.g., groupwise additive models [4, 42]. Although these existing models have shown promising performance, most of them are limited to the single-task learning framework under the mean squared error (MSE) criterion. Particularly, the groupwise additive models depend heavily on a priori knowledge of variable structure. In this paper, we consider a problem commonly encountered in multi-task learning, in which all tasks share an underlying variable structure and involve data with complex non-Gaussian noises, e.g., skewed * Corresponding author. email: [email protected] 34th Conference on Neural Information Processing Systems (NeurIPS 2020), Vancouver, Canada.

Upload: others

Post on 03-Mar-2021

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Multi-task Additive Models for Robust Estimation and Automatic … · group structure identification (CGSI)[26], and Bilevel learning of Group Lasso (BiGL) [12]. 2 Multi-task Additive

Multi-task Additive Models for Robust Estimationand Automatic Structure Discovery

Yingjie Wang1 Hong Chen2lowast Feng Zheng3 Chen Xu4 Tieliang Gong45 Yanhong Chen6

1College of Informatics Huazhong Agricultural University China2 College of Science Huazhong Agricultural University China

3 Department of Computer Science and EngineeringSouthern University of Science and Technology China

4 Department of Mathematics and Statistics University of Ottawa Canada5 School of Computer Science and Technology Xirsquoan Jiaotong University China

6 National Space Science Center Chinese Academy of Sciences China

Abstract

Additive models have attracted much attention for high-dimensional regressionestimation and variable selection However the existing models are usually lim-ited to the single-task learning framework under the mean squared error (MSE)criterion where the utilization of variable structure depends heavily on a prioriknowledge among variables For high-dimensional observations in real environ-ment eg Coronal Mass Ejections (CMEs) data the learning performance ofprevious methods may be degraded seriously due to the complex non-Gaussiannoise and the insufficiency of a prior knowledge on variable structure To tacklethis problem we propose a new class of additive models called Multi-task Addi-tive Models (MAM) by integrating the mode-induced metric the structure-basedregularizer and additive hypothesis spaces into a bilevel optimization frameworkOur approach does not require any priori knowledge of variable structure and suitsfor high-dimensional data with complex noise eg skewed noise heavy-tailednoise and outliers A smooth iterative optimization algorithm with convergenceguarantees is provided to implement MAM efficiently Experiments on simulationsand the CMEs analysis demonstrate the competitive performance of our approachfor robust estimation and automatic structure discovery

1 Introduction

Additive models [14] as nonparametric extension of linear models have been extensively investigatedin machine learning literatures [1 5 34 44] The attractive properties of additive models includethe flexibility on function representation the interpretability on prediction result and the abilityto circumvent the curse of dimensionality Typical additive models are usually formulated underTikhonov regularization schemes and fall into two categories one focuses on recognizing dominantvariables without considering the interaction among the variables [21 28 29 46] and the other aimsto screen informative variables at the group level eg groupwise additive models [4 42]

Although these existing models have shown promising performance most of them are limited tothe single-task learning framework under the mean squared error (MSE) criterion Particularlythe groupwise additive models depend heavily on a priori knowledge of variable structure In thispaper we consider a problem commonly encountered in multi-task learning in which all tasks sharean underlying variable structure and involve data with complex non-Gaussian noises eg skewed

lowastCorresponding author email chenhmailhzaueducn

34th Conference on Neural Information Processing Systems (NeurIPS 2020) Vancouver Canada

(a) Generating data

y(1)1 y(2)

1 y(T )1

y(1)2

y(1)n

y(2)2

y(2)n

y(T )2

y(T )n

y(t)i = f (t)(x(t)

i ) + ϵ Data generating model for each observation i and task t

⋯ ⋯

(b) Groupwise models with priori structure information

OutputsG1G2

GL

⋯ +

(c) Our MAM

+Outputs

y(1)1 y(2)

1 y(T )1

y(1)2

y(1)n

y(2)2

y(2)n

y(T )2

y(T )n

⋯x(t)

12 x(t)13 x(t)

14 x(t)1(Pminus1) x(t)

1P

x(t)22 x(t)

23 x(t)24 x(t)

25 x(t)2(Pminus1) x(t)

2P

⋯ ⋯ ⋯ ⋯ ⋯ ⋯x(t)

n2 x(t)n3 x(t)

n4 x(t)n5 x(t)

n(Pminus1) x(t)nP

x(t)11

x(t)21

x(t)n1

x(t)15

G1 G2 GL⋯Inactive variables

t = 1T

y(1)1 y(2)

1 y(T )1

y(1)2

y(1)n

y(2)2

y(2)n

y(T )2

y(T )n

Group structure

(Intrinsic variable structure shared by all tasks) Outputs associated with variable structure

Datasets

Sparse models toward conditional mean

Multi-task models toward conditional mode

G1G2

GL

Inactive variables

Variable Structure

Datasets

x(t)i = (x(t)

i1 ⋯ x(t)iP )T

⋯ ⋯ ⋯

⋯ ⋯ ⋯

S(t) = (x(t)i y(t)

i )ni=1

t = 1⋯ T

S(t) = (x(t)i y(t)

i )ni=1

t = 1⋯ T

Predictions

Predictions

Figure 1 A schematic comparison between previous groupwise models and MAM (a) Data gener-ating process (b) Mean-based groupwise models with a priori knowledge of group structure eggroup lasso [33 43] and group additive models [42] (c) Mode-based MAM for robust estimationand automatic structure discovery

noise heavy-tailed noise and outliers The main motivation of this paper is described in Figure 1As shown in Figure 1(a) the intrinsic variable structure for generating data is encoded by severalvariable groups G1 G2 GL where some groups also contain inactive variables For each taskt isin 1 T the output is related to different dominant groups eg G1 G2 for the first task Witha prior knowledge of group structure single-task groupwise models shown in Figure 1(b) aim toestimate the conditional mean independently eg group lasso [13 22 33 43] and group additivemodels [4 16 42] All above models are formulated based on a prior knowledge of group structureand Gaussian noise assumption However these requirements are difficult to be satisfied in realapplications eg Coronal Mass Ejections (CMEs) analysis [20]

To relax the dependence on a prior structure and Gaussian noise this paper proposes a class ofMulti-task Additive Models (MAM) by integrating additive hypothesis space mode-induced metric[6 41 10] and structure-based regularizer [12] into a bilevel learning framework The bilevellearning framework is a special kind of mathematical program related closely with optimizationschemes in [7 12] A brief overview of MAM is shown in Figure 1(c) The proposed MAM canachieve robust estimation under complex noise and realize data-driven variable structure discoveryThe main contributions of this paper are summarized as below

bull Model A new class of multi-task additive models is formulated by bringing four distinctconcepts (eg multi-task learning [2 9] sparse additive models [3 4 18 42] mode-inducedmetric [10 38] and bilevel learning framework [12 32]) together in a coherent way torealize robust and interpretable learning As far as we know these issues have not beenunified in a similar fashion before

bull Optimization An optimization algorithm is presented for the non-convex and non-smoothMAM by integrating Half Quadratic (HQ) optimization [24] and dual Forward-Backwardalgorithm with Bregman distance (DFBB) [37] into proxSAGA [30] In theory we providethe convergence analysis of the proposed optimization algorithm

bull Effectiveness Empirical effectiveness of the proposed MAM is supported by experimentalevaluations on simulated data and CMEs data Experimental results demonstrate that MAMcan identify variable structure automatically and estimate the intrinsic function efficientlyeven if the datasets are contaminated by non-Gaussian noise

2

Table 1 Algorithmic properties (X-has the given information times-hasnrsquot the given information)

RMR [38] GroupSpAM [42] CGSI[26] BIGL[12] MAM (ours)Hypothesis Space Linear Additive Additive Linear Additive

Learning Task Single-task Single-task Single-task Multi-task Multi-taskEvaluation Criterion Mode-induced Mean-induced Mean-induced Mean-induced Mode-inducedObjective Function Nonconvex Convex Convex Convex NonconvexRobust Estimation X times times times X

Sparsity on Grouped Features times X X X XSparsity on Individual Features X times times times XVariable Structure Discovery times times X X X

Related works There are some works for automatic structure discovery in additive models [26 40]and partially linear models [19 45] Different from our MAM these approaches are formulated undersingle-task framework and the MSE criterion which are sensitive to non-Gaussian noise and difficultto tackle multi-task structure discovery directly While some mode-based approaches have beendesigned for robust estimation eg regularized modal regression (RMR) [38] none of them considerthe automatic structure discovery Recently an extension of group lasso is formulated for variablestructure discovery [12] Although this approach can induce the data-driven sparsity at the grouplevel it is limited to the linear mean regression and ignores the sparsity with respect to individualfeatures To better highlight the novelty of MAM its algorithmic properties are summarized in Table1 compared with RMR [38] Group Sparse Additive Models (GroupSpAM) [42] Capacity-basedgroup structure identification (CGSI)[26] and Bilevel learning of Group Lasso (BiGL) [12]

2 Multi-task Additive Models

21 Additive models

Now recall some backgrounds of additive models [14 42 44] For the sake of readability wesummarize some necessary notations in Supplementary Material A

Let X sub RP be the input space and Y sub R be the corresponding output set We consider thefollowing data-generating model

Y = flowast(X) + ε (1)whereX isin X Y isin Y ε is a random noise and flowast is the ground truth function For simplicity denoteρ(XY ) as the intrinsic distribution generated in (1) Under the Gaussian noise assumption ieE(ε|X = x) = 0 a large family of nonparametric regression aims to estimate the conditional meanfunction flowast(x) = E(Y |X = x) However the nonparametric regression may face low convergencerate due to the so-called curse of dimensionality [18 34] This motivates the research on additivemodels [14 29] to remedy this problem

Additive Models [14 29] Let the input space X = (X1 XP )T sub RP and let the hypothesis spacewith additive structure be defined as

H =f f(u) =

Psumj=1

fj(uj) fj isin Hj u = (u1 uP )T uj isin Xj

whereHj is the component function space on Xj Usually additive models aim to find the minimizerof E(Y minus f(X))2 inH Moreover groupwise additive models have been proposed with the help of aprior knowledge of variable group eg GroupSpAM [42] and GroupSAM [4]

Let G1 G2 GL be a partition over variable indices 1 P such that Gl capGj = emptyforalll 6= jand cupLl=1Gl = 1 P In essential the main purpose of GroupSpAM [42] is to search theminimizer of

E(Y minus f(X))2 +

Lsuml=1

τl

radicsumjisinGl

E[f2j (uj)] over all f =

Lsuml=1

sumjisinGl

fj isin H

where τl is the corresponding weight for group Gl 1 le l le L

22 Mode-induced metric

Beyond the Gaussian noise assumption in [16 29 42] we impose a weaker assumption on ε iearg maxtisinR pε|X(t) = 0 where pε|X denotes the conditional density function of ε given X In

3

theory this zero-mode assumption allows for more complex cases eg Gaussian noise heavy-tailednoise skewed noise or outliers

Denote p(Y |X = x) as the conditional density function of Y given X = x By taking mode on theboth sides of (1) we obtain the conditional mode function

flowast(x) = arg maxtisinR

p(t|X = x) (2)

where arg maxtisinR p(t|X = x) is assumed to be unique for any x isin X There are direct strategy andindirect strategy for estimating flowast [31] Generally the direct approaches are intractable since theconditional mode function cannot be elicited directly [15] while the indirect estimators based onkernel density estimation (KDE) have shown promising performance [6 10 38 41]

Now we introduce a mode-induced metric [10 38] associated with KDE For any measurable functionf X rarr R the mode-induced metric is

R(f) =

intXpY |X(f(x)|X = x)dρX (x) (3)

where ρX is the marginal distribution of ρwith respect toX As discussed in [10] flowast is the maximizerof the mode-induced metricR(f) According to Theorem 5 in [10] we haveR(f) = pEf

(0) wherepEf

(0) is the density function of error random variable Ef = Y minus f(X)

Define a modal kernel φ such that forallu isin R φ(u) = φ(minusu) φ(u) gt 0 andintR φ(u)du = 1 Typical

examples of modal kernel include Gaussian kernel Logistic kernel Epanechnikov kernel Given(xi yi)ni=1 sub X times Y an empirical version ofR(f) obtained via KDE [10 27] is defined as

Rσemp(f) =1

nsumi=1

φ(yi minus f(xi)

σ

) (4)

where σ is a positive bandwidth Then denote the data-free robust metric wrt Rσemp(f) as

Rσ(f) =1

σ

intXtimesY

φ(y minus f(x)

σ

)dρ(x y) (5)

Theorem 10 in [10] states thatRσ(f) tends toR(f) when σ rarr 0

23 Mode-induced group additive models

Here we form the additive hypothesis space based on smoothing splines [16 23 29 46] Let ψjk k = 1 infin be bounded and orthonormal basis functions on Xj Then the component function

space can be defined as Bj =fj fj =

suminfink=1 βjkψjk(middot)

with the coefficient βjk j = 1 P

After truncating these basis functions to finite dimension d we get

Bj =fj fj =

dsumk=1

βjkψjk(middot)

Denote f2 =radicint

f2(x)dx It has been illustrated that fj minus fj22 = O(1d4) for the second

order Sobolev ball Bj[46] The mode-induced Group Additive Models (mGAM) can be formulatedas

f = arg maxf=

sumPj=1 fj fjisinBj

Rσemp(f)minus λΩ(f) (6)

where λ is a positive regularization parameter and the structure-based regularizer

Ω(f) =

Lsuml=1

τl

radicsumjisinGl

fj22 =

Lsuml=1

τl

radicradicradicradicsumjisinGl

dsumk=1

β2jk

with group weight τl Denote Ψi =(ψ11(xi1) ψ1d(xi1) ψP1(xiP ) ψPd(xiP )

)and

β = (β11 β1d βP1 βPd)T isin RPd Given observations (xi yi)ni=1 with xi =

4

(xi1 xiP )T isin RP the mGAM can be represented as

f =

Psumj=1

fj =

Psumj=1

dsumk=1

βjkψjk(middot)

with

β = arg maxβisinRPd

1

nsumi=1

φ(yi minusΨiβ

σ

)minus λ

Lsuml=1

τl

radicradicradicradicsumjisinGl

dsumk=1

β2jk

(7)

Remark 1 The mGAM is a robust extension of GroupSpAM from mean regression to mode regressionWhen each group Gl l isin 1 L is a singleton our mGAM reduces to a robust version of SpAM[29] by replacing the MSE with the robust mode-induced metric (3) In particular our mGAM isconsistent with RMR [38] when each group is a singleton and all component functions are linear

24 Multi-task additive models

To reduce the dependency of mGAM on a priori structure information this section formulates MAMby learning an augmented mGAM within a multi-task bilevel framework [11 12 25]

Let T be the number of tasks Let X (t) = (X (t)1 X (t)

P )T sub RP and Y(t) sub R be the inputspace and the output space respectively associated with the t-th task Suppose that observationsS(t) = x(t)

i y(t)i 2ni=1 sub X (t) times Y(t) are drawn from an unknown distribution ρ(t)(x y) Without

loss of generality we split each S(t) into the training set S(t)train and the validation set S(t)

val with thesame sample size n for subsequent analysis

To quantify the groups G1 GL we introduce the following unit simplex

Θ =ϑ = (ϑ1 ϑL) isin RPtimesL

∣∣∣ Lsuml=1

ϑjl = 1 0 le ϑjl le 1 j = 1 P

where each element ϑjl can be viewed as a probability that identifies whether the j-th variable belongsto group Gl It is desirable to enjoy the property that ϑjl = 1 rArr j isin Gl and ϑjl = 0 rArr j isin GlHowever we cannot mine the sparsity within each group since

sumLl=1 ϑjl = 1 j = 1 P Inspired

from [35] we introduce ν = (ν1 νP )T isin [0 1]P to screen main effect variables across all taskswhere νj 6= 0 means the j-th variable is effective

Denote Ψ(t)i =

(ψ11(x

(t)i1 ) ψ1d(x

(t)i1 ) ψP1(x

(t)iP ) ψPd(x

(t)iP )) Given S(t)

valTt=1 andS(t)

trainTt=1 our MAM can be formulated as the following bilevel optimization scheme

Outer Problem (based on validation set S(t)val)

(ϑ ν) isin arg maxϑisinΘνisin[01]P

Tsumt=1

U(β(t)(ϑ) ν) with U(β(t)(ϑ) ν) =1

nsumi=1

φ(y(t)

i minusΨ(t)i Tν β(t)(ϑ)

σ

)

where Tν is a linear operator for screening main effect variables across all tasks suchthat Tν β(t)(ϑ) = (ν1β

(t)11 (ϑ) ν1β

(t)1d (ϑ) νP β

(t)P1(ϑ) νP β

(t)Pd(ϑ))T isin RPd and

β(ϑ) = (β(t)(ϑ))1letleT is the maximizer of the following augmented mGAM

Inner Problem (based on training set S(t)train)

β(ϑ)=argmaxβ

Tsumt=1

J(β(t)) with J(β(t))=1

nsumi=1

φ(y(t)i minusΨ

(t)i β(t)

σ

)minus micro

2β(t)22minusλ

Lsuml=1

τlTϑlβ(t)2

where Tϑlβ(t) = (ϑ1lβ

(t)11 ϑ1lβ

(t)1d ϑPlβ

(t)P1 ϑPlβ

(t)Pd)

T isin RPd is used for identifyingwhich variables belong to the l-th group and the penalty term micro

2 β(t)22 with a tending-to-zero

parameter micro assures the strong-convex property for optimization

5

Finally the multi-task additive models (MAM) can be represented as below

f (t) =

Psumj=1

dsumk=1

νj β(t)jk (ϑ)ψjk(middot) t = 1 T

Let ϑThr and νThr be two threshold counterparts of ϑ and ν respectively Similar with [12] ϑThr

is determined by assigning each feature to its most dominant group For any j = 1 P νThrj is

determined by a threshold u ie νThrj = 0 if νj le u and νThr

j = 1 otherwise Then the data-drivenvariable structure can be obtained via S = (ϑThr

l νThr)1lelleL where denotes Hadamard productRemark 2 If the hyper-parameter ν equiv IP the sparsity wrt individual features would not be takeninto account for MAM In this setting our MAM is essentially a robust and nonlinear extension ofBiGL [11] by incorporating mode-induced metric and additive hypothesis spaceRemark 3 Indeed mGAM with an oracle variable structure is the baseline of MAM In other wordsthe inner problem with the estimated variable structure S aims to approximate the mGAM

Algorithm 1 Prox-SAGA for MAM

Input Data S(t)train S

(t)valTt=1 Max-Iter Z isin R The number of groups L Step-size ηϑ

Step-size ην ϑ(0) ν(0) λ micro Modal kernel φ Bandwidth σ Weights τl l = 1 LInitialization at = ϑ(0) ct = ν(0) t = 1 T g(0)

ϑ = 0PtimesL g(0)ν = 0P

for z = 0 1 Z minus 1 do1 Randomly pick set

B(z) sub 1 T denote its cardinality as |B(z)|2 Compute β(k)(ϑ(z)) based on S(k)

train forallk isin B(z)β(k)(ϑ(z))=HQ-DFBB(ϑ(z) λ σ micro τ S(k)

train)3 Update ϑ based on S(k)

val

31) Gϑ = 1|B(z)|

sumkisinB(z)

(hϑ(β(k)(ϑ(z)) ν(z))minus hϑ(β(k)(ak) ν(z))

)

32) ϑ(z) = g(z)ϑ +Gϑ

33) ϑ(z+1) = Pϑ(ϑ(z) minus ηϑϑ(z))

34) g(z+1)ϑ = g

(z)ϑ + |B(z)|

T Gϑ35) ak = ϑ(z) for every k isin B(z)

4 Update ν based on S(k)val

41) Gν = 1|B(z)|

sumkisinB(z)

(hν(β(k)(ϑ(z)) ν(z))minus hν(β(k)(ϑ(z)) ck)

)

42) ν(z) = g(z)ν +Gν

43) ν(z+1) = Pν(ν(z) minus ην ν(z))

44) g(z+1)ν = g

(z)ν + |B(z)|

T Gν 45) ck = ν(z) for every k isin B(z)

Output ϑ = ϑ(Z) ν = ν(Z) β(t)(ϑ) t = 1 T Prediction function f (t) =

sumPj=1

sumdk=1 νj β

(t)jk (ϑ)ψjk(middot) t = 1 T

Variable structure S = (ϑThrl νThr)1lelleL

3 Optimization Algorithm

To implement the non-convex and nonsmooth MAM we employ Prox-SAGA algorithm [30] withsimplex projection and box projection [8] For simplicity we define two partial derivative calculators

minusTsumt=1

partU(β(t)(ϑ) ν)

partν=

Tsumt=1

hν(β(t)(ϑ) ν) minusTsumt=1

partU(β(t)(ϑ) ν)

partϑ=

Tsumt=1

hϑ(β(t)(ϑ) ν)

It is trivial to computesumTt=1 hν(β(t)(ϑ) ν) since the parameter ν only appears explicitly in the upper

problem The optimization parameter ϑ is implicit via the solution β(ϑ) of the inner problem Hence

6

computingsumTt=1 hϑ(β(t)(ϑ) ν) requires us to develop a smooth algorithm HQ-DFBB (combining

HQ [24] and DFBB [37]) for the solution β(ϑ) For the space limitation the optimization detailsincluding HQ-DFBB and two partial derivative calculators are provided in Supplementary MaterialB Let PΘ be the projection onto unit simplex Θ and Pν be the box projection onto [0 1]P Thegeneral procedure of Prox-SAGA is summarized in Algorithm 1

Remark 4 From Theorem 21 in [12] and Theorem 4 in [30] we know that Algorithm 1 convergesonly if the iteration sequence generated by HQ-DFBB converges to the solution of the inner problemDetailed convergence analysis of HQ-DFBB is provided in Supplementary Material C

4 Experiments

This section validates the effectiveness of MAM on simulated data and CMEs data All experimentsare implemented in MATLAB 2019b on an intel Core i7 with 16 GB memory

41 Simulated data analysis

Baselines The proposed MAM is compared with BiGL [11] in terms of variable structure recoveryand prediction ability In addition we also consider some baselines including Lasso [36] RMR [38]mGAM Group Lasso (GL) [43] and GroupSpAM [42] Note that the oracle variable structure is apriori knowledge for implementing mGAM GL and GroupSpAM

Oracle variable structure Set the number of tasks T = 500 the dimension P = 50 for eachtask and the actual number of groups Llowast = 5 We denote the indices of l-th group by Gl =

1 + (l minus 1)(PLlowast) l(PLlowast)

foralll isin 1 Llowast In addition we randomly pick V sub 1 Pto generate sparse features across all tasks For each j isin 1 P and l isin 1 Llowast the oraclevariable structure Slowast can be defined as Slowastjl = 1 if j isin Vc capGl and 0 otherwise

Parameter selection For the same hyper-parameters in BiGL and MAM we set Z = 3000micro = 10minus3 M = 5 Q = 100 and σ = 2 We search the regularization parameter λ in therange of 10minus4 10minus3 10minus2 10minus1 Here we assume the actual number of groups is known ieL = Llowast The weight for each group is set to be τl = 1foralll isin 1 L Following the samestrategy in [11] we choose the initialization ϑ(0) = Pϑ( 1

L IPtimesL + 001N (0PtimesL IPtimesL)) isin RPtimesL

and ν(0) = (05 05)T isin RP

Evaluation criterion Denote f (t) flowast(t) as the estimator and ground truth function respectively1 le t le T Evaluation criterions used here include Average Square Error(ASE)= 1

T

sumTt=1

1nf

(t) minusy(t)22 True Deviation (TD)= 1

T

sumTt=1

1nf

(t)minus flowast(t)22 Variable Structure Recovery S = (νThrϑThrl )1lelleL with the hard threshold value u = 05 Width of Prediction Intervals (WPI) and Sample

Coverage Probability (SCP) with the confidence level 10 Specially WPI and SCP are designed in[41] for comparing the widths of the prediction intervals with the same confidence level (see Section32 in [41] for more details)

Data sets The training set validation set and test set are all drawn from y(t) = flowast(t)(u(t)) + ε withthe same sample size n = 50 for each task where u(t) = (u1 uP )T isin RP is randomly drawnfrom Gaussian distribution N (0P

12 IP ) The noise ε follows Gaussian noise N (0 005) Student

noise t(2) Chi-square noise X 2(2) and Exponential noise Exp(2) respectively We randomly pickG(t) sub G1 GL st |G(t)| = 2 and consider the following examples of ground truth functionflowast(t) 1 le t le T

Example A [12] Linear component function flowast(t)(u(t)) =sumGlisinG(t)

sumjisinGlcapVc u

(t)j β

(t)j where the

true regression coefficient β(t)j = 1 if j isin Gl cap Vc otherwise β(t)

j = 0

Example B Denote flowast1 (u) = 25 sin(u) flowast2 (u) = 2u flowast3 (u) = 2eu minus eminus1 minus 1 flowast4 (u) = 8u2 andflowast5 (u) = 3 sin(2eu) The nonlinear additive function flowast(t)(u(t)) =

sumGlisinG(t)

sumjisinGlcapVc flowastl (u

(t)j )

Here spline basis matrix for MAM mGAM and GroupSpAM are constructed with d = 3 In thedata-generating process we consider two cases of the number of inactive variables ie |V| = 0and |V| = 5 Due to the space limitation we only present the results with Gaussian noise and

7

1 2 3 4 5

10

20

30

40

501 2 3 4 5

10

20

30

40

501 2 3 4 5

10

20

30

40

50

1 2 3 4 5

10

20

30

40

501 2 3 4 5

10

20

30

40

501 2 3 4 5

10

20

30

40

50

1 2 3 4 5

10

20

30

40

501 2 3 4 5

10

20

30

40

501 2 3 4 5

10

20

30

40

501 2 3 4 5

10

20

30

40

501 2 3 4 5

10

20

30

40

501 2 3 4 5

10

20

30

40

50

Figure 2 Variable structure discovery for Example A under different noise and sparsity index V(white pixel means the grouped variables and red pixel means the inactive variables) Top left panelGaussian noise with |V| = 0 Top right panel Student noise with |V| = 0 Bottom left panelGaussian noise with |V| = 5 Bottom right panel Student noise with |V| = 5

Student noise in Table 2 and Figure 2 The remaining results as well as several evaluations on theimpact of hyper-parameters are provided in Supplementary Material D1 From the reported resultseven without the structure information the proposed MAM can provide the competitive regressionestimation with mGAM (given priori structure) and usually achieve better performance than thesecompetitors when the noise is non-Gaussian distribution Specially the actual number of groupsis assumed to be known in current evaluations ie L = Llowast In Supplementary Material D1 wefurther verify the effectiveness of MAM for the general setting L gt Llowast

Table 2 Performance comparisons on Example A (top) and Example B (bottom) wrt differentcriterions

Methods|V| = 0 (Gaussian noise) |V| = 5 (Gaussian noise) |V| = 0 (Student noise) |V| = 5 (Student noise)

ASE TD WPI (SCP) ASE TD WPI (SCP) ASE TD WPI (SCP) ASE TD WPI (SCP)

MAM 00920 00914 00679(01015) 00801 00794 00595(01014) 10204 02467 01817(01014) 14279 02269 01796(01020)

BiGL 00715 00701 00616(01014) 00799 00793 00553(01027) 11097 03550 02061(01023) 14961 03544 02093(01017)

mGAM 00894 00885 00651(01011) 00795 00788 00611(01016) 10132 02441 01803(01015) 13828 02242 01725(01026)

GL 00683 00661 00620(01014) 00708 00684 00535(01011) 13252 03163 01935(01020) 14811 03576 01976(01020)

Lasso 02145 02138 01011(01013) 02204 02131 01080(01017) 38012 32412 05269(01021) 42121 35645 04899(01025)

RMR 02201 02196 01081(01020) 02224 02165 01108(01029) 19723 13241 03324(01022) 21518 13504 03451(01024)

MAM 07981 07972 02252(01044) 07827 07826 02155(01039) 09772 06943 02220(01038) 09145 06749 02218(01040)

BiGL 08307 08307 02318(01029) 07998 07934 02250(01027) 10727 07525 02414(01030) 10004 07405 02365(01034)

mGAM 07969 07967 02249(01042) 07787 07786 02121(01037) 09665 06842 02203(01038) 09135 06753 02226(01037)

GroupSpAM 07914 07916 02101(01033) 07604 07599 02085(01022) 09965 07142 02303(01048) 09521 07077 02294(01025)

GL 08081 08080 02241(01028) 07741 07708 02177(01029) 10501 07306 02335(01030) 09591 07202 02363(01029)

Lasso 25779 24777 03956(01027) 26801 26799 04259(01020) 38645 37612 05218(01020) 37443 35036 04963(01024)

RMR 15160 15061 03237(01028) 17654 16798 03184(01020) 19448 17294 03587(01025) 21085 20077 03329(01030)

42 Coronal mass ejection analysis

Coronal Mass Ejections (CMEs) are the most violent eruptions in the Solar System It is cru-cial to forecast the physical parameters related to CMEs Despite machine learning approacheshave been applied to these tasks recently [20 39] there is no any work for interpretable predic-tion with data-driven structure discovery Interplanetary CMEs (ICMEs) data are provided in TheRichardson and Cane List (httpwwwsrlcaltecheduACEASCDATAlevel3icmetable2htm) From this link we collect 137 ICMEs observations from 1996 to 2016The features of CMEs are provided in SOHO LASCO CME Catalog (httpscdawgsfcnasagovCME_list) In-situ solar wind parameters can be downloaded from OMNIWeb Plus(httpsomniwebgsfcnasagov) The in-situ solar wind parameters at earth is used torepresent the unknown solar wind plasma [20] A total of 21 features are chosen as input by com-bining the features of CMEs and in-situ solar wind parameters Five physical parameters predictiontasks are considered as outputs including CMEs arrive time Mean ICME speed Maximum solar

8

wind speed Increment in solar wind speed and Mean magnetic field strength We split the dataof each task into training set validation set and test set (with ratio 2 2 1) and adopt the samesettings in simulations Table 3 demonstrates that MAM enjoy smaller average absolute error thanthe competitors In addition the estimated structure (via MAM) is described in Figure 3 FromFigure 3 and Table 3 we know group G1 (including Mass MPA Solar wind speed Vy) and groupG2 (including Acceleration and Linear Speed) are significant for most tasks Particularly G2 andG7 (2nd-order Speed at final height) can be characterized as the factors that reflect the CMEs speedTable 3 shows that the groups G2 and G7 play an important role in CMEs arrive time predictionwhich is consistent with the results in [20] In addition the impact of hyper-parameter are displayedin Supplementary Material D2 due to the space limitation Overall the proposed MAM can achievethe promising performance on prediction and structure discovery

Table 3 Average absolute error and dominant group for each taskTasks CMEs arrive time Mean ICME speed Maximum solar wind speed Increment in solar wind speed Mean magnetic field strength

Methods AAE (h) Groups AAE (kms) Groups AAE (kms) Groups AAE (kms) Groups AAE (nT ) Groups

MAM 907 G1 G2 G7 4541 G1 G2 G3 G6 5932 G1 G2 6538 G1 G2 G3 347 G1BiGL 1109 - 5375 - 4651 - 8997 - 521 -Lasso 1216 - 6256 - 5981 - 8534 - 438 -RMR 1202 - 6223 - 5190 - 8613 - 398 -

1 2 3 4 5 6 7 8 9 10

CPA

Angular Width

Acceleration

Linear Speed

2nd-order Speed(20Rs)

Mass

Kinetic Energy

MPA

Field magnitude average

Bx

By

Bz

Solar wind speed

Vx

Vy

Vz

Proton density

Temperature

Flow pressure

Plasma beta

Figure 3 Variable structure S (white pixel=the grouped variables red pixel=the inactive variables)

5 Conclusion

This paper proposes the multi-task additive models to achieve robust estimation and automaticstructure discovery As far as we know it is novel to explore robust interpretable machine learningby integrating modal regression additive models and multi-task learning together The computingalgorithm and empirical evaluations are provided to support its effectiveness In the future it isinteresting to investigate robust additive models for overlapping variable structure discovery [17]

Broader Impact

The positive impacts of this work are two-fold 1) Our algorithmic framework paves a new way formining the intrinsic feature structure among high-dimensional variables and may be the steppingstone to further explore data-driven structure discovery with overlapping groups 2) Our MAM canbe applied to other fields eg gene expression analysis and drug discovery However there is also arisk of resulting an unstable estimation when facing ultra high-dimensional data

9

Acknowledgments

This work was supported by National Natural Science Foundation of China under Grant Nos11671161 12071166 61972188 41574181 the Fundamental Research Funds for the Central Univer-sities (Program No 2662019FW003) and NSERC Grant RGPIN-2016-05024 We are grateful to theanonymous NeurIPS reviewers for their constructive comments

References[1] R Agarwal N Frosst X Zhang R Caruana and G E Hinton Neural additive models

Interpretable machine learning with neural nets arXiv200413912v1 2020

[2] R Caruana Multitask learning Machine Learning 28(1)41ndash75 1997

[3] H Chen G Liu and H Huang Sparse shrunk additive models In International Conference onMachine Learning (ICML) 2020

[4] H Chen X Wang C Deng and H Huang Group sparse additive machine In Advances inNeural Information Processing Systems (NIPS) pages 198ndash208 2017

[5] H Chen Y Wang F Zheng C Deng and H Huang Sparse modal additive model IEEETransactions on Neural Networks and Learning Systems Doi 101109TNNLS202030051442020

[6] Y C Chen C R Genovese R J Tibshirani and L Wasserman Nonparametric modalregression The Annals of Statistics 44(2)489ndash514 2016

[7] B Colson P Marcotte and G Savard An overview of bilevel optimization Annals ofOperations Research 153(1)235ndash256 2007

[8] L Condat Fast projection onto the simplex and the `1-ball Mathematical Programming158(1-2)575ndash585 2016

[9] T Evgeniou and M Pontil Regularized multindashtask learning In ACM Conference on KnowledgeDiscovery and Data Mining pages 109ndash117 2004

[10] Y Feng J Fan and J Suykens A statistical learning approach to modal regression Journal ofMachine Learning Research 21(2)1ndash35 2020

[11] L Franceschi P Frasconi S Salzo R Grazzi and M Pontil Bilevel programming forhyperparameter optimization and meta-learning In International Conference on Machinelearning (ICML) pages 1563ndash1572 2018

[12] J Frecon S Salzo and M Pontil Bilevel learning of the group lasso structure In Advances inNeural Information Processing Systems (NIPS) pages 8301ndash8311 2018

[13] J H Friedman T Hastie and R Tibshirani A note on the group lasso and a sparse group lassoarXiv 10010736 2010

[14] T J Hastie and R J Tibshirani Generalized additive models London Chapman and Hall1990

[15] C Heinrich The mode functional is not elicitable Biometrika 101(1)245ndash251 2014

[16] J Huang J L Horowitz and F Wei Variable selection in nonparametric additive models TheAnnals of Statistics 38(4)2282ndash2313 2010

[17] L Jacob G Obozinski and J Vert Group lasso with overlap and graph lasso In InternationalConference on Machine Learning (ICML) pages 433ndash440 2009

[18] K Kandasamy and Y Yu Additive approximations in high dimensional nonparametric regres-sion via the SALSA In International Conference on Machine Learning (ICML) pages 69ndash782016

10

[19] X Li L Wang and D Nettleton Sparse model identification and learning for ultra-high-dimensional additive partially models Journal of Multivariate Analysis 173204ndash228 2019

[20] J Liu Y Ye C Shen Y Wang and R Erdelyi A new tool for cme arrival time predictionusing machine learning algorithms Cat-puma The Astrophysical Journal 855(2)109ndash1182018

[21] S Lv H Lin H Lian and J Huang Oracle inequalities for sparse additive quantile regressionin reproducing kernel hilbert space The Annals of Statistics 46(2)781ndash813 2018

[22] L Meier S V De Geer and P Buhlmann The group lasso for logistic regression Journal ofThe Royal Statistical Society Series B-statistical Methodology 70(1)53ndash71 2008

[23] L Meier S V D Geer and P Buhlmann High-dimensional additive modeling The Annals ofStatistics 37(6B)3779ndash3821 2009

[24] M Nikolova and M K Ng Analysis of half-quadratic minimization methods for signal andimage recovery SIAM Journal on Scientific Computing 27(3)937ndash966 2005

[25] P Ochs R Ranftl T Brox and T Pock Techniques for gradient based bilevel optimization withnonsmooth lower level problems Journal of Mathematical Imaging and Vision 56(2)175ndash1942016

[26] C Pan and M Zhu Group additive structure identification for kernel nonparametric regressionIn Advances in Neural Information Processing Systems (NIPS) pages 4907ndash4916 2017

[27] E Parzen On estimation of a probability density function and mode Annals of MathematicalStatistics 33(3)1065ndash1076 1962

[28] G Raskutti M J Wainwright and B Yu Minimax-optimal rates for sparse additive models overkernel classes via convex programming Journal of Machine Learning Research 13(2)389ndash4272012

[29] P Ravikumar H Liu J Lafferty and L Wasserman SpAM sparse additive models Journalof the Royal Statistical Society Series B 711009ndash1030 2009

[30] S J Reddi S Sra B Poczos and A J Smola Proximal stochastic methods for nonsmoothnonconvex finite-sum optimization In Advances in Neural Information Processing Systems(NIPS) pages 1145ndash1153 2016

[31] T W Sager Estimation of a multivariate mode Annals of Statistics 6(4)802ndash812 1978

[32] N Shervashidze and F Bach Learning the structure for structured sparsity IEEE Transactionson Signal Processing 63(18)4894ndash4902 2015

[33] N Simon J H Friedman T Hastie and R Tibshirani A sparse-group lasso Journal ofComputational and Graphical Statistics 22(2)231ndash245 2013

[34] C J Stone Additive regression and other nonparametric models The Annals of Statistics13(2)689ndash705 1985

[35] G Swirszcz and A C Lozano Multi-level lasso for sparse multi-task regression In Interna-tional Conference on Machine Learning (ICML) pages 595ndash602 2012

[36] R Tibshirani Regression shrinkage and selection via the lasso Journal of the Royal StatisticalSociety Series B 73(3)267ndash288 1994

[37] Q Van Nguyen Forward-backward splitting with bregman distances Vietnam Journal ofMathematics 45(3)519ndash539 2017

[38] X Wang H Chen W Cai D Shen and H Huang Regularized modal regression withapplications in cognitive impairment prediction In Advances in Neural Information ProcessingSystems (NIPS) pages 1448ndash1458 2017

11

[39] Y Wang J Liu Y Jiang and R Erdeacutelyi Cme arrival time prediction using convolutionalneural network The Astrophysical Journal 881(11)15 2019

[40] Y Wu and L A Stefanski Automatic structure recovery for additive models Biometrika102(2)381ndash395 2015

[41] W Yao and L Li A new regression model modal linear regression Scandinavian Journal ofStatistics 41(3)656ndash671 2013

[42] J Yin X Chen and E P Xing Group sparse additive models In International Conference onMachine Learning (ICML) pages 871ndash878 2012

[43] M Yuan and Y Lin Model selection and estimation in regression with grouped variablesJournal of The Royal Statistical Society Series B-statistical Methodology 68(1)49ndash67 2006

[44] M Yuan and D X Zhou Minimax optimal rates of estimation in high dimensional additivemodels The Annals of Statistics 44(6)2564ndash2593 2016

[45] H H Zhang G Cheng and Y Liu Linear or nonlinear automatic structure discovery forpartially linear models Journal of the American Statistical Association 106(495)1099ndash11122011

[46] T Zhao and H Liu Sparse additive machine In International Conference on ArtificialIntelligence and Statistics (AISTATS) pages 1435ndash1443 2012

12

  • Introduction
  • Multi-task Additive Models
    • Additive models
    • Mode-induced metric
    • Mode-induced group additive models
    • Multi-task additive models
      • Optimization Algorithm
      • Experiments
        • Simulated data analysis
        • Coronal mass ejection analysis
          • Conclusion
Page 2: Multi-task Additive Models for Robust Estimation and Automatic … · group structure identification (CGSI)[26], and Bilevel learning of Group Lasso (BiGL) [12]. 2 Multi-task Additive

(a) Generating data

y(1)1 y(2)

1 y(T )1

y(1)2

y(1)n

y(2)2

y(2)n

y(T )2

y(T )n

y(t)i = f (t)(x(t)

i ) + ϵ Data generating model for each observation i and task t

⋯ ⋯

(b) Groupwise models with priori structure information

OutputsG1G2

GL

⋯ +

(c) Our MAM

+Outputs

y(1)1 y(2)

1 y(T )1

y(1)2

y(1)n

y(2)2

y(2)n

y(T )2

y(T )n

⋯x(t)

12 x(t)13 x(t)

14 x(t)1(Pminus1) x(t)

1P

x(t)22 x(t)

23 x(t)24 x(t)

25 x(t)2(Pminus1) x(t)

2P

⋯ ⋯ ⋯ ⋯ ⋯ ⋯x(t)

n2 x(t)n3 x(t)

n4 x(t)n5 x(t)

n(Pminus1) x(t)nP

x(t)11

x(t)21

x(t)n1

x(t)15

G1 G2 GL⋯Inactive variables

t = 1T

y(1)1 y(2)

1 y(T )1

y(1)2

y(1)n

y(2)2

y(2)n

y(T )2

y(T )n

Group structure

(Intrinsic variable structure shared by all tasks) Outputs associated with variable structure

Datasets

Sparse models toward conditional mean

Multi-task models toward conditional mode

G1G2

GL

Inactive variables

Variable Structure

Datasets

x(t)i = (x(t)

i1 ⋯ x(t)iP )T

⋯ ⋯ ⋯

⋯ ⋯ ⋯

S(t) = (x(t)i y(t)

i )ni=1

t = 1⋯ T

S(t) = (x(t)i y(t)

i )ni=1

t = 1⋯ T

Predictions

Predictions

Figure 1 A schematic comparison between previous groupwise models and MAM (a) Data gener-ating process (b) Mean-based groupwise models with a priori knowledge of group structure eggroup lasso [33 43] and group additive models [42] (c) Mode-based MAM for robust estimationand automatic structure discovery

noise heavy-tailed noise and outliers The main motivation of this paper is described in Figure 1As shown in Figure 1(a) the intrinsic variable structure for generating data is encoded by severalvariable groups G1 G2 GL where some groups also contain inactive variables For each taskt isin 1 T the output is related to different dominant groups eg G1 G2 for the first task Witha prior knowledge of group structure single-task groupwise models shown in Figure 1(b) aim toestimate the conditional mean independently eg group lasso [13 22 33 43] and group additivemodels [4 16 42] All above models are formulated based on a prior knowledge of group structureand Gaussian noise assumption However these requirements are difficult to be satisfied in realapplications eg Coronal Mass Ejections (CMEs) analysis [20]

To relax the dependence on a prior structure and Gaussian noise this paper proposes a class ofMulti-task Additive Models (MAM) by integrating additive hypothesis space mode-induced metric[6 41 10] and structure-based regularizer [12] into a bilevel learning framework The bilevellearning framework is a special kind of mathematical program related closely with optimizationschemes in [7 12] A brief overview of MAM is shown in Figure 1(c) The proposed MAM canachieve robust estimation under complex noise and realize data-driven variable structure discoveryThe main contributions of this paper are summarized as below

bull Model A new class of multi-task additive models is formulated by bringing four distinctconcepts (eg multi-task learning [2 9] sparse additive models [3 4 18 42] mode-inducedmetric [10 38] and bilevel learning framework [12 32]) together in a coherent way torealize robust and interpretable learning As far as we know these issues have not beenunified in a similar fashion before

bull Optimization An optimization algorithm is presented for the non-convex and non-smoothMAM by integrating Half Quadratic (HQ) optimization [24] and dual Forward-Backwardalgorithm with Bregman distance (DFBB) [37] into proxSAGA [30] In theory we providethe convergence analysis of the proposed optimization algorithm

bull Effectiveness Empirical effectiveness of the proposed MAM is supported by experimentalevaluations on simulated data and CMEs data Experimental results demonstrate that MAMcan identify variable structure automatically and estimate the intrinsic function efficientlyeven if the datasets are contaminated by non-Gaussian noise

2

Table 1 Algorithmic properties (X-has the given information times-hasnrsquot the given information)

RMR [38] GroupSpAM [42] CGSI[26] BIGL[12] MAM (ours)Hypothesis Space Linear Additive Additive Linear Additive

Learning Task Single-task Single-task Single-task Multi-task Multi-taskEvaluation Criterion Mode-induced Mean-induced Mean-induced Mean-induced Mode-inducedObjective Function Nonconvex Convex Convex Convex NonconvexRobust Estimation X times times times X

Sparsity on Grouped Features times X X X XSparsity on Individual Features X times times times XVariable Structure Discovery times times X X X

Related works There are some works for automatic structure discovery in additive models [26 40]and partially linear models [19 45] Different from our MAM these approaches are formulated undersingle-task framework and the MSE criterion which are sensitive to non-Gaussian noise and difficultto tackle multi-task structure discovery directly While some mode-based approaches have beendesigned for robust estimation eg regularized modal regression (RMR) [38] none of them considerthe automatic structure discovery Recently an extension of group lasso is formulated for variablestructure discovery [12] Although this approach can induce the data-driven sparsity at the grouplevel it is limited to the linear mean regression and ignores the sparsity with respect to individualfeatures To better highlight the novelty of MAM its algorithmic properties are summarized in Table1 compared with RMR [38] Group Sparse Additive Models (GroupSpAM) [42] Capacity-basedgroup structure identification (CGSI)[26] and Bilevel learning of Group Lasso (BiGL) [12]

2 Multi-task Additive Models

21 Additive models

Now recall some backgrounds of additive models [14 42 44] For the sake of readability wesummarize some necessary notations in Supplementary Material A

Let X sub RP be the input space and Y sub R be the corresponding output set We consider thefollowing data-generating model

Y = flowast(X) + ε (1)whereX isin X Y isin Y ε is a random noise and flowast is the ground truth function For simplicity denoteρ(XY ) as the intrinsic distribution generated in (1) Under the Gaussian noise assumption ieE(ε|X = x) = 0 a large family of nonparametric regression aims to estimate the conditional meanfunction flowast(x) = E(Y |X = x) However the nonparametric regression may face low convergencerate due to the so-called curse of dimensionality [18 34] This motivates the research on additivemodels [14 29] to remedy this problem

Additive Models [14 29] Let the input space X = (X1 XP )T sub RP and let the hypothesis spacewith additive structure be defined as

H =f f(u) =

Psumj=1

fj(uj) fj isin Hj u = (u1 uP )T uj isin Xj

whereHj is the component function space on Xj Usually additive models aim to find the minimizerof E(Y minus f(X))2 inH Moreover groupwise additive models have been proposed with the help of aprior knowledge of variable group eg GroupSpAM [42] and GroupSAM [4]

Let G1 G2 GL be a partition over variable indices 1 P such that Gl capGj = emptyforalll 6= jand cupLl=1Gl = 1 P In essential the main purpose of GroupSpAM [42] is to search theminimizer of

E(Y minus f(X))2 +

Lsuml=1

τl

radicsumjisinGl

E[f2j (uj)] over all f =

Lsuml=1

sumjisinGl

fj isin H

where τl is the corresponding weight for group Gl 1 le l le L

22 Mode-induced metric

Beyond the Gaussian noise assumption in [16 29 42] we impose a weaker assumption on ε iearg maxtisinR pε|X(t) = 0 where pε|X denotes the conditional density function of ε given X In

3

theory this zero-mode assumption allows for more complex cases eg Gaussian noise heavy-tailednoise skewed noise or outliers

Denote p(Y |X = x) as the conditional density function of Y given X = x By taking mode on theboth sides of (1) we obtain the conditional mode function

flowast(x) = arg maxtisinR

p(t|X = x) (2)

where arg maxtisinR p(t|X = x) is assumed to be unique for any x isin X There are direct strategy andindirect strategy for estimating flowast [31] Generally the direct approaches are intractable since theconditional mode function cannot be elicited directly [15] while the indirect estimators based onkernel density estimation (KDE) have shown promising performance [6 10 38 41]

Now we introduce a mode-induced metric [10 38] associated with KDE For any measurable functionf X rarr R the mode-induced metric is

R(f) =

intXpY |X(f(x)|X = x)dρX (x) (3)

where ρX is the marginal distribution of ρwith respect toX As discussed in [10] flowast is the maximizerof the mode-induced metricR(f) According to Theorem 5 in [10] we haveR(f) = pEf

(0) wherepEf

(0) is the density function of error random variable Ef = Y minus f(X)

Define a modal kernel φ such that forallu isin R φ(u) = φ(minusu) φ(u) gt 0 andintR φ(u)du = 1 Typical

examples of modal kernel include Gaussian kernel Logistic kernel Epanechnikov kernel Given(xi yi)ni=1 sub X times Y an empirical version ofR(f) obtained via KDE [10 27] is defined as

Rσemp(f) =1

nsumi=1

φ(yi minus f(xi)

σ

) (4)

where σ is a positive bandwidth Then denote the data-free robust metric wrt Rσemp(f) as

Rσ(f) =1

σ

intXtimesY

φ(y minus f(x)

σ

)dρ(x y) (5)

Theorem 10 in [10] states thatRσ(f) tends toR(f) when σ rarr 0

23 Mode-induced group additive models

Here we form the additive hypothesis space based on smoothing splines [16 23 29 46] Let ψjk k = 1 infin be bounded and orthonormal basis functions on Xj Then the component function

space can be defined as Bj =fj fj =

suminfink=1 βjkψjk(middot)

with the coefficient βjk j = 1 P

After truncating these basis functions to finite dimension d we get

Bj =fj fj =

dsumk=1

βjkψjk(middot)

Denote f2 =radicint

f2(x)dx It has been illustrated that fj minus fj22 = O(1d4) for the second

order Sobolev ball Bj[46] The mode-induced Group Additive Models (mGAM) can be formulatedas

f = arg maxf=

sumPj=1 fj fjisinBj

Rσemp(f)minus λΩ(f) (6)

where λ is a positive regularization parameter and the structure-based regularizer

Ω(f) =

Lsuml=1

τl

radicsumjisinGl

fj22 =

Lsuml=1

τl

radicradicradicradicsumjisinGl

dsumk=1

β2jk

with group weight τl Denote Ψi =(ψ11(xi1) ψ1d(xi1) ψP1(xiP ) ψPd(xiP )

)and

β = (β11 β1d βP1 βPd)T isin RPd Given observations (xi yi)ni=1 with xi =

4

(xi1 xiP )T isin RP the mGAM can be represented as

f =

Psumj=1

fj =

Psumj=1

dsumk=1

βjkψjk(middot)

with

β = arg maxβisinRPd

1

nsumi=1

φ(yi minusΨiβ

σ

)minus λ

Lsuml=1

τl

radicradicradicradicsumjisinGl

dsumk=1

β2jk

(7)

Remark 1 The mGAM is a robust extension of GroupSpAM from mean regression to mode regressionWhen each group Gl l isin 1 L is a singleton our mGAM reduces to a robust version of SpAM[29] by replacing the MSE with the robust mode-induced metric (3) In particular our mGAM isconsistent with RMR [38] when each group is a singleton and all component functions are linear

24 Multi-task additive models

To reduce the dependency of mGAM on a priori structure information this section formulates MAMby learning an augmented mGAM within a multi-task bilevel framework [11 12 25]

Let T be the number of tasks Let X (t) = (X (t)1 X (t)

P )T sub RP and Y(t) sub R be the inputspace and the output space respectively associated with the t-th task Suppose that observationsS(t) = x(t)

i y(t)i 2ni=1 sub X (t) times Y(t) are drawn from an unknown distribution ρ(t)(x y) Without

loss of generality we split each S(t) into the training set S(t)train and the validation set S(t)

val with thesame sample size n for subsequent analysis

To quantify the groups G1 GL we introduce the following unit simplex

Θ =ϑ = (ϑ1 ϑL) isin RPtimesL

∣∣∣ Lsuml=1

ϑjl = 1 0 le ϑjl le 1 j = 1 P

where each element ϑjl can be viewed as a probability that identifies whether the j-th variable belongsto group Gl It is desirable to enjoy the property that ϑjl = 1 rArr j isin Gl and ϑjl = 0 rArr j isin GlHowever we cannot mine the sparsity within each group since

sumLl=1 ϑjl = 1 j = 1 P Inspired

from [35] we introduce ν = (ν1 νP )T isin [0 1]P to screen main effect variables across all taskswhere νj 6= 0 means the j-th variable is effective

Denote Ψ(t)i =

(ψ11(x

(t)i1 ) ψ1d(x

(t)i1 ) ψP1(x

(t)iP ) ψPd(x

(t)iP )) Given S(t)

valTt=1 andS(t)

trainTt=1 our MAM can be formulated as the following bilevel optimization scheme

Outer Problem (based on validation set S(t)val)

(ϑ ν) isin arg maxϑisinΘνisin[01]P

Tsumt=1

U(β(t)(ϑ) ν) with U(β(t)(ϑ) ν) =1

nsumi=1

φ(y(t)

i minusΨ(t)i Tν β(t)(ϑ)

σ

)

where Tν is a linear operator for screening main effect variables across all tasks suchthat Tν β(t)(ϑ) = (ν1β

(t)11 (ϑ) ν1β

(t)1d (ϑ) νP β

(t)P1(ϑ) νP β

(t)Pd(ϑ))T isin RPd and

β(ϑ) = (β(t)(ϑ))1letleT is the maximizer of the following augmented mGAM

Inner Problem (based on training set S(t)train)

β(ϑ)=argmaxβ

Tsumt=1

J(β(t)) with J(β(t))=1

nsumi=1

φ(y(t)i minusΨ

(t)i β(t)

σ

)minus micro

2β(t)22minusλ

Lsuml=1

τlTϑlβ(t)2

where Tϑlβ(t) = (ϑ1lβ

(t)11 ϑ1lβ

(t)1d ϑPlβ

(t)P1 ϑPlβ

(t)Pd)

T isin RPd is used for identifyingwhich variables belong to the l-th group and the penalty term micro

2 β(t)22 with a tending-to-zero

parameter micro assures the strong-convex property for optimization

5

Finally the multi-task additive models (MAM) can be represented as below

f (t) =

Psumj=1

dsumk=1

νj β(t)jk (ϑ)ψjk(middot) t = 1 T

Let ϑThr and νThr be two threshold counterparts of ϑ and ν respectively Similar with [12] ϑThr

is determined by assigning each feature to its most dominant group For any j = 1 P νThrj is

determined by a threshold u ie νThrj = 0 if νj le u and νThr

j = 1 otherwise Then the data-drivenvariable structure can be obtained via S = (ϑThr

l νThr)1lelleL where denotes Hadamard productRemark 2 If the hyper-parameter ν equiv IP the sparsity wrt individual features would not be takeninto account for MAM In this setting our MAM is essentially a robust and nonlinear extension ofBiGL [11] by incorporating mode-induced metric and additive hypothesis spaceRemark 3 Indeed mGAM with an oracle variable structure is the baseline of MAM In other wordsthe inner problem with the estimated variable structure S aims to approximate the mGAM

Algorithm 1 Prox-SAGA for MAM

Input Data S(t)train S

(t)valTt=1 Max-Iter Z isin R The number of groups L Step-size ηϑ

Step-size ην ϑ(0) ν(0) λ micro Modal kernel φ Bandwidth σ Weights τl l = 1 LInitialization at = ϑ(0) ct = ν(0) t = 1 T g(0)

ϑ = 0PtimesL g(0)ν = 0P

for z = 0 1 Z minus 1 do1 Randomly pick set

B(z) sub 1 T denote its cardinality as |B(z)|2 Compute β(k)(ϑ(z)) based on S(k)

train forallk isin B(z)β(k)(ϑ(z))=HQ-DFBB(ϑ(z) λ σ micro τ S(k)

train)3 Update ϑ based on S(k)

val

31) Gϑ = 1|B(z)|

sumkisinB(z)

(hϑ(β(k)(ϑ(z)) ν(z))minus hϑ(β(k)(ak) ν(z))

)

32) ϑ(z) = g(z)ϑ +Gϑ

33) ϑ(z+1) = Pϑ(ϑ(z) minus ηϑϑ(z))

34) g(z+1)ϑ = g

(z)ϑ + |B(z)|

T Gϑ35) ak = ϑ(z) for every k isin B(z)

4 Update ν based on S(k)val

41) Gν = 1|B(z)|

sumkisinB(z)

(hν(β(k)(ϑ(z)) ν(z))minus hν(β(k)(ϑ(z)) ck)

)

42) ν(z) = g(z)ν +Gν

43) ν(z+1) = Pν(ν(z) minus ην ν(z))

44) g(z+1)ν = g

(z)ν + |B(z)|

T Gν 45) ck = ν(z) for every k isin B(z)

Output ϑ = ϑ(Z) ν = ν(Z) β(t)(ϑ) t = 1 T Prediction function f (t) =

sumPj=1

sumdk=1 νj β

(t)jk (ϑ)ψjk(middot) t = 1 T

Variable structure S = (ϑThrl νThr)1lelleL

3 Optimization Algorithm

To implement the non-convex and nonsmooth MAM we employ Prox-SAGA algorithm [30] withsimplex projection and box projection [8] For simplicity we define two partial derivative calculators

minusTsumt=1

partU(β(t)(ϑ) ν)

partν=

Tsumt=1

hν(β(t)(ϑ) ν) minusTsumt=1

partU(β(t)(ϑ) ν)

partϑ=

Tsumt=1

hϑ(β(t)(ϑ) ν)

It is trivial to computesumTt=1 hν(β(t)(ϑ) ν) since the parameter ν only appears explicitly in the upper

problem The optimization parameter ϑ is implicit via the solution β(ϑ) of the inner problem Hence

6

computingsumTt=1 hϑ(β(t)(ϑ) ν) requires us to develop a smooth algorithm HQ-DFBB (combining

HQ [24] and DFBB [37]) for the solution β(ϑ) For the space limitation the optimization detailsincluding HQ-DFBB and two partial derivative calculators are provided in Supplementary MaterialB Let PΘ be the projection onto unit simplex Θ and Pν be the box projection onto [0 1]P Thegeneral procedure of Prox-SAGA is summarized in Algorithm 1

Remark 4 From Theorem 21 in [12] and Theorem 4 in [30] we know that Algorithm 1 convergesonly if the iteration sequence generated by HQ-DFBB converges to the solution of the inner problemDetailed convergence analysis of HQ-DFBB is provided in Supplementary Material C

4 Experiments

This section validates the effectiveness of MAM on simulated data and CMEs data All experimentsare implemented in MATLAB 2019b on an intel Core i7 with 16 GB memory

41 Simulated data analysis

Baselines The proposed MAM is compared with BiGL [11] in terms of variable structure recoveryand prediction ability In addition we also consider some baselines including Lasso [36] RMR [38]mGAM Group Lasso (GL) [43] and GroupSpAM [42] Note that the oracle variable structure is apriori knowledge for implementing mGAM GL and GroupSpAM

Oracle variable structure Set the number of tasks T = 500 the dimension P = 50 for eachtask and the actual number of groups Llowast = 5 We denote the indices of l-th group by Gl =

1 + (l minus 1)(PLlowast) l(PLlowast)

foralll isin 1 Llowast In addition we randomly pick V sub 1 Pto generate sparse features across all tasks For each j isin 1 P and l isin 1 Llowast the oraclevariable structure Slowast can be defined as Slowastjl = 1 if j isin Vc capGl and 0 otherwise

Parameter selection For the same hyper-parameters in BiGL and MAM we set Z = 3000micro = 10minus3 M = 5 Q = 100 and σ = 2 We search the regularization parameter λ in therange of 10minus4 10minus3 10minus2 10minus1 Here we assume the actual number of groups is known ieL = Llowast The weight for each group is set to be τl = 1foralll isin 1 L Following the samestrategy in [11] we choose the initialization ϑ(0) = Pϑ( 1

L IPtimesL + 001N (0PtimesL IPtimesL)) isin RPtimesL

and ν(0) = (05 05)T isin RP

Evaluation criterion Denote f (t) flowast(t) as the estimator and ground truth function respectively1 le t le T Evaluation criterions used here include Average Square Error(ASE)= 1

T

sumTt=1

1nf

(t) minusy(t)22 True Deviation (TD)= 1

T

sumTt=1

1nf

(t)minus flowast(t)22 Variable Structure Recovery S = (νThrϑThrl )1lelleL with the hard threshold value u = 05 Width of Prediction Intervals (WPI) and Sample

Coverage Probability (SCP) with the confidence level 10 Specially WPI and SCP are designed in[41] for comparing the widths of the prediction intervals with the same confidence level (see Section32 in [41] for more details)

Data sets The training set validation set and test set are all drawn from y(t) = flowast(t)(u(t)) + ε withthe same sample size n = 50 for each task where u(t) = (u1 uP )T isin RP is randomly drawnfrom Gaussian distribution N (0P

12 IP ) The noise ε follows Gaussian noise N (0 005) Student

noise t(2) Chi-square noise X 2(2) and Exponential noise Exp(2) respectively We randomly pickG(t) sub G1 GL st |G(t)| = 2 and consider the following examples of ground truth functionflowast(t) 1 le t le T

Example A [12] Linear component function flowast(t)(u(t)) =sumGlisinG(t)

sumjisinGlcapVc u

(t)j β

(t)j where the

true regression coefficient β(t)j = 1 if j isin Gl cap Vc otherwise β(t)

j = 0

Example B Denote flowast1 (u) = 25 sin(u) flowast2 (u) = 2u flowast3 (u) = 2eu minus eminus1 minus 1 flowast4 (u) = 8u2 andflowast5 (u) = 3 sin(2eu) The nonlinear additive function flowast(t)(u(t)) =

sumGlisinG(t)

sumjisinGlcapVc flowastl (u

(t)j )

Here spline basis matrix for MAM mGAM and GroupSpAM are constructed with d = 3 In thedata-generating process we consider two cases of the number of inactive variables ie |V| = 0and |V| = 5 Due to the space limitation we only present the results with Gaussian noise and

7

1 2 3 4 5

10

20

30

40

501 2 3 4 5

10

20

30

40

501 2 3 4 5

10

20

30

40

50

1 2 3 4 5

10

20

30

40

501 2 3 4 5

10

20

30

40

501 2 3 4 5

10

20

30

40

50

1 2 3 4 5

10

20

30

40

501 2 3 4 5

10

20

30

40

501 2 3 4 5

10

20

30

40

501 2 3 4 5

10

20

30

40

501 2 3 4 5

10

20

30

40

501 2 3 4 5

10

20

30

40

50

Figure 2 Variable structure discovery for Example A under different noise and sparsity index V(white pixel means the grouped variables and red pixel means the inactive variables) Top left panelGaussian noise with |V| = 0 Top right panel Student noise with |V| = 0 Bottom left panelGaussian noise with |V| = 5 Bottom right panel Student noise with |V| = 5

Student noise in Table 2 and Figure 2 The remaining results as well as several evaluations on theimpact of hyper-parameters are provided in Supplementary Material D1 From the reported resultseven without the structure information the proposed MAM can provide the competitive regressionestimation with mGAM (given priori structure) and usually achieve better performance than thesecompetitors when the noise is non-Gaussian distribution Specially the actual number of groupsis assumed to be known in current evaluations ie L = Llowast In Supplementary Material D1 wefurther verify the effectiveness of MAM for the general setting L gt Llowast

Table 2 Performance comparisons on Example A (top) and Example B (bottom) wrt differentcriterions

Methods|V| = 0 (Gaussian noise) |V| = 5 (Gaussian noise) |V| = 0 (Student noise) |V| = 5 (Student noise)

ASE TD WPI (SCP) ASE TD WPI (SCP) ASE TD WPI (SCP) ASE TD WPI (SCP)

MAM 00920 00914 00679(01015) 00801 00794 00595(01014) 10204 02467 01817(01014) 14279 02269 01796(01020)

BiGL 00715 00701 00616(01014) 00799 00793 00553(01027) 11097 03550 02061(01023) 14961 03544 02093(01017)

mGAM 00894 00885 00651(01011) 00795 00788 00611(01016) 10132 02441 01803(01015) 13828 02242 01725(01026)

GL 00683 00661 00620(01014) 00708 00684 00535(01011) 13252 03163 01935(01020) 14811 03576 01976(01020)

Lasso 02145 02138 01011(01013) 02204 02131 01080(01017) 38012 32412 05269(01021) 42121 35645 04899(01025)

RMR 02201 02196 01081(01020) 02224 02165 01108(01029) 19723 13241 03324(01022) 21518 13504 03451(01024)

MAM 07981 07972 02252(01044) 07827 07826 02155(01039) 09772 06943 02220(01038) 09145 06749 02218(01040)

BiGL 08307 08307 02318(01029) 07998 07934 02250(01027) 10727 07525 02414(01030) 10004 07405 02365(01034)

mGAM 07969 07967 02249(01042) 07787 07786 02121(01037) 09665 06842 02203(01038) 09135 06753 02226(01037)

GroupSpAM 07914 07916 02101(01033) 07604 07599 02085(01022) 09965 07142 02303(01048) 09521 07077 02294(01025)

GL 08081 08080 02241(01028) 07741 07708 02177(01029) 10501 07306 02335(01030) 09591 07202 02363(01029)

Lasso 25779 24777 03956(01027) 26801 26799 04259(01020) 38645 37612 05218(01020) 37443 35036 04963(01024)

RMR 15160 15061 03237(01028) 17654 16798 03184(01020) 19448 17294 03587(01025) 21085 20077 03329(01030)

42 Coronal mass ejection analysis

Coronal Mass Ejections (CMEs) are the most violent eruptions in the Solar System It is cru-cial to forecast the physical parameters related to CMEs Despite machine learning approacheshave been applied to these tasks recently [20 39] there is no any work for interpretable predic-tion with data-driven structure discovery Interplanetary CMEs (ICMEs) data are provided in TheRichardson and Cane List (httpwwwsrlcaltecheduACEASCDATAlevel3icmetable2htm) From this link we collect 137 ICMEs observations from 1996 to 2016The features of CMEs are provided in SOHO LASCO CME Catalog (httpscdawgsfcnasagovCME_list) In-situ solar wind parameters can be downloaded from OMNIWeb Plus(httpsomniwebgsfcnasagov) The in-situ solar wind parameters at earth is used torepresent the unknown solar wind plasma [20] A total of 21 features are chosen as input by com-bining the features of CMEs and in-situ solar wind parameters Five physical parameters predictiontasks are considered as outputs including CMEs arrive time Mean ICME speed Maximum solar

8

wind speed Increment in solar wind speed and Mean magnetic field strength We split the dataof each task into training set validation set and test set (with ratio 2 2 1) and adopt the samesettings in simulations Table 3 demonstrates that MAM enjoy smaller average absolute error thanthe competitors In addition the estimated structure (via MAM) is described in Figure 3 FromFigure 3 and Table 3 we know group G1 (including Mass MPA Solar wind speed Vy) and groupG2 (including Acceleration and Linear Speed) are significant for most tasks Particularly G2 andG7 (2nd-order Speed at final height) can be characterized as the factors that reflect the CMEs speedTable 3 shows that the groups G2 and G7 play an important role in CMEs arrive time predictionwhich is consistent with the results in [20] In addition the impact of hyper-parameter are displayedin Supplementary Material D2 due to the space limitation Overall the proposed MAM can achievethe promising performance on prediction and structure discovery

Table 3 Average absolute error and dominant group for each taskTasks CMEs arrive time Mean ICME speed Maximum solar wind speed Increment in solar wind speed Mean magnetic field strength

Methods AAE (h) Groups AAE (kms) Groups AAE (kms) Groups AAE (kms) Groups AAE (nT ) Groups

MAM 907 G1 G2 G7 4541 G1 G2 G3 G6 5932 G1 G2 6538 G1 G2 G3 347 G1BiGL 1109 - 5375 - 4651 - 8997 - 521 -Lasso 1216 - 6256 - 5981 - 8534 - 438 -RMR 1202 - 6223 - 5190 - 8613 - 398 -

1 2 3 4 5 6 7 8 9 10

CPA

Angular Width

Acceleration

Linear Speed

2nd-order Speed(20Rs)

Mass

Kinetic Energy

MPA

Field magnitude average

Bx

By

Bz

Solar wind speed

Vx

Vy

Vz

Proton density

Temperature

Flow pressure

Plasma beta

Figure 3 Variable structure S (white pixel=the grouped variables red pixel=the inactive variables)

5 Conclusion

This paper proposes the multi-task additive models to achieve robust estimation and automaticstructure discovery As far as we know it is novel to explore robust interpretable machine learningby integrating modal regression additive models and multi-task learning together The computingalgorithm and empirical evaluations are provided to support its effectiveness In the future it isinteresting to investigate robust additive models for overlapping variable structure discovery [17]

Broader Impact

The positive impacts of this work are two-fold 1) Our algorithmic framework paves a new way formining the intrinsic feature structure among high-dimensional variables and may be the steppingstone to further explore data-driven structure discovery with overlapping groups 2) Our MAM canbe applied to other fields eg gene expression analysis and drug discovery However there is also arisk of resulting an unstable estimation when facing ultra high-dimensional data

9

Acknowledgments

This work was supported by National Natural Science Foundation of China under Grant Nos11671161 12071166 61972188 41574181 the Fundamental Research Funds for the Central Univer-sities (Program No 2662019FW003) and NSERC Grant RGPIN-2016-05024 We are grateful to theanonymous NeurIPS reviewers for their constructive comments

References[1] R Agarwal N Frosst X Zhang R Caruana and G E Hinton Neural additive models

Interpretable machine learning with neural nets arXiv200413912v1 2020

[2] R Caruana Multitask learning Machine Learning 28(1)41ndash75 1997

[3] H Chen G Liu and H Huang Sparse shrunk additive models In International Conference onMachine Learning (ICML) 2020

[4] H Chen X Wang C Deng and H Huang Group sparse additive machine In Advances inNeural Information Processing Systems (NIPS) pages 198ndash208 2017

[5] H Chen Y Wang F Zheng C Deng and H Huang Sparse modal additive model IEEETransactions on Neural Networks and Learning Systems Doi 101109TNNLS202030051442020

[6] Y C Chen C R Genovese R J Tibshirani and L Wasserman Nonparametric modalregression The Annals of Statistics 44(2)489ndash514 2016

[7] B Colson P Marcotte and G Savard An overview of bilevel optimization Annals ofOperations Research 153(1)235ndash256 2007

[8] L Condat Fast projection onto the simplex and the `1-ball Mathematical Programming158(1-2)575ndash585 2016

[9] T Evgeniou and M Pontil Regularized multindashtask learning In ACM Conference on KnowledgeDiscovery and Data Mining pages 109ndash117 2004

[10] Y Feng J Fan and J Suykens A statistical learning approach to modal regression Journal ofMachine Learning Research 21(2)1ndash35 2020

[11] L Franceschi P Frasconi S Salzo R Grazzi and M Pontil Bilevel programming forhyperparameter optimization and meta-learning In International Conference on Machinelearning (ICML) pages 1563ndash1572 2018

[12] J Frecon S Salzo and M Pontil Bilevel learning of the group lasso structure In Advances inNeural Information Processing Systems (NIPS) pages 8301ndash8311 2018

[13] J H Friedman T Hastie and R Tibshirani A note on the group lasso and a sparse group lassoarXiv 10010736 2010

[14] T J Hastie and R J Tibshirani Generalized additive models London Chapman and Hall1990

[15] C Heinrich The mode functional is not elicitable Biometrika 101(1)245ndash251 2014

[16] J Huang J L Horowitz and F Wei Variable selection in nonparametric additive models TheAnnals of Statistics 38(4)2282ndash2313 2010

[17] L Jacob G Obozinski and J Vert Group lasso with overlap and graph lasso In InternationalConference on Machine Learning (ICML) pages 433ndash440 2009

[18] K Kandasamy and Y Yu Additive approximations in high dimensional nonparametric regres-sion via the SALSA In International Conference on Machine Learning (ICML) pages 69ndash782016

10

[19] X Li L Wang and D Nettleton Sparse model identification and learning for ultra-high-dimensional additive partially models Journal of Multivariate Analysis 173204ndash228 2019

[20] J Liu Y Ye C Shen Y Wang and R Erdelyi A new tool for cme arrival time predictionusing machine learning algorithms Cat-puma The Astrophysical Journal 855(2)109ndash1182018

[21] S Lv H Lin H Lian and J Huang Oracle inequalities for sparse additive quantile regressionin reproducing kernel hilbert space The Annals of Statistics 46(2)781ndash813 2018

[22] L Meier S V De Geer and P Buhlmann The group lasso for logistic regression Journal ofThe Royal Statistical Society Series B-statistical Methodology 70(1)53ndash71 2008

[23] L Meier S V D Geer and P Buhlmann High-dimensional additive modeling The Annals ofStatistics 37(6B)3779ndash3821 2009

[24] M Nikolova and M K Ng Analysis of half-quadratic minimization methods for signal andimage recovery SIAM Journal on Scientific Computing 27(3)937ndash966 2005

[25] P Ochs R Ranftl T Brox and T Pock Techniques for gradient based bilevel optimization withnonsmooth lower level problems Journal of Mathematical Imaging and Vision 56(2)175ndash1942016

[26] C Pan and M Zhu Group additive structure identification for kernel nonparametric regressionIn Advances in Neural Information Processing Systems (NIPS) pages 4907ndash4916 2017

[27] E Parzen On estimation of a probability density function and mode Annals of MathematicalStatistics 33(3)1065ndash1076 1962

[28] G Raskutti M J Wainwright and B Yu Minimax-optimal rates for sparse additive models overkernel classes via convex programming Journal of Machine Learning Research 13(2)389ndash4272012

[29] P Ravikumar H Liu J Lafferty and L Wasserman SpAM sparse additive models Journalof the Royal Statistical Society Series B 711009ndash1030 2009

[30] S J Reddi S Sra B Poczos and A J Smola Proximal stochastic methods for nonsmoothnonconvex finite-sum optimization In Advances in Neural Information Processing Systems(NIPS) pages 1145ndash1153 2016

[31] T W Sager Estimation of a multivariate mode Annals of Statistics 6(4)802ndash812 1978

[32] N Shervashidze and F Bach Learning the structure for structured sparsity IEEE Transactionson Signal Processing 63(18)4894ndash4902 2015

[33] N Simon J H Friedman T Hastie and R Tibshirani A sparse-group lasso Journal ofComputational and Graphical Statistics 22(2)231ndash245 2013

[34] C J Stone Additive regression and other nonparametric models The Annals of Statistics13(2)689ndash705 1985

[35] G Swirszcz and A C Lozano Multi-level lasso for sparse multi-task regression In Interna-tional Conference on Machine Learning (ICML) pages 595ndash602 2012

[36] R Tibshirani Regression shrinkage and selection via the lasso Journal of the Royal StatisticalSociety Series B 73(3)267ndash288 1994

[37] Q Van Nguyen Forward-backward splitting with bregman distances Vietnam Journal ofMathematics 45(3)519ndash539 2017

[38] X Wang H Chen W Cai D Shen and H Huang Regularized modal regression withapplications in cognitive impairment prediction In Advances in Neural Information ProcessingSystems (NIPS) pages 1448ndash1458 2017

11

[39] Y Wang J Liu Y Jiang and R Erdeacutelyi Cme arrival time prediction using convolutionalneural network The Astrophysical Journal 881(11)15 2019

[40] Y Wu and L A Stefanski Automatic structure recovery for additive models Biometrika102(2)381ndash395 2015

[41] W Yao and L Li A new regression model modal linear regression Scandinavian Journal ofStatistics 41(3)656ndash671 2013

[42] J Yin X Chen and E P Xing Group sparse additive models In International Conference onMachine Learning (ICML) pages 871ndash878 2012

[43] M Yuan and Y Lin Model selection and estimation in regression with grouped variablesJournal of The Royal Statistical Society Series B-statistical Methodology 68(1)49ndash67 2006

[44] M Yuan and D X Zhou Minimax optimal rates of estimation in high dimensional additivemodels The Annals of Statistics 44(6)2564ndash2593 2016

[45] H H Zhang G Cheng and Y Liu Linear or nonlinear automatic structure discovery forpartially linear models Journal of the American Statistical Association 106(495)1099ndash11122011

[46] T Zhao and H Liu Sparse additive machine In International Conference on ArtificialIntelligence and Statistics (AISTATS) pages 1435ndash1443 2012

12

  • Introduction
  • Multi-task Additive Models
    • Additive models
    • Mode-induced metric
    • Mode-induced group additive models
    • Multi-task additive models
      • Optimization Algorithm
      • Experiments
        • Simulated data analysis
        • Coronal mass ejection analysis
          • Conclusion
Page 3: Multi-task Additive Models for Robust Estimation and Automatic … · group structure identification (CGSI)[26], and Bilevel learning of Group Lasso (BiGL) [12]. 2 Multi-task Additive

Table 1 Algorithmic properties (X-has the given information times-hasnrsquot the given information)

RMR [38] GroupSpAM [42] CGSI[26] BIGL[12] MAM (ours)Hypothesis Space Linear Additive Additive Linear Additive

Learning Task Single-task Single-task Single-task Multi-task Multi-taskEvaluation Criterion Mode-induced Mean-induced Mean-induced Mean-induced Mode-inducedObjective Function Nonconvex Convex Convex Convex NonconvexRobust Estimation X times times times X

Sparsity on Grouped Features times X X X XSparsity on Individual Features X times times times XVariable Structure Discovery times times X X X

Related works There are some works for automatic structure discovery in additive models [26 40]and partially linear models [19 45] Different from our MAM these approaches are formulated undersingle-task framework and the MSE criterion which are sensitive to non-Gaussian noise and difficultto tackle multi-task structure discovery directly While some mode-based approaches have beendesigned for robust estimation eg regularized modal regression (RMR) [38] none of them considerthe automatic structure discovery Recently an extension of group lasso is formulated for variablestructure discovery [12] Although this approach can induce the data-driven sparsity at the grouplevel it is limited to the linear mean regression and ignores the sparsity with respect to individualfeatures To better highlight the novelty of MAM its algorithmic properties are summarized in Table1 compared with RMR [38] Group Sparse Additive Models (GroupSpAM) [42] Capacity-basedgroup structure identification (CGSI)[26] and Bilevel learning of Group Lasso (BiGL) [12]

2 Multi-task Additive Models

21 Additive models

Now recall some backgrounds of additive models [14 42 44] For the sake of readability wesummarize some necessary notations in Supplementary Material A

Let X sub RP be the input space and Y sub R be the corresponding output set We consider thefollowing data-generating model

Y = flowast(X) + ε (1)whereX isin X Y isin Y ε is a random noise and flowast is the ground truth function For simplicity denoteρ(XY ) as the intrinsic distribution generated in (1) Under the Gaussian noise assumption ieE(ε|X = x) = 0 a large family of nonparametric regression aims to estimate the conditional meanfunction flowast(x) = E(Y |X = x) However the nonparametric regression may face low convergencerate due to the so-called curse of dimensionality [18 34] This motivates the research on additivemodels [14 29] to remedy this problem

Additive Models [14 29] Let the input space X = (X1 XP )T sub RP and let the hypothesis spacewith additive structure be defined as

H =f f(u) =

Psumj=1

fj(uj) fj isin Hj u = (u1 uP )T uj isin Xj

whereHj is the component function space on Xj Usually additive models aim to find the minimizerof E(Y minus f(X))2 inH Moreover groupwise additive models have been proposed with the help of aprior knowledge of variable group eg GroupSpAM [42] and GroupSAM [4]

Let G1 G2 GL be a partition over variable indices 1 P such that Gl capGj = emptyforalll 6= jand cupLl=1Gl = 1 P In essential the main purpose of GroupSpAM [42] is to search theminimizer of

E(Y minus f(X))2 +

Lsuml=1

τl

radicsumjisinGl

E[f2j (uj)] over all f =

Lsuml=1

sumjisinGl

fj isin H

where τl is the corresponding weight for group Gl 1 le l le L

22 Mode-induced metric

Beyond the Gaussian noise assumption in [16 29 42] we impose a weaker assumption on ε iearg maxtisinR pε|X(t) = 0 where pε|X denotes the conditional density function of ε given X In

3

theory this zero-mode assumption allows for more complex cases eg Gaussian noise heavy-tailednoise skewed noise or outliers

Denote p(Y |X = x) as the conditional density function of Y given X = x By taking mode on theboth sides of (1) we obtain the conditional mode function

flowast(x) = arg maxtisinR

p(t|X = x) (2)

where arg maxtisinR p(t|X = x) is assumed to be unique for any x isin X There are direct strategy andindirect strategy for estimating flowast [31] Generally the direct approaches are intractable since theconditional mode function cannot be elicited directly [15] while the indirect estimators based onkernel density estimation (KDE) have shown promising performance [6 10 38 41]

Now we introduce a mode-induced metric [10 38] associated with KDE For any measurable functionf X rarr R the mode-induced metric is

R(f) =

intXpY |X(f(x)|X = x)dρX (x) (3)

where ρX is the marginal distribution of ρwith respect toX As discussed in [10] flowast is the maximizerof the mode-induced metricR(f) According to Theorem 5 in [10] we haveR(f) = pEf

(0) wherepEf

(0) is the density function of error random variable Ef = Y minus f(X)

Define a modal kernel φ such that forallu isin R φ(u) = φ(minusu) φ(u) gt 0 andintR φ(u)du = 1 Typical

examples of modal kernel include Gaussian kernel Logistic kernel Epanechnikov kernel Given(xi yi)ni=1 sub X times Y an empirical version ofR(f) obtained via KDE [10 27] is defined as

Rσemp(f) =1

nsumi=1

φ(yi minus f(xi)

σ

) (4)

where σ is a positive bandwidth Then denote the data-free robust metric wrt Rσemp(f) as

Rσ(f) =1

σ

intXtimesY

φ(y minus f(x)

σ

)dρ(x y) (5)

Theorem 10 in [10] states thatRσ(f) tends toR(f) when σ rarr 0

23 Mode-induced group additive models

Here we form the additive hypothesis space based on smoothing splines [16 23 29 46] Let ψjk k = 1 infin be bounded and orthonormal basis functions on Xj Then the component function

space can be defined as Bj =fj fj =

suminfink=1 βjkψjk(middot)

with the coefficient βjk j = 1 P

After truncating these basis functions to finite dimension d we get

Bj =fj fj =

dsumk=1

βjkψjk(middot)

Denote f2 =radicint

f2(x)dx It has been illustrated that fj minus fj22 = O(1d4) for the second

order Sobolev ball Bj[46] The mode-induced Group Additive Models (mGAM) can be formulatedas

f = arg maxf=

sumPj=1 fj fjisinBj

Rσemp(f)minus λΩ(f) (6)

where λ is a positive regularization parameter and the structure-based regularizer

Ω(f) =

Lsuml=1

τl

radicsumjisinGl

fj22 =

Lsuml=1

τl

radicradicradicradicsumjisinGl

dsumk=1

β2jk

with group weight τl Denote Ψi =(ψ11(xi1) ψ1d(xi1) ψP1(xiP ) ψPd(xiP )

)and

β = (β11 β1d βP1 βPd)T isin RPd Given observations (xi yi)ni=1 with xi =

4

(xi1 xiP )T isin RP the mGAM can be represented as

f =

Psumj=1

fj =

Psumj=1

dsumk=1

βjkψjk(middot)

with

β = arg maxβisinRPd

1

nsumi=1

φ(yi minusΨiβ

σ

)minus λ

Lsuml=1

τl

radicradicradicradicsumjisinGl

dsumk=1

β2jk

(7)

Remark 1 The mGAM is a robust extension of GroupSpAM from mean regression to mode regressionWhen each group Gl l isin 1 L is a singleton our mGAM reduces to a robust version of SpAM[29] by replacing the MSE with the robust mode-induced metric (3) In particular our mGAM isconsistent with RMR [38] when each group is a singleton and all component functions are linear

24 Multi-task additive models

To reduce the dependency of mGAM on a priori structure information this section formulates MAMby learning an augmented mGAM within a multi-task bilevel framework [11 12 25]

Let T be the number of tasks Let X (t) = (X (t)1 X (t)

P )T sub RP and Y(t) sub R be the inputspace and the output space respectively associated with the t-th task Suppose that observationsS(t) = x(t)

i y(t)i 2ni=1 sub X (t) times Y(t) are drawn from an unknown distribution ρ(t)(x y) Without

loss of generality we split each S(t) into the training set S(t)train and the validation set S(t)

val with thesame sample size n for subsequent analysis

To quantify the groups G1 GL we introduce the following unit simplex

Θ =ϑ = (ϑ1 ϑL) isin RPtimesL

∣∣∣ Lsuml=1

ϑjl = 1 0 le ϑjl le 1 j = 1 P

where each element ϑjl can be viewed as a probability that identifies whether the j-th variable belongsto group Gl It is desirable to enjoy the property that ϑjl = 1 rArr j isin Gl and ϑjl = 0 rArr j isin GlHowever we cannot mine the sparsity within each group since

sumLl=1 ϑjl = 1 j = 1 P Inspired

from [35] we introduce ν = (ν1 νP )T isin [0 1]P to screen main effect variables across all taskswhere νj 6= 0 means the j-th variable is effective

Denote Ψ(t)i =

(ψ11(x

(t)i1 ) ψ1d(x

(t)i1 ) ψP1(x

(t)iP ) ψPd(x

(t)iP )) Given S(t)

valTt=1 andS(t)

trainTt=1 our MAM can be formulated as the following bilevel optimization scheme

Outer Problem (based on validation set S(t)val)

(ϑ ν) isin arg maxϑisinΘνisin[01]P

Tsumt=1

U(β(t)(ϑ) ν) with U(β(t)(ϑ) ν) =1

nsumi=1

φ(y(t)

i minusΨ(t)i Tν β(t)(ϑ)

σ

)

where Tν is a linear operator for screening main effect variables across all tasks suchthat Tν β(t)(ϑ) = (ν1β

(t)11 (ϑ) ν1β

(t)1d (ϑ) νP β

(t)P1(ϑ) νP β

(t)Pd(ϑ))T isin RPd and

β(ϑ) = (β(t)(ϑ))1letleT is the maximizer of the following augmented mGAM

Inner Problem (based on training set S(t)train)

β(ϑ)=argmaxβ

Tsumt=1

J(β(t)) with J(β(t))=1

nsumi=1

φ(y(t)i minusΨ

(t)i β(t)

σ

)minus micro

2β(t)22minusλ

Lsuml=1

τlTϑlβ(t)2

where Tϑlβ(t) = (ϑ1lβ

(t)11 ϑ1lβ

(t)1d ϑPlβ

(t)P1 ϑPlβ

(t)Pd)

T isin RPd is used for identifyingwhich variables belong to the l-th group and the penalty term micro

2 β(t)22 with a tending-to-zero

parameter micro assures the strong-convex property for optimization

5

Finally the multi-task additive models (MAM) can be represented as below

f (t) =

Psumj=1

dsumk=1

νj β(t)jk (ϑ)ψjk(middot) t = 1 T

Let ϑThr and νThr be two threshold counterparts of ϑ and ν respectively Similar with [12] ϑThr

is determined by assigning each feature to its most dominant group For any j = 1 P νThrj is

determined by a threshold u ie νThrj = 0 if νj le u and νThr

j = 1 otherwise Then the data-drivenvariable structure can be obtained via S = (ϑThr

l νThr)1lelleL where denotes Hadamard productRemark 2 If the hyper-parameter ν equiv IP the sparsity wrt individual features would not be takeninto account for MAM In this setting our MAM is essentially a robust and nonlinear extension ofBiGL [11] by incorporating mode-induced metric and additive hypothesis spaceRemark 3 Indeed mGAM with an oracle variable structure is the baseline of MAM In other wordsthe inner problem with the estimated variable structure S aims to approximate the mGAM

Algorithm 1 Prox-SAGA for MAM

Input Data S(t)train S

(t)valTt=1 Max-Iter Z isin R The number of groups L Step-size ηϑ

Step-size ην ϑ(0) ν(0) λ micro Modal kernel φ Bandwidth σ Weights τl l = 1 LInitialization at = ϑ(0) ct = ν(0) t = 1 T g(0)

ϑ = 0PtimesL g(0)ν = 0P

for z = 0 1 Z minus 1 do1 Randomly pick set

B(z) sub 1 T denote its cardinality as |B(z)|2 Compute β(k)(ϑ(z)) based on S(k)

train forallk isin B(z)β(k)(ϑ(z))=HQ-DFBB(ϑ(z) λ σ micro τ S(k)

train)3 Update ϑ based on S(k)

val

31) Gϑ = 1|B(z)|

sumkisinB(z)

(hϑ(β(k)(ϑ(z)) ν(z))minus hϑ(β(k)(ak) ν(z))

)

32) ϑ(z) = g(z)ϑ +Gϑ

33) ϑ(z+1) = Pϑ(ϑ(z) minus ηϑϑ(z))

34) g(z+1)ϑ = g

(z)ϑ + |B(z)|

T Gϑ35) ak = ϑ(z) for every k isin B(z)

4 Update ν based on S(k)val

41) Gν = 1|B(z)|

sumkisinB(z)

(hν(β(k)(ϑ(z)) ν(z))minus hν(β(k)(ϑ(z)) ck)

)

42) ν(z) = g(z)ν +Gν

43) ν(z+1) = Pν(ν(z) minus ην ν(z))

44) g(z+1)ν = g

(z)ν + |B(z)|

T Gν 45) ck = ν(z) for every k isin B(z)

Output ϑ = ϑ(Z) ν = ν(Z) β(t)(ϑ) t = 1 T Prediction function f (t) =

sumPj=1

sumdk=1 νj β

(t)jk (ϑ)ψjk(middot) t = 1 T

Variable structure S = (ϑThrl νThr)1lelleL

3 Optimization Algorithm

To implement the non-convex and nonsmooth MAM we employ Prox-SAGA algorithm [30] withsimplex projection and box projection [8] For simplicity we define two partial derivative calculators

minusTsumt=1

partU(β(t)(ϑ) ν)

partν=

Tsumt=1

hν(β(t)(ϑ) ν) minusTsumt=1

partU(β(t)(ϑ) ν)

partϑ=

Tsumt=1

hϑ(β(t)(ϑ) ν)

It is trivial to computesumTt=1 hν(β(t)(ϑ) ν) since the parameter ν only appears explicitly in the upper

problem The optimization parameter ϑ is implicit via the solution β(ϑ) of the inner problem Hence

6

computingsumTt=1 hϑ(β(t)(ϑ) ν) requires us to develop a smooth algorithm HQ-DFBB (combining

HQ [24] and DFBB [37]) for the solution β(ϑ) For the space limitation the optimization detailsincluding HQ-DFBB and two partial derivative calculators are provided in Supplementary MaterialB Let PΘ be the projection onto unit simplex Θ and Pν be the box projection onto [0 1]P Thegeneral procedure of Prox-SAGA is summarized in Algorithm 1

Remark 4 From Theorem 21 in [12] and Theorem 4 in [30] we know that Algorithm 1 convergesonly if the iteration sequence generated by HQ-DFBB converges to the solution of the inner problemDetailed convergence analysis of HQ-DFBB is provided in Supplementary Material C

4 Experiments

This section validates the effectiveness of MAM on simulated data and CMEs data All experimentsare implemented in MATLAB 2019b on an intel Core i7 with 16 GB memory

41 Simulated data analysis

Baselines The proposed MAM is compared with BiGL [11] in terms of variable structure recoveryand prediction ability In addition we also consider some baselines including Lasso [36] RMR [38]mGAM Group Lasso (GL) [43] and GroupSpAM [42] Note that the oracle variable structure is apriori knowledge for implementing mGAM GL and GroupSpAM

Oracle variable structure Set the number of tasks T = 500 the dimension P = 50 for eachtask and the actual number of groups Llowast = 5 We denote the indices of l-th group by Gl =

1 + (l minus 1)(PLlowast) l(PLlowast)

foralll isin 1 Llowast In addition we randomly pick V sub 1 Pto generate sparse features across all tasks For each j isin 1 P and l isin 1 Llowast the oraclevariable structure Slowast can be defined as Slowastjl = 1 if j isin Vc capGl and 0 otherwise

Parameter selection For the same hyper-parameters in BiGL and MAM we set Z = 3000micro = 10minus3 M = 5 Q = 100 and σ = 2 We search the regularization parameter λ in therange of 10minus4 10minus3 10minus2 10minus1 Here we assume the actual number of groups is known ieL = Llowast The weight for each group is set to be τl = 1foralll isin 1 L Following the samestrategy in [11] we choose the initialization ϑ(0) = Pϑ( 1

L IPtimesL + 001N (0PtimesL IPtimesL)) isin RPtimesL

and ν(0) = (05 05)T isin RP

Evaluation criterion Denote f (t) flowast(t) as the estimator and ground truth function respectively1 le t le T Evaluation criterions used here include Average Square Error(ASE)= 1

T

sumTt=1

1nf

(t) minusy(t)22 True Deviation (TD)= 1

T

sumTt=1

1nf

(t)minus flowast(t)22 Variable Structure Recovery S = (νThrϑThrl )1lelleL with the hard threshold value u = 05 Width of Prediction Intervals (WPI) and Sample

Coverage Probability (SCP) with the confidence level 10 Specially WPI and SCP are designed in[41] for comparing the widths of the prediction intervals with the same confidence level (see Section32 in [41] for more details)

Data sets The training set validation set and test set are all drawn from y(t) = flowast(t)(u(t)) + ε withthe same sample size n = 50 for each task where u(t) = (u1 uP )T isin RP is randomly drawnfrom Gaussian distribution N (0P

12 IP ) The noise ε follows Gaussian noise N (0 005) Student

noise t(2) Chi-square noise X 2(2) and Exponential noise Exp(2) respectively We randomly pickG(t) sub G1 GL st |G(t)| = 2 and consider the following examples of ground truth functionflowast(t) 1 le t le T

Example A [12] Linear component function flowast(t)(u(t)) =sumGlisinG(t)

sumjisinGlcapVc u

(t)j β

(t)j where the

true regression coefficient β(t)j = 1 if j isin Gl cap Vc otherwise β(t)

j = 0

Example B Denote flowast1 (u) = 25 sin(u) flowast2 (u) = 2u flowast3 (u) = 2eu minus eminus1 minus 1 flowast4 (u) = 8u2 andflowast5 (u) = 3 sin(2eu) The nonlinear additive function flowast(t)(u(t)) =

sumGlisinG(t)

sumjisinGlcapVc flowastl (u

(t)j )

Here spline basis matrix for MAM mGAM and GroupSpAM are constructed with d = 3 In thedata-generating process we consider two cases of the number of inactive variables ie |V| = 0and |V| = 5 Due to the space limitation we only present the results with Gaussian noise and

7

1 2 3 4 5

10

20

30

40

501 2 3 4 5

10

20

30

40

501 2 3 4 5

10

20

30

40

50

1 2 3 4 5

10

20

30

40

501 2 3 4 5

10

20

30

40

501 2 3 4 5

10

20

30

40

50

1 2 3 4 5

10

20

30

40

501 2 3 4 5

10

20

30

40

501 2 3 4 5

10

20

30

40

501 2 3 4 5

10

20

30

40

501 2 3 4 5

10

20

30

40

501 2 3 4 5

10

20

30

40

50

Figure 2 Variable structure discovery for Example A under different noise and sparsity index V(white pixel means the grouped variables and red pixel means the inactive variables) Top left panelGaussian noise with |V| = 0 Top right panel Student noise with |V| = 0 Bottom left panelGaussian noise with |V| = 5 Bottom right panel Student noise with |V| = 5

Student noise in Table 2 and Figure 2 The remaining results as well as several evaluations on theimpact of hyper-parameters are provided in Supplementary Material D1 From the reported resultseven without the structure information the proposed MAM can provide the competitive regressionestimation with mGAM (given priori structure) and usually achieve better performance than thesecompetitors when the noise is non-Gaussian distribution Specially the actual number of groupsis assumed to be known in current evaluations ie L = Llowast In Supplementary Material D1 wefurther verify the effectiveness of MAM for the general setting L gt Llowast

Table 2 Performance comparisons on Example A (top) and Example B (bottom) wrt differentcriterions

Methods|V| = 0 (Gaussian noise) |V| = 5 (Gaussian noise) |V| = 0 (Student noise) |V| = 5 (Student noise)

ASE TD WPI (SCP) ASE TD WPI (SCP) ASE TD WPI (SCP) ASE TD WPI (SCP)

MAM 00920 00914 00679(01015) 00801 00794 00595(01014) 10204 02467 01817(01014) 14279 02269 01796(01020)

BiGL 00715 00701 00616(01014) 00799 00793 00553(01027) 11097 03550 02061(01023) 14961 03544 02093(01017)

mGAM 00894 00885 00651(01011) 00795 00788 00611(01016) 10132 02441 01803(01015) 13828 02242 01725(01026)

GL 00683 00661 00620(01014) 00708 00684 00535(01011) 13252 03163 01935(01020) 14811 03576 01976(01020)

Lasso 02145 02138 01011(01013) 02204 02131 01080(01017) 38012 32412 05269(01021) 42121 35645 04899(01025)

RMR 02201 02196 01081(01020) 02224 02165 01108(01029) 19723 13241 03324(01022) 21518 13504 03451(01024)

MAM 07981 07972 02252(01044) 07827 07826 02155(01039) 09772 06943 02220(01038) 09145 06749 02218(01040)

BiGL 08307 08307 02318(01029) 07998 07934 02250(01027) 10727 07525 02414(01030) 10004 07405 02365(01034)

mGAM 07969 07967 02249(01042) 07787 07786 02121(01037) 09665 06842 02203(01038) 09135 06753 02226(01037)

GroupSpAM 07914 07916 02101(01033) 07604 07599 02085(01022) 09965 07142 02303(01048) 09521 07077 02294(01025)

GL 08081 08080 02241(01028) 07741 07708 02177(01029) 10501 07306 02335(01030) 09591 07202 02363(01029)

Lasso 25779 24777 03956(01027) 26801 26799 04259(01020) 38645 37612 05218(01020) 37443 35036 04963(01024)

RMR 15160 15061 03237(01028) 17654 16798 03184(01020) 19448 17294 03587(01025) 21085 20077 03329(01030)

42 Coronal mass ejection analysis

Coronal Mass Ejections (CMEs) are the most violent eruptions in the Solar System It is cru-cial to forecast the physical parameters related to CMEs Despite machine learning approacheshave been applied to these tasks recently [20 39] there is no any work for interpretable predic-tion with data-driven structure discovery Interplanetary CMEs (ICMEs) data are provided in TheRichardson and Cane List (httpwwwsrlcaltecheduACEASCDATAlevel3icmetable2htm) From this link we collect 137 ICMEs observations from 1996 to 2016The features of CMEs are provided in SOHO LASCO CME Catalog (httpscdawgsfcnasagovCME_list) In-situ solar wind parameters can be downloaded from OMNIWeb Plus(httpsomniwebgsfcnasagov) The in-situ solar wind parameters at earth is used torepresent the unknown solar wind plasma [20] A total of 21 features are chosen as input by com-bining the features of CMEs and in-situ solar wind parameters Five physical parameters predictiontasks are considered as outputs including CMEs arrive time Mean ICME speed Maximum solar

8

wind speed Increment in solar wind speed and Mean magnetic field strength We split the dataof each task into training set validation set and test set (with ratio 2 2 1) and adopt the samesettings in simulations Table 3 demonstrates that MAM enjoy smaller average absolute error thanthe competitors In addition the estimated structure (via MAM) is described in Figure 3 FromFigure 3 and Table 3 we know group G1 (including Mass MPA Solar wind speed Vy) and groupG2 (including Acceleration and Linear Speed) are significant for most tasks Particularly G2 andG7 (2nd-order Speed at final height) can be characterized as the factors that reflect the CMEs speedTable 3 shows that the groups G2 and G7 play an important role in CMEs arrive time predictionwhich is consistent with the results in [20] In addition the impact of hyper-parameter are displayedin Supplementary Material D2 due to the space limitation Overall the proposed MAM can achievethe promising performance on prediction and structure discovery

Table 3 Average absolute error and dominant group for each taskTasks CMEs arrive time Mean ICME speed Maximum solar wind speed Increment in solar wind speed Mean magnetic field strength

Methods AAE (h) Groups AAE (kms) Groups AAE (kms) Groups AAE (kms) Groups AAE (nT ) Groups

MAM 907 G1 G2 G7 4541 G1 G2 G3 G6 5932 G1 G2 6538 G1 G2 G3 347 G1BiGL 1109 - 5375 - 4651 - 8997 - 521 -Lasso 1216 - 6256 - 5981 - 8534 - 438 -RMR 1202 - 6223 - 5190 - 8613 - 398 -

1 2 3 4 5 6 7 8 9 10

CPA

Angular Width

Acceleration

Linear Speed

2nd-order Speed(20Rs)

Mass

Kinetic Energy

MPA

Field magnitude average

Bx

By

Bz

Solar wind speed

Vx

Vy

Vz

Proton density

Temperature

Flow pressure

Plasma beta

Figure 3 Variable structure S (white pixel=the grouped variables red pixel=the inactive variables)

5 Conclusion

This paper proposes the multi-task additive models to achieve robust estimation and automaticstructure discovery As far as we know it is novel to explore robust interpretable machine learningby integrating modal regression additive models and multi-task learning together The computingalgorithm and empirical evaluations are provided to support its effectiveness In the future it isinteresting to investigate robust additive models for overlapping variable structure discovery [17]

Broader Impact

The positive impacts of this work are two-fold 1) Our algorithmic framework paves a new way formining the intrinsic feature structure among high-dimensional variables and may be the steppingstone to further explore data-driven structure discovery with overlapping groups 2) Our MAM canbe applied to other fields eg gene expression analysis and drug discovery However there is also arisk of resulting an unstable estimation when facing ultra high-dimensional data

9

Acknowledgments

This work was supported by National Natural Science Foundation of China under Grant Nos11671161 12071166 61972188 41574181 the Fundamental Research Funds for the Central Univer-sities (Program No 2662019FW003) and NSERC Grant RGPIN-2016-05024 We are grateful to theanonymous NeurIPS reviewers for their constructive comments

References[1] R Agarwal N Frosst X Zhang R Caruana and G E Hinton Neural additive models

Interpretable machine learning with neural nets arXiv200413912v1 2020

[2] R Caruana Multitask learning Machine Learning 28(1)41ndash75 1997

[3] H Chen G Liu and H Huang Sparse shrunk additive models In International Conference onMachine Learning (ICML) 2020

[4] H Chen X Wang C Deng and H Huang Group sparse additive machine In Advances inNeural Information Processing Systems (NIPS) pages 198ndash208 2017

[5] H Chen Y Wang F Zheng C Deng and H Huang Sparse modal additive model IEEETransactions on Neural Networks and Learning Systems Doi 101109TNNLS202030051442020

[6] Y C Chen C R Genovese R J Tibshirani and L Wasserman Nonparametric modalregression The Annals of Statistics 44(2)489ndash514 2016

[7] B Colson P Marcotte and G Savard An overview of bilevel optimization Annals ofOperations Research 153(1)235ndash256 2007

[8] L Condat Fast projection onto the simplex and the `1-ball Mathematical Programming158(1-2)575ndash585 2016

[9] T Evgeniou and M Pontil Regularized multindashtask learning In ACM Conference on KnowledgeDiscovery and Data Mining pages 109ndash117 2004

[10] Y Feng J Fan and J Suykens A statistical learning approach to modal regression Journal ofMachine Learning Research 21(2)1ndash35 2020

[11] L Franceschi P Frasconi S Salzo R Grazzi and M Pontil Bilevel programming forhyperparameter optimization and meta-learning In International Conference on Machinelearning (ICML) pages 1563ndash1572 2018

[12] J Frecon S Salzo and M Pontil Bilevel learning of the group lasso structure In Advances inNeural Information Processing Systems (NIPS) pages 8301ndash8311 2018

[13] J H Friedman T Hastie and R Tibshirani A note on the group lasso and a sparse group lassoarXiv 10010736 2010

[14] T J Hastie and R J Tibshirani Generalized additive models London Chapman and Hall1990

[15] C Heinrich The mode functional is not elicitable Biometrika 101(1)245ndash251 2014

[16] J Huang J L Horowitz and F Wei Variable selection in nonparametric additive models TheAnnals of Statistics 38(4)2282ndash2313 2010

[17] L Jacob G Obozinski and J Vert Group lasso with overlap and graph lasso In InternationalConference on Machine Learning (ICML) pages 433ndash440 2009

[18] K Kandasamy and Y Yu Additive approximations in high dimensional nonparametric regres-sion via the SALSA In International Conference on Machine Learning (ICML) pages 69ndash782016

10

[19] X Li L Wang and D Nettleton Sparse model identification and learning for ultra-high-dimensional additive partially models Journal of Multivariate Analysis 173204ndash228 2019

[20] J Liu Y Ye C Shen Y Wang and R Erdelyi A new tool for cme arrival time predictionusing machine learning algorithms Cat-puma The Astrophysical Journal 855(2)109ndash1182018

[21] S Lv H Lin H Lian and J Huang Oracle inequalities for sparse additive quantile regressionin reproducing kernel hilbert space The Annals of Statistics 46(2)781ndash813 2018

[22] L Meier S V De Geer and P Buhlmann The group lasso for logistic regression Journal ofThe Royal Statistical Society Series B-statistical Methodology 70(1)53ndash71 2008

[23] L Meier S V D Geer and P Buhlmann High-dimensional additive modeling The Annals ofStatistics 37(6B)3779ndash3821 2009

[24] M Nikolova and M K Ng Analysis of half-quadratic minimization methods for signal andimage recovery SIAM Journal on Scientific Computing 27(3)937ndash966 2005

[25] P Ochs R Ranftl T Brox and T Pock Techniques for gradient based bilevel optimization withnonsmooth lower level problems Journal of Mathematical Imaging and Vision 56(2)175ndash1942016

[26] C Pan and M Zhu Group additive structure identification for kernel nonparametric regressionIn Advances in Neural Information Processing Systems (NIPS) pages 4907ndash4916 2017

[27] E Parzen On estimation of a probability density function and mode Annals of MathematicalStatistics 33(3)1065ndash1076 1962

[28] G Raskutti M J Wainwright and B Yu Minimax-optimal rates for sparse additive models overkernel classes via convex programming Journal of Machine Learning Research 13(2)389ndash4272012

[29] P Ravikumar H Liu J Lafferty and L Wasserman SpAM sparse additive models Journalof the Royal Statistical Society Series B 711009ndash1030 2009

[30] S J Reddi S Sra B Poczos and A J Smola Proximal stochastic methods for nonsmoothnonconvex finite-sum optimization In Advances in Neural Information Processing Systems(NIPS) pages 1145ndash1153 2016

[31] T W Sager Estimation of a multivariate mode Annals of Statistics 6(4)802ndash812 1978

[32] N Shervashidze and F Bach Learning the structure for structured sparsity IEEE Transactionson Signal Processing 63(18)4894ndash4902 2015

[33] N Simon J H Friedman T Hastie and R Tibshirani A sparse-group lasso Journal ofComputational and Graphical Statistics 22(2)231ndash245 2013

[34] C J Stone Additive regression and other nonparametric models The Annals of Statistics13(2)689ndash705 1985

[35] G Swirszcz and A C Lozano Multi-level lasso for sparse multi-task regression In Interna-tional Conference on Machine Learning (ICML) pages 595ndash602 2012

[36] R Tibshirani Regression shrinkage and selection via the lasso Journal of the Royal StatisticalSociety Series B 73(3)267ndash288 1994

[37] Q Van Nguyen Forward-backward splitting with bregman distances Vietnam Journal ofMathematics 45(3)519ndash539 2017

[38] X Wang H Chen W Cai D Shen and H Huang Regularized modal regression withapplications in cognitive impairment prediction In Advances in Neural Information ProcessingSystems (NIPS) pages 1448ndash1458 2017

11

[39] Y Wang J Liu Y Jiang and R Erdeacutelyi Cme arrival time prediction using convolutionalneural network The Astrophysical Journal 881(11)15 2019

[40] Y Wu and L A Stefanski Automatic structure recovery for additive models Biometrika102(2)381ndash395 2015

[41] W Yao and L Li A new regression model modal linear regression Scandinavian Journal ofStatistics 41(3)656ndash671 2013

[42] J Yin X Chen and E P Xing Group sparse additive models In International Conference onMachine Learning (ICML) pages 871ndash878 2012

[43] M Yuan and Y Lin Model selection and estimation in regression with grouped variablesJournal of The Royal Statistical Society Series B-statistical Methodology 68(1)49ndash67 2006

[44] M Yuan and D X Zhou Minimax optimal rates of estimation in high dimensional additivemodels The Annals of Statistics 44(6)2564ndash2593 2016

[45] H H Zhang G Cheng and Y Liu Linear or nonlinear automatic structure discovery forpartially linear models Journal of the American Statistical Association 106(495)1099ndash11122011

[46] T Zhao and H Liu Sparse additive machine In International Conference on ArtificialIntelligence and Statistics (AISTATS) pages 1435ndash1443 2012

12

  • Introduction
  • Multi-task Additive Models
    • Additive models
    • Mode-induced metric
    • Mode-induced group additive models
    • Multi-task additive models
      • Optimization Algorithm
      • Experiments
        • Simulated data analysis
        • Coronal mass ejection analysis
          • Conclusion
Page 4: Multi-task Additive Models for Robust Estimation and Automatic … · group structure identification (CGSI)[26], and Bilevel learning of Group Lasso (BiGL) [12]. 2 Multi-task Additive

theory this zero-mode assumption allows for more complex cases eg Gaussian noise heavy-tailednoise skewed noise or outliers

Denote p(Y |X = x) as the conditional density function of Y given X = x By taking mode on theboth sides of (1) we obtain the conditional mode function

flowast(x) = arg maxtisinR

p(t|X = x) (2)

where arg maxtisinR p(t|X = x) is assumed to be unique for any x isin X There are direct strategy andindirect strategy for estimating flowast [31] Generally the direct approaches are intractable since theconditional mode function cannot be elicited directly [15] while the indirect estimators based onkernel density estimation (KDE) have shown promising performance [6 10 38 41]

Now we introduce a mode-induced metric [10 38] associated with KDE For any measurable functionf X rarr R the mode-induced metric is

R(f) =

intXpY |X(f(x)|X = x)dρX (x) (3)

where ρX is the marginal distribution of ρwith respect toX As discussed in [10] flowast is the maximizerof the mode-induced metricR(f) According to Theorem 5 in [10] we haveR(f) = pEf

(0) wherepEf

(0) is the density function of error random variable Ef = Y minus f(X)

Define a modal kernel φ such that forallu isin R φ(u) = φ(minusu) φ(u) gt 0 andintR φ(u)du = 1 Typical

examples of modal kernel include Gaussian kernel Logistic kernel Epanechnikov kernel Given(xi yi)ni=1 sub X times Y an empirical version ofR(f) obtained via KDE [10 27] is defined as

Rσemp(f) =1

nsumi=1

φ(yi minus f(xi)

σ

) (4)

where σ is a positive bandwidth Then denote the data-free robust metric wrt Rσemp(f) as

Rσ(f) =1

σ

intXtimesY

φ(y minus f(x)

σ

)dρ(x y) (5)

Theorem 10 in [10] states thatRσ(f) tends toR(f) when σ rarr 0

23 Mode-induced group additive models

Here we form the additive hypothesis space based on smoothing splines [16 23 29 46] Let ψjk k = 1 infin be bounded and orthonormal basis functions on Xj Then the component function

space can be defined as Bj =fj fj =

suminfink=1 βjkψjk(middot)

with the coefficient βjk j = 1 P

After truncating these basis functions to finite dimension d we get

Bj =fj fj =

dsumk=1

βjkψjk(middot)

Denote f2 =radicint

f2(x)dx It has been illustrated that fj minus fj22 = O(1d4) for the second

order Sobolev ball Bj[46] The mode-induced Group Additive Models (mGAM) can be formulatedas

f = arg maxf=

sumPj=1 fj fjisinBj

Rσemp(f)minus λΩ(f) (6)

where λ is a positive regularization parameter and the structure-based regularizer

Ω(f) =

Lsuml=1

τl

radicsumjisinGl

fj22 =

Lsuml=1

τl

radicradicradicradicsumjisinGl

dsumk=1

β2jk

with group weight τl Denote Ψi =(ψ11(xi1) ψ1d(xi1) ψP1(xiP ) ψPd(xiP )

)and

β = (β11 β1d βP1 βPd)T isin RPd Given observations (xi yi)ni=1 with xi =

4

(xi1 xiP )T isin RP the mGAM can be represented as

f =

Psumj=1

fj =

Psumj=1

dsumk=1

βjkψjk(middot)

with

β = arg maxβisinRPd

1

nsumi=1

φ(yi minusΨiβ

σ

)minus λ

Lsuml=1

τl

radicradicradicradicsumjisinGl

dsumk=1

β2jk

(7)

Remark 1 The mGAM is a robust extension of GroupSpAM from mean regression to mode regressionWhen each group Gl l isin 1 L is a singleton our mGAM reduces to a robust version of SpAM[29] by replacing the MSE with the robust mode-induced metric (3) In particular our mGAM isconsistent with RMR [38] when each group is a singleton and all component functions are linear

24 Multi-task additive models

To reduce the dependency of mGAM on a priori structure information this section formulates MAMby learning an augmented mGAM within a multi-task bilevel framework [11 12 25]

Let T be the number of tasks Let X (t) = (X (t)1 X (t)

P )T sub RP and Y(t) sub R be the inputspace and the output space respectively associated with the t-th task Suppose that observationsS(t) = x(t)

i y(t)i 2ni=1 sub X (t) times Y(t) are drawn from an unknown distribution ρ(t)(x y) Without

loss of generality we split each S(t) into the training set S(t)train and the validation set S(t)

val with thesame sample size n for subsequent analysis

To quantify the groups G1 GL we introduce the following unit simplex

Θ =ϑ = (ϑ1 ϑL) isin RPtimesL

∣∣∣ Lsuml=1

ϑjl = 1 0 le ϑjl le 1 j = 1 P

where each element ϑjl can be viewed as a probability that identifies whether the j-th variable belongsto group Gl It is desirable to enjoy the property that ϑjl = 1 rArr j isin Gl and ϑjl = 0 rArr j isin GlHowever we cannot mine the sparsity within each group since

sumLl=1 ϑjl = 1 j = 1 P Inspired

from [35] we introduce ν = (ν1 νP )T isin [0 1]P to screen main effect variables across all taskswhere νj 6= 0 means the j-th variable is effective

Denote Ψ(t)i =

(ψ11(x

(t)i1 ) ψ1d(x

(t)i1 ) ψP1(x

(t)iP ) ψPd(x

(t)iP )) Given S(t)

valTt=1 andS(t)

trainTt=1 our MAM can be formulated as the following bilevel optimization scheme

Outer Problem (based on validation set S(t)val)

(ϑ ν) isin arg maxϑisinΘνisin[01]P

Tsumt=1

U(β(t)(ϑ) ν) with U(β(t)(ϑ) ν) =1

nsumi=1

φ(y(t)

i minusΨ(t)i Tν β(t)(ϑ)

σ

)

where Tν is a linear operator for screening main effect variables across all tasks suchthat Tν β(t)(ϑ) = (ν1β

(t)11 (ϑ) ν1β

(t)1d (ϑ) νP β

(t)P1(ϑ) νP β

(t)Pd(ϑ))T isin RPd and

β(ϑ) = (β(t)(ϑ))1letleT is the maximizer of the following augmented mGAM

Inner Problem (based on training set S(t)train)

β(ϑ)=argmaxβ

Tsumt=1

J(β(t)) with J(β(t))=1

nsumi=1

φ(y(t)i minusΨ

(t)i β(t)

σ

)minus micro

2β(t)22minusλ

Lsuml=1

τlTϑlβ(t)2

where Tϑlβ(t) = (ϑ1lβ

(t)11 ϑ1lβ

(t)1d ϑPlβ

(t)P1 ϑPlβ

(t)Pd)

T isin RPd is used for identifyingwhich variables belong to the l-th group and the penalty term micro

2 β(t)22 with a tending-to-zero

parameter micro assures the strong-convex property for optimization

5

Finally the multi-task additive models (MAM) can be represented as below

f (t) =

Psumj=1

dsumk=1

νj β(t)jk (ϑ)ψjk(middot) t = 1 T

Let ϑThr and νThr be two threshold counterparts of ϑ and ν respectively Similar with [12] ϑThr

is determined by assigning each feature to its most dominant group For any j = 1 P νThrj is

determined by a threshold u ie νThrj = 0 if νj le u and νThr

j = 1 otherwise Then the data-drivenvariable structure can be obtained via S = (ϑThr

l νThr)1lelleL where denotes Hadamard productRemark 2 If the hyper-parameter ν equiv IP the sparsity wrt individual features would not be takeninto account for MAM In this setting our MAM is essentially a robust and nonlinear extension ofBiGL [11] by incorporating mode-induced metric and additive hypothesis spaceRemark 3 Indeed mGAM with an oracle variable structure is the baseline of MAM In other wordsthe inner problem with the estimated variable structure S aims to approximate the mGAM

Algorithm 1 Prox-SAGA for MAM

Input Data S(t)train S

(t)valTt=1 Max-Iter Z isin R The number of groups L Step-size ηϑ

Step-size ην ϑ(0) ν(0) λ micro Modal kernel φ Bandwidth σ Weights τl l = 1 LInitialization at = ϑ(0) ct = ν(0) t = 1 T g(0)

ϑ = 0PtimesL g(0)ν = 0P

for z = 0 1 Z minus 1 do1 Randomly pick set

B(z) sub 1 T denote its cardinality as |B(z)|2 Compute β(k)(ϑ(z)) based on S(k)

train forallk isin B(z)β(k)(ϑ(z))=HQ-DFBB(ϑ(z) λ σ micro τ S(k)

train)3 Update ϑ based on S(k)

val

31) Gϑ = 1|B(z)|

sumkisinB(z)

(hϑ(β(k)(ϑ(z)) ν(z))minus hϑ(β(k)(ak) ν(z))

)

32) ϑ(z) = g(z)ϑ +Gϑ

33) ϑ(z+1) = Pϑ(ϑ(z) minus ηϑϑ(z))

34) g(z+1)ϑ = g

(z)ϑ + |B(z)|

T Gϑ35) ak = ϑ(z) for every k isin B(z)

4 Update ν based on S(k)val

41) Gν = 1|B(z)|

sumkisinB(z)

(hν(β(k)(ϑ(z)) ν(z))minus hν(β(k)(ϑ(z)) ck)

)

42) ν(z) = g(z)ν +Gν

43) ν(z+1) = Pν(ν(z) minus ην ν(z))

44) g(z+1)ν = g

(z)ν + |B(z)|

T Gν 45) ck = ν(z) for every k isin B(z)

Output ϑ = ϑ(Z) ν = ν(Z) β(t)(ϑ) t = 1 T Prediction function f (t) =

sumPj=1

sumdk=1 νj β

(t)jk (ϑ)ψjk(middot) t = 1 T

Variable structure S = (ϑThrl νThr)1lelleL

3 Optimization Algorithm

To implement the non-convex and nonsmooth MAM we employ Prox-SAGA algorithm [30] withsimplex projection and box projection [8] For simplicity we define two partial derivative calculators

minusTsumt=1

partU(β(t)(ϑ) ν)

partν=

Tsumt=1

hν(β(t)(ϑ) ν) minusTsumt=1

partU(β(t)(ϑ) ν)

partϑ=

Tsumt=1

hϑ(β(t)(ϑ) ν)

It is trivial to computesumTt=1 hν(β(t)(ϑ) ν) since the parameter ν only appears explicitly in the upper

problem The optimization parameter ϑ is implicit via the solution β(ϑ) of the inner problem Hence

6

computingsumTt=1 hϑ(β(t)(ϑ) ν) requires us to develop a smooth algorithm HQ-DFBB (combining

HQ [24] and DFBB [37]) for the solution β(ϑ) For the space limitation the optimization detailsincluding HQ-DFBB and two partial derivative calculators are provided in Supplementary MaterialB Let PΘ be the projection onto unit simplex Θ and Pν be the box projection onto [0 1]P Thegeneral procedure of Prox-SAGA is summarized in Algorithm 1

Remark 4 From Theorem 21 in [12] and Theorem 4 in [30] we know that Algorithm 1 convergesonly if the iteration sequence generated by HQ-DFBB converges to the solution of the inner problemDetailed convergence analysis of HQ-DFBB is provided in Supplementary Material C

4 Experiments

This section validates the effectiveness of MAM on simulated data and CMEs data All experimentsare implemented in MATLAB 2019b on an intel Core i7 with 16 GB memory

41 Simulated data analysis

Baselines The proposed MAM is compared with BiGL [11] in terms of variable structure recoveryand prediction ability In addition we also consider some baselines including Lasso [36] RMR [38]mGAM Group Lasso (GL) [43] and GroupSpAM [42] Note that the oracle variable structure is apriori knowledge for implementing mGAM GL and GroupSpAM

Oracle variable structure Set the number of tasks T = 500 the dimension P = 50 for eachtask and the actual number of groups Llowast = 5 We denote the indices of l-th group by Gl =

1 + (l minus 1)(PLlowast) l(PLlowast)

foralll isin 1 Llowast In addition we randomly pick V sub 1 Pto generate sparse features across all tasks For each j isin 1 P and l isin 1 Llowast the oraclevariable structure Slowast can be defined as Slowastjl = 1 if j isin Vc capGl and 0 otherwise

Parameter selection For the same hyper-parameters in BiGL and MAM we set Z = 3000micro = 10minus3 M = 5 Q = 100 and σ = 2 We search the regularization parameter λ in therange of 10minus4 10minus3 10minus2 10minus1 Here we assume the actual number of groups is known ieL = Llowast The weight for each group is set to be τl = 1foralll isin 1 L Following the samestrategy in [11] we choose the initialization ϑ(0) = Pϑ( 1

L IPtimesL + 001N (0PtimesL IPtimesL)) isin RPtimesL

and ν(0) = (05 05)T isin RP

Evaluation criterion Denote f (t) flowast(t) as the estimator and ground truth function respectively1 le t le T Evaluation criterions used here include Average Square Error(ASE)= 1

T

sumTt=1

1nf

(t) minusy(t)22 True Deviation (TD)= 1

T

sumTt=1

1nf

(t)minus flowast(t)22 Variable Structure Recovery S = (νThrϑThrl )1lelleL with the hard threshold value u = 05 Width of Prediction Intervals (WPI) and Sample

Coverage Probability (SCP) with the confidence level 10 Specially WPI and SCP are designed in[41] for comparing the widths of the prediction intervals with the same confidence level (see Section32 in [41] for more details)

Data sets The training set validation set and test set are all drawn from y(t) = flowast(t)(u(t)) + ε withthe same sample size n = 50 for each task where u(t) = (u1 uP )T isin RP is randomly drawnfrom Gaussian distribution N (0P

12 IP ) The noise ε follows Gaussian noise N (0 005) Student

noise t(2) Chi-square noise X 2(2) and Exponential noise Exp(2) respectively We randomly pickG(t) sub G1 GL st |G(t)| = 2 and consider the following examples of ground truth functionflowast(t) 1 le t le T

Example A [12] Linear component function flowast(t)(u(t)) =sumGlisinG(t)

sumjisinGlcapVc u

(t)j β

(t)j where the

true regression coefficient β(t)j = 1 if j isin Gl cap Vc otherwise β(t)

j = 0

Example B Denote flowast1 (u) = 25 sin(u) flowast2 (u) = 2u flowast3 (u) = 2eu minus eminus1 minus 1 flowast4 (u) = 8u2 andflowast5 (u) = 3 sin(2eu) The nonlinear additive function flowast(t)(u(t)) =

sumGlisinG(t)

sumjisinGlcapVc flowastl (u

(t)j )

Here spline basis matrix for MAM mGAM and GroupSpAM are constructed with d = 3 In thedata-generating process we consider two cases of the number of inactive variables ie |V| = 0and |V| = 5 Due to the space limitation we only present the results with Gaussian noise and

7

1 2 3 4 5

10

20

30

40

501 2 3 4 5

10

20

30

40

501 2 3 4 5

10

20

30

40

50

1 2 3 4 5

10

20

30

40

501 2 3 4 5

10

20

30

40

501 2 3 4 5

10

20

30

40

50

1 2 3 4 5

10

20

30

40

501 2 3 4 5

10

20

30

40

501 2 3 4 5

10

20

30

40

501 2 3 4 5

10

20

30

40

501 2 3 4 5

10

20

30

40

501 2 3 4 5

10

20

30

40

50

Figure 2 Variable structure discovery for Example A under different noise and sparsity index V(white pixel means the grouped variables and red pixel means the inactive variables) Top left panelGaussian noise with |V| = 0 Top right panel Student noise with |V| = 0 Bottom left panelGaussian noise with |V| = 5 Bottom right panel Student noise with |V| = 5

Student noise in Table 2 and Figure 2 The remaining results as well as several evaluations on theimpact of hyper-parameters are provided in Supplementary Material D1 From the reported resultseven without the structure information the proposed MAM can provide the competitive regressionestimation with mGAM (given priori structure) and usually achieve better performance than thesecompetitors when the noise is non-Gaussian distribution Specially the actual number of groupsis assumed to be known in current evaluations ie L = Llowast In Supplementary Material D1 wefurther verify the effectiveness of MAM for the general setting L gt Llowast

Table 2 Performance comparisons on Example A (top) and Example B (bottom) wrt differentcriterions

Methods|V| = 0 (Gaussian noise) |V| = 5 (Gaussian noise) |V| = 0 (Student noise) |V| = 5 (Student noise)

ASE TD WPI (SCP) ASE TD WPI (SCP) ASE TD WPI (SCP) ASE TD WPI (SCP)

MAM 00920 00914 00679(01015) 00801 00794 00595(01014) 10204 02467 01817(01014) 14279 02269 01796(01020)

BiGL 00715 00701 00616(01014) 00799 00793 00553(01027) 11097 03550 02061(01023) 14961 03544 02093(01017)

mGAM 00894 00885 00651(01011) 00795 00788 00611(01016) 10132 02441 01803(01015) 13828 02242 01725(01026)

GL 00683 00661 00620(01014) 00708 00684 00535(01011) 13252 03163 01935(01020) 14811 03576 01976(01020)

Lasso 02145 02138 01011(01013) 02204 02131 01080(01017) 38012 32412 05269(01021) 42121 35645 04899(01025)

RMR 02201 02196 01081(01020) 02224 02165 01108(01029) 19723 13241 03324(01022) 21518 13504 03451(01024)

MAM 07981 07972 02252(01044) 07827 07826 02155(01039) 09772 06943 02220(01038) 09145 06749 02218(01040)

BiGL 08307 08307 02318(01029) 07998 07934 02250(01027) 10727 07525 02414(01030) 10004 07405 02365(01034)

mGAM 07969 07967 02249(01042) 07787 07786 02121(01037) 09665 06842 02203(01038) 09135 06753 02226(01037)

GroupSpAM 07914 07916 02101(01033) 07604 07599 02085(01022) 09965 07142 02303(01048) 09521 07077 02294(01025)

GL 08081 08080 02241(01028) 07741 07708 02177(01029) 10501 07306 02335(01030) 09591 07202 02363(01029)

Lasso 25779 24777 03956(01027) 26801 26799 04259(01020) 38645 37612 05218(01020) 37443 35036 04963(01024)

RMR 15160 15061 03237(01028) 17654 16798 03184(01020) 19448 17294 03587(01025) 21085 20077 03329(01030)

42 Coronal mass ejection analysis

Coronal Mass Ejections (CMEs) are the most violent eruptions in the Solar System It is cru-cial to forecast the physical parameters related to CMEs Despite machine learning approacheshave been applied to these tasks recently [20 39] there is no any work for interpretable predic-tion with data-driven structure discovery Interplanetary CMEs (ICMEs) data are provided in TheRichardson and Cane List (httpwwwsrlcaltecheduACEASCDATAlevel3icmetable2htm) From this link we collect 137 ICMEs observations from 1996 to 2016The features of CMEs are provided in SOHO LASCO CME Catalog (httpscdawgsfcnasagovCME_list) In-situ solar wind parameters can be downloaded from OMNIWeb Plus(httpsomniwebgsfcnasagov) The in-situ solar wind parameters at earth is used torepresent the unknown solar wind plasma [20] A total of 21 features are chosen as input by com-bining the features of CMEs and in-situ solar wind parameters Five physical parameters predictiontasks are considered as outputs including CMEs arrive time Mean ICME speed Maximum solar

8

wind speed Increment in solar wind speed and Mean magnetic field strength We split the dataof each task into training set validation set and test set (with ratio 2 2 1) and adopt the samesettings in simulations Table 3 demonstrates that MAM enjoy smaller average absolute error thanthe competitors In addition the estimated structure (via MAM) is described in Figure 3 FromFigure 3 and Table 3 we know group G1 (including Mass MPA Solar wind speed Vy) and groupG2 (including Acceleration and Linear Speed) are significant for most tasks Particularly G2 andG7 (2nd-order Speed at final height) can be characterized as the factors that reflect the CMEs speedTable 3 shows that the groups G2 and G7 play an important role in CMEs arrive time predictionwhich is consistent with the results in [20] In addition the impact of hyper-parameter are displayedin Supplementary Material D2 due to the space limitation Overall the proposed MAM can achievethe promising performance on prediction and structure discovery

Table 3 Average absolute error and dominant group for each taskTasks CMEs arrive time Mean ICME speed Maximum solar wind speed Increment in solar wind speed Mean magnetic field strength

Methods AAE (h) Groups AAE (kms) Groups AAE (kms) Groups AAE (kms) Groups AAE (nT ) Groups

MAM 907 G1 G2 G7 4541 G1 G2 G3 G6 5932 G1 G2 6538 G1 G2 G3 347 G1BiGL 1109 - 5375 - 4651 - 8997 - 521 -Lasso 1216 - 6256 - 5981 - 8534 - 438 -RMR 1202 - 6223 - 5190 - 8613 - 398 -

1 2 3 4 5 6 7 8 9 10

CPA

Angular Width

Acceleration

Linear Speed

2nd-order Speed(20Rs)

Mass

Kinetic Energy

MPA

Field magnitude average

Bx

By

Bz

Solar wind speed

Vx

Vy

Vz

Proton density

Temperature

Flow pressure

Plasma beta

Figure 3 Variable structure S (white pixel=the grouped variables red pixel=the inactive variables)

5 Conclusion

This paper proposes the multi-task additive models to achieve robust estimation and automaticstructure discovery As far as we know it is novel to explore robust interpretable machine learningby integrating modal regression additive models and multi-task learning together The computingalgorithm and empirical evaluations are provided to support its effectiveness In the future it isinteresting to investigate robust additive models for overlapping variable structure discovery [17]

Broader Impact

The positive impacts of this work are two-fold 1) Our algorithmic framework paves a new way formining the intrinsic feature structure among high-dimensional variables and may be the steppingstone to further explore data-driven structure discovery with overlapping groups 2) Our MAM canbe applied to other fields eg gene expression analysis and drug discovery However there is also arisk of resulting an unstable estimation when facing ultra high-dimensional data

9

Acknowledgments

This work was supported by National Natural Science Foundation of China under Grant Nos11671161 12071166 61972188 41574181 the Fundamental Research Funds for the Central Univer-sities (Program No 2662019FW003) and NSERC Grant RGPIN-2016-05024 We are grateful to theanonymous NeurIPS reviewers for their constructive comments

References[1] R Agarwal N Frosst X Zhang R Caruana and G E Hinton Neural additive models

Interpretable machine learning with neural nets arXiv200413912v1 2020

[2] R Caruana Multitask learning Machine Learning 28(1)41ndash75 1997

[3] H Chen G Liu and H Huang Sparse shrunk additive models In International Conference onMachine Learning (ICML) 2020

[4] H Chen X Wang C Deng and H Huang Group sparse additive machine In Advances inNeural Information Processing Systems (NIPS) pages 198ndash208 2017

[5] H Chen Y Wang F Zheng C Deng and H Huang Sparse modal additive model IEEETransactions on Neural Networks and Learning Systems Doi 101109TNNLS202030051442020

[6] Y C Chen C R Genovese R J Tibshirani and L Wasserman Nonparametric modalregression The Annals of Statistics 44(2)489ndash514 2016

[7] B Colson P Marcotte and G Savard An overview of bilevel optimization Annals ofOperations Research 153(1)235ndash256 2007

[8] L Condat Fast projection onto the simplex and the `1-ball Mathematical Programming158(1-2)575ndash585 2016

[9] T Evgeniou and M Pontil Regularized multindashtask learning In ACM Conference on KnowledgeDiscovery and Data Mining pages 109ndash117 2004

[10] Y Feng J Fan and J Suykens A statistical learning approach to modal regression Journal ofMachine Learning Research 21(2)1ndash35 2020

[11] L Franceschi P Frasconi S Salzo R Grazzi and M Pontil Bilevel programming forhyperparameter optimization and meta-learning In International Conference on Machinelearning (ICML) pages 1563ndash1572 2018

[12] J Frecon S Salzo and M Pontil Bilevel learning of the group lasso structure In Advances inNeural Information Processing Systems (NIPS) pages 8301ndash8311 2018

[13] J H Friedman T Hastie and R Tibshirani A note on the group lasso and a sparse group lassoarXiv 10010736 2010

[14] T J Hastie and R J Tibshirani Generalized additive models London Chapman and Hall1990

[15] C Heinrich The mode functional is not elicitable Biometrika 101(1)245ndash251 2014

[16] J Huang J L Horowitz and F Wei Variable selection in nonparametric additive models TheAnnals of Statistics 38(4)2282ndash2313 2010

[17] L Jacob G Obozinski and J Vert Group lasso with overlap and graph lasso In InternationalConference on Machine Learning (ICML) pages 433ndash440 2009

[18] K Kandasamy and Y Yu Additive approximations in high dimensional nonparametric regres-sion via the SALSA In International Conference on Machine Learning (ICML) pages 69ndash782016

10

[19] X Li L Wang and D Nettleton Sparse model identification and learning for ultra-high-dimensional additive partially models Journal of Multivariate Analysis 173204ndash228 2019

[20] J Liu Y Ye C Shen Y Wang and R Erdelyi A new tool for cme arrival time predictionusing machine learning algorithms Cat-puma The Astrophysical Journal 855(2)109ndash1182018

[21] S Lv H Lin H Lian and J Huang Oracle inequalities for sparse additive quantile regressionin reproducing kernel hilbert space The Annals of Statistics 46(2)781ndash813 2018

[22] L Meier S V De Geer and P Buhlmann The group lasso for logistic regression Journal ofThe Royal Statistical Society Series B-statistical Methodology 70(1)53ndash71 2008

[23] L Meier S V D Geer and P Buhlmann High-dimensional additive modeling The Annals ofStatistics 37(6B)3779ndash3821 2009

[24] M Nikolova and M K Ng Analysis of half-quadratic minimization methods for signal andimage recovery SIAM Journal on Scientific Computing 27(3)937ndash966 2005

[25] P Ochs R Ranftl T Brox and T Pock Techniques for gradient based bilevel optimization withnonsmooth lower level problems Journal of Mathematical Imaging and Vision 56(2)175ndash1942016

[26] C Pan and M Zhu Group additive structure identification for kernel nonparametric regressionIn Advances in Neural Information Processing Systems (NIPS) pages 4907ndash4916 2017

[27] E Parzen On estimation of a probability density function and mode Annals of MathematicalStatistics 33(3)1065ndash1076 1962

[28] G Raskutti M J Wainwright and B Yu Minimax-optimal rates for sparse additive models overkernel classes via convex programming Journal of Machine Learning Research 13(2)389ndash4272012

[29] P Ravikumar H Liu J Lafferty and L Wasserman SpAM sparse additive models Journalof the Royal Statistical Society Series B 711009ndash1030 2009

[30] S J Reddi S Sra B Poczos and A J Smola Proximal stochastic methods for nonsmoothnonconvex finite-sum optimization In Advances in Neural Information Processing Systems(NIPS) pages 1145ndash1153 2016

[31] T W Sager Estimation of a multivariate mode Annals of Statistics 6(4)802ndash812 1978

[32] N Shervashidze and F Bach Learning the structure for structured sparsity IEEE Transactionson Signal Processing 63(18)4894ndash4902 2015

[33] N Simon J H Friedman T Hastie and R Tibshirani A sparse-group lasso Journal ofComputational and Graphical Statistics 22(2)231ndash245 2013

[34] C J Stone Additive regression and other nonparametric models The Annals of Statistics13(2)689ndash705 1985

[35] G Swirszcz and A C Lozano Multi-level lasso for sparse multi-task regression In Interna-tional Conference on Machine Learning (ICML) pages 595ndash602 2012

[36] R Tibshirani Regression shrinkage and selection via the lasso Journal of the Royal StatisticalSociety Series B 73(3)267ndash288 1994

[37] Q Van Nguyen Forward-backward splitting with bregman distances Vietnam Journal ofMathematics 45(3)519ndash539 2017

[38] X Wang H Chen W Cai D Shen and H Huang Regularized modal regression withapplications in cognitive impairment prediction In Advances in Neural Information ProcessingSystems (NIPS) pages 1448ndash1458 2017

11

[39] Y Wang J Liu Y Jiang and R Erdeacutelyi Cme arrival time prediction using convolutionalneural network The Astrophysical Journal 881(11)15 2019

[40] Y Wu and L A Stefanski Automatic structure recovery for additive models Biometrika102(2)381ndash395 2015

[41] W Yao and L Li A new regression model modal linear regression Scandinavian Journal ofStatistics 41(3)656ndash671 2013

[42] J Yin X Chen and E P Xing Group sparse additive models In International Conference onMachine Learning (ICML) pages 871ndash878 2012

[43] M Yuan and Y Lin Model selection and estimation in regression with grouped variablesJournal of The Royal Statistical Society Series B-statistical Methodology 68(1)49ndash67 2006

[44] M Yuan and D X Zhou Minimax optimal rates of estimation in high dimensional additivemodels The Annals of Statistics 44(6)2564ndash2593 2016

[45] H H Zhang G Cheng and Y Liu Linear or nonlinear automatic structure discovery forpartially linear models Journal of the American Statistical Association 106(495)1099ndash11122011

[46] T Zhao and H Liu Sparse additive machine In International Conference on ArtificialIntelligence and Statistics (AISTATS) pages 1435ndash1443 2012

12

  • Introduction
  • Multi-task Additive Models
    • Additive models
    • Mode-induced metric
    • Mode-induced group additive models
    • Multi-task additive models
      • Optimization Algorithm
      • Experiments
        • Simulated data analysis
        • Coronal mass ejection analysis
          • Conclusion
Page 5: Multi-task Additive Models for Robust Estimation and Automatic … · group structure identification (CGSI)[26], and Bilevel learning of Group Lasso (BiGL) [12]. 2 Multi-task Additive

(xi1 xiP )T isin RP the mGAM can be represented as

f =

Psumj=1

fj =

Psumj=1

dsumk=1

βjkψjk(middot)

with

β = arg maxβisinRPd

1

nsumi=1

φ(yi minusΨiβ

σ

)minus λ

Lsuml=1

τl

radicradicradicradicsumjisinGl

dsumk=1

β2jk

(7)

Remark 1 The mGAM is a robust extension of GroupSpAM from mean regression to mode regressionWhen each group Gl l isin 1 L is a singleton our mGAM reduces to a robust version of SpAM[29] by replacing the MSE with the robust mode-induced metric (3) In particular our mGAM isconsistent with RMR [38] when each group is a singleton and all component functions are linear

24 Multi-task additive models

To reduce the dependency of mGAM on a priori structure information this section formulates MAMby learning an augmented mGAM within a multi-task bilevel framework [11 12 25]

Let T be the number of tasks Let X (t) = (X (t)1 X (t)

P )T sub RP and Y(t) sub R be the inputspace and the output space respectively associated with the t-th task Suppose that observationsS(t) = x(t)

i y(t)i 2ni=1 sub X (t) times Y(t) are drawn from an unknown distribution ρ(t)(x y) Without

loss of generality we split each S(t) into the training set S(t)train and the validation set S(t)

val with thesame sample size n for subsequent analysis

To quantify the groups G1 GL we introduce the following unit simplex

Θ =ϑ = (ϑ1 ϑL) isin RPtimesL

∣∣∣ Lsuml=1

ϑjl = 1 0 le ϑjl le 1 j = 1 P

where each element ϑjl can be viewed as a probability that identifies whether the j-th variable belongsto group Gl It is desirable to enjoy the property that ϑjl = 1 rArr j isin Gl and ϑjl = 0 rArr j isin GlHowever we cannot mine the sparsity within each group since

sumLl=1 ϑjl = 1 j = 1 P Inspired

from [35] we introduce ν = (ν1 νP )T isin [0 1]P to screen main effect variables across all taskswhere νj 6= 0 means the j-th variable is effective

Denote Ψ(t)i =

(ψ11(x

(t)i1 ) ψ1d(x

(t)i1 ) ψP1(x

(t)iP ) ψPd(x

(t)iP )) Given S(t)

valTt=1 andS(t)

trainTt=1 our MAM can be formulated as the following bilevel optimization scheme

Outer Problem (based on validation set S(t)val)

(ϑ ν) isin arg maxϑisinΘνisin[01]P

Tsumt=1

U(β(t)(ϑ) ν) with U(β(t)(ϑ) ν) =1

nsumi=1

φ(y(t)

i minusΨ(t)i Tν β(t)(ϑ)

σ

)

where Tν is a linear operator for screening main effect variables across all tasks suchthat Tν β(t)(ϑ) = (ν1β

(t)11 (ϑ) ν1β

(t)1d (ϑ) νP β

(t)P1(ϑ) νP β

(t)Pd(ϑ))T isin RPd and

β(ϑ) = (β(t)(ϑ))1letleT is the maximizer of the following augmented mGAM

Inner Problem (based on training set S(t)train)

β(ϑ)=argmaxβ

Tsumt=1

J(β(t)) with J(β(t))=1

nsumi=1

φ(y(t)i minusΨ

(t)i β(t)

σ

)minus micro

2β(t)22minusλ

Lsuml=1

τlTϑlβ(t)2

where Tϑlβ(t) = (ϑ1lβ

(t)11 ϑ1lβ

(t)1d ϑPlβ

(t)P1 ϑPlβ

(t)Pd)

T isin RPd is used for identifyingwhich variables belong to the l-th group and the penalty term micro

2 β(t)22 with a tending-to-zero

parameter micro assures the strong-convex property for optimization

5

Finally the multi-task additive models (MAM) can be represented as below

f (t) =

Psumj=1

dsumk=1

νj β(t)jk (ϑ)ψjk(middot) t = 1 T

Let ϑThr and νThr be two threshold counterparts of ϑ and ν respectively Similar with [12] ϑThr

is determined by assigning each feature to its most dominant group For any j = 1 P νThrj is

determined by a threshold u ie νThrj = 0 if νj le u and νThr

j = 1 otherwise Then the data-drivenvariable structure can be obtained via S = (ϑThr

l νThr)1lelleL where denotes Hadamard productRemark 2 If the hyper-parameter ν equiv IP the sparsity wrt individual features would not be takeninto account for MAM In this setting our MAM is essentially a robust and nonlinear extension ofBiGL [11] by incorporating mode-induced metric and additive hypothesis spaceRemark 3 Indeed mGAM with an oracle variable structure is the baseline of MAM In other wordsthe inner problem with the estimated variable structure S aims to approximate the mGAM

Algorithm 1 Prox-SAGA for MAM

Input Data S(t)train S

(t)valTt=1 Max-Iter Z isin R The number of groups L Step-size ηϑ

Step-size ην ϑ(0) ν(0) λ micro Modal kernel φ Bandwidth σ Weights τl l = 1 LInitialization at = ϑ(0) ct = ν(0) t = 1 T g(0)

ϑ = 0PtimesL g(0)ν = 0P

for z = 0 1 Z minus 1 do1 Randomly pick set

B(z) sub 1 T denote its cardinality as |B(z)|2 Compute β(k)(ϑ(z)) based on S(k)

train forallk isin B(z)β(k)(ϑ(z))=HQ-DFBB(ϑ(z) λ σ micro τ S(k)

train)3 Update ϑ based on S(k)

val

31) Gϑ = 1|B(z)|

sumkisinB(z)

(hϑ(β(k)(ϑ(z)) ν(z))minus hϑ(β(k)(ak) ν(z))

)

32) ϑ(z) = g(z)ϑ +Gϑ

33) ϑ(z+1) = Pϑ(ϑ(z) minus ηϑϑ(z))

34) g(z+1)ϑ = g

(z)ϑ + |B(z)|

T Gϑ35) ak = ϑ(z) for every k isin B(z)

4 Update ν based on S(k)val

41) Gν = 1|B(z)|

sumkisinB(z)

(hν(β(k)(ϑ(z)) ν(z))minus hν(β(k)(ϑ(z)) ck)

)

42) ν(z) = g(z)ν +Gν

43) ν(z+1) = Pν(ν(z) minus ην ν(z))

44) g(z+1)ν = g

(z)ν + |B(z)|

T Gν 45) ck = ν(z) for every k isin B(z)

Output ϑ = ϑ(Z) ν = ν(Z) β(t)(ϑ) t = 1 T Prediction function f (t) =

sumPj=1

sumdk=1 νj β

(t)jk (ϑ)ψjk(middot) t = 1 T

Variable structure S = (ϑThrl νThr)1lelleL

3 Optimization Algorithm

To implement the non-convex and nonsmooth MAM we employ Prox-SAGA algorithm [30] withsimplex projection and box projection [8] For simplicity we define two partial derivative calculators

minusTsumt=1

partU(β(t)(ϑ) ν)

partν=

Tsumt=1

hν(β(t)(ϑ) ν) minusTsumt=1

partU(β(t)(ϑ) ν)

partϑ=

Tsumt=1

hϑ(β(t)(ϑ) ν)

It is trivial to computesumTt=1 hν(β(t)(ϑ) ν) since the parameter ν only appears explicitly in the upper

problem The optimization parameter ϑ is implicit via the solution β(ϑ) of the inner problem Hence

6

computingsumTt=1 hϑ(β(t)(ϑ) ν) requires us to develop a smooth algorithm HQ-DFBB (combining

HQ [24] and DFBB [37]) for the solution β(ϑ) For the space limitation the optimization detailsincluding HQ-DFBB and two partial derivative calculators are provided in Supplementary MaterialB Let PΘ be the projection onto unit simplex Θ and Pν be the box projection onto [0 1]P Thegeneral procedure of Prox-SAGA is summarized in Algorithm 1

Remark 4 From Theorem 21 in [12] and Theorem 4 in [30] we know that Algorithm 1 convergesonly if the iteration sequence generated by HQ-DFBB converges to the solution of the inner problemDetailed convergence analysis of HQ-DFBB is provided in Supplementary Material C

4 Experiments

This section validates the effectiveness of MAM on simulated data and CMEs data All experimentsare implemented in MATLAB 2019b on an intel Core i7 with 16 GB memory

41 Simulated data analysis

Baselines The proposed MAM is compared with BiGL [11] in terms of variable structure recoveryand prediction ability In addition we also consider some baselines including Lasso [36] RMR [38]mGAM Group Lasso (GL) [43] and GroupSpAM [42] Note that the oracle variable structure is apriori knowledge for implementing mGAM GL and GroupSpAM

Oracle variable structure Set the number of tasks T = 500 the dimension P = 50 for eachtask and the actual number of groups Llowast = 5 We denote the indices of l-th group by Gl =

1 + (l minus 1)(PLlowast) l(PLlowast)

foralll isin 1 Llowast In addition we randomly pick V sub 1 Pto generate sparse features across all tasks For each j isin 1 P and l isin 1 Llowast the oraclevariable structure Slowast can be defined as Slowastjl = 1 if j isin Vc capGl and 0 otherwise

Parameter selection For the same hyper-parameters in BiGL and MAM we set Z = 3000micro = 10minus3 M = 5 Q = 100 and σ = 2 We search the regularization parameter λ in therange of 10minus4 10minus3 10minus2 10minus1 Here we assume the actual number of groups is known ieL = Llowast The weight for each group is set to be τl = 1foralll isin 1 L Following the samestrategy in [11] we choose the initialization ϑ(0) = Pϑ( 1

L IPtimesL + 001N (0PtimesL IPtimesL)) isin RPtimesL

and ν(0) = (05 05)T isin RP

Evaluation criterion Denote f (t) flowast(t) as the estimator and ground truth function respectively1 le t le T Evaluation criterions used here include Average Square Error(ASE)= 1

T

sumTt=1

1nf

(t) minusy(t)22 True Deviation (TD)= 1

T

sumTt=1

1nf

(t)minus flowast(t)22 Variable Structure Recovery S = (νThrϑThrl )1lelleL with the hard threshold value u = 05 Width of Prediction Intervals (WPI) and Sample

Coverage Probability (SCP) with the confidence level 10 Specially WPI and SCP are designed in[41] for comparing the widths of the prediction intervals with the same confidence level (see Section32 in [41] for more details)

Data sets The training set validation set and test set are all drawn from y(t) = flowast(t)(u(t)) + ε withthe same sample size n = 50 for each task where u(t) = (u1 uP )T isin RP is randomly drawnfrom Gaussian distribution N (0P

12 IP ) The noise ε follows Gaussian noise N (0 005) Student

noise t(2) Chi-square noise X 2(2) and Exponential noise Exp(2) respectively We randomly pickG(t) sub G1 GL st |G(t)| = 2 and consider the following examples of ground truth functionflowast(t) 1 le t le T

Example A [12] Linear component function flowast(t)(u(t)) =sumGlisinG(t)

sumjisinGlcapVc u

(t)j β

(t)j where the

true regression coefficient β(t)j = 1 if j isin Gl cap Vc otherwise β(t)

j = 0

Example B Denote flowast1 (u) = 25 sin(u) flowast2 (u) = 2u flowast3 (u) = 2eu minus eminus1 minus 1 flowast4 (u) = 8u2 andflowast5 (u) = 3 sin(2eu) The nonlinear additive function flowast(t)(u(t)) =

sumGlisinG(t)

sumjisinGlcapVc flowastl (u

(t)j )

Here spline basis matrix for MAM mGAM and GroupSpAM are constructed with d = 3 In thedata-generating process we consider two cases of the number of inactive variables ie |V| = 0and |V| = 5 Due to the space limitation we only present the results with Gaussian noise and

7

1 2 3 4 5

10

20

30

40

501 2 3 4 5

10

20

30

40

501 2 3 4 5

10

20

30

40

50

1 2 3 4 5

10

20

30

40

501 2 3 4 5

10

20

30

40

501 2 3 4 5

10

20

30

40

50

1 2 3 4 5

10

20

30

40

501 2 3 4 5

10

20

30

40

501 2 3 4 5

10

20

30

40

501 2 3 4 5

10

20

30

40

501 2 3 4 5

10

20

30

40

501 2 3 4 5

10

20

30

40

50

Figure 2 Variable structure discovery for Example A under different noise and sparsity index V(white pixel means the grouped variables and red pixel means the inactive variables) Top left panelGaussian noise with |V| = 0 Top right panel Student noise with |V| = 0 Bottom left panelGaussian noise with |V| = 5 Bottom right panel Student noise with |V| = 5

Student noise in Table 2 and Figure 2 The remaining results as well as several evaluations on theimpact of hyper-parameters are provided in Supplementary Material D1 From the reported resultseven without the structure information the proposed MAM can provide the competitive regressionestimation with mGAM (given priori structure) and usually achieve better performance than thesecompetitors when the noise is non-Gaussian distribution Specially the actual number of groupsis assumed to be known in current evaluations ie L = Llowast In Supplementary Material D1 wefurther verify the effectiveness of MAM for the general setting L gt Llowast

Table 2 Performance comparisons on Example A (top) and Example B (bottom) wrt differentcriterions

Methods|V| = 0 (Gaussian noise) |V| = 5 (Gaussian noise) |V| = 0 (Student noise) |V| = 5 (Student noise)

ASE TD WPI (SCP) ASE TD WPI (SCP) ASE TD WPI (SCP) ASE TD WPI (SCP)

MAM 00920 00914 00679(01015) 00801 00794 00595(01014) 10204 02467 01817(01014) 14279 02269 01796(01020)

BiGL 00715 00701 00616(01014) 00799 00793 00553(01027) 11097 03550 02061(01023) 14961 03544 02093(01017)

mGAM 00894 00885 00651(01011) 00795 00788 00611(01016) 10132 02441 01803(01015) 13828 02242 01725(01026)

GL 00683 00661 00620(01014) 00708 00684 00535(01011) 13252 03163 01935(01020) 14811 03576 01976(01020)

Lasso 02145 02138 01011(01013) 02204 02131 01080(01017) 38012 32412 05269(01021) 42121 35645 04899(01025)

RMR 02201 02196 01081(01020) 02224 02165 01108(01029) 19723 13241 03324(01022) 21518 13504 03451(01024)

MAM 07981 07972 02252(01044) 07827 07826 02155(01039) 09772 06943 02220(01038) 09145 06749 02218(01040)

BiGL 08307 08307 02318(01029) 07998 07934 02250(01027) 10727 07525 02414(01030) 10004 07405 02365(01034)

mGAM 07969 07967 02249(01042) 07787 07786 02121(01037) 09665 06842 02203(01038) 09135 06753 02226(01037)

GroupSpAM 07914 07916 02101(01033) 07604 07599 02085(01022) 09965 07142 02303(01048) 09521 07077 02294(01025)

GL 08081 08080 02241(01028) 07741 07708 02177(01029) 10501 07306 02335(01030) 09591 07202 02363(01029)

Lasso 25779 24777 03956(01027) 26801 26799 04259(01020) 38645 37612 05218(01020) 37443 35036 04963(01024)

RMR 15160 15061 03237(01028) 17654 16798 03184(01020) 19448 17294 03587(01025) 21085 20077 03329(01030)

42 Coronal mass ejection analysis

Coronal Mass Ejections (CMEs) are the most violent eruptions in the Solar System It is cru-cial to forecast the physical parameters related to CMEs Despite machine learning approacheshave been applied to these tasks recently [20 39] there is no any work for interpretable predic-tion with data-driven structure discovery Interplanetary CMEs (ICMEs) data are provided in TheRichardson and Cane List (httpwwwsrlcaltecheduACEASCDATAlevel3icmetable2htm) From this link we collect 137 ICMEs observations from 1996 to 2016The features of CMEs are provided in SOHO LASCO CME Catalog (httpscdawgsfcnasagovCME_list) In-situ solar wind parameters can be downloaded from OMNIWeb Plus(httpsomniwebgsfcnasagov) The in-situ solar wind parameters at earth is used torepresent the unknown solar wind plasma [20] A total of 21 features are chosen as input by com-bining the features of CMEs and in-situ solar wind parameters Five physical parameters predictiontasks are considered as outputs including CMEs arrive time Mean ICME speed Maximum solar

8

wind speed Increment in solar wind speed and Mean magnetic field strength We split the dataof each task into training set validation set and test set (with ratio 2 2 1) and adopt the samesettings in simulations Table 3 demonstrates that MAM enjoy smaller average absolute error thanthe competitors In addition the estimated structure (via MAM) is described in Figure 3 FromFigure 3 and Table 3 we know group G1 (including Mass MPA Solar wind speed Vy) and groupG2 (including Acceleration and Linear Speed) are significant for most tasks Particularly G2 andG7 (2nd-order Speed at final height) can be characterized as the factors that reflect the CMEs speedTable 3 shows that the groups G2 and G7 play an important role in CMEs arrive time predictionwhich is consistent with the results in [20] In addition the impact of hyper-parameter are displayedin Supplementary Material D2 due to the space limitation Overall the proposed MAM can achievethe promising performance on prediction and structure discovery

Table 3 Average absolute error and dominant group for each taskTasks CMEs arrive time Mean ICME speed Maximum solar wind speed Increment in solar wind speed Mean magnetic field strength

Methods AAE (h) Groups AAE (kms) Groups AAE (kms) Groups AAE (kms) Groups AAE (nT ) Groups

MAM 907 G1 G2 G7 4541 G1 G2 G3 G6 5932 G1 G2 6538 G1 G2 G3 347 G1BiGL 1109 - 5375 - 4651 - 8997 - 521 -Lasso 1216 - 6256 - 5981 - 8534 - 438 -RMR 1202 - 6223 - 5190 - 8613 - 398 -

1 2 3 4 5 6 7 8 9 10

CPA

Angular Width

Acceleration

Linear Speed

2nd-order Speed(20Rs)

Mass

Kinetic Energy

MPA

Field magnitude average

Bx

By

Bz

Solar wind speed

Vx

Vy

Vz

Proton density

Temperature

Flow pressure

Plasma beta

Figure 3 Variable structure S (white pixel=the grouped variables red pixel=the inactive variables)

5 Conclusion

This paper proposes the multi-task additive models to achieve robust estimation and automaticstructure discovery As far as we know it is novel to explore robust interpretable machine learningby integrating modal regression additive models and multi-task learning together The computingalgorithm and empirical evaluations are provided to support its effectiveness In the future it isinteresting to investigate robust additive models for overlapping variable structure discovery [17]

Broader Impact

The positive impacts of this work are two-fold 1) Our algorithmic framework paves a new way formining the intrinsic feature structure among high-dimensional variables and may be the steppingstone to further explore data-driven structure discovery with overlapping groups 2) Our MAM canbe applied to other fields eg gene expression analysis and drug discovery However there is also arisk of resulting an unstable estimation when facing ultra high-dimensional data

9

Acknowledgments

This work was supported by National Natural Science Foundation of China under Grant Nos11671161 12071166 61972188 41574181 the Fundamental Research Funds for the Central Univer-sities (Program No 2662019FW003) and NSERC Grant RGPIN-2016-05024 We are grateful to theanonymous NeurIPS reviewers for their constructive comments

References[1] R Agarwal N Frosst X Zhang R Caruana and G E Hinton Neural additive models

Interpretable machine learning with neural nets arXiv200413912v1 2020

[2] R Caruana Multitask learning Machine Learning 28(1)41ndash75 1997

[3] H Chen G Liu and H Huang Sparse shrunk additive models In International Conference onMachine Learning (ICML) 2020

[4] H Chen X Wang C Deng and H Huang Group sparse additive machine In Advances inNeural Information Processing Systems (NIPS) pages 198ndash208 2017

[5] H Chen Y Wang F Zheng C Deng and H Huang Sparse modal additive model IEEETransactions on Neural Networks and Learning Systems Doi 101109TNNLS202030051442020

[6] Y C Chen C R Genovese R J Tibshirani and L Wasserman Nonparametric modalregression The Annals of Statistics 44(2)489ndash514 2016

[7] B Colson P Marcotte and G Savard An overview of bilevel optimization Annals ofOperations Research 153(1)235ndash256 2007

[8] L Condat Fast projection onto the simplex and the `1-ball Mathematical Programming158(1-2)575ndash585 2016

[9] T Evgeniou and M Pontil Regularized multindashtask learning In ACM Conference on KnowledgeDiscovery and Data Mining pages 109ndash117 2004

[10] Y Feng J Fan and J Suykens A statistical learning approach to modal regression Journal ofMachine Learning Research 21(2)1ndash35 2020

[11] L Franceschi P Frasconi S Salzo R Grazzi and M Pontil Bilevel programming forhyperparameter optimization and meta-learning In International Conference on Machinelearning (ICML) pages 1563ndash1572 2018

[12] J Frecon S Salzo and M Pontil Bilevel learning of the group lasso structure In Advances inNeural Information Processing Systems (NIPS) pages 8301ndash8311 2018

[13] J H Friedman T Hastie and R Tibshirani A note on the group lasso and a sparse group lassoarXiv 10010736 2010

[14] T J Hastie and R J Tibshirani Generalized additive models London Chapman and Hall1990

[15] C Heinrich The mode functional is not elicitable Biometrika 101(1)245ndash251 2014

[16] J Huang J L Horowitz and F Wei Variable selection in nonparametric additive models TheAnnals of Statistics 38(4)2282ndash2313 2010

[17] L Jacob G Obozinski and J Vert Group lasso with overlap and graph lasso In InternationalConference on Machine Learning (ICML) pages 433ndash440 2009

[18] K Kandasamy and Y Yu Additive approximations in high dimensional nonparametric regres-sion via the SALSA In International Conference on Machine Learning (ICML) pages 69ndash782016

10

[19] X Li L Wang and D Nettleton Sparse model identification and learning for ultra-high-dimensional additive partially models Journal of Multivariate Analysis 173204ndash228 2019

[20] J Liu Y Ye C Shen Y Wang and R Erdelyi A new tool for cme arrival time predictionusing machine learning algorithms Cat-puma The Astrophysical Journal 855(2)109ndash1182018

[21] S Lv H Lin H Lian and J Huang Oracle inequalities for sparse additive quantile regressionin reproducing kernel hilbert space The Annals of Statistics 46(2)781ndash813 2018

[22] L Meier S V De Geer and P Buhlmann The group lasso for logistic regression Journal ofThe Royal Statistical Society Series B-statistical Methodology 70(1)53ndash71 2008

[23] L Meier S V D Geer and P Buhlmann High-dimensional additive modeling The Annals ofStatistics 37(6B)3779ndash3821 2009

[24] M Nikolova and M K Ng Analysis of half-quadratic minimization methods for signal andimage recovery SIAM Journal on Scientific Computing 27(3)937ndash966 2005

[25] P Ochs R Ranftl T Brox and T Pock Techniques for gradient based bilevel optimization withnonsmooth lower level problems Journal of Mathematical Imaging and Vision 56(2)175ndash1942016

[26] C Pan and M Zhu Group additive structure identification for kernel nonparametric regressionIn Advances in Neural Information Processing Systems (NIPS) pages 4907ndash4916 2017

[27] E Parzen On estimation of a probability density function and mode Annals of MathematicalStatistics 33(3)1065ndash1076 1962

[28] G Raskutti M J Wainwright and B Yu Minimax-optimal rates for sparse additive models overkernel classes via convex programming Journal of Machine Learning Research 13(2)389ndash4272012

[29] P Ravikumar H Liu J Lafferty and L Wasserman SpAM sparse additive models Journalof the Royal Statistical Society Series B 711009ndash1030 2009

[30] S J Reddi S Sra B Poczos and A J Smola Proximal stochastic methods for nonsmoothnonconvex finite-sum optimization In Advances in Neural Information Processing Systems(NIPS) pages 1145ndash1153 2016

[31] T W Sager Estimation of a multivariate mode Annals of Statistics 6(4)802ndash812 1978

[32] N Shervashidze and F Bach Learning the structure for structured sparsity IEEE Transactionson Signal Processing 63(18)4894ndash4902 2015

[33] N Simon J H Friedman T Hastie and R Tibshirani A sparse-group lasso Journal ofComputational and Graphical Statistics 22(2)231ndash245 2013

[34] C J Stone Additive regression and other nonparametric models The Annals of Statistics13(2)689ndash705 1985

[35] G Swirszcz and A C Lozano Multi-level lasso for sparse multi-task regression In Interna-tional Conference on Machine Learning (ICML) pages 595ndash602 2012

[36] R Tibshirani Regression shrinkage and selection via the lasso Journal of the Royal StatisticalSociety Series B 73(3)267ndash288 1994

[37] Q Van Nguyen Forward-backward splitting with bregman distances Vietnam Journal ofMathematics 45(3)519ndash539 2017

[38] X Wang H Chen W Cai D Shen and H Huang Regularized modal regression withapplications in cognitive impairment prediction In Advances in Neural Information ProcessingSystems (NIPS) pages 1448ndash1458 2017

11

[39] Y Wang J Liu Y Jiang and R Erdeacutelyi Cme arrival time prediction using convolutionalneural network The Astrophysical Journal 881(11)15 2019

[40] Y Wu and L A Stefanski Automatic structure recovery for additive models Biometrika102(2)381ndash395 2015

[41] W Yao and L Li A new regression model modal linear regression Scandinavian Journal ofStatistics 41(3)656ndash671 2013

[42] J Yin X Chen and E P Xing Group sparse additive models In International Conference onMachine Learning (ICML) pages 871ndash878 2012

[43] M Yuan and Y Lin Model selection and estimation in regression with grouped variablesJournal of The Royal Statistical Society Series B-statistical Methodology 68(1)49ndash67 2006

[44] M Yuan and D X Zhou Minimax optimal rates of estimation in high dimensional additivemodels The Annals of Statistics 44(6)2564ndash2593 2016

[45] H H Zhang G Cheng and Y Liu Linear or nonlinear automatic structure discovery forpartially linear models Journal of the American Statistical Association 106(495)1099ndash11122011

[46] T Zhao and H Liu Sparse additive machine In International Conference on ArtificialIntelligence and Statistics (AISTATS) pages 1435ndash1443 2012

12

  • Introduction
  • Multi-task Additive Models
    • Additive models
    • Mode-induced metric
    • Mode-induced group additive models
    • Multi-task additive models
      • Optimization Algorithm
      • Experiments
        • Simulated data analysis
        • Coronal mass ejection analysis
          • Conclusion
Page 6: Multi-task Additive Models for Robust Estimation and Automatic … · group structure identification (CGSI)[26], and Bilevel learning of Group Lasso (BiGL) [12]. 2 Multi-task Additive

Finally the multi-task additive models (MAM) can be represented as below

f (t) =

Psumj=1

dsumk=1

νj β(t)jk (ϑ)ψjk(middot) t = 1 T

Let ϑThr and νThr be two threshold counterparts of ϑ and ν respectively Similar with [12] ϑThr

is determined by assigning each feature to its most dominant group For any j = 1 P νThrj is

determined by a threshold u ie νThrj = 0 if νj le u and νThr

j = 1 otherwise Then the data-drivenvariable structure can be obtained via S = (ϑThr

l νThr)1lelleL where denotes Hadamard productRemark 2 If the hyper-parameter ν equiv IP the sparsity wrt individual features would not be takeninto account for MAM In this setting our MAM is essentially a robust and nonlinear extension ofBiGL [11] by incorporating mode-induced metric and additive hypothesis spaceRemark 3 Indeed mGAM with an oracle variable structure is the baseline of MAM In other wordsthe inner problem with the estimated variable structure S aims to approximate the mGAM

Algorithm 1 Prox-SAGA for MAM

Input Data S(t)train S

(t)valTt=1 Max-Iter Z isin R The number of groups L Step-size ηϑ

Step-size ην ϑ(0) ν(0) λ micro Modal kernel φ Bandwidth σ Weights τl l = 1 LInitialization at = ϑ(0) ct = ν(0) t = 1 T g(0)

ϑ = 0PtimesL g(0)ν = 0P

for z = 0 1 Z minus 1 do1 Randomly pick set

B(z) sub 1 T denote its cardinality as |B(z)|2 Compute β(k)(ϑ(z)) based on S(k)

train forallk isin B(z)β(k)(ϑ(z))=HQ-DFBB(ϑ(z) λ σ micro τ S(k)

train)3 Update ϑ based on S(k)

val

31) Gϑ = 1|B(z)|

sumkisinB(z)

(hϑ(β(k)(ϑ(z)) ν(z))minus hϑ(β(k)(ak) ν(z))

)

32) ϑ(z) = g(z)ϑ +Gϑ

33) ϑ(z+1) = Pϑ(ϑ(z) minus ηϑϑ(z))

34) g(z+1)ϑ = g

(z)ϑ + |B(z)|

T Gϑ35) ak = ϑ(z) for every k isin B(z)

4 Update ν based on S(k)val

41) Gν = 1|B(z)|

sumkisinB(z)

(hν(β(k)(ϑ(z)) ν(z))minus hν(β(k)(ϑ(z)) ck)

)

42) ν(z) = g(z)ν +Gν

43) ν(z+1) = Pν(ν(z) minus ην ν(z))

44) g(z+1)ν = g

(z)ν + |B(z)|

T Gν 45) ck = ν(z) for every k isin B(z)

Output ϑ = ϑ(Z) ν = ν(Z) β(t)(ϑ) t = 1 T Prediction function f (t) =

sumPj=1

sumdk=1 νj β

(t)jk (ϑ)ψjk(middot) t = 1 T

Variable structure S = (ϑThrl νThr)1lelleL

3 Optimization Algorithm

To implement the non-convex and nonsmooth MAM we employ Prox-SAGA algorithm [30] withsimplex projection and box projection [8] For simplicity we define two partial derivative calculators

minusTsumt=1

partU(β(t)(ϑ) ν)

partν=

Tsumt=1

hν(β(t)(ϑ) ν) minusTsumt=1

partU(β(t)(ϑ) ν)

partϑ=

Tsumt=1

hϑ(β(t)(ϑ) ν)

It is trivial to computesumTt=1 hν(β(t)(ϑ) ν) since the parameter ν only appears explicitly in the upper

problem The optimization parameter ϑ is implicit via the solution β(ϑ) of the inner problem Hence

6

computingsumTt=1 hϑ(β(t)(ϑ) ν) requires us to develop a smooth algorithm HQ-DFBB (combining

HQ [24] and DFBB [37]) for the solution β(ϑ) For the space limitation the optimization detailsincluding HQ-DFBB and two partial derivative calculators are provided in Supplementary MaterialB Let PΘ be the projection onto unit simplex Θ and Pν be the box projection onto [0 1]P Thegeneral procedure of Prox-SAGA is summarized in Algorithm 1

Remark 4 From Theorem 21 in [12] and Theorem 4 in [30] we know that Algorithm 1 convergesonly if the iteration sequence generated by HQ-DFBB converges to the solution of the inner problemDetailed convergence analysis of HQ-DFBB is provided in Supplementary Material C

4 Experiments

This section validates the effectiveness of MAM on simulated data and CMEs data All experimentsare implemented in MATLAB 2019b on an intel Core i7 with 16 GB memory

41 Simulated data analysis

Baselines The proposed MAM is compared with BiGL [11] in terms of variable structure recoveryand prediction ability In addition we also consider some baselines including Lasso [36] RMR [38]mGAM Group Lasso (GL) [43] and GroupSpAM [42] Note that the oracle variable structure is apriori knowledge for implementing mGAM GL and GroupSpAM

Oracle variable structure Set the number of tasks T = 500 the dimension P = 50 for eachtask and the actual number of groups Llowast = 5 We denote the indices of l-th group by Gl =

1 + (l minus 1)(PLlowast) l(PLlowast)

foralll isin 1 Llowast In addition we randomly pick V sub 1 Pto generate sparse features across all tasks For each j isin 1 P and l isin 1 Llowast the oraclevariable structure Slowast can be defined as Slowastjl = 1 if j isin Vc capGl and 0 otherwise

Parameter selection For the same hyper-parameters in BiGL and MAM we set Z = 3000micro = 10minus3 M = 5 Q = 100 and σ = 2 We search the regularization parameter λ in therange of 10minus4 10minus3 10minus2 10minus1 Here we assume the actual number of groups is known ieL = Llowast The weight for each group is set to be τl = 1foralll isin 1 L Following the samestrategy in [11] we choose the initialization ϑ(0) = Pϑ( 1

L IPtimesL + 001N (0PtimesL IPtimesL)) isin RPtimesL

and ν(0) = (05 05)T isin RP

Evaluation criterion Denote f (t) flowast(t) as the estimator and ground truth function respectively1 le t le T Evaluation criterions used here include Average Square Error(ASE)= 1

T

sumTt=1

1nf

(t) minusy(t)22 True Deviation (TD)= 1

T

sumTt=1

1nf

(t)minus flowast(t)22 Variable Structure Recovery S = (νThrϑThrl )1lelleL with the hard threshold value u = 05 Width of Prediction Intervals (WPI) and Sample

Coverage Probability (SCP) with the confidence level 10 Specially WPI and SCP are designed in[41] for comparing the widths of the prediction intervals with the same confidence level (see Section32 in [41] for more details)

Data sets The training set validation set and test set are all drawn from y(t) = flowast(t)(u(t)) + ε withthe same sample size n = 50 for each task where u(t) = (u1 uP )T isin RP is randomly drawnfrom Gaussian distribution N (0P

12 IP ) The noise ε follows Gaussian noise N (0 005) Student

noise t(2) Chi-square noise X 2(2) and Exponential noise Exp(2) respectively We randomly pickG(t) sub G1 GL st |G(t)| = 2 and consider the following examples of ground truth functionflowast(t) 1 le t le T

Example A [12] Linear component function flowast(t)(u(t)) =sumGlisinG(t)

sumjisinGlcapVc u

(t)j β

(t)j where the

true regression coefficient β(t)j = 1 if j isin Gl cap Vc otherwise β(t)

j = 0

Example B Denote flowast1 (u) = 25 sin(u) flowast2 (u) = 2u flowast3 (u) = 2eu minus eminus1 minus 1 flowast4 (u) = 8u2 andflowast5 (u) = 3 sin(2eu) The nonlinear additive function flowast(t)(u(t)) =

sumGlisinG(t)

sumjisinGlcapVc flowastl (u

(t)j )

Here spline basis matrix for MAM mGAM and GroupSpAM are constructed with d = 3 In thedata-generating process we consider two cases of the number of inactive variables ie |V| = 0and |V| = 5 Due to the space limitation we only present the results with Gaussian noise and

7

1 2 3 4 5

10

20

30

40

501 2 3 4 5

10

20

30

40

501 2 3 4 5

10

20

30

40

50

1 2 3 4 5

10

20

30

40

501 2 3 4 5

10

20

30

40

501 2 3 4 5

10

20

30

40

50

1 2 3 4 5

10

20

30

40

501 2 3 4 5

10

20

30

40

501 2 3 4 5

10

20

30

40

501 2 3 4 5

10

20

30

40

501 2 3 4 5

10

20

30

40

501 2 3 4 5

10

20

30

40

50

Figure 2 Variable structure discovery for Example A under different noise and sparsity index V(white pixel means the grouped variables and red pixel means the inactive variables) Top left panelGaussian noise with |V| = 0 Top right panel Student noise with |V| = 0 Bottom left panelGaussian noise with |V| = 5 Bottom right panel Student noise with |V| = 5

Student noise in Table 2 and Figure 2 The remaining results as well as several evaluations on theimpact of hyper-parameters are provided in Supplementary Material D1 From the reported resultseven without the structure information the proposed MAM can provide the competitive regressionestimation with mGAM (given priori structure) and usually achieve better performance than thesecompetitors when the noise is non-Gaussian distribution Specially the actual number of groupsis assumed to be known in current evaluations ie L = Llowast In Supplementary Material D1 wefurther verify the effectiveness of MAM for the general setting L gt Llowast

Table 2 Performance comparisons on Example A (top) and Example B (bottom) wrt differentcriterions

Methods|V| = 0 (Gaussian noise) |V| = 5 (Gaussian noise) |V| = 0 (Student noise) |V| = 5 (Student noise)

ASE TD WPI (SCP) ASE TD WPI (SCP) ASE TD WPI (SCP) ASE TD WPI (SCP)

MAM 00920 00914 00679(01015) 00801 00794 00595(01014) 10204 02467 01817(01014) 14279 02269 01796(01020)

BiGL 00715 00701 00616(01014) 00799 00793 00553(01027) 11097 03550 02061(01023) 14961 03544 02093(01017)

mGAM 00894 00885 00651(01011) 00795 00788 00611(01016) 10132 02441 01803(01015) 13828 02242 01725(01026)

GL 00683 00661 00620(01014) 00708 00684 00535(01011) 13252 03163 01935(01020) 14811 03576 01976(01020)

Lasso 02145 02138 01011(01013) 02204 02131 01080(01017) 38012 32412 05269(01021) 42121 35645 04899(01025)

RMR 02201 02196 01081(01020) 02224 02165 01108(01029) 19723 13241 03324(01022) 21518 13504 03451(01024)

MAM 07981 07972 02252(01044) 07827 07826 02155(01039) 09772 06943 02220(01038) 09145 06749 02218(01040)

BiGL 08307 08307 02318(01029) 07998 07934 02250(01027) 10727 07525 02414(01030) 10004 07405 02365(01034)

mGAM 07969 07967 02249(01042) 07787 07786 02121(01037) 09665 06842 02203(01038) 09135 06753 02226(01037)

GroupSpAM 07914 07916 02101(01033) 07604 07599 02085(01022) 09965 07142 02303(01048) 09521 07077 02294(01025)

GL 08081 08080 02241(01028) 07741 07708 02177(01029) 10501 07306 02335(01030) 09591 07202 02363(01029)

Lasso 25779 24777 03956(01027) 26801 26799 04259(01020) 38645 37612 05218(01020) 37443 35036 04963(01024)

RMR 15160 15061 03237(01028) 17654 16798 03184(01020) 19448 17294 03587(01025) 21085 20077 03329(01030)

42 Coronal mass ejection analysis

Coronal Mass Ejections (CMEs) are the most violent eruptions in the Solar System It is cru-cial to forecast the physical parameters related to CMEs Despite machine learning approacheshave been applied to these tasks recently [20 39] there is no any work for interpretable predic-tion with data-driven structure discovery Interplanetary CMEs (ICMEs) data are provided in TheRichardson and Cane List (httpwwwsrlcaltecheduACEASCDATAlevel3icmetable2htm) From this link we collect 137 ICMEs observations from 1996 to 2016The features of CMEs are provided in SOHO LASCO CME Catalog (httpscdawgsfcnasagovCME_list) In-situ solar wind parameters can be downloaded from OMNIWeb Plus(httpsomniwebgsfcnasagov) The in-situ solar wind parameters at earth is used torepresent the unknown solar wind plasma [20] A total of 21 features are chosen as input by com-bining the features of CMEs and in-situ solar wind parameters Five physical parameters predictiontasks are considered as outputs including CMEs arrive time Mean ICME speed Maximum solar

8

wind speed Increment in solar wind speed and Mean magnetic field strength We split the dataof each task into training set validation set and test set (with ratio 2 2 1) and adopt the samesettings in simulations Table 3 demonstrates that MAM enjoy smaller average absolute error thanthe competitors In addition the estimated structure (via MAM) is described in Figure 3 FromFigure 3 and Table 3 we know group G1 (including Mass MPA Solar wind speed Vy) and groupG2 (including Acceleration and Linear Speed) are significant for most tasks Particularly G2 andG7 (2nd-order Speed at final height) can be characterized as the factors that reflect the CMEs speedTable 3 shows that the groups G2 and G7 play an important role in CMEs arrive time predictionwhich is consistent with the results in [20] In addition the impact of hyper-parameter are displayedin Supplementary Material D2 due to the space limitation Overall the proposed MAM can achievethe promising performance on prediction and structure discovery

Table 3 Average absolute error and dominant group for each taskTasks CMEs arrive time Mean ICME speed Maximum solar wind speed Increment in solar wind speed Mean magnetic field strength

Methods AAE (h) Groups AAE (kms) Groups AAE (kms) Groups AAE (kms) Groups AAE (nT ) Groups

MAM 907 G1 G2 G7 4541 G1 G2 G3 G6 5932 G1 G2 6538 G1 G2 G3 347 G1BiGL 1109 - 5375 - 4651 - 8997 - 521 -Lasso 1216 - 6256 - 5981 - 8534 - 438 -RMR 1202 - 6223 - 5190 - 8613 - 398 -

1 2 3 4 5 6 7 8 9 10

CPA

Angular Width

Acceleration

Linear Speed

2nd-order Speed(20Rs)

Mass

Kinetic Energy

MPA

Field magnitude average

Bx

By

Bz

Solar wind speed

Vx

Vy

Vz

Proton density

Temperature

Flow pressure

Plasma beta

Figure 3 Variable structure S (white pixel=the grouped variables red pixel=the inactive variables)

5 Conclusion

This paper proposes the multi-task additive models to achieve robust estimation and automaticstructure discovery As far as we know it is novel to explore robust interpretable machine learningby integrating modal regression additive models and multi-task learning together The computingalgorithm and empirical evaluations are provided to support its effectiveness In the future it isinteresting to investigate robust additive models for overlapping variable structure discovery [17]

Broader Impact

The positive impacts of this work are two-fold 1) Our algorithmic framework paves a new way formining the intrinsic feature structure among high-dimensional variables and may be the steppingstone to further explore data-driven structure discovery with overlapping groups 2) Our MAM canbe applied to other fields eg gene expression analysis and drug discovery However there is also arisk of resulting an unstable estimation when facing ultra high-dimensional data

9

Acknowledgments

This work was supported by National Natural Science Foundation of China under Grant Nos11671161 12071166 61972188 41574181 the Fundamental Research Funds for the Central Univer-sities (Program No 2662019FW003) and NSERC Grant RGPIN-2016-05024 We are grateful to theanonymous NeurIPS reviewers for their constructive comments

References[1] R Agarwal N Frosst X Zhang R Caruana and G E Hinton Neural additive models

Interpretable machine learning with neural nets arXiv200413912v1 2020

[2] R Caruana Multitask learning Machine Learning 28(1)41ndash75 1997

[3] H Chen G Liu and H Huang Sparse shrunk additive models In International Conference onMachine Learning (ICML) 2020

[4] H Chen X Wang C Deng and H Huang Group sparse additive machine In Advances inNeural Information Processing Systems (NIPS) pages 198ndash208 2017

[5] H Chen Y Wang F Zheng C Deng and H Huang Sparse modal additive model IEEETransactions on Neural Networks and Learning Systems Doi 101109TNNLS202030051442020

[6] Y C Chen C R Genovese R J Tibshirani and L Wasserman Nonparametric modalregression The Annals of Statistics 44(2)489ndash514 2016

[7] B Colson P Marcotte and G Savard An overview of bilevel optimization Annals ofOperations Research 153(1)235ndash256 2007

[8] L Condat Fast projection onto the simplex and the `1-ball Mathematical Programming158(1-2)575ndash585 2016

[9] T Evgeniou and M Pontil Regularized multindashtask learning In ACM Conference on KnowledgeDiscovery and Data Mining pages 109ndash117 2004

[10] Y Feng J Fan and J Suykens A statistical learning approach to modal regression Journal ofMachine Learning Research 21(2)1ndash35 2020

[11] L Franceschi P Frasconi S Salzo R Grazzi and M Pontil Bilevel programming forhyperparameter optimization and meta-learning In International Conference on Machinelearning (ICML) pages 1563ndash1572 2018

[12] J Frecon S Salzo and M Pontil Bilevel learning of the group lasso structure In Advances inNeural Information Processing Systems (NIPS) pages 8301ndash8311 2018

[13] J H Friedman T Hastie and R Tibshirani A note on the group lasso and a sparse group lassoarXiv 10010736 2010

[14] T J Hastie and R J Tibshirani Generalized additive models London Chapman and Hall1990

[15] C Heinrich The mode functional is not elicitable Biometrika 101(1)245ndash251 2014

[16] J Huang J L Horowitz and F Wei Variable selection in nonparametric additive models TheAnnals of Statistics 38(4)2282ndash2313 2010

[17] L Jacob G Obozinski and J Vert Group lasso with overlap and graph lasso In InternationalConference on Machine Learning (ICML) pages 433ndash440 2009

[18] K Kandasamy and Y Yu Additive approximations in high dimensional nonparametric regres-sion via the SALSA In International Conference on Machine Learning (ICML) pages 69ndash782016

10

[19] X Li L Wang and D Nettleton Sparse model identification and learning for ultra-high-dimensional additive partially models Journal of Multivariate Analysis 173204ndash228 2019

[20] J Liu Y Ye C Shen Y Wang and R Erdelyi A new tool for cme arrival time predictionusing machine learning algorithms Cat-puma The Astrophysical Journal 855(2)109ndash1182018

[21] S Lv H Lin H Lian and J Huang Oracle inequalities for sparse additive quantile regressionin reproducing kernel hilbert space The Annals of Statistics 46(2)781ndash813 2018

[22] L Meier S V De Geer and P Buhlmann The group lasso for logistic regression Journal ofThe Royal Statistical Society Series B-statistical Methodology 70(1)53ndash71 2008

[23] L Meier S V D Geer and P Buhlmann High-dimensional additive modeling The Annals ofStatistics 37(6B)3779ndash3821 2009

[24] M Nikolova and M K Ng Analysis of half-quadratic minimization methods for signal andimage recovery SIAM Journal on Scientific Computing 27(3)937ndash966 2005

[25] P Ochs R Ranftl T Brox and T Pock Techniques for gradient based bilevel optimization withnonsmooth lower level problems Journal of Mathematical Imaging and Vision 56(2)175ndash1942016

[26] C Pan and M Zhu Group additive structure identification for kernel nonparametric regressionIn Advances in Neural Information Processing Systems (NIPS) pages 4907ndash4916 2017

[27] E Parzen On estimation of a probability density function and mode Annals of MathematicalStatistics 33(3)1065ndash1076 1962

[28] G Raskutti M J Wainwright and B Yu Minimax-optimal rates for sparse additive models overkernel classes via convex programming Journal of Machine Learning Research 13(2)389ndash4272012

[29] P Ravikumar H Liu J Lafferty and L Wasserman SpAM sparse additive models Journalof the Royal Statistical Society Series B 711009ndash1030 2009

[30] S J Reddi S Sra B Poczos and A J Smola Proximal stochastic methods for nonsmoothnonconvex finite-sum optimization In Advances in Neural Information Processing Systems(NIPS) pages 1145ndash1153 2016

[31] T W Sager Estimation of a multivariate mode Annals of Statistics 6(4)802ndash812 1978

[32] N Shervashidze and F Bach Learning the structure for structured sparsity IEEE Transactionson Signal Processing 63(18)4894ndash4902 2015

[33] N Simon J H Friedman T Hastie and R Tibshirani A sparse-group lasso Journal ofComputational and Graphical Statistics 22(2)231ndash245 2013

[34] C J Stone Additive regression and other nonparametric models The Annals of Statistics13(2)689ndash705 1985

[35] G Swirszcz and A C Lozano Multi-level lasso for sparse multi-task regression In Interna-tional Conference on Machine Learning (ICML) pages 595ndash602 2012

[36] R Tibshirani Regression shrinkage and selection via the lasso Journal of the Royal StatisticalSociety Series B 73(3)267ndash288 1994

[37] Q Van Nguyen Forward-backward splitting with bregman distances Vietnam Journal ofMathematics 45(3)519ndash539 2017

[38] X Wang H Chen W Cai D Shen and H Huang Regularized modal regression withapplications in cognitive impairment prediction In Advances in Neural Information ProcessingSystems (NIPS) pages 1448ndash1458 2017

11

[39] Y Wang J Liu Y Jiang and R Erdeacutelyi Cme arrival time prediction using convolutionalneural network The Astrophysical Journal 881(11)15 2019

[40] Y Wu and L A Stefanski Automatic structure recovery for additive models Biometrika102(2)381ndash395 2015

[41] W Yao and L Li A new regression model modal linear regression Scandinavian Journal ofStatistics 41(3)656ndash671 2013

[42] J Yin X Chen and E P Xing Group sparse additive models In International Conference onMachine Learning (ICML) pages 871ndash878 2012

[43] M Yuan and Y Lin Model selection and estimation in regression with grouped variablesJournal of The Royal Statistical Society Series B-statistical Methodology 68(1)49ndash67 2006

[44] M Yuan and D X Zhou Minimax optimal rates of estimation in high dimensional additivemodels The Annals of Statistics 44(6)2564ndash2593 2016

[45] H H Zhang G Cheng and Y Liu Linear or nonlinear automatic structure discovery forpartially linear models Journal of the American Statistical Association 106(495)1099ndash11122011

[46] T Zhao and H Liu Sparse additive machine In International Conference on ArtificialIntelligence and Statistics (AISTATS) pages 1435ndash1443 2012

12

  • Introduction
  • Multi-task Additive Models
    • Additive models
    • Mode-induced metric
    • Mode-induced group additive models
    • Multi-task additive models
      • Optimization Algorithm
      • Experiments
        • Simulated data analysis
        • Coronal mass ejection analysis
          • Conclusion
Page 7: Multi-task Additive Models for Robust Estimation and Automatic … · group structure identification (CGSI)[26], and Bilevel learning of Group Lasso (BiGL) [12]. 2 Multi-task Additive

computingsumTt=1 hϑ(β(t)(ϑ) ν) requires us to develop a smooth algorithm HQ-DFBB (combining

HQ [24] and DFBB [37]) for the solution β(ϑ) For the space limitation the optimization detailsincluding HQ-DFBB and two partial derivative calculators are provided in Supplementary MaterialB Let PΘ be the projection onto unit simplex Θ and Pν be the box projection onto [0 1]P Thegeneral procedure of Prox-SAGA is summarized in Algorithm 1

Remark 4 From Theorem 21 in [12] and Theorem 4 in [30] we know that Algorithm 1 convergesonly if the iteration sequence generated by HQ-DFBB converges to the solution of the inner problemDetailed convergence analysis of HQ-DFBB is provided in Supplementary Material C

4 Experiments

This section validates the effectiveness of MAM on simulated data and CMEs data All experimentsare implemented in MATLAB 2019b on an intel Core i7 with 16 GB memory

41 Simulated data analysis

Baselines The proposed MAM is compared with BiGL [11] in terms of variable structure recoveryand prediction ability In addition we also consider some baselines including Lasso [36] RMR [38]mGAM Group Lasso (GL) [43] and GroupSpAM [42] Note that the oracle variable structure is apriori knowledge for implementing mGAM GL and GroupSpAM

Oracle variable structure Set the number of tasks T = 500 the dimension P = 50 for eachtask and the actual number of groups Llowast = 5 We denote the indices of l-th group by Gl =

1 + (l minus 1)(PLlowast) l(PLlowast)

foralll isin 1 Llowast In addition we randomly pick V sub 1 Pto generate sparse features across all tasks For each j isin 1 P and l isin 1 Llowast the oraclevariable structure Slowast can be defined as Slowastjl = 1 if j isin Vc capGl and 0 otherwise

Parameter selection For the same hyper-parameters in BiGL and MAM we set Z = 3000micro = 10minus3 M = 5 Q = 100 and σ = 2 We search the regularization parameter λ in therange of 10minus4 10minus3 10minus2 10minus1 Here we assume the actual number of groups is known ieL = Llowast The weight for each group is set to be τl = 1foralll isin 1 L Following the samestrategy in [11] we choose the initialization ϑ(0) = Pϑ( 1

L IPtimesL + 001N (0PtimesL IPtimesL)) isin RPtimesL

and ν(0) = (05 05)T isin RP

Evaluation criterion Denote f (t) flowast(t) as the estimator and ground truth function respectively1 le t le T Evaluation criterions used here include Average Square Error(ASE)= 1

T

sumTt=1

1nf

(t) minusy(t)22 True Deviation (TD)= 1

T

sumTt=1

1nf

(t)minus flowast(t)22 Variable Structure Recovery S = (νThrϑThrl )1lelleL with the hard threshold value u = 05 Width of Prediction Intervals (WPI) and Sample

Coverage Probability (SCP) with the confidence level 10 Specially WPI and SCP are designed in[41] for comparing the widths of the prediction intervals with the same confidence level (see Section32 in [41] for more details)

Data sets The training set validation set and test set are all drawn from y(t) = flowast(t)(u(t)) + ε withthe same sample size n = 50 for each task where u(t) = (u1 uP )T isin RP is randomly drawnfrom Gaussian distribution N (0P

12 IP ) The noise ε follows Gaussian noise N (0 005) Student

noise t(2) Chi-square noise X 2(2) and Exponential noise Exp(2) respectively We randomly pickG(t) sub G1 GL st |G(t)| = 2 and consider the following examples of ground truth functionflowast(t) 1 le t le T

Example A [12] Linear component function flowast(t)(u(t)) =sumGlisinG(t)

sumjisinGlcapVc u

(t)j β

(t)j where the

true regression coefficient β(t)j = 1 if j isin Gl cap Vc otherwise β(t)

j = 0

Example B Denote flowast1 (u) = 25 sin(u) flowast2 (u) = 2u flowast3 (u) = 2eu minus eminus1 minus 1 flowast4 (u) = 8u2 andflowast5 (u) = 3 sin(2eu) The nonlinear additive function flowast(t)(u(t)) =

sumGlisinG(t)

sumjisinGlcapVc flowastl (u

(t)j )

Here spline basis matrix for MAM mGAM and GroupSpAM are constructed with d = 3 In thedata-generating process we consider two cases of the number of inactive variables ie |V| = 0and |V| = 5 Due to the space limitation we only present the results with Gaussian noise and

7

1 2 3 4 5

10

20

30

40

501 2 3 4 5

10

20

30

40

501 2 3 4 5

10

20

30

40

50

1 2 3 4 5

10

20

30

40

501 2 3 4 5

10

20

30

40

501 2 3 4 5

10

20

30

40

50

1 2 3 4 5

10

20

30

40

501 2 3 4 5

10

20

30

40

501 2 3 4 5

10

20

30

40

501 2 3 4 5

10

20

30

40

501 2 3 4 5

10

20

30

40

501 2 3 4 5

10

20

30

40

50

Figure 2 Variable structure discovery for Example A under different noise and sparsity index V(white pixel means the grouped variables and red pixel means the inactive variables) Top left panelGaussian noise with |V| = 0 Top right panel Student noise with |V| = 0 Bottom left panelGaussian noise with |V| = 5 Bottom right panel Student noise with |V| = 5

Student noise in Table 2 and Figure 2 The remaining results as well as several evaluations on theimpact of hyper-parameters are provided in Supplementary Material D1 From the reported resultseven without the structure information the proposed MAM can provide the competitive regressionestimation with mGAM (given priori structure) and usually achieve better performance than thesecompetitors when the noise is non-Gaussian distribution Specially the actual number of groupsis assumed to be known in current evaluations ie L = Llowast In Supplementary Material D1 wefurther verify the effectiveness of MAM for the general setting L gt Llowast

Table 2 Performance comparisons on Example A (top) and Example B (bottom) wrt differentcriterions

Methods|V| = 0 (Gaussian noise) |V| = 5 (Gaussian noise) |V| = 0 (Student noise) |V| = 5 (Student noise)

ASE TD WPI (SCP) ASE TD WPI (SCP) ASE TD WPI (SCP) ASE TD WPI (SCP)

MAM 00920 00914 00679(01015) 00801 00794 00595(01014) 10204 02467 01817(01014) 14279 02269 01796(01020)

BiGL 00715 00701 00616(01014) 00799 00793 00553(01027) 11097 03550 02061(01023) 14961 03544 02093(01017)

mGAM 00894 00885 00651(01011) 00795 00788 00611(01016) 10132 02441 01803(01015) 13828 02242 01725(01026)

GL 00683 00661 00620(01014) 00708 00684 00535(01011) 13252 03163 01935(01020) 14811 03576 01976(01020)

Lasso 02145 02138 01011(01013) 02204 02131 01080(01017) 38012 32412 05269(01021) 42121 35645 04899(01025)

RMR 02201 02196 01081(01020) 02224 02165 01108(01029) 19723 13241 03324(01022) 21518 13504 03451(01024)

MAM 07981 07972 02252(01044) 07827 07826 02155(01039) 09772 06943 02220(01038) 09145 06749 02218(01040)

BiGL 08307 08307 02318(01029) 07998 07934 02250(01027) 10727 07525 02414(01030) 10004 07405 02365(01034)

mGAM 07969 07967 02249(01042) 07787 07786 02121(01037) 09665 06842 02203(01038) 09135 06753 02226(01037)

GroupSpAM 07914 07916 02101(01033) 07604 07599 02085(01022) 09965 07142 02303(01048) 09521 07077 02294(01025)

GL 08081 08080 02241(01028) 07741 07708 02177(01029) 10501 07306 02335(01030) 09591 07202 02363(01029)

Lasso 25779 24777 03956(01027) 26801 26799 04259(01020) 38645 37612 05218(01020) 37443 35036 04963(01024)

RMR 15160 15061 03237(01028) 17654 16798 03184(01020) 19448 17294 03587(01025) 21085 20077 03329(01030)

42 Coronal mass ejection analysis

Coronal Mass Ejections (CMEs) are the most violent eruptions in the Solar System It is cru-cial to forecast the physical parameters related to CMEs Despite machine learning approacheshave been applied to these tasks recently [20 39] there is no any work for interpretable predic-tion with data-driven structure discovery Interplanetary CMEs (ICMEs) data are provided in TheRichardson and Cane List (httpwwwsrlcaltecheduACEASCDATAlevel3icmetable2htm) From this link we collect 137 ICMEs observations from 1996 to 2016The features of CMEs are provided in SOHO LASCO CME Catalog (httpscdawgsfcnasagovCME_list) In-situ solar wind parameters can be downloaded from OMNIWeb Plus(httpsomniwebgsfcnasagov) The in-situ solar wind parameters at earth is used torepresent the unknown solar wind plasma [20] A total of 21 features are chosen as input by com-bining the features of CMEs and in-situ solar wind parameters Five physical parameters predictiontasks are considered as outputs including CMEs arrive time Mean ICME speed Maximum solar

8

wind speed Increment in solar wind speed and Mean magnetic field strength We split the dataof each task into training set validation set and test set (with ratio 2 2 1) and adopt the samesettings in simulations Table 3 demonstrates that MAM enjoy smaller average absolute error thanthe competitors In addition the estimated structure (via MAM) is described in Figure 3 FromFigure 3 and Table 3 we know group G1 (including Mass MPA Solar wind speed Vy) and groupG2 (including Acceleration and Linear Speed) are significant for most tasks Particularly G2 andG7 (2nd-order Speed at final height) can be characterized as the factors that reflect the CMEs speedTable 3 shows that the groups G2 and G7 play an important role in CMEs arrive time predictionwhich is consistent with the results in [20] In addition the impact of hyper-parameter are displayedin Supplementary Material D2 due to the space limitation Overall the proposed MAM can achievethe promising performance on prediction and structure discovery

Table 3 Average absolute error and dominant group for each taskTasks CMEs arrive time Mean ICME speed Maximum solar wind speed Increment in solar wind speed Mean magnetic field strength

Methods AAE (h) Groups AAE (kms) Groups AAE (kms) Groups AAE (kms) Groups AAE (nT ) Groups

MAM 907 G1 G2 G7 4541 G1 G2 G3 G6 5932 G1 G2 6538 G1 G2 G3 347 G1BiGL 1109 - 5375 - 4651 - 8997 - 521 -Lasso 1216 - 6256 - 5981 - 8534 - 438 -RMR 1202 - 6223 - 5190 - 8613 - 398 -

1 2 3 4 5 6 7 8 9 10

CPA

Angular Width

Acceleration

Linear Speed

2nd-order Speed(20Rs)

Mass

Kinetic Energy

MPA

Field magnitude average

Bx

By

Bz

Solar wind speed

Vx

Vy

Vz

Proton density

Temperature

Flow pressure

Plasma beta

Figure 3 Variable structure S (white pixel=the grouped variables red pixel=the inactive variables)

5 Conclusion

This paper proposes the multi-task additive models to achieve robust estimation and automaticstructure discovery As far as we know it is novel to explore robust interpretable machine learningby integrating modal regression additive models and multi-task learning together The computingalgorithm and empirical evaluations are provided to support its effectiveness In the future it isinteresting to investigate robust additive models for overlapping variable structure discovery [17]

Broader Impact

The positive impacts of this work are two-fold 1) Our algorithmic framework paves a new way formining the intrinsic feature structure among high-dimensional variables and may be the steppingstone to further explore data-driven structure discovery with overlapping groups 2) Our MAM canbe applied to other fields eg gene expression analysis and drug discovery However there is also arisk of resulting an unstable estimation when facing ultra high-dimensional data

9

Acknowledgments

This work was supported by National Natural Science Foundation of China under Grant Nos11671161 12071166 61972188 41574181 the Fundamental Research Funds for the Central Univer-sities (Program No 2662019FW003) and NSERC Grant RGPIN-2016-05024 We are grateful to theanonymous NeurIPS reviewers for their constructive comments

References[1] R Agarwal N Frosst X Zhang R Caruana and G E Hinton Neural additive models

Interpretable machine learning with neural nets arXiv200413912v1 2020

[2] R Caruana Multitask learning Machine Learning 28(1)41ndash75 1997

[3] H Chen G Liu and H Huang Sparse shrunk additive models In International Conference onMachine Learning (ICML) 2020

[4] H Chen X Wang C Deng and H Huang Group sparse additive machine In Advances inNeural Information Processing Systems (NIPS) pages 198ndash208 2017

[5] H Chen Y Wang F Zheng C Deng and H Huang Sparse modal additive model IEEETransactions on Neural Networks and Learning Systems Doi 101109TNNLS202030051442020

[6] Y C Chen C R Genovese R J Tibshirani and L Wasserman Nonparametric modalregression The Annals of Statistics 44(2)489ndash514 2016

[7] B Colson P Marcotte and G Savard An overview of bilevel optimization Annals ofOperations Research 153(1)235ndash256 2007

[8] L Condat Fast projection onto the simplex and the `1-ball Mathematical Programming158(1-2)575ndash585 2016

[9] T Evgeniou and M Pontil Regularized multindashtask learning In ACM Conference on KnowledgeDiscovery and Data Mining pages 109ndash117 2004

[10] Y Feng J Fan and J Suykens A statistical learning approach to modal regression Journal ofMachine Learning Research 21(2)1ndash35 2020

[11] L Franceschi P Frasconi S Salzo R Grazzi and M Pontil Bilevel programming forhyperparameter optimization and meta-learning In International Conference on Machinelearning (ICML) pages 1563ndash1572 2018

[12] J Frecon S Salzo and M Pontil Bilevel learning of the group lasso structure In Advances inNeural Information Processing Systems (NIPS) pages 8301ndash8311 2018

[13] J H Friedman T Hastie and R Tibshirani A note on the group lasso and a sparse group lassoarXiv 10010736 2010

[14] T J Hastie and R J Tibshirani Generalized additive models London Chapman and Hall1990

[15] C Heinrich The mode functional is not elicitable Biometrika 101(1)245ndash251 2014

[16] J Huang J L Horowitz and F Wei Variable selection in nonparametric additive models TheAnnals of Statistics 38(4)2282ndash2313 2010

[17] L Jacob G Obozinski and J Vert Group lasso with overlap and graph lasso In InternationalConference on Machine Learning (ICML) pages 433ndash440 2009

[18] K Kandasamy and Y Yu Additive approximations in high dimensional nonparametric regres-sion via the SALSA In International Conference on Machine Learning (ICML) pages 69ndash782016

10

[19] X Li L Wang and D Nettleton Sparse model identification and learning for ultra-high-dimensional additive partially models Journal of Multivariate Analysis 173204ndash228 2019

[20] J Liu Y Ye C Shen Y Wang and R Erdelyi A new tool for cme arrival time predictionusing machine learning algorithms Cat-puma The Astrophysical Journal 855(2)109ndash1182018

[21] S Lv H Lin H Lian and J Huang Oracle inequalities for sparse additive quantile regressionin reproducing kernel hilbert space The Annals of Statistics 46(2)781ndash813 2018

[22] L Meier S V De Geer and P Buhlmann The group lasso for logistic regression Journal ofThe Royal Statistical Society Series B-statistical Methodology 70(1)53ndash71 2008

[23] L Meier S V D Geer and P Buhlmann High-dimensional additive modeling The Annals ofStatistics 37(6B)3779ndash3821 2009

[24] M Nikolova and M K Ng Analysis of half-quadratic minimization methods for signal andimage recovery SIAM Journal on Scientific Computing 27(3)937ndash966 2005

[25] P Ochs R Ranftl T Brox and T Pock Techniques for gradient based bilevel optimization withnonsmooth lower level problems Journal of Mathematical Imaging and Vision 56(2)175ndash1942016

[26] C Pan and M Zhu Group additive structure identification for kernel nonparametric regressionIn Advances in Neural Information Processing Systems (NIPS) pages 4907ndash4916 2017

[27] E Parzen On estimation of a probability density function and mode Annals of MathematicalStatistics 33(3)1065ndash1076 1962

[28] G Raskutti M J Wainwright and B Yu Minimax-optimal rates for sparse additive models overkernel classes via convex programming Journal of Machine Learning Research 13(2)389ndash4272012

[29] P Ravikumar H Liu J Lafferty and L Wasserman SpAM sparse additive models Journalof the Royal Statistical Society Series B 711009ndash1030 2009

[30] S J Reddi S Sra B Poczos and A J Smola Proximal stochastic methods for nonsmoothnonconvex finite-sum optimization In Advances in Neural Information Processing Systems(NIPS) pages 1145ndash1153 2016

[31] T W Sager Estimation of a multivariate mode Annals of Statistics 6(4)802ndash812 1978

[32] N Shervashidze and F Bach Learning the structure for structured sparsity IEEE Transactionson Signal Processing 63(18)4894ndash4902 2015

[33] N Simon J H Friedman T Hastie and R Tibshirani A sparse-group lasso Journal ofComputational and Graphical Statistics 22(2)231ndash245 2013

[34] C J Stone Additive regression and other nonparametric models The Annals of Statistics13(2)689ndash705 1985

[35] G Swirszcz and A C Lozano Multi-level lasso for sparse multi-task regression In Interna-tional Conference on Machine Learning (ICML) pages 595ndash602 2012

[36] R Tibshirani Regression shrinkage and selection via the lasso Journal of the Royal StatisticalSociety Series B 73(3)267ndash288 1994

[37] Q Van Nguyen Forward-backward splitting with bregman distances Vietnam Journal ofMathematics 45(3)519ndash539 2017

[38] X Wang H Chen W Cai D Shen and H Huang Regularized modal regression withapplications in cognitive impairment prediction In Advances in Neural Information ProcessingSystems (NIPS) pages 1448ndash1458 2017

11

[39] Y Wang J Liu Y Jiang and R Erdeacutelyi Cme arrival time prediction using convolutionalneural network The Astrophysical Journal 881(11)15 2019

[40] Y Wu and L A Stefanski Automatic structure recovery for additive models Biometrika102(2)381ndash395 2015

[41] W Yao and L Li A new regression model modal linear regression Scandinavian Journal ofStatistics 41(3)656ndash671 2013

[42] J Yin X Chen and E P Xing Group sparse additive models In International Conference onMachine Learning (ICML) pages 871ndash878 2012

[43] M Yuan and Y Lin Model selection and estimation in regression with grouped variablesJournal of The Royal Statistical Society Series B-statistical Methodology 68(1)49ndash67 2006

[44] M Yuan and D X Zhou Minimax optimal rates of estimation in high dimensional additivemodels The Annals of Statistics 44(6)2564ndash2593 2016

[45] H H Zhang G Cheng and Y Liu Linear or nonlinear automatic structure discovery forpartially linear models Journal of the American Statistical Association 106(495)1099ndash11122011

[46] T Zhao and H Liu Sparse additive machine In International Conference on ArtificialIntelligence and Statistics (AISTATS) pages 1435ndash1443 2012

12

  • Introduction
  • Multi-task Additive Models
    • Additive models
    • Mode-induced metric
    • Mode-induced group additive models
    • Multi-task additive models
      • Optimization Algorithm
      • Experiments
        • Simulated data analysis
        • Coronal mass ejection analysis
          • Conclusion
Page 8: Multi-task Additive Models for Robust Estimation and Automatic … · group structure identification (CGSI)[26], and Bilevel learning of Group Lasso (BiGL) [12]. 2 Multi-task Additive

1 2 3 4 5

10

20

30

40

501 2 3 4 5

10

20

30

40

501 2 3 4 5

10

20

30

40

50

1 2 3 4 5

10

20

30

40

501 2 3 4 5

10

20

30

40

501 2 3 4 5

10

20

30

40

50

1 2 3 4 5

10

20

30

40

501 2 3 4 5

10

20

30

40

501 2 3 4 5

10

20

30

40

501 2 3 4 5

10

20

30

40

501 2 3 4 5

10

20

30

40

501 2 3 4 5

10

20

30

40

50

Figure 2 Variable structure discovery for Example A under different noise and sparsity index V(white pixel means the grouped variables and red pixel means the inactive variables) Top left panelGaussian noise with |V| = 0 Top right panel Student noise with |V| = 0 Bottom left panelGaussian noise with |V| = 5 Bottom right panel Student noise with |V| = 5

Student noise in Table 2 and Figure 2 The remaining results as well as several evaluations on theimpact of hyper-parameters are provided in Supplementary Material D1 From the reported resultseven without the structure information the proposed MAM can provide the competitive regressionestimation with mGAM (given priori structure) and usually achieve better performance than thesecompetitors when the noise is non-Gaussian distribution Specially the actual number of groupsis assumed to be known in current evaluations ie L = Llowast In Supplementary Material D1 wefurther verify the effectiveness of MAM for the general setting L gt Llowast

Table 2 Performance comparisons on Example A (top) and Example B (bottom) wrt differentcriterions

Methods|V| = 0 (Gaussian noise) |V| = 5 (Gaussian noise) |V| = 0 (Student noise) |V| = 5 (Student noise)

ASE TD WPI (SCP) ASE TD WPI (SCP) ASE TD WPI (SCP) ASE TD WPI (SCP)

MAM 00920 00914 00679(01015) 00801 00794 00595(01014) 10204 02467 01817(01014) 14279 02269 01796(01020)

BiGL 00715 00701 00616(01014) 00799 00793 00553(01027) 11097 03550 02061(01023) 14961 03544 02093(01017)

mGAM 00894 00885 00651(01011) 00795 00788 00611(01016) 10132 02441 01803(01015) 13828 02242 01725(01026)

GL 00683 00661 00620(01014) 00708 00684 00535(01011) 13252 03163 01935(01020) 14811 03576 01976(01020)

Lasso 02145 02138 01011(01013) 02204 02131 01080(01017) 38012 32412 05269(01021) 42121 35645 04899(01025)

RMR 02201 02196 01081(01020) 02224 02165 01108(01029) 19723 13241 03324(01022) 21518 13504 03451(01024)

MAM 07981 07972 02252(01044) 07827 07826 02155(01039) 09772 06943 02220(01038) 09145 06749 02218(01040)

BiGL 08307 08307 02318(01029) 07998 07934 02250(01027) 10727 07525 02414(01030) 10004 07405 02365(01034)

mGAM 07969 07967 02249(01042) 07787 07786 02121(01037) 09665 06842 02203(01038) 09135 06753 02226(01037)

GroupSpAM 07914 07916 02101(01033) 07604 07599 02085(01022) 09965 07142 02303(01048) 09521 07077 02294(01025)

GL 08081 08080 02241(01028) 07741 07708 02177(01029) 10501 07306 02335(01030) 09591 07202 02363(01029)

Lasso 25779 24777 03956(01027) 26801 26799 04259(01020) 38645 37612 05218(01020) 37443 35036 04963(01024)

RMR 15160 15061 03237(01028) 17654 16798 03184(01020) 19448 17294 03587(01025) 21085 20077 03329(01030)

42 Coronal mass ejection analysis

Coronal Mass Ejections (CMEs) are the most violent eruptions in the Solar System It is cru-cial to forecast the physical parameters related to CMEs Despite machine learning approacheshave been applied to these tasks recently [20 39] there is no any work for interpretable predic-tion with data-driven structure discovery Interplanetary CMEs (ICMEs) data are provided in TheRichardson and Cane List (httpwwwsrlcaltecheduACEASCDATAlevel3icmetable2htm) From this link we collect 137 ICMEs observations from 1996 to 2016The features of CMEs are provided in SOHO LASCO CME Catalog (httpscdawgsfcnasagovCME_list) In-situ solar wind parameters can be downloaded from OMNIWeb Plus(httpsomniwebgsfcnasagov) The in-situ solar wind parameters at earth is used torepresent the unknown solar wind plasma [20] A total of 21 features are chosen as input by com-bining the features of CMEs and in-situ solar wind parameters Five physical parameters predictiontasks are considered as outputs including CMEs arrive time Mean ICME speed Maximum solar

8

wind speed Increment in solar wind speed and Mean magnetic field strength We split the dataof each task into training set validation set and test set (with ratio 2 2 1) and adopt the samesettings in simulations Table 3 demonstrates that MAM enjoy smaller average absolute error thanthe competitors In addition the estimated structure (via MAM) is described in Figure 3 FromFigure 3 and Table 3 we know group G1 (including Mass MPA Solar wind speed Vy) and groupG2 (including Acceleration and Linear Speed) are significant for most tasks Particularly G2 andG7 (2nd-order Speed at final height) can be characterized as the factors that reflect the CMEs speedTable 3 shows that the groups G2 and G7 play an important role in CMEs arrive time predictionwhich is consistent with the results in [20] In addition the impact of hyper-parameter are displayedin Supplementary Material D2 due to the space limitation Overall the proposed MAM can achievethe promising performance on prediction and structure discovery

Table 3 Average absolute error and dominant group for each taskTasks CMEs arrive time Mean ICME speed Maximum solar wind speed Increment in solar wind speed Mean magnetic field strength

Methods AAE (h) Groups AAE (kms) Groups AAE (kms) Groups AAE (kms) Groups AAE (nT ) Groups

MAM 907 G1 G2 G7 4541 G1 G2 G3 G6 5932 G1 G2 6538 G1 G2 G3 347 G1BiGL 1109 - 5375 - 4651 - 8997 - 521 -Lasso 1216 - 6256 - 5981 - 8534 - 438 -RMR 1202 - 6223 - 5190 - 8613 - 398 -

1 2 3 4 5 6 7 8 9 10

CPA

Angular Width

Acceleration

Linear Speed

2nd-order Speed(20Rs)

Mass

Kinetic Energy

MPA

Field magnitude average

Bx

By

Bz

Solar wind speed

Vx

Vy

Vz

Proton density

Temperature

Flow pressure

Plasma beta

Figure 3 Variable structure S (white pixel=the grouped variables red pixel=the inactive variables)

5 Conclusion

This paper proposes the multi-task additive models to achieve robust estimation and automaticstructure discovery As far as we know it is novel to explore robust interpretable machine learningby integrating modal regression additive models and multi-task learning together The computingalgorithm and empirical evaluations are provided to support its effectiveness In the future it isinteresting to investigate robust additive models for overlapping variable structure discovery [17]

Broader Impact

The positive impacts of this work are two-fold 1) Our algorithmic framework paves a new way formining the intrinsic feature structure among high-dimensional variables and may be the steppingstone to further explore data-driven structure discovery with overlapping groups 2) Our MAM canbe applied to other fields eg gene expression analysis and drug discovery However there is also arisk of resulting an unstable estimation when facing ultra high-dimensional data

9

Acknowledgments

This work was supported by National Natural Science Foundation of China under Grant Nos11671161 12071166 61972188 41574181 the Fundamental Research Funds for the Central Univer-sities (Program No 2662019FW003) and NSERC Grant RGPIN-2016-05024 We are grateful to theanonymous NeurIPS reviewers for their constructive comments

References[1] R Agarwal N Frosst X Zhang R Caruana and G E Hinton Neural additive models

Interpretable machine learning with neural nets arXiv200413912v1 2020

[2] R Caruana Multitask learning Machine Learning 28(1)41ndash75 1997

[3] H Chen G Liu and H Huang Sparse shrunk additive models In International Conference onMachine Learning (ICML) 2020

[4] H Chen X Wang C Deng and H Huang Group sparse additive machine In Advances inNeural Information Processing Systems (NIPS) pages 198ndash208 2017

[5] H Chen Y Wang F Zheng C Deng and H Huang Sparse modal additive model IEEETransactions on Neural Networks and Learning Systems Doi 101109TNNLS202030051442020

[6] Y C Chen C R Genovese R J Tibshirani and L Wasserman Nonparametric modalregression The Annals of Statistics 44(2)489ndash514 2016

[7] B Colson P Marcotte and G Savard An overview of bilevel optimization Annals ofOperations Research 153(1)235ndash256 2007

[8] L Condat Fast projection onto the simplex and the `1-ball Mathematical Programming158(1-2)575ndash585 2016

[9] T Evgeniou and M Pontil Regularized multindashtask learning In ACM Conference on KnowledgeDiscovery and Data Mining pages 109ndash117 2004

[10] Y Feng J Fan and J Suykens A statistical learning approach to modal regression Journal ofMachine Learning Research 21(2)1ndash35 2020

[11] L Franceschi P Frasconi S Salzo R Grazzi and M Pontil Bilevel programming forhyperparameter optimization and meta-learning In International Conference on Machinelearning (ICML) pages 1563ndash1572 2018

[12] J Frecon S Salzo and M Pontil Bilevel learning of the group lasso structure In Advances inNeural Information Processing Systems (NIPS) pages 8301ndash8311 2018

[13] J H Friedman T Hastie and R Tibshirani A note on the group lasso and a sparse group lassoarXiv 10010736 2010

[14] T J Hastie and R J Tibshirani Generalized additive models London Chapman and Hall1990

[15] C Heinrich The mode functional is not elicitable Biometrika 101(1)245ndash251 2014

[16] J Huang J L Horowitz and F Wei Variable selection in nonparametric additive models TheAnnals of Statistics 38(4)2282ndash2313 2010

[17] L Jacob G Obozinski and J Vert Group lasso with overlap and graph lasso In InternationalConference on Machine Learning (ICML) pages 433ndash440 2009

[18] K Kandasamy and Y Yu Additive approximations in high dimensional nonparametric regres-sion via the SALSA In International Conference on Machine Learning (ICML) pages 69ndash782016

10

[19] X Li L Wang and D Nettleton Sparse model identification and learning for ultra-high-dimensional additive partially models Journal of Multivariate Analysis 173204ndash228 2019

[20] J Liu Y Ye C Shen Y Wang and R Erdelyi A new tool for cme arrival time predictionusing machine learning algorithms Cat-puma The Astrophysical Journal 855(2)109ndash1182018

[21] S Lv H Lin H Lian and J Huang Oracle inequalities for sparse additive quantile regressionin reproducing kernel hilbert space The Annals of Statistics 46(2)781ndash813 2018

[22] L Meier S V De Geer and P Buhlmann The group lasso for logistic regression Journal ofThe Royal Statistical Society Series B-statistical Methodology 70(1)53ndash71 2008

[23] L Meier S V D Geer and P Buhlmann High-dimensional additive modeling The Annals ofStatistics 37(6B)3779ndash3821 2009

[24] M Nikolova and M K Ng Analysis of half-quadratic minimization methods for signal andimage recovery SIAM Journal on Scientific Computing 27(3)937ndash966 2005

[25] P Ochs R Ranftl T Brox and T Pock Techniques for gradient based bilevel optimization withnonsmooth lower level problems Journal of Mathematical Imaging and Vision 56(2)175ndash1942016

[26] C Pan and M Zhu Group additive structure identification for kernel nonparametric regressionIn Advances in Neural Information Processing Systems (NIPS) pages 4907ndash4916 2017

[27] E Parzen On estimation of a probability density function and mode Annals of MathematicalStatistics 33(3)1065ndash1076 1962

[28] G Raskutti M J Wainwright and B Yu Minimax-optimal rates for sparse additive models overkernel classes via convex programming Journal of Machine Learning Research 13(2)389ndash4272012

[29] P Ravikumar H Liu J Lafferty and L Wasserman SpAM sparse additive models Journalof the Royal Statistical Society Series B 711009ndash1030 2009

[30] S J Reddi S Sra B Poczos and A J Smola Proximal stochastic methods for nonsmoothnonconvex finite-sum optimization In Advances in Neural Information Processing Systems(NIPS) pages 1145ndash1153 2016

[31] T W Sager Estimation of a multivariate mode Annals of Statistics 6(4)802ndash812 1978

[32] N Shervashidze and F Bach Learning the structure for structured sparsity IEEE Transactionson Signal Processing 63(18)4894ndash4902 2015

[33] N Simon J H Friedman T Hastie and R Tibshirani A sparse-group lasso Journal ofComputational and Graphical Statistics 22(2)231ndash245 2013

[34] C J Stone Additive regression and other nonparametric models The Annals of Statistics13(2)689ndash705 1985

[35] G Swirszcz and A C Lozano Multi-level lasso for sparse multi-task regression In Interna-tional Conference on Machine Learning (ICML) pages 595ndash602 2012

[36] R Tibshirani Regression shrinkage and selection via the lasso Journal of the Royal StatisticalSociety Series B 73(3)267ndash288 1994

[37] Q Van Nguyen Forward-backward splitting with bregman distances Vietnam Journal ofMathematics 45(3)519ndash539 2017

[38] X Wang H Chen W Cai D Shen and H Huang Regularized modal regression withapplications in cognitive impairment prediction In Advances in Neural Information ProcessingSystems (NIPS) pages 1448ndash1458 2017

11

[39] Y Wang J Liu Y Jiang and R Erdeacutelyi Cme arrival time prediction using convolutionalneural network The Astrophysical Journal 881(11)15 2019

[40] Y Wu and L A Stefanski Automatic structure recovery for additive models Biometrika102(2)381ndash395 2015

[41] W Yao and L Li A new regression model modal linear regression Scandinavian Journal ofStatistics 41(3)656ndash671 2013

[42] J Yin X Chen and E P Xing Group sparse additive models In International Conference onMachine Learning (ICML) pages 871ndash878 2012

[43] M Yuan and Y Lin Model selection and estimation in regression with grouped variablesJournal of The Royal Statistical Society Series B-statistical Methodology 68(1)49ndash67 2006

[44] M Yuan and D X Zhou Minimax optimal rates of estimation in high dimensional additivemodels The Annals of Statistics 44(6)2564ndash2593 2016

[45] H H Zhang G Cheng and Y Liu Linear or nonlinear automatic structure discovery forpartially linear models Journal of the American Statistical Association 106(495)1099ndash11122011

[46] T Zhao and H Liu Sparse additive machine In International Conference on ArtificialIntelligence and Statistics (AISTATS) pages 1435ndash1443 2012

12

  • Introduction
  • Multi-task Additive Models
    • Additive models
    • Mode-induced metric
    • Mode-induced group additive models
    • Multi-task additive models
      • Optimization Algorithm
      • Experiments
        • Simulated data analysis
        • Coronal mass ejection analysis
          • Conclusion
Page 9: Multi-task Additive Models for Robust Estimation and Automatic … · group structure identification (CGSI)[26], and Bilevel learning of Group Lasso (BiGL) [12]. 2 Multi-task Additive

wind speed Increment in solar wind speed and Mean magnetic field strength We split the dataof each task into training set validation set and test set (with ratio 2 2 1) and adopt the samesettings in simulations Table 3 demonstrates that MAM enjoy smaller average absolute error thanthe competitors In addition the estimated structure (via MAM) is described in Figure 3 FromFigure 3 and Table 3 we know group G1 (including Mass MPA Solar wind speed Vy) and groupG2 (including Acceleration and Linear Speed) are significant for most tasks Particularly G2 andG7 (2nd-order Speed at final height) can be characterized as the factors that reflect the CMEs speedTable 3 shows that the groups G2 and G7 play an important role in CMEs arrive time predictionwhich is consistent with the results in [20] In addition the impact of hyper-parameter are displayedin Supplementary Material D2 due to the space limitation Overall the proposed MAM can achievethe promising performance on prediction and structure discovery

Table 3 Average absolute error and dominant group for each taskTasks CMEs arrive time Mean ICME speed Maximum solar wind speed Increment in solar wind speed Mean magnetic field strength

Methods AAE (h) Groups AAE (kms) Groups AAE (kms) Groups AAE (kms) Groups AAE (nT ) Groups

MAM 907 G1 G2 G7 4541 G1 G2 G3 G6 5932 G1 G2 6538 G1 G2 G3 347 G1BiGL 1109 - 5375 - 4651 - 8997 - 521 -Lasso 1216 - 6256 - 5981 - 8534 - 438 -RMR 1202 - 6223 - 5190 - 8613 - 398 -

1 2 3 4 5 6 7 8 9 10

CPA

Angular Width

Acceleration

Linear Speed

2nd-order Speed(20Rs)

Mass

Kinetic Energy

MPA

Field magnitude average

Bx

By

Bz

Solar wind speed

Vx

Vy

Vz

Proton density

Temperature

Flow pressure

Plasma beta

Figure 3 Variable structure S (white pixel=the grouped variables red pixel=the inactive variables)

5 Conclusion

This paper proposes the multi-task additive models to achieve robust estimation and automaticstructure discovery As far as we know it is novel to explore robust interpretable machine learningby integrating modal regression additive models and multi-task learning together The computingalgorithm and empirical evaluations are provided to support its effectiveness In the future it isinteresting to investigate robust additive models for overlapping variable structure discovery [17]

Broader Impact

The positive impacts of this work are two-fold 1) Our algorithmic framework paves a new way formining the intrinsic feature structure among high-dimensional variables and may be the steppingstone to further explore data-driven structure discovery with overlapping groups 2) Our MAM canbe applied to other fields eg gene expression analysis and drug discovery However there is also arisk of resulting an unstable estimation when facing ultra high-dimensional data

9

Acknowledgments

This work was supported by National Natural Science Foundation of China under Grant Nos11671161 12071166 61972188 41574181 the Fundamental Research Funds for the Central Univer-sities (Program No 2662019FW003) and NSERC Grant RGPIN-2016-05024 We are grateful to theanonymous NeurIPS reviewers for their constructive comments

References[1] R Agarwal N Frosst X Zhang R Caruana and G E Hinton Neural additive models

Interpretable machine learning with neural nets arXiv200413912v1 2020

[2] R Caruana Multitask learning Machine Learning 28(1)41ndash75 1997

[3] H Chen G Liu and H Huang Sparse shrunk additive models In International Conference onMachine Learning (ICML) 2020

[4] H Chen X Wang C Deng and H Huang Group sparse additive machine In Advances inNeural Information Processing Systems (NIPS) pages 198ndash208 2017

[5] H Chen Y Wang F Zheng C Deng and H Huang Sparse modal additive model IEEETransactions on Neural Networks and Learning Systems Doi 101109TNNLS202030051442020

[6] Y C Chen C R Genovese R J Tibshirani and L Wasserman Nonparametric modalregression The Annals of Statistics 44(2)489ndash514 2016

[7] B Colson P Marcotte and G Savard An overview of bilevel optimization Annals ofOperations Research 153(1)235ndash256 2007

[8] L Condat Fast projection onto the simplex and the `1-ball Mathematical Programming158(1-2)575ndash585 2016

[9] T Evgeniou and M Pontil Regularized multindashtask learning In ACM Conference on KnowledgeDiscovery and Data Mining pages 109ndash117 2004

[10] Y Feng J Fan and J Suykens A statistical learning approach to modal regression Journal ofMachine Learning Research 21(2)1ndash35 2020

[11] L Franceschi P Frasconi S Salzo R Grazzi and M Pontil Bilevel programming forhyperparameter optimization and meta-learning In International Conference on Machinelearning (ICML) pages 1563ndash1572 2018

[12] J Frecon S Salzo and M Pontil Bilevel learning of the group lasso structure In Advances inNeural Information Processing Systems (NIPS) pages 8301ndash8311 2018

[13] J H Friedman T Hastie and R Tibshirani A note on the group lasso and a sparse group lassoarXiv 10010736 2010

[14] T J Hastie and R J Tibshirani Generalized additive models London Chapman and Hall1990

[15] C Heinrich The mode functional is not elicitable Biometrika 101(1)245ndash251 2014

[16] J Huang J L Horowitz and F Wei Variable selection in nonparametric additive models TheAnnals of Statistics 38(4)2282ndash2313 2010

[17] L Jacob G Obozinski and J Vert Group lasso with overlap and graph lasso In InternationalConference on Machine Learning (ICML) pages 433ndash440 2009

[18] K Kandasamy and Y Yu Additive approximations in high dimensional nonparametric regres-sion via the SALSA In International Conference on Machine Learning (ICML) pages 69ndash782016

10

[19] X Li L Wang and D Nettleton Sparse model identification and learning for ultra-high-dimensional additive partially models Journal of Multivariate Analysis 173204ndash228 2019

[20] J Liu Y Ye C Shen Y Wang and R Erdelyi A new tool for cme arrival time predictionusing machine learning algorithms Cat-puma The Astrophysical Journal 855(2)109ndash1182018

[21] S Lv H Lin H Lian and J Huang Oracle inequalities for sparse additive quantile regressionin reproducing kernel hilbert space The Annals of Statistics 46(2)781ndash813 2018

[22] L Meier S V De Geer and P Buhlmann The group lasso for logistic regression Journal ofThe Royal Statistical Society Series B-statistical Methodology 70(1)53ndash71 2008

[23] L Meier S V D Geer and P Buhlmann High-dimensional additive modeling The Annals ofStatistics 37(6B)3779ndash3821 2009

[24] M Nikolova and M K Ng Analysis of half-quadratic minimization methods for signal andimage recovery SIAM Journal on Scientific Computing 27(3)937ndash966 2005

[25] P Ochs R Ranftl T Brox and T Pock Techniques for gradient based bilevel optimization withnonsmooth lower level problems Journal of Mathematical Imaging and Vision 56(2)175ndash1942016

[26] C Pan and M Zhu Group additive structure identification for kernel nonparametric regressionIn Advances in Neural Information Processing Systems (NIPS) pages 4907ndash4916 2017

[27] E Parzen On estimation of a probability density function and mode Annals of MathematicalStatistics 33(3)1065ndash1076 1962

[28] G Raskutti M J Wainwright and B Yu Minimax-optimal rates for sparse additive models overkernel classes via convex programming Journal of Machine Learning Research 13(2)389ndash4272012

[29] P Ravikumar H Liu J Lafferty and L Wasserman SpAM sparse additive models Journalof the Royal Statistical Society Series B 711009ndash1030 2009

[30] S J Reddi S Sra B Poczos and A J Smola Proximal stochastic methods for nonsmoothnonconvex finite-sum optimization In Advances in Neural Information Processing Systems(NIPS) pages 1145ndash1153 2016

[31] T W Sager Estimation of a multivariate mode Annals of Statistics 6(4)802ndash812 1978

[32] N Shervashidze and F Bach Learning the structure for structured sparsity IEEE Transactionson Signal Processing 63(18)4894ndash4902 2015

[33] N Simon J H Friedman T Hastie and R Tibshirani A sparse-group lasso Journal ofComputational and Graphical Statistics 22(2)231ndash245 2013

[34] C J Stone Additive regression and other nonparametric models The Annals of Statistics13(2)689ndash705 1985

[35] G Swirszcz and A C Lozano Multi-level lasso for sparse multi-task regression In Interna-tional Conference on Machine Learning (ICML) pages 595ndash602 2012

[36] R Tibshirani Regression shrinkage and selection via the lasso Journal of the Royal StatisticalSociety Series B 73(3)267ndash288 1994

[37] Q Van Nguyen Forward-backward splitting with bregman distances Vietnam Journal ofMathematics 45(3)519ndash539 2017

[38] X Wang H Chen W Cai D Shen and H Huang Regularized modal regression withapplications in cognitive impairment prediction In Advances in Neural Information ProcessingSystems (NIPS) pages 1448ndash1458 2017

11

[39] Y Wang J Liu Y Jiang and R Erdeacutelyi Cme arrival time prediction using convolutionalneural network The Astrophysical Journal 881(11)15 2019

[40] Y Wu and L A Stefanski Automatic structure recovery for additive models Biometrika102(2)381ndash395 2015

[41] W Yao and L Li A new regression model modal linear regression Scandinavian Journal ofStatistics 41(3)656ndash671 2013

[42] J Yin X Chen and E P Xing Group sparse additive models In International Conference onMachine Learning (ICML) pages 871ndash878 2012

[43] M Yuan and Y Lin Model selection and estimation in regression with grouped variablesJournal of The Royal Statistical Society Series B-statistical Methodology 68(1)49ndash67 2006

[44] M Yuan and D X Zhou Minimax optimal rates of estimation in high dimensional additivemodels The Annals of Statistics 44(6)2564ndash2593 2016

[45] H H Zhang G Cheng and Y Liu Linear or nonlinear automatic structure discovery forpartially linear models Journal of the American Statistical Association 106(495)1099ndash11122011

[46] T Zhao and H Liu Sparse additive machine In International Conference on ArtificialIntelligence and Statistics (AISTATS) pages 1435ndash1443 2012

12

  • Introduction
  • Multi-task Additive Models
    • Additive models
    • Mode-induced metric
    • Mode-induced group additive models
    • Multi-task additive models
      • Optimization Algorithm
      • Experiments
        • Simulated data analysis
        • Coronal mass ejection analysis
          • Conclusion
Page 10: Multi-task Additive Models for Robust Estimation and Automatic … · group structure identification (CGSI)[26], and Bilevel learning of Group Lasso (BiGL) [12]. 2 Multi-task Additive

Acknowledgments

This work was supported by National Natural Science Foundation of China under Grant Nos11671161 12071166 61972188 41574181 the Fundamental Research Funds for the Central Univer-sities (Program No 2662019FW003) and NSERC Grant RGPIN-2016-05024 We are grateful to theanonymous NeurIPS reviewers for their constructive comments

References[1] R Agarwal N Frosst X Zhang R Caruana and G E Hinton Neural additive models

Interpretable machine learning with neural nets arXiv200413912v1 2020

[2] R Caruana Multitask learning Machine Learning 28(1)41ndash75 1997

[3] H Chen G Liu and H Huang Sparse shrunk additive models In International Conference onMachine Learning (ICML) 2020

[4] H Chen X Wang C Deng and H Huang Group sparse additive machine In Advances inNeural Information Processing Systems (NIPS) pages 198ndash208 2017

[5] H Chen Y Wang F Zheng C Deng and H Huang Sparse modal additive model IEEETransactions on Neural Networks and Learning Systems Doi 101109TNNLS202030051442020

[6] Y C Chen C R Genovese R J Tibshirani and L Wasserman Nonparametric modalregression The Annals of Statistics 44(2)489ndash514 2016

[7] B Colson P Marcotte and G Savard An overview of bilevel optimization Annals ofOperations Research 153(1)235ndash256 2007

[8] L Condat Fast projection onto the simplex and the `1-ball Mathematical Programming158(1-2)575ndash585 2016

[9] T Evgeniou and M Pontil Regularized multindashtask learning In ACM Conference on KnowledgeDiscovery and Data Mining pages 109ndash117 2004

[10] Y Feng J Fan and J Suykens A statistical learning approach to modal regression Journal ofMachine Learning Research 21(2)1ndash35 2020

[11] L Franceschi P Frasconi S Salzo R Grazzi and M Pontil Bilevel programming forhyperparameter optimization and meta-learning In International Conference on Machinelearning (ICML) pages 1563ndash1572 2018

[12] J Frecon S Salzo and M Pontil Bilevel learning of the group lasso structure In Advances inNeural Information Processing Systems (NIPS) pages 8301ndash8311 2018

[13] J H Friedman T Hastie and R Tibshirani A note on the group lasso and a sparse group lassoarXiv 10010736 2010

[14] T J Hastie and R J Tibshirani Generalized additive models London Chapman and Hall1990

[15] C Heinrich The mode functional is not elicitable Biometrika 101(1)245ndash251 2014

[16] J Huang J L Horowitz and F Wei Variable selection in nonparametric additive models TheAnnals of Statistics 38(4)2282ndash2313 2010

[17] L Jacob G Obozinski and J Vert Group lasso with overlap and graph lasso In InternationalConference on Machine Learning (ICML) pages 433ndash440 2009

[18] K Kandasamy and Y Yu Additive approximations in high dimensional nonparametric regres-sion via the SALSA In International Conference on Machine Learning (ICML) pages 69ndash782016

10

[19] X Li L Wang and D Nettleton Sparse model identification and learning for ultra-high-dimensional additive partially models Journal of Multivariate Analysis 173204ndash228 2019

[20] J Liu Y Ye C Shen Y Wang and R Erdelyi A new tool for cme arrival time predictionusing machine learning algorithms Cat-puma The Astrophysical Journal 855(2)109ndash1182018

[21] S Lv H Lin H Lian and J Huang Oracle inequalities for sparse additive quantile regressionin reproducing kernel hilbert space The Annals of Statistics 46(2)781ndash813 2018

[22] L Meier S V De Geer and P Buhlmann The group lasso for logistic regression Journal ofThe Royal Statistical Society Series B-statistical Methodology 70(1)53ndash71 2008

[23] L Meier S V D Geer and P Buhlmann High-dimensional additive modeling The Annals ofStatistics 37(6B)3779ndash3821 2009

[24] M Nikolova and M K Ng Analysis of half-quadratic minimization methods for signal andimage recovery SIAM Journal on Scientific Computing 27(3)937ndash966 2005

[25] P Ochs R Ranftl T Brox and T Pock Techniques for gradient based bilevel optimization withnonsmooth lower level problems Journal of Mathematical Imaging and Vision 56(2)175ndash1942016

[26] C Pan and M Zhu Group additive structure identification for kernel nonparametric regressionIn Advances in Neural Information Processing Systems (NIPS) pages 4907ndash4916 2017

[27] E Parzen On estimation of a probability density function and mode Annals of MathematicalStatistics 33(3)1065ndash1076 1962

[28] G Raskutti M J Wainwright and B Yu Minimax-optimal rates for sparse additive models overkernel classes via convex programming Journal of Machine Learning Research 13(2)389ndash4272012

[29] P Ravikumar H Liu J Lafferty and L Wasserman SpAM sparse additive models Journalof the Royal Statistical Society Series B 711009ndash1030 2009

[30] S J Reddi S Sra B Poczos and A J Smola Proximal stochastic methods for nonsmoothnonconvex finite-sum optimization In Advances in Neural Information Processing Systems(NIPS) pages 1145ndash1153 2016

[31] T W Sager Estimation of a multivariate mode Annals of Statistics 6(4)802ndash812 1978

[32] N Shervashidze and F Bach Learning the structure for structured sparsity IEEE Transactionson Signal Processing 63(18)4894ndash4902 2015

[33] N Simon J H Friedman T Hastie and R Tibshirani A sparse-group lasso Journal ofComputational and Graphical Statistics 22(2)231ndash245 2013

[34] C J Stone Additive regression and other nonparametric models The Annals of Statistics13(2)689ndash705 1985

[35] G Swirszcz and A C Lozano Multi-level lasso for sparse multi-task regression In Interna-tional Conference on Machine Learning (ICML) pages 595ndash602 2012

[36] R Tibshirani Regression shrinkage and selection via the lasso Journal of the Royal StatisticalSociety Series B 73(3)267ndash288 1994

[37] Q Van Nguyen Forward-backward splitting with bregman distances Vietnam Journal ofMathematics 45(3)519ndash539 2017

[38] X Wang H Chen W Cai D Shen and H Huang Regularized modal regression withapplications in cognitive impairment prediction In Advances in Neural Information ProcessingSystems (NIPS) pages 1448ndash1458 2017

11

[39] Y Wang J Liu Y Jiang and R Erdeacutelyi Cme arrival time prediction using convolutionalneural network The Astrophysical Journal 881(11)15 2019

[40] Y Wu and L A Stefanski Automatic structure recovery for additive models Biometrika102(2)381ndash395 2015

[41] W Yao and L Li A new regression model modal linear regression Scandinavian Journal ofStatistics 41(3)656ndash671 2013

[42] J Yin X Chen and E P Xing Group sparse additive models In International Conference onMachine Learning (ICML) pages 871ndash878 2012

[43] M Yuan and Y Lin Model selection and estimation in regression with grouped variablesJournal of The Royal Statistical Society Series B-statistical Methodology 68(1)49ndash67 2006

[44] M Yuan and D X Zhou Minimax optimal rates of estimation in high dimensional additivemodels The Annals of Statistics 44(6)2564ndash2593 2016

[45] H H Zhang G Cheng and Y Liu Linear or nonlinear automatic structure discovery forpartially linear models Journal of the American Statistical Association 106(495)1099ndash11122011

[46] T Zhao and H Liu Sparse additive machine In International Conference on ArtificialIntelligence and Statistics (AISTATS) pages 1435ndash1443 2012

12

  • Introduction
  • Multi-task Additive Models
    • Additive models
    • Mode-induced metric
    • Mode-induced group additive models
    • Multi-task additive models
      • Optimization Algorithm
      • Experiments
        • Simulated data analysis
        • Coronal mass ejection analysis
          • Conclusion
Page 11: Multi-task Additive Models for Robust Estimation and Automatic … · group structure identification (CGSI)[26], and Bilevel learning of Group Lasso (BiGL) [12]. 2 Multi-task Additive

[19] X Li L Wang and D Nettleton Sparse model identification and learning for ultra-high-dimensional additive partially models Journal of Multivariate Analysis 173204ndash228 2019

[20] J Liu Y Ye C Shen Y Wang and R Erdelyi A new tool for cme arrival time predictionusing machine learning algorithms Cat-puma The Astrophysical Journal 855(2)109ndash1182018

[21] S Lv H Lin H Lian and J Huang Oracle inequalities for sparse additive quantile regressionin reproducing kernel hilbert space The Annals of Statistics 46(2)781ndash813 2018

[22] L Meier S V De Geer and P Buhlmann The group lasso for logistic regression Journal ofThe Royal Statistical Society Series B-statistical Methodology 70(1)53ndash71 2008

[23] L Meier S V D Geer and P Buhlmann High-dimensional additive modeling The Annals ofStatistics 37(6B)3779ndash3821 2009

[24] M Nikolova and M K Ng Analysis of half-quadratic minimization methods for signal andimage recovery SIAM Journal on Scientific Computing 27(3)937ndash966 2005

[25] P Ochs R Ranftl T Brox and T Pock Techniques for gradient based bilevel optimization withnonsmooth lower level problems Journal of Mathematical Imaging and Vision 56(2)175ndash1942016

[26] C Pan and M Zhu Group additive structure identification for kernel nonparametric regressionIn Advances in Neural Information Processing Systems (NIPS) pages 4907ndash4916 2017

[27] E Parzen On estimation of a probability density function and mode Annals of MathematicalStatistics 33(3)1065ndash1076 1962

[28] G Raskutti M J Wainwright and B Yu Minimax-optimal rates for sparse additive models overkernel classes via convex programming Journal of Machine Learning Research 13(2)389ndash4272012

[29] P Ravikumar H Liu J Lafferty and L Wasserman SpAM sparse additive models Journalof the Royal Statistical Society Series B 711009ndash1030 2009

[30] S J Reddi S Sra B Poczos and A J Smola Proximal stochastic methods for nonsmoothnonconvex finite-sum optimization In Advances in Neural Information Processing Systems(NIPS) pages 1145ndash1153 2016

[31] T W Sager Estimation of a multivariate mode Annals of Statistics 6(4)802ndash812 1978

[32] N Shervashidze and F Bach Learning the structure for structured sparsity IEEE Transactionson Signal Processing 63(18)4894ndash4902 2015

[33] N Simon J H Friedman T Hastie and R Tibshirani A sparse-group lasso Journal ofComputational and Graphical Statistics 22(2)231ndash245 2013

[34] C J Stone Additive regression and other nonparametric models The Annals of Statistics13(2)689ndash705 1985

[35] G Swirszcz and A C Lozano Multi-level lasso for sparse multi-task regression In Interna-tional Conference on Machine Learning (ICML) pages 595ndash602 2012

[36] R Tibshirani Regression shrinkage and selection via the lasso Journal of the Royal StatisticalSociety Series B 73(3)267ndash288 1994

[37] Q Van Nguyen Forward-backward splitting with bregman distances Vietnam Journal ofMathematics 45(3)519ndash539 2017

[38] X Wang H Chen W Cai D Shen and H Huang Regularized modal regression withapplications in cognitive impairment prediction In Advances in Neural Information ProcessingSystems (NIPS) pages 1448ndash1458 2017

11

[39] Y Wang J Liu Y Jiang and R Erdeacutelyi Cme arrival time prediction using convolutionalneural network The Astrophysical Journal 881(11)15 2019

[40] Y Wu and L A Stefanski Automatic structure recovery for additive models Biometrika102(2)381ndash395 2015

[41] W Yao and L Li A new regression model modal linear regression Scandinavian Journal ofStatistics 41(3)656ndash671 2013

[42] J Yin X Chen and E P Xing Group sparse additive models In International Conference onMachine Learning (ICML) pages 871ndash878 2012

[43] M Yuan and Y Lin Model selection and estimation in regression with grouped variablesJournal of The Royal Statistical Society Series B-statistical Methodology 68(1)49ndash67 2006

[44] M Yuan and D X Zhou Minimax optimal rates of estimation in high dimensional additivemodels The Annals of Statistics 44(6)2564ndash2593 2016

[45] H H Zhang G Cheng and Y Liu Linear or nonlinear automatic structure discovery forpartially linear models Journal of the American Statistical Association 106(495)1099ndash11122011

[46] T Zhao and H Liu Sparse additive machine In International Conference on ArtificialIntelligence and Statistics (AISTATS) pages 1435ndash1443 2012

12

  • Introduction
  • Multi-task Additive Models
    • Additive models
    • Mode-induced metric
    • Mode-induced group additive models
    • Multi-task additive models
      • Optimization Algorithm
      • Experiments
        • Simulated data analysis
        • Coronal mass ejection analysis
          • Conclusion
Page 12: Multi-task Additive Models for Robust Estimation and Automatic … · group structure identification (CGSI)[26], and Bilevel learning of Group Lasso (BiGL) [12]. 2 Multi-task Additive

[39] Y Wang J Liu Y Jiang and R Erdeacutelyi Cme arrival time prediction using convolutionalneural network The Astrophysical Journal 881(11)15 2019

[40] Y Wu and L A Stefanski Automatic structure recovery for additive models Biometrika102(2)381ndash395 2015

[41] W Yao and L Li A new regression model modal linear regression Scandinavian Journal ofStatistics 41(3)656ndash671 2013

[42] J Yin X Chen and E P Xing Group sparse additive models In International Conference onMachine Learning (ICML) pages 871ndash878 2012

[43] M Yuan and Y Lin Model selection and estimation in regression with grouped variablesJournal of The Royal Statistical Society Series B-statistical Methodology 68(1)49ndash67 2006

[44] M Yuan and D X Zhou Minimax optimal rates of estimation in high dimensional additivemodels The Annals of Statistics 44(6)2564ndash2593 2016

[45] H H Zhang G Cheng and Y Liu Linear or nonlinear automatic structure discovery forpartially linear models Journal of the American Statistical Association 106(495)1099ndash11122011

[46] T Zhao and H Liu Sparse additive machine In International Conference on ArtificialIntelligence and Statistics (AISTATS) pages 1435ndash1443 2012

12

  • Introduction
  • Multi-task Additive Models
    • Additive models
    • Mode-induced metric
    • Mode-induced group additive models
    • Multi-task additive models
      • Optimization Algorithm
      • Experiments
        • Simulated data analysis
        • Coronal mass ejection analysis
          • Conclusion