research mathematical sciences machine learning approach ... · – applications to phenotypic...

46
1 Research Mathematical Sciences Machine Learning Approach to Phenotyping and Genotype-Phenotype Association Report from ARPA-E TERRA Project Digital Agriculture Workshop at UIUC Peder Olsen 1 , Aurelie Lozano 1 , Karthikeyan Natesan Ramamurthy 1 , Min-hwan Oh 3 , Ming Yu 4 , Javier Ribera 2 , Yuhao Chen 2 , Mitch Tuinstra 2 , Addie M. Thompson 5 , Ronny Luss 1 , Kimberly C. Lang 1 , Naoki Abe 1 1 IBM Research, Yorktown Heights, NY 2 Purdue University, West Lafayette, IN 3 Columbia University, NY 4 University of Chicago, Chicago, NY 5 Michigan State University, Lansing, MI

Upload: others

Post on 31-May-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Research Mathematical Sciences Machine Learning Approach ... · – Applications to phenotypic trait prediction from remote sensed data, and to multi-response GWAS. 3 ... • Optimize

1

Research Mathematical Sciences

Machine Learning Approach to Phenotyping and Genotype-Phenotype Association

Report from ARPA-E TERRA ProjectDigital Agriculture Workshop at UIUC

Peder Olsen1, Aurelie Lozano1, Karthikeyan Natesan Ramamurthy1, Min-hwan Oh3, Ming Yu4, Javier Ribera2, Yuhao Chen2, Mitch Tuinstra2, Addie M. Thompson5, Ronny Luss1,

Kimberly C. Lang1, Naoki Abe1

1IBM Research, Yorktown Heights, NY2Purdue University, West Lafayette, IN3Columbia University, NY4University of Chicago, Chicago, NY5Michigan State University, Lansing, MI

Page 2: Research Mathematical Sciences Machine Learning Approach ... · – Applications to phenotypic trait prediction from remote sensed data, and to multi-response GWAS. 3 ... • Optimize

2

Research Mathematical Sciences

Talk Agenda

§ Overview of ARPA-E TERRA Project with Purdue U. and U. Queensland§ Category 3: Machine Learning for Automated Phenotyping:

– Counting Panicles in Sorghum UAV imagery§ Category 4: Machine Learning for Genotype-Phenotype Association:

– Simultaneous Parameter Learning and Bi-Clustering for Multi-Response Models– Applications to phenotypic trait prediction from remote sensed data, and to multi-

response GWAS

Page 3: Research Mathematical Sciences Machine Learning Approach ... · – Applications to phenotypic trait prediction from remote sensed data, and to multi-response GWAS. 3 ... • Optimize

3

Research Mathematical Sciences

Transportation Energy Resources from Renewable Agriculture (TERRA)§ Project goal: “Automated Sorghum Phenotyping

and Trait Development Platform”– An automated high-throughput system for

determining how variations in the sorghum genome impact field performance, agricultural productivity and energy potential as biofuel

§ Challenges– Automating phenotyping using sensor data from

ground-based mobile and airborne platforms has been a bottleneck for advancing plant breeding

– Integrated analytics on high-dimensional genomic data and multi-modal field image/sensor data is unprecedented and poses major technical challenge

§ Partnership: Purdue University (Agronomy/Sensors) IBM Research (Analytics), CSIRO/U. Queensland (Crop Science)

§ Funded by Department of Energy for 3 years3URL: http://www.purdue.edu/newsroom/releases/2015/Q2/purdue-leading-research-using-advanced-technologies-to-better-grow-sorghum-as-biofuel.html

Genomic data Field performance (phenotype) data

Genotype to Phenotype mapping

Drones PhenomobilesGene expressions

SNP’s

Trait development Automatic phenotyping

Plant breeding recommendationTo maximize fuel energy potential

Page 4: Research Mathematical Sciences Machine Learning Approach ... · – Applications to phenotypic trait prediction from remote sensed data, and to multi-response GWAS. 3 ... • Optimize

4

Research Mathematical Sciences

TERRA Project Scope

§ Overall Project Category 1: Complete Integrated Phenotyping Systems Solutions

– Category 2: “High Throughput Automated Hardware & Sensing Technologies” (Purdue U.)

• Optimize high-throughput remote-sensing technologies to acquire relevant data on sorghum plant phenotypes

– Category 3: “Computational Solutions for Selection & Prediction” (Purdue U. and IBM)

• Develop data analytics algorithms for image data segmentation and feature extraction for automated phenotyping

– Category 4: “Genetics, Genomics and Bioinformatics” (IBM and Purdue U.)

• Develop sophisticated genetic analysis pipelines to identify genes controlling sorghum performance, by integrated genotype-phenotype analysis

4

Overall deliverable will be “Automated Sorghum Phenotyping and Trait Development Platform”

Page 5: Research Mathematical Sciences Machine Learning Approach ... · – Applications to phenotypic trait prediction from remote sensed data, and to multi-response GWAS. 3 ... • Optimize

5

Research Mathematical Sciences

Talk Agenda

§ Overview of ARPA-E TERRA Project with Purdue U. and U. Queensland§ Category 3: Machine Learning for Automated Phenotyping:

– Counting Panicles in Sorghum UAV imagery§ Category 4: Machine Learning for Genotype-Phenotype Association:

– Simultaneous Parameter Learning and Bi-Clustering for Multi-Response Models– Applications to phenotypic trait prediction from remote sensed data, and to multi-

response GWAS

Page 6: Research Mathematical Sciences Machine Learning Approach ... · – Applications to phenotypic trait prediction from remote sensed data, and to multi-response GWAS. 3 ... • Optimize

6

Research Mathematical Sciences

Counting Sorghum Panicles

§ Sorghum is a cereal crop and the panicle is the head of the plant that holds the grain.

§ Aerial downward looking images taken by UAV will be used to count the number of panicles in a small row segment of sorghum

Page 7: Research Mathematical Sciences Machine Learning Approach ... · – Applications to phenotypic trait prediction from remote sensed data, and to multi-response GWAS. 3 ... • Optimize

7

Research Mathematical Sciences

Automated, High-Throughput Phenotyping

§ Phenotyping is the process of measuring plant traits and plays a central role in plant breeding.

§ Manual phenotyping methods are inaccurate, expensive, and labor-intensive.§ Panicle counts is itself a phenotype, but can also be used to estimate the

productive tiller numbers and the 50% flowering time that are key phenotypes for the Sorghum crop.

§ Our research is part of the larger TERRA project (sponsored by ARPA-E) that aims to accelerate plant breeding through genotyping and high-throughput automated phenotyping.

Page 8: Research Mathematical Sciences Machine Learning Approach ... · – Applications to phenotypic trait prediction from remote sensed data, and to multi-response GWAS. 3 ... • Optimize

8

Research Mathematical Sciences

Extract Individual Rows

Human Annotation

Labeled Examples

Raw images

New images

Extract relevant phenotypes

.

Panicle Module

Panicle counts Plant location/pedigreeHigh-Throughput Phenotyping Pipeline for Panicle Counting, and other traits

Page 9: Research Mathematical Sciences Machine Learning Approach ... · – Applications to phenotypic trait prediction from remote sensed data, and to multi-response GWAS. 3 ... • Optimize

9

Research Mathematical Sciences

2017 Hybrid Calibration Panel

§ 18 varieties were planted on 4 different plots divided into 12 row-segments and imaged on 6 different dates (1 week apart).

– 2 of 12 row-segments were annotated– 864 images annotated

§ Used this data for training and testing both panicle detection and panicle counting models.

§ Evaluation by cross validation: train on 3 plots evaluate on remaining plot. Then rotate so each plot is in a test set exactly once.

Page 10: Research Mathematical Sciences Machine Learning Approach ... · – Applications to phenotypic trait prediction from remote sensed data, and to multi-response GWAS. 3 ... • Optimize

10

Research Mathematical Sciences

2017 Hybrid Calibration PanelPanicle Annotation

• Manually annotated rows:• Annotated data: 8 rows, 6 dates:

7/11, 7/17, 7/25, 8/2, 8/8, 8/16. Field rows 2 and 3..

• Annotation data contains • 35,318 panicle super-pixels• 12,099 annotated panicles • 4,521 panicles seen on last date• 864 images• 144 image sequences

Page 11: Research Mathematical Sciences Machine Learning Approach ... · – Applications to phenotypic trait prediction from remote sensed data, and to multi-response GWAS. 3 ... • Optimize

11

Research Mathematical Sciences

Superpixels

A superpixel is a group of pixels with similar characteristics – such as color or intensity. Superpixels carry higher perceptual and semantical meaningful than pixels. Superpixels are not at the level of a semantic segmentation where each segment corresponds to an object

§ We used the SLIC (Simple Linear Iterative Clustering) superpixel algorithm. It’s computationally cheap and readily available

– SLIC clusters in 5 dimensional color+position space (e.g. rgb+x,y or Lab+xy)– Achanta, R., Shaji, A., Smith, K., Lucchi, A., Fua, P., & Süsstrunk, S. (2010). Slic

superpixels. Ecole Polytechnique Fédéral de Lausssanne (EPFL), Tech. Rep, 149300, 155-162.

Page 12: Research Mathematical Sciences Machine Learning Approach ... · – Applications to phenotypic trait prediction from remote sensed data, and to multi-response GWAS. 3 ... • Optimize

12

Research Mathematical Sciences

Superpixel Segmentation for Annotation

• Clicking on pixel picks a segment from a superpixel set• Provided 3 sets of super-pixels to allow user to control accuracy

of annotation

Page 13: Research Mathematical Sciences Machine Learning Approach ... · – Applications to phenotypic trait prediction from remote sensed data, and to multi-response GWAS. 3 ... • Optimize

13

Research Mathematical Sciences

The Annotation Tool in Action (Video)

*Also developed a crowd sourcing (web based) annotation tool on IBM cloud with data security and password protected access

Page 14: Research Mathematical Sciences Machine Learning Approach ... · – Applications to phenotypic trait prediction from remote sensed data, and to multi-response GWAS. 3 ... • Optimize

14

Research Mathematical Sciences

The panicle count is monotonically increasing (under normal conditions)

• Used isotonic regression to force the estimated panicle count progression to be monotone.

• Always helps in our experiments.

• Also helped in detecting annotation errors

Without isotonic regression With isotonic regression

Page 15: Research Mathematical Sciences Machine Learning Approach ... · – Applications to phenotypic trait prediction from remote sensed data, and to multi-response GWAS. 3 ... • Optimize

15

Research Mathematical Sciences

Counting by Non-linear Regression

§ The idea in non-linear regression is to minimize

!"#$

%& '"; ) − +" ,

§ The regressor x is the image that can have millions of pixels§ If the regression target y is the count, then we would likely have to have

millions of images to train a model§ Instead we use an image density map where the density mass corresponding

to each panicle integrates to 1. § ) are the regression parameters which in our case will be a convolutional

neural network

Page 16: Research Mathematical Sciences Machine Learning Approach ... · – Applications to phenotypic trait prediction from remote sensed data, and to multi-response GWAS. 3 ... • Optimize

16

Research Mathematical Sciences

The Image Density Map• The concept of the image density map

was introduced in 2010 by Victor Lempitsky and Andrew Zisserman.

• In 2015 several research groups realized that the image density map was perfect as a regression target to a CNN

• It’s now widely used for crowd counting in computer vision, but not for phenotyping in agriculture, where classical color and texture methods are still widely used

image

Image density map

Page 17: Research Mathematical Sciences Machine Learning Approach ... · – Applications to phenotypic trait prediction from remote sensed data, and to multi-response GWAS. 3 ... • Optimize

17

Research Mathematical Sciences

Count: 34.502

Count: 34Dot annotation

20170816_range_018_row_082Image Density Map with Dot Annotation

Place a dot in the center of a panicle based onsuper-pixel annotation

Place a gaussian kernel with width of 5 pixels

Density on top of each panicle sums to 1

Page 18: Research Mathematical Sciences Machine Learning Approach ... · – Applications to phenotypic trait prediction from remote sensed data, and to multi-response GWAS. 3 ... • Optimize

18

Research Mathematical Sciences

Count: 35.031Count: 35.031

20170816_range_018_row_082

Mark all pixels belonging to panicle

area annotation is the sum of equally weighted gaussians on top of each pixel in the annotation

Density on top of each panicle sums to 1

The area annotation is typically more spread out than dot annotationCount: 35.031

Count: 34Area annotation

Image Density Map with Area Annotation

Page 19: Research Mathematical Sciences Machine Learning Approach ... · – Applications to phenotypic trait prediction from remote sensed data, and to multi-response GWAS. 3 ... • Optimize

19

Research Mathematical Sciences

bluegreen

Input dataImage density mapCNN

redthermal time 1237.8 Celsius-days

CNNArchitecture::

1

2 3

9x9x3x20Convolution

2x2Max Pool

5x5x20x40Convolution

2x2Max Pool

5x5x40x20Convolution

5x5x20x10Convolution

h

3

w

h/4

w/4

20

20 4040 20 10

Architecture of a Deep Head Counting System• Deep convolutional neural network① (Counting CNN – CCNN Onoro-Rubio 2016)• Include thermal time as input layer② (with RGB input) & image density map as target③• Orientation and flip invariance④ and isotonic regression⑤

1237.8 Celsius-days

Page 20: Research Mathematical Sciences Machine Learning Approach ... · – Applications to phenotypic trait prediction from remote sensed data, and to multi-response GWAS. 3 ... • Optimize

20

Research Mathematical Sciences

62.57

61.58

62.42

61.0

8

Rotate, flip and predict counts

Monotonic time series via isotonic regression

4 5

• Orientation and flip invariance④ and isotonic regression⑤

Architecture of a Deep Head Counting System

Median = 62.00

Page 21: Research Mathematical Sciences Machine Learning Approach ... · – Applications to phenotypic trait prediction from remote sensed data, and to multi-response GWAS. 3 ... • Optimize

23

Research Mathematical Sciences

Mean Absolute Error (MAE) Comparison w/ Multicolumn CNN

[MCNN] Single-Image Crowd Counting via Multi-Column Convolutional Neural Network, Zhang, Yingying and Zhou, Desen and Chen, Siqin and Gao, Shenghua and Ma, Yi, CVPR 2016

• MCNN has 3 CCNN like branches

• Each branch meant to specialize on different head sizes

• Needs batch normalization to work in our setting

• Since we don’t have perspective issues for UAV imagery the extra branches do not help.

system Base +rotations +isotonic

CCNN 1.39 1.38 1.28

MCNN 1.38 1.38 1.28

Page 22: Research Mathematical Sciences Machine Learning Approach ... · – Applications to phenotypic trait prediction from remote sensed data, and to multi-response GWAS. 3 ... • Optimize

24

Research

Mathematical Sciences

MAE Comparison w/ CSRnet

[CSRnet] CSRNet: Dilated convolutional neural networks for understanding the highly congested scenes, Li, Yuhong and Zhang, Xiaofan and Chen, Deming, CVPR 2018

• CSRnet is a deep architecture based on VGG16

• It uses dilated kernels to deal with perspective

• Needs batch normalization to work in our setting

• CSRnet was best on 4 out of 5 crowd counting benchmark test when published

• CSRNet suffers from overtraining and is difficult to train (needs good initialization (vgg16), batch normalization, …)

system Base +rotations +isotonic

CCNN 1.39 1.38 1.28

CSRnet 1.17 1.12 1.09

Page 23: Research Mathematical Sciences Machine Learning Approach ... · – Applications to phenotypic trait prediction from remote sensed data, and to multi-response GWAS. 3 ... • Optimize

25

Research Mathematical Sciences

Thermal Time

§ Growing Degree Days (GDD) = # degree days above 10�C

§ The cumulative GDD since planting is an indicator for the development stage

– A short season hybrid needs 1026 degree days to reach flowering

Image from agropedia

Page 24: Research Mathematical Sciences Machine Learning Approach ... · – Applications to phenotypic trait prediction from remote sensed data, and to multi-response GWAS. 3 ... • Optimize

26

Research Mathematical Sciences

Adding Thermal Time

§ Thermal time lets the model ignore images when no panicles have been developed

§ No gain seen for CSRnet which is already overtrained

system Base +rotations +isotonic

CCNN 1.39 1.38 1.28

+thermal 1.31 1.29 1.18

MCNN 1.39 1.38 1.28

+thermal 1.35 1.32 1.25

CSRnet 1.17 1.12 1.09

+thermal 1.15 1.13 1.10

Page 25: Research Mathematical Sciences Machine Learning Approach ... · – Applications to phenotypic trait prediction from remote sensed data, and to multi-response GWAS. 3 ... • Optimize

27

Research Mathematical Sciences

Detecting Panicle Pixels

§ Can use the same architecture, but using an image detection map as regression target

§ This is a simple foreground-background segmentation and is not state-of-the-art, but reuses code

bluegreen

Input dataImage detection map

CCNN

red

thermal time1237.8 Celsius-days1421.2 Celsius-

days

Page 26: Research Mathematical Sciences Machine Learning Approach ... · – Applications to phenotypic trait prediction from remote sensed data, and to multi-response GWAS. 3 ... • Optimize

28

Research Mathematical Sciences

Adding a Region-Of-Interest Channel§ Uses the predicted image detection map as a region-of-interest masking

channel

bluegreen

Input data

Image density mapCCNN

red

thermal time1237.8 Celsius-days1421.2 Celsius-days

system Base +rotations +isotonic +thermal +detection

CCNN 1.39 1.38 1.28 1.18 1.06

MCNN 1.38 1.38 1.28 1.25 1.24

CSRnet 1.17 1.12 1.09 1.10 1.11

Page 27: Research Mathematical Sciences Machine Learning Approach ... · – Applications to phenotypic trait prediction from remote sensed data, and to multi-response GWAS. 3 ... • Optimize

29

Research Mathematical Sciences

Future work

§ Architecture experiments– Current best crowd estimation system: Scale-Aggregation – Sanet– Res-net versions of the architectures

§ Using area annotation to induce panicle segmentation– Works well and is currently embedded in annotation tool – Need to compare to state-of-the-art for homogeneous instance segmentation in presence of

occlusion

§ Flowering time estimation– Can use panicle count time series to track plant development and estimate flowering time– For producing hybrid varieties planting of male and female variety has to be staggered for

flowering time to match– Can estimate flowering time to 1 day with 60 observations, and to within 3 days from 2

observations.

§ Apply ideas to other crops and problems in agronomy

Page 28: Research Mathematical Sciences Machine Learning Approach ... · – Applications to phenotypic trait prediction from remote sensed data, and to multi-response GWAS. 3 ... • Optimize

30

Research

Mathematical Sciences

Talk Agenda

§ Overview of ARPA-E TERRA Project with Purdue U. and U. Queensland

§ Category 3: Machine Learning for Automated Phenotyping:

– Counting Panicles in Sorghum UAV imagery

§ Category 4: Machine Learning for Genotype-Phenotype Association: – Simultaneous Parameter Learning and Bi-Clustering for Multi-Response Models

– Applications to phenotypic trait prediction from remote sensed data, and to multi-

response GWAS

Page 29: Research Mathematical Sciences Machine Learning Approach ... · – Applications to phenotypic trait prediction from remote sensed data, and to multi-response GWAS. 3 ... • Optimize

32

Research Mathematical Sciences

Context of this work in the Overall Pipeline

Trait prediction with extracted

features

GWAS with predicted/reference

traits or features

Pre-processing and feature extraction

Page 30: Research Mathematical Sciences Machine Learning Approach ... · – Applications to phenotypic trait prediction from remote sensed data, and to multi-response GWAS. 3 ... • Optimize

33

Research Mathematical Sciences

GWAS with each trait in isolation

= *

§ The coefficient vector for each trait (task) is estimated independently.

§ Relationship between the traits can be inferred post-hoc, by clustering columns of !.

+ NOISE

n va

rietie

s

k phenotypic traits

YOutput

p features (SNPs)

n va

rietie

sX

Input!

p fe

atur

es

k phenotypic traits

Coefficient

Page 31: Research Mathematical Sciences Machine Learning Approach ... · – Applications to phenotypic trait prediction from remote sensed data, and to multi-response GWAS. 3 ... • Optimize

34

Research Mathematical Sciences

Multi-trait GWAS with Task and SNP Clustering

§ Certain SNPs can jointly influence certain traits (tasks).

§ This input-output relationship induces a bi-clustering or checkerboard structure in !.

§ Our goal is to automatically learn ! and this checkerboard structure.

p features (SNPs)

n va

rietie

s

k phenotypic traits

= *n va

rietie

s!

p fe

atur

es

k phenotypic traits

Y X

+ NOISE

Output Input Coefficient

Page 32: Research Mathematical Sciences Machine Learning Approach ... · – Applications to phenotypic trait prediction from remote sensed data, and to multi-response GWAS. 3 ... • Optimize

35

Research Mathematical Sciences

Estimating Coefficients Only …

§ This is the same as a single-response GWAS problem.

Estimation Coefficient sparsity

Page 33: Research Mathematical Sciences Machine Learning Approach ... · – Applications to phenotypic trait prediction from remote sensed data, and to multi-response GWAS. 3 ... • Optimize

36

Research Mathematical Sciences

Performing Bi-Clustering Only …

§ Chi EC, Allen GI, Baraniuk RG. Convex biclustering. Biometrics. 2017 Mar 1;73(1):10-9.

Input variable: ! Column clustering

Row clustering

Page 34: Research Mathematical Sciences Machine Learning Approach ... · – Applications to phenotypic trait prediction from remote sensed data, and to multi-response GWAS. 3 ... • Optimize

37

Research Mathematical Sciences

Putting Together – Simultaneous Estimation & Bi-Clustering

Estimation

Bi-clustering

§ Simply the sum of the two previous objectives in variables ! (estimation variable) and " (clustering variable).

Page 35: Research Mathematical Sciences Machine Learning Approach ... · – Applications to phenotypic trait prediction from remote sensed data, and to multi-response GWAS. 3 ... • Optimize

38

Research Mathematical Sciences

Special (and simpler) case : estimation variables = clustering variables

Estimation

§ Estimation variables within the same cluster are set to be equal.

Manuscript under review by AISTATS 2018

the assumption here is that an appropriate sharingof information can benefit all the tasks [4, 19]. Theimplicit assumption that all tasks are closely relatedcan be excessive as it ignores the underlying speci-ficity of the mappings. There have been several ex-tensions to multi-task learning that address this prob-lem. The authors in [12] propose a dirty model forfeature sharing among tasks, wherein a linear super-position of two sets of parameters - one that is commonto all tasks, and one that is task-specific is used. [14]leverages a predefined tree structure among the outputtasks (e.g. using hierarchical agglomerative clustering)and imposes group regularizations on the task param-eters based on this tree. The approach proposed in[15] learns to share by defining a set of basis task pa-rameters and posing the task-specific parameters as asparse linear combination of these. The approaches of[11] and [13] assume that the tasks are clustered intogroups and proceed to learn the group structure alongwith the task parameters using a convex and an in-teger quadratic program respectively. However, theseapproaches do not consider joint clustering of the fea-tures. In addition the mixed integer program of [13] iscomputationally intensive and greatly limit the maxi-mum number of tasks that can be considered. Anotherpertinent approach is the Network Lasso formulationpresented in [8]. This formulation, however, is limitedto settings where only clustering among the tasks isneeded.

Our formulations can be seen as generalizations of (un-supervised) convex bi-clustering [5]. Indeed convex bi-clustering aims at grouping observations and featuresin a data matrix; our approaches aim at discoveringgroupings in the parameter matrix of multi-responseregression models while jointly estimating such a ma-trix, and the discovered groupings reflect groupings infeatures and responses.

Roadmap. In Section 2, we will discuss the two pro-posed formulations, and in Section 3, we will presentthe optimization schemes used to estimate the param-eters. The choice of hyperparameters used and theirsignificance is discussed in Section 4. We illustrate thesolution path for one of the formulations in Section 5.We will provide results for estimation with syntheticdata, and a case study using multi-response GWASwith real data in Sections 6 and 7 respectively. Weconclude in Section 8.

2 PROPOSED FORMULATIONS

We will motivate and propose two distinct formula-tions for simultaneous parameter learning and cluster-ing with general supervised models involving matrixvalued parameters. Our formulations will be devel-

oped around multi-task regression in this paper. Weare interested in accurate parameter estimation as wellas understanding the bi-cluster or checkerboard struc-ture of the parameter matrix. More formally, denoteby Z the observed data, ⇥ the model parameters tobe estimated, and L(Z;⇥) a general loss function, andR(⇥) to be the regularization.

In multi-task regression, Z = {X,Y } where X

s

2Rn⇥p are the design matrices and Y

s

2 Rn are theresponse vectors for each task s = 1, . . . , k. ⇥ is amatrix in Rp⇥k containing the regression coe�cientsfor each task. A popular choice for L is the squaredloss: L(Z;⇥) =

Pk

s=1 kYs

� X

s

⇥s

k22 and for R(⇥),the `1 regularization, k⇥k1. Here we wish to discoverthe bi-cluster structure among features and responses,respectively the rows and columns of ⇥.

2.1 Formulation 1:

We begin with the simplest formulation, which, as weshall see, is a special case of the latter one.

min⇥

L(Z;⇥) + �1R(⇥) + �2

h⌦

W

(⇥) + ⌦fW

(⇥T )i

(1)where ⌦

W

(⇥) =P

i<j

w

ij

k⇥·i � ⇥·jk2 and ⇥·i is the

i

th column of ⇥. Note that this term is inspired by theconvex bi-clustering objective [5, eqn. 2.1]. ⌦

W

(⇥)encourages sparsity in di↵erences between model pa-rameters (columns of ⇥), whereas ⌦

W

(⇥T ) encour-ages sparsity in the di↵erences between the various di-mensions (rows of ⇥) of the model parameter matrix.When the overall objective is optimized, we can expectto see a checkerboard pattern in the model parametermatrix.

The degree of sharing of parameters, and hence thedegree of bi-clustering, is controlled using the tuningparameter �2. As �2 increases and more parametersfuse together, the number of rectangles in the checker-board pattern will reduce. W and f

W are nonnegativeweights that can be imposed to reflect prior belief onthe closeness of the rows and columns of ⇥.

In the remainder of the paper, we shall assume that thedesign matrix X is constant across tasks for notationalsimplicity. Our approaches can be seamlessly extendedto the case of task-specific design matrices.

For sparse multi-task linear regression, formulation 1can be instantiated as,

min⇥

kY �X⇥k2F

+ �1

kX

i=1

k⇥i

k1

+ �2

h⌦

W

(⇥) + ⌦fW

(⇥T )i.

(2)

In this linear regression setting, the rows of ⇥ corre-Bi-clustering

Manuscript under review by AISTATS 2018

the assumption here is that an appropriate sharingof information can benefit all the tasks [4, 19]. Theimplicit assumption that all tasks are closely relatedcan be excessive as it ignores the underlying speci-ficity of the mappings. There have been several ex-tensions to multi-task learning that address this prob-lem. The authors in [12] propose a dirty model forfeature sharing among tasks, wherein a linear super-position of two sets of parameters - one that is commonto all tasks, and one that is task-specific is used. [14]leverages a predefined tree structure among the outputtasks (e.g. using hierarchical agglomerative clustering)and imposes group regularizations on the task param-eters based on this tree. The approach proposed in[15] learns to share by defining a set of basis task pa-rameters and posing the task-specific parameters as asparse linear combination of these. The approaches of[11] and [13] assume that the tasks are clustered intogroups and proceed to learn the group structure alongwith the task parameters using a convex and an in-teger quadratic program respectively. However, theseapproaches do not consider joint clustering of the fea-tures. In addition the mixed integer program of [13] iscomputationally intensive and greatly limit the maxi-mum number of tasks that can be considered. Anotherpertinent approach is the Network Lasso formulationpresented in [8]. This formulation, however, is limitedto settings where only clustering among the tasks isneeded.

Our formulations can be seen as generalizations of (un-supervised) convex bi-clustering [5]. Indeed convex bi-clustering aims at grouping observations and featuresin a data matrix; our approaches aim at discoveringgroupings in the parameter matrix of multi-responseregression models while jointly estimating such a ma-trix, and the discovered groupings reflect groupings infeatures and responses.

Roadmap. In Section 2, we will discuss the two pro-posed formulations, and in Section 3, we will presentthe optimization schemes used to estimate the param-eters. The choice of hyperparameters used and theirsignificance is discussed in Section 4. We illustrate thesolution path for one of the formulations in Section 5.We will provide results for estimation with syntheticdata, and a case study using multi-response GWASwith real data in Sections 6 and 7 respectively. Weconclude in Section 8.

2 PROPOSED FORMULATIONS

We will motivate and propose two distinct formula-tions for simultaneous parameter learning and cluster-ing with general supervised models involving matrixvalued parameters. Our formulations will be devel-

oped around multi-task regression in this paper. Weare interested in accurate parameter estimation as wellas understanding the bi-cluster or checkerboard struc-ture of the parameter matrix. More formally, denoteby Z the observed data, ⇥ the model parameters tobe estimated, and L(Z;⇥) a general loss function, andR(⇥) to be the regularization.

In multi-task regression, Z = {X,Y } where X

s

2Rn⇥p are the design matrices and Y

s

2 Rn are theresponse vectors for each task s = 1, . . . , k. ⇥ is amatrix in Rp⇥k containing the regression coe�cientsfor each task. A popular choice for L is the squaredloss: L(Z;⇥) =

Pk

s=1 kYs

� X

s

⇥s

k22 and for R(⇥),the `1 regularization, k⇥k1. Here we wish to discoverthe bi-cluster structure among features and responses,respectively the rows and columns of ⇥.

2.1 Formulation 1:

We begin with the simplest formulation, which, as weshall see, is a special case of the latter one.

min⇥

L(Z;⇥) + �1R(⇥) + �2

h⌦

W

(⇥) + ⌦fW

(⇥T )i

(1)where ⌦

W

(⇥) =P

i<j

w

ij

k⇥·i � ⇥·jk2 and ⇥·i is the

i

th column of ⇥. Note that this term is inspired by theconvex bi-clustering objective [5, eqn. 2.1]. ⌦

W

(⇥)encourages sparsity in di↵erences between model pa-rameters (columns of ⇥), whereas ⌦

W

(⇥T ) encour-ages sparsity in the di↵erences between the various di-mensions (rows of ⇥) of the model parameter matrix.When the overall objective is optimized, we can expectto see a checkerboard pattern in the model parametermatrix.

The degree of sharing of parameters, and hence thedegree of bi-clustering, is controlled using the tuningparameter �2. As �2 increases and more parametersfuse together, the number of rectangles in the checker-board pattern will reduce. W and f

W are nonnegativeweights that can be imposed to reflect prior belief onthe closeness of the rows and columns of ⇥.

In the remainder of the paper, we shall assume that thedesign matrix X is constant across tasks for notationalsimplicity. Our approaches can be seamlessly extendedto the case of task-specific design matrices.

For sparse multi-task linear regression, formulation 1can be instantiated as,

min⇥

kY �X⇥k2F

+ �1

kX

i=1

k⇥i

k1

+ �2

h⌦

W

(⇥) + ⌦fW

(⇥T )i.

(2)

In this linear regression setting, the rows of ⇥ corre-

Page 36: Research Mathematical Sciences Machine Learning Approach ... · – Applications to phenotypic trait prediction from remote sensed data, and to multi-response GWAS. 3 ... • Optimize

39

Research

Mathematical Sciences

Optimization§ For Simplified Formulation (“Formulation 1”) we employ proximal

decomposition (Combettes & Pesquet, 2008), which is an efficient algorithm

for minimizing the summation of several convex functions.

§ For Full Formulation (“Formulation 2”) we use an alternating minimization

method on Θ and Γ.

– The first alternating step is to estimate Θ while fixing Γ. This minimization problem

is separable for each column and each sub-problem can be easily written as a

standard Lasso problem.

– In the second step, we fix Θ and optimize for Γ using the COnvex BiclusteRing

Algorithm (COBRA) introduced in Chi et al. (2014), which also employs proximal

decomposition

§ The resulting optimization algorithms are shown to converge to the global

optimum.

Page 37: Research Mathematical Sciences Machine Learning Approach ... · – Applications to phenotypic trait prediction from remote sensed data, and to multi-response GWAS. 3 ... • Optimize

40

Research Mathematical Sciences

Solution Path as we increase the bi-clustering regularization

Yu et al.

Figure 1. Multi-response GWAS: The simultaneous grouping relationship between phenotypic traits andSNPs manifest as a block structure (row + column groups) in the parameter matrix. The row and columngroups are special cases of the more general block structure. Our proposed approach infers the parametermatrix as well as the group structures.

Figure 2. Evolution of the bi-clustering structure of model coefficient matrix ⇥ as regularization parameter�2 increases.

426

This is a provisional file, not the final typeset article 14

Evolution of the bi-clustering structure of model coefficient matrix Θ as regularization parameter λ2 increases.

Page 38: Research Mathematical Sciences Machine Learning Approach ... · – Applications to phenotypic trait prediction from remote sensed data, and to multi-response GWAS. 3 ... • Optimize

41

Research Mathematical Sciences

Simulated Data Experiments: Setup

§ We focus on multi-task regression: with .

§ The true regression parameter !* has a bi-cluster (checkerboard) structure.

§ To simulate sparsity, we set the coefficients within many of the blocks in the

checkerboard to 0.

§ For a non-zero block, the coefficients are , .

– This makes them close but not identical.

– μrc is the cluster mean defined by the rth row partition and cth column partition.

§ We set n = 200, p = 500, and k = 250 in our experiments.

§ For the non-zero blocks, we set μrc ∼ Uniform{−2, −1, 1, 2} and set σε=0.25

§ We consider a low noise setting (σ=1.5) and a high noise setting (σ=3)

Page 39: Research Mathematical Sciences Machine Learning Approach ... · – Applications to phenotypic trait prediction from remote sensed data, and to multi-response GWAS. 3 ... • Optimize

42

Research Mathematical Sciences

Simulated Data Experiments: Results

Low noise results High noise results

§ Row clustering; Column clustering; Row & Column clustering§ “F1” (Formulation 1) = simpler formulation, “F2” (Formulation 2) = full (general) formulation.§ ARI: Adjusted Rand Index, F-1 = F1 score, JI=Jaccard Index.§ Baseline: best of: (a) letting each coefficient be its own group, and (b) imposing a single group for all

coefficients.§ 2-step: estimate-then-cluster approach: (a)Estimate Θ first using LASSO, and (b) perform

unsupervised convex bi-clustering on Θ.

Yu et al.

Figure 5. Solution paths for formulation 2,fixing �1 and varying �2,�3.

Figure 6. Solution paths for formulation 2,fixing �2 and varying �1,�3.

Table 1. Performance of low noise setting. The first block is for row clustering; the second block is forcolumn clustering; and the third block is for row-column biclustering.

Baseline 2-step F1 F2ARI 0 0.679±0.157 0.869±0.069 0.900±0.046F-1 0.446 0.757±0.128 0.907±0.052 0.931±0.022JI 0.287 0.625±0.161 0.834±0.081 0.871±0.042

ARI 0 0.877±0.043 0.914±0.020 0.915±0.013F-1 0.446 0.908±0.037 0.933±0.023 0.934±0.012JI 0.287 0.847±0.048 0.876±0.031 0.887±0.025

ARI 0 0.708±0.118 0.841±0.059 0.863±0.035F-1 0.172 0.734±0.110 0.857±0.052 0.877±0.026JI 0.094 0.591±0.134 0.753±0.077 0.781±0.035

Table 2. Performance of high noise setting. The first block is for row clustering; the second block is forcolumn clustering; and the third block is for row-column biclustering.

Baseline 2-step F1 F2ARI 0 0.577±0.163 0.803±0.104 0.804±0.096F-1 0.446 0.674±0.138 0.874±0.093 0.874±0.075JI 0.287 0.525±0.159 0.793±0.097 0.792±0.098

ARI 0 0.734±0.132 0.905±0.077 0.905±0.046F-1 0.446 0.799±0.107 0.924±0.054 0.933±0.039JI 0.287 0.689±0.120 0.872±0.078 0.867±0.065

ARI 0 0.555±0.187 0.801±0.125 0.812±0.105F-1 0.172 0.586±0.152 0.824±0.104 0.821±0.086JI 0.094 0.437±0.179 0.714±0.118 0.713±0.104

This is a provisional file, not the final typeset article 16

Yu et al.

Figure 5. Solution paths for formulation 2,fixing �1 and varying �2,�3.

Figure 6. Solution paths for formulation 2,fixing �2 and varying �1,�3.

Table 1. Performance of low noise setting. The first block is for row clustering; the second block is forcolumn clustering; and the third block is for row-column biclustering.

Baseline 2-step F1 F2ARI 0 0.679±0.157 0.869±0.069 0.900±0.046F-1 0.446 0.757±0.128 0.907±0.052 0.931±0.022JI 0.287 0.625±0.161 0.834±0.081 0.871±0.042

ARI 0 0.877±0.043 0.914±0.020 0.915±0.013F-1 0.446 0.908±0.037 0.933±0.023 0.934±0.012JI 0.287 0.847±0.048 0.876±0.031 0.887±0.025

ARI 0 0.708±0.118 0.841±0.059 0.863±0.035F-1 0.172 0.734±0.110 0.857±0.052 0.877±0.026JI 0.094 0.591±0.134 0.753±0.077 0.781±0.035

Table 2. Performance of high noise setting. The first block is for row clustering; the second block is forcolumn clustering; and the third block is for row-column biclustering.

Baseline 2-step F1 F2ARI 0 0.577±0.163 0.803±0.104 0.804±0.096F-1 0.446 0.674±0.138 0.874±0.093 0.874±0.075JI 0.287 0.525±0.159 0.793±0.097 0.792±0.098

ARI 0 0.734±0.132 0.905±0.077 0.905±0.046F-1 0.446 0.799±0.107 0.924±0.054 0.933±0.039JI 0.287 0.689±0.120 0.872±0.078 0.867±0.065

ARI 0 0.555±0.187 0.801±0.125 0.812±0.105F-1 0.172 0.586±0.152 0.824±0.104 0.821±0.086JI 0.094 0.437±0.179 0.714±0.118 0.713±0.104

This is a provisional file, not the final typeset article 16

1

2

3

1 2 3

Page 40: Research Mathematical Sciences Machine Learning Approach ... · – Applications to phenotypic trait prediction from remote sensed data, and to multi-response GWAS. 3 ... • Optimize

43

Research Mathematical Sciences

Real Data Analysis: Multi-response GWAS§ The design matrix X was created from the SNP data from Sorghum varieties in

the phenotyping study in the Midwestern United States.

§ We consider n=911 varieties and over 80,000 SNPs. We remove duplicate SNPs and those that do not have significantly high correlation to at least one response variable. Finally, we end up considering p = 2, 937 SNPs.

§ The output data Y contains the following 6 response variables (columns) for all the n varieties collected by hand measurements:

– Height to panicle (h1) – Height to top collar (h2)– Diameter top collar (d1)– Diameter at 5 cm from base (d2)– Leaf collar count (l1) – Green leaf count (l2)

Page 41: Research Mathematical Sciences Machine Learning Approach ... · – Applications to phenotypic trait prediction from remote sensed data, and to multi-response GWAS. 3 ... • Optimize

44

Research Mathematical Sciences

Real Data Analysis: Multi-response GWAS

Smoothed coefficient matrix obtained by simpler formulation

Smoothed coefficient matrix obtained by general formulation

Yu et al.

Figure 8. Smoothed coefficient matrix obtained from formulations 1 (left) and 2 (right), revealing thebi-clustering structure.

Coef

ficie

nt fo

r Hei

ght T

rait

Chromosome 1 2 3 4 5 6 7 8 9 10

Dw1Dw3Dw2

Formulation 1 Formulation 2 Known height genes

-0.1

5-0

.10

-0.0

50.

00.

050.

100.

15

Figure 9. Distribution of coefficients for height traits for all SNPs. The red lines are loci of known heightgenes, and the black and gray dots correspond to coefficients of formulation 1 and 2 respectively.

This is a provisional file, not the final typeset article 18

Yu et al.

Figure 8. Smoothed coefficient matrix obtained from formulations 1 (left) and 2 (right), revealing thebi-clustering structure.

Coef

ficie

nt fo

r Hei

ght T

rait

Chromosome 1 2 3 4 5 6 7 8 9 10

Dw1Dw3Dw2

Formulation 1 Formulation 2 Known height genes

-0.1

5-0

.10

-0.0

50.

00.

050.

100.

15

Figure 9. Distribution of coefficients for height traits for all SNPs. The red lines are loci of known heightgenes, and the black and gray dots correspond to coefficients of formulation 1 and 2 respectively.

This is a provisional file, not the final typeset article 18

Page 42: Research Mathematical Sciences Machine Learning Approach ... · – Applications to phenotypic trait prediction from remote sensed data, and to multi-response GWAS. 3 ... • Optimize

45

Research Mathematical Sciences

Real Data Analysis: Multi-response GWAS

§ Distribution of coefficients for height traits for all SNPs. The red lines are loci of known height genes, and the black and gray dots correspond to coefficients of formulation 1 and 2 respectively.

Yu et al.

Figure 8. Smoothed coefficient matrix obtained from formulations 1 (left) and 2 (right), revealing thebi-clustering structure.

Coef

ficie

nt fo

r Hei

ght T

rait

Chromosome 1 2 3 4 5 6 7 8 9 10

Dw1Dw3Dw2

Formulation 1 Formulation 2 Known height genes

-0.1

5-0

.10

-0.0

50.

00.

050.

100.

15

Figure 9. Distribution of coefficients for height traits for all SNPs. The red lines are loci of known heightgenes, and the black and gray dots correspond to coefficients of formulation 1 and 2 respectively.

This is a provisional file, not the final typeset article 18

Page 43: Research Mathematical Sciences Machine Learning Approach ... · – Applications to phenotypic trait prediction from remote sensed data, and to multi-response GWAS. 3 ... • Optimize

46

Research Mathematical Sciences

Real data analysis: phenotypic trait prediction from remote sensed data § The experimental data was obtained from 18 Sorghum varieties planted in 6

replicate plot locations. We considered the trait of plant height. § From the RGB and hyperspectral images of each plot, we extract features of

length 206. § Hence n = 6, p = 206, and the number of tasks k = 18. § The presence of multiple varieties with replicates much smaller in number

than predictors poses a major challenge: building separate models for each variety is unrealistic, while a single model does not fit all.

§ This is where our proposed simultaneous estimation and clustering approach provides the flexibility to share information among tasks that leads to learning at the requisite level of robustness.

Page 44: Research Mathematical Sciences Machine Learning Approach ... · – Applications to phenotypic trait prediction from remote sensed data, and to multi-response GWAS. 3 ... • Optimize

47

Research Mathematical Sciences

Real data analysis: phenotypic trait prediction from remote sensed data

Yu et al.

Table 3. RMSE and parameter recovery accuracy of the estimation schemes of low noise (� = 1.5) setting.Lasso 2-step Form1 Form2

RMSE 1.627±0.02 1.622±0.02 1.613±0.02 1.612±0.02Rec. acc. 0.234±0.03 0.231±0.03 0.223±0.03 0.222±0.03

Table 4. RMSE and parameter recovery accuracy of the estimation schemes of high noise (� = 3) setting.Lasso 2-step Form1 Form2

RMSE 3.34±0.02 3.30±0.02 3.23±0.02 3.16±0.02Rec. acc. 0.364±0.06 0.362±0.06 0.327±0.05 0.325±0.06

Figure 7. Tree structure of tasks (varieties) inferred using our approach for plant height.

Table 5. RMSE for plant height prediction.

Method RMSESingle model 44.39±6.55No group multitask learning 36.94±6.10Kang et al. 37.55±7.60Proposed 33.31±5.10

Table 6. Comparison of test RMSE on the multi-response GWAS dataset.Lasso 2-step Form1 Form2

RMSE 2.181 2.206 2.105 2.119

Frontiers 17

Tree structure of tasks (varieties) inferred for plant height by sweeping the penalty parameter λ2

Yu et al.

Table 3. RMSE and parameter recovery accuracy of the estimation schemes of low noise (� = 1.5) setting.Lasso 2-step Form1 Form2

RMSE 1.627±0.02 1.622±0.02 1.613±0.02 1.612±0.02Rec. acc. 0.234±0.03 0.231±0.03 0.223±0.03 0.222±0.03

Table 4. RMSE and parameter recovery accuracy of the estimation schemes of high noise (� = 3) setting.Lasso 2-step Form1 Form2

RMSE 3.34±0.02 3.30±0.02 3.23±0.02 3.16±0.02Rec. acc. 0.364±0.06 0.362±0.06 0.327±0.05 0.325±0.06

Figure 7. Tree structure of tasks (varieties) inferred using our approach for plant height.

Table 5. RMSE for plant height prediction.

Method RMSESingle model 44.39±6.55No group multitask learning 36.94±6.10Kang et al. 37.55±7.60Proposed 33.31±5.10

Table 6. Comparison of test RMSE on the multi-response GWAS dataset.Lasso 2-step Form1 Form2

RMSE 2.181 2.206 2.105 2.119

Frontiers 17

• Single model: single predictive model using Lasso, treating all the varieties as i.i.d.

• No group multitask learning: learns a traditional multitask model using Group Lasso where each variety forms a separate group

• Kang et al. Kang et al. (2011): uses a mixed integer program to learn shared feature representations among tasks, while simultaneously determining “with whom” each task should share

Page 45: Research Mathematical Sciences Machine Learning Approach ... · – Applications to phenotypic trait prediction from remote sensed data, and to multi-response GWAS. 3 ... • Optimize

48

Research

Mathematical Sciences

Concluding Remarks§ Introduced and studied formulations for joint estimation and bi-clustering of the parameter

matrix in multi-response models.

§ Convex formulations with efficient optimization schemes imply global optimality and

adaptability to large datasets.

§ Results with synthetic datasets show improvements compared to baselines.

§ Real data analysis show promising results for phenotypic prediction and GWAS.

§ Extensions:

– Can be generalized to other problems by changing loss function (e.g. logistic loss); regularization

(e.g. beyond lasso penalty); distance functions between the parameter matrix rows and columns.

– Tensor clustering for GWAS with multiple varieties and multiple phenotypes

– Application to GWAS with time varying phenotypes (a.k.a fGWAS) to discover potential regime

changes in associations

Page 46: Research Mathematical Sciences Machine Learning Approach ... · – Applications to phenotypic trait prediction from remote sensed data, and to multi-response GWAS. 3 ... • Optimize

49

Research Mathematical Sciences

Acknowledgments

This work was supported by the Advanced Research Projects Agency Energy (ARPA-E), U.S. Department of Energy under Grant DE-AR0000593.

We also thank Andrew Linvill for help with annotating panicles in particularly difficult images.