sequential and spatial supervised learning

1
Sequential and Spatial Supervised Learning Sequential and Spatial Supervised Learning Guohua Hao, Rongkun Shen, Guohua Hao, Rongkun Shen, , Dan Vega, Yaroslav Bulatov and Thomas G. Dan Vega, Yaroslav Bulatov and Thomas G. Dietterich Dietterich School of Electrical Engineering and Computer Science, Oregon State University, School of Electrical Engineering and Computer Science, Oregon State University, Corvallis, Oregon 97331 Corvallis, Oregon 97331 Abstract Abstract Introduction Introduction Methods Methods Conclusions and Future work Conclusions and Future work Acknowledgement Acknowledgement Reference Reference Traditional supervised learning assumes independence between the training examples. However, many statistical learning problems involve sequential or spatial data that are not independent. Furthermore, the sequential or spatial relationships can be exploited to improve the prediction accuracy of a classifier. We are developing and testing new practical methods for machine learning with sequential and spatial data. This poster gives a snapshot of our current methods and results. In the classical supervised learning problems, we assume that the training examples are drawn independently and identically from some joint distribution P(x,y). However, many applications of machine learning involve predicting a sequence of labels for a sequence of observations. New learning methods are needed that can capture the possible interdependencies between labels. We can formulate this Sequential Supervised Learning problem as follows: Given: a set of training examples of the form (X,Y), where X=(x 1 ,x 2 ,…,x n ) a sequence of feature vectors Y=(y 1 ,y 2 ,…,y n ) corresponding label sequences Goal: find a classifier h to predict new X as Y=h(X) y t-1 y t y t+1 x t-1 x t x t+1 Vertical relationship: as in normal supervised learning Horizontal relationship: interdependencies between label variable, can improve accuracy Examples include part of speech tagging, protein secondary structure prediction etc. Extending 1-D observation and label sequences to 2-D arrays, we obtain a similar formulation for the Spatial Supervised Learning problem, where both X and Y have 2-D structure and interdependencies between labels. y i,j y i,j+1 y i+1,j y i+1,j+1 x i+1,j x i+1,j+1 x i,j x i,,j+1 Structural Supervised Learning: Given: A graph G = (V,E), each vertex is an (x v ,y v ) pair. Some vertexes are missing the y label Goal: Predict the missing y labels Sliding window / Recurrent sliding window y t-1 y t y t+1 x t-1 x t x t+1 Sliding window y t-1 y t y t+1 x t-1 x t x t+1 Recurrent Sliding window Hidden Markov Model – Joint Distribution P(X,Y) y t-1 y t y t+1 x t-1 x t x t+1 Generalization of Naïve Bayesian Networks Transition probability P(y t |y t-1 ) Observation probability P(x t |y t ) With conditional independence, impractical to represent overlapping features of observations f f Conditional Random Field – Conditional Distribution P(Y|X) y t-1 y t y t+1 x t-1 x t x t+1 Extension of logistic regression to sequential data Label sequence Y forms a Markov random field globally conditioned on observation X Removes the HMM independence assumption Discriminative methods – Score function f(X,Y) 1 1 (, | ) , , , t t t t t t t t y y w f y y w g y w Potential function of the random field Conditional Probability 1 1 1 1 1 | exp , , exp , , T T t t t t t t t Y t PY X y y w Z X y y w Z X 1 log | N i i i J PY X Maximize the log likelihood Parameter Estimation Iterative scaling and gradient descent – exponential number of parameters Gradient tree boosting – only necessary interactions among features Averaged perceptron ( Michael Collins et al. 2002 ) Hidden Markov support vector machine (Yasemin Altun et al. 2003 ) Maximum Margin Markov Network ( Ben Taskar 2003 ) Application Application s s Assign the secondary structure classes (-helix, -sheet and coil) to protein’s amino acid (AA) sequence, leading to tertiary and/or quaternary structure corresponding to protein functions Use Position-Specific Scoring Matrix profiles to improve the prediction accuracy Use CB513 datasets, with sequences shorter than 30 AA residues excluded, in our experiment Protein secondary structure prediction Protein AA sequence >1avhb-4-AS: IPAYL AETLY YAMKG AGTDD HTLIR VMVSR SEIDL FNIRK EFRKN FATSL YSMIK GDTSG DYKKA LLLLC GEDD CRF Training and Testing Prediction Results Feed into CRF Generate raw profile with PSI- BLAST Output Applications (cont’d) Applications (cont’d) Experiment result Divide the training data into a sub- training set and a development (validation) set. Try window sizes from 1 to 21 and tree sizes from 10, 20, 30, 50, 70. The best window size was 11, and the best tree size was 20. With this configuration, the best number of iterations to train was 110, which gave 66.3% correct predictions on the development set. Train on the entire training set with this configuration and evaluate on the test set. The result was 67.1% correct. Neural network sliding windows give better performance than this, so we are currently designing experiments to understand why! Semantic Role Labeling Classification of remotely sensed images (1) Dietterich, T.G (2002). Machine learning for sequential data: a review. Structural, Syntactic, and Statistical Pattern Recognition (pp. 15-30). New York: Springer Verlag (2) Lafferty, J., MaCallum, A., & Pereira, F. (2001) Conditional random fields: Probabilistic models for segmenting and labeling sequence data. Proceedings of the 18 th International Conference on Machine Learning (pp.282-289). San Francisco, CA: Morgan Kaufmann (3) Dietterich T. G. , Ashenfelter, A., & Bulatov, Y. (2004) Training Conditional Random Field via Gradient Tree Boosting. Proceedings of the 21 st International Conference on Machine Learning (pp 217-224) Banff, Canada (4) Jones D. T. (1999) Protein Secondary Structure Prediction Based Matrices . J. Mol. Biol. 292:195-202 (5) Cuff J.A. and Barton G.J. (2000) Application of Multiple Sequence Alignment Profiles to Improve Protein Secondary Structure Prediction. Proteins: Structure, Function and Genetics 40:502-511 (6) Carreras, X. & Marquez, L. Introduction to the CoNLL-2004 Shared Task: Semantic Role Labeling . Proceedings of CoNLL-2004 (7) Della Pietra, S., Della Pietra, V. & Lafferty, J. (1997) Inducing features of random fields. IEEE Transactions on Pattern Analysis and Machine Intelligence, 19(4), 380---393. Applications (cont’d) Applications (cont’d) We thank the National Science Foundation for supporting this research under grant number IIS-0307592 Assign the crop identification classes (unknown, sugar beets, stubble, bare soil, potatoes, carrots) to pixels in the remotely sensed image The official "shared task" of the 2004 Co-NLL conference. For each verb in the sentence, find all of its arguments and label their semantic roles. He would n’t accept anything of value from those he was writing about A0: acceptor AM-MOD: modal A1: thing accepted AM-NEG: negation V: verb A2: Accepted-from Difficulty for the machine learning Humans use background knowledge to figure out semantic roles There are 70 various semantic role tags, which makes it computationally intensive Experiment result Two forms of feature-inducing in Conditional Random Fields Regression tree approach Incremental field growing 70 different semantic tags to learn. Training set is 9000, whereas test set is 2,000 Evaluated using F-measure, the harmonic mean between precision and recall of requested argument types. Both methods got similar performance, F-measure around 65. The best published performance was 71.72, using a simple greedy left to right sequence labeling Again, simpler non-relational approaches outperform CRF on this task. Why? Experiment result Different window sizes affect not only the computation time, but also the accuracy of the classifier J48 (C4.5) and Naïve Bayes classifiers are the most extensively studied. The results show that Naïve Bayes achieves a higher accuracy with smaller sliding windows, while J48 does better with larger window sizes. Training set and test set are created by dividing the image in half with a horizontal line. The top half is used as the training set, and the bottom half as the test set. Training Set Expansion – rotations and reflections of the training set increase the training set 8 fold. Figure 5: Image with true labels of the classes. Upper part is the training example and the lower part is the testing example Figure 1 Figure 2 Figure 4 When compared to individual pixel classification (59%) it is easy to see that the recurrent sliding window allows for a significant improvement in the accuracy of the classifier. Currently, the affect bagging and boosting has on the accuracy is under investigation. IC=1 OC=1 IC=1 OC=3 IC=1 OC=5 Naïve Bayesian Classifier 71% 77% 43% J48 Best accuracy is 74% with IC=7 and OC=3 ( IC = input context; OC = output context ) Position-Specific Scoring Matrix A R N D C Q E G H I L K M F P S T W Y V Class 1 I -1 -3 -3 -3 -1 -3 -3 -4 -3 4 2 -3 1 0 -3 -2 -1 -3 -1 3 H 2 P -1 -2 -2 -1 -3 -1 -1 -2 -2 -3 -3 -1 -2 -4 7 -1 -1 -4 -3 -2 H 3 A 4 -1 -2 -2 0 -1 -1 0 -2 -1 -1 -1 -1 -2 -1 1 0 -3 -2 0 H 4 Y -2 -2 -2 -3 -2 -1 -2 -3 2 -1 -1 -2 -1 3 -3 -2 -2 2 7 -1 H 5 L -1 -2 -3 -4 -1 -2 -3 -4 -3 2 4 -2 2 0 -3 -2 -1 -2 -1 1 H 6 A 4 -1 -2 -2 0 -1 -1 0 -2 -1 -1 -1 -1 -2 -1 1 0 -3 -2 0 H 7 E -1 0 0 2 -4 2 5 -2 0 -3 -3 1 -2 -3 -1 0 -1 -3 -2 -2 E 8 T 0 -1 0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1 1 5 -2 -2 0 E 9 L -1 -2 -3 -4 -1 -2 -3 -4 -3 2 4 -2 2 0 -3 -2 -1 -2 -1 1 E 10 Y -2 -2 -2 -3 -2 -1 -2 -3 2 -1 -1 -2 -1 3 -3 -2 -2 2 7 -1 E 11 Y -2 -2 -2 -3 -2 -1 -2 -3 2 -1 -1 -2 -1 3 -3 -2 -2 2 7 -1 E 12 A 4 -1 -2 -2 0 -1 -1 0 -2 -1 -1 -1 -1 -2 -1 1 0 -3 -2 0 C 13 M -1 -1 -2 -3 -1 0 -2 -3 -2 1 2 -1 5 0 -2 -1 -1 -1 -1 1 C 14 K -1 2 0 -1 -3 1 1 -2 -1 -3 -2 5 -1 -3 -1 0 -1 -3 -2 -2 C 15 G 0 -2 0 -1 -3 -2 -2 6 -2 -4 -4 -2 -3 -3 -2 0 -2 -2 -3 -3 C 16 A 4 -1 -2 -2 0 -1 -1 0 -2 -1 -1 -1 -1 -2 -1 1 0 -3 -2 0 C 17 G 0 -2 0 -1 -3 -2 -2 6 -2 -4 -4 -2 -3 -3 -2 0 -2 -2 -3 -3 C 18 T 0 -1 0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1 1 5 -2 -2 0 C Figure 6 training set expansion A classifier is trained, and run on the 8 rotations and reflections of the test set. A majority vote decides the final class. A sliding window is used to group the input pixels with varying square size, the same is done for the output window. Thus, the labeling of a pixel is dependent not only on the pixel intensity values in its neighborhood, but also on the labels placed on the pixels in the neighborhood Pixel labels by NaiveBayes IC=1 OC=3 on each of 8 rotations and reflections of the test image Majorit y voting Classification result In the recent years, substantial progress has already been make to the sequential and spatial supervised learning problems. This poster has attempted to review some of the existing methods and give out our current methods and experiment results in several applications. Future work will include Develop methods that can handle large number of classes Discriminative methods using large margin principles Understand why structural learning methods, such as CRF, do not outperform classical methods in some structural learning problems

Upload: fatima-cherry

Post on 31-Dec-2015

12 views

Category:

Documents


0 download

DESCRIPTION

Position-Specific Scoring Matrix A R N D C Q E G H I L K M F P S T W Y V Class 1 I -1 -3 -3 -3 -1 -3 -3 -4 -3 4 2 -3 1 0 -3 -2 -1 -3 -1 3 H 2 P -1 -2 -2 -1 -3 -1 -1 -2 -2 -3 -3 -1 -2 -4 7 -1 -1 -4 -3 -2 H - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Sequential and Spatial Supervised Learning

Sequential and Spatial Supervised LearningSequential and Spatial Supervised LearningGuohua Hao, Rongkun Shen, Guohua Hao, Rongkun Shen, ,Dan Vega, Yaroslav Bulatov and Thomas G. DietterichDan Vega, Yaroslav Bulatov and Thomas G. Dietterich

School of Electrical Engineering and Computer Science, Oregon State University, Corvallis, Oregon 97331School of Electrical Engineering and Computer Science, Oregon State University, Corvallis, Oregon 97331

AbstractAbstract

IntroductionIntroduction

MethodsMethods

Conclusions and Future workConclusions and Future work

AcknowledgementAcknowledgement

ReferenceReference

Traditional supervised learning assumes independence between the training examples. However, many statistical learning problems involve sequential or spatial data that are not independent. Furthermore, the sequential or spatial relationships can be exploited to improve the prediction accuracy of a classifier. We are developing and testing new practical methods for machine learning with sequential and spatial data. This poster gives a snapshot of our current methods and results.

In the classical supervised learning problems, we assume that the training examples are drawn independently and identically from some joint distribution P(x,y). However, many applications of machine learning involve predicting a sequence of labels for a sequence of observations. New learning methods are needed that can capture the possible interdependencies between labels. We can formulate this Sequential Supervised Learning problem as follows:

Given: a set of training examples of the form (X,Y), where X=(x1,x2,…,xn) a sequence of feature vectors Y=(y1,y2,…,yn) corresponding label sequencesGoal: find a classifier h to predict new X as Y=h(X)

yt-1 yt yt+1

xt-1 xt xt+1

• Vertical relationship: as in normal supervised learning• Horizontal relationship: interdependencies between label variable, can improve accuracy

Examples include part of speech tagging, protein secondary structure prediction etc.

Extending 1-D observation and label sequences to 2-D arrays, we obtain a similar formulation for the Spatial Supervised Learning problem, where both X and Y have 2-D structure and interdependencies between labels.

yi,jyi,j+1

yi+1,j yi+1,j+1

xi+1,j xi+1,j+1

xi,j xi,,j+1

Structural Supervised Learning:

Given: A graph G = (V,E), each vertex is an (xv,yv) pair. Some vertexes are missing the y label

Goal: Predict the missing y labels

Sliding window / Recurrent sliding window

yt-1 yt yt+1

xt-1 xt xt+1

Sliding window

yt-1 yt yt+1

xt-1 xt xt+1

Recurrent Sliding window

Hidden Markov Model – Joint Distribution P(X,Y)

yt-1 yt yt+1

xt-1 xt xt+1

Generalization of Naïve Bayesian Networks

Transition probability P(yt|yt-1)

Observation probability P(xt|yt)

With conditional independence, impractical to

represent overlapping features of observations

f

f

Conditional Random Field – Conditional Distribution P(Y|X)

yt-1 yt yt+1

xt-1 xt xt+1

Extension of logistic regression to sequential data

Label sequence Y forms a Markov random field

globally conditioned on observation X

Removes the HMM independence assumption

Discriminative methods – Score function f(X,Y)

1 1( , | ) , , ,t t t t t t t ty y w f y y w g y w

Potential function of the random field

Conditional Probability

1 11 1

1| exp , , exp , ,

T T

t t t t t tt Y t

P Y X y y w Z X y y wZ X

1

log |N

i ii

J P Y X

Maximize the log likelihood

Parameter Estimation

Iterative scaling and gradient descent – exponential number of parameters

Gradient tree boosting – only necessary interactions among features

Averaged perceptron ( Michael Collins et al. 2002 )

Hidden Markov support vector machine (Yasemin Altun et al. 2003 )

Maximum Margin Markov Network ( Ben Taskar 2003 )

ApplicationsApplications

Assign the secondary structure classes (-helix, -sheet and coil) to protein’s

amino acid (AA) sequence, leading to tertiary and/or quaternary structure

corresponding to protein functions

Use Position-Specific Scoring Matrix profiles to improve the prediction accuracy

Use CB513 datasets, with sequences shorter than 30 AA residues excluded, in

our experiment

Protein secondary structure prediction

Protein AA sequence>1avhb-4-AS:IPAYL AETLY YAMKG AGTDD HTLIR VMVSR SEIDL FNIRK EFRKN FATSL YSMIK GDTSG DYKKA LLLLC GEDD

CRF Training and Testing

Prediction Results

Feed into CRF

Generate raw profile with PSI-BLAST

Output

Applications (cont’d)Applications (cont’d) Experiment result

Divide the training data into a sub-training set and a development (validation) set. Try window sizes from 1 to 21 and tree sizes from 10, 20, 30, 50, 70.

The best window size was 11, and the best tree size was 20. With this configuration, the best number of iterations to train was 110, which gave 66.3% correct predictions on the development set.

Train on the entire training set with this configuration and evaluate on the test set. The result was 67.1% correct.

Neural network sliding windows give better performance than this, so we are currently designing experiments to understand why!

Semantic Role Labeling

Classification of remotely sensed images

(1) Dietterich, T.G (2002). Machine learning for sequential data: a review. Structural, Syntactic, and Statistical Pattern Recognition (pp. 15-30). New York: Springer Verlag

(2) Lafferty, J., MaCallum, A., & Pereira, F. (2001) Conditional random fields: Probabilistic models for segmenting and labeling sequence data. Proceedings of the 18th International Conference on Machine Learning (pp.282-289). San Francisco, CA: Morgan Kaufmann

(3) Dietterich T. G. , Ashenfelter, A., & Bulatov, Y. (2004) Training Conditional Random Field via Gradient Tree Boosting. Proceedings of the 21st International Conference on Machine Learning (pp 217-224) Banff, Canada

(4) Jones D. T. (1999) Protein Secondary Structure Prediction Based Matrices. J. Mol. Biol. 292:195-202

(5) Cuff J.A. and Barton G.J. (2000) Application of Multiple Sequence Alignment Profiles to Improve Protein Secondary Structure Prediction. Proteins: Structure, Function and Genetics 40:502-511

(6) Carreras, X. & Marquez, L. Introduction to the CoNLL-2004 Shared Task: Semantic Role Labeling. Proceedings of CoNLL-2004

(7) Della Pietra, S., Della Pietra, V. & Lafferty, J. (1997) Inducing features of random fields. IEEE Transactions on

Pattern Analysis and Machine Intelligence, 19(4), 380---393.

Applications (cont’d)Applications (cont’d)

We thank the National Science Foundation for supporting this research under grant number IIS-0307592

Assign the crop identification classes (unknown, sugar beets, stubble, bare soil, potatoes, carrots) to pixels in the remotely sensed image

The official "shared task" of the 2004 Co-NLL conference.

For each verb in the sentence, find all of its arguments and label their semantic roles.

He would n’t accept anything of value from those he was writing about

A0:acceptor

AM-MOD:modal

A1:thing accepted

AM-NEG:negation

V:verb

A2:Accepted-from

Difficulty for the machine learning

Humans use background knowledge to figure out semantic roles

There are 70 various semantic role tags, which makes it computationally intensive

Experiment result

Two forms of feature-inducing in Conditional Random Fields

Regression tree approach

Incremental field growing

70 different semantic tags to learn. Training set is 9000, whereas test set is 2,000

Evaluated using F-measure, the harmonic mean between precision and recall of requested argument types.

Both methods got similar performance, F-measure around 65. The best published performance was 71.72, using a simple  greedy left to right sequence labeling

Again, simpler non-relational approaches outperform CRF on  this task. Why?

Experiment result

Different window sizes affect not only the computation time, but also the accuracy of the classifier

J48 (C4.5) and Naïve Bayes classifiers are the most extensively studied. The results show that Naïve Bayes achieves a higher accuracy with smaller sliding windows, while J48 does better with larger window sizes.

Training set and test set are created by dividing the image in half with a horizontal line. The top half is used as the training set, and the bottom half as the test set.

Training Set Expansion – rotations and reflections of the training set increase the training set 8 fold.

Figure 5: Image with true labels of the classes. Upper part is the training example and the lower part is the testing example

Figure 1

Figure 2

Figure 4

When compared to individual pixel classification (59%) it is easy to see that the recurrent sliding window allows for a significant improvement in the accuracy of the classifier.

Currently, the affect bagging and boosting has on the accuracy is under investigation.

IC=1 OC=1 IC=1 OC=3 IC=1 OC=5

Naïve Bayesian Classifier

71% 77% 43%

J48 Best accuracy is 74% with IC=7 and OC=3( IC = input context; OC = output context )

Position-Specific Scoring Matrix A R N D C Q E G H I L K M F P S T W Y V Class 1 I -1 -3 -3 -3 -1 -3 -3 -4 -3 4 2 -3 1 0 -3 -2 -1 -3 -1 3 H 2 P -1 -2 -2 -1 -3 -1 -1 -2 -2 -3 -3 -1 -2 -4 7 -1 -1 -4 -3 -2 H 3 A 4 -1 -2 -2 0 -1 -1 0 -2 -1 -1 -1 -1 -2 -1 1 0 -3 -2 0 H 4 Y -2 -2 -2 -3 -2 -1 -2 -3 2 -1 -1 -2 -1 3 -3 -2 -2 2 7 -1 H 5 L -1 -2 -3 -4 -1 -2 -3 -4 -3 2 4 -2 2 0 -3 -2 -1 -2 -1 1 H 6 A 4 -1 -2 -2 0 -1 -1 0 -2 -1 -1 -1 -1 -2 -1 1 0 -3 -2 0 H 7 E -1 0 0 2 -4 2 5 -2 0 -3 -3 1 -2 -3 -1 0 -1 -3 -2 -2 E 8 T 0 -1 0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1 1 5 -2 -2 0 E 9 L -1 -2 -3 -4 -1 -2 -3 -4 -3 2 4 -2 2 0 -3 -2 -1 -2 -1 1 E 10 Y -2 -2 -2 -3 -2 -1 -2 -3 2 -1 -1 -2 -1 3 -3 -2 -2 2 7 -1 E 11 Y -2 -2 -2 -3 -2 -1 -2 -3 2 -1 -1 -2 -1 3 -3 -2 -2 2 7 -1 E 12 A 4 -1 -2 -2 0 -1 -1 0 -2 -1 -1 -1 -1 -2 -1 1 0 -3 -2 0 C 13 M -1 -1 -2 -3 -1 0 -2 -3 -2 1 2 -1 5 0 -2 -1 -1 -1 -1 1 C 14 K -1 2 0 -1 -3 1 1 -2 -1 -3 -2 5 -1 -3 -1 0 -1 -3 -2 -2 C 15 G 0 -2 0 -1 -3 -2 -2 6 -2 -4 -4 -2 -3 -3 -2 0 -2 -2 -3 -3 C 16 A 4 -1 -2 -2 0 -1 -1 0 -2 -1 -1 -1 -1 -2 -1 1 0 -3 -2 0 C 17 G 0 -2 0 -1 -3 -2 -2 6 -2 -4 -4 -2 -3 -3 -2 0 -2 -2 -3 -3 C 18 T 0 -1 0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1 1 5 -2 -2 0 C

Figure 6 training set expansion

A classifier is trained, and run on the 8 rotations and reflections of the test set.

A majority vote decides the final class.

A sliding window is used to group the input pixels with varying square size, the same is done for the output window. Thus, the labeling of a pixel is dependent not only on the pixel intensity values in its neighborhood, but also on the labels placed on the pixels in the neighborhood

Pixel labels by NaiveBayes IC=1 OC=3 on each of 8 rotations and reflections of the test image

Majority voting

Classification result

In the recent years, substantial progress has already been make to the sequential and spatial supervised learning problems. This poster has attempted to review some of the existing methods and give out our current methods and experiment results in several applications. Future work will include

Develop methods that can handle large number of classes

Discriminative methods using large margin principles

Understand why structural learning methods, such as CRF, do not outperform classical methods in some structural learning problems