sequential and spatial supervised learning

Sequential and Spatial Supervised LearningSequential and Spatial Supervised LearningGuohua Hao, Rongkun Shen, Guohua Hao, Rongkun Shen, ,Dan Vega, Yaroslav Bulatov and Thomas G. DietterichDan Vega, Yaroslav Bulatov and Thomas G. Dietterich

School of Electrical Engineering and Computer Science, Oregon State University, Corvallis, Oregon 97331School of Electrical Engineering and Computer Science, Oregon State University, Corvallis, Oregon 97331

AbstractAbstract

IntroductionIntroduction

MethodsMethods

Conclusions and Future workConclusions and Future work

AcknowledgementAcknowledgement

ReferenceReference

Traditional supervised learning assumes independence between the training examples. However, many statistical learning problems involve sequential or spatial data that are not independent. Furthermore, the sequential or spatial relationships can be exploited to improve the prediction accuracy of a classifier. We are developing and testing new practical methods for machine learning with sequential and spatial data. This poster gives a snapshot of our current methods and results.

In the classical supervised learning problems, we assume that the training examples are drawn independently and identically from some joint distribution P(x,y). However, many applications of machine learning involve predicting a sequence of labels for a sequence of observations. New learning methods are needed that can capture the possible interdependencies between labels. We can formulate this Sequential Supervised Learning problem as follows:

Given: a set of training examples of the form (X,Y), where X=(x1,x2,…,xn) a sequence of feature vectors Y=(y1,y2,…,yn) corresponding label sequencesGoal: find a classifier h to predict new X as Y=h(X)

yt-1 yt yt+1

xt-1 xt xt+1

• Vertical relationship: as in normal supervised learning• Horizontal relationship: interdependencies between label variable, can improve accuracy

Examples include part of speech tagging, protein secondary structure prediction etc.

Extending 1-D observation and label sequences to 2-D arrays, we obtain a similar formulation for the Spatial Supervised Learning problem, where both X and Y have 2-D structure and interdependencies between labels.

yi,jyi,j+1

yi+1,j yi+1,j+1

xi+1,j xi+1,j+1

xi,j xi,,j+1

Structural Supervised Learning:

Given: A graph G = (V,E), each vertex is an (xv,yv) pair. Some vertexes are missing the y label

Goal: Predict the missing y labels

Sliding window / Recurrent sliding window

yt-1 yt yt+1

xt-1 xt xt+1

Sliding window

yt-1 yt yt+1

xt-1 xt xt+1

Recurrent Sliding window

Hidden Markov Model – Joint Distribution P(X,Y)

yt-1 yt yt+1

xt-1 xt xt+1

Generalization of Naïve Bayesian Networks

Transition probability P(yt|yt-1)

Observation probability P(xt|yt)

With conditional independence, impractical to

represent overlapping features of observations

f

f

Conditional Random Field – Conditional Distribution P(Y|X)

yt-1 yt yt+1

xt-1 xt xt+1

Extension of logistic regression to sequential data

Label sequence Y forms a Markov random field

globally conditioned on observation X

Removes the HMM independence assumption

Discriminative methods – Score function f(X,Y)

1 1( , | ) , , ,t t t t t t t ty y w f y y w g y w

Potential function of the random field

Conditional Probability

1 11 1

1| exp , , exp , ,

T T

t t t t t tt Y t

P Y X y y w Z X y y wZ X

1

log |N

i ii

J P Y X

Maximize the log likelihood

Parameter Estimation

Iterative scaling and gradient descent – exponential number of parameters

Gradient tree boosting – only necessary interactions among features

Averaged perceptron ( Michael Collins et al. 2002 )

Hidden Markov support vector machine (Yasemin Altun et al. 2003 )

Maximum Margin Markov Network ( Ben Taskar 2003 )

ApplicationsApplications

Assign the secondary structure classes (-helix, -sheet and coil) to protein’s

amino acid (AA) sequence, leading to tertiary and/or quaternary structure

corresponding to protein functions

Use Position-Specific Scoring Matrix profiles to improve the prediction accuracy

Use CB513 datasets, with sequences shorter than 30 AA residues excluded, in

our experiment

Protein secondary structure prediction

Protein AA sequence>1avhb-4-AS:IPAYL AETLY YAMKG AGTDD HTLIR VMVSR SEIDL FNIRK EFRKN FATSL YSMIK GDTSG DYKKA LLLLC GEDD

CRF Training and Testing

Prediction Results

Feed into CRF

Generate raw profile with PSI-BLAST

Output

Applications (cont’d)Applications (cont’d) Experiment result

Divide the training data into a sub-training set and a development (validation) set. Try window sizes from 1 to 21 and tree sizes from 10, 20, 30, 50, 70.

The best window size was 11, and the best tree size was 20. With this configuration, the best number of iterations to train was 110, which gave 66.3% correct predictions on the development set.

Train on the entire training set with this configuration and evaluate on the test set. The result was 67.1% correct.

Neural network sliding windows give better performance than this, so we are currently designing experiments to understand why!

Semantic Role Labeling

Classification of remotely sensed images

(1) Dietterich, T.G (2002). Machine learning for sequential data: a review. Structural, Syntactic, and Statistical Pattern Recognition (pp. 15-30). New York: Springer Verlag

(2) Lafferty, J., MaCallum, A., & Pereira, F. (2001) Conditional random fields: Probabilistic models for segmenting and labeling sequence data. Proceedings of the 18th International Conference on Machine Learning (pp.282-289). San Francisco, CA: Morgan Kaufmann

(3) Dietterich T. G. , Ashenfelter, A., & Bulatov, Y. (2004) Training Conditional Random Field via Gradient Tree Boosting. Proceedings of the 21st International Conference on Machine Learning (pp 217-224) Banff, Canada

(4) Jones D. T. (1999) Protein Secondary Structure Prediction Based Matrices. J. Mol. Biol. 292:195-202

(5) Cuff J.A. and Barton G.J. (2000) Application of Multiple Sequence Alignment Profiles to Improve Protein Secondary Structure Prediction. Proteins: Structure, Function and Genetics 40:502-511

(6) Carreras, X. & Marquez, L. Introduction to the CoNLL-2004 Shared Task: Semantic Role Labeling. Proceedings of CoNLL-2004

(7) Della Pietra, S., Della Pietra, V. & Lafferty, J. (1997) Inducing features of random fields. IEEE Transactions on

Pattern Analysis and Machine Intelligence, 19(4), 380---393.

Applications (cont’d)Applications (cont’d)

We thank the National Science Foundation for supporting this research under grant number IIS-0307592

Assign the crop identification classes (unknown, sugar beets, stubble, bare soil, potatoes, carrots) to pixels in the remotely sensed image

The official "shared task" of the 2004 Co-NLL conference.

For each verb in the sentence, find all of its arguments and label their semantic roles.

He would n’t accept anything of value from those he was writing about

A0:acceptor

AM-MOD:modal

A1:thing accepted

AM-NEG:negation

V:verb

A2:Accepted-from

Difficulty for the machine learning

Humans use background knowledge to figure out semantic roles

There are 70 various semantic role tags, which makes it computationally intensive

Experiment result

Two forms of feature-inducing in Conditional Random Fields

Regression tree approach

Incremental field growing

70 different semantic tags to learn. Training set is 9000, whereas test set is 2,000

Evaluated using F-measure, the harmonic mean between precision and recall of requested argument types.

Both methods got similar performance, F-measure around 65. The best published performance was 71.72, using a simple greedy left to right sequence labeling

Again, simpler non-relational approaches outperform CRF on this task. Why?

Experiment result

Different window sizes affect not only the computation time, but also the accuracy of the classifier

J48 (C4.5) and Naïve Bayes classifiers are the most extensively studied. The results show that Naïve Bayes achieves a higher accuracy with smaller sliding windows, while J48 does better with larger window sizes.

Training set and test set are created by dividing the image in half with a horizontal line. The top half is used as the training set, and the bottom half as the test set.

Training Set Expansion – rotations and reflections of the training set increase the training set 8 fold.

Figure 5: Image with true labels of the classes. Upper part is the training example and the lower part is the testing example

Figure 1

Figure 2

Figure 4

When compared to individual pixel classification (59%) it is easy to see that the recurrent sliding window allows for a significant improvement in the accuracy of the classifier.

Currently, the affect bagging and boosting has on the accuracy is under investigation.

IC=1 OC=1 IC=1 OC=3 IC=1 OC=5

Naïve Bayesian Classifier

71% 77% 43%

J48 Best accuracy is 74% with IC=7 and OC=3( IC = input context; OC = output context )

Position-Specific Scoring Matrix A R N D C Q E G H I L K M F P S T W Y V Class 1 I -1 -3 -3 -3 -1 -3 -3 -4 -3 4 2 -3 1 0 -3 -2 -1 -3 -1 3 H 2 P -1 -2 -2 -1 -3 -1 -1 -2 -2 -3 -3 -1 -2 -4 7 -1 -1 -4 -3 -2 H 3 A 4 -1 -2 -2 0 -1 -1 0 -2 -1 -1 -1 -1 -2 -1 1 0 -3 -2 0 H 4 Y -2 -2 -2 -3 -2 -1 -2 -3 2 -1 -1 -2 -1 3 -3 -2 -2 2 7 -1 H 5 L -1 -2 -3 -4 -1 -2 -3 -4 -3 2 4 -2 2 0 -3 -2 -1 -2 -1 1 H 6 A 4 -1 -2 -2 0 -1 -1 0 -2 -1 -1 -1 -1 -2 -1 1 0 -3 -2 0 H 7 E -1 0 0 2 -4 2 5 -2 0 -3 -3 1 -2 -3 -1 0 -1 -3 -2 -2 E 8 T 0 -1 0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1 1 5 -2 -2 0 E 9 L -1 -2 -3 -4 -1 -2 -3 -4 -3 2 4 -2 2 0 -3 -2 -1 -2 -1 1 E 10 Y -2 -2 -2 -3 -2 -1 -2 -3 2 -1 -1 -2 -1 3 -3 -2 -2 2 7 -1 E 11 Y -2 -2 -2 -3 -2 -1 -2 -3 2 -1 -1 -2 -1 3 -3 -2 -2 2 7 -1 E 12 A 4 -1 -2 -2 0 -1 -1 0 -2 -1 -1 -1 -1 -2 -1 1 0 -3 -2 0 C 13 M -1 -1 -2 -3 -1 0 -2 -3 -2 1 2 -1 5 0 -2 -1 -1 -1 -1 1 C 14 K -1 2 0 -1 -3 1 1 -2 -1 -3 -2 5 -1 -3 -1 0 -1 -3 -2 -2 C 15 G 0 -2 0 -1 -3 -2 -2 6 -2 -4 -4 -2 -3 -3 -2 0 -2 -2 -3 -3 C 16 A 4 -1 -2 -2 0 -1 -1 0 -2 -1 -1 -1 -1 -2 -1 1 0 -3 -2 0 C 17 G 0 -2 0 -1 -3 -2 -2 6 -2 -4 -4 -2 -3 -3 -2 0 -2 -2 -3 -3 C 18 T 0 -1 0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1 1 5 -2 -2 0 C

Figure 6 training set expansion

A classifier is trained, and run on the 8 rotations and reflections of the test set.

A majority vote decides the final class.

A sliding window is used to group the input pixels with varying square size, the same is done for the output window. Thus, the labeling of a pixel is dependent not only on the pixel intensity values in its neighborhood, but also on the labels placed on the pixels in the neighborhood

Pixel labels by NaiveBayes IC=1 OC=3 on each of 8 rotations and reflections of the test image

Majority voting

Classification result

In the recent years, substantial progress has already been make to the sequential and spatial supervised learning problems. This poster has attempted to review some of the existing methods and give out our current methods and experiment results in several applications. Future work will include

Develop methods that can handle large number of classes

Discriminative methods using large margin principles

Understand why structural learning methods, such as CRF, do not outperform classical methods in some structural learning problems

sequential and spatial supervised learning

Documents