selective prediction with hidden markov models · selective prediction with hidden markov models...

Selective Prediction with Hidden

Markov Models

Research Thesis

In Partial Fulfillment of The

Requirements for the Degree of

Master of Science in Computer Science

Dmitry Pidan

Submitted to the the Senate of

the Technion - Israel Institute of Technology

Sivan 5772, HaifaMay 2012

Technion - Computer Science Department - M.Sc. Thesis MSC-2013-07 - 2013

The Research Thesis Was Done Under The Supervision of Assoc.Prof. Ran El-Yaniv in

the Department of Computer Science.

The Generous Financial Help Of Technion Is Gratefully Acknowledged.

Publications

1. R. El-Yaniv and D. Pidan, ”Selective Prediction of Financial Trends with Hidden Markov

Models”. InNIPS2011

iiTechnion - Computer Science Department - M.Sc. Thesis MSC-2013-07 - 2013

Contents

List of Figures v

List of Tables vii

Abstract 1

Abbreviations and Notations 3

1 Introduction 5

2 Preliminaries 7

2.1 Hidden Markov Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .7

2.1.1 Brief Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . .7

2.1.2 The Three Basic HMM Problems . . . . . . . . . . . . . . . . . . . .9

2.1.2.1 Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . .10

2.1.2.2 Decoding . . . . . . . . . . . . . . . . . . . . . . . . . . . .11

2.1.2.3 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . .13

2.1.3 HMM for Continuous Data . . . . . . . . . . . . . . . . . . . . . . . .14

2.1.4 Using HMM with Labeled Data . . . . . . . . . . . . . . . . . . . . .16

2.2 Selective Classification / Prediction . . . . . . . . . . . . . . .. . . . . . . . 17

2.2.1 Preliminary Definitions . . . . . . . . . . . . . . . . . . . . . . . . .. 17

2.2.2 Risk Coverage (RC) Trade-Off . . . . . . . . . . . . . . . . . . . . .. 18

2.3 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .20

2.3.1 Financial Prediction with HMMs . . . . . . . . . . . . . . . . . . .. 20

2.3.2 Models for Selective Classification/Prediction . . . .. . . . . . . . . . 20

iiiTechnion - Computer Science Department - M.Sc. Thesis MSC-2013-07 - 2013

CONTENTS

3 Selective Prediction with HMMs 23

3.1 Ambiguity Model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .23

3.2 State-Based Selectivity. . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . 25

3.2.1 Selective HMM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .25

3.2.2 Naive-sHMM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .26

3.2.3 Overcoming Coarseness. . . . . . . . . . . . . . . . . . . . . . . . . .30

3.2.4 Why Not To Enlarge a State Space? . . . . . . . . . . . . . . . . . . .31

3.2.5 Randomized Linear Interpolation (RLI). . . . . . . . . . . .. . . . . . 32

3.2.6 Recursive Refinement (RR). . . . . . . . . . . . . . . . . . . . . . . .34

3.2.6.1 Flattening a Refined HMM . . . . . . . . . . . . . . . . . .36

3.2.6.2 The Most Likely Aggregate State . . . . . . . . . . . . . . .39

3.2.6.3 Parameters Estimation of the Refining HMM . . . . . . . . .40

3.2.6.4 Comparison to Other Compositional Hidden Markov Models 42

4 Experimental Results 45

4.1 Experimental Setting . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . 45

4.2 Experiments with Discrete Data . . . . . . . . . . . . . . . . . . . . .. . . . 47

4.2.1 Discretization and Quantization . . . . . . . . . . . . . . . . .. . . . 47

4.2.2 Ambiguity-Based Model . . . . . . . . . . . . . . . . . . . . . . . . .48

4.2.3 Selective HMM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .49

4.3 Experiments with Continuous Data . . . . . . . . . . . . . . . . . . .. . . . . 51

4.3.1 Filtered Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .51

4.3.2 RC Trade-off vs. Data Complexity . . . . . . . . . . . . . . . . . .. . 53

4.3.3 Raw Price Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .54

5 Discussion 59

References 61

A Derivation of Baum-Welch re-estimation formulas for RR-HMM 65

ivTechnion - Computer Science Department - M.Sc. Thesis MSC-2013-07 - 2013

List of Figures

2.1 Coin toss example HMM . . . . . . . . . . . . . . . . . . . . . . . . . . . . .8

2.2 The RC plane and RC trade-off . . . . . . . . . . . . . . . . . . . . . . . .. . 19

3.1 HMM-based classifier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 24

3.2 5-state Naive-sHMM coverage vs. allotted bound . . . . . . .. . . . . . . . . 31

3.3 Recursively Refined HMM . . . . . . . . . . . . . . . . . . . . . . . . . . . .36

3.4 Embedding of 2-state refining HMM . . . . . . . . . . . . . . . . . . . .. . . 37

4.1 A walk-forward evaluation procedure . . . . . . . . . . . . . . . .. . . . . . 46

4.2 Quantization of the discrete data sequence withW = 3 . . . . . . . . . . . . . 48

4.3 Comparison of Naive-, RLI- and RR-sHMM . . . . . . . . . . . . . . .. . . . 51

4.4 Comparison of the Naive-sHMM and Ambiguity-Based Classifier . . . . . . . 53

4.5 RR-sHMM performance on high rejection rates . . . . . . . . . .. . . . . . . 54

4.6 Pure price sequence vs. filtered price sequence (ζ = 1/3), 10.8.2010-31.12.201055

4.7 Comparison of Naive-, RLI- and RR-sHMM for filtered continuous data . . . . 56

4.8 sHMM error improvement for different EMAW parameters . . . . . . . . . .56

4.9 RC-curves of RR-sHMM for S&P500 and GLD returns . . . . . . . .. . . . . 57

5.1 Distributions of visit and risk train/test differences. . . . . . . . . . . . . . . . 60

vTechnion - Computer Science Department - M.Sc. Thesis MSC-2013-07 - 2013

LIST OF FIGURES

viTechnion - Computer Science Department - M.Sc. Thesis MSC-2013-07 - 2013

List of Tables

4.1 Comparison of ambiguity-based classifiers for different W ’s . . . . . . . . . . 48

4.2 Comparison of ambiguity-based classifiers with different number of states . . .49

4.3 Comparison of quantization window lengths, 17.2.1987-31.12.1998 . . . . . . 50

4.4 Coverage Rates of sHMM . . . . . . . . . . . . . . . . . . . . . . . . . . . .52

4.5 Coverage Rates of sHMM’s for filtered continuous data . . .. . . . . . . . . . 52

viiTechnion - Computer Science Department - M.Sc. Thesis MSC-2013-07 - 2013

LIST OF TABLES

viiiTechnion - Computer Science Department - M.Sc. Thesis MSC-2013-07 - 2013

Abstract

Focusing on short term trend prediction in a financial context we consider the

problem of selective prediction whereby the predictor can abstain from prediction

in order to improve its performance. The main characteristic of selective predic-

tors is the trade-off they exhibit between error and coverage rates. In the context

of classification selective prediction is termed ‘classification with a reject option,’

and there the main idea for implementing rejection is Chow’sambiguity principle

[8]. In this paper we examine two types of selective HMM predictors. The first

is an ambiguity-based rejection in the spirit of Chow. The second is a special-

ized mechanism for HMMs that identifies low quality HMM states and abstains

from prediction in those states. We call this modelselective HMM (sHMM). In

both approaches we can trade-off prediction coverage to gain better accuracy in a

controlled manner. We compare the performance of the ambiguity-based HMM

rejection technique to that of the sHMM approach, demonstrate the effectiveness

of both methods and the superiority of the sHMM model.


ABSTRACT

2Technion - Computer Science Department - M.Sc. Thesis MSC-2013-07 - 2013

Abbreviations and Notations

P [X |Y ] , E [X |Y ] Conditional Probability and Expectation of random variable X given ran-

dom variable Y

P [X] , E [X] Probability and Expectation of random variable X

EM Expectation Maximization

EMA Exponential Moving Average

HMM Hidden Markov Model

RC Risk-Coverage

RLI Randomized Linear Interpolation

RR Recursive Refinement

sHMM Selective Hidden Markov Model


ABBREVIATIONS AND NOTATIONS


Chapter 1

Introduction

A famous phrase attributed to the English mathematician andphilosopher Whitehead is: “Not

ignorance, but ignorance of ignorance is the death of knowledge.” Indeed, one of the key

elements of intelligence is the ability to differentiate between things that we know and things

that we don’t know. In the machine learning realm, this differentiation can be implemented by

“selective predictors.” Not only are these models able to output a prediction, but they are also

capable of abstaining from decision at certain instances.

Selective predictionis the study of predictive models that can automatically qualify their

own predictions and output ‘don’t know’ when they are not sufficiently confident. Currently,

manifestations of selective prediction within machine learning mainly exist in the territory of

inductive classification, where this notion is often termed‘classification with a reject option.’

In the study of a reject option, which was initiated more than40 years ago by Chow [8], the

goal is to enhance accuracy (or reduce ‘risk’) by compromising the coverage. For a classifier

or predictor equipped with a rejection mechanism we can quantify its performance profile

by evaluating itsrisk-coverage (RC) curve, giving the functional relation between error and

coverage. The RC curve represents a trade-off: the more coverage we compromise, the more

accurate we can expect to be, up to the point where we reject everything and (trivially) never err.

The essence of selective classification is to construct classifiers achieving useful (and optimal)

RC trade-offs, thus providing the user with thecontrol over the choice of the desired risk (with

its associated coverage compromise).

In the present work we are concerned with selective prediction in the context of sequen-

tial learning. While selective classification in a static context has extensively been studied

during the past few decades, selective prediction models for sequential tasks have only been


1. INTRODUCTION

sparsely considered in the literature. This work is focusedon the restricted objective of pre-

dicting next-day trends in financial sequences. While limited in scope, this problem serves as a

good representative of difficult sequential data [27], in addition to being a very interesting and

potentially rewarding real-world problem.

A convenient and quite versatile modeling technique for analyzing sequences is the Hidden

Markov Model (HMM). HMMs earned a reputation as useful models in sequential problems

due to their high expressivity in modeling sequences and theexistence of efficient algorithms

for training and inference. Since their introduction by Baum in the 60’s, HMMs have been

extensively used in numerous applications such as speech recognition, information extraction,

and natural language processing, to name a few. The literature pertaining to HMM-based

models and techniques for predicting financial data sequences encompasses several dozens

papers. We provide an overview of the most related results inSection2.3.1.

The goals we set for our research were to develop useful selective prediction models based

on HMMs. To this end we examined two approaches. The first is a straightforward appli-

cation of Chow’s ambiguity principle implemented with HMMs. The second, is a novel and

specialized technique utilizing the HMM modularity and state structure. In this approach we

identify latent states whose prediction quality is consistently bad, and abstain from predictions

while the underlying source is likely to be in those states. We call this modelselective HMM

(sHMM). While this natural approach can work in principle, if the HMM does not contain too

many “fine grained” states, whose probabilistic volume (or “visit rate”) is small, the resulting

risk-coverage trade-off curve will be a coarse step function that will prevent fine control and

usability. One of our contributions is a solution to this coarseness problem by introducing al-

gorithms for refining sHMMs. The resulting refined sHMMs giverise to smooth RC trade-off

curves.

We present the results of a quite extensive empirical study showing the effectiveness of

our methods, which can increase the edge in predicting next-day trends. As a benchmark

application, we used a prediction of next-day direction of the S&P500 index, and compared

the two approaches, namely, the classical ambiguity-basedapproach and the sHMM approach.

Our results demonstrate the superiority of sHMM, even when the classical approach is given

an advantage by adjusting favorably its hyper-parameters in hindsight.


Chapter 2

Preliminaries

2.1 Hidden Markov Models

2.1.1 Brief Description

A discrete-time random processis a collection of random variables{Xt | t ≥ 0}, wheret ∈N denotes a time index. AMarkov chainis a discrete-time random process that obeys the

following Markov property:

P [Xt |X1, . . . ,Xt−1] = P [Xt |Xt−k, . . . ,Xt−1] ,

for some1 ≤ k < t, i.e.,Xt is independent ofX1 . . . Xt−k−1 givenXt−k . . . Xt−1.

The special casek = 1 is called afirst-order Markov chain:

P [Xt |X1, . . . ,Xt−1] = P [Xt |Xt−1] . (2.1)

When the process is stationary, i.e. the right-hand side in Equation2.1 is independent of

time, the labelt can be omitted, resulting a set of probabilities,

aij = P [Xt = xi |Xt−1 = xj ]

for each pair of valuesxi, xj in the domain ofXt variables.

Consider coin toss games, where the opponent has two biased coins c1 andc2. At every

round, one of the coins is tossed, and then the opponent changes coins fromc1 to c2 with prob-

ability p, and fromc2 to c1 with probability q. For the first game, coins are chosen with equal

probability. The tosses ofc1 andc2 result inhead with probabilitiesh1 andh2, respectively.


2. PRELIMINARIES

The player can only see thehead/tailoutput and never knows which coin was actually used in

any round.1

Let St be a random variable describing which coin is used at roundt, andOt be a random

variable describing the outcome of this round. Then the above scenario ofT rounds can be

represented by two random processes;S1 . . . ST , that arehiddencoin choices, andO1 . . . OT ,

which areobservableoutcomes. We note that the choice of a coin to toss at roundt is inde-

pendent of the history of the game, given the coin used in round t − 1. Thus,S1, . . . , ST is a

stationary first-order Markov chain. We also note, that the outcomeOt is dependent only on

St.

The above model is called aHidden Markov Model, abbreviatedHMM. Generally, HMM

is a doubly embedded stochastic process, where the underlying process is not observable (it

is hidden), and can only be guessed through another process that produces observations. It

is a common practice to describe HMM as a probabilistic statemachine, in which every state

can emit observations from an alphabet. Figure2.1depicts such a state machine for the above

coin toss example. In this example, states correspond to coins, and observations correspond to

outcomes. In a general HMM, the sequence of state transitions represents the hidden process,

while the sequence of observations represents the observable process.

WVUTPQRSc1

1−p�� p

((

h1

xxqqqqqq

qqqqqq

qqq

1−h1

��

WVUTPQRSc2

1−q��

q

hh

h2

��

1−h2

&&▼▼▼▼▼

▼▼▼▼▼▼

▼▼▼▼

H T H T

Figure 2.1: Coin toss example HMM

Definition 2.1.1. HMM is a tupleλ , 〈Q,V, π,A,B〉 2, where

• Q is a set of states of the sizeN . We denote individual states as elements ofQ ,

{q1, . . . , qN}.

• V is a set of observations of sizeM . Elements ofV , {v1, . . . , vM} are regarded as the

observation alphabet.

1This example is taken from Rabiner [26], section II.A2The definition follows the one given by Rabiner in [26]



• π = (πi)Ni=1 is the initial states distribution,πi , P [S1 = qi].

• A = (aij)Ni,j=1 is the state transition probability matrix,aij , P [St+1 = qj | St = qi].

• B = (bi(k))i=N,k=Mi=1,k=1 is the observation emission probability matrix,

bi(k) , P [Ot = vk | St = qi] .

The HMM parametersπ, A andB should obey the following stochastic conditions,

N∑

i=1

πi = 1, ∀iN∑

j=1

aij = 1, ∀iM∑

k=1

bi(k) = 1. (2.2)

HMM is usually viewed as a generative model, i.e., as a stochastic source generating an

observation sequenceO = O1 . . . OT . This generation process is performed as follows,

1. Sett = 1.

2. Choose an initial state,S1, according to the initial state distributionπ.

3. Choose an observation,Ot, according to the emission distribution ofSt.

4. Choose a next state,St+1, according to the next state transition distribution ofSt.

5. Incrementt.

6. Return to step3 if t < T , and otherwise stop.

2.1.2 The Three Basic HMM Problems

In order to use HMMs in real-world applications, the following three basic problems should be

solved.

1. Inference - given the HMMλ, coupled with the observation sequenceO, how can we

calculateP [O |λ] efficiently?

2. Decoding- given the HMMλ, coupled with the observation sequenceO, how can we

find a state sequenceS that corresponds well toO in some meaningful sense?

3. Training - given the observation sequenceO, how can we learn the parameters ofλ?


2. PRELIMINARIES

The first problem is an evaluation problem which aims at giving the score that measures

how much a given HMM “suits” the observation sequence, or howlikely the sequence is to

be generated by the given HMM. This score is useful when trying to choose from a group of

models the “best” one for a given sequence, e.g., for classification of observation sequences,

while each candidate model represents a class.

The second problem emerges when one has a model with meaningful states (as in many

practical applications, e.g., states that represent market conditions in financial applications or

states that represent parts of speech in NLP tasks). In this case, the user of the model is often

interested to find, or decode, the sequence of states that generated a given observation sequence.

Finally, the third problem is essential in every HMM application. In the vast majority of

applications, only the set of observation sequences is given, and the actual parameters of the

underlying HMM are unknown (or non existent). Thus, there isa need to learn parameters

that best describe the given observation sequences (and thus are believed to approximate the

unknown source).

2.1.2.1 Inference

Let us first return to the coin toss example described in Section 2.1.1. Given the HMM in

Figure2.1, what is the probabilityP of generating observation sequenceHHTTH with the

corresponding state sequencec1c2c2c1c2? Clearly,

P = P [c1]P [H | c1]P [c2 | c1]P [H | c2]P [c2 | c2]P [T | c2]P [c1 | c2]P [T | c1] ∗∗P [c2 | c1]P [H | c2] = 0.5h1ph2(1− q)(1 − h2)q(1 − h1)ph2.

Note that in this and following equations we omit the conditioning probabilities on the model

λ when it is clear from the context.

For a general observation sequence,O = O1 . . . OT , and a corresponding state sequence,

S = S1 . . . ST , we obtain,

P [O,S |λ] = πS1bS1

(O1)aS1S2bS2

(O2) . . . aST−1STbST

(OT ), (2.3)

and the probabilityP [O |λ] is a summation over all possible realizations of state sequenceS,

P [O |λ] =∑

S

P [O,S |λ] =∑

S

πS1bS1

(O1)aS1S2bS2

(O2) . . . aST−1STbST

(OT ). (2.4)

The time complexity required for the computation of Equation 2.4isO(TNT ), since order

of T arithmetic operations is required for every ofNT possible state sequences. Of course,



this computation is feasible only for very smallT ’s. Thus, a more efficient procedure for the

computation ofP [O |λ] is required.

The forward-backward procedure[3], [5] provides a recursive solution for computing of

P [O |λ] . We define aforward variable,αt(i) = P [O1 . . . Ot, St = qi |λ]. which is computed

recursively as follows,

• Initialization:

α1(i) = πibi(O1)

• Recursion:

αt(i) =N∑

j=1

αt−1(j)ajibi(Ot) (2.5)

Then, obviously,

P [O = O1 . . . OT |λ] =N∑

i=1

P [O = O1 . . . OT , ST = qi |λ] =N∑

i=1

αT (i). (2.6)

The computation of Equation2.6incursO(N2T ) time complexity, which is definitely more

efficient then the computation of2.4, and can scale for very largeT ’s.

2.1.2.2 Decoding

The first question to address, when discussing an “optimal” state sequence is what is a mean-

ingful optimality criterion? Or, in other words, how do we compare two state sequencesS1 and

S2 and decide which of them “better” matches the observation sequenceO given the modelλ?

There are several possible optimality criteria, and in thissection we will discuss two of them,

both are likelihood-based measures.

The first one is a sequence of “individually” most likely states. We form a state sequence,

S∗ = S∗1 . . . S

∗T , by finding for each observationOt in O the stateS∗

t , such that,

S∗t = argmax

qi∈QP [St = qi |O,λ] .

The quantityP [St = qi |O,λ] is denoted byγt(i). This quantity can be calculated using

a forward-backward procedure from Section2.1.2.1. We first define thebackward variable

βt(i) = P [Ot+1 . . . OT |St = qi, λ], which is computed as follows:

• Initialization:

βT (i) = 1


2. PRELIMINARIES

• Recursion:

βt(i) =

N∑

j=1

aijbj(Ot+1)βt+1(j) (2.7)

Then

γt(i) =αt(i)βt(i)

N∑

i=1αt(i)βt(i)

(2.8)

Clearly,∑N

i=1 γt(i) = 1, henceγt is a probability measure.

The criterion of individually most likely state has one major drawback: the resulting state

sequence can be illegal, i.e., there is a possibility that itcannot be produced by the given model

(for example, when the model contains impossible transitions, i.e., statesqi andqj for which

aij = 0, and those transitions present in the sequence). To overcome this drawback, another

optimality criterion can be used; the likelihood of the whole state sequence given the model

and observations.

S∗ = argmaxS

P [S |O,λ] .

MaximizingP [S |O,λ] is equivalent to maximizingP [S,O |λ], sinceP [O |λ] is indepen-

dent ofS. There exists a dynamic programming algorithm, calledViterbi algorithm[32], that

maximizesP [S,O |λ] over the set of all possible realizations ofS. The Viterbi algorithm can

be expressed as follows.

1. Initialization:

δ1(i) = πibi(O1) ψ1(i) = 0.

2. Recursion:

δt(i) = max1≤j≤N

[δt−1(j)aji] bi(Ot) ψt(i) = argmax1≤j≤N

[δt−1(j)aji] .

3. Backtracking:

S∗T = argmax

1≤i≤NδT (i);

S∗t = ψt+1(S

∗t+1), 1 ≤ t ≤ T − 1.



2.1.2.3 Training

The solutions (Sections2.1.2.1, 2.1.2.2) for the inference and decoding problems assume the

existence of a hypothetical HMM which is supposed to be the underlying source for the given

observation sequence. However, this modeling assumption does not necessarily hold in great

many practical applications.

Given a certain alphabet (determined by the given observation sequence) and a known

number of states, the training process aims at estimating the remaining parameters of the hy-

pothetical HMMλ: π,A andB. Depending on the criterion for the effectiveness of a given

parameter choice, the parameters estimation is usually a nonlinear optimization problem. In

this section we will describe the solution for the most popular maximum likelihood (ML)crite-

rion, i.e., a choice of the parametersπ,A andB such thatP [O |λ] is maximized. Alternative

optimization criteria, like MMI (maximum mutual information) [1], or MDI (minimum dis-

crimination information) [12], were also proposed in literature. Generally, the choice of the

most suitable optimization criterion depends on the application, and ML is usually considered

as the default one (in particular in financial applications).

There exists an efficient procedure that finds parametersπ,A andB, such thatP [O |λ]is at one of its local maxima. This procedure is called theBaum-Welch algorithm[4]. The

algorithm is an instance of a general family of EM (expectation-maximization) algorithms [9].

The essence of the procedure is that instead of directly maximizing P [O |λ], we iteratively

solve the Baum’s auxiliary function,

Q(λ, λ) =∑

S

P [S |O,λ] logP[O,S | λ

], (2.9)

starting with some initial setting forπ, A andB. The maximization of this auxiliary function

guarantees thatP[O | λ

]≥ P [O |λ]. Given this guarantee, we can now definere-estimation

procedure which iteratively increasesP [O |λ] until convergence:

1. Start with aλ initialized to some randomπ,A,B.

2. Findλ that maximizesQ(λ, λ).

3. If∣∣P[O | λ

]− P [O |λ]

∣∣ > ǫ, return to Step2 with λ = λ. Otherwise, outputλ.


2. PRELIMINARIES

Before we show how to find theλ that maximizesQ(λ, λ), we need to define the following

quantity,

ξt(i, j) = P [St = i, St+1 = j |O,λ] = αt(i)aijbj(Ot+1)βt+1(j)N∑

i=1

N∑

j=1αt(i)aijbj(Ot+1)βt+1(j)

. (2.10)

DefineAij to be a random variable counting the number of transitions from stateqi to state

qj. DefineBik to be a random variable counting the number of visits at stateqi while the

observation isvk. Then,

πi = P [S1 = qi |O,λ] = γ1(i) (2.11a)

aij =E [Aij |O,λ]

N∑

j=1E [Aij |O,λ]

=

T−1∑

t=1ξt(i, j)

N∑

j=1

T−1∑

t=1ξt(i, j)

(2.11b)

bi(k) =E [Bik |O,λ]

M∑

k=1

E [Bik |O,λ]=

T∑

t=1

s.t.Ot=vk

γt(i)

T∑

t=1γt(i)

(2.11c)

The interpretation of the Baum-Welch procedure in the termsof EM algorithm is as follows.

The E (expectation) step is the calculation of the Baum auxiliary function (Equation2.9), and

the M (maximization) step is the evaluation of Equations2.11a-2.11c.

The state space size (the parameterN ) is often considered as a hyper-parameter to be

found empirically, by experimenting with different sizes and selecting the one which better

describes the model. Another possible methodology is to usea fixed number of states based

on prior belief/knowledge about the underlying source (e.g., the number of parts-of-speech in

NLP processing, or the number of basic market conditions in financial applications). Finally, a

variety of heuristics that aim to find the optimal state spacesize were proposed in the literature

(see, e.g., [30]). Since we did not apply those techniques in our research, their overview is

beyond the scope of this manuscript.

2.1.3 HMM for Continuous Data

In sections2.1.1, 2.1.2, we considered HMMs that model sequences consisting of observations

from some discrete domain. HMM, however, can be used for modeling continuous observation



densities as well. To this end, the discrete emission distribution in each state (modeled by the

matrixB) is replaced by a probability density function (pdf). The pdf in each state must, of

course, be properly normalized, i.e., givenbi(x), a pdf of stateqi, we require that

∞∫

−∞

bi(x)dx = 1 (2.12)

The most popular choice for state probability density function is a finite mixture of Gaus-

sian components,

bi(Ot) =

M∑

m=1

cimN (Ot, µim,Σim) (2.13)

whereM is a number of mixture components in the stateqi, cim is a weight of mixture compo-

nentm in stateqi, andµim andΣim are the mean and covariance of this mixture component.

Weights of components have to obey∑M

m=1 cim = 1, in order to fulfill the condition in Equa-

tion 2.12. We note that multi-dimensional observations can be modeled in the same way. In

this case,µim is a mean vector andΣim is a covariance matrix of the mixture component.

For HMMs with observations probability density functions modeled by mixtures of Gaus-

sians, the solutions of inference and decoding problems (Sections2.1.2.1and2.1.2.2) remain

the same, with the only difference of usingbi(Ot = vk) from Equation2.13, instead ofbi(k)

in the discrete case. Solution for the training problem (Section 2.1.2.3), should now contain

re-estimation formulas forcim, µim, andΣim, for each stateqi and mixture componentm. To

provide those formulas, we first need to define variableγt(i,m), which describes the probabil-

ity of being at stateqi with m’th mixture component interpretingOt:

γt(i,m) =

αt(i)βt(i)N∑

i=1αt(i)βt(i)

cimN(Ot, µim,Σim)M∑

m=1cimN(Ot, µim,Σim)

(2.14)

The re-estimation formulas for mixture parameters are

cim =

T∑

t=1γt(i,m)

T∑

t=1

M∑

m=1γt(i,m)

(2.15a)


2. PRELIMINARIES

µim =

T∑

t=1γt(i,m)Ot

T∑

t=1γt(i,m)

(2.15b)

Σim =

T∑

t=1γt(i,m)(Ot − µim)(Ot − µim)′

T∑

t=1γt(i,m)

(2.15c)

where prime denotes the vector transpose. Re-estimation formulas forπi andaij remain un-

changed (see Equations2.11a, 2.11b).

2.1.4 Using HMM with Labeled Data

A labeled observation sequence is a sequence,O = O1 . . . OT , which is accompanied by a

corresponding sequence of labels,L = L1 . . . LT . For eacht, the labelLt can be interpreted

as a class to whichOt belongs. This modeling can be very useful in many applications. For

example, in NLP problems such as part-of-speech tagging, each Ot denotes a word in the

sentence and the correspondingLt denotes the part-of-speech of this word. For this labeled

data, an important problem, given the sequenceO, is to discover the corresponding sequence

L. This problem is known as thetaggingproblem.

One of the most popular HMM solutions for the tagging problemis to represent the data

source as a generative model, where each observationOt is generated by the state labeled with

the labelLt. In this case, the tagging problem reduces to the known problem of discovering

the best state path given the observation sequence (see Section 2.1.2.2). This approach is used

in many applications such as NLP [23] or information extraction [14].

One of the approaches to create HMMs with labeled states is toassign labels from a known

set to the states before the training. This approach was firstproposed by Krogh in [22], and

is known asClass HMM. For this approach, a supervised EM method is used for the model

training. The supervised EM method is derived from the classical Baum-Welch algorithm,

by posing additional restrictions on forward-backward variables forcing specific observation

to belong to a particular class. Suppose that stateqi is assigned with the labelli. Then, the

supervisedαst (i) variable is computed as follows:

αst (i) =

{αt(i) li = Lt

0 otherwise. (2.16)


2.2 Selective Classification / Prediction

All other forward-backward variables are restricted similarly. Then, those variables are

used in regular Baum-Welch re-estimation formulas (Equations2.11a-2.11c).

An alternative approach is to calculate state labels a-posteriori, on the already trained

model. In this approach, the HMM is trained using the unsupervised Baum-Welch algorithm,

with observation sequences only, as it is described in Section 2.1.2.3. Then, the label of each

state is calculated using the parameters of the trained model. For example, Zhang [33] uses

this approach for the prediction of next-day trend of the S&P500 index, labeling every state by

{±1}, using the weighted mean of the Gaussian mixture of this state,

li = sign

(M∑

m=1

cimµim

)

. (2.17)

In our work, we also follow the latter approach. For each state qi, the labelli is chosen to

be the one that maximizes the expected number of visits in state qi given the label at visit isli,

li = argmaxl

E [St = qi | lt = l, O, λ] = argmaxl

T∑

t=1

lt=l

γt(i). (2.18)


2.2.1 Preliminary Definitions

To define the performance parameters in selective prediction we utilize the following defini-

tions for selective classifiers from [10]. Let X be a feature space, and let(X,Y ) ⊆ X× {±1}be a set of labeled data instances, assumed to be sampled i.i.d. from some unknown distri-

bution P (X,Y ). In standard binary classification the output of the learning algorithm is a

function f : X → {±1}, constructed so as to minimize the probability of misclassification,

P [f(X) 6= Y ].

In its most general from, a selective (binary) classifier is represented as a pair of functions

〈f, g〉, wheref is a binary classifier andg : X → [0, 1] is aselection functionfor f :

〈f, g〉(x) ,{reject w.p. 1− g(x)f(x) w.p. g(x)

.

Wheneverg(x) is binary, i.e.g(x) : X → {0, 1}, the selective classifier is calleddetermin-

istic: wheng(x) = 1, the predictionf(x) is accepted, and otherwise it is ignored. We note two

extreme cases of selective classification:g(x) ≡ 1, which is equivalent to “standard learning”

(no “rejects” are allowed), andg(x) ≡ 0, which is a trivial case of rejecting every data instance.


2. PRELIMINARIES

The performance of a selective classifier is measured by itscoverageandrisk:

Definition 2.2.1. Thecoverageof the selective classifier〈f, g〉 is the expected volume of non-

rejected data instances,C , E [g(X)], where expectation is taken w.r.t. the unknown underly-

ing distribution.

Definition 2.2.2. Therisk of the selective classifier〈f, g〉 is its misclassifications rate measured

over non-rejected data instances:

R ,

E [I(f(X) 6= Y )g(X)]

CC > 0

0 otherwise

It is worth mentioning that forg(X) ≡ 1, Definition 2.2.2reduces to the standard mis-

classification rate. We also note that both coverage and riskare unknown quantities that can

only be estimated empirically, since they are dependent on the unknown underlying distribution

P (X,Y ).

2.2.2 Risk Coverage (RC) Trade-Off

Risk and coverage trade-off each other and the purpose of a selective prediction model is to

provide “sufficiently low” risk with “sufficiently high” coverage. Generally, the user of a se-

lective model would like to bound one measure (either risk orcoverage) and then obtain the

best model that optimizes the other measure. Pietraszek formulated two optimization models

for selective classifiers [25]:

1. Bounded-Abstention Model: given a constraint (lower bound) on the coverage, the learner

should output a selective classifier with the lowest risk.

2. Bounded-Improvement Model: given a constraint (upper bound) on the risk, the learner

should output a selective classifier with the highest coverage.

The functional relation between risk and coverage is calledthe risk coverage (RC) trade-

off. Figure 2.2 depicts elements of this RC trade-off. The full plane spanned by risk and

coverage axes is called theRC-plane, consisting of all points that can characterize a selective

classifier. The area between the two solid curves is the region where we expect to find an

optimal classifier. The area above the top solid line is called an ”achievable” area, which

means that the selective classifiers characterized by points in this area can in principle be found,

however a better classifier may exist. The area below the bottom solid line is a ”non-achievable”



Figure 2.2: The RC plane and RC trade-off

area, representing risk-coverage profiles that can never beachieved. For the case of binary

classification, El-Yaniv and Wiener [10] provided a comprehensive study about RC-plane and

characterization of its elements.

The dashed curve in Figure2.2represents theRC curve(of a certain model). The extreme

point on this curve atc = 1 represents standard (non-rejective) learning. The other extreme

point atr = 0 represents perfect prediction, which means outputting a classifier that never errs.

A selective predictor is useful if its RC curve is “non trivial”, in the sense that progressively

smaller risk can be obtained with progressively smaller coverage. Thus, when constructing a

selective classification or prediction model it is imperative to examine its RC curve. One can

consider theoretical bounds of the RC curve (as in [10]) or empirical ones, as we do here.

Interpolated RC curve can be obtained by picking a number of coverage bounds at certain

grid points of choice, and learning (and testing) a selective model aiming at achieving the

best possible risk for each coverage level. Obviously, eachsuch model should respect the

corresponding coverage bound.


2/figures/rc-plane.eps

2. PRELIMINARIES

2.3 Related Work

2.3.1 Financial Prediction with HMMs

Financial modeling with HMMs has been considered since their introduction by Baum et al.

HMMs became a very popular tool for financial predictions (asfor many other applications)

due to their implementation simplicity on the one hand, and powerful expressibility, on the

other. While a complete survey of financial prediction with HMMs is clearly beyond scope, we

mention notably related results.

Hamilton [17] introduced aregime-switchingmodel, in which the sequence is hypothesized

to be generated by a number of hidden sources, orregimes, whose switching process is mod-

eled by a first-order Markov chain. Later, in [29] a Hidden Markov Model of neural network

“experts” was used for prediction of half-hour and daily price changes of the S&P500 index.

Zhang [33] applied this model for prediction of the next day S&P500 trend. His model em-

ployed mixture of Gaussians in the states, instead of neuralnetworks. The latter two works

reported prominent results in terms of cumulative profit. Later, Idvall and Jonson [20] made an

attempt to use Zhang’s model for foreign exchange rates forecasting, and found the results to be

too unstable for this task. Recent experimental work by Rao and Hong [27] evaluated HMMs

for the next-day trend prediction in terms of accuracy, and reported on a slight but consistent

positive edge over random guessing.

In [6], an HMM-based classifier was proposed for “reliable trends,” defined to be special-

ized 15 day return sequences that end with either five consecutive positive, or five consecutive

negative returns. A three-class classifier was constructedusing two HMMs, one trained to

identify upward (reliable) trends and the other, to identify downward trends. To discriminate

the third class of non reliable trends, an ambiguity based rule in the style of Chow’s policy

was utilized. Despite the use of this rejection policy in theclassifier, this technique does not

provide a general selective classification solution because the class to be rejected has a prede-

fined structure (non reliable trend), so that the notion of RCtrade-off is not well defined in this

context.

2.3.2 Models for Selective Classification/Prediction

Selective classification (mainly known asclassification with a reject option) was introduced by

Chow [8], who took a Bayesian route to infer the optimal rejection rule and analyze the risk-

coverage trade-off under complete knowledge of the underlying probabilistic source. Chow’s


2.3 Related Work

Bayes-optimal policy is to reject instances whenever none of the posteriori probabilities are

sufficiently predominant. While this policy cannot be explicitly applied in agnostic settings,

it marked a generalambiguity-basedapproach for rejection strategies. There is a substantial

volume of research contributions on selective classification where the main theme is the imple-

mentation of reject mechanisms for particular classifier learning algorithms like support vector

machines, see, e.g., [31]. Most of these mechanisms can be viewed as variations of theChow’s

ambiguity-based policy.

The general consensus is that selective classification can often provide substantial error re-

ductions and therefore rejection techniques have found good use in numerous applications, see,

e.g., [18]. In a context of co-operation of selective classification with HMMs, rejection mech-

anisms were utilized in [21] as a post-processing output verifier for HMM-based handwritten

word recognition system.

There have also been a few theoretical studies providing worst case probability bounds on

the risk-coverage trade-off. Pietraszek [25] formulated the problem of building an abstaining

binary classifier as bounded-risk and bounded-coverage optimization problems, and provided

a method for building such a classifier under each criteria using ROC analysis. Bartlett and

Wegkamp, provided excess risk bound for ERM learning of classifiers that can reject [2].

Freund et al. developed certain coverage/risk bounds for selective ensemble methods [15].

El-Yaniv and Wiener [10] provided characterizations of (optimal) RC trade-offs invarious set-

tings, for noise-free models, including bounds for perfectclassification. These results were

extended to agnostic settings in [11].


2. PRELIMINARIES


Chapter 3

Selective Prediction with HMMs

In this chapter we present two approaches to build HMM-basedselective predictors. The first

approach, presented in Section3.1, implements a classical ambiguity rejection principle. The

idea is to build an HMM classifier for short observation sequences and decide on rejection using

classification reliability. The second approach, presented in Section3.2.1, is a contribution of

the present work. Here we utilize the modular structure of HMMs, identifying and quantifying

non-reliable states, and rejecting from those states.

3.1 Ambiguity Model.

In this approach we construct an HMM-based classifier, similar to the one used in [6], and

endow it with a rejection mechanism in the spirit of Chow [8]. The classifier consists of two

HMMs, λ+ andλ−, that model positively and negatively labeled observationsequences respec-

tively. The classifier is trained on the pool of binary labeled observation sequences, and learns

to identify every new observation sequence as either positive or negative one. This training

scheme is schematically depicted in Figure3.1.

The training set, consisting of binary labeled sequences,{(O1, l1), . . . , (ON , lN )}, li ∈

{±1}, is partitioned into two sets,P andN, whereP consists of positive instances, andN

consists of negative instances. That is1 P , {Oi | li = 1}, andN , {Oi | li = −1}.

We thus train two HMMs,λ+ andλ−, usingP andN, respectively, whereλ+ is trained to

1We use uppercase notation for the indexes of observation sequences in order to differentiate the indexing of

the sequences from the indexing of observations in a single sequence.


3. SELECTIVE PREDICTION WITH HMMS

Figure 3.1: HMM-based classifier

identify positively labeled sequences, andλ− – negatively labeled ones. Each new observation

sequenceO is classified assign(P [O |λ+]− P [O |λ−]).For applying Chow’s ambiguity idea to the model(λ+, λ−), we need to define a mea-

sureC(O) of prediction confidence for any observation sequenceO. A natural choice in this

context is to measure the log-likelihood difference between the positive and negative models,

normalized by the length of the sequence. Thus, we define

C(O) , | 1T(log P

[O |λ+

]− logP

[O |λ−

])|,

whereT is the length ofO. The greaterC(O) is, the more confident are we in the classification

of O. Now, given the classification confidences of all sequences in the training data set, and

given a required lower bound on the coverage,0 ≤ B ≤ 1, an empirical threshold can be

found such that a designated number of instances with the smallest confidence measures will

be rejected:

threshold = max ν s.t.

∣∣{Oi | C(Oi) < ν

}∣∣

N< B (3.1)


3/figures/classifier.eps

3.2 State-Based Selectivity.

If the data is highly non-stationary (e.g., financial sequences), this threshold can be re-

estimated at the arrival of every new data instance, by adding this data instance to the training

pool and recalculating Equation (3.1).

For applying the classifier approach to a financial (autoregressive) prediction problem,

where the training data is usually a single sequence of historical prices, there is a need to

convert this data into a pool of labeled observation sequences on which the HMM components

of the classifier can be trained. This can be achieved by partitioning the sequence of prices to

intervals, while each interval is labeled with the direction of price movement at the next time

point after the interval. The partitioning process is shownin more details in Section4.2.1, and

depicted in Figure4.2.


In this section, we present a novel state-based approach to implement selective prediction with

HMMs. The idea in the approach is to designate an appropriatesubset of the states as “rejec-

tive,” and then consider outputting a ”don’t know” answer based on how much is it likely that

we are currently visiting a rejective state. We refer to thismodel asselective HMM (sHMM).

3.2.1 Selective HMM

We develop our model using alabeledobservation sequenceO. That is, the observation se-

quence,O = O1 . . . OT , has an associated sequence,L = L1 . . . LT , of labels, where each

label,Lt ∈ L, denotes a class to which the corresponding observationOt belongs. Each state

qi, in our HMM, is associated with a class, specified by a labelli ∈ L, such that each obser-

vation generated by this state is supposed to belong to the class (see Section2.1.4 for more

detailed explanation on using HMM with labeled data.)

For our prediction task, namely, guessing the next-day direction in a financial price se-

quence, we employ the solution of the specific case of thetaggingproblem mentioned in Sec-

tion 2.1.4. First, we associate each observationOt in the training data sequence with a label

Lt ∈ {±1}, which reflects the correct prediction of the sequence direction at timet (actually,

the direction of the sequence at timet+1). Consequently, after the training process, each state

qi of the HMM is also assigned a labelli ∈ {±1}. Thus, the prediction of the direction at

timeT reduces to discovering the labelLT given the observation sequence,O1 . . . OT , and the

corresponding label sequenceL1 . . . LT−1. The labelLT is discovered by finding the state in



which the model is likely to be at timeT , andLT is chosen to be the label of this state. Since

we are interested only in finding the most likely state at a single time instance, and not in dis-

covering the complete state sequence (as in a general tagging problem), we use the individually

most likely state for this task (see Section2.1.2.2for details).

Despite this very specific setting, the selective prediction solution we present is independent

of either the set of labels, the way of assigning labels to states, and the way of choosing the

state associated with the output label. Thus, this solutioncan be applied, as-is, to a general

tagging problem (it may also be relevant to other HMM problems with labeled data).

We convert an HMM to aselective HMM (sHMM)by designating a subset of states,

Q′ ⊆ Q, asrejective; that is, predictions made when being (in terms of most likely states) in

one of these states are ignored. We callQ′ a “rejection subset”. For developing and analyzing

sHMMs, we utilize notations and definitions from selective classification, in particular, cover-

age and risk, as described in Section2.2.1. Given the HMM,λ = 〈Q,M, π,A,B〉, a mapping

h : Q → L, h(qi) = li from states to labels, an observation sequence,O = O1 . . . OT , with

its corresponding label sequence,L = L1 . . . LT , and hidden state sequence,S = S1 . . . ST ,

and a subset of states,Q′ ⊆ Q, designated as rejective, we define the functions,f : O → L

(predictor), andg : O → {0, 1} (qualifier) as follows.

Definition 3.2.1(Predictor). f(Ot) , li, subject toSt = qi.

Definition 3.2.2(Qualifier). g(Ot) ,

{

1, qi ∈ Q \Q′;

0, qi ∈ Q′;subject toSt = qi.

We denote byλQ′a sHMMλ in whichQ′ is the subset of rejective states.

3.2.2 Naive-sHMM

As described in Section2.2, a selective predictor is characterized in terms of itscoverageand

risk. Moreover, for exploiting the risk-coverage trade-off, which is the essence of a selective

model, we need to endow the model with a mechanism that allowsfor controlling this trade-

off. In the ideal case, the resulting RC-curve should be monotonically decreasing. Since in our

selective predictor the decision whether to reject the datainstance or not is made based on a

state (rather than on a data instance as in classical approaches like ambiguity-based selective

models), we need to relate to the risk and coverage of states,and develop a mechanism that

will allow a reduction of the total risk via appropriate selection of rejective states.



Definitions3.2.1and3.2.2facilitate the calculation of coverage and risk of an sHMMλQ′.

For this calculation we introduce a random variableT that represents the current time instance

(obtains values from1, . . . , T ).

C(λQ′) = E [g(O)] =

T∑

t=1P [T = t]P [St /∈ Q′ |O,λ] =

=T∑

t=1P [T = t]

∑

qi /∈Q′

P [St = qi |O,λ] =1

T

T∑

t=1

∑

qi /∈Q′

γt(i).(3.2)

R(λQ′) =

1

C(λQ′)E [I(f(O) 6= L)g(O)] =

=1

C(λQ′)

T∑

t=1P [T = t]P [St = qi /∈ Q′ |O,λ,Lt 6= li] =

=1

C(λQ′)T

∑

qi /∈Q′

T∑

t=1

Lt 6=li

γt(i) =∑

qi /∈Q′

T∑

t=1

Lt 6=li

γt(i)

/

∑

qi /∈Q′

T∑

t=1γt(i) .

(3.3)

LetCi be a random variable counting the number of times the HMM visited stateqi given

the observation sequenceO. For each state,qi ∈ Q, thevisit rate, v(i), quantifies the coverage

contribution ofqi, and is defined to be the fraction of time the HMM spends inqi. Therisk, r(i)

quantifies a reliability ofqi, and is defined to be the fraction of erroneous predictions generated

while at this state. Formally, given the HMMλ and observation sequenceO, we define,

Definition 3.2.3. Thevisit rate, v(i), of a stateqi ∈ Q is

v(i) ,1

TE [Ci |O,λ] .

Note thatv(i) can be calculated using forward-backward variablesγt(i):

v(i) =1

TE [Ci |O,λ] =

1

T

T∑

t=1

γt(i) (3.4)

Definition 3.2.4. Therisk, r(i), of a stateqi ∈ Q, is

r(i) ,E [Ci |O,λ, li is wrong]

E [Ci |O,λ].

Risk r(i) can also be calculated usingγt(i) variables:

r(i) =E [Ci |O,λ, li is wrong]

E [Ci |O,λ]=

1

v(i)T

T∑

t=1

Lt 6=li

γt(i) (3.5)



Using Equations3.4 and 3.5, we can now express the coverage (Equation3.2) and risk

(Equation3.3) of the entire sHMM in terms of visit rates and risks of individual states.

C(λQ′

) =∑

qi /∈Q′

v(i) (3.6a)

R(λQ′

) =

∑

qi /∈Q′

r(i)v(i)

∑

qi /∈Q′

v(i)(3.6b)

Suppose we are required to meet a user specified rejection bound 0 ≤ B ≤ 1. This means

that we are required to emit predictions (rather than ‘don’tknow’s) in at least1 − B fraction

of the time. To achieve this we apply the following greedy selection procedure of rejective

states whereby highest risk states are sequentially selected as long as their overall visit rate

does not exceedB. We call the resulting modelNaive-sHMM. Formally, letqi1 , qi2 , . . . , qiN

be an ordering of states, such that for eachj < k, r(ij) ≥ r(ik). Then, the rejective subsetB

is defined as,

B ,

qi1 , . . . , qiK |

K∑

j=1

v(ij) ≤ B,

K+1∑

j=1

v(ij) > B

. (3.7)

Given a rejective subsetB, afirst non-rejective stateqB is defined to be a state with a highest

risk which is not inB (in case there is more than one such state, we select one arbitrarily),

namely,

qB , argmaxqi

{r(i) | qi /∈ B} (3.8)

Proposition 1 (Monotonicity). LetB1 < B2 be rejection bounds, andB1, B2 be correspond-

ing rejection subsets, defined by Equation3.7. Then, the following holds.

1. R(λB2) ≤ R(λB1).

2. If qB1= qi, v(i) < B2−B1, and there existsqj s.t.r(i) > r(j) thenR(λB2) < R(λB1).

Proof.

1. We first note that ifB2 = 1, thenB2 = Q. Clearly, in this caseC(λB2) = 0 and

using Definition2.2.2, we know thatR(λB2) = 0, and the statement trivially holds. We



therefore assumeB2 < 1, and, consequently,B2 ( Q.

R(λB1)−R(λB2) =

∑

qi /∈B1

r(i)v(i)

∑

qi /∈B1

v(i)−

∑

qi /∈B2

r(i)v(i)

∑

qi /∈B2

v(i)=

=

(

∑

qi /∈B1

r(i)v(i)

)(

∑

qi /∈B2

v(i)

)

−(

∑

qi /∈B2

r(i)v(i)

)(

∑

qi /∈B1

v(i)

)

(

∑

qi /∈B1

v(i)

)(

∑

qi /∈B2

v(i)

)

It is sufficient to prove that the numerator is non-negative.

∑

qi /∈B1

r(i)v(i)

∑

qi /∈B2

v(i)

−

∑

qi /∈B2

r(i)v(i)

∑

qi /∈B1

v(i)

= (3.9)

=

∑

qi∈B2\B1

r(i)v(i)

∑

qi /∈B2

v(i)

−

∑

qi /∈B2

r(i)v(i)

∑

qi∈B2\B1

v(i)

=

=∑

qi∈B2\B1

∑

qj /∈B2

r(i)v(i)v(j) −∑

qi∈B2\B1

∑

qj /∈B2

r(j)v(i)v(j) =

=∑

qi∈B2\B1

∑

qj /∈B2

v(i)v(j) (r(i)− r(j)) , (3.10)

where Equation3.9follows fromB1 ⊆ B2, which holds by construction. IfB2\B1 = ∅,

then Equation3.10is zero. Otherwise, for each pair of states,qi ∈ B2 \B1, andqj /∈ B2,

r(i) ≥ r(j), and this completes the proof.

2. The proof is separated into two cases. IfB2 = 1, thenR(λB2) = 0. SinceB1 < B2,

B1 ⊂ Q ⇒ C(λB1) > 0, and since there existsqj such thatr(i) > r(j) ⇒ r(i) > 0.

Then,r(i) > 0 andqi /∈ B1 ⇒ R(λB1) > 0.

Otherwise,B2 < 1. Sincev(i) < B2 − B1, we know thatB1 + v(i) < B2, and from

Equation3.7we deduce thatqi ∈ B2. On the other hand,qi = qB1, soqi /∈ B1, entailing

that qi ∈ B2 \ B1. Let qk be a state inQ \ B2. From the existence ofqj such that

r(i) > r(j), eitherqj /∈ B2, or r(k) ≤ r(j) < r(i). Either way, we found a pair of

states,qi ∈ B2 \ B1, andqj /∈ B2, such thatr(i) − r(j) > 0. Thus Equation3.10

becomes strictly positive, andR(λB2) < R(λB1).



Proposition1 justifies our construction of the rejection subset, in view of the requirement

that the resulting RC-curve should decrease monotonically. The first part in the proposition

ensures, that if we increase the allotted rejection bound, the risk of the resulting Naive-sHMM

will at least not increase. The second part tells us, that a sufficient increase of the bound,

allowing for larger rejection subset, while there are stillreliable states (more reliable than the

rejected ones), will result in lower risk sHMM.

3.2.3 Overcoming Coarseness.

The Naive-sHMM approach presented in the previous section suffers from the followingcoarse-

nessproblem. If the model does not comprise a large number of states, or has states with very

high visit rates (as it is often the case in applications), the total visit rate of the rejective states

might be far from the requested boundB and the selectivity cannot be fully exploited. For

example, consider a model that has three states such thatr(q1) > r(q2) > r(q3), v(q1) = ǫ,

andv(q2) = B + ǫ. In this case, only the negligibly visited stateq1 will be included in the

rejection subset.

Figure 3.2, depicts this problem for a 5-state sHMM example, showing that the actual

obtained coverage is a step function (where the number of steps is the number of states) whose

levels (coverage obtained) are sometimes far above the coverage bound. This issue prevents a

creation of high-quality selective models, and harms the model controllability, making certain

bounds very hard or even impossible to be achieved.

One could argue that a possible solution to this coarseness issue could be to enlarge the

state space. While this solution may work in principle, it suffers from a number of drawbacks.

We discuss those drawbacks in Section3.2.4.

In sections3.2.5and3.2.6we propose two different methods to deal with the coarseness

problem. Both methods are based on developing a more refined rejection procedure. The

first method, calledRandomized Linear Interpolation (RLI), allows to reject only part of the

predictions emitted from a particular state, thus providing a fine-grained rejection scheme as

needed. The second method, calledRecursive Refinement (RR), does increase the number of

model states, but in a very restricted and regularized manner; that is, it refines states of the

original model, and constructs a kind of a hierarchical model in which the original states are

used for making the predictions, and the refining states are used for making decisions about

rejections. In our experimental results (Chapter4) we provide an empirical comparison of both

these methods.



1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 00

0.2

0.4

0.6

0.8

1

Coverage Bound

Act

ual C

over

age

Figure 3.2: 5-state Naive-sHMM coverage vs. allotted bound

3.2.4 Why Not To Enlarge a State Space?

The mostly straightforward solution to the coarseness problem presented in the previous sec-

tion is to enlarge the state space of the predicting HMM. Thisway, one can hope to get a better

distribution of visits among states, having finer visit ratein each state. Then, when the algo-

rithm of Naive-sHMM constructs a rejection subset using Equation 3.7, it can succeed to get

closer to the bound then the original sHMM with small number of states could. While this

solution is very simple and intuitive, it has a number of pitfalls that make it inappropriate for

our problem.

First, the choice of the number of HMM states is an architectural issue, which has crucial

model-selection implications. For example, in financial applications it is a common practice

to select the number of states so as to allow for sufficient expressiveness that can reflect basic

market conditions such as strong up/down trends, trend correction dynamics, etc. Taking more

states may cause over fitting. Thus, usually the number of states of HMM is dictated by the

context, and there is no freedom to pick up any number of states.

Even if we could select any number of states, there would be still no guarantee that the

resulting model will not suffer from the coarseness problem. The reason is that the training

process (e.g., via the Baum-Welch algorithm) can converge to a local minima in which few

states will receive large visit rates, and visit rates of other states will be negligible. In our


3/figures/coarse_step.eps


experimental work, this was almost always the case.

Another, and may be the most important reason not to take thissolution, is that even if it

works, it actually does not precisely achieves the target. We recall, that our target is,given an

HMM predictor, to improve its quality by endowing it with a rejection mechanism. That is,

in a real application, the predicting HMM can be very rigorously designed and tuned, and our

rejection mechanism should guarantee that, while compromising the coverage, the prediction

quality will be at least not worse than of the predictor with full coverage. But since the number

of states is usually a key feature in HMM design, by enlarginga number of states we get a

completely differentHMM, whose predictions may not align with the original one, and for

which we can provide no guarantee about its quality. Thus, ina very common case where

predictor is manually designed by a user and is given as an input to the rejection algorithm, the

solution of enlarging a state space is simply not applicable.

3.2.5 Randomized Linear Interpolation (RLI).

Therandomized linear interpolation (RLI)method refines the Naive-sHMM by partly rejecting

predictions emitted from first non-rejective state (definedin Equation3.8). Partial rejection is

implemented by a randomized choice of instances to reject (with some probability P). Given

a Naive-sHMM with a rejection subsetB, and the corresponding first non-rejective stateqB,

qualifier functiong(Ot) from Def. 3.2.2has the following probabilistic definition:

P [g(Ot) = 0] ,

0 qi ∈ Q \ (B ∪ {qB})P qi = qB1 qi ∈ B

subject toSt = qi (3.11)

SupposeqB = qj, for some stateqj ∈ Q. Denote byλRLI(B) the RLI-sHMM λ with a

rejection subsetB. Equations3.12aand3.12bprovide formulas for coverage and risk of the

resulting RLI-sHMM,

C(λRLI(B)) =∑

qi /∈B∪{qB}

v(i) + (1− P)v(j) (3.12a)

R(λRLI(B)) =

∑

qi /∈B∪{qB}

r(i)v(i) + (1− P)r(j)v(j)

∑

qi /∈B∪{qB}

v(i) + (1− P)v(j)(3.12b)



To implement the RLI model it remains of course to specify therejection probability P.

Clearly, the optimal choice of P should provide the lowest possible risk subject to the coverage

constraintB.

Lemma 1. Let B be a rejection subset, as defined by Equation3.7, with qB = qj, for some

qj ∈ Q. Let λRLI(B)1 and λRLI(B)

2 be two RLI-sHMMs, with the corresponding rejection

probabilities P1 and P2 on stateqB. If P1 < P2 thenR(λRLI(B)1 ) ≥ R(λ

RLI(B)2 ).

Proof.

R(λRLI(B)1 )−R(λ

RLI(B)2 ) =

=

∑

qi /∈B∪{qB}

r(i)v(i) + (1− P1)r(j)v(j)

∑

qi /∈B∪{qB}

v(i) + (1− P1)v(j)−

∑

qi /∈B∪{qB}

r(i)v(i) + (1− P2)r(j)v(j)

∑

qi /∈B∪{qB}

v(i) + (1− P2)v(j)=

=

∑

qi /∈B

r(i)v(i) − P1r(j)v(j)

∑

qi /∈B

v(i) − P1v(j)−

∑

qi /∈B

r(i)v(i) − P2r(j)v(j)

∑

qi /∈B

v(i)− P2v(j)=

=

(P2 − P1)v(j)∑

qi /∈B

(v(i)(r(j) − r(i)))

(

∑

qi /∈B

v(i) − P1v(j)

)(

∑

qi /∈B

v(i) − P2v(j)

) , (3.13)

and since the denominator of Equation3.13is positive, P2 > P1, and for eachqi /∈ B, r(i) ≤r(j), Equation3.13is always non-negative (obtaining zero if an only if for allqi /∈ B, r(i) =

r(j)).

From Lemma1, it follows that the optimal choice of P should be the maximumpossible

value satisfying the rejection bound constraint; that is, achoice leading toC(λRLI(B)) = 1−B.

Proposition 2 (Optimal RLI). LetλRLI(B) be RLI-sHMM, withqB = qj, Let

P =1

v(j)

B −∑

qi∈B

v(i)

. (3.14)

Then,C(λRLI(B)) = 1−B.



Proof.

C(λRLI(B)) =∑

qi /∈B∪{qB}

v(i) + (1− P)v(j) =∑

qi /∈B∪{qB}

v(i) + v(j)− Pv(j)

=∑

qi /∈B

v(i) − 1

v(j)

B −∑

qi∈B

v(i)

v(j)

=∑

qi /∈B

v(i) +∑

qi∈B

v(i) −B = 1−B.

We complete the discussion on RLI-sHMM by proving monotonicity property for this

model, as we did for Naive-sHMM. Note, that here the conditions for RC-curve monotonicity

are weaker than in the Naive-sHMM case. The reason for this isthat RLI refinement method

allows for meeting the coverage bound precisely, in expectation.

Proposition 3 (RLI Monotonicity). Let B1 < B2 be rejection bounds, andB1, B2 be the

corresponding rejection subsets, as defined by Equation3.7. Let qB1= qj. If there exists

qk ∈ Q s.t. r(j) > r(k), thenR(λRLI(B2)) < R(λRLI(B1)).

Proof. We note that Lemma1 proves a weak version of the proposition for the caseB1 =

B2. Additional condition of existence ofqk such thatr(j) > r(k) provides a guarantee for

Equation3.13to be strictly positive. Otherwise,B1 ⊂ B2, in which case

R(λRLI(B1)) >(1) R(λB1∪{qB1}) ≥(2) R(λB2) ≥(3) R(λRLI(B2))

where inequality (1) follows from the strict version of Lemma1, applied with P2 = 1. Inequal-

ity (2) follows from Proposition1, and (3) follows from Lemma1, applied with P1 = 0.

3.2.6 Recursive Refinement (RR).

Given an HMM, the goal in ourrecursive refinementapproach is to construct an approximate

HMM whose states have finer granularity of visit rates. This finer granularity enables a con-

struction of a rejective subset (Equation3.7) whose total visit rate is closer to the required

bound.

In order to minimize the risk of overfitting, we construct ahierarchical model of HMMs,

in which each stateqi, whose visit rate is greater than some predefined bound (called aheavy

state) gives rise to another HMM (called arefining HMM), whose purpose is to redistribute



among its states the visits within the heavy state. The original heavy state governs the visits

to the states of its refining HMM such that only transitions tothis heavy state can result in

transitions to its refining states. In addition, refining states inherit their labels from their parent

heavy state. Constructed in this way, the resulting hierarchical model is likely to preserve the

dynamics of the original, non-refined model, however providing finer visit rates of its states.

Let us demonstrate this construction by an example. Consider a two-state HMM, with

Q = {q1, q2}, that generates an observation sequenceO = O1 . . . O10, with the corresponding

state sequenceq2q1q2q2q2q2q1q2q2q2. In order to refine a highly visited stateq2 we construct

a new HMM with statesq3, q4, such that it is likely to generate the observation subsequences

defined by time indexes1, 3− 6, 8− 10 (subsequences corresponding to stateq2). We interpret

every transition from stateq1 to q2 as the transition into (initial) state of the refining{q3, q4}model, and every self transition withinq2 as any transition between statesq3 andq4 (including

self transitions of those states). If eitherq3 or q4 is itself a heavy state, the refinement process

can be applied to it also (recursively), resulting in a tree of HMMs spanned by the stateq2.

The recursive refinement process starts with a root HMM,λ0, trained in a standard way

using the Baum-Welch algorithm. Inλ0, heavy states are identified. For each such stateqi, a

refining HMM λi is trained (see Section3.2.6.3) and combined withλ0 as follows: every tran-

sition from other states intoqi in λ0 entails a transition into (initial) state inλi in accordance

with the initial state distribution ofλi; every self transition toqi in λ0 results in a state tran-

sition in λi according to its state transition matrix; finally, every transition fromqi to another

state entails a transition from a state inλi with a probability of the original transition fromqi.

States ofλi are assigned the label ofqi. This refinement continues in a recursive manner and

terminates when all the heavy states have refinements. The non refined states are calledleaf

states.

Figure3.3depicts a recursively refined HMM having two refinement levels. In this model,

states 1,2,4 are heavy (and refined) states, and states 3,5,6,7,8 are leaf (emitting) states. The

model consisting of states 3 and 4 refines state 1, the model consisting of states 5 and 6 refines

state 2, etc.

An aggregate stateof the complete hierarchical model corresponds to a set of inner HMM

states, each of which is a state on a path from the root throughrefining HMMs, to a leaf state.

Only leaf states actually emit observation symbols. Refinedstates are non-emitting and their

role in this construction is to preserve the structure and transitions of the HMMs they belong

to.



❴⑤✙ ✤

❁❨✤❁❨❴ ⑤

✙1

��,,

��

❴⑤✙ ✤

❁❨✤❁❨❴ ⑤

✙2

��

ll

��

WVUTPQRS3

��,,

❴⑤✙ ✤

❁❨✤❁❨❴ ⑤

✙4

��

ll

��

WVUTPQRS5

��,,WVUTPQRS6

��

ll

WVUTPQRS7

��,,WVUTPQRS8

��

ll

Figure 3.3: Recursively Refined HMM

At every time instancet, the model is at some aggregate state. Transition to the next

aggregate state always starts atλ0 and recursively progresses to the leaf states, as shown in the

following example. Suppose that the model in Figure3.3 is at aggregate state{1,4,7} at time

t. The aggregate state at timet + 1 is calculated as follows.λ0 is in state 1, so its next state

(say 1 again) is chosen according to the distribution{a11, a12}. We then consider the HMM

that refines state 1, which was in state 4 at timet. Here again the next state (say 3) is chosen

according to the distribution{a43, a44}. State 3 is a leaf state that emits observation, and the

aggregate state at timet+ 1 is {1,3}. On the other hand, if state 2 is chosen at the root, a new

state (say 6) in its refining HMM is chosen according to the initial distribution{π5, π6} (since

transition into the heavy state was done from another state). The chosen state 6 is a leaf state

so the new aggregate state becomes{2,6}.

3.2.6.1 Flattening a Refined HMM

Given a recursively refined HMM, it is possible to construct an equivalent flattened HMM.

The flattened model is easier to deal with when considering a variety of “local” operations

applied directly to leaf states, such as forward-backward calculations, or calculations of visit

rates and risks. However, when considering “global” operations related to the hierarchy of

the RR-HMM, such as finding a most likely aggregate state, theflattened representation is

extremely hard or even impossible to deal with, because it discards the structural information

of the original model.

The construction of an equivalent flattened model can be conducted bottom-up on the hi-

erarchical structure of the RR-HMM. First, every refining HMM at the lowest level (refining



HMM in which all states are leaves) is embedded in the HMM on the upper level, replacing the

heavy state it refines. We repeat this process, level by level, until the states in the root HMM are

replaced with their refinements, and here the process ends, resulting in a standard (flat) HMM

in which each state corresponds to a single aggregate state in the hierarchical RR-HMM. For

example, consider the RR-HMM shown in Figure3.3. The flattening starts by embedding

the HMM consisting of states{7, 8} so as to substitute state4, and the HMM consisting of

states{5, 6}, that substitutes state2. Then, a new refining HMM consisting of states{3, 7, 8}substitutes state1, resulting in a 5-state HMM, consisting of states{3, 5, 6, 7, 8}.

Considering the process described in the previous paragraph, we observe that for flattening

a RR-HMM, it is sufficient to know how to embed a single refiningHMM in which each state

is a leaf, into the upper level model. To do so, we need to definethe transitions between the

states of the refining HMM and the states of the HMM in which it is embedded. We also need

to calculate the probabilities corresponding to those transitions and recalculate probabilities of

internal transitions within the refining HMM, in order to geta valid stochastic model. Finally,

it is also required to define the revised emission distribution within every state to be embedded.

WVUTPQRS1I❝ ❛ ,,❪ ❬

I❄❄

❄❄

��❄❄❄

❄ I❚❚❚❚

❚❚❚❚❚❚❚❚

❚❚

**❚❚❚❚❚❚❚

❚❚❚❚❚❚❚

❴⑤✙ ✤

❁❨✤❁❨❴ ⑤

✙2

II✗ ��✬

III❝ ❛ ,,❪ ❬ WVUTPQRS3

WVUTPQRS4II

22

III❥❥❥❥❥❥❥❥❥❥❥❥❥

44❥❥❥❥❥❥❥❥❥❥❥❥❥

WVUTPQRS5IIrr

III⑧⑧⑧⑧

??⑧⑧⑧⑧

Figure 3.4: Embedding of 2-state refining HMM

Figure3.4depicts the embedding of the refining HMM (states4, 5) to substitute the heavy

state2 (for the sake of simplicity, some edges, e.g. self transitions of states4, 5, are omitted in

this diagram). The component drawn with dashed lines (state2 and its transitions) is removed

and replaced by new transitions to states4, 5. Each transition to the heavy state is replaced

with transitions into every refining state (type I transitions in Figure3.4). The self transition

of the heavy state is replaced by transitions between refining states (type II transitions), and

each transition from the heavy state is replaced by transitions from refining states (type III

transitions).

Consider an observation sequenceO. Let {q1 . . . qN} be states of the HMM in which the

refining HMM is embedded (upper HMM, denoted byλ), and letqi be a heavy state. Let



{qN+1 . . . qN+N ′} be states of the refining HMM (denoted byλr). Denote bySt the state ofλ

at timet, bySrt , the state ofλr at timet, and bySe

t , the state at timet of the flat HMM resulting

from the embedding ofλr in λ (denoted byλe). The transition probabilities of the resulting

flat HMM are defined as follows.

Type I transitions: 1 ≤ j ≤ N ,N + 1 ≤ k ≤ N +N ′, j 6= i,

P[Set+1 = qk |Se

t = qj]=

{P [Sr

1 = qk, St+1 = qi] , St = qj;0, otherwise

Consequently

aejk = P [Sr1 = qk, St+1 = qi |St = qj] =

= P [Sr1 = qk |St = qj, St+1 = qi]P [St+1 = qi |St = qj] =

= πrkaji. (3.15)

Type II transitions: N + 1 ≤ j, k ≤ N +N ′,

P[Set+1 = qk |Se

t = qj]=

{P[Srt+1 = qk, St+1 = qi

], Sr

t = qj, St = qi;0, otherwise

aejk = P[Srt+1 = qk, St+1 = qi |Sr

t = qj, St = qi]=

= P[Srt+1 = qk |Sr

t = qj, St = qi, St+1 = qi]P [St+1 = qi |St = qi] =

= arjkaii. (3.16)

Type III transitions: N + 1 ≤ j ≤ N +N ′, 1 ≤ k ≤ N , k 6= i,

aejk = P [St+1 = qk |St = qi] = aik. (3.17)

To define the observation emission distribution of an embedded state, we first recall that

in the flattened HMM, each state corresponds to the aggregatestate in the original RR-HMM.

Since the aggregate state is uniquely defined by its leaf state, the emission distribution of the

aggregate state is the emission distribution of its leaf state. We thus get,

bej(k) = brj(k) (3.18)

The initial probability of being at stateqk, N + 1 ≤ k ≤ N +N ′, in the flattened HMM,

is similar to the type I transition probability, with the only change being the requirement that

S1 = qi, instead ofSt = qj, St+1 = qi. Consequently,

πek = πrkπi. (3.19)



Noting that a flattened representation of RR-HMM induces an equivalent observations dis-

tribution, and that every state in it corresponds to a singleaggregate state in the original RR-

HMM, we observe that the flattened representation can be usedfor the calculation of forward-

backward variables corresponding to aggregate states (seeEquations2.5, 2.7, 2.8, and2.10).

Consequently, the flattened representation can be used for conducting inference (Equation2.6),

calculating visit rates and risks of aggregate states (Equations 3.4 and3.5), and training (see

Section3.2.6.3).

To convert the RR-HMM to a selective model (RR-sHMM), a procedure described in Sec-

tion 3.2.2is applied to its flattened representation. For making predictions with a RR-sHMM,

we apply a routine for finding of amost likely aggregate state, which is described in Sec-

tion 3.2.6.2.

3.2.6.2 The Most Likely Aggregate State

In order to apply the RR-HMM for predictions in a manner similar to the standard HMM, we

need to identify at timet the aggregate state at which the model is most likely to be. However,

the use of the individually most likely state of the flattenedrepresentation is not appropriate for

this task, as is demonstrated in the following example.

Consider a HMM consisting of two states,q1, q2, such that,γt(1) = 0.4 andγt(2) = 0.6

for eacht. Clearly, at every timet, q2 is a most likely state. Now, consider the two-state

refinement ofq2, consisting of statesq3, q4, such that on a flattened model, at every timet,

γt(3) = 0.25 andγt(4) = 0.35. In a flattened HMM (consisting of statesq1, q3, q4), q1 is now

the most likely state at every timet. Thus, instead of obtaining a refined prediction with respect

to the original model, a completely different prediction isobtained.

To overcome this problem, a recursive procedure on the RR-HMM hierarchy is applied for

finding themost likely aggregate state. In a preprocessing step,γt(i) is calculated for each

stateqi in the model. For leaf states,γt(i) is calculated using a standard forward-backward

algorithm applied to the flattened model. Then, for each heavy stateqi that has a corresponding

refining HMM λi

γt(i) =∑

qj∈λi

γt(j). (3.20)

The recursive procedure starts at the root model,λ0. The most likely individual state in it,

sayqi, is identified. If this state has no refinement (i.e., it is a leaf state), then we are done.

Otherwise, the most likely individual state inλi (HMM that refinesqi), sayqj, is identified,



and the aggregate state is updated to be{qi, qj}. The process then continues to lower levels of

the hierarchy, and terminates when reaching the leaf state.The resulting sequence of states is

the most likely aggregate state of the RR-HMM.

3.2.6.3 Parameters Estimation of the Refining HMM

A crucial task in the construction of the RR-HMM is estimating the parameters of the refining

HMM. During the construction process, whenever a heavy state is refined, the estimation pro-

cedure should be applied to the transition and emission probabilities of its refining HMM. The

estimation procedure is based on maximizing the likelihoodof the entire model to generate the

given observation sequence, under the constraint of keeping unchanged the parameters that are

not directly related to the refining HMM; that is, the estimation procedure is appliedonly to the

parameters of the refining HMM.

This estimation procedure is iterative and makes use of the embedding process described in

Section3.2.6.1, combining it with Baum-Welch re-estimation steps until a convergence crite-

rion is met. The procedure starts with some initial random choice of values for the parameters

of the refining HMM. Then, this HMM is embedded so as to substitute the heavy state it re-

fines, and its parameters are recalculated, resulting in a new (non-embedded) refining HMM.

The process is then continued with this new refining HMM.

Algorithm 1 is a pseudo-code of the training algorithm for refining HMMλi of a heavy state

qi in HMM λ. In steps1-3, a randomλi is generated and connected to the HMMλ instead ofqi.

Steps5-8 iteratively update the parameters ofλi until the Baum-Welch convergence criterion is

met, while steps5-7 represent the embedding process (corresponding to the three types of the

transitions described in Section3.2.6.1), and step8 is the actual update of the refining HMM

parameters. In step10, λ is updated with the finalλi parameters. Finally, in step3, qi is stored

as a state refined byλi, to preserve the hierarchical structure of the resulting model.

πj =1

Z

γ1(j) +

T−1∑

t=1

n∑

k=1

k 6=i

ξt(k, j)

(3.21a)

ajk =

T−1∑

t=1ξt(j, k)

n+N∑

l=n+1

T−1∑

t=1ξt(j, l)

(3.21b)



Algorithm 1 TrainRefiningHMM

Require: HMM λ = 〈{qj}j=Nj=1 , {vm}m=M

m=1 , π,A,B〉, heavy stateqi,O

1: Draw a random HMM,λi = 〈{qj}j=N+N ′

j=N+1 , {vm}m=Mm=1 , {πj}j=N+N ′

j=N+1 , {ajk}j,k=N+N ′

j,k=N+1 ,

{bjm}j=N+N ′,m=Mj=N+1,m=1 〉

2: For each1 ≤ j ≤ N, j 6= i, replace transitionqjqi with qjqN+1 . . . qjqN+N ′ , andqiqj with

qN+1qj . . . qN+N ′qj

3: Remove stateqi with the corresponding{bim}m=Mm=1 from λ and record it as a state refined

by λi. SetLqj = Lqi , for eachN + 1 ≤ j ≤ N +N ′

4: while not convergeddo

5: For each1 ≤ j ≤ N, j 6= i, 1 ≤ k ≤ N ′ updateaj(N+k) = ajiπN+k, anda(N+k)j =

aij.

6: For eachN + 1 ≤ j ≤ N +N ′, updateπj = πiπj

7: For eachN + 1 ≤ j, k ≤ N +N ′, updateajk = aiiajk

8: Re-estimate{πj}j=N+N ′

j=N+1 , {ajk}j,k=N+N ′

j,k=N+1 , {bjm}j=N+N ′,m=Mj=N+1,m=1 , using Equations3.21a-

3.21c

9: end while

10: Perform steps5-7

Ensure: HMM λ



bjm =

T∑

t=1

Ot=m

γt(j)

T∑

t=1γt(j)

(3.21c)

In Equations3.21a-3.21c, re-estimation formulas for the parameters of the statesqn+1 . . . qn+N

(states of the refining HMM), that are used in Step8, are presented. The formulas are derived

by solving appropriate partial derivatives of the Baum auxiliary function (Equation2.9). It is

not hard to see that constraints for the parameters to be valid distributions are preserved (Z

is a normalization factor in theπj equation). The main difference from the original Baum-

Welch formulas is in the re-estimation ofπj; specifically, in the refinement process, transitions

from other states into a heavy stateqi also affect the initial distribution of its refining states.

This derivation is discussed in details in the AppendixA. The re-estimations ofajk andbj(k)

are similar to the standard Baum-Welch routine, with the difference being that the expectation

calculation only considers states originated from the refining HMM.

3.2.6.4 Comparison to Other Compositional Hidden Markov Models

Various schemes that compose several HMMs into a single model were presented in the litera-

ture. One class of such schemes are models that were developed for the learning of observation

sequences originated from a number of stochastic sources, each of which can be modeled via

a single HMM, while the observations are dependent on the combination of internal states of

the underlying models. In fact, those models provide a more efficient alternative for an HMM

whose state space is a cross product of the states of internalmodels. Factorial HMM, devel-

oped by Ghrahamani and Jordan in [16], presents such a combination scheme in which each

underlying HMM operates independently. Mixed Memory Markov Models, presented by Saul

and Jordan [28], and Coupled Hidden Markov Models, presented by Brand in [7], introduce, in

addition, a coupling between the states of underlying HMM sources. In general, those models

serve a purpose of an output signal decomposition into multiple components, thus all underly-

ing models that represent sources for those components operate simultaneously, without having

a top-down hierarchical structure.

The Hierarchical Hidden Markov Model (HHMM), presented by Fine, Singer and Tishby

in [13], was developed for capturing recursive patterns in modeled observation sequences. This

model is very similar in its structure to the RR-HMM we introduce. It consists of the number



of HMMs, organized in a hierarchical tree form, where each state in the upper HMM gives

rise to a sub-tree, which in turn is itself a hierarchical HMM. Transitions are performed both

horizontally and vertically, and the observations are emitted by so-called “production states” -

states at the lowest level of the hierarchy (leaf states in our terminology). Despite the structural

similarity, there is a number of key differentiating characteristics between our RR-HMM and

the HHMM. First, in HHMM, each state in the upper level is designed to generate a sub-

sequence rather than a single observation. This leads to thefollowing generation principle:

the generation of the new observation starts at the level from which the last observation was

generated (a level above the production state that generated the last observation), and “climbing

up” in a tree of HMMs is performed only from the so-called terminal state, which exists at each

level and notifies about the end of the sub-sequence generation. In contrast, in our RR-HMM,

each new observation is generated using a top-down traversal through the model, such that the

generation of each new observation is governed by the root HMM. Additional fundamental

difference, is that the HHMM is developed at-once, using a predefined architecture (number of

levels), and not by adding levels iteratively on-demand, asin our RR-HMM.

In a paper by Siddiqi, Gordon and Moore [30], top-down principle is used for investigating

the most suitable model structure. In this process, each state is in turn replaced by a HMM,

until a best model in term of some evaluation criteria is achieved. The replaced states are

abandoned, thus the outcome of the algorithm is a flat (ratherthan hierarchical) HMM. Since

hierarchy preservation is essential for balancing betweenvisit rates of states (for example, when

choosing the most likely aggregate state, see Section3.2.6.2), the flat HMM resulting by this

construction cannot be used instead of the RR-HMM.

The outcome of the RR training method is a tree of HMMs whose main purpose is to

re-distribute visit rates among states. This re-distribution is the key element that allows for

achieving smooth RC curves. In fact, all models listed abovewere developed to serve other

purposes such as better modeling of sequences that have special structure (e.g., sequences

hypothesized to be emerged from a hierarchical or combinatorial source), and thus they do not

address the re-distribution objective we require.


Chapter 4

Experimental Results

We evaluated techniques described in Section3 on a financial time series prediction task.

Specifically, we focus on predicting the next-day trend of the S&P500 index. The term “trend

prediction” means that we are only interested in determining next day direction (up or down).

Thus, we are dealing with a binary sequential prediction problem.

A selective prediction subroutine has a striking appeal in the context of financial appli-

cations, and trading in particular. Risk averse speculators and traders can greatly benefit if

provided with reliable mechanisms to control and bound risk. Moreover, the presence of a

wide diversity of weakly correlated financial instruments should allow one, at least in princi-

ple, to apply selective prediction with a low coverage boundon each individual instrument yet

receiving a dense stream of trading signals.

Next-day trend prediction of price sequences using HMMs, and in particular of the S&P500

index, is reported to be a difficult task. A recent experimental work by Rao and Hong [27]

extensively tested various applications of HMMs to this task. This work concluded that when

considering autoregressive financial trend prediction, HMMs can provide some positive edge

over a coin toss. However, the reported advantage is very small (51.72%). This empirical

evidence provides incentive for applying selective prediction algorithms. Can these techniques

increase this negligible advantage?

4.1 Experimental Setting

In all experiments, the input time series consist ofrelative returnsof S&P500 close prices,

observed at 1/27/1999 to 12/31/2010. Given that the close price at time instancet is pt, the


4. EXPERIMENTAL RESULTS

relative return at timet is

rt =pt − pt−1

pt−1(4.1)

Each observationrt is labeled with its correct next-day trend prediction,lt = sign(rt+1).

Being at timeT +1, and given the historical price sequencep1 . . . pT up to timeT +1 we con-

struct the relative price sequencer2 . . . rT , with its corresponding label sequencel2 . . . lT−1.

We note that for achieving a training sequence of labeled returns of length T, the sequence

of T+2 close prices is required. The model is trained using relative prices and corresponding

labels sequences. Given the next price,pT+1, we calculaterT+1 andlT , and the trained model

is used to predict the next-day direction,lT+1.

It is well known that price sequences are highly non-stationary. Therefore, following Zhang

[33], we employed awalk-forwardprocedure whereby the model is trained over the window

of pastWp returns and then tested on the subsequent window ofWf “future” returns. Then,

we “walk forward” Wf steps (days) in the return sequence (so that the subsequent training

segment ends where the last test segment ended) and the process repeats until we consume the

entire data sequence. This process is depicted in Figure4.1.

✲

time

Wp Wf

Wp Wf

Wp Wf

❞ ❞ ❞

Figure 4.1: A walk-forward evaluation procedure

In all experiments, unless explicitly specified,Wp was set to 2000 andWf was set to 50. In

each experiment, we estimated and generated the RC-curve ofeach model under consideration.

The RC-curve was constructed by taking the linear grid of rejection rate bounds from 0 to 0.9

in steps of 0.1, and calculating empirical error rate obtained from the non-rejected part of the


4.2 Experiments with Discrete Data

test sequence. A prediction,dt ∈ {+1,−1}, is considered erroneous ifdt 6= lt. To avoid

dependency on random initialization of HMMs, a cross-validation overN = 30 folds was

performed. Each fold consisted of10 runs, each one obtained with its own random initialization

of the HMM parameters. The mean empirical error rateµe, and the corresponding standard

error of the mean,σe, were calculated,

µe =

N∑

i=1

Ei

Ci

N(4.2a)

σe =std{

E1

C1. . . EN

CN

}

√N

, (4.2b)

whereEi is the total number of errors accrued, andCi is the total number of non-rejected

instances obtained from all runs of foldi.

Selective approaches were examined using two variations ofthe same dataset: continuous

and discrete. The continuous set is the sequence of relativereturns,rt, and the discrete set,

is a discretization of same sequence. The discretization isdescribed in detail in Section4.2.1.

Selective models were implemented on top of the HMM Toolbox for Matlab, developed by

Murphy [24].


4.2.1 Discretization and Quantization

In order to produce a discrete data sequence from the original return sequence,r2 . . . rT , two

transformations were applied. First, every data point,rt, is discretizedto r′t ∈ {+1,−1},

usingr′t = sign(rt). Next, we appliedquantization, where the sequencer′2 . . . r′T (discretized

returns) is divided into small sequencesOt consisting ofW consecutive discrete returns,Ot =

r′t . . . r′t+W−1. Each sequenceOt is labeled withlt+W−1 (the label of the last data instance in

Ot, which is a correct prediction for this instance).

For the ambiguity-based classifier (Section3.1), the sequencesOt, originating from the

training data, form positive and negative training pools, and the sequencesOt, originating from

the test data, are used for classifier performance evaluation. For the sHMM, each sequenceOt

is encoded as a single observation, forming a new,quantizedsequence.



︸︷︷︸

O1 = 5

1

O2 = 3︷︸︸︷

−1︸︷︷︸

O3 = 6

1

O4 = 4︷︸︸︷

1︸︷︷︸

O5 = 0

−1

O6 = 1︷︸︸︷

−1︸︷︷︸

O7 = 3

−1

O8 = 6︷︸︸︷

1︸︷︷︸

O9 = 5

1

O10 = 3︷︸︸︷

−1 1 1 1︸︷︷︸

O11 = 7

1 1 1 −1

Figure 4.2: Quantization of the discrete data sequence withW = 3

Figure4.2depicts a quantization example withW = 3 using a sample discretized return se-

quence. For an ambiguity-based classifier, this sequence induces two training pools: a positive

pool, consisting of the sequencesO1 = {1,−1, 1}, O5 = {−1,−1,−1}, O6 = {−1,−1, 1},

O8 = {1, 1,−1},O9 = {1,−1, 1},O10 = {−1, 1, 1}, and a negative pool, consisting ofO2 =

{−1, 1, 1}, O3 = {1, 1,−1}, O4 = {1,−1,−1}, O7 = {−1, 1, 1}, andO11 = {1, 1, 1}. For

the sHMM, the observation sequence isO = 5, 3, 6, 4, 0, 1, 3, 6, 5, 3, 7.

4.2.2 Ambiguity-Based Model

In our first experiment, we evaluated performance of the ambiguity-based selective classifier,

described in Section3.1. The classifier has two hyper-parameters:N , the number of states in

each of two HMMs that model positive and negative sequences,and a quantization parameter

W (see section4.2.1). Selective ambiguity-based classifiers were tested with different settings

of these hyper-parameters.

Table 4.1: Comparison of ambiguity-based classifiers for differentW ’s

Coverage bound

W 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1

3 0.487 0.486 0.483 0.480 0.479 0.482 0.479 0.472 0.458 0.458

4 0.505 0.505 0.503 0.501 0.494 0.488 0.479 0.465 0.453 0.455

5 0.486 0.486 0.483 0.480 0.475 0.467 0.463 0.452 0.438 0.432

6 0.495 0.494 0.491 0.487 0.483 0.477 0.469 0.461 0.453 0.453

Table4.1 demonstrates the performance of the ambiguity-based classifiers that vary with

the length of a single training sequence (W ), while the number of states in each HMM is fixed

to 5. For eachW , the table summarizes the mean empirical error (µe) achieved for coverage

bounds in the range between 1 to 0.1, with decremented reduction of the bound using 0.1 steps.

It can be observed that for eachW , the resulting RC-curve is non-trivial, thus the selective

classifier is potentially useful. Comparing these results,we see that a model withW = 5



achieves the best performance, both in terms of the number ofcoverage bounds for which it

performed better (emphasized usingbold fonts), and in terms of lowest error rate among all

tested coverage bounds (emphasized usingbold-italic fonts).

Table 4.2: Comparison of ambiguity-based classifiers with different number of states

N Coverage bound

1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1

2 0.488 0.486 0.484 0.484 0.480 0.469 0.464 0.455 0.452 0.454

3 0.486 0.485 0.481 0.480 0.476 0.465 0.460 0.451 0.440 0.437

4 0.486 0.486 0.483 0.481 0.478 0.469 0.461 0.454 0.446 0.440

5 0.486 0.486 0.483 0.480 0.475 0.467 0.463 0.452 0.438 0.432

6 0.485 0.486 0.483 0.481 0.475 0.467 0.459 0.449 0.436 0.434

7 0.484 0.484 0.481 0.478 0.474 0.468 0.461 0.451 0.440 0.434

8 0.485 0.485 0.481 0.478 0.474 0.467 0.460 0.449 0.438 0.431

In Table4.2, the ambiguity-based classifier withW = 5 is further investigated for different

values of the hyper-parameterN . Here again we observe that a useful RC-curve is obtained for

all theN values we tested. However, here it is not so clear which valueof N yields the best

performance. for example, bothN = 7 andN = 8 exhibit best performance for 5 individual

coverage bounds. Nevertheless,N = 8 achieved the overall best performance among all others

(0.431), so we consider it to be the best choice (in hindsight) for the hyper-parameterN .

Overall, these results indicate in hindsight that the pair,W = 5 andN = 8, leads to a

good performance of the ambiguity-based classifier. In subsequent experiments we used this

choice of hyper-parameters for this classifier for comparisons with our novel sHMM technique,

in order to give (unfair) advantage to this well-known selective classification approach.

4.2.3 Selective HMM

For state-based selective HMMs, described in Section3.2, we used a 5-state model. Such

HMMs are hypothesized to be sufficiently expressive to modela small number of basic market

conditions such as strong/weak trends (up and down) and sideways markets [27], [33]. Another

hyper-parameter of the model, the length of the quantization windowW (see Section4.2.1),

was chosen based on a preliminary experiment using an earlier slice of S&P500 data. This slice

also consisted of 3000 points, from 17.2.1987 to 31.12.1998. The results of this preliminary



evaluation are summarized in Table4.3. Based on these results, we selectedW = 3 to be the

size of the quantization window in the main experiment.

Table 4.3: Comparison of quantization window lengths, 17.2.1987-31.12.1998

W Coverage bound

1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1

3 0.456 0.456 0.448 0.441 0.435 0.434 0.433 0.430 0.425 0.411

4 0.457 0.457 0.452 0.448 0.445 0.440 0.436 0.434 0.430 0.418

5 0.454 0.454 0.451 0.447 0.444 0.440 0.435 0.432 0.432 0.429

6 0.451 0.451 0.449 0.446 0.442 0.439 0.437 0.434 0.432 0.429

Figure4.3depicts RC-curves for all sHMM models described in Section3.2, namely Naive-

, RLI- and RR-sHMM. In this and forthcoming figures, RC-curves represent the mean empirical

error rate as a function of the coverage bound, and the error bars on each curve are standard

errors of the mean. In the experiment with the RR-sHMM, we applied one level of refinement

to every state with visit rate greater than0.1.

From the figure, it can be seen that all three models exhibit meaningful RC-curves; namely,

the error rates decrease monotonically with decreasing coverage bounds. The RLI and RR mod-

els (curves 2 and 3, respectively) outperform the Naive one (curve 1), by better exploiting the

allotted coverage bound. This can be seen more evidently in Table4.4, which summarizes the

actual coverage of the models versus the allotted bound. In addition, Figure4.3and Table4.4

show that the RR model outperforms the RLI model, and, moreover, its effective coverage rate

is higher for each required bound. This validates the effectiveness of the RR approach that im-

plements a smarter selection procedure than the RLI approach. Specifically, when RR refines

a state and the resulting sub-states have different risk rates, the selection procedure will tend to

reject less reliable states first.

A comparison of the Naive-sHMM to the ambiguity-based classifier is shown in Figure4.4.

The hyper-parameters of the ambiguity-based classifier were set in hindsight, as described in

Section4.2.2. Despite this classifier’s advantage, Figure4.4shows that the Naive-sHMM, and,

consequently, RLI- and RR-sHMM, clearly outperform the ambiguity-based model through the

entire coverage range.

We also compared our models to two alternative HMM learning methods that were recently

proposed: the spectral algorithm of Hsu et al. [19], and the V-STACKS algorithm of Siddiqi et


4.3 Experiments with Continuous Data

1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.10.4

0.41

0.42

0.43

0.44

0.45

0.46

0.47

Coverage Bound

Err

or R

ate

Naive

RLI

RR

Figure 4.3: Comparison of Naive-, RLI- and RR-sHMM

al. [30]. As can be seen in Figure4.4, when a coverage compromise is allowed, our selective

techniques can also improve the accuracy obtained by these more advanced learning methods.

The last experiment in this section tested the performance of the RR-sHMM, which is the

best model in terms of empirical error, on low coverage bounds. The results of this experiment

are depicted in Figure4.5. From these results we learn that even close the extreme caseof

zero coverage, where the model is very likely to diverge due to the very low number of non-

rejected instances (entailing an increased impact of each erroneous prediction), RR-sHMM still

succeeds to produce (on average) a meaningful RC-curve.


4.3.1 Filtered Data

In this set of experiments, the sequence of returns,rt, was used as input data. Following Zhang

[33], in all experiments we used 5-state HMMs with a mixture of 4 Gaussian components for

modeling observation densities in each state.

Since the sequence of pure returns is very noisy, we first evaluated performance on a

smoothed return sequence. Smoothing was obtained using standard first order infinite impulse


4/figures/discrete_shmm_comparison.eps


Table 4.4: Coverage Rates of sHMM

Bound Naive RLI RR

0.9 0.999 0.899 0.942

0.8 0.939 0.798 0.842

0.7 0.778 0.696 0.735

0.6 0.719 0.593 0.628

0.5 0.633 0.491 0.526

0.4 0.507 0.391 0.423

0.3 0.385 0.291 0.324

0.2 0.305 0.192 0.224

0.1 0.199 0.094 0.131

response (IIR) low pass filter defined recursively as,

xt , pt + (1− ζ)xt−1, (4.3)

usingζ = 1/31. This filter was applied to the sequence of S&P500 close prices, and then the

resulting sequence ofxt’s was converted to the sequence of returns, using Equation4.1. As is

evident in Figure4.6, the application of this filter resulted in a smoother observation sequence,

giving rise to easier prediction task.

Table 4.5: Coverage Rates of sHMM’s for filtered continuous data

Bound Naive RLI RR

0.9 0.955 0.879 0.902

0.8 0.900 0.811 0.823

0.7 0.838 0.739 0.742

0.6 0.754 0.639 0.650

0.5 0.658 0.535 0.562

0.4 0.556 0.438 0.473

0.3 0.480 0.341 0.367

0.2 0.398 0.230 0.258

0.1 0.207 0.114 0.140

1This standard filter is often referred to as ‘exponential moving average’ andζ = 1/3 roughly corresponds to

smoothing out vibrations of window length five.



1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.10.42

0.43

0.44

0.45

0.46

0.47

0.48

0.49

Coverage Bound

Err

or R

ate

Naive−sHMM

1.Ambiguity

Spectral Algorithm

V−STACKS

Figure 4.4: Comparison of the Naive-sHMM and Ambiguity-Based Classifier

Figure4.7and Table4.5compare performance of the three sHMMs over the smoothed ob-

servation sequence. It is evident that these results reinforce our conclusions from Section4.2.3.

Specifically, as in the discrete case, Table4.5shows improved coverage-bound exploitation by

the RLI and the RR models, and Figure4.7indeed shows smaller errors for those models result-

ing from the better coverage-bound exploitation. We also observe that RR-sHMM outperforms

the RLI model, due to be the result of smarter rejection decisions (see Section4.2.3for a more

detailed explanation.)

4.3.2 RC Trade-off vs. Data Complexity

Trend prediction of a smoothed price sequence is, quite obviously, easier than trend prediction

of the original, noisy sequence of raw prices. It is intuitively clear that progressively higher

degrees of smoothing will give rise to progressively easierprediction problems. In this section

we present a preliminary study where we examined the RC trade-offs resulting from such

progressively easier problems.

The smoothness of a lowpass filtered time series depends on the cutoff window length

W – the greaterW is, the smoother is the resulting sequence. In Figure4.8 we depict the

relative improvements of various RC trade-offs obtained byour sHMM for progressively easier

problems corresponding to an increasing sequence of cutoffwindow length thresholdW . The


4/figures/naive_vs_ambiguity.eps


0 0.2 0.18 0.16 0.14 0.12 0.1 0.08 0.06 0.04 0.020.38

0.385

0.39

0.395

0.4

0.405

0.41

0.415

0.42

Coverage bound

Err

or r

ate

Figure 4.5: RR-sHMM performance on high rejection rates

figure clearly shows that for larger window lengths, the improvement achieved by sHMM is

more pronounced, especially for low coverage rates. This relationship between the complexity

of the data and the potential of selective mechanisms to improve the quality of predicting this

data is an interesting subject for further investigation.

4.3.3 Raw Price Data

Our final experiment considered pure (non-smoothed) price data. The prediction task in this

setting is perhaps the most difficult we have considered. As all previously described experi-

ments demonstrated the superiority of the RR-sHMM, in this experiment we focused on the

performance of this model.

Results for next-day trend prediction of S&P500 pure returns sequence with the RR-sHMM

are shown in Figure4.9(dotted line). The figure demonstrates that RR-sHMM still succeeds to

achieve a meaningful RC curve, with approximately 10% mean empirical error improvement

when rejecting 90% of the instances. These results stronglyindicate that the RR model is

potentially useful even for this difficult task.

To validate the effectiveness of our selective prediction approach beyond predictions over

the S&P500 data, we also tested performance of the RR-sHMM over the sequence of returns

recorded for Gold, represented by its exchange traded fund (ETF) replica, whose symbol is


4/figures/rr_high_rates.eps


0 10 20 30 40 50 60 70 80 90 1001000

1050

1100

1150

1200

1250

1300

Time

Pric

e

Pure return sequenceIIR output

Figure 4.6: Pure price sequence vs. filtered price sequence (ζ = 1/3), 10.8.2010-31.12.2010

GLD. The RR-sHMM was applied with exactly the same parametersettings as in the S&P500

case, with the following changes: we tookWp = 1000, andWf = 25. The reason for these

changes was the availability of only 1500 data points, from 2/7/2005 to 12/31/2010 (the GLD

ETF did not exist prior to this starting date). The RC curve ofGLD trend prediction is shown

in Figure4.9 (solid line). The qualitative characteristics of this RC-curve are similar to those

of the RC-curve for S&P500 trend prediction.


4/figures/filtered_data.eps


1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1

0.16

0.18

0.2

0.22

0.24

0.26

0.28

0.3

0.32

Coverage Bound

Err

or R

ate

2.RLI

3.RR

1.Naive

Figure 4.7: Comparison of Naive-, RLI- and RR-sHMM for filtered continuous data

1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.10

5

10

15

20

25

30

35

40

45

50

Coverage Bound

Impr

ovem

ent (

%)

W=8

W=7

W=6

W=5

W=4

W=3

Figure 4.8: sHMM error improvement for different EMAW parameters


4/figures/ema5_shmm_comparison.eps

5/figures/emas.eps


1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.10.42

0.43

0.44

0.45

0.46

0.47

0.48

0.49

Coverage Bound

Err

or R

ate

GLD

S&P 500

Figure 4.9: RC-curves of RR-sHMM for S&P500 and GLD returns


4/figures/sp500_and_gld.eps

Chapter 5

Discussion

In this thesis we presented novel techniques for implementing selective prediction using Hid-

den Markov Models. We focused on the selective prediction ofnext-day directions in financial

sequences. For this difficult prediction task our models areable to provide a substantial predic-

tion improvement, as is evident from the empirical results presented in Chapter4. The structure

and modularity of HMMs make them particularly convenient for incorporating controllable se-

lective prediction mechanisms. Indeed, our models give rise to smooth and monotonically

decreasing risk-coverage trade-off curves, thus enablingcontrol of the desired level of selectiv-

ity. Refinement methods make this control more fine-grained,increasing the potential usability

of the tool in real-world applications.

In the context of financial modeling, when considering the preliminary results presented in

Section4.3.2, we expect that the relative advantage of selective prediction techniques will be

higher when applied on easier tasks. We note that better results may be achieved by utilizing

more elaborate HMM modeling, perhaps including other sources of specialized information

including prices of other correlated indices. We have not explored here these possibilities and

focused on a vanilla auto-regressive model.

A vital component in the construction of our state-based selective prediction model is the

estimation of risk and visit rates at every individual HMM state. Clearly, robust estimates

of these two statistics are key when aiming at constructing selective models that generalize

well. In our implementation we only utilized naive estimators based on empirical counts. We

visually evaluated the effectiveness of these naive estimators as follows. For each HMM and

each state, we calculated risk and visit rates using both training and test data, and then plotted

the differences. The resulting plots are shown in Figures5.1aand5.1b. Figure5.1adepicts the


5. DISCUSSION

−0.2 −0.1 0 0.1 0.20

2000

4000

6000

Difference

Num

ber

of in

stan

ces

(a) Visit

−1 −0.5 0 0.5 10

500

1000

1500

2000

Difference

Num

ber

of in

stan

ces

(b) Risk

Figure 5.1: Distributions of visit and risk train/test differences

distribution of deviations of empirical visit rates. It is evident that this distribution is symmetric

and quite concentrated around zero, which means that our empirical visit estimates are quite

effective. Figure5.1bdepicts a similar distribution, but now for state risks. It is apparent that

this distribution is much less concentrated, which means that our empirical risk estimates are

quite inaccurate on average. On the other hand, the distribution is quite symmetric about zero,

so underestimates are often compensated by overestimates.

We believe that a major bottleneck in attaining smaller testerrors are these noisy risk es-

timates. This noise is partly due to the noisy nature of our prediction problem, but may also

be attributed to the simplistic approach we took in estimating empirical risk. A challenging

problem would be to incorporate more robust estimates in ourmechanism, which may lead to

better risk-coverage trade-offs.

This work focused on the construction of selective mechanisms on top of HMMs, and

their empirical evaluation. While we believe our results provide a kind of a proof of concept, it

would be very interesting to investigate theoretical properties of selective prediction in the more

general context of sequential learning, and in HMM modelingin particular. Potentially relevant

results, when considering such directions, are the recent theoretical studies in [10, 11] in the

context of selective classification. Finally, it would be very interesting to examine selective

prediction mechanisms in the more general context of Bayesian networks and other types of

graphical models.


5/figures/cov.eps

5/figures/risk.eps

References

[1] L. R. Bahl, P. F. Brown, P. V. de Souza, and R. L. Mercer. Maximum mutual information

estimation for hidden Markov model parameters for speech recognition. InProceedings

of ICASSP, pages 49–52, 1986.13

[2] P. Bartlett and M. Wegkamp. Classification with a reject option using a hinge loss.Journal

of Machine Learning Research, 9:1823–1840, 2008.21

[3] L. E. Baum and A. Egon. An inequality with applications tostatistical estimation of prob-

abilistic functions of a Markov process and to a model of ecology. Bull. Amer. Meteorol.

Soc., 73:360–363, 1966.11

[4] L. E. Baum, T. Petrie, G. Soules, and N. Weiss. A maximization technique occurring

in the statistical analysis of probabilistic functions of markov chains. The Annals of

Mathematical Statistics, 41(1):164–171, 1970.13

[5] L. E. Baum and G. R. Sell. Growth functions for transformations on manifolds.Pac. J.

Math., 27(2):211–227, 1968.11

[6] M. Bicego, E. Grosso, and E. Otranto. A Hidden Markov Model approach to classify and

predict the sign of financial local trends.SSPR, 5342:852–861, 2008.20, 23

[7] M. Brand. Coupled hidden markov models for modeling interacting processes. Technical

Report 405, MIT Media Lab, 1997.42

[8] C. Chow. On optimum recognition error and reject tradeoff. IEEE-IT, 16:41–46, 1970.

1, 5, 20, 23

[9] A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood from incomplete

data via the em algorithm.Journal of the Royal Statistical Society, Series B, 39(1):1–38,

1977.13


REFERENCES

[10] R. El-Yaniv and Y. Wiener. On the foundations of noise-free selective classification.

JMLR, 11:1605–1641, May 2010.17, 19, 21, 60

[11] R. El-Yaniv and Y. Wiener. Agnostic selective classification. InNIPS, 2011.21, 60

[12] Y. Ephraim, A. Dembo, and L. R. Rabiner. A minimum discrimination information ap-

proach for hidden Markov modeling. InProceedings of ICASSP, 1987.13

[13] S. Fine, Y. Singer, and N. Tishby. The Hierarchical Hidden Markov Model: Analysis and

Applications.Machine Learning, 32(1):41–62, 1998.42

[14] D. Freitag and A. K. Mccallum. Information extraction with hmms and shrinkage. InIn

Proceedings of the AAAI-99 Workshop on Machine Learning forInformation Extraction,

pages 31–36, 1999.16

[15] Y. Freund, Y. Mansour, and R. Schapire. Generalizationbounds for averaged classifiers.

Annals of Statistics, 32(4):1698–1722, 2004.21

[16] Z. Ghahramani and M. I. Jordan. Factorial Hidden MarkovModels. Machine Learning,

29(2–3):245–273, 1997.42

[17] J. Hamilton. Analysis of time series subject to changesin regime.Journal of Economet-

rics, 45(1–2):39–70, 1990.20

[18] B. Hanczar and E. Dougherty. Classification with rejectoption in gene expression data.

Bioinformatics, 24:1889–1895, 2008.21

[19] D. Hsu, S. Kakade, and T. Zhang. A spectral algorithm forlearning Hidden Markov

Models. InCOLT, 2009.50

[20] P. Idvall and C. Jonsson. Algorithmic trading: Hidden markov model on foreign exchange

data. Master’s thesis, Linkopings Universitet, Sweden, 2008. 20

[21] A. Koerich. Rejection strategies for handwritten wordrecognition. InIWFHR, 2004.21

[22] A. Krogh. Hidden Markov Models for labeled sequences. In Proceedings of the 12th

IAPR ICPR’94, pages 140–144, 1994.16

[23] C. D. Manning and H. Schutze.Foundations of statistical natural language processing.

MIT Press, Cambridge, MA, USA, 1999.16


REFERENCES

[24] K. Murphy. Hidden Markov Model (HMM) Toolbox for Matlab.

http://www.cs.ubc.ca/˜murphyk/Software/HMM/hmm.html, 1998.47

[25] T. Pietraszek. Optimizing abstaining classifiers using ROC analysis. InProceedings of

the 22nd International Conference on Machine Learning, pages 665–672. ACM Press,

2005.18, 21

[26] L. R. Rabiner. A tutorial on hidden markov models and selected applications in speech

recognition.Proceedings of the IEEE, 77(2), February 1989.8

[27] S. Rao and J. Hong. Analysis of Hidden Markov Models and Support Vector Machines

in financial applications. Technical Report UCB/EECS-2010-63, Electrical Engineering

and Computer Sciences University of California at Berkeley, 2010.6, 20, 45, 49

[28] L. K. Saul and M. I. Jordan. Mixed memory markov models: Decomposing complex

stochastic processes as mixtures of simpler ones.Machine Learning, 37:75–87, 1999.42

[29] S. Shi and A. S. Weigend. Taking time seriously: Hidden Markov Experts applied to

financial engineering. InIEEE/IAFE, pages 244–252. IEEE, 1997.20

[30] S. Siddiqi, G. Gordon, and A. Moore. Fast State Discovery for HMM Model Selection

and Learning. InAI-STATS, 2007.14, 43, 51

[31] F. Tortorella. Reducing the classification cost of support vector classifiers through an

roc-based reject rule.Pattern Anal. Appl., 7:128–143, 2004.21

[32] A. Viterbi. Error bounds for convolutional codes and anasymptotically optimum decod-

ing algorithm.IEEE-IT, 13(2):260–269, 1967.12

[33] Y. Zhang. Prediction of financial time series with hidden markov models. Master’s thesis,

The School of Computing Science, Simon Frazer University, Canada, 2004.17, 20, 46,

49, 51


REFERENCES


Appendix A

Derivation of Baum-Welch

re-estimation formulas for RR-HMM

In this appendix, we derive the re-estimation formula for the parameterπ of the refining HMM

in the RR-sHMM model, as given in Equation3.21a. Since the re-estimation formulas for other

parameters (A andB) are quite similar to those in the standard Baum-Welch method, and their

derivations follow closely the original derivations we omit them, and focus in this appendix on

the re-estimation ofπ.

Let λ = 〈Q,V, π,A,B〉 be an HMM. Letλi = 〈Qi, V i, πi, Ai, Bi〉 be a refining HMM of

the heavy stateqi ∈ Q. Let λe = 〈Qe, V e, πe, Ae, Be〉 be a flattened HMM, resulting from

the embedding ofλi within λ. Suppose w.l.o.g. that|Q| = |Qi| = N . Denote byx the

re-estimation result of variablex.

As described in Section3.2.6.3, we estimate the parameters ofλi using a variation of

Baum-Welch algorithm, applied to the flattened HMMλe. According to this algorithm, the re-

estimation formulas are the result of the maximization of the auxiliary function (Equation2.9),

Q(λe, λe) =∑

Se

P [Se |O,λe] logP[O,Se | λe

],

whereSe is a state sequence originated fromλe. Sinceλi, which is a re-estimated component

of λe, is itself an HMM, it should obey the HMM stochastic constraints.

For the re-estimation ofπi, the relevant stochastic constraint is,

N∑

j=1

πij = 1. (A.1)


A. DERIVATION OF BAUM-WELCH RE-ESTIMATION FORMULAS FORRR-HMM

Using Lagrange multipliers, we combine Equation2.9and EquationA.1 into the following

Lagrangian,

Q(λe, µ) =∑

Se

P [Se |O,λe] log P[O,Se | λe

]+ µ

N∑

j=1

πij − 1

, (A.2)

whereµ is a Lagrange multiplier. To maximize this Lagrangian, we require its partial deriva-

tives,

∂Q

∂πij=∑

Se

P [Se |O,λe]P[O,Se | λe

]−1 ∂P[O,Se | λe

]

∂πij+ µ, j = 1 . . . N (A.3a)

∂Q

∂µ=

N∑

j=1

πij − 1 (A.3b)

(we omit other partial derivatives of the Lagrangian since they do not affect the re-estimation

formula forπ that we derive.)

From the Equation2.3, we know that the probability of the observation sequence together

with the corresponding state sequence given the model is,

P [O,Se |λe] = πSe1bSe

1(O1)aSe

1Se2bSe

2(O2) . . . aSe

T−1SeTbSe

T(OT ).

SupposeSe1 is a state inQe that corresponds to an aggregate state{qk}, andSe

2 is a state in

Qe corresponding to an aggregate state{qi, qij}, while qi, qk ∈ Q, qij ∈ Qi, andqi is a heavy

state. Then, using the embedding of type I transition (see Section 3.2.6.1, Equation3.15), we

get,

P[O,Se | λe

]= πSe

1bSe

1(O1)akiπ

ijbSe

2(O2) . . . aSe

T−1SeTbSe

T(OT ).

Assuming that the number ofπij occurrences inP [O,Se |λe] is n, we obtain,

P[O,Se | λe

]−1 ∂P[O,Se | λe

]

∂πij=

n

πij, (A.4)

and substitutingA.4 in A.3a, we conclude that,

∂Q

∂πij=

1

πij

T∑

n=1

n

∑

Se, s.t. |πij|=n

P [Se |O,λe]

+ µ, j = 1 . . . N. (A.5)

We observe that∑

Se, s.t. |πij|=n P [Se |O,λe] is the probability of either having exactlyn

type I transitions into aggregate state{qi, qij}, or having{qi, qij} as initial state of the state se-

quence with subsequent (not necessarily contiguous)n−1 type I transitions into it. The reason


is that only one of those two conditions can result in a choiceof a state inλi according to the ini-

tial state distributionπi. Consequently, the expression∑T

n=1 n(∑

Se, s.t. |πij|=n P [Se |O,λe]

)

equals the expected number of choices ofqij with the distributionπi, i.e., either type I transi-

tions into aggregate state{qi, qij}, or a choice of this state as initial state of the state sequence.

Denoting this expectation byEij, and solving EquationA.4 for its root, we obtain,

πij = −Ei

j

µ. (A.6)

Assigningπij from EquationA.6 into the constraintA.1, and solving forµ, we get,

µ = −N∑

n=1

Ein, (A.7)

and by substitutingµ in EquationA.6, we obtain a valid probability measure forπij ,

πij =Ei

j∑N

n=1Ein

. (A.8)

Let I1 be an indicator variable for the aggregate state{qi, qij}, indicating that it is initial

state. LetIt, 2 ≤ t ≤ T be indicator variables for type I transition at timet into this aggregate

state. Then1,

Eij =

T∑

t=1

E [It |O,λe] = P[Se1 = {qi, qij} |O,λe

]+

+

T∑

t=2

N∑

k=1

k 6=i

P[Set−1 = {qk}, Se

t = {qi, qij} |O,λe]=

= γe1(j) +

T−1∑

t=1

N∑

k=1

k 6=i

ξet (k, j). (A.9)

Considering that the second derivative ofQ w.r.t. πij ,

∂Q

∂2πij= −

Eij

(

πij

)2 , (A.10)

is always negative, we conclude that the value forπij that we found is a local maximum ofQ.

1Note that in EquationA.9, states of the flattened modelλe are specified in their aggregate form.


חיזוי סלקטיבי באמצעות מודלים

מרקוביים סמויים

חיבור על מחקר

לשם מילוי חלקי של הדרישות לקבלת התואר

מגיסטר למדעים במדעי המחשב

דמיטרי פידן

מכון טכנולוגי לישראל–הוגש לסנט הטכניון

2012ב חיפה מאי "סיוון תשע


יניב בפקולטה למדעי המחשב-חבר רן אל' המחקר נעשה בהנחיית פרופ

אני מודה לטכניון על התמיכה הכספית הנדיבה בהשתלמותי


תקציר

אנו. סידרתית למידה של בהקשר"( לדחייה האפשרות עם חיזוי "או )סלקטיבי בחיזוי דנים אנו זו מחקר בעבודת

לעומת. סידרתי חיזוי בבעיות משופרים ביצועים לאפשר פוטנציאל יש סלקטיבי חיזוי של זה שליישום מאמינים

חיזוי לבעיות סלקטיביים מודלים, האחרונות השנים בארבעיים לעומק נלמד אשר, סלקטיבי סיווג של התחום

זמן טווח כאשר, פיננסיות בסדרות עתידיות מגמות של בחיזוי להתמקד בחרנו. דל באופן בספרות נדונו סידרתי

חיזוי בעיות של רחבה משפחה מייצגת, מעשית מבט מנקודת מעניינת מהיותה לבד, זו בעיה. אחד יום הוא החיזוי

. רועשות סידרתי

(. HMM או, Hidden Markov Model )סמוי מרקובי מודל נמצא מפתחים אנו אותו סלקטיבי לחיזוי המודל בבסיס

החוקרות עבודות של מבוטל לא מספר, בפרט. נתונים סדרות לניתוח ונפוץ שימושי כלי הוא סמוי מרקובי מודל

מודל, מודלרי כמבנה, בנוסף. בעבר פורסמו פיננסיים נתונים סדרות לחיזוי סמויים מרקוביים מודלים של יישומים

סמוי מרקובי במודל בחרנו, לפיכך. יחודיים סלקטיבי חיזוי מנגנוני להגדרת נוחה תשתית מהווה סמוי מרקובי

להשיג מסוגלים יהיו אשר יחודיים דחייה מנגנוני לפתח היתה ומטרתנו, המוצע המודל של בליבתו הנמצא כחזאי

. פיננסיות בסדרות מגמות בחיזוי שימושית סיכון-כיסוי תחלופת

. נצפה ותהליך סמוי תהליך: מעורבים סטוכסטיים תהליכים משני המורכבת מערכת הוא סמוי מרקובי מודל

נהוג. זמנית-בו המתרחש הסמוי התהליך מהו לנחש ומנסים בלבד הנצפה בתהליך מתבוננים המודל משתמשי

, מראש מוגדר בית-מאלף אותיות מפיק מצב כל בה, הסתברותית מצבים כמכונת סמוי מרקובי מודל לתאר

יצרו אשר מצבים סדרת הוא הסמוי והתהליך, תצפיות סדרת הוא הנצפה התהליך, זה בתאור. תצפיות הנקראות

. ראשון מסדר מרקוב שרשראות הן הסדרות שתי. הפלט סדרת את

,𝑄 מתואר על ידי חמישייה 𝜆מודל מרקובי סמוי , פורמלית 𝑉, 𝜋, 𝐴, 𝐵 , שבה𝑄 היא קבוצת מצבים בגודל 𝑁 ,𝑉

אשר מתאר 𝑁 הוא ווקטור בגודל 𝑀 ,𝜋בגודל (בית של המכונה-האלף)היא קבוצת התצפיות האפשריות

להיות המצב 𝜋𝑖 בוקטור היא הסתברות של מצב i (𝜋𝑖)כלומר כניסה במקום , התפלגות התחלתית של מצבים

𝑁 היא מטריצת מעברים 𝐴, ההתחלתי × 𝑁זאת אומרת כניסה במקום , במכונת מצביםi, j( 𝑎𝑖𝑗) היא הסתברות

𝑁 היא מטריצת פלט 𝐵-ו, 𝑞𝑗 למצב 𝑞𝑖לעבור ממצב × 𝑀 , כלומר כל כניסה במטריצה במקוםi, k היא הסתברות

𝑣𝑘 לתצפית . אשר מייצר את סדרת התצפיות, נהוג להתייחס למודל מרקובי סמוי כמודל גנרטיבי. 𝑞𝑖במצב

: מתבצעת באופן הבא𝑇היצירה של סידרת התצפיות באורך

𝑡קבע .1 = 𝜋 לפי התפלגות התחלתית 𝑞𝑖 ובחר מצב התחלתי 1

.𝐵 במטריצה 𝑖בחר תצפית לפי התפלגות המוגדרת על ידי שורה .2

.𝐴 במטריצה 𝑖ובחר מצב הבא לפי התפלגות המוגדרת על ידי שורה , 𝑡הגדל .3

𝑡אם .4 ≤ 𝑇 אחרת סיים, 2 חזור לשלב.

מה, בהם לשימוש הקשורות בסיסיות בעיות לפתרון יעילים אלגוריתמים קיימים סמויים מרקובים מודלים עבור

ניתן לחשב ביעילות 𝜆 ומודל 𝑂תצפיות סדרת בהנתן, כך. מעשיים לשימושים במיוחד לאטרקטיביים שהופכם

-forwardבאמצעות האלגוריתם הנקרא , (בעית ההערכה) 𝑃 𝑂|𝜆דהיינו , 𝜆 נוצר על ידי 𝑂-הסתברות ש

backward ,תצפיות סדרת בהנתן, כן כמו. הסדרה באורך לינארית היא שסיבוכיותו𝑂 ומודל 𝜆 , ניתן למצוא סדרת

בעית ) Viterbiהנקרא אלגוריתם , באמצעות אלגוריתם מבוסס תיכנות דינאמי, 𝑂התואמת את סדרת , 𝑆מצבים

(מקומית )אופטימליים פרמטרים המוצא אלגוריתם דהיינו, למכונה אימון אלגוריתם קיים, לבסוף. (הפיענוח

𝜋, 𝐴, 𝐵 עבור מכונה𝜆 בהנתן סדרת תצפיות 𝑂 . אלגוריתם זה נקראBaum-Welch , והוא משתייך למשפחה

.Expectation-Maximizationרחבה של אלגוריתמי


והמטרה, בסדרה תצפית לכל תיוג שיש כך תיוגים בסדרת מלווה התצפיות סדרת, שונות באפליקציות, לעיתים

מודלים באמצעות זה מסוג בעיות לפתור האפשריות הדרכים אחת. חדשות סדרות יתייג אשר מודל ללמוד היא

מותאם מצב שלכל היא המשמעות. אפשריים לתיוגים המודל של מצבים בין התאמה למצוא היא סמויים מרקוביים

סדרת מציאת ידי על מתבצע חדשות סדרות של והתיוג, אפשריים תיוגים קבוצת מתוך( תיוגים מספר או )תיוג

-Baum של וריאציה הוא האימון תהליך ואז, המודל אימון לפני להתבצע יכול מצבים של תיוג. מתאימה מצבים

Welch ,המצבים תיוגי את בחשבון לוקחת אשר( מפוקח לימוד- דהיינו (supervised) ,)המודל אימון אחרי או .

סדרת. מתוייגות סדרות על מבוסס פיננסיות סדרות של מגמות לחיזוי משתמשים אנו בו הסמוי מרקובי המודל

מגמת הוא( יום )נקודה כל עבור והתיוג , מניות מדד כגון פיננסי מוצר של היומיות תשואות סדרת היא התצפיות

, סמוי מרקובי מודל של הפרמטרים את לומדים אנו תשואות סדרת בהנתן(. יורדת/עולה )למחרת יום הסדרה

יום מגמה של חיזוי לצורך, כעת. מצב לכל( יורד/עולה )תיוג מתאימים אנו, התיוגים וסדרת אלה פרמטרים ומתוך

ההסתברות שלו מצב חישוב ידי על לעשות ניתן זאת. האחרונה התצפית עבור תיוג למצוא יש, קדימה אחד

מצב של התיוג את ולהפיק( forward-backward-ה אלגוריתם ידי על )האחרונה התצפית את לייצר ביותר הגבוהה

סלקטיבי חיזוי למערכת כבסיס אותנו משמשת זו חיזוי מערכת. הבא ביום הסדרה של הצפויה כהתנהגות זה

. מפתחים אנו אותה

מנצלים אנו אלה מדדים הגדרת לצורך. וסיכון כיסוי: עיקריים מדדים שני ידי על נמדדים סלקטיבי לחיזוי מודלים

On the foundations of noise-free"ווינר יניב-אל של מאמר מתוך סלקטיביים מסווגים של הבאות ההגדרות את

selective classification ." מסווג מוגדר כפונקציה𝑓: 𝑋 → על מנת להפוך את . הוא תחום הקלט𝑋כאשר , {1,1−}

:𝑔מגדירים פונציית , המסווג למסווג סלקטיבי 𝑋 → משמע עבור כל דוגמה , שמהווה פונקציית בחירה, [0,1]

אם, הסלקטיבי המסווג. מהקלט פונקציה זו מגדירה הסתברות לדוגמה זו להיות מקובלת או נדחית על ידי המסווג

,𝑓 : הפונקציות שתי הפעלת י"ע מתקבל, כך 𝑔 𝑥 = 𝑟𝑒𝑗𝑒𝑐𝑡 𝑤. 𝑝 1 − 𝑔(𝑥)

𝑓 𝑥 𝑤. 𝑝 𝑔(𝑥)

כלומר, המסווג ידי על שהתקבלו הדוגמאות אוסף של של היחסי החלק גודל כתוחלת מוגדר המודל כיסוי

𝐶 ≜ 𝐸[𝑔 𝑋 ] ,בין כיחס מוגדר המודל סיכון. הדוגמאות אוסף על ההתפלגות מתוך מחושבת התוחלת באשר

𝑅 :שלו הכיסוי לבין( נכון לא כסיווג מוגדרת השגיאה כאשר )המודל של השגיאה תוחלת ≜𝐸[𝐼 𝑓 𝑋 ≠𝑌 𝑔 𝑋 ]

𝐶

𝐶כאשר בנקודה = .0-כ מוגדר המודל סיכון גם 0

אחד על להתפשר שניתן היא התחלופה משמעות(. trade-off )לתחלופה ניתנים המודלים של וסיכון כיסוי מדדי

לקבל מנת על יותר נמוך כיסוי להרשות ניתן למשל )האחר המדד במונחי טוב יותר מודל להשיג מנת על, המדדים

: הבאות אופטימיזציה בעיות משתי כאחת הסלקטיביים המסווגים לימוד בעיית את לנסח ניתן(. יותר מדויק מודל

ביותר הנמוך הסיכון בעל הסלקטיבי המסווג את למד, הכיסוי על תחתון חסם בהנתן .1

ביותר הגדול הכיסוי בעל הסלקטיבי המסווג את למד, הסיכון על עליון חסם בהנתן .2

בעל מסווג מתארת העקומה על נקודה כל כאשר, עקומה ידי על וסיכון כיסוי בין התחלופה את לתאר( ומועיל )נוח

של יעיל ניצול המאפשר אלגוריתם הוא סלקטיביים מסווגים ללימוד טוב אלגוריתם. מסוימים וסיכון כיסוי פרופיל

סיכון-הכיסוי פרופיל את לבחור למשתמש לאפשר מסוגל להיות האלגוריתם על, בנוסף. לסיכון כיסוי בין התחלופה

. הנדרש

. סמויים מרקוביים מודלים על בהתבסס סלקטיבי לחיזוי מודלים לבניית גישות שתי בוחנים אנחנו זו בעבודה

אנחנו, זו בגישה. בחיזוי משמעות-דו סמך על קלאסי סלקטיביות עקרון על מבוססת היא הראשונה הגישה

להיות צריך החיזוי עבורן ולסדרות חיובי להיות צריך החיזוי שעבורן לסדרות הנתון הסדרות אוסף את מחלקים

סדרות על מתאמן הראשון, סמויים מרקוביים מודלים שני מאמנים אנחנו אלה קבוצות שתי סמך על. שלילי


הנראות בעקרון שימוש תוך חדשה סדרה עבור החיזוי את מבצע המסווג. שליליות סדרות על והשני חיוביות

ובחירת( 𝑃[𝑂|𝜆])מהמודלים אחד כל ידי על מיוצרת להיות הסדרה נראות חישוב ידי על דהיינו, המקסימלית

קרובות התוצאות כאשר. ההסתבריות שתי בין הקרבה ידי על נמדדת בחיזוי משמעות-דו. סביר היותר המודל

-- ואחרת, הסדרה את דוחה המערכת(, מסוים מסף נמוך ההסתברויות של הלוגריתמים בין ההפרש )מידי

כפרמטר מתקבל אשר הרצוי הכיסוי על תחתון בחסם בהתחשב, אדפטיבית מחושב הסף. החיזוי את מפיקה

. מהמשתמש

את מדרגים, המודל של המודולרי המבנה את מנצלים אנו, זו עבודה של העיקרית התרומה שהיא, השנייה בגישה

". דוחה קבוצה תת-"כ אמינים פחות מצבים של קבוצה תת ומגדירים, משוקלל אמינות מדד לפי המודל מצבי

מודל "זה למודל קוראים אנו. אמינים הלא המצבים ידי על נוצרו אשר תצפיות את דוחה שהמודל היא המשמעות

". סלקטיבי סמוי מרקובי

מודד הביקור שיעור. מצב של והסיכון במצב הביקור שעור: מדדים שני על מסתמכים אנו, המצבים דירוג לצורך

של היחסי החלק תוחלת בעזרת מחושב והוא סלקטיבי סמוי מרקובי מודל של הכולל לכיסוי המצב תרומת את

זה מדד. המצב ידי על המופקים החיזויים של האמינות את מודד המצב סיכון. זה במצב נמצא שהמודל הזמן

הזמן תוחלת לבין, שגוי חיזוי הפקת כדי תוך במצב נמצא שהמודל הזמן תוחלת בין יחס מציאת ידי על מחושב

-ה באלגוריתם המחושבים במשתנים שימוש ידי על לחישוב ניתנים המדדים שני. במצב נמצא שהמודל הכולל

forward-backward .של הכולל והסיכון הכיסוי את באמצעותם לבטא ניתן, מצב לכל אלו מדדים שחשבנו לאחר

כך, ביותר הגבוה הסיכון בעלי המצבים בחירת ידי על נבנת המצבים של דוחה קבוצה תת. הסלקטיבי המודל

האפשרי הדחיות אחוז על עליון חסם על עולה אינו הקבוצה בתת המצבים כל של הכולל הביקור שיעורי שסכום

. מהמשתמש כפרמטר מתקבל אשר( המודל של הרצוי הכיסוי על תחתון חסם של המשלים)

יותר מודלים לייצר יכולת של הפונקציונלית הדרישה על עונה, סלקטיבי סמוי מרקובי מודל לבניית זו סכמה

לספק שלה היכולת מבחינת בעייתית זו פשוטה סכמה, זאת לעומת. הכיסוי על יותר להתפשר ניתן כאשר אמינים

לדחות יודע הסלקטיבי שהמודל בעובדה נעוצה לכך הסיבה. וסיכון כיסוי של החסמים על טובה שליטה למשתמש

כאשר או(, רבים ביישומים שקורא כמו )מצבים של קטן ממספר מורכב המודל כאשר לכן, בלבד שלמים מצבים

חסם וכל היות, מוגבלת המשתמש של השליטה יכולת, גבוה ביקור שיעור בעלי גם הם גבוה סיכון בעלי מצבים

-בר איננו, דוחה קבוצה לתת מצב הוספת ידי על מהשני מתקבל מהם שאחד המודלים של כיסויים בין נמצא אשר

השיטה". מעודנים "סלקטיביים מודלים לייצר שמטרתן שיטות שתי מציאים אנו, זו בבעייה לטפל מנת על. השגה

התת בחירת של עידון ידי על המטרה את משיגה"( הסתברותית לינארית אינטרפולציה "הנקראת )הראשונה

בשיטה, למעשה. שלם מצב רק ולא, חלקי באופן מצב לדחות ביכולת מתבטא העידון. מצבים של הדוחה קבוצה

דחייה הסתברות מגדירים בנוסף אך, הפשוט הסלקטיבי המרקובי במודל כמו בדיוק דוחה קבוצה תת מרכיבים זו

על תחתון חסם כל( בתוחלת )להשיג ניתן זו בשיטה. לקבוצה נכנס שלא ביותר הגבוה הסיכון בעל המצב על

. המודל של הכולל הכוסוי

תת של ולא, עצמו הסמוי המרקובי המודל של עידון על מבוססת" רקורסיבי עידון "הנקראת, השנייה השיטה

על יעלה לא מצב כל של הביקור שעור שבו מודל לייצר היא זו שיטה מטרת. הראשונה השיטה כמו דוחה קבוצה

אשר סמוי מרקובי מודל בונים, מהחסם גדול שלו הביקור ששעור מצב לכל זו בשיטה, כך לצורך. מראש נתון חסם

תת לבחירת משמשים זה פנימי מודל של מצבים. המצב עבור" עידון "או", פנימי מודל "של סוג להוות אמור

להיות נבחרים שנדחים החלקים אך, חלקי באופן מצבים לדחות אפשר זו בשיטה גם שלמעשה כך, דוחה קבוצה

בצורה המצבים את ולעדן להמשיך ניתן. הקודמת בשיטה כמו מיקריים חלקים ולא, אמינים הפחות החלקים

. הביקור שעור חסם את משיגים אשר עד, רקורסיבית

: עיקריות שאלות כמה על לענות ניסינו זו עבודה של הנסויי בחלק

?פיננסיות סדרות של חיזוי בעיות על המופעל סלקטיבי חיזוי יעיל כמה עד .1


סמוי מרקובי מודל או משמעות-דו המבוסס מסווג – עדיף שהוצעו הסלקטיביים המודלים מבין אילו .2

?סלקטיבי

?יותר טובים מודלים להשיג מצליחה עידון שיטת איזו .3

מתוחכמים אלגוריתמים י"ע שנלמדו מודלים לשפר יכולה שהוצגה הסלקטיבי החיזוי שיטת האם .4

?בספרות לאחרונה הופיעו אשר וחדישים

ואכן אפקטיביים המוצעים הסלקטיביים שהמודלים חזקה אינדיקציה נותנות שערכנו ניסויים מגוון של תוצאות

, ככל שהן גדלות שפשרות כיסוי הראו התוצאות, בפרט. הסטנדרטי הסמוי המרקובי המודל דיוק את משפרים

אשר( עידונים כולל )הסלקטיביים המודלים גרסאות לכל נכונות אלו מסקנות. המודל דיוק את מונוטונית משפרות

על הסלקטיבי הסמוי המרקובי המודל של משמעי וחד ברור יתרון מראות תוצאותינו, בנוסף.זו בעבודה נידונו

למודל הוגן-לא יתרון שהענקנו מכיוון במיוחד חזקה זו מסקנה, למעשה. משמעות-דו עקב דחייה המבוסס המסווג

פרמטרים עבורו בחרנו כלומר -- מעשה שלאחר בחכמה עבורו פרמטרים-הייפר בחירת י"ע משמעות-הדו

לא אשר, הרקורסיבי העידון שיטת היא המיטבית השיטה, העידון שיטות מבין. המבחן מדגם סמך על אופטימליים

יותר גבוה שנבנה המודל של הממשי שהכיסוי כך זאת עושה גם אלא, אמין יותר סלקטיבי מודל לייצר מצליחה רק

. ההסתברותית הלינארית האינטרפולציה שיטת ידי על שנוצר המודל של מהכיסוי


selective prediction with hidden markov models · selective prediction with hidden markov models...

Documents