emerging patterns based classifier
DESCRIPTION
It gives several insights for designing of robust,fast,accurate classifiers based on emerging patternsTRANSCRIPT
Con
tras
t Dat
a M
inin
g: M
etho
ds
and
App
licat
ions
Jam
es B
aile
y, N
ICT
A V
icto
ria L
abor
ator
y an
d T
he U
nive
rsity
of M
elbo
urne
Guo
zhu
Don
g, W
right
Sta
te U
nive
rsity
Pre
sent
ed a
t the
IEE
E In
tern
atio
nal C
onfe
renc
e on
Dat
a M
inin
g (I
CD
M),
Oct
ober
28-
31 2
007
An
up to
dat
e ve
rsio
n of
this
tuto
rial i
s av
aila
ble
at h
ttp://
ww
w.c
sse.
unim
elb.
edu.
au/~
jbai
ley/
cont
rast
IEE
E IC
DM
28-
31 O
ct. 0
7C
ontr
ast D
ata
Min
ing:
Met
hods
and
App
licat
ions
Ja
mes
Bai
ley
and
Guo
zhu
Don
g2
Con
tras
t dat
a m
inin
g -
Wha
t is
it ?
Co
ntr
ast
-``
To
com
pare
or
appr
aise
in
resp
ect t
o di
ffere
nces
’’ (M
erria
m W
ebst
er D
ictio
nary
)
Co
ntr
ast
dat
a m
inin
g-
The
min
ing
of
patte
rns
and
mod
els
cont
rast
ing
two
or
mor
e cl
asse
s/co
nditi
ons.
IEE
E IC
DM
28-
31 O
ct. 0
7C
ontr
ast D
ata
Min
ing:
Met
hods
and
App
licat
ions
Ja
mes
Bai
ley
and
Guo
zhu
Don
g3
Con
tras
t Dat
a M
inin
g -
Why
?
``S
omet
imes
it’s
goo
d to
con
trast
wha
t you
lik
e w
ith s
omet
hing
els
e. I
t mak
es y
ou
appr
ecia
te it
eve
n m
ore’
’D
arby
Con
ley,
Get
Fuz
zy, 2
001
IEE
E IC
DM
28-
31 O
ct. 0
7C
ontr
ast D
ata
Min
ing:
Met
hods
and
App
licat
ions
Ja
mes
Bai
ley
and
Guo
zhu
Don
g4
Wha
t can
be
cont
rast
ed ?
Obj
ects
at d
iffer
ent t
ime
perio
ds
``C
ompa
re IC
DM
pap
ers
publ
ishe
d in
200
6-20
07
vers
us th
ose
in 2
004-
2005
’’
Obj
ects
for
diffe
rent
spa
tiall
ocat
ions
``F
ind
the
dist
ingu
ishi
ng fe
atur
es o
f loc
atio
n x
for
hum
an D
NA
, ver
sus
loca
tion
xfo
r m
ouse
DN
A’’
Obj
ects
acr
oss
diffe
rent
cla
sses
``F
ind
the
diffe
renc
es b
etw
een
peop
le w
ith
brow
n ha
ir, v
ersu
s th
ose
with
blo
nde
hair’
’
IEE
E IC
DM
28-
31 O
ct. 0
7C
ontr
ast D
ata
Min
ing:
Met
hods
and
App
licat
ions
Ja
mes
Bai
ley
and
Guo
zhu
Don
g5
Wha
t can
be
cont
rast
ed ?
Con
t.
Obj
ects
with
ina
clas
s``
With
in th
e ac
adem
ic p
rofe
ssio
n, th
ere
are
few
pe
ople
old
er th
an 8
0’’ (
rarit
y)``
With
in th
e ac
adem
ic p
rofe
ssio
n, th
ere
are
no r
ich
peop
le’’
(hol
es)
``W
ithin
com
pute
r sc
ienc
e, m
ost o
f the
pap
ers
com
e fr
om U
SA
or
Eur
ope’
’ (ab
unda
nce)
Obj
ect p
ositi
ons
in a
ran
king
``F
ind
the
diffe
renc
es b
etw
een
high
and
low
inco
me
earn
ers’
’
Com
bina
tions
of t
he a
bove
IEE
E IC
DM
28-
31 O
ct. 0
7C
ontr
ast D
ata
Min
ing:
Met
hods
and
App
licat
ions
Ja
mes
Bai
ley
and
Guo
zhu
Don
g6
Alte
rnat
ive
nam
es fo
r co
ntra
st d
ata
min
ing
Con
tras
t={c
hang
e, d
iffer
ence
, dis
crim
inat
or,
clas
sific
atio
n ru
le, …
}
Con
tras
t dat
a m
inin
g is
rel
ated
to to
pics
suc
h as
:C
hang
e de
tect
ion,
cla
ss b
ased
ass
ocia
tion
rule
s, c
ontr
ast s
ets,
co
ncep
t drif
t, di
ffere
nce
dete
ctio
n, d
iscr
imin
ativ
e pa
ttern
s,
(dis
)sim
ilarit
y in
dex,
em
ergi
ng p
atte
rns,
gra
dien
t min
ing,
hig
h co
nfid
ence
pat
tern
s, (
in)f
requ
ent p
atte
rns,
top
k pa
ttern
s,…
…
IEE
E IC
DM
28-
31 O
ct. 0
7C
ontr
ast D
ata
Min
ing:
Met
hods
and
App
licat
ions
Ja
mes
Bai
ley
and
Guo
zhu
Don
g7
Cha
ract
eris
tics
of c
ontr
ast d
ata
min
ing
App
lied
to m
ultiv
aria
te d
ata
Obj
ects
may
be
rela
tiona
l, se
quen
tial,
grap
hs, m
odel
s, c
lass
ifier
s, c
ombi
natio
ns
of th
ese
Use
rs m
ay w
ant e
ither
To
find
mul
tiple
cont
rast
s (a
ll, o
r to
p k)
A s
ingl
em
easu
re fo
r co
mpa
rison
•``
The
deg
ree
of d
iffer
ence
bet
wee
n th
e gr
oups
(or
m
odel
s) is
0.7
’’
IEE
E IC
DM
28-
31 O
ct. 0
7C
ontr
ast D
ata
Min
ing:
Met
hods
and
App
licat
ions
Ja
mes
Bai
ley
and
Guo
zhu
Don
g8
Con
tras
t cha
ract
eris
tics
Con
t.
Rep
rese
ntat
ion
of c
ontr
asts
is im
port
ant.
N
eeds
to b
eIn
terp
reta
ble,
non
red
unda
nt, p
oten
tially
act
iona
ble,
ex
pres
sive
Tra
ctab
leto
com
pute
Qua
lity
of c
ontr
asts
is a
lso
impo
rtan
t. N
eed
Sta
tistic
al s
igni
fican
ce, w
hich
can
be
mea
sure
d in
m
ultip
le w
ays
Abi
lity
to r
ank
cont
rast
s is
des
irabl
e, e
spec
ially
for
clas
sific
atio
n
IEE
E IC
DM
28-
31 O
ct. 0
7C
ontr
ast D
ata
Min
ing:
Met
hods
and
App
licat
ions
Ja
mes
Bai
ley
and
Guo
zhu
Don
g9
How
is c
ontr
ast d
ata
min
ing
used
?
Dom
ain
unde
rsta
ndin
g``
You
ng c
hild
ren
with
dia
bete
s ha
ve a
gre
ater
ris
k of
hos
pita
l ad
mis
sion
, com
pare
d to
the
rest
of t
he p
opul
atio
n
Use
d fo
r bu
ildin
g cl
assi
fiers
Man
y di
ffere
nt te
chni
ques
-to
be
cove
red
late
rA
lso
used
for
wei
ghtin
gan
d ra
nkin
gin
stan
ces
Use
d in
con
stru
ctio
n of
syn
thet
icin
stan
ces
Goo
d fo
r ra
recl
asse
s
Use
d fo
r al
ertin
g, n
otifi
catio
n an
d m
onito
ring
``T
ell m
e w
hen
the
diss
imila
rity
inde
x fa
lls b
elow
0.3
’’
IEE
E IC
DM
28-
31 O
ct. 0
7C
ontr
ast D
ata
Min
ing:
Met
hods
and
App
licat
ions
Ja
mes
Bai
ley
and
Guo
zhu
Don
g10
Goa
ls o
f thi
s tu
toria
l
Pro
vide
an
over
view
of c
ontr
ast d
ata
min
ing
Brin
g to
geth
er r
esul
ts fr
om a
num
ber
of
disp
arat
e ar
eas.
Min
ing
for
diffe
rent
type
s of
dat
a•
Rel
atio
nal,
sequ
ence
, gra
ph, m
odel
s, …
Cla
ssifi
catio
nus
ing
disc
rimin
atin
g pa
ttern
s
IEE
E IC
DM
28-
31 O
ct. 0
7C
ontr
ast D
ata
Min
ing:
Met
hods
and
App
licat
ions
Ja
mes
Bai
ley
and
Guo
zhu
Don
g11
By
the
end
of th
is tu
toria
l you
will
be
abl
e to
…
Und
erst
and
som
e pr
inci
pal t
echn
ique
s fo
r re
pres
entin
gco
ntra
sts
and
eval
uatin
gth
eir
qual
ityA
ppre
ciat
e so
me
min
ing
tech
niqu
es fo
r co
ntra
st d
isco
very
U
nder
stan
d te
chni
ques
for
usin
g co
ntra
sts
in c
lass
ifica
tion
IEE
E IC
DM
28-
31 O
ct. 0
7C
ontr
ast D
ata
Min
ing:
Met
hods
and
App
licat
ions
Ja
mes
Bai
ley
and
Guo
zhu
Don
g12
Don
’t ha
ve ti
me
to c
over
..
Str
ing
algo
rithm
sC
onne
ctio
ns to
wor
k in
indu
ctiv
e lo
gic
prog
ram
min
gT
ree-
base
d co
ntra
sts
Cha
nges
in d
ata
stre
ams
Fre
quen
t pat
tern
alg
orith
ms
Con
nect
ions
to g
ranu
lar
com
putin
g…
IEE
E IC
DM
28-
31 O
ct. 0
7C
ontr
ast D
ata
Min
ing:
Met
hods
and
App
licat
ions
Ja
mes
Bai
ley
and
Guo
zhu
Don
g13
Out
line
of th
e tu
toria
l
Bas
ic n
otio
ns a
nd u
niva
riate
cont
rast
sP
atte
rn a
nd r
ule
base
d co
ntra
sts
Con
tras
t pat
tern
bas
ed c
lass
ifica
tion
Con
tras
ts fo
r ra
re c
lass
dat
aset
sD
ata
cube
con
tras
tsS
eque
nce
base
d co
ntra
sts
Gra
ph b
ased
con
tras
tsM
odel
bas
ed c
ontr
asts
Com
mon
them
es +
ope
n pr
oble
ms
+ s
umm
ary
IEE
E IC
DM
28-
31 O
ct. 0
7C
ontr
ast D
ata
Min
ing:
Met
hods
and
App
licat
ions
Ja
mes
Bai
ley
and
Guo
zhu
Don
g14
Bas
ic n
otio
ns a
nd u
niva
riate
case
Fea
ture
sel
ectio
nan
d fe
atur
e si
gnifi
canc
ete
sts
can
be th
ough
t of a
s a
basi
c co
ntra
st d
ata
min
ing
activ
ity.
``T
ell m
e th
e di
scrim
inat
ing
feat
ures
’’ •
Wou
ld li
ke a
sin
gle
qual
itym
easu
re•
Use
ful f
or fe
atur
e ra
nkin
g
Em
phas
is is
less
on
findi
ngth
e co
ntra
st a
nd
mor
e on
eva
luat
ing
its p
ower
IEE
E IC
DM
28-
31 O
ct. 0
7C
ontr
ast D
ata
Min
ing:
Met
hods
and
App
licat
ions
Ja
mes
Bai
ley
and
Guo
zhu
Don
g15
Sam
ple
Fea
ture
-Cla
ss D
atas
et
Hap
py ☺
150
9004
3325
4327
9006
1005
ID
…..
……
Hap
py ☺
120
Hap
py ☺
137
Sad
200
Cla
ssH
eigh
t (cm
)
IEE
E IC
DM
28-
31 O
ct. 0
7C
ontr
ast D
ata
Min
ing:
Met
hods
and
App
licat
ions
Ja
mes
Bai
ley
and
Guo
zhu
Don
g16
Dis
crim
inat
ive
pow
er
Can
ass
ess
disc
rimin
ativ
e po
wer
of H
eigh
tfe
atur
e by
Info
rmat
ion
mea
sure
s(s
igna
l to
nois
e, in
form
atio
n ga
in r
atio
, …)
Sta
tistic
al te
sts
(t-t
est,
Kol
mog
orov
-Sm
irnov
, Chi
sq
uare
d, W
ilcox
onra
nk s
um, …
). A
sses
sing
w
heth
er
•T
he m
ean
of e
ach
clas
s is
the
sam
e•
The
sam
ples
for
each
cla
ss c
ome
from
the
sam
e di
strib
utio
n•
How
wel
l a d
atas
et fi
ts a
hyp
othe
sis
No
sing
le te
st is
bes
t in
all s
ituat
ions
!
IEE
E IC
DM
28-
31 O
ct. 0
7C
ontr
ast D
ata
Min
ing:
Met
hods
and
App
licat
ions
Ja
mes
Bai
ley
and
Guo
zhu
Don
g17
Exa
mpl
e D
iscr
imin
ativ
e P
ower
T
est -
Wilc
oxon
Ran
k S
um
Sup
pose
n1
happ
y, a
nd n
2sa
d in
stan
ces
Sor
t the
inst
ance
s ac
cord
ing
to h
eigh
t val
ue:
h 1<
= h
2<
= h
3<=
… h
n 1+
n 2A
ssig
n a
rank
to e
ach
inst
ance
, ind
icat
ing
how
man
y in
stan
ces
in th
e ot
her
clas
s ar
e le
ss.
For
x in
cla
ss A
For
eac
h cl
ass
Com
pute
the
Ran
ksum
=S
um(r
anks
of a
ll its
inst
ance
s)N
ull H
ypot
hesi
s: T
he in
stan
ces
are
from
the
sam
e di
strib
utio
nC
onsu
lt st
atis
tical
sig
nific
ance
tabl
e to
det
erm
ine
whe
ther
val
ue
of R
anks
umis
sig
nific
ant
Ran
k(x)
=|{
y: c
lass
(y)<
>A
and
hei
ght(
y)<
heig
ht(x
)}|
IEE
E IC
DM
28-
31 O
ct. 0
7C
ontr
ast D
ata
Min
ing:
Met
hods
and
App
licat
ions
Ja
mes
Bai
ley
and
Guo
zhu
Don
g18
Ran
k S
um C
alcu
latio
n E
xam
ple
0H
appy
☺12
081
6
1S
ad
15
041
5
1H
appy
☺17
732
1
2S
ad
19
066
0
2S
ad
21
048
13
Hap
py ☺
220
324
Ran
kC
lass
Hei
ght(
cm)
ID Hap
py:R
ankS
um=
3+1+
0=4
Sad
:Ran
kSum
=2+
2+1=
5
IEE
E IC
DM
28-
31 O
ct. 0
7C
ontr
ast D
ata
Min
ing:
Met
hods
and
App
licat
ions
Ja
mes
Bai
ley
and
Guo
zhu
Don
g19
Wilc
oxon
Ran
k S
um T
estC
ont.
Non
par
amet
ric (
no n
orm
al d
istr
ibut
ion
assu
mpt
ion)
Req
uire
s an
ord
erin
g on
the
attr
ibut
e va
lues
Sca
led
valu
e of
Ran
ksum
is e
quiv
alen
t to
area
unde
r R
OC
curv
e fo
r us
ing
the
sele
cted
feat
ure
as a
cla
ssifi
erTrue Positive Rate 0 %
100%
Fals
e Po
sitiv
e R
ate
0 %10
0%
Ran
ksum
(n1*
n 2)
IEE
E IC
DM
28-
31 O
ct. 0
7C
ontr
ast D
ata
Min
ing:
Met
hods
and
App
licat
ions
Ja
mes
Bai
ley
and
Guo
zhu
Don
g20
Dis
crim
inat
ing
with
attr
ibut
e va
lues
Can
alte
rnat
ivel
y fo
cus
on s
igni
fican
ce o
f at
trib
ute
valu
es, w
ith e
ither
1) F
requ
ency
/infr
eque
ncy
(hig
h/lo
w c
ount
s)F
requ
ent i
n on
e cl
ass
and
infr
eque
nt in
the
othe
r.
•T
here
are
50
happ
y pe
ople
of h
eigh
t 200
cm a
nd o
nly
2 sa
d pe
ople
of h
eigh
t 200
cm
2) R
atio
(hig
h ra
tio o
f sup
port
)A
ppea
rs X
tim
es m
ore
in o
ne c
lass
than
the
othe
r•
The
re a
re 2
5 tim
es m
ore
happ
y pe
ople
of h
eigh
t 200
cm
than
sad
peo
ple
of h
eigh
t 200
cm
IEE
E IC
DM
28-
31 O
ct. 0
7C
ontr
ast D
ata
Min
ing:
Met
hods
and
App
licat
ions
Ja
mes
Bai
ley
and
Guo
zhu
Don
g21
Attr
ibut
e/F
eatu
re C
onve
rsio
n
Pos
sibl
e to
form
a n
ew b
inar
y fe
atur
e ba
sed
on a
ttrib
ute
valu
e an
d th
en a
pply
fe
atur
e si
gnifi
canc
e te
sts
Blu
r di
stin
ctio
n be
twee
n at
trib
ute
and
attr
ibut
e va
lue
Hap
py ☺
…N
oY
es
Cla
ss…
200c
m15
0cm
Sad
…Y
esN
o
IEE
E IC
DM
28-
31 O
ct. 0
7C
ontr
ast D
ata
Min
ing:
Met
hods
and
App
licat
ions
Ja
mes
Bai
ley
and
Guo
zhu
Don
g22
Dis
crim
inat
ing
Attr
ibut
e V
alue
s in
a
Dat
a S
trea
m
Det
ectin
g ch
ange
s in
attr
ibut
e va
lues
is a
n im
port
ant f
ocus
in d
ata
stre
ams
Ofte
n fo
cus
on u
niva
riate
cont
rast
s fo
r ef
ficie
ncy
reas
ons
Fin
ding
whe
nch
ange
occ
urs
(non
sta
tiona
ry
stre
am).
F
indi
ng th
e m
agni
tude
of th
e ch
ange
. E.g
. How
big
is
the
dist
ance
bet
wee
n tw
o sa
mpl
es o
f the
str
eam
?U
sefu
l for
sig
nalin
g ne
cess
ityfo
r m
odel
upd
ate
or a
n im
pend
ing
faul
t or
criti
cal e
vent
IEE
E IC
DM
28-
31 O
ct. 0
7C
ontr
ast D
ata
Min
ing:
Met
hods
and
App
licat
ions
Ja
mes
Bai
ley
and
Guo
zhu
Don
g23
Odd
s ra
tio a
nd R
isk
ratio
Can
be
used
for
com
parin
g or
mea
surin
g ef
fect
siz
eU
sefu
l for
bin
ary
data
Wel
l kno
wn
in c
linic
al c
onte
xts
Can
als
o be
use
d fo
r qu
ality
eva
luat
ion
of
mul
tivar
iate
con
tras
ts (
will
see
late
r)A
sim
ple
exam
ple
give
n ne
xt
IEE
E IC
DM
28-
31 O
ct. 0
7C
ontr
ast D
ata
Min
ing:
Met
hods
and
App
licat
ions
Ja
mes
Bai
ley
and
Guo
zhu
Don
g24
Odd
s an
d ris
k ra
tio C
ont.
4321ID
……
No
Mal
e
No
Fem
ale
Yes
Mal
e
Exp
osed
(e
vent
)G
ende
r (f
eatu
re)
IEE
E IC
DM
28-
31 O
ct. 0
7C
ontr
ast D
ata
Min
ing:
Met
hods
and
App
licat
ions
Ja
mes
Bai
ley
and
Guo
zhu
Don
g25
Odd
s R
atio
Exa
mpl
e
Sup
pose
we
have
100
men
and
100
wom
en,
and
70 m
en a
nd 1
0 w
omen
hav
e be
en e
xpos
edO
dds
of e
xpos
ure(
mal
e)=
0.7/
0.3=
2.33
Odd
s of
exp
osur
e(fe
mal
e)=
0.1/
0.9=
0.11
Odd
s ra
tio=
2.33
/.11=
21.2
Mal
es h
ave
21.2
tim
es th
e od
ds o
f exp
osur
e th
an fe
mal
esIn
dica
tes
expo
sure
is m
uch
mor
e lik
ely
for
mal
es th
an fo
r fe
mal
es
IEE
E IC
DM
28-
31 O
ct. 0
7C
ontr
ast D
ata
Min
ing:
Met
hods
and
App
licat
ions
Ja
mes
Bai
ley
and
Guo
zhu
Don
g26
Rel
ativ
e R
isk
Exa
mpl
e
Sup
pose
we
have
100
men
and
100
wom
en,
and
70 m
en a
nd 1
0 w
omen
hav
e be
en e
xpos
edR
elat
ive
risk
of e
xpos
ure
(mal
e)=
70/1
00=
0.7
Rel
ativ
e ris
k of
exp
osur
e(fe
mal
e)=
10/1
00=
0.1
The
rel
ativ
e ris
k=0.
7/0.
1=7
Men
7 ti
mes
mor
e lik
ely
to b
e ex
pose
d th
an
wom
en
IEE
E IC
DM
28-
31 O
ct. 0
7C
ontr
ast D
ata
Min
ing:
Met
hods
and
App
licat
ions
Ja
mes
Bai
ley
and
Guo
zhu
Don
g27
Pat
tern
/Rul
e B
ased
Con
tras
ts
Ove
rvie
w o
f ``r
elat
iona
l’’ c
ontr
ast p
atte
rn m
inin
g E
mer
ging
pat
tern
s an
d m
inin
gJu
mpi
ng e
mer
ging
pat
tern
s C
ompu
tatio
nal c
ompl
exity
B
orde
r di
ffere
ntia
l alg
orith
m•
Gen
e cl
ub +
bor
der
diffe
rent
ial
•In
crem
enta
l min
ing
Tre
e ba
sed
algo
rithm
Pro
ject
ion
base
d al
gorit
hmZ
BD
D b
ased
alg
orith
m
Bio
info
rmat
icap
plic
atio
n: c
ance
r st
udy
on m
icro
arra
yge
ne e
xpre
ssio
n da
ta
IEE
E IC
DM
28-
31 O
ct. 0
7C
ontr
ast D
ata
Min
ing:
Met
hods
and
App
licat
ions
Ja
mes
Bai
ley
and
Guo
zhu
Don
g28
Ove
rvie
w
Cla
ss b
ased
ass
ocia
tion
rule
s (C
aiet
al 9
0, L
iu e
t al 9
8, ..
.)
Ver
sion
spa
ces
(Mitc
hell
77)
Em
ergi
ng p
atte
rns
(Don
g+Li
99)
–m
any
algo
rithm
s (la
ter)
Con
tras
t set
min
ing
(Bay
+P
azza
ni99
, Web
b et
al 0
3)
Odd
s ra
tio r
ules
& d
elta
dis
crim
inat
ive
EP
(Li e
t al 0
5, L
i et
al 0
7)
MD
L ba
sed
cont
rast
(Sie
bes,
KD
D07
)
Usi
ng s
tatis
tical
mea
sure
s to
eva
luat
e gr
oup
diffe
renc
es
(Hild
erm
an+
Pec
kman
05, W
ebb
07)
Spa
tial c
ontr
ast p
atte
rns
(Aru
nasa
lam
et a
l 05)
……
see
ref
eren
ces
IEE
E IC
DM
28-
31 O
ct. 0
7C
ontr
ast D
ata
Min
ing:
Met
hods
and
App
licat
ions
Ja
mes
Bai
ley
and
Guo
zhu
Don
g29
Cla
ssifi
catio
n/A
ssoc
iatio
n R
ules
Cla
ssifi
catio
n ru
les
--sp
ecia
l ass
ocia
tion
rule
s (w
ith ju
st o
ne it
em –
clas
s --
on R
HS
):X
C
(s,
c)
•X
is a
pat
tern
, •
C is
a c
lass
,
•s
is s
uppo
rt,
•c
is c
onfid
ence
IEE
E IC
DM
28-
31 O
ct. 0
7C
ontr
ast D
ata
Min
ing:
Met
hods
and
App
licat
ions
Ja
mes
Bai
ley
and
Guo
zhu
Don
g30
Ver
sion
Spa
ce (
Mitc
hell)
Ver
sion
spa
ce: t
he s
et o
f all
patte
rns
cons
iste
nt w
ith
give
n (D
+,D
-) –
patte
rns
sepa
ratin
g D
+, D
-.T
he s
pace
is d
elim
ited
by a
spe
cific
& a
gen
eral
boun
dary
. U
sefu
l for
sea
rchi
ng th
e tr
ue h
ypot
hesi
s, w
hich
lies
som
ewhe
re
b/w
the
two
boun
darie
s.A
ddin
g +
veex
ampl
es to
D+
mak
es th
e sp
ecifi
c bo
unda
ry m
ore
gene
ral;
addi
ng -
veex
ampl
es to
D-
mak
es th
e ge
nera
l bo
unda
ry m
ore
spec
ific.
Com
mon
pat
tern
/hyp
othe
sis
lang
uage
ope
rato
rs:
conj
unct
ion,
dis
junc
tion
Pat
tern
s/hy
poth
eses
are
cris
p; n
eed
to b
e ge
nera
lized
to
dea
l with
per
cent
ages
; har
d to
dea
l with
noi
se in
dat
a
IEE
E IC
DM
28-
31 O
ct. 0
7C
ontr
ast D
ata
Min
ing:
Met
hods
and
App
licat
ions
Ja
mes
Bai
ley
and
Guo
zhu
Don
g31
ST
UC
CO
, MA
GN
UM
OP
US
for
cont
rast
pa
ttern
min
ing
ST
UC
CO
(B
ay+
Paz
zani
99)
Min
ing
cont
rast
pat
tern
s X
(ca
lled
cont
rast
set
s) b
etw
een
k>=
2 gr
oups
: |su
ppi(X
) –
supp
j(X)|
>=
min
Diff
Use
Chi
2 to
mea
sure
sta
tistic
al s
igni
fican
ce o
f con
tras
t pat
tern
s•
sign
ifica
nce
cut-
off t
hres
hold
s ch
ange
, bas
ed o
n th
e le
vel o
f the
no
de a
nd th
e lo
cal n
umbe
r of
con
tras
t pat
tern
s M
ax-M
iner
like
sea
rch
stra
tegy
, plu
s so
me
prun
ing
tech
niqu
es
MA
GN
UM
OP
US
(W
ebb
01)
An
asso
ciat
ion
rule
min
ing
met
hod,
usi
ng M
ax-M
iner
like
ap
proa
ch (
prop
osed
bef
ore,
and
inde
pend
ently
of,
Max
-Min
er)
Can
min
e co
ntra
st p
atte
rns
(by
limiti
ng R
HS
to a
cla
ss)
IEE
E IC
DM
28-
31 O
ct. 0
7C
ontr
ast D
ata
Min
ing:
Met
hods
and
App
licat
ions
Ja
mes
Bai
ley
and
Guo
zhu
Don
g32
Con
tras
t pat
tern
s vs
deci
sion
tree
ba
sed
rule
s
It ha
s be
en r
ecog
nize
d by
sev
eral
aut
hors
(e.
g.
Bay
+P
azza
ni99
) th
at
rule
s ge
nera
tion
from
dec
isio
n tr
ees
can
be g
ood
cont
rast
pat
tern
s,
but m
ay m
iss
man
y go
od c
ontr
ast p
atte
rns.
Diff
eren
t con
tras
t set
min
ing
algo
rithm
s ha
ve
diffe
rent
thre
shol
dsS
ome
have
min
sup
port
thre
shol
dS
ome
have
no
min
sup
port
thre
shol
d; lo
w s
uppo
rt
patte
rns
may
be
usef
ul fo
r cl
assi
ficat
ion
etc
IEE
E IC
DM
28-
31 O
ct. 0
7C
ontr
ast D
ata
Min
ing:
Met
hods
and
App
licat
ions
Ja
mes
Bai
ley
and
Guo
zhu
Don
g33
Em
ergi
ng P
atte
rns
Em
ergi
ng P
atte
rns
(EP
s) a
re c
ontr
ast p
atte
rns
betw
een
two
clas
ses
of d
ata
who
se s
uppo
rt c
hang
es s
igni
fican
tly b
etw
een
the
two
clas
ses.
Cha
nge
sign
ifica
nce
can
be d
efin
ed b
y:
If su
pp2(
X)/
supp
1(X
) =
infin
ity, t
hen
X is
a ju
mpi
ng E
P.
jum
ping
EP
occ
urs
in s
ome
mem
bers
of o
ne c
lass
but
nev
er
occu
rs in
the
othe
r cl
ass.
Con
junc
tive
lang
uage
; ext
ensi
on to
dis
junc
tive
EP
late
r
sim
ilar
to R
iskR
atio
; +:
allo
win
g pa
ttern
s w
ith
smal
l ove
rall
supp
ort
big
supp
ort r
atio
:su
pp2(
X)/
supp
1(X
) >
= m
inR
atio
big
supp
ort d
iffer
ence
:|s
upp2
(X)
–su
pp1(
X)|
>=
min
Diff
(as
defin
ed b
y B
ay+P
azza
ni99
)
IEE
E IC
DM
28-
31 O
ct. 0
7C
ontr
ast D
ata
Min
ing:
Met
hods
and
App
licat
ions
Ja
mes
Bai
ley
and
Guo
zhu
Don
g34
A ty
pica
l EP
in th
e M
ushr
oom
dat
aset
The
Mus
hroo
m d
atas
et c
onta
ins
two
clas
ses:
edi
ble
and
pois
onou
s.E
ach
data
tupl
eha
s se
vera
l fea
ture
s su
ch a
s: o
dor,
rin
g-nu
mbe
r, s
talk
-sur
face
-bel
low
-rin
g, e
tc.
Con
side
r th
e pa
ttern
{o
dor
= n
one,
st
alk-
surf
ace-
belo
w-r
ing
= s
moo
th,
ring-
num
ber
= o
ne}
Its s
uppo
rt in
crea
ses
from
0.2
% in
the
pois
onou
s cl
ass
to
57.6
% in
the
edib
le c
lass
(a
grow
th r
ate
of 2
88).
IEE
E IC
DM
28-
31 O
ct. 0
7C
ontr
ast D
ata
Min
ing:
Met
hods
and
App
licat
ions
Ja
mes
Bai
ley
and
Guo
zhu
Don
g35
Exa
mpl
e E
P in
mic
roar
ray
data
for
canc
er
Nor
mal
Tis
sues
Can
cer
Tis
sues
Jum
ping
EP
: Pat
tern
s w
/ hig
h su
ppor
t rat
io b
/w d
ata
clas
ses
E.G
. {g1
=L,
g2=
H,g
3=L}
; sup
pN=
50%
, sup
pC=
0
LH
HL
HL
LH
LL
HL
HL
HL
g4g3
g2g1
LH
HH
HL
LL
HH
HL
HL
HH
g4g3
g2g1
binn
ed
data
IEE
E IC
DM
28-
31 O
ct. 0
7C
ontr
ast D
ata
Min
ing:
Met
hods
and
App
licat
ions
Ja
mes
Bai
ley
and
Guo
zhu
Don
g36
Top
sup
port
min
imal
jum
ping
EP
s fo
r co
lon
canc
er
Col
on C
ance
r E
Ps
{1+
4-
112+
113
+}
100%
{1+
4-
113+
116
+}
100%
{1+
4-
113+
221
+}
100%
{1+
4-
113+
696
+}
100%
{1+
108
-11
2+ 1
13+
} 10
0%{1
+ 1
08-
113+
116
+}
100%
{4-
108-
112+
113
+}
100%
{4-
109+
113
+ 7
00+
} 10
0%{4
-11
0+ 1
12+
113
+}
100%
{4-
112+
113
+ 7
00+
} 10
0%{4
-11
3+ 1
17+
700
+}
100%
{1+
6+
8-
700+
} 97
.5%
Col
on N
orm
al E
Ps
{12-
21-
35+
40+
137
+ 2
54+
} 10
0%{1
2-35
+ 4
0+ 7
1-13
7+ 2
54+
} 10
0%{2
0-21
-35
+ 1
37+
254
+}
100%
{20-
35+
71-
137+
254
+}
100%
{5-
35+
137
+ 1
77+
} 95
.5%
{5-
35+
137
+ 2
54+
} 95
.5%
{5-
35+
137
+ 4
19-}
95.
5%{5
-13
7+ 1
77+
309
+}
95.5
%{5
-13
7+ 2
54+
309
+}
95.5
%{7
-21
-33
+ 3
5+ 6
9+}
95.5
%{7
-21
-33
+ 6
9+ 3
09+
} 95
.5%
{7-
21-
33+
69+
126
1+}
95.5
%
EP
s fr
om
Mao
+D
ong
2005
(g
ene
club
+
bord
er-d
iff).
Col
on c
ance
r da
tase
t (A
lon
et a
l, 19
99 (
PN
AS
)): 4
0 ca
ncer
tiss
ues,
22
nor
mal
tiss
ues.
200
0 ge
nes
The
se E
Ps
have
95%
--1
00%
sup
port
in o
ne
clas
s bu
t 0%
sup
port
in
the
othe
r cl
ass.
Min
imal
: Eac
h pr
oper
su
bset
occ
urs
in b
oth
clas
ses.
Ver
y fe
w 1
00%
sup
port
EP
s.
The
re a
re ~
1000
item
s w
ith s
upp
>=
80%
.
IEE
E IC
DM
28-
31 O
ct. 0
7C
ontr
ast D
ata
Min
ing:
Met
hods
and
App
licat
ions
Ja
mes
Bai
ley
and
Guo
zhu
Don
g37
A p
oten
tial u
se o
f min
imal
jum
ping
EP
sM
inim
al ju
mpi
ng E
Ps
for
norm
altis
sues
Pro
perly
exp
ress
ed g
ene
grou
ps im
port
ant f
or n
orm
al c
ell f
unct
ioni
ng, b
ut
dest
roye
d in
all
colo
n ca
ncer
tiss
ues
Res
tore
thes
e ?c
ure
colo
n ca
ncer
?
Min
imal
jum
ping
EP
s fo
r ca
ncer
tissu
es
Bad
gen
e gr
oups
that
occ
ur in
som
e ca
ncer
tiss
ues
but n
ever
occ
ur in
nor
mal
tissu
es
Dis
rupt
thes
e ?c
ure
colo
n ca
ncer
?
? P
ossi
ble
targ
ets
for
drug
des
ign
?Li
+W
ong
2002
pro
pose
d “g
ene
ther
apy
usin
g E
P”
idea
: the
rapy
aim
s to
des
troy
ba
d JE
P &
res
tore
goo
d JE
P
IEE
E IC
DM
28-
31 O
ct. 0
7C
ontr
ast D
ata
Min
ing:
Met
hods
and
App
licat
ions
Ja
mes
Bai
ley
and
Guo
zhu
Don
g38
Use
fuln
ess
of E
mer
ging
Pat
tern
sE
Ps
are
usef
ul
for
build
ing
high
ly a
ccur
ate
and
robu
st c
lass
ifier
s, a
nd fo
r im
prov
ing
othe
r ty
pes
of c
lass
ifier
s fo
r di
scov
erin
g po
wer
ful d
istin
guis
hing
feat
ures
bet
wee
n da
tase
ts.
Like
oth
er p
atte
rns
com
pose
d of
con
junc
tive
com
bina
tion
of e
lem
ents
, EP
s ar
e ea
sy fo
r pe
ople
to u
nder
stan
d an
d us
e di
rect
ly.
EP
s ca
n al
so c
aptu
re p
atte
rns
abou
t cha
nge
over
tim
e.
Pap
ers
usin
g E
P te
chni
ques
in C
ance
r C
ell (
cove
r, 3
/02)
.E
mer
ging
Pat
tern
s ha
ve b
een
appl
ied
in m
edic
al a
pplic
atio
ns fo
r
diag
nosi
ng a
cute
Lym
phob
last
icLe
ukem
ia.
IEE
E IC
DM
28-
31 O
ct. 0
7C
ontr
ast D
ata
Min
ing:
Met
hods
and
App
licat
ions
Ja
mes
Bai
ley
and
Guo
zhu
Don
g39
The
land
scap
e of
EP
s on
the
supp
ort p
lane
, an
d ch
alle
nges
for
min
ing
O1
1
Sup
D2
(X)
Sup D1 (X)
C BA
•E
P m
inR
atio
cons
trai
nt is
ne
ither
mon
oton
ic n
or a
nti-
mon
oton
ic (
but e
xcep
tions
ex
ist f
or s
peci
al c
ases
)•
Req
uire
s sm
alle
r su
ppor
t th
resh
olds
than
thos
e us
ed
for
freq
uent
pat
tern
min
ing
Land
scap
e of
EP
sC
halle
nges
for
EP
m
inin
g
IEE
E IC
DM
28-
31 O
ct. 0
7C
ontr
ast D
ata
Min
ing:
Met
hods
and
App
licat
ions
Ja
mes
Bai
ley
and
Guo
zhu
Don
g40
Odd
s R
atio
and
Rel
ativ
e R
isk
Pat
tern
s [L
i and
Won
g P
OD
S06
]
May
use
odd
s ra
tio/r
elat
ive
risk
to
eval
uate
com
poun
d fa
ctor
s as
wel
lM
ay b
e no
sin
gle
fact
or w
ith h
igh
rela
tive
risk
or o
dds
ratio
, but
a c
ombi
natio
n of
fact
ors
•R
elat
ive
risk
patte
rns
-S
imila
r to
em
ergi
ng
patte
rns
•R
isk
diffe
renc
e pa
ttern
s -
Sim
ilar
to c
ontr
ast s
ets
•O
dds
ratio
pat
tern
s
IEE
E IC
DM
28-
31 O
ct. 0
7C
ontr
ast D
ata
Min
ing:
Met
hods
and
App
licat
ions
Ja
mes
Bai
ley
and
Guo
zhu
Don
g41
Min
ing
Pat
tern
s w
ith H
igh
Odd
s R
atio
or
Rel
ativ
e R
isk
Spa
ce o
f odd
s ra
tio p
atte
rns
and
rela
tive
risk
patte
rns
are
not c
onve
x in
gen
eral
Can
bec
ome
conv
ex, i
f str
atifi
ed in
to
plat
eaus
, bas
ed o
n su
ppor
t lev
els
IEE
E IC
DM
28-
31 O
ct. 0
7C
ontr
ast D
ata
Min
ing:
Met
hods
and
App
licat
ions
Ja
mes
Bai
ley
and
Guo
zhu
Don
g42
EP
Min
ing
Alg
orith
ms
Com
plex
ity r
esul
t (W
ang
et a
l 05)
Bor
der-
diffe
rent
ial a
lgor
ithm
(D
ong+
Li 9
9)G
ene
club
+ b
orde
r di
ffere
ntia
l (M
ao+
Don
g 05
)C
onst
rain
t-ba
sed
appr
oach
(Z
hang
et a
l 00)
Tre
e-ba
sed
appr
oach
(B
aile
y et
al 0
2,
Fan
+K
otag
iri02
)P
roje
ctio
n ba
sed
algo
rithm
(B
aile
y el
al 0
3)Z
BD
D b
ased
met
hod
(Loe
kito
+B
aile
y06
).
IEE
E IC
DM
28-
31 O
ct. 0
7C
ontr
ast D
ata
Min
ing:
Met
hods
and
App
licat
ions
Ja
mes
Bai
ley
and
Guo
zhu
Don
g43
Com
plex
ity r
esul
t
The
com
plex
ity o
f fin
ding
em
ergi
ng
patte
rns
(eve
n th
ose
with
the
high
est
freq
uenc
y) is
MA
X S
NP
-har
d.
Thi
s im
plie
s th
at p
olyn
omia
l tim
e ap
prox
imat
ion
sche
mes
do
not e
xist
for
the
prob
lem
unl
ess
P=
NP
.
IEE
E IC
DM
28-
31 O
ct. 0
7C
ontr
ast D
ata
Min
ing:
Met
hods
and
App
licat
ions
Ja
mes
Bai
ley
and
Guo
zhu
Don
g44
Bor
ders
are
con
cise
rep
rese
ntat
ions
of
conv
ex c
olle
ctio
ns o
f ite
mse
ts
< m
inB
={1
2,13
}, m
axB
={1
2345
,124
56}>
123,
123
412
124,
123
5
123
45
125,
124
5
124
56
126,
124
6
1313
4, 1
256
135,
134
5
A c
olle
ctio
n S
is c
onve
x:
If fo
r al
l X,Y
,Z (
X in
S, Y
in
S, X
sub
set Z
sub
set
Y)
Z in
S.
IEE
E IC
DM
28-
31 O
ct. 0
7C
ontr
ast D
ata
Min
ing:
Met
hods
and
App
licat
ions
Ja
mes
Bai
ley
and
Guo
zhu
Don
g45
Bor
der-
Diff
eren
tial A
lgor
ithm
<{{
}},{
1234
}> -
<{{}
},{2
3,24
,34}
>=
<{1
,234
},{1
234}
>{}{}
1,,
22, , 3,
43,
412
, 13,
14
,
, 2
3, 2
423
, 24
, , 3434
123,
124
, 134
,234
1234
Goo
d fo
r: J
umpi
ng E
Ps;
EP
s in
“re
ctan
gle
regi
ons,
” …
Alg
orith
m:
•U
se ite
rations
of
expan
sion &
m
inim
izat
ion o
f “p
roduct
s” o
f diffe
rence
s
•U
se t
ree
to s
pee
d
up m
inim
izat
ion
•F
ind
min
imal
sub
sets
of 1
234
that
are
not
sub
sets
of 2
3, 2
4, 3
4.
•{1
,234
} =
min
({1
,4}
X {
1,3}
X {
1,2}
)
Itera
tive
expa
nsio
n &
min
imiz
atio
n ca
n be
vi
ewed
as
optim
ized
Ber
ge h
yper
grap
htr
ansv
ersa
l alg
orith
m
IEE
E IC
DM
28-
31 O
ct. 0
7C
ontr
ast D
ata
Min
ing:
Met
hods
and
App
licat
ions
Ja
mes
Bai
ley
and
Guo
zhu
Don
g46
Gen
e cl
ub +
Bor
der
Diff
eren
tial
Bor
der-
diffe
rent
ial c
an h
andl
e up
to 7
5 at
trib
utes
(us
ing
2003
PC
)F
or m
icro
arra
yge
ne e
xpre
ssio
n da
ta, t
here
are
th
ousa
nds
of g
enes
. (M
ao+
Don
g 05
) us
ed b
orde
r-di
ffere
ntia
l afte
r fin
ding
m
any
gene
clu
bs -
-on
e ge
ne c
lub
per
gene
.A
gen
e cl
ub is
a s
et o
f k g
enes
str
ongl
y co
rrel
ated
with
a
give
n ge
ne a
nd th
e cl
asse
s.
Som
e E
Ps
disc
over
ed u
sing
this
met
hod
wer
e sh
own
earli
er. D
isco
vere
d m
ore
EP
s w
ith n
ear
100%
sup
port
in
canc
er o
r no
rmal
, inv
olvi
ng m
any
diffe
rent
gen
es. M
uch
bette
r th
an e
arlie
r re
sults
.
IEE
E IC
DM
28-
31 O
ct. 0
7C
ontr
ast D
ata
Min
ing:
Met
hods
and
App
licat
ions
Ja
mes
Bai
ley
and
Guo
zhu
Don
g47
Tre
e-ba
sed
algo
rithm
for
JEP
min
ing
Use
tree
to c
ompr
ess
data
and
pat
tern
s.T
ree
is s
imila
r to
FP
tree
, but
it s
tore
s tw
o co
unts
per
no
de (
one
per
clas
s)an
d us
es d
iffer
ent i
tem
ord
erin
gN
odes
with
non
-zer
o su
ppor
t for
pos
itive
cla
ss a
nd z
ero
supp
ort f
or n
egat
ive
clas
s ar
e ca
lled
base
nod
es.
For
eve
ry b
ase
node
, the
pat
h’s
item
seti
s a
pote
ntia
l JE
P. G
athe
r ne
gativ
e da
ta c
onta
inin
g ro
ot it
em a
nd it
em
for
base
d no
des
on th
e pa
th. C
all b
orde
r di
ffere
ntia
l.Ite
m o
rder
ing
is im
port
ant.
Hyb
rid (
supp
ort r
atio
or
derin
g fir
st fo
r a
perc
enta
ge o
f ite
ms,
freq
uenc
y or
derin
g fo
r ot
her
item
s) is
bes
t.
IEE
E IC
DM
28-
31 O
ct. 0
7C
ontr
ast D
ata
Min
ing:
Met
hods
and
App
licat
ions
Ja
mes
Bai
ley
and
Guo
zhu
Don
g48
Pro
ject
ion
base
d al
gorit
hmF
orm
dat
aset
H c
onta
inin
g th
e di
ffere
nces
{p
-ni|
i=1…
k}.
p is
a p
ositi
ve tr
ansa
ctio
n, n
1, …
, nk
are
nega
tive
tran
sact
ions
.F
ind
min
imal
tran
sver
als
of h
yper
grap
hH
. i.e
. The
sm
alle
st s
ets
inte
rsec
ting
ever
y ed
ge (
equi
vale
nt to
th
e sm
alle
st s
ubse
ts o
f p n
ot c
onta
ined
in a
ny n
i).Le
t x1<
…<
xmbe
incr
easi
ng it
em fr
eque
ncy
(in
H)
orde
ring.
For
i=1
to m
le
t Hxi
be H
with
all
item
s y
> x
i pro
ject
ed o
ut &
al
l tra
nsac
tions
con
tain
ing
xi r
emov
ed (
data
pr
ojec
tion)
.re
mov
e no
n m
inim
al tr
ansa
ctio
ns in
Hxi
.if
Hxi
is s
mal
l, ap
ply
bord
er d
iffer
entia
l O
ther
wis
e, a
pply
the
algo
rithm
on
Hxi
.
Let H
be:
a b
c d
(edg
e 1)
b e
d
(edg
e 2)
b c
e
(edg
e 3)
c d
e
(edg
e 4)
Item
ord
erin
g:
a <
b <
c <
d <
e
Ha
is H
with
all
item
s >
a (
red
item
s)pr
ojec
ted
out
and
also
edg
e w
ith a
rem
oved
, so
Ha=
{}.
IEE
E IC
DM
28-
31 O
ct. 0
7C
ontr
ast D
ata
Min
ing:
Met
hods
and
App
licat
ions
Ja
mes
Bai
ley
and
Guo
zhu
Don
g49
ZB
DD
bas
ed a
lgor
ithm
to
min
e di
sjun
ctiv
e em
ergi
ng p
atte
rns
Dis
jun
ctiv
e E
mer
gin
g P
atte
rns:
allo
win
gdi
sjun
ctio
n as
wel
l as
conj
unct
ion
of
sim
ple
attr
ibut
e co
nditi
ons.
e.g
. Pre
cip
itat
ion
=(
gt-n
orm
OR
lt-no
rm)
AN
D
In
tern
al d
isco
lora
tio
n =
( br
own
OR
bla
ck )
Gen
eral
izat
ion
of E
Ps
ZB
DD
bas
ed a
lgor
ithm
use
s Z
ero
Sup
pres
sed
Bin
ary
Dec
isio
n D
iagr
am fo
r ef
ficie
ntly
min
ing
disj
unct
ive
EP
s.
IEE
E IC
DM
28-
31 O
ct. 0
7C
ontr
ast D
ata
Min
ing:
Met
hods
and
App
licat
ions
Ja
mes
Bai
ley
and
Guo
zhu
Don
g50
Pop
ular
in b
oole
anS
AT
sol
vers
and
rel
iabi
lity
eng.
Can
onic
al D
AG
rep
rese
ntat
ions
of b
oole
anfo
rmul
ae
No
de
shar
ing
: ide
ntic
al n
odes
are
sha
red
Cac
hin
g p
rin
cip
le: p
ast c
ompu
tatio
n re
sults
are
aut
omat
ical
ly s
tore
d an
d ca
n be
ret
rieve
dE
ffici
ent B
DD
impl
emen
tatio
ns a
vaila
ble,
e.g
. CU
DD
(U
of C
olor
ado)
Bin
ary
Dec
isio
n D
iagr
ams
(BD
Ds)
c ad
10
root
f = (c
Λa)
v (d
Λa)
c
ad
10
a1
0
0
10
dotte
d (o
r 0)
edg
e: d
on’t
link
the
node
s (in
form
ulae
)
IEE
E IC
DM
28-
31 O
ct. 0
7C
ontr
ast D
ata
Min
ing:
Met
hods
and
App
licat
ions
Ja
mes
Bai
ley
and
Guo
zhu
Don
g51
ZB
DD
Rep
rese
ntat
ion
of It
emse
ts
Zer
o-s
up
pre
ssed
BD
D, Z
BD
D: A
BD
D v
aria
nt fo
r m
anip
ulat
ion
of it
em
com
bina
tions
E.g
. Bui
ldin
g a
ZB
DD
for
{{a,b,c,e}
,{a,b,d,e
},{b,c,d}
}
Ord
erin
g : c
<d
<a
< e
< b
c a e b
10
d a e b
10
c a e b
10
d
={{a,b,c,e}
}{{a,b,d,e}
}{{a,b,c,e},{a,b,d,e
}}U
z{{b,c,d}
}U
z=
{{a,b,c,e}
,{a,b,d,e
},{b,c,d
}} c dd a
e
b
10
c d b
10
Uz
Uz=
ZBD
D s
et-u
nion
Uz
==
IEE
E IC
DM
28-
31 O
ct. 0
7C
ontr
ast D
ata
Min
ing:
Met
hods
and
App
licat
ions
Ja
mes
Bai
ley
and
Guo
zhu
Don
g52
ZB
DD
bas
ed m
inin
g ex
ampl
eU
se s
olid
pat
hs in
ZB
DD
(Dn)
to g
ener
ate
cand
idat
es, a
nd u
se B
itmap
of
Dp
to c
heck
freq
uenc
y su
ppor
t in
Dp.
c
d ee
f
g
1
d b f hac d e bZB
DD
(Dn)
Bitm
apa
b c
d e
f g h
iP
1: 1
0 0
0 1
0 1
0 0
P2:
1 0
0 1
0 0
0 0
1P
3: 0
1 0
0 0
1 0
1 0
P4:
0 0
1 0
1 0
0 1
0
N1:
1 0
0 0
0 1
1 0
0N
2: 0
1 0
1 0
0 0
1 0
N3:
0 1
0 0
0 1
0 1
0N
4: 0
0 1
0 1
0 1
0 0
Dp= Dn=
Ord
erin
g: a
<c<d
<e<b
<f<g
<h
hf
bi
da
ge
a
eA2
hA3
cA1
hf
bh
db
gf
a
eA2
gA3
cA1
Dp
Dn
IEE
E IC
DM
28-
31 O
ct. 0
7C
ontr
ast D
ata
Min
ing:
Met
hods
and
App
licat
ions
Ja
mes
Bai
ley
and
Guo
zhu
Don
g53
Con
tras
t pat
tern
bas
ed c
lass
ifica
tion
--hi
stor
yC
ontr
ast p
atte
rn b
ased
cla
ssifi
catio
n: M
etho
ds to
bui
ld o
r im
prov
e cl
assi
fiers
, usi
ng c
ontr
ast p
atte
rns
CB
A (
Liu
et a
l 98)
CA
EP
(Don
g et
al 9
9)In
stan
ce b
ased
met
hod:
DeE
Ps
(Li e
t al 0
0, 0
4)Ju
mpi
ng E
P b
ased
(Li
et a
l 00)
, Inf
orm
atio
n ba
sed
(Zha
ng e
t al 0
0), B
ayes
ian
base
d (F
an+K
otag
iri03
), im
prov
ing
scor
ing
for
>=
3 cl
asse
s (B
aile
y et
al 0
3)
CM
AR
(Li
et a
l 01)
Top
-ran
ked
EP
bas
ed P
CL
(Li+
Won
g 02
)C
PA
R (
Yin
+H
an 0
3)W
eigh
ted
deci
sion
tree
(Alh
amm
ady+
Kot
agiri
06)
Rar
e cl
ass
clas
sific
atio
n(A
lham
mad
y+K
otag
iri04
)C
onst
ruct
ing
supp
lem
enta
ry tr
aini
ng in
stan
ces
(Alh
amm
ady+
Kot
agiri
05)
Noi
se to
lera
nt c
lass
ifica
tion
(Fan
+Kot
agiri
04)
EP
leng
th b
ased
1-c
lass
cla
ssifi
catio
n of
rar
e ca
ses
(Che
n+D
ong
06)
…M
ost f
ollo
w th
e ag
greg
atin
g ap
proa
ch o
f CA
EP
.
IEE
E IC
DM
28-
31 O
ct. 0
7C
ontr
ast D
ata
Min
ing:
Met
hods
and
App
licat
ions
Ja
mes
Bai
ley
and
Guo
zhu
Don
g54
EP
-bas
ed c
lass
ifier
s: r
atio
nale
Con
side
r a
typi
cal E
P in
the
Mus
hroo
m d
atas
et, {
odor
= n
one,
st
alk-
surf
ace-
belo
w-r
ing
= s
moo
th, r
ing-
num
ber
= o
ne};
its s
uppo
rt
incr
ease
s fr
om 0
.2%
from
“po
ison
ous”
to 5
7.6%
in “
edib
le”
(gro
wth
ra
te =
288
).
Str
ong
diffe
rent
iatin
g po
wer
: if a
test
T c
onta
ins
this
EP
, we
can
pred
ict T
as
edib
le w
ith h
igh
conf
iden
ce 9
9.6%
= 5
7.6/
(57.
6+0.
2)A
sin
gle
EP
is u
sual
ly s
harp
in te
lling
the
clas
s of
a s
mal
l fra
ctio
n (e
.g. 3
%)
of a
ll in
stan
ces.
Nee
d to
agg
rega
teth
e po
wer
of m
any
EP
s to
mak
e th
e cl
assi
ficat
ion.
E
P b
ased
cla
ssifi
catio
n m
etho
ds o
ften
out p
erfo
rm s
tate
of t
he a
rt
clas
sifie
rs, i
nclu
ding
C4.
5 an
d S
VM
. The
y ar
e al
so n
oise
tole
rant
.
IEE
E IC
DM
28-
31 O
ct. 0
7C
ontr
ast D
ata
Min
ing:
Met
hods
and
App
licat
ions
Ja
mes
Bai
ley
and
Guo
zhu
Don
g55
CA
EP
(C
lass
ifica
tion
by A
ggre
gatin
g E
mer
ging
Pat
tern
s)
The
cont
ribut
ion
of o
ne E
P X
(sup
port
wei
ghte
d co
nfid
ence
):
Giv
en a
test
T a
nd a
set
E(C
i) of
EPs
for c
lass
Ci,
the
aggr
egat
e sc
ore
of T
for C
i is
Giv
en a
test
cas
e T,
obt
ain
T’s
scor
es fo
r eac
h cl
ass,
by
aggr
egat
ing
the
disc
rimin
atin
g po
wer
of E
Ps c
onta
ined
by
T; a
ssig
n th
e cl
ass
with
the
max
imal
sco
re a
s T’
s cl
ass.
The
disc
rimin
atin
g po
wer
of E
Ps a
re e
xpre
ssed
in te
rms
of
supp
orts
and
gro
wth
rat
es. P
refe
rla
rge
supR
atio
, lar
ge s
uppo
rt
For
eac
h cl
ass,
usi
ng m
edia
n (o
r 85
%)
aggr
egat
ed v
alue
to
norm
aliz
e to
avo
id b
ias
tow
ards
cla
ss w
ith m
ore
EP
s
Com
pare
CM
AR
: C
hi2
wei
ghte
d C
hi2
stre
ngth
(X)
= s
up(X
) *
supR
atio
(X)
/ (su
pRat
io(X
)+1)
scor
e(T,
Ci)
=Σ
stre
ngth
(X)
(ove
r X
of C
imat
chin
g T
)
IEE
E IC
DM
28-
31 O
ct. 0
7C
ontr
ast D
ata
Min
ing:
Met
hods
and
App
licat
ions
Ja
mes
Bai
ley
and
Guo
zhu
Don
g56
How
CA
EP
wor
ks?
An
exam
ple
Giv
en a
test
T=
{a,d
,e},
how
to c
lass
ify T
?
b
ed
cb
ea
ed
ca
ed
ba
ec
dc
ba
baC
lass
2 (D
2)
Cla
ss 1
(D1)
●T
cont
ains
EPs
of c
lass
1 :
{a,e
} (50
%:2
5%) a
nd
{d,e
} (50
%:2
5%),
so
Sco
re(T
, cla
ss1)
=
●T
cont
ains
EPs
of c
lass
2: {
a,d}
(25%
:50%
), so
Sc
ore(
T, c
lass
2) =
0.3
3;
●T
will
be c
lass
ified
as
clas
s 1
sinc
e Sc
ore1
>Sco
re2
0.5*
[2/(
2+1)
] + 0
.5*[
2/(2
+1)
] = 0
.67
IEE
E IC
DM
28-
31 O
ct. 0
7C
ontr
ast D
ata
Min
ing:
Met
hods
and
App
licat
ions
Ja
mes
Bai
ley
and
Guo
zhu
Don
g57
DeE
Ps
(Dec
isio
n-m
akin
g by
Em
ergi
ng P
atte
rns)
An
inst
ance
bas
ed (
lazy
)le
arni
ng m
etho
d, li
ke k
-NN
; but
doe
s no
t us
e no
rmal
dis
tanc
e m
easu
re.
For
a te
st in
stan
ce T
, DeE
Ps
Firs
t pro
ject
eac
h tr
aini
ng in
stan
ce to
con
tain
onl
y ite
ms
in T
Dis
cove
r E
Ps
from
the
proj
ecte
d da
taT
hen
use
thes
e E
Ps
to s
elec
t tra
inin
g da
ta th
at m
atch
som
e di
scov
ered
E
Ps
Fin
ally
, use
the
prop
ortio
nal s
ize
of m
atch
ing
data
in a
cla
ss C
as T
’s
scor
e fo
r C
Adv
anta
ge: d
isal
low
sim
ilar
EP
s to
giv
e du
plic
ate
vote
s!
IEE
E IC
DM
28-
31 O
ct. 0
7C
ontr
ast D
ata
Min
ing:
Met
hods
and
App
licat
ions
Ja
mes
Bai
ley
and
Guo
zhu
Don
g58
DeE
Ps
: Pla
y-G
olf e
xam
ple
(dat
a pr
ojec
tion)
Tes
t =
{su
nn
y, m
ild, h
igh
, tru
e} Out
look
Tem
pera
ture
Hum
idity
Win
dyC
lass
sunn
yhi
ghN
sunn
yhi
ghtru
eN
true
Nsu
nny
mild
high
Nm
ildhi
ghtru
eN
high
Pm
ildhi
ghP
TRU
EP
sunn
yP
mild
Psu
nny
mild
TRU
EP
mild
high
TRU
EP
Out
look
Tem
pera
ture
Hum
idity
Win
dyC
lass
sunn
yho
thi
ghfa
lse
Nsu
nny
hot
high
true
Nra
inco
olno
rmal
true
Nsu
nny
mild
high
fals
eN
rain
mild
high
true
Nov
erca
stho
thi
ghFA
LSE
Pra
inm
ildhi
ghFA
LSE
Pra
inco
olno
rmal
FALS
EP
over
cast
cool
norm
alTR
UE
Psu
nny
cool
norm
alFA
LSE
Pra
inm
ildno
rmal
FALS
EP
sunn
ym
ildno
rmal
TRU
EP
over
cast
mild
high
TRU
EP
over
cast
hot
norm
alFA
LSE
P
Dis
cove
r E
Ps
and
deriv
e sc
ores
usi
ng th
e pr
ojec
ted
data
Orig
inal
dat
aP
roje
cted
dat
a
IEE
E IC
DM
28-
31 O
ct. 0
7C
ontr
ast D
ata
Min
ing:
Met
hods
and
App
licat
ions
Ja
mes
Bai
ley
and
Guo
zhu
Don
g59
PC
L (P
redi
ctio
n by
Col
lect
ive
Like
lihoo
d)Le
t X1,
…,X
mbe
the
m (
e.g.
100
0) m
ost g
ener
al E
Ps
in d
esce
ndin
g su
ppor
t ord
er.
Giv
en a
test
cas
e T
, con
side
r th
e lis
t of a
ll E
Ps
that
mat
ch T
. Div
ide
this
list
by
EP
’s c
lass
, and
list
them
in d
esce
ndin
g su
ppor
t ord
er:
P c
lass
: Xi1
, …, X
ip
N c
lass
: Xj1
, …, X
jn
Use
k (
e.g.
15)
top
rank
ed m
atch
ing
EP
s to
get
sco
re fo
r T
for
the
P
clas
s (s
imila
rly fo
r N
):
norm
aliz
ing
fact
or
Sco
re(T
,P)
= Σ
t=1k
supp
P(X
it) /
supp
P(X
t)
IEE
E IC
DM
28-
31 O
ct. 0
7C
ontr
ast D
ata
Min
ing:
Met
hods
and
App
licat
ions
Ja
mes
Bai
ley
and
Guo
zhu
Don
g60
Em
ergi
ng p
atte
rn s
elec
tion
fact
ors
The
re a
re m
any
EP
s, c
an’t
use
them
all.
Sho
uld
sele
ct a
nd u
se a
goo
d su
bset
.E
P s
elec
tion
cons
ider
atio
ns in
clud
eK
eep
min
imal
(sh
orte
st, m
ost g
ener
al)
ones
Rem
ove
synt
actic
ally
sim
ilar
ones
Use
sup
port
/gro
wth
rat
e im
prov
emen
t(be
twee
n su
pers
et/s
ubse
t pai
rs)
to p
rune
Use
inst
ance
cov
erag
e/ov
erla
pto
pru
neU
sing
onl
y in
finite
gro
wth
rat
eon
es (
JEP
s)…
…
IEE
E IC
DM
28-
31 O
ct. 0
7C
ontr
ast D
ata
Min
ing:
Met
hods
and
App
licat
ions
Ja
mes
Bai
ley
and
Guo
zhu
Don
g61
Why
EP
-bas
ed c
lass
ifier
s ar
e go
od
Use
the
disc
rimin
atin
g po
wer
of l
ow
supp
ort E
Ps,
toge
ther
with
hi
gh s
uppo
rt o
nes
Use
mu
lti-
feat
ure
cond
ition
s, n
ot ju
st s
ingl
e-fe
atur
e co
nditi
ons
Sel
ect f
rom
larg
er p
oo
lsof
dis
crim
inat
ive
cond
ition
sC
ompa
re: S
earc
h sp
ace
of p
atte
rns
for
deci
sion
tree
s is
lim
ited
by
early
gre
edy
choi
ces.
Ag
gre
gat
e/co
mb
ine
disc
rimin
atin
g po
wer
of a
div
ersi
fied
com
mitt
ee o
f “ex
pert
s” (
EP
s)
Dec
isio
n is
hig
hly
exp
lain
able
IEE
E IC
DM
28-
31 O
ct. 0
7C
ontr
ast D
ata
Min
ing:
Met
hods
and
App
licat
ions
Ja
mes
Bai
ley
and
Guo
zhu
Don
g62
Som
e ot
her
wor
ks
CB
A (
Liu
et a
l 98)
use
s on
e ru
le to
mak
e a
clas
sific
atio
n pr
edic
tion
for
a te
stC
MA
R (
Li e
t al 0
1) u
ses
agg
reg
ated
(Ch2
wei
ghte
d)
Chi
2 of
mat
chin
g ru
les
CP
AR
(Y
in+
Han
03)
use
s ag
greg
atio
n by
ave
ragi
ng: i
t us
es th
e av
erag
e ac
cura
cy o
f top
k r
ules
for
each
cla
ss
mat
chin
g a
test
cas
e…
IEE
E IC
DM
28-
31 O
ct. 0
7C
ontr
ast D
ata
Min
ing:
Met
hods
and
App
licat
ions
Ja
mes
Bai
ley
and
Guo
zhu
Don
g63
Agg
rega
ting
EP
s/ru
les
vsba
ggin
g (c
lass
ifier
ens
embl
es)
Bag
ging
/ens
embl
es: a
com
mitt
ee o
f cla
ssifi
ers
vote E
ach
clas
sifie
r is
fairl
y ac
cura
te fo
r a
larg
e po
pula
tion
(e.g
. >51
% a
ccur
ate
for
2 cl
asse
s)
Agg
rega
ting
EP
s/ru
les:
mat
chin
g pa
ttern
s/ru
les
vote E
ach
patte
rn/r
ule
is a
ccur
ate
on a
ver
y sm
all
popu
latio
n, b
ut in
accu
rate
if u
sed
as a
cla
ssifi
er o
n al
l dat
a; e
.g. 9
9% a
ccur
ate
on 2
% o
f dat
a, b
ut <
2%
accu
rate
on
all d
ata
IEE
E IC
DM
28-
31 O
ct. 0
7C
ontr
ast D
ata
Min
ing:
Met
hods
and
App
licat
ions
Ja
mes
Bai
ley
and
Guo
zhu
Don
g64
Usi
ng c
ontr
asts
for
rare
cla
ss d
ata
[Al H
amm
ady
and
Ram
amoh
anar
ao04
,05,
06]
Rar
e cl
ass
data
is im
port
ant i
n m
any
appl
icat
ions
Intr
usio
n de
tect
ion
(1%
of s
ampl
es a
re
atta
cks)
Fra
ud d
etec
tion
(1%
of s
ampl
es a
re fr
aud)
Cus
tom
er c
lick
thru
s(1
% o
f cus
tom
ers
mak
e a
purc
hase
)…
..
IEE
E IC
DM
28-
31 O
ct. 0
7C
ontr
ast D
ata
Min
ing:
Met
hods
and
App
licat
ions
Ja
mes
Bai
ley
and
Guo
zhu
Don
g65
Rar
e C
lass
Dat
aset
s
Due
to th
e cl
ass
imba
lanc
e, c
an
enco
unte
r so
me
prob
lem
sF
ew in
stan
ces
in th
e ra
re c
lass
, diff
icul
t to
trai
n a
clas
sifie
rF
ew c
on
tras
tsfo
r th
e ra
re c
lass
Po
or
qu
alit
yco
ntra
sts
for
the
maj
ority
cla
ss
Nee
d to
eith
er in
crea
se th
e in
stan
ces
in
the
rare
cla
ss o
r ge
nera
te e
xtra
con
tras
tsfo
r it
IEE
E IC
DM
28-
31 O
ct. 0
7C
ontr
ast D
ata
Min
ing:
Met
hods
and
App
licat
ions
Ja
mes
Bai
ley
and
Guo
zhu
Don
g66
Syn
thes
isin
gne
w c
ontr
asts
(n
ew e
mer
ging
pat
tern
s)
Syn
thes
isin
gne
w e
mer
ging
pat
tern
s by
su
perp
ositi
onof
hig
h gr
owth
rat
e ite
ms
Sup
pose
that
attr
ibut
e A
2=`a
’ has
hig
h gr
owth
rat
e an
d th
at {
A1=
`x’,
A2=
`y’}
is a
n em
ergi
ng p
atte
rn.
T
hen
crea
te a
new
em
ergi
ng p
atte
rn {
A1=
‘x’,
A2=
‘a’}
and
test
its
qual
ity.
A s
impl
e he
uris
tic, b
ut c
an g
ive
surp
risin
gly
good
cla
ssifi
catio
n pe
rfor
man
ce
IEE
E IC
DM
28-
31 O
ct. 0
7C
ontr
ast D
ata
Min
ing:
Met
hods
and
App
licat
ions
Ja
mes
Bai
ley
and
Guo
zhu
Don
g67
Syn
thes
isin
gne
w d
ata
inst
ance
s
Can
als
o us
e pr
evio
usly
foun
d co
ntra
sts
as th
e ba
sis
for
cons
truc
ting
new
rar
e cl
ass
inst
ance
sC
ombi
ne o
verla
ppin
g co
ntra
sts
and
high
gro
wth
rat
e ite
ms
Mai
n id
ea -
inte
rsec
t &
`cr
oss
pro
du
ct’t
he e
mer
ging
pa
ttern
s &
hig
h gr
owth
rat
e (s
uppo
rt r
atio
) ite
ms
Fin
dem
ergi
ng p
atte
rns
Clu
ster
emer
ging
pat
tern
s in
to g
roup
s th
at c
over
all
the
attr
ibut
esC
om
bin
e pa
ttern
s w
ithin
eac
h gr
oup
to fo
rm
inst
ance
s
IEE
E IC
DM
28-
31 O
ct. 0
7C
ontr
ast D
ata
Min
ing:
Met
hods
and
App
licat
ions
Ja
mes
Bai
ley
and
Guo
zhu
Don
g68
Syn
thes
isin
gne
w in
stan
ces
E1{
A1=
1, A
2=X
1}, E
2{A
5=Y
1,A
6=2,
A7=
3},
E3{
A2=
X2,
A3=
4,A
5=Y
2} -
this
is a
gro
up
V4
is a
hig
h gr
owth
item
for
A4
Com
bine
E1+
E2+
E3+
{A4=
V4}
to g
et fo
ur s
ynth
etic
inst
ance
s.
A7
A6
A5
A4
A3
A2
A1
32
Y2
V4
4X
21
32
Y1
V4
4X
21
32
Y2
V4
4X
11
32
Y1
V4
4X
11
IEE
E IC
DM
28-
31 O
ct. 0
7C
ontr
ast D
ata
Min
ing:
Met
hods
and
App
licat
ions
Ja
mes
Bai
ley
and
Guo
zhu
Don
g69
Mea
surin
g in
stan
ce q
ualit
y us
ing
emer
ging
pat
tern
s [A
l Ham
mad
yan
d R
amam
ohan
arao
07]
Cla
ssifi
ers
usua
lly a
ssum
e th
at d
ata
inst
ance
s ar
e re
late
d to
onl
y a
sing
le c
lass
(cr
isp
assi
gnm
ents
).H
owev
er, r
eal l
ife d
atas
ets
suffe
r fr
om n
oise
.A
lso,
whe
n ex
pert
s as
sign
an
inst
ance
to a
cl
ass,
they
firs
t ass
ign
scor
es to
eac
h cl
ass
and
then
ass
ign
the
clas
s w
ith th
e hi
ghes
t sco
re.
Thu
s, a
n in
stan
ce m
ay in
fact
be
rela
ted
to
seve
ral c
lass
es
IEE
E IC
DM
28-
31 O
ct. 0
7C
ontr
ast D
ata
Min
ing:
Met
hods
and
App
licat
ions
Ja
mes
Bai
ley
and
Guo
zhu
Don
g70
Mea
surin
g in
stan
ce q
ualit
y C
ont.
For
eac
h in
stan
ce i,
ass
ign
a w
eigh
t for
its
st
reng
th o
f mem
bers
hip
in e
ach
clas
s.C
an u
se e
mer
ging
pat
tern
s to
det
erm
ine
appr
opria
te w
eigh
ts fo
r in
stan
ces
Use
thes
e w
eigh
ts in
a m
odifi
ed v
ersi
on o
f cl
assi
fier,
e.g
. a d
ecis
ion
tree
Mod
ify in
form
atio
n ga
in c
alcu
latio
n to
take
wei
ghts
in
to a
ccou
nt
Wei
ght(
i) =
agg
rega
tion
of E
Ps
divi
ded
by
mea
n va
lue
for
inst
ance
s in
that
cla
ss
IEE
E IC
DM
28-
31 O
ct. 0
7C
ontr
ast D
ata
Min
ing:
Met
hods
and
App
licat
ions
Ja
mes
Bai
ley
and
Guo
zhu
Don
g71
Usi
ng E
Ps
to b
uild
Wei
ghte
d D
ecis
ion
Tre
es
Inst
ead
of c
risp
clas
s m
embe
rshi
p,le
t ins
tanc
es h
ave
wei
ghte
d cl
ass
mem
bers
hip,
th
en b
uild
wei
ghte
d de
cisi
on
tree
s, w
here
pro
babi
litie
s ar
e co
mpu
ted
from
the
wei
ghte
d m
embe
rshi
p.
DeE
Ps
and
othe
r E
P b
ased
cl
assi
fiers
can
be
used
to a
ssig
n w
eigh
ts.
)|
|)
(,..
.,|
|1
)(
()
(1
TWik
Tp
TWi
Tp
TP
Ti
k
Ti
∑∑
∈∧
∈∧
∧
==
=
An
inst
ance
Xi’s
mem
bers
hip
in k
cla
sses
: (W
i1,…
,Wik
)
∑ =
∧∧
∧−
=k j
jj
WDT
Tp
Tp
TP
Info
12
))(
(lo
g*)
())
((
∑ =
∧=
m ll
lWDT
TP
Info
TTT
AInfo
1
))(
(|
||
|)
,(
IEE
E IC
DM
28-
31 O
ct. 0
7C
ontr
ast D
ata
Min
ing:
Met
hods
and
App
licat
ions
Ja
mes
Bai
ley
and
Guo
zhu
Don
g72
Mea
surin
g in
stan
ce q
ualit
y by
em
ergi
ng p
atte
rns
Con
t.
Mor
e ef
fect
ive
than
k-N
N te
chni
ques
for
assi
gnin
g w
eigh
tsLe
ss s
ensi
tive
to n
oise
Not
dep
ende
nt o
n di
stan
ce m
etric
Tak
es in
to a
ccou
nt a
ll in
stan
ces,
not
just
cl
ose
neig
hbor
s
IEE
E IC
DM
28-
31 O
ct. 0
7C
ontr
ast D
ata
Min
ing:
Met
hods
and
App
licat
ions
Ja
mes
Bai
ley
and
Guo
zhu
Don
g73
Dat
a cu
be b
ased
con
tras
ts(C
ondi
tiona
l Con
tras
ts)
Gra
dien
t (D
ong
et a
l 01)
, cub
egra
de(I
mie
linsk
iet a
l 02
–T
R p
ublis
hed
in 2
000)
:M
inin
g sy
ntac
tical
ly s
imila
r cu
be c
ells
, hav
ing
sign
ifica
ntly
di
ffere
nt m
easu
re v
alue
sS
ynta
ctic
ally
sim
ilar:
anc
esto
r-de
scen
dant
or
sibl
ing-
sibl
ing
pair
Can
be
view
ed a
s “c
on
dit
ion
al c
on
tras
ts”:
two
neig
hbor
ing
patte
rns
with
big
diff
eren
ce in
per
form
ance
/mea
sure
Dat
a cu
bes
usef
ul fo
r an
alyz
ing
mul
ti-di
men
sion
al,
mul
ti-le
vel,
time-
depe
nden
t dat
a.
Gra
dien
t min
ing
usef
ul fo
r M
DM
L an
alys
is in
mar
ketin
g,
busi
ness
dec
isio
ning
, med
ical
/sci
entif
ic s
tudi
es
IEE
E IC
DM
28-
31 O
ct. 0
7C
ontr
ast D
ata
Min
ing:
Met
hods
and
App
licat
ions
Ja
mes
Bai
ley
and
Guo
zhu
Don
g74
Dec
isio
n su
ppor
t in
data
cub
esU
sed
for
disc
over
ing
patte
rns
capt
ured
in c
onso
lidat
ed h
isto
rical
da
ta fo
r a
com
pany
/org
aniz
atio
n:
rule
s, a
nom
alie
s, u
nusu
al fa
ctor
com
bina
tions
Foc
us o
n m
odel
ing
& a
naly
sis
of d
ata
for
deci
sion
mak
ers,
not
dai
ly
oper
atio
ns.
Dat
a or
gani
zed
arou
nd m
ajor
sub
ject
s or
fact
ors,
suc
h as
cust
omer
, pro
duct
, tim
e, s
ales
.
Cub
e “c
onta
ins”
hug
e nu
mbe
r of
MD
ML
“seg
men
t” o
r “s
ecto
r”
sum
mar
ies
at d
iffer
ent l
evel
s of
det
ails
Bas
ic O
LAP
ope
ratio
ns: D
rill d
own,
rol
l up,
slic
e an
d di
ce, p
ivot
IEE
E IC
DM
28-
31 O
ct. 0
7C
ontr
ast D
ata
Min
ing:
Met
hods
and
App
licat
ions
Ja
mes
Bai
ley
and
Guo
zhu
Don
g75
Dat
a C
ubes
: Bas
e T
able
& H
iera
rchi
esBas
e ta
ble
sto
res
sale
s vo
lum
e (measure
), a
funct
ion o
f pro
duct
, tim
e, &
loc
atio
n (
dim
ensi
ons)
ProductLocatio
n
Tim
eH
iera
rchi
cal s
umm
ariz
atio
n pa
ths
Indu
stry
R
egio
n
Y
ear
Cat
egor
y C
ount
ry Q
uart
er
Prod
uct
C
ity
Mon
th
Wee
k
Off
ice
Day
a bas
e ce
ll
*:
all
(as
top o
f ea
ch d
imen
sion
)
IEE
E IC
DM
28-
31 O
ct. 0
7C
ontr
ast D
ata
Min
ing:
Met
hods
and
App
licat
ions
Ja
mes
Bai
ley
and
Guo
zhu
Don
g76
Dat
a C
ubes
: Der
ived
Cel
lsT
ime
Product
Location
sum
sum
TV
VC
RPC
1Qtr
2Qtr
3Qtr
4Qtr
U.S
.A
Can
ada
Mex
ico
sum
Mea
sure
s:
sum
, co
unt,
av
g,
max,
m
in,
std,
…
Der
ived
cel
ls,
diffe
rent
leve
ls o
f det
ails
(TV,*
,Mex
ico)
IEE
E IC
DM
28-
31 O
ct. 0
7C
ontr
ast D
ata
Min
ing:
Met
hods
and
App
licat
ions
Ja
mes
Bai
ley
and
Guo
zhu
Don
g77
Dat
a C
ubes
: Cel
l Lat
tice
(*,*
,*)
(a1,*
,*)
(*,b
1,*
)(a
2,*
,*)
…
(a1,b
2,*
)(a
1,b
1,*
)(a
2,b
1,*
)… …
(a1,b
2,c
1)
(a1,b
1,c
1)
(a1,b
1,c
2)
Com
pare
: cu
boid
latti
ce
IEE
E IC
DM
28-
31 O
ct. 0
7C
ontr
ast D
ata
Min
ing:
Met
hods
and
App
licat
ions
Ja
mes
Bai
ley
and
Guo
zhu
Don
g78
Gra
dien
t min
ing
in d
ata
cube
s
Use
rs w
ant:
mor
e po
wer
ful (
OLA
M)
supp
ort:
Fin
d po
tent
ially
inte
rest
ing
cells
from
the
billi
ons!
O
LAP
ope
ratio
ns u
sed
to h
elp
user
s se
arch
in h
uge
spac
e of
ce
llsU
sers
do:
mou
sing
, eye
-bal
ling,
mem
oing
, dec
isio
ning
, …
Gra
die
nt
min
ing
: Fin
d sy
ntac
tical
ly s
imila
r ce
lls w
ith
sign
ifica
ntly
diff
eren
t mea
sure
val
ues
(tee
n cl
othi
ng,C
alifo
rnia
,200
6), t
otal
-pro
fit=
100K
vs
(tee
n cl
othi
ng,P
ensy
lvan
ia,2
006)
, tot
al p
rofit
= 1
0K
A s
peci
fic O
LAM
task
IEE
E IC
DM
28-
31 O
ct. 0
7C
ontr
ast D
ata
Min
ing:
Met
hods
and
App
licat
ions
Ja
mes
Bai
ley
and
Guo
zhu
Don
g79
Live
Set
-Driv
en A
lgor
ithm
for
cons
trai
ned
grad
ient
min
ing
Set
-orie
nted
pro
cess
ing;
trav
erse
the
cube
whi
le c
arry
ing
the
live
seto
f cel
ls h
avin
g po
tent
ial t
o m
atch
des
cend
ants
of t
he c
urre
nt
cell
as g
radi
ent c
ells
A g
radi
ent c
ompa
res
two
cells
; one
is th
e pr
obe
cell,
& th
e ot
her
is a
gr
adie
nt c
ell.
Pro
be c
ells
are
anc
esto
r or
sib
ling
cells
Tra
vers
e th
e ce
ll sp
ace
in a
coa
rse-
to-f
ine
man
ner,
look
ing
for
mat
chab
legr
adie
nt c
ells
with
pot
entia
l to
satis
fy g
radi
ent c
onst
rain
t
Dyn
amic
ally
pru
neth
e liv
e se
t dur
ing
trav
ersa
l
Com
pare
: Naï
ve m
etho
d ch
ecks
eac
h po
ssib
le c
ell p
air
IEE
E IC
DM
28-
31 O
ct. 0
7C
ontr
ast D
ata
Min
ing:
Met
hods
and
App
licat
ions
Ja
mes
Bai
ley
and
Guo
zhu
Don
g80
Pru
ning
pro
be c
ells
usi
ng d
imen
sion
m
atch
ing
anal
ysis
Def
n: P
robe
cel
l p=
(a1,
…,a
n) is
mat
chab
lew
ith
grad
ient
cel
l g=
(b1,
…, b
n) if
f
No
solid
-mis
mat
ch, o
r
Onl
y on
e so
lid-m
ism
atch
but
no
*-m
ism
atch
A s
olid
-mis
mat
ch: i
f aj≠
b j+
non
e of
ajor
bjis
*
A *
-mis
mat
ch: i
f aj=
* an
d b j
≠*
Thm
: cel
l p is
mat
chab
lew
ith c
ell g
iffp
may
mak
e a
prob
e-gr
adie
nt p
air
with
som
e
desc
enda
nt o
f g (
usin
g on
ly d
imen
sion
val
ue in
fo)
p=
(00,
Tor,
*,
*)
: 1
sol
idg=
(00,
Chi, *
,PC)
: 1 *
IEE
E IC
DM
28-
31 O
ct. 0
7C
ontr
ast D
ata
Min
ing:
Met
hods
and
App
licat
ions
Ja
mes
Bai
ley
and
Guo
zhu
Don
g81
Seq
uenc
e ba
sed
cont
rast
sW
e w
ant t
o co
mpa
re s
eque
nce
data
sets
:bi
oinf
orm
atic
s (D
NA
, pro
tein
), w
eb lo
g, jo
b/w
orkf
low
his
tory
, bo
oks/
docu
men
tse.
g. c
ompa
re p
rote
in fa
mili
es; c
ompa
re b
ible
boo
ks/v
ersi
ons
Seq
uenc
e da
ta a
re v
ery
diffe
rent
from
rel
atio
nal d
ata
orde
r/po
sitio
n m
atte
rsun
boun
ded
num
ber
of “
flexi
ble
dim
ensi
ons”
Seq
uenc
e co
ntra
sts
in te
rms
of 2
type
s of
com
paris
on:
Dat
aset
bas
ed: P
ositi
ve v
sN
egat
ive
•D
istin
guis
hing
seq
uenc
e pa
ttern
s w
ith g
ap c
onst
rain
ts (
Jiet
al 0
5, 0
7)
•E
mer
ging
sub
strin
gs (
Cha
n et
al 0
3)S
ite b
ased
: Nea
r m
arke
r vs
away
from
mar
ker
•M
otifs
•
May
als
o in
volv
e da
ta c
lass
esR
ough
ly: A
site
is a
pos
ition
in
a s
eque
nce
whe
re a
sp
ecia
l mar
ker/
patte
rn o
ccur
s
IEE
E IC
DM
28-
31 O
ct. 0
7C
ontr
ast D
ata
Min
ing:
Met
hods
and
App
licat
ions
Ja
mes
Bai
ley
and
Guo
zhu
Don
g82
Exa
mpl
e se
quen
ce c
ontr
asts
Whe
n co
mpa
ring
the
two
prot
ein
fam
ilies
zf-C
2H2
andzf
-CC
HC, w
e di
scov
ered
a p
rote
in M
DS
CLH
Hap
pear
ing
as a
su
bseq
uenc
e in
141
of196
prot
ein
sequ
ence
s of
zf-C
2H2
but n
ever
app
earin
g in
the 208
sequ
ence
s in
zf-C
CHC.
Whe
n co
mpa
ring
the
first
and
last
boo
ks fr
om th
e B
ible
, w
e fo
und
the
subs
eque
nces
(with
gap
s) “
havi
ng h
orns
”,
“fac
e w
orsh
ip”,
“st
ones
pric
e”an
d “o
rnam
ents
pric
e”ap
pear
mul
tiple
tim
es in
sen
tenc
es in
the
Boo
k of
R
evel
atio
n, b
ut n
ever
in th
e B
ook
of G
enes
is.
IEE
E IC
DM
28-
31 O
ct. 0
7C
ontr
ast D
ata
Min
ing:
Met
hods
and
App
licat
ions
Ja
mes
Bai
ley
and
Guo
zhu
Don
g83
Seq
uenc
e an
d se
quen
ce p
atte
rn
occu
rren
ceA
seq
uenc
eS
=e 1e 2e 3…
e nis
an
orde
red
list o
f ite
ms
over
a g
iven
al
phab
et.
E.G
. “AG
CA”
is a
DN
A s
eque
nce
over
the
alph
abet
{A,
C, G
, T}.
“AC
”is
a s
ubse
quen
ce o
f “AGCA
”bu
t not
a s
ubst
ring;
“GC
A”is
a s
ubst
ring
Giv
en s
eque
nce S
and
a su
bseq
uenc
e pa
ttern
S’
, an
occu
rren
ce
of S
’in
Sco
nsis
ts o
f the
pos
ition
s of
the
item
s fr
om S
’in
S.
EG
: con
side
r S
=“AC
ACBC
B”
<1,
5>, <
1,7>
, <3,
5>, <
3,7>
are
occ
urre
nces
of “
AB
”<
1,2,
5>, <
1,2,
7>, <
1,4,
5>, …
are
occ
urre
nces
of “
AC
B”
IEE
E IC
DM
28-
31 O
ct. 0
7C
ontr
ast D
ata
Min
ing:
Met
hods
and
App
licat
ions
Ja
mes
Bai
ley
and
Guo
zhu
Don
g84
Max
imum
-gap
con
stra
int s
atis
fact
ion
A (
max
imum
) ga
p co
nstr
aint
: spe
cifie
d by
a p
ositi
ve in
tege
r g.
Giv
en S
& a
n oc
curr
ence
os
= <i 1
, …i m
>, i
fi k
+1–i k
<= g
+1
for
all 1
<=
k <m,
then
os
fulfi
lls th
e g-
gap
cons
trai
nt.
If a
subs
eque
nce S’
has
one
occu
rren
ce fu
lfilli
ng a
gap
con
stra
int,
then
S’
satis
fies
the
gap
cons
trai
nt.
The
<3,
5> o
ccur
renc
e of
“A
B”
in S =
“ACACBC
B”, s
atis
fies
the
max
imum
gap
con
stra
int g
=1.
T
he <
3,4,
5> o
ccur
renc
e of
“A
CB
”in
S =
“ACAC
BCB”
satis
fies
the
max
imum
gap
con
stra
int g
=1.
The
<1,
2,5>
, <1,
4,5>
, <3,
4,5>
occ
urre
nces
of “
AC
B”
in S =
“AC
ACBC
B”sa
tisfy
the
max
imum
gap
con
stra
int g
=2.
One
seq
uenc
e co
ntrib
utes
to a
t mos
t one
to c
ount
.
IEE
E IC
DM
28-
31 O
ct. 0
7C
ontr
ast D
ata
Min
ing:
Met
hods
and
App
licat
ions
Ja
mes
Bai
ley
and
Guo
zhu
Don
g85
g-M
DS
Min
ing
Pro
blem
Giv
en tw
o se
ts pos &
neg
of s
eque
nces
, tw
o su
ppor
t th
resh
olds
min
p&
min
n, &
a m
axim
um g
ap g
, a p
atte
rnp
is a
Min
imal
Dis
tingu
ishi
ng S
ubse
quen
cew
ith g
-gap
co
nstr
aint
(g-
MD
S),
if th
ese
cond
ition
s ar
e m
et:
Giv
en pos
,neg
, min
p, m
inn
and g,
the g-
MD
S m
inin
g pr
oble
m is
to fi
nd a
ll th
e g-
MD
Ss.
β
β
β
1. F
requ
ency
con
ditio
n: supp p
os(p
,g)
>=
min
p;
2. In
freq
uenc
y co
nditi
on: s
upp n
eg(p
,g)
<=
min
n;3.
Min
imal
ityco
nditi
on: T
here
is n
o su
bseq
uenc
e of
psa
tisfy
ing
1 &
2.
IEE
E IC
DM
28-
31 O
ct. 0
7C
ontr
ast D
ata
Min
ing:
Met
hods
and
App
licat
ions
Ja
mes
Bai
ley
and
Guo
zhu
Don
g86
Exa
mpl
e g-
MD
S
Giv
en m
inp=
1/3,
min
n=0,
g=
1,po
s=
{C
BA
B, A
AC
CB
, BB
AA
C},
neg
= {
BC
AB
,AB
AC
B}
1-M
DS
are
: BB
, CC
, BA
A, C
BA
“ACC
”is
freq
uent
in p
os&
non
-occ
urrin
g in
neg
, but
it is
not
m
inim
al (
its s
ubse
quen
ce “CC
”mee
ts th
e fir
st tw
o co
nditi
ons)
.
IEE
E IC
DM
28-
31 O
ct. 0
7C
ontr
ast D
ata
Min
ing:
Met
hods
and
App
licat
ions
Ja
mes
Bai
ley
and
Guo
zhu
Don
g87
g-M
DS
min
ing
: Cha
lleng
es
The
min
sup
port
thre
shol
ds in
min
ing
dist
ingu
ishi
ng
patte
rns
need
to b
e lo
wer
than
thos
e us
ed fo
r m
inin
g fr
eque
nt p
atte
rns.
Min
sup
port
s of
fer
very
wea
k pr
unin
g po
wer
on
the
larg
e se
arch
spa
ce.
Max
imum
gap
con
stra
int i
s ne
ither
mon
oton
e no
r
anti-
mon
oton
e.
Gap
che
ckin
g re
quire
s cl
ever
han
dlin
g.
IEE
E IC
DM
28-
31 O
ct. 0
7C
ontr
ast D
ata
Min
ing:
Met
hods
and
App
licat
ions
Ja
mes
Bai
ley
and
Guo
zhu
Don
g88
Con
SG
apM
iner
The
Co
nS
Gap
Min
eral
gorit
hm w
orks
in th
ree
step
s:
1.C
andi
date
Gen
erat
ion:
C
andi
date
s ar
e ge
nera
ted
with
out d
uplic
atio
n. E
ffici
ent
prun
ing
stra
tegi
es a
re e
mpl
oyed
.
2.S
uppo
rt C
alcu
latio
n an
d G
ap C
heck
ing:
F
or e
ach
gene
rate
d ca
ndid
ate c,
supp p
os(c,
g)an
d su
ppneg(c,
g)ar
e ca
lcul
ated
usi
ng b
itset
oper
atio
ns.
3.M
inim
izat
ion:
R
emov
e al
l the
non
-min
imal
pat
tern
s (u
sing
pat
tern
tree
s).
IEE
E IC
DM
28-
31 O
ct. 0
7C
ontr
ast D
ata
Min
ing:
Met
hods
and
App
licat
ions
Ja
mes
Bai
ley
and
Guo
zhu
Don
g89
Con
SG
apM
iner
: Can
dida
te G
ener
atio
n
neg
5
pos
3
pos
2
neg
pos
Cla
ss
41
Seq
uen
ceID
{ }
BA
AA
AAA
(0, 0
)AA
B (0
, 1)
AAC
AACA
(0,
0)
AACB
(1,
1)
AACC
(1,
0)
AACB
A (0
, 0)
AACB
B (0
, 0)
AACB
C (0
, 0)
……
…
C(3
, 2)
(3, 2
)(3
, 2)
(2, 1
)
(2, 1
)
•D
FS
tree
•T
wo
coun
ts p
er n
ode/
patte
rn
•D
on’t
exte
nd p
os-in
freq
uent
pat
tern
s
•A
void
dup
licat
es &
cer
tain
non
-min
imal
g-
MD
S (
e.g.
don
’t ex
tend
g-M
DS
)
CB
AB
AA
CC
B
BB
AA
CB
CA
B
AB
AC
B
IEE
E IC
DM
28-
31 O
ct. 0
7C
ontr
ast D
ata
Min
ing:
Met
hods
and
App
licat
ions
Ja
mes
Bai
ley
and
Guo
zhu
Don
g90
Use
Bits
etO
pera
tion
for
Gap
Che
ckin
g
We
enco
de th
e oc
curr
ence
s’en
ding
pos
ition
s in
to a
bits
etan
d us
e a
serie
s of
bitw
ise
oper
atio
ns to
gen
erat
e a
new
ca
ndid
ate
sequ
ence
’s b
itset
.
AT
CG
AG
TA
TC
G
AC
CA
GT
AT
CG
AT
TA
CC
AG
TA
TC
G
AC
TG
TA
TT
AC
CA
GT
AT
CG
Sto
ring
proj
ecte
d su
ffixe
s an
d pe
rfor
min
g sc
ans
is e
xpen
sive
.
e.g.
Giv
en a
seq
uenc
eA
CT
GT
AT
TA
CC
AG
TA
TC
G
to c
heck
whe
ther
AG
is a
su
bseq
uenc
e fo
r g
=1:
Pro
ject
ion
s w
ith
pre
fix
A :
Pro
ject
ion
s w
ith
AG
obt
aine
d fr
om th
e ab
ove:
AG
TA
TC
G
IEE
E IC
DM
28-
31 O
ct. 0
7C
ontr
ast D
ata
Min
ing:
Met
hods
and
App
licat
ions
Ja
mes
Bai
ley
and
Guo
zhu
Don
g91
Con
SG
apM
iner
: Sup
port
& G
ap C
heck
ing
(1)
Initi
al B
itset
Arr
ay C
onst
ruct
ion:
For
eac
h ite
m x
, co
nstr
uct a
n ar
ray
of b
itset
sto
des
crib
e w
here
x o
ccur
s in
eac
h se
quen
ce fr
om pos
and neg.
neg
AB
AC
B5
pos
BB
AA
C3
pos
AA
CC
B2
neg
pos
Cla
ss
BC
AB
4
CB
AB
1
Seq
uenc
eID
1010
0
0010
0011
0
1100
0
0010
sing
le-it
em A
Dat
aset
Initi
al B
itset
Arr
ay
IEE
E IC
DM
28-
31 O
ct. 0
7C
ontr
ast D
ata
Min
ing:
Met
hods
and
App
licat
ions
Ja
mes
Bai
ley
and
Guo
zhu
Don
g92
EG
: gen
erat
e m
ask
bits
etfo
r X
=“A”
in s
eque
nce
5 (w
ith m
ax g
ap g
=1)
:
neg
5
po
s3
po
s2
neg
po
s
Cla
ss
41
Seq
uen
ceID
CB
AB
AA
CC
BB
BA
AC
BC
AB
AB
AC
B
1 0 1 0 0
> >
0 1 0 1 0
0 1 0 1 0
> >
0 0 1 0 1
OR
0 1 1 1 1
Mas
k bi
tset
forX
:
Mas
k bi
tset
: al
l the
lega
l pos
ition
s in
the
seq
uenc
e at
mos
t (g
+1)
-pos
ition
s aw
ay f
rom
tai
l of
an o
ccur
renc
e of
the
(m
axim
um p
refix
of
the)
pat
tern
.
Tw
o st
eps:
(1)
g+
1 rig
ht s
hifts
; (2)
OR
them
Con
SG
apM
iner
: Sup
port
& G
ap C
heck
ing
(2)
IEE
E IC
DM
28-
31 O
ct. 0
7C
ontr
ast D
ata
Min
ing:
Met
hods
and
App
licat
ions
Ja
mes
Bai
ley
and
Guo
zhu
Don
g93
EG
: Gen
erat
e bi
tset
arra
y (b
a) fo
r X’
=“BA
”fr
om X
=‘B’
(g= 1)
neg
5
po
s3
po
s2
neg
po
s
Cla
ss
41
Seq
uen
ceID
CB
AB
AA
CC
BB
BA
AC
BC
AB
AB
AC
B
ba(X
):
0101
0000
1
1100
0
1001
0100
1
mas
k(X
’):
0011
0000
0
0111
0
0110
0011
0
2 sh
ifts
plus
OR ba
(‘A’):
0010
1100
0
0011
0
0010
1010
0
&
ba(X
’):
0010
0000
0
0011
0
0010
0010
0
mas
k(X
’):
0011
0000
0
0111
0
0110
0011
0
1.G
et b
afo
r X
=‘B
’
2.S
hift
ba(X
) to
get
mas
k fo
r X
’ = ‘B
A’
3.A
ND
ba(
‘A’)
and
mas
k(X
’) to
get
ba(
X’)
Nu
mb
er
of
arra
ys
wit
h
som
e 1
= co
un
t
Con
SG
apM
iner
: Sup
port
& G
ap C
heck
ing
(3)
IEE
E IC
DM
28-
31 O
ct. 0
7C
ontr
ast D
ata
Min
ing:
Met
hods
and
App
licat
ions
Ja
mes
Bai
ley
and
Guo
zhu
Don
g94
Exe
cutio
n tim
e pe
rfor
man
ce o
n pr
otei
n fa
mili
es
110100
1000
6.25
%12
.50%
18.7
5%25
%31
.25%
min
imal
sup
port
running time (sec)
0.0
0.1
1.0
10.0
100.
010
00.0
13
57
9
max
imal
gap
running time (sec)
run
tim
e vs
sup
por
t, f
or
g =
5
run
tim
e vs
g, f
or
α=
0.3
12
5(5
)(123
, 186
)
Avg
. Len
. (P
os, N
eg)
DU
F16
95 (
5)D
UF
1694
(16
)
Neg
(#)
Pos
(#)
(205
, 262
)
Avg
. Len
. (P
os, N
eg)
Tat
D_D
Nas
e(11
9)T
atC
(74)
Neg
(#)
Pos
(#) 10
0
1000
1000
0
5.40
%13
.50%
16.2
0%18
.90%
21.6
0%24
.30%
min
imal
sup
port
running time (sec)
run
tim
e vs
sup
por
t, f
or
g =
5
110100
1000
1000
0
34
56
7
max
imal
gap
running time (sec)
α
run
tim
e vs
g, f
or
α=
0.2
7(2
0)
IEE
E IC
DM
28-
31 O
ct. 0
7C
ontr
ast D
ata
Min
ing:
Met
hods
and
App
licat
ions
Ja
mes
Bai
ley
and
Guo
zhu
Don
g95
Pat
tern
Len
gth
Dis
trib
utio
n --
Pro
tein
Fam
ilies
The
leng
th a
nd fr
eque
ncy
dist
ribut
ion
of p
atte
rns:
TaC
vsT
atD
_DN
ase,
g =
5, α
=13
.5%.
1
100
1000
0
1000
000
34
56
78
910
11
leng
th o
f pat
tern
s
#5-MDS
1
100
1000
0
1000
000
1~10
11~2
021
~30
31~4
041
~50
>50
freq
uenc
y co
unt
#5-MDSLe
ngt
h d
istr
ibu
tion
Freq
uen
cy d
istr
ibu
tion
IEE
E IC
DM
28-
31 O
ct. 0
7C
ontr
ast D
ata
Min
ing:
Met
hods
and
App
licat
ions
Ja
mes
Bai
ley
and
Guo
zhu
Don
g96
Bib
le B
ooks
Exp
erim
ent
New
Tes
tam
ent (
Mat
thew
, Mar
k, L
uke
and
John
) vs
Old
Tes
tam
ent (
Gen
esis
, Exo
dus,
Lev
iticu
s an
d N
umbe
rs):
010203040
0.13
%0.
27%
0.40
%0.
53%
0.66
%
min
imal
sup
port
running time (sec)
25
Max
. Len
.
7
Avg
. Len
.
3344
Alp
habe
t
4893
3768
#Neg
#Pos
2025303540
02
46
8
max
imal
gap
running time (sec)run
tim
e vs
sup
por
t, f
or
g =
6.
run
tim
e vs
g, f
or
α=
0.0
01
3.
Som
e in
tere
stin
g te
rms
foun
d fr
om th
e B
ible
bo
oks
(New
Tes
tam
ent v
sO
ld T
esta
men
t):
Tru
ly k
ingd
om (
12)
Chi
ef p
riest
s (5
3)
Que
stio
n sa
ying
(13
)F
orgi
vene
ss in
(22
)
answ
er tr
uly
(10)
good
new
s (2
3)
seat
ed h
and
(10)
eter
nal l
ife (
24)
Sub
sequ
ence
s (c
ount
)S
ubst
rings
(co
unt)
IEE
E IC
DM
28-
31 O
ct. 0
7C
ontr
ast D
ata
Min
ing:
Met
hods
and
App
licat
ions
Ja
mes
Bai
ley
and
Guo
zhu
Don
g97
Ext
ensi
ons
Allo
win
g m
in g
ap c
onst
rain
tA
llow
ing
max
win
dow
leng
th c
onst
rain
tC
onsi
derin
g di
ffere
nt m
inim
izat
ion
stra
tegi
es:
Sub
sequ
ence
-bas
ed m
inim
izat
ion
(des
crib
ed o
n pr
evio
us s
lides
)C
over
age
(mat
chin
g tid
setc
onta
inm
ent)
+
subs
eque
nce
base
d m
inim
izat
ion
Pre
fix b
ased
min
imiz
atio
n
IEE
E IC
DM
28-
31 O
ct. 0
7C
ontr
ast D
ata
Min
ing:
Met
hods
and
App
licat
ions
Ja
mes
Bai
ley
and
Guo
zhu
Don
g98
Mot
if m
inin
g
Fin
d se
quen
ce p
atte
rns
freq
uent
aro
und
a si
te m
arke
r,
but i
nfre
quen
t els
ewhe
reC
an a
lso
cons
ider
two
clas
ses:
Fin
d pa
ttern
s fr
eque
nt a
roun
d si
te m
arke
r in
+ve
clas
s, b
ut in
fr
eque
nt a
t oth
er p
ositi
ons,
and
infr
eque
nt a
roun
d si
te m
arke
r in
–v
ecl
ass
Ofte
n, b
iolo
gica
l stu
dies
use
bac
kgro
und
prob
abili
ties
inst
ead
of
a re
al -
veda
tase
t
Pop
ular
con
cept
/tool
in b
iolo
gica
l stu
dies
IEE
E IC
DM
28-
31 O
ct. 0
7C
ontr
ast D
ata
Min
ing:
Met
hods
and
App
licat
ions
Ja
mes
Bai
ley
and
Guo
zhu
Don
g99
Con
tras
ts fo
r G
raph
Dat
a
Can
cap
ture
str
uctu
ral d
iffer
ence
sS
ubgr
aphs
appe
arin
g in
one
cla
ss b
ut n
ot in
th
e ot
her
clas
s•
Che
mic
al c
ompo
und
anal
ysis
•S
ocia
l net
wor
k co
mpa
rison
IEE
E IC
DM
28-
31 O
ct. 0
7C
ontr
ast D
ata
Min
ing:
Met
hods
and
App
licat
ions
Ja
mes
Bai
ley
and
Guo
zhu
Don
g10
0
Con
tras
ts fo
r gr
aph
data
Con
t.
Sta
ndar
d fr
eque
nt s
ubgr
aph
min
ing
Giv
en a
gra
ph d
atab
ase,
find
con
nect
ed
subg
raph
sap
pear
ing
freq
uent
ly
Con
tras
t sub
grap
hspa
rtic
ular
ly fo
cus
on
disc
rimin
atio
n an
d m
inim
ality
IEE
E IC
DM
28-
31 O
ct. 0
7C
ontr
ast D
ata
Min
ing:
Met
hods
and
App
licat
ions
Ja
mes
Bai
ley
and
Guo
zhu
Don
g10
1
Min
imal
con
tras
t sub
grap
hs[T
ing
and
Bai
ley
06]
A c
ontr
ast g
raph
is a
sub
grap
hap
pear
ing
in o
ne c
lass
of g
raph
s an
d ne
ver
in
anot
her
clas
s of
gra
phs
Min
imal
if n
one
of it
s su
bgra
phs
are
cont
rast
sM
ay b
e d
isco
nn
ecte
d•
Allo
ws
succ
inct
des
crip
tion
of d
iffer
ence
s•
But
req
uire
s la
rger
sea
rch
spac
e
Will
focu
s on
one
ver
sus
one
case
IEE
E IC
DM
28-
31 O
ct. 0
7C
ontr
ast D
ata
Min
ing:
Met
hods
and
App
licat
ions
Ja
mes
Bai
ley
and
Guo
zhu
Don
g10
2
Con
tras
t sub
grap
hex
ampl
ev 0
(a)
v 1(a
)v 2
(a)
v 3(c
)
e 2(a
)e 0
(a)
e 1(a
)
e 3(a
)e 4
(a)
Gra
ph
A
v 0(a
)
v 1(a
)v 2
(a)
e 2(a
)e 0
(a)
e 1(a
)
Gra
ph
C
v 0(a
)
v 1(a
)v 3
(c)
e 0(a
) Gra
ph
D
v 3(c
)
Gra
ph
E
Gra
ph
B
v 0(a
)
v 1(a
)v 2
(a)
v 3(a
)
e 2(a
)
e 0(a
)e 1
(a) e 3
(a)
e 4(a
)v 4
(a)
Pos
itive
Neg
ativ
e
Con
tras
tC
ontr
ast
Con
tras
t
IEE
E IC
DM
28-
31 O
ct. 0
7C
ontr
ast D
ata
Min
ing:
Met
hods
and
App
licat
ions
Ja
mes
Bai
ley
and
Guo
zhu
Don
g10
3
Min
imal
con
tras
t sub
grap
hs
Min
imal
cont
rast
gra
phs
are
of tw
o ty
pes
Tho
se w
ith o
nly
vert
ices
(a
vert
ex s
et)
Tho
se w
ithou
t iso
late
d ve
rtic
es (
edge
set
s)
Can
pro
ve th
at fo
r 1-
1 ca
se, t
he m
inim
alco
ntra
st s
ubgr
aphs
are
the
unio
n of
Min
. Co
n. V
erte
x S
ets
+ M
in. C
on
. Ed
ge
Set
s
IEE
E IC
DM
28-
31 O
ct. 0
7C
ontr
ast D
ata
Min
ing:
Met
hods
and
App
licat
ions
Ja
mes
Bai
ley
and
Guo
zhu
Don
g10
4
Min
ing
cont
rast
sub
grap
hs
Mai
n id
eaF
ind
the
max
imal
com
mon
edg
e se
ts•
The
se m
ay b
e di
scon
nect
ed
App
ly a
min
imal
hyp
ergr
aph
tran
sver
sal
oper
atio
n to
der
ive
the
min
imal
con
tras
t edg
e se
tsfr
om th
e m
axim
al c
omm
on e
dge
sets
Mus
t com
pute
min
imal
con
tras
t ver
tex
sets
se
para
tely
and
then
min
imal
uni
on w
ith th
e m
inim
al c
ontr
ast e
dge
sets
IEE
E IC
DM
28-
31 O
ct. 0
7C
ontr
ast D
ata
Min
ing:
Met
hods
and
App
licat
ions
Ja
mes
Bai
ley
and
Guo
zhu
Don
g10
5
Con
tras
t gra
ph m
inin
g w
orkf
low
Pos
itive
G
raph
Gp
Neg
ativ
e G
raph
Gn2
Neg
ativ
eG
raph
Gn3�
�
Neg
ativ
eG
raph
Gn1
Max
imal
Com
mon
E
dge
Set
s 2
(Max
imal
Com
mon
V
erte
x S
ets
2)��
�
Max
imal
Com
mon
E
dge
Set
s 3
(Max
imal
Com
mon
V
erte
x S
ets
1)
Max
imal
Com
mon
E
dge
Set
s 1
(Max
imal
Com
mon
V
erte
x S
ets
1)
Max
imal
C
omm
on
Edg
e S
ets
(Max
imal
C
omm
on
Ver
tex
Set
s)
Com
plem
ents
of
Max
imal
Com
mon
E
dge
Set
s
(Com
plem
ents
of
Max
imal
Com
mon
V
erte
x S
ets)
Min
imal
C
ontr
ast
Edg
e S
ets
(Min
imal
V
erte
x S
ets)
Com
plem
ent
Mini
mal
Tran
sver
sals
IEE
E IC
DM
28-
31 O
ct. 0
7C
ontr
ast D
ata
Min
ing:
Met
hods
and
App
licat
ions
Ja
mes
Bai
ley
and
Guo
zhu
Don
g10
6
Giv
en a
gra
ph d
atab
ase
and
a qu
ery
q. F
ind
all g
raph
s in
the
data
base
con
tain
ed in
q.
App
licat
ions
Que
ryin
g im
age
data
base
s re
pres
ente
d as
attr
ibut
ed r
elat
iona
l gr
aphs
. E
ffici
ently
find
all
obje
cts
from
the
data
base
con
tain
ed
in a
giv
en s
cene
(qu
ery)
.
Usi
ng d
iscr
imin
ativ
e gr
aphs
for
cont
ainm
ent s
earc
h an
d in
dexi
ng
[Che
n et
al 0
7]
mod
el g
raph
dat
abas
e D
quer
y gr
aph
q mod
els
cont
aine
d by
q
IEE
E IC
DM
28-
31 O
ct. 0
7C
ontr
ast D
ata
Min
ing:
Met
hods
and
App
licat
ions
Ja
mes
Bai
ley
and
Guo
zhu
Don
g10
7
Dis
crim
inat
ive
grap
hs fo
r in
dexi
ng
Con
t.
Mai
n id
ea:
Giv
en a
que
ry g
raph
q a
nd a
dat
abas
e gr
aph
g •If
a fe
atur
e f i
s no
t con
tain
ed in
q a
nd f
is
cont
aine
d in
g, t
hen
g is
not
con
tain
ed in
q
Als
o ex
ploi
t sim
ilarit
y be
twee
n gr
aphs
.If
f is
a co
mm
on s
ubst
ruct
ure
betw
een
g1
and
g2, t
hen
if f i
s no
t con
tain
ed in
the
quer
y,
both
g1
and
g2 a
re n
ot c
onta
ined
in th
e qu
ery
IEE
E IC
DM
28-
31 O
ct. 0
7C
ontr
ast D
ata
Min
ing:
Met
hods
and
App
licat
ions
Ja
mes
Bai
ley
and
Guo
zhu
Don
g10
8
Gra
ph C
onta
inm
ent E
xam
ple
[Fro
m
Che
n et
al 0
7]
00
1f 4
01
1f 3
01
1f 2
11
1f 1
g cg b
g a
(ga)
(gb)
(gc)
A S
ampl
e D
atab
ase
(f1)
(f2)
(f3)
(f4)
Fea
ture
s
IEE
E IC
DM
28-
31 O
ct. 0
7C
ontr
ast D
ata
Min
ing:
Met
hods
and
App
licat
ions
Ja
mes
Bai
ley
and
Guo
zhu
Don
g10
9
Dis
crim
inat
ive
grap
hs fo
r in
dexi
ng
Aim
to s
elec
tthe
``c
ontr
ast f
eatu
res’
’ tha
t hav
e th
e m
ost p
runi
ng p
ower
(sav
e m
ost
isom
orph
ism
test
s)T
hese
are
feat
ures
that
are
con
tain
ed b
y m
any
grap
hs in
the
data
base
, but
are
unl
ikel
y to
be
cont
aine
d by
a q
uery
gra
ph.
Gen
erat
e lo
ts o
f can
dida
tes
usin
g a
freq
uent
su
bgra
phm
inin
g an
d th
en fi
lter
outp
ut g
raph
s fo
r di
scrim
inat
ive
pow
er
IEE
E IC
DM
28-
31 O
ct. 0
7C
ontr
ast D
ata
Min
ing:
Met
hods
and
App
licat
ions
Ja
mes
Bai
ley
and
Guo
zhu
Don
g11
0
Gen
erat
ing
the
Inde
x
Afte
r th
e co
ntra
st s
ubgr
aphs
have
bee
n fo
und,
sel
ect a
sub
set o
f the
mU
se a
set
cov
er h
euris
ticto
sel
ect a
set
that
``
cove
rs’’
all t
he g
raph
s in
the
data
base
, in
the
cont
ext o
f a g
iven
que
ry q
For
mul
tiple
que
ries,
use
a m
axim
um
cove
rage
with
cos
t app
roac
h
IEE
E IC
DM
28-
31 O
ct. 0
7C
ontr
ast D
ata
Min
ing:
Met
hods
and
App
licat
ions
Ja
mes
Bai
ley
and
Guo
zhu
Don
g11
1
Con
tras
ts fo
r tr
ees
Spe
cial
cas
e of
gra
phs
Low
er c
ompl
exity
Lots
of a
ctiv
ity in
the
docu
men
t/XM
L ar
ea, f
or
chan
ge d
etec
tion.
Not
ions
suc
h as
edi
t dis
tanc
e m
ore
typi
cal
for
this
con
text
IEE
E IC
DM
28-
31 O
ct. 0
7C
ontr
ast D
ata
Min
ing:
Met
hods
and
App
licat
ions
Ja
mes
Bai
ley
and
Guo
zhu
Don
g11
2
Con
tras
ts o
f mod
els
Mod
els
can
be c
lust
erin
gs, d
ecis
ion
tree
s, …
Why
is c
ontr
astin
g us
eful
her
e ?
Con
tras
t/com
pare
a us
er g
ener
ated
mod
el a
gain
st a
kn
own
refe
renc
e m
odel
, to
eval
uate
acc
urac
y/de
gree
of
diff
eren
ce.
May
wis
h to
com
pare
deg
ree
of d
iffer
ence
betw
een
one
algo
rithm
usi
ng v
aryi
ng p
aram
eter
sE
limin
ate
redu
ndan
cyam
ong
mod
els
by c
hoos
ing
diss
imila
r re
pres
enta
tives
IEE
E IC
DM
28-
31 O
ct. 0
7C
ontr
ast D
ata
Min
ing:
Met
hods
and
App
licat
ions
Ja
mes
Bai
ley
and
Guo
zhu
Don
g11
3
Con
tras
ts o
f mod
els
Con
t.
Isn’
t thi
s ju
st a
dis
sim
ilarit
y m
easu
re ?
Li
ke E
uclid
ean
dist
ance
?S
imila
r, b
ut o
pera
ting
on m
ore
com
plex
ob
ject
s, n
ot ju
st v
ecto
rs
Diff
icul
ties
are
For
rul
e ba
sed
clas
sifie
rs, c
an’t
just
rep
ort o
n nu
mbe
r of
diff
eren
t rul
es
IEE
E IC
DM
28-
31 O
ct. 0
7C
ontr
ast D
ata
Min
ing:
Met
hods
and
App
licat
ions
Ja
mes
Bai
ley
and
Guo
zhu
Don
g11
4
Clu
ster
ing
com
paris
on
Pop
ular
clu
ster
ing
com
paris
on m
easu
res
Ran
d in
dex
an
d J
acca
rdin
dex
•M
easu
re th
e pr
opor
tion
of p
oint
pai
rs o
n w
hich
the
two
clus
terin
gsag
ree
Mu
tual
info
rmat
ion
•H
ow m
uch
info
rmat
ion
one
clus
terin
g gi
ves
abou
t th
e ot
her
Clu
ster
ing
err
or
•C
lass
ifica
tion
erro
r m
etric
IEE
E IC
DM
28-
31 O
ct. 0
7C
ontr
ast D
ata
Min
ing:
Met
hods
and
App
licat
ions
Ja
mes
Bai
ley
and
Guo
zhu
Don
g11
5
Clu
ster
ing
Com
paris
on M
easu
res
Nea
rly a
ll te
chni
ques
use
a ‘C
onfu
sion
Mat
rix’
of tw
o cl
uste
rings
. Exa
mpl
e : L
et C
= {
c 1, c
2, c
3)
and
C’=
{c’
1, c
’ 2, c
’ 3}
mij
= |
c i∩
c’j|
57
8c’
3
82
10c’
2
114
5c’
1
c 3c 2
c 1m
IEE
E IC
DM
28-
31 O
ct. 0
7C
ontr
ast D
ata
Min
ing:
Met
hods
and
App
licat
ions
Ja
mes
Bai
ley
and
Guo
zhu
Don
g11
6
Pai
r co
untin
g
Con
side
rs th
e nu
mbe
r of
poi
nts
on w
hich
two
clus
terin
gsag
ree
or d
isag
ree.
Eac
h pa
ir fa
lls in
to o
ne
of fo
ur c
ateg
orie
sN
11–
nu
mbe
r of
pai
rs o
f poi
nts
wh
ich
are
in
th
e sa
me
clu
ster
in b
oth
C a
nd
C’
N00
–n
um
ber
of p
airs
of p
oin
ts w
hic
h a
re
not
in t
he
sam
e cl
ust
er in
bot
h C
an
d C
’N
10–
nu
mbe
r of
pai
rs o
f poi
nts
wh
ich
are
in
th
e sa
me
clu
ster
in C
bu
t n
ot in
C’
N01
–n
um
ber
of p
airs
of p
oin
ts w
hic
h a
re
in t
he
sam
e cl
ust
er in
C’ b
ut
not
in C
N -
tota
l nu
mbe
r of
pai
rs o
f poi
nts
IEE
E IC
DM
28-
31 O
ct. 0
7C
ontr
ast D
ata
Min
ing:
Met
hods
and
App
licat
ions
Ja
mes
Bai
ley
and
Guo
zhu
Don
g11
7
Ran
d(C
,C’)
=
Jacc
ard(
C,C
’) =
Tw
o po
pula
r in
dexe
s -
Ran
d an
d Ja
ccar
d
Pai
r C
ount
ing
N11
+ N
00N
N11
N11
+ N
01 +
N10
IEE
E IC
DM
28-
31 O
ct. 0
7C
ontr
ast D
ata
Min
ing:
Met
hods
and
App
licat
ions
Ja
mes
Bai
ley
and
Guo
zhu
Don
g11
8
Clu
ster
ing
Err
or M
etric
(C
lass
ifica
tion
Err
or M
etric
)
An
inje
ctiv
e m
appi
ng o
f C=
{1,…
,K}
into
C’=
{1…
,K’}.
Nee
d to
find
max
imum
in
ters
ectio
n fo
r al
l pos
sibl
e m
appi
ngs.
Clu
ster
ing
erro
r=(1
4+10
+5)
/60=
0.48
3
Bes
t mat
ch is
{c2,
c’ 1
}, {
c 1, c
’ 2},
{c
3, c
’ 3}}
57
8c’
3
82
10c’
2
114
5c’
1
c 3c 2
c 1m
IEE
E IC
DM
28-
31 O
ct. 0
7C
ontr
ast D
ata
Min
ing:
Met
hods
and
App
licat
ions
Ja
mes
Bai
ley
and
Guo
zhu
Don
g11
9
Clu
ster
ing
Com
paris
on D
iffic
ultie
s
Ref
eren
ce
Whi
ch m
ost s
imila
r to
clu
ster
ing
(a)?
R
and(
a,b)
=R
and(
a,c)
Ja
ccar
d(a,
b)=
Jacc
ard(
a,c)
!
(a)
(b)
(c)
IEE
E IC
DM
28-
31 O
ct. 0
7C
ontr
ast D
ata
Min
ing:
Met
hods
and
App
licat
ions
Ja
mes
Bai
ley
and
Guo
zhu
Don
g12
0
Com
parin
g da
tase
ts v
ia in
duce
d m
odel
s
Giv
en tw
o da
tase
ts, w
e m
ay c
ompa
re th
eir
diffe
renc
e, b
y co
nsid
erin
g th
e di
ffere
nce
or
devi
atio
n be
twee
n th
e m
odel
sth
at c
an b
e in
duce
d fr
om th
emM
odel
s he
re c
an r
efer
to d
ecis
ion
tree
s,
freq
uent
item
sets
, em
ergi
ng p
atte
rns,
etc
May
als
o co
mpa
re a
n ol
d m
odel
to a
new
da
tase
tH
ow m
uch
does
it m
isre
pres
ent ?
IEE
E IC
DM
28-
31 O
ct. 0
7C
ontr
ast D
ata
Min
ing:
Met
hods
and
App
licat
ions
Ja
mes
Bai
ley
and
Guo
zhu
Don
g12
1
The
FO
CU
S F
ram
ewor
k [G
anti
et a
l 02]
Dev
elop
s a
sing
le m
easu
re fo
r qu
antif
ying
the
diffe
renc
e be
twee
n th
e in
tere
stin
g ch
arac
teris
tics
in e
ach
data
set.
Key
Idea
: ``A
mod
el h
as a
str
uctu
ral c
ompo
nent
th
at id
entif
ies
inte
rest
ing
regi
ons
of th
e at
trib
ute
spac
e …
eac
h su
ch r
egio
n is
sum
mar
ized
by
one
(or
seve
ral)
mea
sure
(s)’’
Diff
eren
ce b
etw
een
two
clas
sifie
rs is
mea
sure
d by
am
ount
of w
ork
need
ed to
cha
nge
them
into
so
me
com
mon
spe
cial
izat
ion
IEE
E IC
DM
28-
31 O
ct. 0
7C
ontr
ast D
ata
Min
ing:
Met
hods
and
App
licat
ions
Ja
mes
Bai
ley
and
Guo
zhu
Don
g12
2
Foc
usF
ram
ewor
k C
ont.
For
com
parin
g tw
o m
odel
s, d
ivid
e th
e m
odel
s ea
ch in
to r
egio
ns a
nd th
en
com
pare
the
regi
ons
indi
vidu
ally
For
a d
ecis
ion
tree
, com
pare
leaf
nod
es o
f ea
ch m
odel
Agg
rega
te th
e pa
irwis
edi
ffere
nces
bet
wee
n ea
ch o
f the
reg
ions
IEE
E IC
DM
28-
31 O
ct. 0
7C
ontr
ast D
ata
Min
ing:
Met
hods
and
App
licat
ions
Ja
mes
Bai
ley
and
Guo
zhu
Don
g12
3
Dec
isio
n tr
ee e
xam
ple
[Tak
en fr
om G
anti
et 0
2]
(0.1
,0.0
)
(0.0
,0.3
)
(0.0
5,0.
55)
30
100K
(0.1
8,0.
1)
(0.0
,0.1
)
(0.1
,0.5
2)
50
80K
�[0
.05-
0.1]
[0.0
-0.0
4]
[0.1
-0.1
4]
[0.0
-0.0
]
100K 80
K
3050
Salary
Age
Salary
Salary
Age
Age
[0.0
-0.0
]
[0.0
-0.0
]
T1:
D1
T2:
D2
T3:
GC
R o
f T
1 an
d T
2(j
ust
fo
r cl
ass1
)
Diff
eren
ce(D
1,D
2)=
|0.0
-0.0
|+|0
.0-0
.04|
+|0
.1-0
.14|
+|0
.0-0
.0|+
|0.0
-0.0
|+|0
.05-
0.1|
=0.
13
(cla
ss1,
clas
s2)
(cla
ss1’
,cla
ss2’
)(c
lass
1-cl
ass1
’)
IEE
E IC
DM
28-
31 O
ct. 0
7C
ontr
ast D
ata
Min
ing:
Met
hods
and
App
licat
ions
Ja
mes
Bai
ley
and
Guo
zhu
Don
g12
4
Cor
resp
onde
nce
Tra
cing
of
Cha
nges
[Wan
g et
al 0
3]
Cor
resp
onde
nce
trac
ing
aim
s to
mak
e ch
ange
bet
wee
n th
e tw
o m
odel
s un
ders
tand
able
by
expl
icitl
y de
scrib
ing
chan
ges
and
then
ran
king
them
IEE
E IC
DM
28-
31 O
ct. 0
7C
ontr
ast D
ata
Min
ing:
Met
hods
and
App
licat
ions
Ja
mes
Bai
ley
and
Guo
zhu
Don
g12
5
Cor
resp
onde
nce
Tra
cing
Exa
mpl
e [T
aken
from
Wan
g et
al 0
3]
Con
side
r ol
d an
d ne
w r
ule
base
d cl
assi
fiers
O
ldID
’s o
f ins
tanc
es c
lass
ified
O1:
If A
4=1
then
C3
[0,2
,7,9
,13,
15,1
7]
O
2: If
A3=
1 an
d A
4=2
then
C2
[1,4
,6,1
0,12
,16]
O3:
If A
3=2
and
A4=
2 th
en C
1 [3
,5,8
,11,
14]
New
N1:
If A
3=1
and
A4=
1 th
en C
3 [0
,9,1
5]N
2: If
A3=
1 an
d A
4=2
then
C2
[1,4
,6,1
0,12
,16]
N3:
If A
3=2
and
A4=
1 th
en C
2 [2
,7,1
3,17
]N
4: If
A3=
2 an
d A
4=2
then
C1
[3,5
,8,1
1,14
]
IEE
E IC
DM
28-
31 O
ct. 0
7C
ontr
ast D
ata
Min
ing:
Met
hods
and
App
licat
ions
Ja
mes
Bai
ley
and
Guo
zhu
Don
g12
6
Cor
resp
onde
nce
Exa
mpl
e co
nt.
Rul
es N
1 an
d N
3 cl
assi
fy th
e ex
ampl
es th
at
wer
e cl
assi
fied
by r
ule
O1.
So
the
chan
ges
for
the
sub
popu
latio
n co
vere
d by
O1
can
be
desc
ribed
as
<O
1,N
1> a
nd <
O1,
N3>
Cha
nges
<O
2,N
2> a
nd <
O3,
N4>
are
triv
ial
beca
use
the
old
and
new
rul
es a
re id
entic
al.
IEE
E IC
DM
28-
31 O
ct. 0
7C
ontr
ast D
ata
Min
ing:
Met
hods
and
App
licat
ions
Ja
mes
Bai
ley
and
Guo
zhu
Don
g12
7
Rul
e A
ccur
acy
Incr
ease
.
The
qua
ntita
tive
chan
ge Q
of <
O,N
> is
th
e es
timat
ed a
ccur
acy
incr
ease
(+
or
-)
due
to th
e ch
ange
from
O to
N.
Cha
nges
are
ran
ked
acco
rdin
g to
qu
antit
ativ
e ch
ange
Q a
nd th
en p
rese
nted
to
the
user
IEE
E IC
DM
28-
31 O
ct. 0
7C
ontr
ast D
ata
Min
ing:
Met
hods
and
App
licat
ions
Ja
mes
Bai
ley
and
Guo
zhu
Don
g12
8
Com
mon
them
es fo
r co
ntra
st
min
ing
Diff
eren
t rep
rese
ntat
ions
Min
imal
ityis
the
mos
t com
mon
Sup
port
/rat
io c
onst
rain
ts q
uite
pop
ular
, th
ough
not
nec
essa
rily
the
best
Con
junc
tions
mos
t pop
ular
for
rela
tiona
l cas
e
Larg
e nu
mbe
r of
con
tras
t pat
tern
s ar
e ou
tput
IEE
E IC
DM
28-
31 O
ct. 0
7C
ontr
ast D
ata
Min
ing:
Met
hods
and
App
licat
ions
Ja
mes
Bai
ley
and
Guo
zhu
Don
g12
9
Rec
omm
enda
tions
to P
ract
ition
ers
Som
e im
port
ant p
oint
s ar
eC
ontra
st p
atte
rns
can
capt
ure
dist
ingu
ishi
ng
patte
rns
betw
een
clas
ses
Con
trast
pat
tern
s ca
n be
use
d to
bui
ld h
igh
qual
ity c
lass
ifier
sC
ontra
st p
atte
rns
can
capt
ure
usef
ul p
atte
rns
for d
etec
ting/
treat
ing
dise
ases
, or o
ther
ev
ents
/con
ditio
ns
IEE
E IC
DM
28-
31 O
ct. 0
7C
ontr
ast D
ata
Min
ing:
Met
hods
and
App
licat
ions
Ja
mes
Bai
ley
and
Guo
zhu
Don
g13
0
Ope
n P
robl
ems
in C
ontr
ast D
ata
Min
ing
How
to m
eani
ngfu
lly a
sses
s qu
ality
of c
ontr
asts
, esp
ecia
lly fo
r no
n-re
latio
nal d
ata.
How
to e
xpla
in th
e se
man
tics
of c
ontr
asts
Min
ing
of c
ontr
asts
usi
ng u
ser
spec
ified
dom
ain
know
ledg
eH
ighl
y ex
pres
sive
cont
rast
s (f
irst o
rder
..)
Dev
elop
new
way
s to
bui
ld c
ontr
ast b
ased
cla
ssifi
ers
and
findi
ngth
e hi
ghes
t im
pact
cont
rast
sR
are
clas
s cl
assi
ficat
ion
and
cont
rast
s st
ill a
n un
settl
ed is
sue
Dis
cove
ry o
f con
tras
ts in
mas
sive
dat
aset
s.E
ffici
ently
min
e co
ntra
sts
whe
n th
ere
are
thou
sand
s of
at
trib
utes
, suc
h as
in m
edic
al d
omai
nsE
ffici
ent m
inin
g of
top-
k co
ntra
st p
atte
rns
Are
ther
e m
eani
ngfu
l app
roxi
mat
ions
(e.
g. s
ampl
ing)
?
IEE
E IC
DM
28-
31 O
ct. 0
7C
ontr
ast D
ata
Min
ing:
Met
hods
and
App
licat
ions
Ja
mes
Bai
ley
and
Guo
zhu
Don
g13
1
Sum
mar
y
We
have
giv
en a
wid
e su
rvey
of c
ontr
ast
min
ing.
It s
houl
d no
w b
e cl
eare
rW
hy c
ontr
ast d
ata
min
ing
is im
port
ant a
nd
whe
n it
can
be u
sed
How
it c
an b
e us
ed fo
r ve
ry p
ower
ful
clas
sifie
rsW
hat a
lgor
ithm
s ca
n be
use
d fo
r co
ntra
st
data
min
ing
IEE
E IC
DM
28-
31 O
ct. 0
7C
ontr
ast D
ata
Min
ing:
Met
hods
and
App
licat
ions
Ja
mes
Bai
ley
and
Guo
zhu
Don
g13
2
Ack
now
ledg
emen
ts
We
are
grat
eful
to th
e fo
llow
ing
peop
le fo
r th
eir
help
ful c
omm
ents
or
mat
eria
ls fo
r th
is tu
toria
lE
ric B
aeJi
awei
Han
Xia
onan
JiR
aoK
otag
iriJi
nyan
LiE
lsa
Loek
itoK
athe
rine
Ram
say
Lim
soon
Won
g
IEE
E IC
DM
28-
31 O
ct. 0
7C
ontr
ast D
ata
Min
ing:
Met
hods
and
App
licat
ions
Ja
mes
Bai
ley
and
Guo
zhu
Don
g13
3
Bib
liogr
aphy
Thi
s bi
blio
grap
hy c
onta
ins
thre
e se
ctio
ns:
Min
ing
of E
mer
ging
Pat
tern
s, C
hang
e P
atte
rns,
C
ontr
ast/D
iffer
ence
Pat
tern
sE
mer
ging
/Con
tras
t Pat
tern
Bas
ed C
lass
ifica
tion
Oth
er A
pplic
atio
ns o
f Em
ergi
ng P
atte
rns
An
up to
dat
e ve
rsio
n of
this
bib
liogr
aphy
is a
vaila
ble
at
http
://w
ww
.cs.
wrig
ht.e
du/~
gdon
g/E
PC
.htm
l
Ple
ase
let
us
kno
w o
f an
y ex
tra
refe
ren
ces
to in
clu
de
!
IEE
E IC
DM
28-
31 O
ct. 0
7C
ontr
ast D
ata
Min
ing:
Met
hods
and
App
licat
ions
Ja
mes
Bai
ley
and
Guo
zhu
Don
g13
4
Bib
liogr
aphy
(M
inin
g of
Em
ergi
ng P
atte
rns,
Cha
nge
Pat
tern
s,
Con
tras
t/Diff
eren
ce P
atte
rns)
Aru
nasa
lam
, Bav
ani a
nd C
haw
la, S
anja
y an
d S
un, P
ei. S
trik
ing
Tw
o B
irds
with
One
Sto
ne: S
imul
tane
ous
Min
ing
of P
ositi
ve a
nd N
egat
ive
Spa
tial P
atte
rns.
In P
roce
edin
gs o
f the
Fift
h S
IAM
Inte
rnat
iona
l Con
fere
nce
on D
ata
Min
ing,
Apr
il 21
-23,
pp,
New
port
Bea
ch, C
A, U
SA
, SIA
M 2
005
Bav
ani A
runa
sala
m, S
anja
y C
haw
la: C
CC
S: a
top-
dow
n as
soci
ativ
e cl
assi
fier
for
imba
lanc
ed c
lass
dis
trib
utio
n.
KD
D 2
006:
517
-522
Eric
Bae
, Jam
es B
aile
y, G
uozh
u D
ong:
Clu
ster
ing
Sim
ilarit
y C
ompa
rison
Usi
ng D
ensi
ty P
rofil
es. A
ustr
alia
n C
onfe
renc
e on
Art
ifici
al In
telli
genc
e 20
06: 3
42-3
51Ja
mes
Bai
ley,
Tho
mas
Man
ouki
an, K
otag
iri R
amam
ohan
arao
: Fas
t Alg
orith
ms
for
Min
ing
Em
ergi
ng P
atte
rns.
P
KD
D 2
002:
39-
50.
J. B
aile
y an
d T
. Man
ouki
an a
nd K
. Ram
amoh
anar
ao: A
Fas
t Alg
orith
m fo
r C
ompu
ting
Hyp
ergr
aph
Tra
nsve
rsal
s an
d its
App
licat
ion
in M
inin
g E
mer
ging
Pat
tern
s. P
roce
edin
gs o
f the
3rd
IEE
E In
tern
atio
nal C
onfe
renc
e on
Dat
a M
inin
g (I
CD
M).
Pag
es 4
85-4
88. F
lorid
a, U
SA
, Nov
embe
r 20
03.
Ste
phen
D. B
ay, M
icha
el J
. Paz
zani
: Det
ectin
g C
hang
e in
Cat
egor
ical
Dat
a: M
inin
g C
ontr
ast S
ets.
KD
D 1
999:
30
2-30
6.S
teph
en D
. Bay
, Mic
hael
J. P
azza
ni: D
etec
ting
Gro
up D
iffer
ence
s: M
inin
g C
ontr
ast S
ets.
Dat
a M
in. K
now
l. D
isco
v. 5
(3):
213
-246
(20
01)
Cris
tian
Buc
ila, J
ohan
nes
Geh
rke,
Dan
iel K
ifer,
Wal
ker
M. W
hite
: Dua
lMin
er: A
Dua
l-Pru
ning
Alg
orith
m fo
r Ite
mse
ts w
ith C
onst
rain
ts. D
ata
Min
. Kno
wl.
Dis
cov.
7(3
): 2
41-2
72 (
2003
)Y
ando
ng C
ai, N
ick
Cer
cone
, Jia
wei
Han
: An
Attr
ibut
e-O
rient
ed A
ppro
ach
for
Lear
ning
Cla
ssifi
catio
n R
ules
from
R
elat
iona
l Dat
abas
es. I
CD
E 1
990:
281
-288
Sar
ah C
han,
Ben
Kao
, Chi
Lap
Yip
, Mic
hael
Tan
g: M
inin
g E
mer
ging
Sub
strin
gs. D
AS
FA
A 2
003.
Yix
in C
hen,
Guo
zhu
Don
g, J
iaw
ei H
an, J
ian
Pei
, Ben
jam
in W
. Wah
, Jia
nyon
g W
ang:
Onl
ine
Ana
lytic
al
Pro
cess
ing
Str
eam
Dat
a: Is
It F
easi
ble?
DM
KD
200
2C
hen
Che
n, X
ifeng
Yan
, Phi
lip S
. Yu,
Jia
wei
Han
, Don
g-Q
ing
Zha
ng, X
iaoh
ui G
u: T
owar
ds G
raph
Con
tain
men
t S
earc
h an
d In
dexi
ng. V
LDB
200
7: 9
26-9
37
IEE
E IC
DM
28-
31 O
ct. 0
7C
ontr
ast D
ata
Min
ing:
Met
hods
and
App
licat
ions
Ja
mes
Bai
ley
and
Guo
zhu
Don
g13
5
Bib
liogr
aphy
(M
inin
g of
Em
ergi
ng P
atte
rns,
Cha
nge
Pat
tern
s,
Con
tras
t/Diff
eren
ce P
atte
rns)
Gra
ham
Cor
mod
e, S
. Mut
hukr
ishn
an: W
hat's
new
: fin
ding
sig
nific
ant d
iffer
ence
s in
net
wor
k da
ta s
trea
ms.
IE
EE
/AC
M T
rans
. Net
w. 1
3(6)
: 121
9-12
32 (
2005
)Lu
c D
e R
aedt
, Alb
rech
t Zim
mer
man
n: C
onst
rain
t-B
ased
Pat
tern
Set
Min
ing.
SD
M 2
007
Luc
De
Rae
dt: T
owar
ds Q
uery
Eva
luat
ion
in In
duct
ive
Dat
abas
es U
sing
Ver
sion
Spa
ces.
Dat
abas
e S
uppo
rt fo
r D
ata
Min
ing
App
licat
ions
200
4: 1
17-1
34Lu
c D
e R
aedt
, Ste
fan
Kra
mer
: The
Lev
elw
ise
Ver
sion
Spa
ce A
lgor
ithm
and
its
App
licat
ion
to M
olec
ular
F
ragm
ent F
indi
ng. I
JCA
I 200
1: 8
53-8
62G
uozh
u D
ong,
Jin
yan
Li: E
ffici
ent M
inin
g of
Em
ergi
ng P
atte
rns:
Dis
cove
ring
Tre
nds
and
Diff
eren
ces.
KD
D 1
999:
43
-52.
G
uozh
u D
ong,
Jin
yan
Li: M
inin
g bo
rder
des
crip
tions
of e
mer
ging
pat
tern
s fr
om d
atas
et p
airs
. Kno
wl.
Inf.
Sys
t. 8(
2): 1
78-2
02 (
2005
).D
ong,
G. a
nd H
an, J
. and
Lak
shm
anan
, L.V
.S. a
nd P
ei, J
. and
Wan
g, H
. and
Yu,
P.S
. Onl
ine
Min
ing
of C
hang
es
from
Dat
a S
trea
ms:
Res
earc
h P
robl
ems
and
Pre
limin
ary
Res
ults
, Pro
ceed
ings
of t
he 2
003
AC
M S
IGM
OD
W
orks
hop
on M
anag
emen
t and
Pro
cess
ing
of D
ata
Str
eam
s, 2
003
Guo
zhu
Don
g, J
iaw
eiH
an, J
oyce
M. W
. Lam
, Jia
nP
ei, K
eW
ang,
Wei
Zou
: Min
ing
Con
stra
ined
Gra
dien
ts in
La
rge
Dat
abas
es. I
EE
E T
rans
. Kno
wl.
Dat
a E
ng. 1
6(8)
: 922
-938
(20
04).
Joha
nnes
Fis
cher
, Vol
ker
Heu
n, S
tefa
n K
ram
er: O
ptim
al S
trin
g M
inin
g U
nder
Fre
quen
cy C
onst
rain
ts. P
KD
D
2006
: 139
-150
Ven
kate
shG
anti,
Joh
anne
s G
ehrk
e, R
aghu
Ram
akris
hnan
: A F
ram
ewor
k fo
r M
easu
ring
Cha
nges
in D
ata
Cha
ract
eris
tics.
PO
DS
199
9: 1
26-1
37V
enka
tesh
Gan
ti, J
ohan
nes
Geh
rke,
Rag
huR
amak
rishn
an, W
ei-Y
in L
oh: A
Fra
mew
ork
for
Mea
surin
g D
iffer
ence
s in
Dat
a C
hara
cter
istic
s. J
. Com
put.
Sys
t. S
ci. 6
4(3)
: 542
-578
(20
02)
Gar
riga,
G.C
. and
Kra
lj, P
. and
Lav
rac,
N. C
lose
d S
ets
for
Labe
led
Dat
a?, P
KD
D, 2
006
Hild
erm
an, R
.J. a
nd P
eckh
am, T
. A S
tatis
tical
ly S
ound
Alte
rnat
ive
App
roac
h to
Min
ing
Con
tras
t Set
s,
Pro
ceed
ings
of t
he 4
th A
ustr
alas
ian
Dat
a M
inin
g C
onfe
renc
e, 2
005
(pp1
57-1
72)
IEE
E IC
DM
28-
31 O
ct. 0
7C
ontr
ast D
ata
Min
ing:
Met
hods
and
App
licat
ions
Ja
mes
Bai
ley
and
Guo
zhu
Don
g13
6
Bib
liogr
aphy
(M
inin
g of
Em
ergi
ng P
atte
rns,
Cha
nge
Pat
tern
s,
Con
tras
t/Diff
eren
ce P
atte
rns)
Hui
-jing
Hua
ng, Y
ongs
ong
Qin
, Xia
ofen
gZ
hu, J
ilian
Zha
ng, a
nd S
hich
aoZ
hang
. Diff
eren
ce
Det
ectio
n B
etw
een
Tw
o C
ontr
ast S
ets.
Pro
ceed
ings
of t
he 8
th I
nter
natio
nal C
onfe
renc
e on
Dat
a W
areh
ousi
ng a
nd K
now
ledg
e D
isco
very
(D
aWak
), 2
006.
Imbe
rman
, S.P
. and
Tan
sel,
A.U
. and
Pac
uit,
E. A
n E
ffici
ent M
etho
d F
or F
indi
ng E
mer
ging
Fre
quen
t Ite
mse
ts,
3rd
Inte
rnat
iona
l Wor
ksho
p on
Min
ing
Tem
pora
l and
Seq
uent
ial D
ata,
pp1
12--
121,
200
4T
omas
z Im
ielin
ski,
Leon
id K
hach
iyan
, Am
inA
bdul
ghan
i: C
ubeg
rade
s: G
ener
aliz
ing
Ass
ocia
tion
Rul
es. D
ata
Min
. K
now
l. D
isco
v. 6
(3):
219
-257
(20
02)
Inak
oshi
, H. a
nd A
ndo,
T. a
nd S
ato,
A. a
nd O
kam
oto,
S. D
isco
very
of e
mer
ging
pat
tern
s fr
om n
eare
st n
eigh
bors
, In
tern
atio
nal C
onfe
renc
e on
Mac
hine
Lea
rnin
g an
d C
yber
netic
s, 2
002.
X
iaon
anJi
, Jam
es B
aile
y, G
uozh
u D
ong:
Min
ing
Min
imal
Dis
tingu
ishi
ng S
ubse
quen
ce P
atte
rns
with
Gap
C
onst
rain
ts. I
CD
M 2
005:
194
-201
.X
iaon
anJi
, Jam
es B
aile
y, G
uozh
u D
ong:
Min
ing
Min
imal
Dis
tingu
ishi
ng S
ubse
quen
ce P
atte
rns
with
Gap
C
onst
rain
ts. K
now
l. In
f. S
yst.
11(3
): 2
59--
286
(200
7).
Dan
iel K
ifer,
Sha
iBen
-Dav
id, J
ohan
nes
Geh
rke:
Det
ectin
g C
hang
e in
Dat
a S
trea
ms.
VLD
B 2
004:
180
-191
P K
ralj,
N L
avra
c, D
Gam
berg
er, A
Krs
taci
c. C
ontr
ast S
et M
inin
g fo
r D
istin
guis
hing
Bet
wee
n S
imila
r D
isea
ses.
LN
CS
Vol
ume
4594
, 200
7.S
auD
an L
ee, L
uc D
e R
aedt
: An
Effi
cien
t Alg
orith
m fo
r M
inin
g S
trin
g D
atab
ases
Und
er C
onst
rain
ts. K
DID
200
4:
108-
129
Hai
quan
Li, J
inya
nLi
, Lim
soon
Won
g, M
engl
ing
Fen
g, Y
ap-P
eng
Tan
: Rel
ativ
e ris
k an
d od
ds r
atio
: a d
ata
min
ing
pers
pect
ive.
PO
DS
200
5: 3
68-3
77Ji
nyan
Li, G
uim
eiLi
u an
d Li
mso
onW
ong.
Min
ing
Sta
tistic
ally
Impo
rtan
t Equ
ival
ence
Cla
sses
and
del
ta-
Dis
crim
inat
ive
Em
ergi
ng P
atte
rns.
KD
D 2
007.
Jiny
anLi
, Tho
mas
Man
ouki
an, G
uozh
u D
ong,
Kot
agiri
Ram
amoh
anar
ao: I
ncre
men
tal M
aint
enan
ce o
n th
e B
orde
r of
the
Spa
ce o
f Em
ergi
ng P
atte
rns.
Dat
a M
in. K
now
l. D
isco
v. 9
(1):
89-
116
(200
4).
IEE
E IC
DM
28-
31 O
ct. 0
7C
ontr
ast D
ata
Min
ing:
Met
hods
and
App
licat
ions
Ja
mes
Bai
ley
and
Guo
zhu
Don
g13
7
Bib
liogr
aphy
(M
inin
g of
Em
ergi
ng P
atte
rns,
Cha
nge
Pat
tern
s,
Con
tras
t/Diff
eren
ce P
atte
rns)
Jiny
anLi
and
Qia
ngY
ang.
Str
ong
Com
poun
d-R
isk
Fac
tors
: Effi
cien
t Dis
cove
ry th
roug
h E
mer
ging
Pat
tern
s an
d C
ontr
ast S
ets.
IEE
E T
rans
actio
ns o
n In
form
atio
n T
echn
olog
y in
Bio
med
icin
e. T
o ap
pear
.Li
n, J
. and
Keo
gh, E
. G
roup
SA
X: E
xten
ding
the
Not
ion
of C
ontr
ast S
ets
to T
ime
Ser
ies
and
Mul
timed
ia D
ata.
P
roce
edin
gs o
f the
10t
h eu
rope
anco
nfer
ence
on
prin
cipl
es a
nd p
ract
ice
of k
now
ledg
e di
scov
ery
inda
taba
ses.
B
erlin
, Ger
man
y, S
epte
mbe
r, 2
006.
Bin
g Li
u, K
eW
ang,
Lai
-Fun
Mun
, Xin
-Zhi
Qi:
Usi
ng D
ecis
ion
Tre
e In
duct
ion
for
Dis
cove
ring
Hol
es in
Dat
a.
PR
ICA
I 199
8: 1
82-1
93B
ing
Liu,
Lia
ng-P
ing
Ku,
Wyn
ne H
su: D
isco
verin
g In
tere
stin
g H
oles
in D
ata.
IJC
AI(
2) 1
997:
930
-935
Bin
g Li
u, W
ynne
Hsu
, Yim
ing
Ma:
Dis
cove
ring
the
set o
f fun
dam
enta
l rul
e ch
ange
s. K
DD
200
1: 3
35-3
40.
Els
a Lo
ekito
, Jam
es B
aile
y: F
ast M
inin
g of
Hig
h D
imen
sion
al E
xpre
ssiv
e C
ontr
ast P
atte
rns
Usi
ng Z
ero-
supp
ress
ed B
inar
y D
ecis
ion
Dia
gram
s. K
DD
200
6: 3
07-3
16.
Yu
Men
g, M
arga
ret H
. Dun
ham
: Effi
cien
t Min
ing
of E
mer
ging
Eve
nts
in a
Dyn
amic
Spa
tiote
mpo
ral E
nviro
nmen
t. P
AK
DD
200
6: 7
50-7
54T
om M
. Mitc
hell:
Ver
sion
Spa
ces:
A C
andi
date
Elim
inat
ion
App
roac
h to
Rul
e Le
arni
ng. I
JCA
I 197
7: 3
05-3
10A
mit
Sat
sang
i, O
smar
R. Z
aian
e, C
ontr
astin
g th
e C
ontr
ast S
ets:
An
Alte
rnat
ive
App
roac
h, E
leve
nth
Inte
rnat
iona
l D
atab
ase
Eng
inee
ring
and
App
licat
ions
Sym
posi
um (
IDE
AS
200
7), B
anff,
Can
ada,
Sep
tem
ber
6-8,
200
7 M
iche
le S
ebag
: Del
ayin
g th
e C
hoic
e of
Bia
s: A
Dis
junc
tive
Ver
sion
Spa
ce A
ppro
ach.
ICM
L 19
96: 4
44-4
52M
iche
le S
ebag
: Usi
ng C
onst
rain
ts to
Bui
ldin
g V
ersi
on S
pace
s. E
CM
L 19
94: 2
57-2
71A
rnau
d S
oule
t, B
runo
Cré
mill
eux,
Fra
nçoi
s R
ioul
t: C
onde
nsed
Rep
rese
ntat
ion
of E
Ps
and
Pat
tern
s Q
uant
ified
by
Fre
quen
cy-B
ased
Mea
sure
s. K
DID
200
4: 1
73-1
90P
awel
Ter
leck
i, K
rzys
ztof
Wal
czak
: On
the
rela
tion
betw
een
roug
h se
t red
ucts
and
jum
ping
em
ergi
ng p
atte
rns.
In
f. S
ci. 1
77(1
): 7
4-83
(20
07).
IEE
E IC
DM
28-
31 O
ct. 0
7C
ontr
ast D
ata
Min
ing:
Met
hods
and
App
licat
ions
Ja
mes
Bai
ley
and
Guo
zhu
Don
g13
8
Bib
liogr
aphy
(M
inin
g of
Em
ergi
ng P
atte
rns,
Cha
nge
Pat
tern
s,
Con
tras
t/Diff
eren
ce P
atte
rns)
Rog
er M
ing
Hie
ngT
ing,
Jam
es B
aile
y: M
inin
g M
inim
al C
ontr
ast S
ubgr
aph
Pat
tern
s. S
DM
200
6.V
. S. T
seng
, C. J
. Chu
, and
Tyn
e Li
ang,
An
Effi
cien
t Met
hod
for
Min
ing
Tem
pora
l Em
ergi
ng It
emse
tsF
rom
Dat
a S
trea
ms,
Inte
rnat
iona
l Com
pute
r S
ympo
sium
, Wor
ksho
p on
Sof
twar
e E
ngin
eerin
g, D
atab
ases
and
Kno
wle
dge
Dis
cove
ry, 2
006
J. V
reek
en, M
. van
Lee
uwen
, A. S
iebe
s: C
hara
cter
isin
gth
e D
iffer
ence
. KD
D 2
007.
Hai
xun
Wan
g, W
ei F
an, P
hilip
S. Y
u, J
iaw
eiH
an: M
inin
g co
ncep
t-dr
iftin
g da
ta s
trea
ms
usin
g en
sem
ble
clas
sifie
rs. K
DD
200
3: 2
26-2
35P
eng
Wan
g, H
aixu
nW
ang,
Xia
oche
nW
u, W
ei W
ang,
Bai
leS
hi: O
n R
educ
ing
Cla
ssifi
er G
ranu
larit
y in
Min
ing
Con
cept
-Drif
ting
Dat
a S
trea
ms.
ICD
M 2
005:
474
-481
Lush
eng
Wan
g, H
aoZ
hao,
Guo
zhu
Don
g, J
ianp
ing
Li: O
n th
e co
mpl
exity
of f
indi
ng e
mer
ging
pat
tern
s. T
heor
. C
ompu
t. S
ci. 3
35(1
): 1
5-27
(20
05).
Ke
Wan
g, S
enqi
ang
Zho
u, A
daW
ai-C
hee
Fu,
Jef
frey
Xu
Yu:
Min
ing
Cha
nges
of C
lass
ifica
tion
by
Cor
resp
onde
nce
Tra
cing
. SD
M 2
003.
Geo
ffrey
I. W
ebb:
Dis
cove
ring
Sig
nific
ant P
atte
rns.
Mac
hine
Lea
rnin
g 68
(1):
1-3
3 (2
007)
Geo
ffrey
I. W
ebb,
Son
gmao
Zha
ng: K
-Opt
imal
Rul
e D
isco
very
. Dat
a M
in. K
now
l. D
isco
v. 1
0(1)
: 39-
79 (
2005
)G
eoffr
ey I.
Web
b, S
hane
M. B
utle
r, D
ougl
as A
. New
land
s: O
n de
tect
ing
diffe
renc
es b
etw
een
grou
ps. K
DD
200
3:
256-
265.
Xiu
zhen
Zha
ng, G
uozh
u D
ong,
Kot
agiri
Ram
amoh
anar
ao: E
xplo
ring
cons
trai
nts
to e
ffici
ently
min
e em
ergi
ng
patte
rns
from
larg
e hi
gh-d
imen
sion
al d
atas
ets.
KD
D 2
000:
310
-314
.Li
zhua
ngZ
hao,
Moh
amm
ed J
. Zak
i, N
aren
Ram
akris
hnan
: BLO
SO
M: a
fram
ewor
k fo
r m
inin
g ar
bitr
ary
bool
ean
expr
essi
ons.
KD
D 2
006:
827
-832
IEE
E IC
DM
28-
31 O
ct. 0
7C
ontr
ast D
ata
Min
ing:
Met
hods
and
App
licat
ions
Ja
mes
Bai
ley
and
Guo
zhu
Don
g13
9
Bib
liogr
aphy
(E
mer
ging
/Con
tras
t Pat
tern
Bas
ed
Cla
ssifi
catio
n)
Ham
adA
lham
mad
y, K
otag
iriR
amam
ohan
arao
: The
App
licat
ion
of E
mer
ging
Pat
tern
s fo
r Im
prov
ing
the
Qua
lity
of R
are-
Cla
ss C
lass
ifica
tion.
PA
KD
D 2
004:
207
-211
Ham
adA
lham
mad
y, K
otag
iriR
amam
ohan
arao
: Usi
ng E
mer
ging
Pat
tern
s an
d D
ecis
ion
Tre
es in
Rar
e-C
lass
C
lass
ifica
tion.
ICD
M 2
004:
315
-318
Ham
adA
lham
mad
y, K
otag
iriR
amam
ohan
arao
: Exp
andi
ng th
e T
rain
ing
Dat
a S
pace
Usi
ng E
mer
ging
Pat
tern
s an
d G
enet
ic M
etho
ds. S
DM
200
5H
amad
Alh
amm
ady,
Kot
agiri
Ram
amoh
anar
ao: U
sing
Em
ergi
ng P
atte
rns
to C
onst
ruct
Wei
ghte
d D
ecis
ion
Tre
es.
IEE
E T
rans
. Kno
wl.
Dat
a E
ng.
18(7
): 8
65-8
76 (
2006
).H
amad
Alh
amm
ady,
Kot
agiri
Ram
amoh
anar
ao: M
inin
g E
mer
ging
Pat
tern
s an
d C
lass
ifica
tion
in D
ata
Str
eam
s.
Web
Inte
llige
nce
2005
: 272
-275
Jam
es B
aile
y, T
hom
as M
anou
kian
, Kot
agiri
Ram
amoh
anar
ao: C
lass
ifica
tion
Usi
ng C
onst
rain
ed E
mer
ging
P
atte
rns.
WA
IM 2
003:
226
-237
Guo
zhu
Don
g, X
iuzh
enZ
hang
, Lim
soon
Won
g, J
inya
nLi
: CA
EP
: Cla
ssifi
catio
n by
Agg
rega
ting
Em
ergi
ng
Pat
tern
s. D
isco
very
Sci
ence
1999
: 30-
42.
Hon
gjia
nF
an, K
otag
iriR
amam
ohan
arao
: An
Effi
cien
t Sin
gle-
Sca
n A
lgor
ithm
for
Min
ing
Ess
entia
l Jum
ping
E
mer
ging
Pat
tern
s fo
r C
lass
ifica
tion.
PA
KD
D 2
002:
456
-462
Hon
gjia
nF
an, K
otag
iriR
amam
ohan
arao
: Effi
cien
tly M
inin
g In
tere
stin
g E
mer
ging
Pat
tern
s. W
AIM
200
3: 1
89-2
01H
ongj
ian
Fan
, Kot
agiri
Ram
amoh
anar
ao: N
oise
Tol
eran
t Cla
ssifi
catio
n by
Chi
Em
ergi
ng P
atte
rns.
PA
KD
D 2
004:
20
1-20
6H
ongj
ian
Fan
, Min
g F
an, K
otag
iriR
amam
ohan
arao
, Men
gxu
Liu:
Fur
ther
Impr
ovin
g E
mer
ging
Pat
tern
Bas
ed
Cla
ssifi
ers
Via
Bag
ging
. PA
KD
D 2
006:
91-
96H
ongj
ian
Fan
, Kot
agiri
Ram
amoh
anar
ao: A
wei
ghtin
g sc
hem
e ba
sed
on e
mer
ging
pat
tern
s fo
r w
eigh
ted
supp
ort
vect
or m
achi
nes.
GrC
2005
: 435
-440
IEE
E IC
DM
28-
31 O
ct. 0
7C
ontr
ast D
ata
Min
ing:
Met
hods
and
App
licat
ions
Ja
mes
Bai
ley
and
Guo
zhu
Don
g14
0
Bib
liogr
aphy
(E
mer
ging
/Con
tras
t Pat
tern
Bas
ed
Cla
ssifi
catio
n)
Hon
gjia
nF
an, K
otag
iriR
amam
ohan
arao
: Fas
t Dis
cove
ry a
nd th
e G
ener
aliz
atio
n of
Str
ong
Jum
ping
Em
ergi
ng
Pat
tern
s fo
r B
uild
ing
Com
pact
and
Acc
urat
e C
lass
ifier
s. IE
EE
Tra
ns. K
now
l. D
ata
Eng
. 18(
6): 7
21-7
37 (
2006
)Ji
nyan
Li, G
uozh
u D
ong,
Kot
agiri
Ram
amoh
anar
ao: I
nsta
nce-
Bas
ed C
lass
ifica
tion
by E
mer
ging
Pat
tern
s. P
KD
D
2000
: 191
-200
Jiny
anLi
, Guo
zhu
Don
g, K
otag
iriR
amam
ohan
arao
: Mak
ing
Use
of t
he M
ost E
xpre
ssiv
e Ju
mpi
ng E
mer
ging
P
atte
rns
for
Cla
ssifi
catio
n. P
AK
DD
200
0: 2
20-2
32Ji
nyan
Li, G
uozh
u D
ong,
Kot
agiri
Ram
amoh
anar
ao: M
akin
g U
se o
f the
Mos
t Exp
ress
ive
Jum
ping
Em
ergi
ng
Pat
tern
s fo
r C
lass
ifica
tion.
Kno
wl.
Inf.
Sys
t. 3(
2): 1
31-1
45 (
2001
)Ji
nyan
Li, K
otag
iriR
amam
ohan
arao
, Guo
zhu
Don
g: E
mer
ging
Pat
tern
s an
d C
lass
ifica
tion.
AS
IAN
200
0:15
-32
Jiny
anLi
, Guo
zhu
Don
g, K
otag
iriR
amam
ohan
arao
, Lim
soon
Won
g: D
eEP
s: A
New
Inst
ance
-Bas
ed L
azy
Dis
cove
ry a
nd C
lass
ifica
tion
Sys
tem
. Mac
hine
Lea
rnin
g 54
(2):
99-
124
(200
4).
Wen
min
Li, J
iaw
eiH
an, J
ian
Pei
: CM
AR
: Acc
urat
e an
d E
ffici
ent C
lass
ifica
tion
Bas
ed o
n M
ultip
le C
lass
-A
ssoc
iatio
n R
ules
. IC
DM
200
1: 3
69-3
76Ji
nyan
Li, K
otag
iriR
amam
ohan
arao
, Guo
zhu
Don
g: C
ombi
ning
the
Str
engt
h of
Pat
tern
Fre
quen
cy a
nd D
ista
nce
for
Cla
ssifi
catio
n. P
AK
DD
200
1: 4
55-4
66B
ing
Liu,
Wyn
ne H
su, Y
imin
gM
a: In
tegr
atin
g C
lass
ifica
tion
and
Ass
ocia
tion
Rul
e M
inin
g. K
DD
1998
: 80-
86K
otag
iriR
amam
ohan
arao
, Jam
es B
aile
y: D
isco
very
of E
mer
ging
Pat
tern
s an
d T
heir
Use
in C
lass
ifica
tion.
A
ustr
alia
n C
onfe
renc
e on
Art
ifici
al In
telli
genc
e 20
03: 1
-12
Ram
amoh
anar
ao, K
. and
Bai
ley,
J. a
nd F
an, H
. Effi
cien
t Min
ing
of C
ontr
ast P
atte
rns
and
The
ir A
pplic
atio
ns to
C
lass
ifica
tion,
Thi
rd In
tern
atio
nal C
onfe
renc
e on
Inte
llige
nt S
ensi
ng a
nd In
form
atio
n P
roce
ssin
g, 2
005
(39-
-47)
.R
amam
ohan
arao
, K. a
nd F
an, H
. Pat
tern
s B
ased
Cla
ssifi
ers,
Wor
ld W
ide
Web
200
7: 1
0(71
--83
).Q
unS
un, X
iuzh
enZ
hang
, Kot
agiri
Ram
amoh
anar
ao: N
oise
Tol
eran
ce o
f EP
-Bas
ed C
lass
ifier
s. A
ustr
alia
n C
onfe
renc
e on
Art
ifici
al In
telli
genc
e 20
03: 7
96-8
06
IEE
E IC
DM
28-
31 O
ct. 0
7C
ontr
ast D
ata
Min
ing:
Met
hods
and
App
licat
ions
Ja
mes
Bai
ley
and
Guo
zhu
Don
g14
1
Bib
liogr
aphy
(E
mer
ging
/Con
tras
t Pat
tern
Bas
ed
Cla
ssifi
catio
n)
Xia
oxin
Yin
, Jia
wei
Han
: CP
AR
: Cla
ssifi
catio
n ba
sed
on P
redi
ctiv
e A
ssoc
iatio
n R
ules
. SD
M 2
003
Xiu
zhen
Zha
ng, G
uozh
u D
ong,
Kot
agiri
Ram
amoh
anar
ao: I
nfor
mat
ion-
Bas
ed C
lass
ifica
tion
by A
ggre
gatin
g E
mer
ging
Pat
tern
s. ID
EA
L 20
00: 4
8-53
Xiu
zhen
Zha
ng, G
uozh
u D
ong,
Kot
agiri
Ram
amoh
anar
ao: B
uild
ing
Beh
avio
urK
now
ledg
e S
pace
to M
ake
Cla
ssifi
catio
n D
ecis
ion.
PA
KD
D 2
001:
488
-494
Zho
u W
ang,
Hon
gjia
nF
an, K
otag
iriR
amam
ohan
arao
: Exp
loiti
ng M
axim
al E
mer
ging
Pat
tern
s fo
r C
lass
ifica
tion.
A
ustr
alia
n C
onfe
renc
e on
Art
ifici
al In
telli
genc
e 20
04: 1
062-
1068
IEE
E IC
DM
28-
31 O
ct. 0
7C
ontr
ast D
ata
Min
ing:
Met
hods
and
App
licat
ions
Ja
mes
Bai
ley
and
Guo
zhu
Don
g14
2
Bib
liogr
aphy
(O
ther
App
licat
ions
of E
mer
ging
Pat
tern
s)
Ann
e-La
ure
Bou
lest
eix,
Ger
hard
Tut
z, K
orbi
nian
Str
imm
er: A
CA
RT
-bas
ed a
ppro
ach
to d
isco
ver
emer
ging
pa
ttern
s in
mic
roar
ray
data
. Bio
info
rmat
ics
19(1
8): 2
465-
2472
(20
03).
Liju
nC
hen,
Guo
zhu
Don
g: M
asqu
erad
er D
etec
tion
Usi
ng O
CLE
P: O
ne C
lass
Cla
ssifi
catio
n U
sing
Len
gth
Sta
tistic
s of
Em
ergi
ng P
atte
rns.
Pro
ceed
ings
of I
nter
natio
nal W
orks
hop
on IN
form
atio
nP
roce
ssin
g ov
er E
volv
ing
Net
wor
ks (
WIN
PE
N),
200
6.G
uozh
u D
ong,
Kau
stub
hD
eshp
ande
: Effi
cien
t Min
ing
of N
iche
s an
d S
et R
outin
es. P
AK
DD
200
1: 2
34-2
46G
rand
inet
ti, W
.M. a
nd C
hesn
evar
, C.I.
and
Fal
appa
, M.A
. Enh
ance
d A
ppro
xim
atio
n of
the
Em
ergi
ng P
atte
rn
Spa
ce u
sing
an
Incr
emen
tal A
ppro
ach,
Pro
ceed
ings
of V
II W
orks
hop
of R
esea
rche
rs in
Com
pute
r S
cien
ces,
A
rgen
tine,
pp2
63--
267,
200
5Ji
nyan
Li, H
uiqi
ngLi
u, S
ee-K
iong
Ng,
Lim
soon
Won
g. D
isco
very
of S
igni
fican
t Rul
es fo
r C
lass
ifyin
g C
ance
r D
iagn
osis
Dat
a . B
ioin
form
atic
s. 1
9 (s
uppl
. 2):
ii93
-ii10
2. (
Thi
s pa
per
was
als
o pr
esen
ted
in th
e 20
03 E
urop
ean
Con
fere
nce
on C
ompu
tatio
nal B
iolo
gy, P
aris
, Fra
nce,
Sep
tem
ber
26-3
0.)
Jiny
anLi
, Hui
qing
Liu,
Jam
es R
. Dow
ning
, Alle
n E
ng-J
uhY
eoh,
Lim
soon
Won
g. S
impl
e R
ules
Und
erly
ing
Gen
e E
xpre
ssio
n P
rofil
es o
f Mor
e th
an S
ix S
ubty
pes
of A
cute
Lym
phob
last
icLe
ukem
ia (
ALL
) P
atie
nts.
Bio
info
rmat
ics.
19
:71-
-78,
200
3.
Jiny
anLi
, Lim
soon
Won
g: E
mer
ging
pat
tern
s an
d ge
ne e
xpre
ssio
n da
ta. G
enom
e In
form
atic
s, 2
001:
12(3
--13
).
Jiny
anLi
, Lim
soon
Won
g: Id
entif
ying
goo
d di
agno
stic
gen
e gr
oups
from
gen
e ex
pres
sion
pro
files
usi
ng th
e co
ncep
t of e
mer
ging
pat
tern
s. B
ioin
form
atic
s 18
(5):
725
-734
(20
02)
Jiny
anLi
, Lim
soon
Won
g. G
eogr
aphy
of D
iffer
ence
s B
etw
een
Tw
o C
lass
es o
f Dat
a. P
roce
edin
gs 6
th E
urop
ean
Con
fere
nce
on P
rinci
ples
of D
ata
Min
ing
and
Kno
wle
dge
Dis
cove
ry,p
ages
325
--33
7, H
elsi
nki,
Fin
land
, Aug
ust
2002
.Ji
nyan
Li a
nd L
imso
onW
ong.
Str
uctu
ral G
eogr
aphy
of t
he s
pace
of e
mer
ging
pat
tern
s. In
telli
gent
Dat
a A
naly
sis
(ID
A):
An
Inte
rnat
iona
l Jou
rnal
, Vol
ume
9, p
ages
567
-588
, Nov
embe
r 20
05.
Jiny
anLi
, Xiu
zhen
Zha
ng, G
uozh
u D
ong,
Kot
agiri
Ram
amoh
anar
ao, Q
unS
un: E
ffici
ent M
inin
g of
Hig
h C
onfid
ienc
eA
ssoc
iatio
n R
ules
with
out S
uppo
rt T
hres
hold
s. P
KD
D 1
999:
406
-411
IEE
E IC
DM
28-
31 O
ct. 0
7C
ontr
ast D
ata
Min
ing:
Met
hods
and
App
licat
ions
Ja
mes
Bai
ley
and
Guo
zhu
Don
g14
3
Bib
liogr
aphy
(O
ther
App
licat
ions
of E
mer
ging
Pat
tern
s)
Shi
hong
Mao
, Guo
zhu
Don
g: D
isco
very
of H
ighl
y D
iffer
entia
tive
Gen
e G
roup
s fr
om M
icro
arra
yG
ene
Exp
ress
ion
Dat
a U
sing
the
Gen
e C
lub
App
roac
h. J
. Bio
info
rmat
ics
and
Com
puta
tiona
l Bio
logy
3(6
): 1
263-
1280
(20
05).
Pod
raza
, R. a
nd T
omas
zew
ski,
K. K
TD
A: E
mer
ging
Pat
tern
s B
ased
Dat
a A
naly
sis
Sys
tem
, Pro
ceed
ings
of X
XI
Fal
l Mee
ting
of P
olis
h In
form
atio
n P
roce
ssin
g S
ocie
ty, p
p213
--22
1, 2
005
Rio
ult,
F. M
inin
g st
rong
em
ergi
ng p
atte
rns
in w
ide
SA
GE
dat
a, P
roce
edin
gs o
f the
EC
ML/
PK
DD
Dis
cove
ry
Cha
lleng
e W
orks
hop,
Pis
a, It
aly,
pp1
27--
138,
200
4E
ng-J
uhY
eoh,
Mar
y E
. Ros
s, S
heila
A. S
hurt
leff,
W. K
ent W
illia
m, D
ivye
nP
atel
, Ram
iMah
fouz
, Fre
d G
. Beh
m,
Sus
ana
C. R
aim
ondi
, Mar
y V
. Rei
lling
, Ana
miP
atel
, Che
ng C
heng
, Dar
io C
ampa
na, D
awn
Wilk
ins,
Xia
odon
gZ
hou,
Jin
yan
Li, H
uiqi
ngLi
u, C
hin-
Hon
Pui
, Will
iam
E. E
vans
, Cla
yton
Nae
ve, L
imso
onW
ong,
Jam
es R
. D
owni
ng. C
lass
ifica
tion,
sub
type
dis
cove
ry, a
nd p
redi
ctio
n of
out
com
e in
ped
iatr
ic a
cute
lym
phob
last
icle
ukem
ia
by g
ene
expr
essi
on p
rofil
ing.
Can
cer
Cel
l, 1:
133-
-143
, Mar
ch 2
002.
Y
oon,
H.S
. and
Lee
, S.H
. and
Kim
, J.H
. App
licat
ion
of E
mer
ging
Pat
tern
s fo
r M
ulti-
sour
ce B
io-D
ata
Cla
ssifi
catio
n an
d A
naly
sis,
LE
CT
UR
E N
OT
ES
IN C
OM
PU
TE
R S
CIE
NC
E V
ol36
10, 2
005.
Y
u, L
.T.H
. and
Chu
ng, F
. and
Cha
n, S
.C.F
. and
Yue
n, S
.M.C
. Usi
ngem
ergi
ng p
atte
rn b
ased
pro
ject
ed
clus
terin
g an
d ge
ne e
xpre
ssio
n da
ta fo
r ca
ncer
det
ectio
n, P
roce
edin
gs o
f the
sec
ond
conf
eren
ce o
n A
sia-
Pac
ific
bioi
nfor
mat
ics,
pp7
5--8
4, 2
004.
Zha
ng, X
. and
Don
g, G
. and
Won
g, L
. Usi
ng C
AE
P to
pre
dict
tran
slat
ion
initi
atio
n si
tes
from
gen
omic
DN
A
sequ
ence
s, T
R20
01/2
2, C
SS
E, U
niv.
of M
elbo
urne
, 200
1.