technical university of crete department of electronic and ...€¦ · technical university of...

66
Technical University of Crete Department of Electronic and Computer Engineering D ESIGN AND E VALUATION OF T OPIC D RIVEN F OCUSED C RAWLERS FOR THE W ORLD W IDE W EB By B ATSAKIS S OTIRIOS A Thesis submitted in partial fulfillment of the requirements for the degree of Master of Computer Engineering Chania, November 2007

Upload: others

Post on 30-Mar-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Technical University of Crete Department of Electronic and ...€¦ · Technical University of Crete Department of Electronic and Computer Engineering DESIGN AND EVALUATION OF TOPIC

Technical University of Crete

Department of Electronic and Computer

Engineering

DESIGN AND EVALUATION OF TOPIC

DRIVEN FOCUSED CRAWLERS

FOR THE WORLD WIDE WEB

By

BATSAKIS SOTIRIOS

A Thesis submit ted in par t ia l fu l f i l lment

of the requi rements for the degree of

Master of Computer Engineer ing

Chania , November 2007

Page 2: Technical University of Crete Department of Electronic and ...€¦ · Technical University of Crete Department of Electronic and Computer Engineering DESIGN AND EVALUATION OF TOPIC

ii

Design and evaluation of topic driven

focused crawlers for the World Wide Web

Batsakis Sotirios

Abst ract

Fo c us e d c r aw l e r s a r e p r o g r am s de s i gne d t o b r ow s e t h e

W eb an d d ow nl o ad p a ge s o n a s p e c i f i c t o p i c . Th e y a r e us e d

f o r a ns w e r i n g us e r q u e r i e s o r f o r bu i l d i n g d i g i t a l l i b r a r i e s

o n a t o p i c s p ec i f i ed b y t h e us e r . T he y a r e d i s t i n gu i s h ed in to

c l as s i c , s e m an t i c a n d l e a r n i n g f o cus e d c r a wl e r s . C l as s i c

f o c us e d c r a wl e r s e s t im a t e t h e r e l ev anc e o f W eb p a ge s wi th

t h e t o p i c b y c o m pu t i n g th e s imi l a r i t y o f W eb p a ge s w i t h a

u s e r p ro v id e d l i s t o f k e yw o r d s t h a t d e sc r ib e t he t op i c o f

i n t e r es t . S em an t i c C r aw l e r s a r e a v a r i a t i o n o f c l a s s i c

f o c us e d c r a wl e r s t h a t u s e c on c ep tua l r e l a t i o ns b e t we e n

t e rm s ( e . g . r e t r i eve d f ro m an on t o l og y) f o r e s t im a t i n g t h e

r e l ev a n c e o f t h e W e b p a ge w i t h t h e t op i c . Le a r n i n g c r a wle r s

e m plo y a t r a in in g p r o ce s s t h a t gu i de t he c r a wl e r t o wa r ds

p a ge s r e l a t ed t o t he t o p i c .

T h i s wo rk a dd r es s i s s u es r e l a t e d t o t h e d e s i gn an d

i mpl e me n t a t i o n o f c l a s s i c , s em an t i c a n d l e a r n i n g fo cu s ed

c r a w le r s . S e ve r a l v a r i a n t s o f c l a s s i c f o cu se d c ra wl e r s

r e l yi n g u p on we b p a ge c on t e n t an d l i nk an c ho r t ex t f o r

e s t im a t in g t h e r e l ev a n c e o f w eb p a ges t o a g i v en t op i c a r e

ex a min e d a nd imp le m e n t ed . A no v e l ty o f t h i s w o rk i s t he

i n t ro du c t io n o f a ne w c a t e go r y o f s e ma n t i c c r a wl e r s m ak i n g

u s e o f W or d Ne t a s t h e un d er l yi n g o n to lo g y f o r o b t a in i n g

t e rm s c on c ep tu a l l y r e l a t e d ( bu t n o t n e c es s a r i l y

l ex i co gr a p h i c a l l y s i mi l a r ) w i th t h e t op i c . Le a r n in g c r a wl e r s

b a s ed on Hid d en M a r ko v Mo d e l ( HM M ) f o r l e a r n i n g n o t

Page 3: Technical University of Crete Department of Electronic and ...€¦ · Technical University of Crete Department of Electronic and Computer Engineering DESIGN AND EVALUATION OF TOPIC

iii

o n l y t h e co n t en t o f r e l ev an t p a ge s bu t a l s o p a t hs l e ad in g to

r e l ev a n t p a ge s fo l l o w in g a c e r t a i n num b er o f r ou t in g h o ps

a r e ex a min e d as w e l l . An a d d i t i ona l c on t r ib u t i on o f t h i s

w o r k i s t h e i n t r od u c t i on o f a ne w c a t e go r y o f h yb r id

c r a w le r s c omb in in g th e s t r e n gt h o f bo th c l a s s i c an d l e a r n in g

f o c us e d c r aw l e r s .

T h e c r a wl e r s r e f e r r e d t o a bo ve a r e a l l i mp l e m en t e d

a n d a c om p ar a t iv e a n a l ys i s o f t h e i r p e r f o r m an c e i s

p r e s en t e d . A l l c r aw l e r s ac h i e v e t h e i r m ax imu m p er f o rma n c e

w h e n a com bi n a t i on o f w eb p a ge an d a n c ho r t ex t i s u s ed f o r

a s s i gn i n g d ow nl oad p r i o r i t i e s t o w e b p a ge s . S e m an t i c

s imi l a r i t y m e t ho ds c om bi n ed wi th a ge n e r a l pu r po se

o n t o l o g y s o u r c e su c h a s W o r dN et do n ’ t a c t u a l l y i m p ro v e

p e r f o r ma n c e , ex ce p t t h e im p l em en t a t i on t h a t r e s t r i c t s

s e ma n t i c s im i l a r i t y t o s yn o n ym t e rm s . H yb r i d c r a wle r s

i mp ro v ed t h e p e r f o r m an c e o f s t a t e o f t h e a r t HM M c r a wle r s

y i e l d in g v e r y p r o mi s in g r e su l t s .

Page 4: Technical University of Crete Department of Electronic and ...€¦ · Technical University of Crete Department of Electronic and Computer Engineering DESIGN AND EVALUATION OF TOPIC

iv

C on t en ts

C hap t e r 1 . I n t r odu c t i on . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1 . 1 B a c k gr o u n d .............................................................................................................. 2

1 . 2 P r e s e n t w o r k ........................................................................................................... 6

1 . 3 C o n t r i b u t i o n o f t h e c u r r e n t t h e s i s ............................................................... 8

1 . 4 T h e s i s o u t l i n e ......................................................................................................... 9

C hap t e r 2 . R e la t ed W o rk . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 0

2.1 Introduction ............................................................................................................... 10

2 . 2 N o n F o c u s e d C r a w l e r s ..................................................................................... 11

2 . 3 C l a s s i c F o c u s e d C r a w l e r s ............................................................................... 12

2 . 4 S e ma n t i c C r a w l e r s ............................................................................................. 16

2 . 5 L e a r n i n g C r a w l e r s .............................................................................................. 19

2 . 6 S u mma r y ................................................................................................................. 24

C hap t e r 3 . C raw l er D es ign . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 6

3.1 Introduction ............................................................................................................... 26

3 . 2 C l a s s i c C r a w l e r s ................................................................................................. 29

3 . 2 . 2 B e s t F i r s t C r a w l e r w i t h a n c h o r t e x t s i mi l a r i t y ........................... 31

3 . 2 . 3 B e s t F i r s t C r a w l e r w i t h p a g e c o n t e n t a n d a n c h o r t e x t . ........... 31

3 . 3 S e ma n t i c C r a w l e r s ............................................................................................. 32

3 . 3 . 1 E h r i g C r a w l e r ............................................................................................... 34

3 . 3 . 2 S S R M C r a w l e r .............................................................................................. 34

3 . 2 . 3 S e ma n t i c C r a w l e r w i t h s y n o n y m s e t e x p a n s i o n .......................... 35

3 . 4 L e a r n i n g C r a w l e r s .............................................................................................. 35

3 . 4 . 1 H i d d e n M a r ko v M o d e l C r a w l e r ........................................................... 37

3 . 4 . 2 H y b r i d C r a w l e r s .......................................................................................... 39

3 . 5 S u mma r y ................................................................................................................. 41

C hap t e r 4 . E xp e r ime n t a l R esu l t s . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 3

4.1 Introduction ............................................................................................................... 43

4 . 2 P e r f o r ma n c e me a s u r e s ...................................................................................... 44

4 . 3 E x p e r i me n t s e t u p ................................................................................................ 45

4 . 4 C l a s s i c F o c u s e d C r a w l e r s ............................................................................... 47

4 . 5 S e ma n t i c C r a w l e r s ............................................................................................. 48

4 . 6 L e a r n i n g C r a w l e r s .............................................................................................. 50

4 . 7 D i s c u s s i o n .............................................................................................................. 53

Page 5: Technical University of Crete Department of Electronic and ...€¦ · Technical University of Crete Department of Electronic and Computer Engineering DESIGN AND EVALUATION OF TOPIC

v

C hap t e r 5 . Con c lus ion s and f u tu r e wo r k . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 4

R ef e r en c es . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 6

Page 6: Technical University of Crete Department of Electronic and ...€¦ · Technical University of Crete Department of Electronic and Computer Engineering DESIGN AND EVALUATION OF TOPIC

CHAPTER 1. INTRODUCTION

1

Chapter 1. Introduction

T h e W o r l d W id e W eb i s a hu ge i n f o rm a t io n s ou r c e w i t h

b i l l i o ns o f w e b p age s o n e ve r y c o n c e i v ab l e su b j e c t . G en e r a l

p u rp os e s e a r ch en g i n es s u ch as G oo g le [ 5 ] , Y a ho o [ 7 ] , M SN

[ 8 ] a nd As k [ 9 ] ha v e a pp e a r ed in o r d e r t o a s s i s t u s e r s i n

f i nd i n g in f o rm at i on o n t h e W eb . The s e s e a r c h en g i n es a r e

v e r y c o m pl i c a t e d an d s i z a b l e s ys t e ms [ 1 , 2 ] , bu t t h e y d o n ’ t

a c h i e v e a fu l l c ove r a ge o f t h e W e b . G o o g l e a ch i e v es u p to

7 6 % a nd Y ah oo up t o 6 9% co v e r a ge , wh i l e o t h e r s ea r c h

e n g i n es i n dex a n ev e n sm al l e r p e r c e n t a ge o f t h e e n t i r e W eb

[ 3 ] . In f o r m at io n se a r c h es on t h e W e b i s s u ed t h ro u gh W eb

s e a r ch en g i n es a r e n o t p r op a ga t ed o v er t h e W e b in r ea l t i me .

In s t e a d th e y i n d ex , a n a l yz e a n d c a t e go r i z e W e b i n f o rm at io n

a c c um ul a t e d l oc a l ly i n d a t a r e po s i t o r i e s a nd t h i s i n f o r m at ion

i s t h en u s ed f o r ans w e r i n g us e r q ue r i e s . Th e ge n e r a l p u rp os e

s e a r ch e n gi n e ap pro a c h e f f e c t i v e l y a d d r e s s es t h e n e e d o f t h e

e n d us e r t o f i n d spe c i f i c i n f o r m at i on in r e a l t im e .

C r a wl e r s ( a l s o k no w n as R ob o t s o r S p id e r s [ 20 ] ) a r e

t oo l s fo r a s s emb l i n g lo c a l l y i n f o rm at io n f r om t h e W eb .

Fo c us e d c ra wl e r s i n p a r t i cu l a r , h ave b e e n i n t ro du c ed f o r

s a t i s f yi n g th e n e ed o f i nd iv i du a l s ( e . g . d om ai n ex p e r t s ) o r

o r ga n iz a t io ns t o c re a t e a nd m ai n t a i n l o c a l l y d i g i t a l l i b ra r i e s

o n a s ub j e c t o r f o r a n sw e r i n g c omp l i ca t e d qu e r i e s ( f o r wh i ch

a w e b s e a rc h en g in e wo u l d yi e l d l im i t ed o r no s a t i s fa c t o r y

r e s u l t s ) . T yp i c a l r e q u i r em e n t s o f su ch a pp l i c a t i on us e r s a r e

t h e n e ed fo r h i gh q u a l i t y u p - to - d a t e r e su l t s , w h i l e

m in i miz i n g th e amo u n t o f r e s o ur c e s d e d i c a t e d t o t h e s ea r c h

t a sk . Foc us e d c r awl e r s d ow nl o ad a s m a n y p a ge s r e l ev an t t o

t h e s ub j e c t a s t he y c a n , w h i l e k ee p in g th e am ou n t o f

i r r e l ev a n t d a t a dow n lo ad e d to a mi n i mum [ 3 0] . Bes id es t h e

c r e a t i on o f s p e c i a l i z e d d i g i t a l l i b r a r i e s , a pp l i c a t i ons o f

Page 7: Technical University of Crete Department of Electronic and ...€¦ · Technical University of Crete Department of Electronic and Computer Engineering DESIGN AND EVALUATION OF TOPIC

CHAPTER 1. INTRODUCTION

2

f o c us e d c r aw l e r s a l so i nc lu d e gu id i ng i n t e l l i ge n t a gen t s o n

t h e W eb fo r l o c a t in g s pe c i a l i z ed in f o rm at i on ( e . g . f l i gh t

s c h ed u l es a nd t i c ke t p r i c es f o r a vo ya ge p l a nn in g a ge n t ) . As

t h e imp o r t an c e and th e s i z e o f t h e W eb g r o ws s o do es t h e

i mp or t an c e o f Fo cus e d Cr a wl e r s .

1 .1 Background

C r a wl e r s a r e g iv e n a s t a r t i n g s e t o f w e b p a ge s ( s e ed pa ge s )

i n t h e i r i np u t , ex t r a c t o u t go i n g l i n ks a pp e a r in g in t h e s e ed

p a ge s a n d de t e r mine w h a t l i n ks t o v i s i t n ex t b as e d on c e r t a i n

c r i t e r i a . In t h e f o l l o wi n g , w e b p a ges po in t e d t o b y t h e s e

l i n ks a r e do w nlo a de d , a nd th os e s a t i s f yi n g c e r t a i n s e l ec t i o n

c r i t e r i a a r e s to r ed i n a l o c a l r ep os i to r y. C r a wl e r s c on t i nu e

v i s i t i n g W e b p a ges u n t i l a k n ow n numb e r o f p a ge s h a v e b e e n

d o wn lo ad e d o r un t i l l o c a l r e so u rc e s ( su c h a s s to r a ge ) a r e

ex h au s t ed .

T h e Cr a wl e r s u s ed b y ge n e r a l p u rp os e s e a r ch e n g ine s

r e t r i ev e W eb p a ge s m as s iv e l y r e ga r d l es s t h e i r t o p i c . M eth o ds

f o r im p l em e n t i n g su c h Cr a wl e r s i n c lud e :

a ) B r ea dt h F i rs t C r aw l e rs : T he o u t go in g l i nks f r om t he

g i v e n se t o f pa ge s a r e ex t ra c t ed a nd in s e r t ed i n a F i r s t

In F i r s t Ou t ( F IFO ) q ue u e , an d th e i r co r r es po nd in g w eb

p a ge s a r e do w nlo ad e d . T h e p r o c es s c o n t in ue s s im i l a r l y

w i t h t h e n e w p a ges .

b ) Pa g e i mp or t an c e C r aw l e rs : T he y a s s i gn h i gh e r v i s i t

p r io r i t y t o w e b p a ge s ( i . e . t o t he i r c o r r es po nd in g

U R Ls ) l i nk e d to f r om m o r e im po r t a n t p a ge s . P a ge

i mp or t an c e es t im a t i on c r i t e r i a fo r a s s i gn i n g p r i o r i t i e s

t o ex t r a c t e d UR Ls i n c lu d e Ba c k l i n k co u n t ( i . e . num b e r

o f we b p a ge s c on t a i n i n g l i nk s t o a g ive n p a ge ) [ 2 2 ] a nd

P a ge R an k ( t h e imp o r t an c e e s t i m a t i on m et ho d u s ed in

t h e Go o g l e s e a r ch e n g i n e ) [ 6 ] .

Page 8: Technical University of Crete Department of Electronic and ...€¦ · Technical University of Crete Department of Electronic and Computer Engineering DESIGN AND EVALUATION OF TOPIC

CHAPTER 1. INTRODUCTION

3

A l t ho u gh s im pl e , B r e a d th F i r s t C r a w l e r s a ch i e v e go od

p e r f o r ma n c e (m ea s u r ed as t h e a v e r a ge qu a l i t y o f

d o wn lo ad e d p a ge s u s i n g P a ge Ra nk c r i t e r io n ) [ 19 ] , a nd a r e

e f f e c t i v e fo r im p l em e n t i n g no n - f o cu s ed C r a wl e r s . Th e

m aj o r d i s a dv a n t a ge o f Br e a d th F i r s t C r a wl e r s ( a n d o f t h e

o th e r n on t op i c d r iv e n C ra wl e r s ) i s t h a t t h e y u s e o n l y t h e

l i n k s t r uc tu r e o f t h e w e b an d no t w e b pa ge c o n t en t i n

a s s i gn i n g v i s i t p r io r i t i e s t o UR Ls ; c ons e qu e n t l y t h e y f a i l t o

f o c us o n p a ge s o n a t o p i c . Be c au s e p a ge s o n a s p ec i f i c

t op i c a r e a m in o r f r a c t i on o f t h e ov e ra l l W e b , c r a wl i n g o n

t h a t t o p i c u s i n g n o n fo c us ed c r a wl e r s w i l l r e su l t i n to

d o wn lo ad in g a l a r ge n um b er o f i r r e l ev a n t p a ge s , t h us

q u i c k l y e x ha us t ing t h e a v a i l a b l e r e s ou r c es . T h e re fo r e

b u i ld i n g a sp e c i a l i z e d d i g i t a l l i b r a ry c a l l s fo r fo c use d

c r a w le r s .

Fo c us e d c r a wl e r s w o r k b y c o m bi n i n g b o t h t h e co n t en t o f

t h e r e t r i e v ed W eb p a ge s an d th e l i nk s t r u c tu r e o f t h e W eb

f o r a s s i gn in g h i ghe r v i s i t i n g p r io r i t y t o pa ge s r e l e v an t t o

t h e t o p i c . T h e y a r e d i s t i n gu i s h ed in to t h e fo l l o wi n g

c a t e go r i es :

a ) C l ass i c Fo c us ed C r aw l e rs [ 26 ] t ake a s i np u t a u s e r

q u e r y t h a t d es c r i be s t h e t o p i c a nd a s e t o f s t a r t i n g

U R Ls ( s e ed s ) . The c r a wl in g s t a r t s f r om th e us e r

p r ov id e d s ee d URLs . T h e c r aw l e r s a s s i gn a p r i o r i t y

v a lu e t o v i s i t ed p age s a c c o r d in g t o t h e i r r e l ev an c e t o

t h e t o p i c . T h e w e b p a ge s a r e o r de r e d b y r e l e v a n c e a nd

t h e c r aw l e r s p ro c ee d b y v i s i t i n g t h e m os t r e l ev a n t w e b

p a ge s f i r s t . T h e mo s t co mmo n c r i t e r io n fo r r e l e v an c e

e s t im a t io n b e tw e e n a r e t r i e v ed p a ge a n d a u s e r qu e r y

i s d e f i n ed as t h e s imi l a r i t y b e t w e e n t h e t ex t o f t h e

v i s i t ed p a ge wi th t h e qu e r y ( t op i c ) . T yp i c a l l y t h i s i s

c o mp ut ed us in g a t ex t s im i l a r i t y m o d e l su c h as t h e

Bo o le a n o r t h e Ve c t o r Sp a c e Mo d e l [ 12 ] . Foc us e d

Page 9: Technical University of Crete Department of Electronic and ...€¦ · Technical University of Crete Department of Electronic and Computer Engineering DESIGN AND EVALUATION OF TOPIC

CHAPTER 1. INTRODUCTION

4

c r a w le r s u s in g V e c t o r Sp a c e M ode l f o r r e l e v an c e

e s t im a t io n ( Bes t F i r s t C r a wl e r s [ 25 ] ) a r e t h e m os t

e f f e c t i v e c l a s s i c foc u s ed c r aw l in g m et ho d s o f a r [ 26 ] .

Ex i s t i n g wo r k on c l a s s i c fo cu s ed c raw l e r s i s p r e s en t e d

i n s e c t i o n 2 .3 . O u r p r op os e d v a r i a n t s a nd

i mp l e me n t a t i o ns o f c l a s s i c fo c use d c r a wl e r s a re

d i s c uss e d in s e c t i on 3 . 2 .

b ) S e man t i c C raw l e rs a r e a v a r i a t i o n o f c l a s s i c fo cu s ed

c r a w le r s . P a ge v i s i t p r io r i t y i s a s s i gne d t o p a ge s u s in g

t h e i r c on t e n t a nd b y a p p l yi n g s e m a n t i c c r i t e r i a f o r

c o mp ut i n g p a ge - t o - t op i c r e l e v an c e . A p a ge a n d th e

q u e r y c a n b e r e l e v a n t i f t h e y s h a r e c o n c ep t u a l l y

s imi l a r ( bu t no t ne c e s sa r i l y l e x i c a l l y s i m i l a r ) t e rms .

C on c ep tu a l r e l a t i on s b e t w e en t e rm s a r e d e f in e d us in g

a n un d er l yi n g t op i c sp e c i f i c o r ge n e r a l p u r po s e

o n t o l o g y. T h us , s em a n t i c c r a wl e r s d i f f e r w i th c l as s i c

f o c us e d c r a wl e r s i n t h e w a y c o n t en t r e l ev a n c e i s

c o mp ut ed . T o t he b e s t o f ou r k now l ed ge s em a n t i c

c r a w le r s ha v en ’ t be e n c om p ar e d wi th s t a t e - o f - th e - a r t

c l a s s i c fo cu s ed c ra w l e r s s u ch as t h os e r e fe r r ed t o

a b ov e , no r h a v e t h e y b e e n c omb ine d wi t h mo d e rn

s e ma n t i c s i mi l a r i t y m e t h o ds ( as t ho s e p r e s en t e d i n

[ 1 1 ] ) so a s t o a c h i e v e t h e i r fu l l p o t en t i a l . T h e p r es e n t

w o r k ad d r es s e s a l l t h es e i s su es ( s e c t i on 3 . 3 ) .

c ) L e ar n in g C r aw le rs [ 33 ] ap p l y a t r a in in g p ro c e s s fo r

a s s i gn i n g v i s i t p r i o r i t i e s t o W e b p a ge s a n d f o r gu i d in g

t h e c r a wl i n g p ro ce s s . Th e y a r e c h ar a c t e r i z ed b y t h e

w a y r e l e v an t w eb pa ge s o r p a t hs t h r ough w e b l i nk s f o r

r e a c h i n g r e l ev an t p a ge s a r e l e a r n ed b y t h e c r a w le r

( t yp i c a l l y b y m a c h i n e l e a rn i n g o r o the r p r o ce s s e s ) so

t h a t t h e c r a wl e r c an d i s t i n gu i sh b e t we e n r e l e v an t an d

n o n r e l e v an t p a ges . Bu i ld i n g up on t h i s i d e a , a n um be r

Page 10: Technical University of Crete Department of Electronic and ...€¦ · Technical University of Crete Department of Electronic and Computer Engineering DESIGN AND EVALUATION OF TOPIC

CHAPTER 1. INTRODUCTION

5

o f a p pr o a ch e s fo r l e a rn in g r e l ev a n t t o t h e t op i c W eb

p a ge s h a ve ap p e ar ed i n t h e l i t e r a t u re an d in c l ud e :

1 . A p p ro a ch e s b as e d o n m a ch i n e l ea r n i n g : T he

c r a w le r i s s up p l i ed wi t h a t r a i n in g s e t c ons i s t i n g

o f r e l ev a n t a nd n on r e l ev a n t W e b p age s w h i ch i s

u s ed t o t r a i n t h e l e a r n i n g C r a wl e r [ 33 , 34 ] . Du r i n g

c r a w l in g h i gh e r v i s i t p r i o r i t y i s a s s ign e d t o w eb

p a ge s c l as s i f i ed as r e l ev a n t t o t h e t op i c .

2 . A p p ro a ch e s t h a t t a k e n o t o n l y t h e p a ge c on t en t

a n d t h e c o r re sp on d i n g c l a s s i f i c a t i o n o f w eb p a ge s

a s r e l e va n t o r no n r e l ev a n t t o t he t op i c i n t o

a c c o un t , b u t a l s o t h e l i n k s t ru c t u r e o f t h e W eb an d

t h e p ro ba b i l i t y t ha t a g i ve n p a ge (w h ic h c a n b e

n o n re l ev a n t t o t he t op i c ) w i l l l e ad t o a r e l e va n t

p a ge w i t h in t h e min im um n um b er o f s t ep s ( ho ps ) .

M e th od s b a se d i n C on te x t G ra ph s [ 31 ] a nd H id d en

M a r ko v Mo de l s (HM M ) [ 16 ] a r e ex am pl es o f t h i s

c a t e go r y o f m e th o ds . S e c t io n 2 .5 c on t a in a

d e t a i l e d d es c r i p t i on o f t h e se me th od s a n d S ec t i on

3 . 4 t h e e nh an c e me n t s p ro pos e d in t h i s w o r k .

3 . H yb r i d m et ho ds t h a t co mbi n e l e a rn i n g c r a wl e r s

w i t h i d e as o f c l a s s i c f oc us e d c r a wl e r s [ 3 5 ] . O u r

w o r k fo c us e s on hyb r i d c r aw l e r s a nd p ro po s es an

a p p ro a ch t h a t comb in e s t h e s t r e n gt hs o f c l a s s i c

f o c us e d c r aw le r s ( v a r i a t i on s o f Be s t F i r s t

C r a wl e r s ) wi t h Hidd e n M a r ko v M od e l s f o r l e a rn in g

n o t o n l y h o w to d i s t i n gu i s h b e t w ee n r e l ev a n t a nd

n o n r e l e v an t W e b p a ge s b as ed o n c on t e n t , b u t a l s o

o n l e a rn i n g ho w t o gu id e t h e s e a rc h fo r s u ch

r e l ev a n t W eb p a ges t h r ou gh a s e qu e nc e o f ro u t in g

h o ps b e t w e en W e b p a ge s ( s om et im es t h r ou gh n on

r e l ev a n t p a ge s ) . T h i s m e t ho d i s d e s c r ib e d in

s e c t i o n 3 .4 a nd t he ex p er im en t a l r e su l t s ob t a in e d

Page 11: Technical University of Crete Department of Electronic and ...€¦ · Technical University of Crete Department of Electronic and Computer Engineering DESIGN AND EVALUATION OF TOPIC

CHAPTER 1. INTRODUCTION

6

( S e c t i on 4 . 6 ) i n d i c a t e t h a t i t i s a ve r y e f f e c t i ve

c r a w l in g m eth od .

Fig. 1: Crawler Classification

1 .2 Present w ork

T hi s w o rk d e a l s wi th t he d es i gn an d e v a l u a t i on o f fo cu s ed

c r a w le r s . S t a t e o f t h e a r t a pp r o a che s f o r bu i ld i n g to p i c

d r iv en f o cu s ed c ra w l e r s a r e co ns ide r e d in c l ud in g c l as s i c ,

s e ma n t i c a nd l e a rn in g c r a wl e r s . S ev e r a l v a r i an t s o f t h es e

a p p ro a ch e s a r e a l so p r op os e d a nd e v a l u a t ed . Th e em ph as i s o f

t h i s w or k i s on hyb r i d c r a wl e r s com bi n in g t ex t a nd l i n k

i n fo rm at io n fo r r e ac h in g f a s t e r mo r e p r omi s i n g p a ge s on t h e

t op i c o f i n t e re s t .

T h e f i r s t c r a wl e r im p l em e n t ed i s t h e Br e ad t h F i r s t

C raw l er . T h i s i s a c l a s s i c n on to p i c -o r i e n t e d c r aw le r wh i ch

i s u s ed a s a r e fe r e n c e i n a l l com p ar i s on s wi t h fo c us e d

c r a w le r s . S ev e r a l va r i an t s o f t h e B es t F i r s t Cr awl er [ 2 5 ] a r e

a l so im p l em e n t ed a n d e v a l ua t ed . Be s t F i r s t C r a wl e r w or ks b y

e s t im a t in g th e r e l ev a n c e o f t he r e t r i ev e d p a ge w i th t h e u s e r

q u e r y ( b o th r ep r es e n t ed us i n g t e rm v e c t o r s ) u s in g Ve c t o r

S p ac e Mo d e l (VSM ) [ 1 2 ] ; t h en i t v i s i t s t h e l i n ks ex t ra c t ed

f r om t h e m os t r e l e v a n t p a ge . A UR L c a n be r ep r ese n t ed

e i t h e r b y t h e t e r m v e c to r o f t h e W eb p a ge i t wa s ex t ra c t ed

f r om , o r b y t h e t e rm v e c to r o f i t s c o r r e s po nd i n g a n ch o r t ex t

( t he t ex t t h a t a pp ea r s o n th e l i n k po in t i n g t o t h a t UR L) . Al l

s o l u t i on s ( us in g p a ge c on t en t , a n c ho r t ex t o r t h e i r

Crawlers

Non topic oriented crawlers Focused crawlers

Classic focused crawlers Semantic crawlers Learning crawlers

Page 12: Technical University of Crete Department of Electronic and ...€¦ · Technical University of Crete Department of Electronic and Computer Engineering DESIGN AND EVALUATION OF TOPIC

CHAPTER 1. INTRODUCTION

7

c o mbi n a t i on ) a re im p l em e n t ed an d e va l u a t ed . Th e s e me th o ds

a r e de s c r ib e d in s ec t i on 3 .2 .

T he s e co nd c a t e go r y o f m e th o ds i n c lu d es S em a n t i c

C r a wl e r s t ha t e s t im a t e t h e c on c e p tua l ( s em a n t i c ) r e l ev a n c e

o f a W eb p a ge w i t h t h e qu e r y. T h e m e th od b y E h r ig e t . a l

[ 1 3 ] c om bin e s f o cu s ed c r aw l e r s an d s e ma n t i c r e l a t i o ns f r om

a n o n t o lo g y ( i n [ 1 3 ] t op i c s p e c i f i c on to lo g i es w e r e us e d ) , f o r

a s s i gn i n g v i s i t p r io r i t i e s t o p a ge s . In o u r im p l em e n t a t i o n o f

s e ma n t i c c r aw l e r s , t e rm v e c to r s a r e e nh a n c ed wi t h s yn o n ym s

a n d s em a n t i c a l l y s imi l a r t e rm s f rom Wo r dN e t [ 4 ] ( t hu s

m a k in g o u r im p le m e n t a t i on t h e f i r s t ge ne r a l pu rp os e

s e ma n t i c c r a wl e r im p l em e n t a t i on ) . To p i c r e l e va n c e c an t h en

b e c omp ut ed b y V S M [ 1 2] , t h e S e m an t i c S i mi l a r i t y R e t r i e v a l

M od e l (SSR M ) [ 1 4 ] o r b y M i h a l c e a e t . a l . [ 15 ] .

O u r p r op os ed app r o a ch t o Le a r n in g C r aw l e r s i s

i n f lu e n c ed b y w o r k o n H M M C raw l er s [ 16 , 1 8 ] fo r l e a r n in g

p a th s l e ad in g to r e l e v an t p a ge s i n add i t i on t o t h e c on t en t o f

t h e d es i r e d w e b p age s . Th e u s e r o f a n H MM C r a wl e r p rov id e s

a t r a in in g s e t o f p age s ( bo th r e l ev a n t a n d n on r e l ev a n t t o t he

t op i c o f i n t e r es t ) . T h es e p a ge s a re c l us t e r ed a c co r d i ng t o

t h e i r co n t en t . T r ans i t i on p ro b ab i l i t i e s b e tw e e n t he r e su l t i n g

c l us t e r s r ep r es e n t in g r e l e v an t o r no n r e l e v an t p a ge s ( l ea d in g

t o r e l ev an t o n es ) a r e c om put e d an d a r e us e d to e s t im a t e

( g i v en th e c l us t e r a W eb p a ge i s a s s i gn e d) , t h e p ro b ab i l i t y

t h a t i t w i l l l e ad t o r e l ev a n t p a ge s . T h e h i gh e r t h i s p ro b ab i l i t y

i s t h e h i gh e r t he v i s i t p r io r i t y g i v e n t o t h e p a ge ’ s ex t r ac t ed

l i n ks wi l l b e . K -Me a n s [ 4 7 ] an d X -Me a n s [ 1 7 ] c an b e a pp l i ed

f o r t h e c lu s t e r in g o f W eb pa ge s . K-m e a ns c l us t e r i n g i s a n

a l go r i t hm to c l as s i f y o r t o g r ou p ob je c t s b as e d o n

a t t r i b u t es / f e a t u r es i n to K g r ou ps ( K i s pos i t i v e i n t e ge r

p r e d ef in e d num b e r ) . T h e g r o up i n g i s do n e b y m i n im iz in g t h e

s um o f sq u ar e s o f d i s t a n c es be t w e en da t a a nd t h e

c o r r es po nd in g c l us t e r c e n t ro id . X -M ea n s i s an ex t e ns i on o f

Page 13: Technical University of Crete Department of Electronic and ...€¦ · Technical University of Crete Department of Electronic and Computer Engineering DESIGN AND EVALUATION OF TOPIC

CHAPTER 1. INTRODUCTION

8

K - m e an s wi t h d yn a mi c es t im a t io n o f t h e n um b er o f c l us t e r s

d e p en d en t on th e d a t a . In t h i s wo rk f o c us ed c r a wl e r s b a s ed

o n bo th c lu s t e r in g a p p r o ac h es a r e i mp le m en t e d a n d

e v a lu a t e d .

Ba s e d on t h e HMM C ra wl e r , Hy br id Cr awl e r s t h a t

c o mbi n e c l as s i c f oc u s ed c ra wl e r s fo r a s s i gn in g p r i o r i t i e s t o

U R Ls b a s ed o n t op i c r e l ev a n c e , a nd l e a r n i n g c r a wl e r s f o r

l e a rn in g a c c es s p a t hs fo r r e a c h in g re l e va n t p a ge s ( po ss ib l y

t h ro u gh no n r e l ev a n t o ne s ) a r e p r op os e d . T wo hyb r i d

c r a w le r s com bi n i n g H M M w i t h p a ge o r b o t h p a ge a n d anc h o r

t ex t s a r e imp l em en t e d a nd ev a lu a t e d . O u r p r op os ed a pp ro a c h

t o h yb r i d c ra wl e r s i s p re s en t e d i n s e c t i on 3 .4 .

T h e c r aw le r s r e f e r r e d t o ab ov e ( an d t h e i r v a r i a t i o ns )

a r e a l l im p l em en t ed a n d t he i r p e r fo rma n c e i s c omp a r e d ba s e d

o n r es u l t s o b t a i ned f r om t h e w e b u s i n g s ev e r a l d i f f e r e n t

t op i c s a nd s e ed ( s t a r t i n g ) p a ge s . S e c t i on 4 p r es e n t s a

c o mp a r a t i v e s tu d y o f t h e p e r f o rm an c e o f a l l c r a wl e r v a r i a n t s

b y c a t e go r y a l on g w i th a c r i t i c a l a n a l ys i s o f t h e i r

p e r f o r ma n c e .

1 .3 Contr ibut ion o f the current thes i s

T h e c on t r ib u t i on s o f t h i s w or k a r e su mm a r i z e d be lo w:

a ) T hi s t h es i s p r es e n t s a c r i t i c a l e v a l ua t io n o f s t a t e o f t h e

a r t a pp ro a c h es t o W eb C r aw l in g , i n c lu d in g C l as s i c ,

S em a n t i c a nd Le a r n i n g Fo c us ed Cr a w l e r s . T o o u r

k n ow le d ge a s im i l a r e v a l u a t i on h as n ’ t a pp e a r ed in t h e

l i t e ra tu r e b e fo r e .

b ) P ro po s es s ev e r a l v a r i an t s t o ex i s t i n g c r a w l in g

m et ho do lo g i es b a se d o n r e c en t s em a n t i c r e l e v an c e

e s t im a t io n m eth ods a nd com p a r e t he i r p e r fo rm a n ce

w i t h c l a s s i c fo c us ed c r aw l in g m eth od s .

c ) P ro po s es a no v e l hyb r i d a p pr o a ch t o l e a r n i n g c r a wl i n g

c o mbi n i n g c l as s i c f o cu s ed c r a wl e r s fo r a s s i gn i n g

p r io r i t i e s t o UR Ls w i th i d ea s f ro m l e a r n i n g c ra wl e r s

Page 14: Technical University of Crete Department of Electronic and ...€¦ · Technical University of Crete Department of Electronic and Computer Engineering DESIGN AND EVALUATION OF TOPIC

CHAPTER 1. INTRODUCTION

9

f o r l e a rn i n g pa th s f o r r e a c h i n g w e b pa ge s r e l e v an t t o

t h e t op i c .

1 .4 Thes i s out l ine

T h e w or k i n t h i s t he s i s i s o r ga n iz ed as f o l l o ws : R e l a t e d w o r k

o n fo c us ed c r a wl i ng i s p r e s en t e d i n Se c t i on 2 . I t i s o r ga n iz ed

i n s ix su bs e c t io ns ; t h e f i r s t i s t h e i n t r od uc t i on , t he s e co n d

s ub s e c t i on ( 2 . 2 ) p re s e n t s no n t op i c d r i v en c r aw l e r s , t h e t h i rd

s ub s e c t i on (2 .3 ) c l a s s i c im p l em e n t a t i o ns o f fo c us e d c r aw l e r s ,

t h e fo u r th su bs e c t io n (2 .4 ) t h e p r e l im in a r y r e l a t e d wo rk o n

s e ma n t i c c r aw l e r s , t h e f i f t h s ubs e c t io n ( 2 . 5 ) p res e n t s

p r e v io us w o rk on Le a r n i n g Cr a wl e r s a nd th e s ix th i s a

s um ma r i z a t i o n o f t h e ab ov e .

I s s u e s r e l a t e d t o t h e d es i gn an d im ple m e n t a t i on o f W eb

c r a w le r s i s p re s en t e d i n s ec t i on 3 . S u bs e c t i on 3 .1 i s a n

i n t ro du c t io n to t h e t o p i c , s ub s e c t i on 3 . 2 p r ov id es a d e t a i l e d

d e s c r ip t i o n o f c l a s s i c c r aw l e r s i mp lem e n t ed i n t h i s wo rk a n d

s ub s e c t i on 3 . 3 d ea l s w i th i s su es r e l a t e d t o t h e d es i gn o f

s e ma n t i c c r aw l e r s . In s u bs e c t io n 3 . 4 p a r t i c u l a r em p ha s i s i s

g i v e n to l ea r n i n g c r a w le r s an d to t he s ub s eq ue n t d es i gn o f

h yb r i d c r a wl e r s .

S e c t i on 4 p ro v id es a d es c r i p t i on o f t h e ex p er im en t a l

r e s u l t s . S ub s e c t i on 4 .1 p r es en t t h e p u rp os e o f t h e

ex p e r im e n t s , i n t h e s e c on d pa r t ( s u bs e c t i on 4 .2 ) t h e

p e r f o r ma n c e m e a su r e s u s ed to e v a lu a t e t he c r a wl e r s a r e

d e s c r ib e d . Th e ex pe r im e n t a l s e tu p i s d i s c uss e d i n s ub s ec t i on

4 . 3 . Ex p e r i m en t a l r e s u l t s on C l as s i c C r a wl e r s a re p re s en t ed

i n su bs e c t i on 4 .4 f o l l o w e d b y r e s u l t s o b t a i n ed b y s e ma n t i c

a n d l e a r n in g c r a w l e r s i n su bs ec t i on s 4 . 5 an d 4 . 6

r e s p ec t iv e l y. S u bse c t i on 4 .7 p r e s en t s a c r i t i c a l a n a l ys i s o f

t h e p e r fo rm a n ce o f v a r i ou s c r a wl e r s m e t ho ds c on s i de re d in

t h i s wo r k . F i n a l l y c o n c l us i on an d i s su e s f o r f u r th e r r e se a r c h

a r e d i s c us se d in S ec t i on 5 .

Page 15: Technical University of Crete Department of Electronic and ...€¦ · Technical University of Crete Department of Electronic and Computer Engineering DESIGN AND EVALUATION OF TOPIC

CHAPTER 2. RELATED WORK

10

Chapter 2. Related Work

2.1 Introduction

R el a t e d w o rk o n c r a w le r s i n c l ud e s c o n t r i bu t io ns r e gar d in g

b o t h c l a s s i c (n on - to p i c o r i en t ed ) a nd fo c us e d ( t o p i c -

o r i e n t e d ) c r a wl e r s . Ex i s t i n g wo r k o n t h e d es i gn a nd

i mpl e me n t a t i o n o f n on fo c us e d c r aw l e r s an d o f f o cu s ed

( c l as s i c , s e m an t i c a n d l e a r n in g ) c ra w l e r s p r op os ed in t h e

l i t e ra tu r e i s p r es e n t e d i n t h i s c h ap te r .

C l a s s i c no n f o cu s ed Cr a wl e r s ( e . g . c ra w l e r s u s ed b y w e b

s e a r ch e n g in e s f o r a s s em bl i n g w e b p a ge s t o l o c a l

r e p os i to r i e s ) do wnl o ad W eb p a ge s m a ss i v e l y r e ga r d l e s s o f

c o n t e n t i n o r d e r t o c r e a t e v as t p a ge r e po s i t o r i e s . Fo cu s ed

c r a w le r s o n th e o th e r ha nd a r e mo r e s e l e c t i v e , do w nloa d in g

o n l y p a ge s r e l a t e d t o a kn ow n (u s e r p r ov id e d) t o p i c . I s s u es

r e l a t e d t o t h e d es i gn a nd im pl em e n t a t i o n o f c l a s s i c a s w e l l a s

o f fo c us e d c r a wl e r s a r e d i sc us s ed i n t h e fo l l o wi n g a n d

i n c lu d e :

a ) S e ar ch s t ra t eg y : Th e c r a wl e r ca n b r ow s e t h e w e b i n a

b r e a d th f i r s t o r d e r o r s e l e c t l i nk s t o f o l l o w u s in g

i mp or t an c e es t ima t i on c r i t e r i a . Fo c us e d c r aw l e r s

a s s i gn v i s i t i n g p r io r i t i e s t o pa ge s ac c o r d in g t o t h e

r e l ev a n c e o f t h e page w i t h a t o p i c sp ec i f i e d b y a u s e r .

b ) R ef r e sh in g po l i cy : D u e t o t h e d yn a mi c n a tu r e o f t h e

W eb , p a ge s m us t b e r ev i s i t e d i n o rd e r t o ke e p p a ge

r e p os i to r i e s up - t o - d a t e . T h e op t ima l p a ge r e f r es h

p o l i c y t h a t a c h i eve s k e e p i n g p a ge re p os i to r i e s up - to -

d a t e wi th ou t un n ec e s s a r y d o w nl o ad in g o f no n o u t -

d a t ed p a ge s i s a v e r y i m p o r t an t i s s u e i n c r a wl e r d e s i gn

[ 2 1 ] . A l so , s a t i s f yi n g t h e co nf l i c t i n g d e m an ds fo r h i gh

d o wn lo ad in g r a t e w i t ho u t p u t t i n g ex ce s s i ve l o ad to t h e

Page 16: Technical University of Crete Department of Electronic and ...€¦ · Technical University of Crete Department of Electronic and Computer Engineering DESIGN AND EVALUATION OF TOPIC

CHAPTER 2. RELATED WORK

11

v i s i t ed W e b s i t e s i s a m a jo r c on c e rn w h e n de s i gn i n g a

C r a wl e r fo r a s e a r ch en g i n e .

c ) S yn ch ron iz a t i on : C r a wl e r s u s ed b y c o mm e r c i a l s e a r ch

e n g i n es us e m ul t i p l e p a ra l l e l p ro c es se s t h a t m as s iv e l y

r e t r i ev e W e b p a ge s , r e ga rd l es s o f t he i r t op i c . Th e se

p r o c es s e s mu s t be s yn c h r on iz e d i n o r d e r t o av o id

d u p l i c a t e d pa ge dow n lo ad in g [ 20 ] .

2 .2 Non Focused Craw lers

T yp i c a l l y n o n f o cus e d c ra wl e r s a r e u s e d b y ge n e r a l p u rp os e

s e a r ch e n gi n es fo r a s s e mbl in g lo c a l l y W e b i n fo rm at i on .

M e th od s fo r im p l em e n t i n g s u ch C ra wl e r s i n c lu d e :

a ) B r ea dt h F i r s t C raw l e r : A ft e r d ow nlo a d i n g t h e i n i t i a l

p a ge s ( c a l l ed s e ed p a ge s ) t h e ou t go in g l i nks ex t r a c t e d

f r om th e s e p a ge s a r e p u t i n a F IFO q u e u e . Th e l i n ks

ex t r a c t e d f i r s t po i n t t o p a ge s t h a t a r e g i v e n t h e h i gh es t

p r io r i t y f o r d o w nl oa d in g a n d f u r th e r c r a w l in g . B r e a d t h

f i r s t c r a wl i n g i s o n e o f t h e m os t c omm on l y u s e d

c r a w l in g a p p ro a ch e s fo r a s s emb l in g l o c a l l y W e b

c o n t e n t f o r u se b y W eb s e a r ch e n g in es . Go o g l e Bo t [ 5 ] ,

S l u rp [ 7 ] , M SN Bot [ 8 ] a nd T eo ma [ 9 ] a re ex am pl es o f

c r a w le r im p l em e n ta t i on s us e d b y c o mm e r c i a l s e a r ch

e n g i n es . P a ge r e f r e s h p o l i c y, s yn c h r on iz a t io n , a n d

o p t im al do wn lo a d in g r a t e a r e im po r t an t i s s ue s h e r e [ 20 ,

2 1 ] . Te c hn ic a l i s s u es s u ch as t he su pp o r t ed f i l e

f o rm a t s , f i l e s i z e l im i t a t i on s an d t h e v i s i t i n g p o l i c y a r e

a l so o f g r e a t i mpo r t a n c e . B r e a d th f i r s t c ra wl e r s a r e

c a p a b l e o f c r a wl ing a l a r ge pa r t o f t h e W eb [ 3 ] . T h en

t h e d ow nl o ad ed pa ge s a r e a na l yz e d ( e . g . b y c o n t e n t ,

t yp e ) i n d ex e d an d s ub se qu e n t l y s t o r ed in d a t a

r e p os i to r i e s co mp os e d o f t ho us a nd s o f c om pu t e r s a nd

T e r a b yt e s o f d a t a [ 1 , 2 ] . T h i s ap pr oa c h r e qu i r es h u ge

Page 17: Technical University of Crete Department of Electronic and ...€¦ · Technical University of Crete Department of Electronic and Computer Engineering DESIGN AND EVALUATION OF TOPIC

CHAPTER 2. RELATED WORK

12

r e s ou r c es wh i c h a re a v a i l a b l e o n l y t o l a r ge com p an ie s

o r o r ga n iz a t io ns s uc h as G oo g l e o r Y ah o o .

C r a wl e r s s u ch a s M e r c a t o r [ 45 ] an d La r b i n [ 10 ]

a r e ex am pl es o f B r e a d th F i r s t C r aw l e r s wh ic h a r e

f r e e l y a v a i l ab l e t o p ro g r a mm e rs fo r t e s t i n g a nd s ys t e m

d e v e l opm e n t . W h en l imi t e d r es ou r c es a r e a v a i l a b l e t h e y

c a n c r aw l a sm al l p a r t o f t h e i nd ex ed w e b an d re t r i ev e

w e b c on te n t f o r f u r th e r p r o c es s in g . B r e a d t h f i r s t

c r a w le r s yi e l d h i gh q u a l i t y p a ge s [ 19 ] b u t a r e n ’ t t o p i c

o r i e n t e d .

b ) Pa g e i mp o rt an c e C r aw l e rs : T h e y a s s i gn h i ghe r v i s i t

p r io r i t y t o U R Ls r e t r i ev e d f r om mo r e im po r t a n t

p a ge s . T yp i c a l l y , p a ge i mp or t an ce f o r a s s i gn in g

p r io r i t i e s t o ex t r ac t e d U R Ls i s comp ut e d b y B a c k l i n k

c o un t (w h er e h i gh er p r i o r i t y i s g i v en t o p a ge s po i n t e d

t o b y m a n y o t h e r W eb p a ge s ) a nd Pag e Ra n k [ 6 ] . O th e r

c r i t e r i a s uc h a s t h e p o s i t i o n o f t h e p age w i t h in t h e W eb

s i t e h i e r a r ch y ( e . g . l o w d e p t h , a s i nd i c a t ed b y f e w e r - o r

n o ne - s l a s h es i n t o t he p a ge U RL, l e a d to h i gh er

p r io r i t y) , o r t h e nu mb e r o f o u t go i n g l i n ks o f t h a t p a ge

( O ut l in k c ou n t ) can b e u s ed as w e l l . Ch o e t . a l [ 22 ]

p r ov id es a su r v e y o n th i s t yp e o f C r a wl e r s . P a ge

i mp or t an c e c r i t e r i a h a v e b e en sh own to i mp ro v e t he

q u a l i t y o f d ow nl oad e d p a ge s [ 2 2 ] .

2 .3 Class i c Focused Craw lers

C r a wl e r s u s ed b y s e a r c h en g i ne s ( s uch a s t h os e r e f e r r e d t o i n

s e c t i o n 2 . 2 ) a r e d es i gn ed to m ax imiz e t h e t o t a l n um b er a n d

p r ob a b l y t h e q u a l i t y o f d o wn lo a de d w e b p a ge s . To p ic

o r i e n t e d o r Fo c use d C r aw le r s t a k e a s i n pu t a u s e r qu e r y

( C l a s s i c Fo cu se d c r a w le r s ) , o r ex am pl e p a ge s p r ov ide d b y

Page 18: Technical University of Crete Department of Electronic and ...€¦ · Technical University of Crete Department of Electronic and Computer Engineering DESIGN AND EVALUATION OF TOPIC

CHAPTER 2. RELATED WORK

13

t h e u se r a s a t r a in in g s e t ( Le a r n i n g Cr a w l e r s ) a nd f o cu s t h e

c r a w l in g p r oc e s s o n p a ge s r e l ev a n t t o t h e t op i c . Foc u s ed

c r a w le r s k ee p th e o v e r a l l n umb e r o f d o wn lo ad e d W e b pa ge s

t o a m in i mum wh i l e m ax imiz in g t h e p e r c en t a ge o f r e l e v a n t

p a ge s .

T h e p e r fo rm a nc e o f a f oc us e d c r aw l e r d ep e nds o n t h e

s e l e c t i o n o f go o d s t a r t i n g p a ge s ( s e e d p a ge s ) . Go od s e e d

p a ge s c an b e e i t he r w eb pa ge s r e l ev a n t t o qu e r y t o p i c o r

p a ge s f r om wh ic h r e l ev a n t p a ges c a n b e a c c e s s e d t h r ough a

s m al l num b e r o r r ou t i n g h op s . Fo r ex am pl e , i f t h e t op i c i s on

s c i e n t i f i c pu b l i c a t i on s , a go od s ee d p a ge c a n be t h e

p u b l i c a t i on s p a ge o f a n au th o r , l a b o r de p a r tm e n t o r

a l t e r n a t i v e l y t h e w e b p a ge o f t h e au th o r , l ab o r d e pa r tm en t

r e s p ec t iv e l y ( a l t hou gh th e l a s t m a y c o n t a i n n o p ub l i c a t i o ns

a t a l l , i t i s kn ow n t o l e a d t o p a ge s con t a in in g p u b l i c a t i on s ) .

S e ed p a ge s s ho u ld a l s o b e im po r t a n t a s w e l l ( wh e r e

i mp or t an c e i s d e f in e d u s i n g l i n k ana l ys i s m e t ho ds suc h a s

H IT S [ 46 ] a nd Page R a n k [ 6 ] ) . T h e r a t i o na l e b e h i nd t h i s

r e q u i r em en t i s t ha t imp o r t an t W e b p a ge s ( wh e n u s ed a s

s t a r t i n g p a ge s –s ee d s – f o r c r a wl i ng ) m a y gu i d e c raw l i n g

p r o c es s t o o th e r i mp o r t an t W eb p a ge s f a s t e s t , t hu s im p rov in g

t h e qu a l i t y o f t h e r e s u l t s . T h e se e d pa ge s a r e o f t en s e l ec t e d

b y s u b m i t t i n g t h e qu e r y t h a t d es c r i b es t h e t o p i c o f i n t e r es t t o

a s e a r c h e n gi n e a nd b y u s i n g t he t op se a r c h e n g in e r e su l t s .

E a r l y a p p r o a c he s o n Fo cus e d C r a wl i n g in c l ud e am on g

o th e r s t h e F i s h -S ea r c h a l go r i t hm [ 2 3 ] . Th e b a s i c i d e a o f t h e

a l go r i t hm i s t h a t w h e n s ev e r a l p a ges a r e c a nd i d a t es fo r l i nk

b r o ws i n g an d dow n lo ad in g , p r i o r i t y i s g i v en to pa ge s

r e l ev a n t t o t h e t op i c ( a p a ge i s l ab e l ed a s r e l ev an t i f i t

c o n t a i ns t h e qu e r y t ex t ) . E v e r y c a n d id a t e p a ge i s a s s i gne d a

Bo o le a n v a l u e d e r iv e d b y a s imp l e l ex i co gr a ph i c r u l e ( an d i t

i s d ow nl o ad ed b y a s ep a r a t e a pp l i ca t i on t h r e ad ) . T hre a d s

c o r r es po nd in g to r e l e va n t p a ge s c r e a t e n e w th r e ad s f o r t h e i r

Page 19: Technical University of Crete Department of Electronic and ...€¦ · Technical University of Crete Department of Electronic and Computer Engineering DESIGN AND EVALUATION OF TOPIC

CHAPTER 2. RELATED WORK

14

o u t go in g l i nk s , wh i l e t h re a ds c o r r es po nd in g t o i r r e l ev a n t

p a ge s a re s to pp e d . T h i s wo r k ex a mi ne d t h e s ep a r a t e u se o f

t h e a nc ho r t ex t i n a s s i gn i n g p r io r i t i e s t o UR Ls .

T h e m ai n d i s ad v an t a ge o f t h e F i sh -Se a r c h a l go r i t hm i s

t h a t p r io r i t i e s t a ke Bo o le a n v a lu es ; t h e re f o re a l l r e l e v a n t

p a ge s a r e a s s i gn ed th e s am e p r io r i t y . T h e S ha r k -S ea r c h

a l go r i t hm [ 2 4 ] i s a d i r e c t su c c e s so r o f F i s h - Se a r c h , w h e r e

V SM [ 1 2] i s u s ed f o r a s s i gn in g no n Bo o le a n p r io r i t y v a l u es

t o c an d i da t e p a ge s . T h i s im pr ov e d th e r e su l t s o f c r a wl in g

[ 2 4 ] . Th e V e c t o r S p ac e Mo d e l be c am e th e ba s i s o f c l a s s i c

f o c us e d c r aw l e r s ev e r s i n c e .

A c c or d i n g t o VSM , d o cum e n t s a r e r e p r es e n t ed a s t e rm

v e c to r s an d t he we i gh t ��� o f a t e rm j i n do c um e n t i i s

c o mp ut ed as :

� ��� = ���� ∗ ���

���� = ��� ����

, ��� = ��� ���

� �1�

W h e r e ���� i s t h e t e rm f r e qu e n c y o f t e r m j i n do cu m en t i , ���

i s t h e i nv e r s e doc u m en t f r e qu en c y o f t e rm j , ��� i s t h e

f r e q ue n c y o f a pp e ar a n c e o f t e rm j i n to d o cu m en t i , ���� i s

t h e m ax im um f r eq ue n c y o f a l l t e r ms in to d oc um e n t i , � i s t he

t o t a l num b er o f doc u m en t s a nd �� i s t h e n um b er o f do c um en t s

c o n t a i n i n g t e rm j .

R e c en t a pp ro a c he s t o fo c us ed c r a w l i n g i nc lud e

In f o S p id e r s a nd Be s t - F i r s t C r aw le r [ 2 5 ] . In f o S p id e r s u s e

N e u r a l Ne tw o rk s , w h i l e Be s t F i r s t C r a wl e r s u s e t ex t

s imi l a r i t y b y V S M f o r a s s i gn i n g p r i o r i t y v a l u e s t o c a nd i d a t e

p a ge s .

G i ve n a qu e r y a n d a W e b p a ge , t he p r io r i t y o f t h e W eb

p a ge i s c om put e d b y Be s t F i r s t C r a w l e r s a s t h e cos in e

Page 20: Technical University of Crete Department of Electronic and ...€¦ · Technical University of Crete Department of Electronic and Computer Engineering DESIGN AND EVALUATION OF TOPIC

CHAPTER 2. RELATED WORK

15

s imi l a r i t y b e t w e e n t h e i r d o cum e n t v e c t o r s wh e r e ���, ��� a r e

t e rm w e i gh t s o f t h e q ue r y a n d th e we b p a ge r es p e c t i v e l y:

��� �������������, � !���" = ∑ $�� ∗ $��%�&'(∑ $��)�&%

�&' (∑ $��)�&%�&'

�2��

W h e r e + i s t he t o t a l num b e r o f t e rm s i n t o q u er y a n d pa ge

c o n t e n t .

In f a c t , t h e Be s t F i r s t C ra wl e r i s a s im p l i f i e d v e r s i on o f

t h e Sh a rk -S e a r ch c r a w le r : I t d o e sn ’ t c omb in e l i n k a nc h o r

t ex t a nd p r ev i ou s v i s i t ed p a ge s s c o r es i n to t h e p a ge p r io r i t y

f u n c t i on , a s S h ar k -S e a r ch do es . A l s o , Be s t F i r s t C r aw l e r s u s e

o n l y t e r m f r eq u en c y ( t f ) v e c to r s f o r c omp ut in g t op i c

r e l ev a n c e . Th e use o f i n ve r s e d ocu m en t f r eq u en c y ( i d f )

v a lu es ( as su gge s t e d b y V S M ) i n t he c as e o f fo cu s ed

c r a w l in g i s p ro b l em a t i c s i nc e t h i s mi gh t r e q u i r e

r e c a l cu l a t i o n o f a l l t e rm v ec to r s a t e v e r y c r a w l i n g s t ep . In

a d d i t i o n , i d f v a lu es a r e h i gh l y i n a c c ur a t e a t t h e e a r l y s t a ge s

o f c r aw l i n g b e c au s e o f t h e sm al l n um b er o f r e t r i e v e d

d o cu m en t s . Bes t F i r s t C r a wl e r s h a ve b ee n s how n t o

o u t p e r f o rm In f o S p i d er s , a n d S h a rk -S e a r ch a nd a l s o o th e r

n o n- f o cus e d c ra wl in g a p p ro a c he s s u ch a s Br e ad t h F i r s t , a n d

P a ge Ra n k [ 26 ] . Bes t f i r s t c r a wl i n g i s c on s i d e r e d t o b e t h e

m os t e s t a b l i s he d a p p ro a ch t o f o cuse d c ra wl in g du e t o i t s

s im p l i c i t y a n d e f f i c i e n c y. T h e N - Bes t F i r s t C r aw le r i s a

ge n e r a l i z ed v e r s i on o f Be s t F i r s t C r a wl e r : a t e a c h s t ep ,

i n s t e a d o f ch oo s i ng o n e W e b pa ge f o r l i nk ex t r a c t i o n a n d

d o wn lo ad in g o f p age s po in t e d t o b y t h e s e l i n ks , N p a ge s w i th

h i gh es t p r i o r i t y a r e c ho s en [ 2 7 ] .

A l on g t h e s am e l i n es , a n a pp r o ach r e fe r r ed t o a s

“ i n t e l l i ge n t c r aw l in g” [ 2 8 ] su gge s t s c o mbi n i n g p a ge c on t en t ,

U R L s t r i n g an d s t a t i s t i c s a bo u t r e l ev an t / i r r e l e va n t p a ge s a n d

s ib l i n g p a ge s f o r a s s i gn in g p r i o r i t i e s t o c a nd id a t e UR Ls .

T h es e s t a t i s t i c s a re u p da t ed a nd c om bi ne d du r i n g c r aw l i n g

Page 21: Technical University of Crete Department of Electronic and ...€¦ · Technical University of Crete Department of Electronic and Computer Engineering DESIGN AND EVALUATION OF TOPIC

CHAPTER 2. RELATED WORK

16

f o r gu i d in g th e s e l e c t i on o f t h e n ex t l i n ks t o fo l l o w yi e l d i n g

a h i gh l y e f f e c t i ve c r a w l i n g a l go r i t hm th a t l e a rn s t o c r a w l

w i t ho u t d i r e c t u s e r t r a in in g .

2 .4 Semant ic Craw lers

S em a n t i c C r aw l e r s a r e imp l em en te d b y c o m bi n in g a n

o n t o l o g y w i t h s e m a n t i c s im i l a r i t y m e a s u r e s [ 14 ] f o r

d e t ec t i n g t o p i c r e l e v a n c e b e tw e e n re t r i e v ed W e b p a ge s a n d

u s e r qu e r i es . S e ma n t i c s im i l a r i t y p l a ys a n i m p or t an t r o l e

h e r e : i t c a n b e us ed to d e t e c t t o p i c r e l e va n c e b y a s s o c i a t i n g

t e rm s in a q u er y a n d t he W e b p a ge us in g th e o n t o l o g y, a n d

b y a s s i gn in g a d e g r e e o f r e l ev a n ce t o e a c h su ch t e rm

a s so c i a t i o n .

E h r i g e t . a l [ 13 ] p ro p os es u s e o f t op i c o r i en t ed o n to lo g i es

f o r f i n d in g p a ge s r e l ev a n t on t he t op i c o f i n t e r e s t . Ev e r y

t e rm in a W eb p a ge i s ex am in e d a nd co n t r i bu t es pos i t i v e ly t o

a s s i gn i n g a p r io r i t y s c o r e i f i t i s a q u e r y t e r m o r i f i t i s

s e ma n t i c a l l y r e l a t ed t o t h e u s e r q ue ry t e r m s . T h e f o l l ow i n g

v a r i a t i o ns fo r e v a lu a t in g s em an t i c r e l a t i o ns o f p a ge t e r m s

w i t h qu e r y t e rm s we r e us ed :

a ) I f a t e rm i s d i r e c t l y c o n n e c t e d ( d i s t an c e 1 ) t o a qu e r y

t e rm , t h e n i t i s c o n s i de r e d r e l eva n t (d i s t a n c e i s

d e f i n ed a s t h e l eng t h o f t he sh o r t es t p a th c on n e c t i n g

t w o t e rms r ep r e sen t ed a s v e r t i c e s i n to t h e on t o l o g y

g r a p h wh e r e ed ge s r ep r es e n t r e l a t i on o f a d j a c en t

t e rm s) .

b ) I f a t e r m i s c lo s e t o a q u e r y t e r m ( d i s t an c e 2 o r l e s s )

u s i n g o n l y IS - A r e l a t i o ns t h e n i t i s r e l e v an t t o t h e

q u e r y t e r m.

c ) E v e r y p a ge t e r m ap p e a r in g i n t o t h e o n t o l o g y g r a p h i s

a s s i gn ed a r e l ev a nc e v a l u e d ep e nd i ng o n i t s d i s t a n ce

w i t h qu e r y t e r ms . T h e g r e a t e r t h e d i s t an c e t h e l o w er

t h e r e l e v an c e v a l u e w i l l b e . Sp e c i f i c a l l y , u s i n g a t o p i c

Page 22: Technical University of Crete Department of Electronic and ...€¦ · Technical University of Crete Department of Electronic and Computer Engineering DESIGN AND EVALUATION OF TOPIC

CHAPTER 2. RELATED WORK

17

s p e c i f i c un d er l yi n g o n to lo g y t h e s em a n t i c s im i l a r i t y

b e tw e e n t e rm s i s co mp ut ed as :

�� ,-��., �)� = �/0�12,13� �3�

W h e r e � i s a d e c r e as in g f ac to r (0 .5 i n t h i s w o rk ) an d

�5��., �)� i s t h e l e n g th o f sho r t e s t p a t h c on n ec t i n g t e rms

t 1 a nd t 2 i n to t he on to lo g y g r a p h ( 0 i f t h e t e rm s b e lo n g

t o t h e s a m e s yn o nym s e t ) . Th e lo n ger t h e d i s t a n c e o f

t h e t e rms in to t h e g r a p h th e s m al l e r t h e i r s i mi l a r i t y i s .

T h i s m e th od i s a v a r i a t i o n o f t h e s ho r t e s t p a th

s e ma n t i c s imi l a r i t y m eth od .

T h e l a s t ap p ro a ch ha s t h e b es t p e r fo rm a nc e fo r

c o mp ut i n g th e co nc e p tu a l s im i l a r i t y b e tw e e n t e rms a nd w a s

a l so u s ed in o u r w o r k fo r co mp a r i so n w i th o th e r s ema n t i c

r e l a t i on m eth od s a n d s t a t e - o f -a r t c l a s s i c fo c us ed c r aw l i n g

a p p ro a ch e s . A no t he r s t a t e o f t he a r t t e rm s im i l a r i t y m e t ho d

u s ed i n p r e s en t wo rk i s t h e Li e t . a l m e t ho d [ 42 ] :

T h e s em a n t i c s imi l a r i t y b e t w e en t w o t e rm s t 1 a nd t 2 i s

c o mp ut ed a s a fu n c t i on o f t h e l e n g t h o f t h e p a th

c o nn e c t in g t h e t e rm s i n t he u nd e r l yi n g o n t o l o g y g r a p h

a n d th e d ep th o f t e r m s i n t o t h e t ax o nom y:

�� 6���., �)� = !789:;<7:=;<:;<>:=;< �4�

W h e r e L i s t h e sh o r t es t p a t h l e n g th b e t w e en �. an d �), @

i s t h e d e p th o f t h e m os t sp e c i f i c comm on co n c ep t o f �., �)

i n t o t h e t ax on om y a n d �, A a r e c ons t an t s �� = 0,2 a n d A = 0,6

i n ou r im p l em e n t a t i on ) .

A c c o rd in g t o r esu l t s r ep or t ed i n [ 14 ] t h i s m e t ho d h av e

b e e n p ro ve n to b e f a s t a n d a c cu r a t e ( a c h i e v i n g a c c u r ac y

u p to 8 2 % c omp a r ed t o r es u l t s ob t a in ed b y h u m an s ) .

G e n e r a l pu r po se t ax on omi e s su ch a s W o rd N et c a n a l so b e

a p p l i e d f o r f oc us ed c r aw l i n g . W or dN e t i s an o n l in e l ex i ca l

Page 23: Technical University of Crete Department of Electronic and ...€¦ · Technical University of Crete Department of Electronic and Computer Engineering DESIGN AND EVALUATION OF TOPIC

CHAPTER 2. RELATED WORK

18

r e f e r e nc e s ys t e m de v e lo pe d a t P r in c e to n Un i v e r s i t y . W o rd N et

a t t e mpt s t o m od e l t h e l ex i c a l k no wl ed ge o f a n a t iv e s pe a k e r

o f En g l i sh . W o rd Ne t c an a l so b e s e en a s on t o l o g y f o r n a t u ra l

l a n gu a ge t e rm s . I t c o n t a i ns a ro un d 1 00 , 00 0 t e r ms , o r ga n iz ed

i n t o t ax on omi c h i e r a r c h i es . No un s , v e r bs , ad j e c t i v e s a n d

a d v e rb s a r e g r o up ed i n t o s yn o n ym s e t s ( s yn s e t s ) . Th e s yn s e t s

a r e a l s o o r ga n iz e d in t o s e ns es ( i . e . co r r es po nd ing t o

d i f f e r en t m e an in gs o f t h e s am e t e r m o r c o n ce p t ) . T he s yn s e t s

( o r co n ce p t s ) a r e r e l a t e d t o o th e r s yn s e t s h i gh e r o r l owe r i n

t h e h i e r a r ch y d e f i ne d b y d i f f e r e n t t yp e s o f r e l a t i ons h i ps . T h e

m os t c omm on r e l a t i on sh ips a r e t h e Hyp o n ym / H yp e r n ym ( i . e . ,

I s - A r e l a t i o ns h ip s ) , a n d t h e M er on ym / H olo n ym ( i . e . , P a r t -o f

r e l a t i on sh ip s ) . T her e a r e n i n e no un a n d s ev e r a l v e rb Is - A

h i e r a r c h i es ( ad j e c t i v e s a nd ad v e rb s a re n o t o r ga n iz ed i n to Is -

A h i e r a r c h i es ) . F i gu r e 2 i l l u s t r a t es a f r a gm en t o f t h e

W o r dN e t Is - A h i e ra r c h y.

Fi g . 2 W o r dN e t H yp e r n ym / h yp o n ym s s yn s e t s r e l a t i o ns

ex a mpl e

Airplane , aeroplane, plane

Aircraft

Craft

Vehicle

Airship,… Drone,

Glider,…

Vessel, watercraft

Rocket, projectile Sled, sledge,…

Spacecraft,… Hovercraft

Airliner Amphibia

n

Jet Fighter Bomber Biplan

e

Monoplane

Page 24: Technical University of Crete Department of Electronic and ...€¦ · Technical University of Crete Department of Electronic and Computer Engineering DESIGN AND EVALUATION OF TOPIC

CHAPTER 2. RELATED WORK

19

T o th e b es t o f o u r kn ow le d ge a c o mp a r a t iv e s tu d y

b e tw e e n s e ma n t i c a n d o t h e r f o cu se d c r a wl in g a p p roa c h e s

h a sn ’ t b e en r epo r t e d i n t h e l i t e r a t u r e b e fo r e . T h e

i mp l e me n t a t i o ns i n [ 13 ] a r e co mp ar e d on l y w i t h a b a s i c

f o c us e d c r a wl e r ( a s s i gn i n g e ac h p a ge a s i mp l e b in a r y p r io r i t y

v a lu e d ep en d ed on t h e p r es e n c e o f qu e r y t e r ms ) r a t h e r t h an

w i t h t h e wi d e l y u s e d Be s t F i r s t C r a wl e r s m ak in g u s e o f V SM

f o r e s t im a t i n g t o p i c r e l e v an c e [ 2 9 ] . Th e p r op os e d w o rk d e a l s

w i t h ex a c t l y t h i s i s s u e an d p r e s en t s a c o m p ar a t iv e s tu d y

b e tw e e n c l as s i c an d s e ve r a l v a r i an t s o f s em an t i c c r aw l i n g

a p p ro a ch e s ( i n c l ud in g E hr i g e t . a l [ 1 3 ] ) .

2 .5 Learning Craw lers

E a r l y a p p r o a c h es t o d e v e lo p i n g l ea r n i n g c r a wl e r s a pp l i e d a

l e a rn in g c l as s i f i e r ( t h a t r e l i e d on we b t ax on omi e s s u ch a s

Y a h oo [ 7 ] ) an d u s ed f o r d i s t i n gu i sh i ng b e t w e en r e l e va n t a n d

n o n r e l e v an t p a ge s [ 3 0 ] . Ev e r y p a ge c on t a in in g l i n ks

c a n d i da t e f o r do wn lo a d in g i s c l a s s i f i e d a s r e l e v an t o r n o t

r e l ev a n t an d as s ign e d a p r io r i t y v a l u e a c c o r d i n g t o t h a t

c l a s s i f i c a t i o n (h i gh e r p r i o r i t y w a s a s s i gn e d t o r e l e v a n t

p a ge s ) . T h i s wo rk i s c on s id e r ed to b e o n e o f t h e f i r s t

c o n t r i bu t io ns i n t h e f i e ld o f Le a r n in g C r a wl e r s . Re s e n t

a p p ro a ch e s i n vo l v in g m a ch i n e l ea r n ing m e th ods fo r f o cu s ed

c r a w l in g i n c l ud e de c i s io n t r e es [ 3 4 ] , N e u r a l N e tw o rk s a n d

S up po r t V e c to r M ac h in es [ 3 3 ] .

Bu i l d i n g u po n s im i l a r i d e as t he c r a w le r i n [ 31 ]

i n t ro du c e d t h e co nc e p t o f Co n t ex t Gra p hs : Fo r e v e r y r e l e v a n t

p a ge a s e a r c h e n gin e ’ s b a c k l i nk s e rv i c e i s a pp l i ed t o r e t r i ev e

i t s p r e d ec e s s o r p a ge s . T h en , a c l a s s i f i e r i s bu i ld a c co r d in g t o

t h e d i s t a n c e o f pa ge s ( Le v e l ) t o t h e r e l ev a n t p a ges s e t .

D o wn lo a d p r io r i t i e s a r e e s t im a t e d u s in g t h i s c l a s s i f i e r : T h e

c l os e r a c a nd i d a t e p a ge t o a r e l e v an t o n e i s , t he g r e a t e r t h e

p r io r i t y o f t h a t p a ge wi l l b e .

Page 25: Technical University of Crete Department of Electronic and ...€¦ · Technical University of Crete Department of Electronic and Computer Engineering DESIGN AND EVALUATION OF TOPIC

CHAPTER 2. RELATED WORK

20

T a r ge t p a ge

Le v e l 1 p a ge

Le v e l 2 p a ge

F i g . 3 Co n t ex t g r a ph : P a ge s a r e c l as s i f i e d a c co r d in g to t he i r

d i s t a n c e ( Le v e l ) f ro m t a r ge t p a ge s .

A n ex t en s i on to t he C on tex t G r ap h m et ho d w as t h e Hid d en

M a r ko v M od e l ( HM M ) c r a wl e r [ 1 6 ] . T h e us e r b r ow s es t h e

W eb an d in d i c a t e s i f a do wn lo ad e d p a ge i s r e l ev a n t t o t he

t op i c o r n o t . Th e v i s i t i n g s eq u en c e i s a l so r e c o rd ed and i s

u s ed fo r t r a i n i n g th e a l go r i t h m to i de n t i f y p a t h s l e ad i ng t o

r e l ev a n t p a ge s . The d o wn lo a de d p a ge s a r e c lu s t e re d an d a

H i dd en M ar ko v M od e l [ 44 ] i s c r e a t ed : E ve r y p a ge i s

c h a r a c t e r i z ed b y t wo s t a t e s ( a ) t h e v i s i b l e s t a t e

c o r r es po nd in g to t he c l us t e r t h a t t he p a ge b e l ongs t o

a c c o rd in g to i t s c o n t en t , a nd (b ) t h e h id de n s t a t e

c o r r es po nd in g t o t h e d i s t a n ce o f t h e p a ge f r o m a r e l ev a n t

p a ge ( 0 i f t h e pa ge i s a t a r ge t / r e l e va n t pa ge ) . Dur i n g

c r a w l in g e v e r y p a ge i s a s s i gn e d a v a lu e e qu a l t o t h e

p r ob a b i l i t y t h a t g i v en th e c l us t e r t h e p a ge b e l on gs t o ,

c r a w l in g wi l l l e ad t o a t a r ge t p age , t h i s p ro b ab i l i t y i s

c o mp ut ed u s i n g th e H id de n M a rk ov Mo d e l .

Page 26: Technical University of Crete Department of Electronic and ...€¦ · Technical University of Crete Department of Electronic and Computer Engineering DESIGN AND EVALUATION OF TOPIC

CHAPTER 2. RELATED WORK

21

S p ec i f i c a l l y A l l pa ge s a r e r ep r es e n t e d b y t h e i r

t e rm v e c t o r s a c c o rd in g t o VS M a nd th e y a r e c l u s t e r e d . T h us

e v e r y p a ge i n to t h e t r a in in g s e t i n c h a r a c t e r i z ed b y t h e

c l us t e r i t be lo n gs t o a nd b y i t s d i s t anc e ( l e v e l ) f r om a t a r ge t

p a ge ( F i g . 4 ) .

L 3 p a ge

L 2 p a ge

L 1 p a ge

L 0 p a ge

Fi g . 4 Re p r es e n t a t i o n o f t h e H MM t r a in in g s e t u s in g

d i s t an c e f r om t a r ge t p a ge s ( Le v e l ) and c l us t e r s o f p a ge s i n

t h e t r a i n i n g s e t .

In f i gu r e 4 g r e e n p a ge s i n d i c a t e t a r ge t o r l ev e l 0 p a ges ,

ye l l o w p a ges a r e l e v e l 1 pa ge s (1 l i n k d i s t a n c e f r om t a r ge t

p a ge s ) , o r an ge p a ge s a r e l e v e l 2 (2 l i nk s aw a y f r o m t a r ge t

p a ge s ) , an d re d p age s a r e 3 o r m o re l i n ks a w a y f r o m t a r ge t

p a ge s . La b e l s on pa ge s r e p r es en t t h e c l us t e r t h e p a ge b e l on gs

t o ( e . g . C 0 , C 1 and C 2 l ab e l s c o r r e sp o nd i n g t o C lu s t e r 0 ,

C l us t e r 1 a nd C lu s t e r 2 r e sp e c t iv e l y) . N o t i c e t h a t pa ge s

w i t h i n t h e s am e C l us t e r c an b e l on g t o d i f f e re n t l e ve l s , a nd

t h a t p a ge s i n t he s am e l e v e l c an b e lo ng t o a d i f f e r en t c lu s t e r .

E v e r y p a ge i s c ha r a c t e r i z ed b y i t s l e v e l o r h i dd e n s t a t e L i

w h e r e i i s t h e l e ve l , a nd b y t h e c l us t e r C j i t b e lo n gs ( o r

v i s i b l e s t a t e ) . T h a t s e t o f p a ge s wi t h h id d en a nd v i s i b l e

C2

C2

C0

C1 C0

C1

Page 27: Technical University of Crete Department of Electronic and ...€¦ · Technical University of Crete Department of Electronic and Computer Engineering DESIGN AND EVALUATION OF TOPIC

CHAPTER 2. RELATED WORK

22

s t a t es fo r m a Hi dd e n M a rk ov Mo d e l [ 44 ] . Th e fo l l ow i n g

s um ma r i z e s t h e pa r a m et e r s an d n o t a t i o n u se d b y H M M

c r a w le r :

I . I n i t ia l p r obab i l i t y ma t r ix :

D = { F�G'�, … , F�G/181:/7.�}

W h e r e ����!� d e no t e s t h e n umb e r o f h id d en s t a t es

a n d F�G�� r e p r es e n t s t he p r ob a b i l i t y o f b e in g a t h id d en

s t a t e i a t t im e 1 . T h i s p r ob a b i l i t y i s co mp ut ed b y

a s s i gn i n g to e a c h p a ge a v a lu e e qu a l t o t h e p e r c en t a ge

o f p a ge s w i t h t h e s a me h i dd e n s t a t e i n t o t h e t r a in in g

s e t .

I I . T r ans i t ion Pr obab i l i t i e s Ma t r i x A :

J = [L��]'N�O/181:/,'N�O/181:/

W h e r e L�� r ep r es e n t s t h e p ro b ab i l i t y o f be i n g a t s t a t e L j

a t t im e t + 1 i f a t s t a t e L i a t t i me t . Th i s p ro b ab i l i t y i s

e s t im a t e d b y c o u n t i n g t h e co r r es po n d in g t r an s i t i o ns

f r om s t a t e L i t o L j on t h e us e r t r a i n i n g s e t , a nd b y

n o rm al i z in g b y t h e o v e r a l l n um b er o f t r a ns i t i on s f rom

s t a t e L i .

I I I . E mi ss i on Pr obab i l i t i e s M at r ix B :

P = [A��]'N�O/181:/,'N�OQ6R/1:S/

W h e r e A�� r e p r e s en t s t h e p r ob a b i l i t y o f b e in g a t c l u s t e r

C j g i v en s t a t e L i an d T� ��!�� i s t h e n um b er o f c lu s t e r s

o f pa ge s . P r ob a b i l i t i e s a re c omp ut e d b y c o u n t i n g t h e

n um b er o f p a ges i n to c lu s t e r C j w i th h i dd e n s t a t e L i

a n d no rm al i z in g b y t h e o v e r a l l n umb e r o f p a ge s wi th

h id d en s t a t e L i .

D u r i n g c r a wl i n g pa ge c o n te n t i s p ro c e s s e d a nd th e H M M

c r a w le r a s s i gns t h e p a ge to a c lu s t e r u s i n g K -N ea r es t

N e i gh bo rs a l go r i t hm [ 43 ] . G i v en th e p a ge c l u s t e r an d t h e

H i dd en M a rk ov M od e l p a r am e te r s (π , A a n d B m at r ix es ) t h e

p r ob a b i l i t y t h a t t he n ex t p a ge v i s i t ed w i l l b e a t a r ge t p age i s

Page 28: Technical University of Crete Department of Electronic and ...€¦ · Technical University of Crete Department of Electronic and Computer Engineering DESIGN AND EVALUATION OF TOPIC

CHAPTER 2. RELATED WORK

23

c o mp ut ed us i n g Vi t e rb i a l go r i t hm [4 0 ] . Th a t p ro ba b i l i t y

r e p r es e n t s a l s o v i s i t p r i o r i t y o f t h e l i nk . Th e V i t e r b i

a l go r i t hm co mp ut es a p r ed i c t i on o f t he s t a t e i n t h e n ex t t ime

s t ep g iv e n th e s e qu e n c e o f w e b p a ges ob s e r v ed t hu s f a r . In

o r d e r t o ca l cu l a t e t h e p r ed i c t i o n v a l ue , e a ch v i s i t ed p a ge i s

a s so c i a t e d wi t h v a lu e s a (L j , t ) , j = 0 , 1 , . . , s t a t es . V a lu e a (L j , t ) i s

t h e p ro ba b i l i t y t h a t t h e s ys t e m i s i n h id d en s t a t e L j a t t i me t ,

b a s ed on ob s e r v a t io ns m ad e t hu s f a r . G i ve n v a lu es a (L j , t -1 )

o f pa r e n t p a ge s , v a l u es a (L j , t ) a r e c om put e d us ing t h e

f o l l o wi n g r ec u rs ion :

��G� , �" = A�QU V ���G�, � − 1� ∗ ���/181:/

�&'� �5�

W h e r e a i j i s t he t r a n s i t i o n p ro b ab i l i t y o f s t a t e L i t o L j f r om

m at r ix A a nd A�QU i s t h e e mis s i on p rob a b i l i t y o f c lu s t e r c t

f r om h id de n s t a t e L j f rom m at r ix B . V a lu e s a (L j , 0 ) a t t h e

f i na l r e c u rs i on s t ep a r e t a ke n f r om in i t i a l p ro b ab i l i t y m a t r ix

π . G iv e n v a lu es a (L j , t ) t h e p r ob a b i l i t y t h a t t h e s ys t e m wi l l be

i n s t a t e L j a t t h e nex t t im e s t e p i s c omp ut e d a s fo l l o w s :

��G�, � + 1" = V ���G�, �� ∗ ���/181:/

�&'� �6�

T h e p ro b ab i l i t y o f b e e n a t s t a t e L 0 ( r e l e v an t pa ge ) i n t he n ex t

s t ep i s t he p r io r i t y a s s i gn ed t o p a ges .

C h ak r ab a r t i e t . a l [ 32 ] p r op os e d a t wo c l as s i f i e r

a p p ro a ch . T he o p en d i re c to r y ( D M O Z) [ 39 ] W e b t ax o nom y i s

u s ed t o c l as s i f y d o w nl oa d ed pa ge s as r e l e va n t o r no t , a n d

f e e d a s e c on d c l as s i f i e r w h i ch i s t r a in e d u s i n g th es e p age s .

T h e s e c on d c l as s i f i e r i s u s e d t o e v a l ua t e t h e p r ob a b i l i t y t h a t

t h e g iv e n p a ge w i l l l e a d t o a t a r ge t p age . A n ex t e ns i v e s tu d y

o f Le a r n i n g C r aw l e r s an d t h e e v a lu a t io n o f s ev e r a l

Page 29: Technical University of Crete Department of Electronic and ...€¦ · Technical University of Crete Department of Electronic and Computer Engineering DESIGN AND EVALUATION OF TOPIC

CHAPTER 2. RELATED WORK

24

c l as s i f i e r s u s e d t o a s s i gn v i s i t p r io r i t y v a l u e s t o p a ges i s

p r e s en t e d i n [ 3 3 ] . C l a s s i f i e r s b a sed o n S up po r t V ec t o r

M a c h in e s [ 38 ] (S VM ) s e em to o u tp e r f o rm Ba ye s C l a s s i f i e r s

a n d c l a s s i f i e r s b as ed on N e ur a l N e t wo rk s on t h a t t a s k .

R e se n t c on t r ib u t i on s t o t h e f i e l d o f l e a rn in g c r a wl in g

i n c lu d e H yb r i d c raw l e r s [ 3 5 ] c om bin in g i d ea s f r om l e a rn in g

a n d c l as s i c f oc us ed c r aw l e r s . In [ 3 5 ] a H yb r i d C r a wl e r i s

p r op os e d : Th e c r awl e r wo r ks b y a c t i n g a l t e r n a t i v e l y e i t he r a s

l e a rn in g c r a w l e r gu id e d b y ge n e t i c a l go r i t hm s ( fo r l e a rn in g

t h e l i nk s e qu en c e l e a d in g t o t a r ge t p age s ) o r a s b r e ad th f i r s t

c r a w le r . In o u r w o r k , w e ap p l y a h yb r i d m eth od f o r

i mp ro v i n g t h e p e r f o rm a n c e o f l e a rn in g c r a w le r s . Ho w ev e r ,

i n s t e a d o f a l t e rna t i n g c r a wl e r s b e t w e en t wo mo d es o f

o p e r a t i on ( Le a r n ing o r Br e a d th f i r s t c r a w l e r ) w e c omb ine t h e

p a ge p r io r i t y f u n c t i on s c omp ut e d b y a H id de n M a rk ov Mo d e l

C r a wl e r an d t h a t o f t h e Bes t F i r s t C ra wl e r i n o rd e r t o

e v a lu a t e t h e o v e ra l l p r i o r i t y v a lu e o f a W e b p a ge .

2 .6 Summary

R el a t e d w o rk o n fo c us e d c r a wl e r s i nc l ud es c l a s s i c , s ema n t i c

a n d l e a r n in g a p pr o a c he s . T h e Bes t F i r s t C r a wl e r a n d

v a r i a t i o ns o f t h i s m e t ho d ( e . g . N- Be s t F i r s t C r a wl e r ) fo r m a

c o mmo n an d e f f e c t i v e ap pr o a ch fo r f o c us e d c r a wl i n g [ 2 6 ] .

S em a n t i c c ra wl e r s p r e s en t e d i n [ 1 3 ] a r e no t w e l l s t ud i e d a n d

a c om p ar i s on w i t h s t a t e o f t h e a r t c l a s s i c fo c us ed c r aw l e r s

s u ch a s Be s t - F i r s t h a s n ’ t a pp e a r ed in t h e l i t e ra tu r e b e f o r e .

Le a r n i n g c r a w l e r s f o rm a d i s t i n c t i ve c a t e go r y o f f o c u s ed

c r a w le r s ba s ed o n a t r a in in g s e t p ro v i d ed b y t h e u s e r f o r

t op i c d es c r i p t i on . Le a r n in g c r a w le r s ba s ed on S VM

c l as s i f i e r s f o r a s s ign i n g p a ge v i s i t i n g p r io r i t i e s a c h i e v e go o d

p e r f o r ma n c e [ 33 ] , w h i l e me th od s t h a t l e a r n p a t hs l e ad i ng t o

r e l ev a n t t o t h e t op i c p a ges su ch as Co n t ex t G r ap h me t ho d

[ 3 1 ] an d Hi dd en Ma r k ov Mo d e l Cr a wle r s [ 16 ,1 8 ] a r e o f g r e a t

Page 30: Technical University of Crete Department of Electronic and ...€¦ · Technical University of Crete Department of Electronic and Computer Engineering DESIGN AND EVALUATION OF TOPIC

CHAPTER 2. RELATED WORK

25

i mp or t an c e . A l s o t h e ne w l y p r o p os ed h yb r i d m et ho ds [ 3 5 ] a r e

v e r y p r om is i n g ap pr o a c h t o f o cu se d c ra w l i n g .

Page 31: Technical University of Crete Department of Electronic and ...€¦ · Technical University of Crete Department of Electronic and Computer Engineering DESIGN AND EVALUATION OF TOPIC

CHAPTER 3. CRAWLER DESIGN

26

Chapter 3. Crawler Design

3.1 Introduction

Is s u e s r e l a t e d t o d es i gn a nd im ple m e n t a t i on s o f f oc u s ed

c r a w le r s a re d i s cus s ed n ex t . G i v en a n a pp l i c a t i on ( ge n e r a l

p u rp os e w e b s ea r ch e n g i n e o r t op i c s p e c i f i c d i g i t a l l i b r a r y)

t h e a pp r op r i a t e t yp e o f w eb c r aw le r h a s t o b e d e t e r mi n ed

f i r s t . Fo r t h e f i r s t a p p l i c a t i o n t yp e , a b r e a d th f i r s t c r aw le r i s

a r e a s on ab l e s o lu t io n . Fo c us ed c r a wl e r s ( c l a s s i c , s em a n t i c o r

l e a rn in g c r a wl e r s ) a r e b es t su i t ed fo r t h e l a t e r ap p l i ca t i on

t yp e .

B r e a d th F i r s t C r a wl e r

Fo c us e d Cr a wl e r s

G r e e n c i r c l e s d eno t e r e l ev a n t t o t he t o p i c p a ges a n d a r c s

l i n ks b e t we e n W eb p a ge s . A r ro w s d e no t e v i s i t s e que n c e

u s i n g d i f f e re n t c ra w l e r s . Fo c us e d C r a wl e r s a s s i gn h i gh e r

v i s i t p r i o r i t i e s t o l i n ks co n t a i n ed i n r e l ev a n t t o t h e t o p i c

p a ge s .

Fi g . 5 C r a wl e r O p er a t i on

Page 32: Technical University of Crete Department of Electronic and ...€¦ · Technical University of Crete Department of Electronic and Computer Engineering DESIGN AND EVALUATION OF TOPIC

CHAPTER 3. CRAWLER DESIGN

27

Fi g . 5 d em on s t r a t e s t h e s e a rc h s t age s o f a c r a wl e r . W eb

p a ge s a r e d en o t e d b y c i r c l es ( g r e e n c i r c l e s co r r es po nd to

p a ge s r e l ev a n t t o t h e t op i c a t h a nd ) wh i l e l i n ks d en o t e

o u t go in g l i n ks f rom a p a ge . T he c r a wl e r r e t r i e v es p a ge s f r om

t h e we b s t a r t i n g wi th a s e ed p a ge sho w n a t t h e ro o t o f t h e

t r e e . A s d i s cu ss ed i n t h e i n t r od u c t io n , t h e ou t go in g l i nk s

( U R Ls ) o f ea c h v i s i t e d p a ge a r e p l a c e d i n a q ue u e f r om

w h ic h th e w e b p a ge to v i s i t nex t i s s e l e c t e d i n so m e o rd e r .

T h e c r a wl e r ge t s t h e UR L, d o wnl o ad t h e p a ge an d p l a c e s

U R Ls e x t r ac t ed f rom th e do w nlo a d ed p a ge i n t h e q u eu e . T h i s

p r o c es s i s r ep e a t ed u n t i l t h e c r aw l e r d e c i d es t o s to p ( e . g .

d i s k s pa c e ex h au s t e d , t im e l ap s ed o r t he us e r i s s a t i s f i ed

w i t h t h e r es u l t s ) . Fo c us e d c r a wl e r s i n t ro du c e a n umb er o f

c r i t e r i a ( e . g . p a ge imp o r t an c e , r e l e v a n c e t o t o p i c ) f o r

a s s i gn i n g p r i o r i t i e s t o w eb p a ge s i n t h e qu e ue an d f o r

s e l e c t i n g w h i c h pa ge t o v i s i t n ex t . F i g . 6 i l l u s t r a t es t h e

o p e r a t i on s t a ge s o f a c ra wl e r :

N o

Y e s

N o

Fi g . 6 O v er v i e w o f C ra wl e r o p er a t i o n

User input

Page downloading

Content processing

Priority assignment

Crawling termination

criteria satisfied?

Output: Web pages satisfying user needs

Page 33: Technical University of Crete Department of Electronic and ...€¦ · Technical University of Crete Department of Electronic and Computer Engineering DESIGN AND EVALUATION OF TOPIC

CHAPTER 3. CRAWLER DESIGN

28

a ) I npu t : C r a wl e r s t a k e a s i np u t a num b er o f s t a r t i n g

( s e e d ) U R Ls a n d ( i n t he c a s e o f f o cu se d c r a wl e r s ) t h e

t op i c d es c r i p t i o n . T h i s d e s c r ip t i o n ca n b e a l i s t o f

k e yw o r d s f o r c l a s s i c a n d s em a n t i c f o cu s ed c r a wl e r s o r

a t r a in i n g s e t fo r l ea r n i n g c r aw le r s .

b ) Pa g e dow nl oa d ing : Pa ge s f r om q u eu e a r e d o wnl o ad e d

i n s om e o rd e r . Fo c us e d c r aw l e r s m a y d e c i d e t o

ex c lu de p a ges no t s a t i s f yi n g t h e t op i c c r i t e r i a f ro m

f u r t h e r i nv es t i ga t i o n . P a ge s a r e s t o r e d lo c a l l y a t a

p a ge r ep os i to r y f o r f u r th e r p ro c e s s i n g .

c ) C on t en t p ro c es s in g : T he p a ge c on t en t i s l ex i c a l l y

a n a l yz e d a n d r ed uce d in to t e rm v ec to r s ( a l l t e rm s a r e

r e d u ce d t o t h e i r m o rp ho l o g i ca l ro o t s b y a p p l yi n g

P or t e r ’ s s t em min g a l go r i t hm [ 4 8 ] a nd s t op wo r ds a re

r e mo v ed ) . Ea c h t e rm in a v e c t o r i s r ep r e s en t e d b y i t s

t e rm f r eq u en c y- i nv e r s e f r e qu en c y v e c to r ( t f - i d f )

a c c o rd in g t o VSM . T h e ou t go in g l i nks o f t h e p a ge a r e

a l so ex t r a c t e d an d p l a ce d in t he p r io r i t y q u e u e .

d ) Pr i o r i t y as s i gnme n t : Ex t r a c t ed U R Ls f ro m

d o wn lo ad e d p a ge s a r e p l a c ed in a p r i o r i t y q u e u e wh e re

p r io r i t i e s a r e d e t e rm in ed b as ed o n th e t yp e o f c r a wl e r

a n d us e r p r e f e re n ce s . T he y r a n ge f r om s imp l e c r i t e r i a

s u ch a s p a ge imp or t an c e o r r e l e v an ce t o q ue r y t o p i c

( c om pu t ed b y m a t c h in g t h e q u er y w i th p a ge o r an c ho r

t ex t ) t o mo r e i nv o l ve d c r i t e r i a ( e . g . c r i t e r i a

d e t e r min e d b y a l e a r n in g p r o c es s ) .

e ) E xpan s i on : UR Ls a r e s e l e c t e d f o r f u r t h e r ex p a ns i on

a n d s t ep s ( b ) - ( e ) a r e r e p e a t e d un t i l s om e c r i t e r i a

( e . g . t h e d es i r ed n umb e r o f p age s h av e b e e n

d o wn lo ad e d ) a r e s a t i s f i e d o r s ys t em r e so u r ce s a r e

ex h au s t ed .

A l l C r a wl e r s f o l l ow t he a bo v e d e s i gn . B r e a d t h F i r s t C raw l e r

r e q u i r es o n l y s e e d p a ge s a s i n pu t . Be s t - F i r s t an d S em an t i c

Page 34: Technical University of Crete Department of Electronic and ...€¦ · Technical University of Crete Department of Electronic and Computer Engineering DESIGN AND EVALUATION OF TOPIC

CHAPTER 3. CRAWLER DESIGN

29

C r a wl e r s t a k e t he s e e d p a ge s an d a u se r q u er y a s i n pu t wh i l e

Le a r n i n g C r a wl e r s a c c e p t a t r a i n i n g s e t o f U R Ls o f p a ge s

i n s t e a d o f a q u er y. C r a wl e r s a l s o d i f f e r i n t h e w a y p r i o r i t i e s

a r e a s s i gn ed t o ex t r a c t ed U R Ls . T h i s i s t h e m os t c ru c i a l p a r t

i n t h e im p l em en t a t i o n o f f o cus e d c r a wl e r s .

A l l C r a wl e r s i n t h i s w o rk a r e i mp l e me n t ed in J a v a [ 36 ]

u s i n g E c l ip s e [ 37 ] . T h e do w nlo a d ed pa ge s m us t b e o f

t ex t / h tm l f o rm at a n d t h e i r co n t en t s i z e mu s t no t ex c e e d

1 0 0K B. R es t r i c t i ons a r e a l so i mp os e d o n co nn e c t io n t i me o u t

a n d d o wnl o ad i n g t i m es fo r p e r fo r m an c e r e as on s . T h os e

r e s t r i c t i o ns ap p l y t o a l l imp l em en te d c r a w le r s . T h e c r a wl in g

p r o c es s i s r e pe a t ed u n t i l t h e p r ed e f in e d num b e r o f pa ge s i s

r e t r i ev e d ( F i g . 6 ) . In o r e x p e r i m en t s t h i s num b e r i s s e t eq u a l

t o 10 00 we b p a ge s .

3 .2 Class i c Craw lers

T h e B r ea d th F i r s t C ra wl er f o rm s t h e b a se l in e f o r

i mp l e me n t in g Be s t F i r s t , Se m an t i c and Le a r n i n g C r aw le r s . I t

i s a s imp le p ro g r a m th a t ge t s o n e o r m o re s e ed p a ge s as i np u t

a n d fo l l o ws t h e l i nk s i n a b r e a d th f i r s t w a y u n t i l t he d es i r ed

n um b er o f W e b p age s i s d o wnl o ad e d . F i g . 7 i l l u s t r a t es t he

i n t e r fa c e o f t h e Br e a d th F i r s t c r aw l e r i mp l e me n t e d . I t

a c c e p t s o ne o r mo re s e ed p a ges a s i np u t . D o wnl o ad e d pa ge s

a r e sh ow n b e lo w .

Page 35: Technical University of Crete Department of Electronic and ...€¦ · Technical University of Crete Department of Electronic and Computer Engineering DESIGN AND EVALUATION OF TOPIC

CHAPTER 3. CRAWLER DESIGN

30

Fi g . 7 Sc r e en sh o t o f B r e a d th F i r s t C r aw l e r

3 .2 .1 B es t F i r s t Cr aw l e r w i t h p ag e co nt en t c r i t er ia

T h e s e c on d c l as s i c Cr a wl e r ( and th e f i r s t f o cu s ed )

i mp l e me n t e d i s t he B e s t F i r s t C ra wl e r us ing p ag e c on t en t

f o r p r io r i t i z i n g c a n d i d a t e UR Ls . W h e n a W e b p age i s

d o wn lo ad e d i t s c on t e n t i s l ex i c a l l y a n a l yz e d an d r ep r e sen t ed

b y t e r m v e c to r s . Ea c h t e rm in su c h ve c t o r i s r e p r es en t ed b y

i t s t f - i d f w e i gh t a cc o r d in g t o VSM [ 12] . P r i o r i t y a s s i gned to

a l i nk e qu a l s t h e c o s in e s im i l a r i t y ( E q . 2 ) o f t h e p a ge

c o n t a i n i n g th e l i nk a nd t h e us e r qu e r y .

Page 36: Technical University of Crete Department of Electronic and ...€¦ · Technical University of Crete Department of Electronic and Computer Engineering DESIGN AND EVALUATION OF TOPIC

CHAPTER 3. CRAWLER DESIGN

31

N o t i c e t h a t u s in g i nv e r s e do c um e n t f r eq u en c y ( i d f )

w e i gh t s c an b e p ro b l em a t i c b e c aus e i d f w e i gh t s n e ed t o b e

u p da t ed a t ev e r y c r a w l in g s t e p , f o r t h i s r e a so n i d f w e igh t s

c a n b e i n a c cu r a t e a t t h e i n i t i a l s t ep s o f c r aw l in g w h en t h e

n um b er o f r e t r i e ve d p a ge s i s sm al l [ 25 ] . M os t Be s t F i r s t

C r a wl e r i mp l e me n t a t i o ns u se o n l y t e r m f r eq u en c y ( t f )

w e i gh t s . In t h i s w o r k i d f w e i gh t s a r e p r ov i d ed b y th e

In t e l l iS e a r ch w eb s e a r ch e n gi n e [ 41 ] h o l d in g i d f s t a t i s t i c s

f o r En g l i sh t e rms . A t t h e n ex t s t e p t he l i n k wi t h t h e h i gh e s t

p r io r i t y i s s e l e c t e d f o r do wn lo a d in g .

3 . 2 . 2 B e s t F i rs t C raw le r w i th an cho r t e x t s i mi l ar i ty

T h e s e co nd v a r i a t i o n o f Bes t F i r s t C ra w l e r i s t h e Be s t F i r s t

C raw l er u s in g an ch or t e x t s im i la r i t y . T h e an c ho r t ex t o f a

U R L i s t h e c l i c k ab l e t ex t t h a t a pp e ar s o n t h e l i nk i n s id e a

W eb p a ge p o i n t i n g t o t h a t UR L. In t h i s w or k w e imp l em en t ed

a v a r i an t o f t h e a b o ve Be s t F i r s t C ra w l e r wh ic h in s t e ad o f

p a ge c on te n t u s es U R Ls a n ch o r t ex t a s t h e r ep r es e n t a t i o n o f

p a ge c on t en t a nd f o r a s s i gn i n g d o wn lo a d p r i o r i t i e s . No t i c e

t h a t l i n ks f rom the s a me p a ge ma y b e a s s i gn e d d i f fe r e n t

p r io r i t y v a l u e s , a s o p pos e d t o t h e f i r s t im p l em e n t a t i on , u s i n g

p a ge t ex t c on t en t f o r a s s i gn i n g p r io r i t i e s , w he r e a l l l i nk s

i n t o t he s am e p a ge a r e g i v e n t he s am e p r io r i t y . A s w i l l b e

s ho w n in t h e r e su l t s , s e l ec t i on o f anc h o r t ex t f o r a s s i gn in g

p r io r i t y v a l u es i mp r ov e d th e ge n e r a l p e r f o r m an c e o f t he

c r a w le r , u s i n g bo t h h a r v es t r a t i o a n d av e r a ge s i mi l a r i t y

c r i t e r i a ( s e c t i on 4 .3 ) .

3 . 2 . 3 B es t F i r s t C r aw l e r w i th pa ge c on t en t an d anc h or

t e x t .

T h e t h i rd v a r i a t i on o f Be s t F i r s t C r a wl e r c om bin e s t he

p r e v io us t wo im ple m e n t a t i on s us i n g p a ge c o n t en t a nd l i nk

Page 37: Technical University of Crete Department of Electronic and ...€¦ · Technical University of Crete Department of Electronic and Computer Engineering DESIGN AND EVALUATION OF TOPIC

CHAPTER 3. CRAWLER DESIGN

32

a n c ho r t ex t r e sp e c t iv e l y. E a c h UR L i s a s s i gn e d a p r io r i t y

v a lu e d e f i ne d a s :

�F���������� = similarity�5�,�" + �� ������� ���,��2 � �7�

W h e r e 5���������� i s t h e p r io r i t y v a l u e a s s i gne d t o l i n k i ,

similarity�5�,�" i s t h e s i mi l a r i t y o f q u e r y � a nd 5� ( t h e c on t e n t

o f t h e p a ge wh e r e t h e l i n k i i s l o c a t e d ) an d similarity���,�" i s

t h e s imi l a r i t y o f anc h o r t ex t �� o f l i n k i a n d qu e r y q .

T h e id e a b eh i nd t h e Be s t F i r s t C ra w l e r wi t h pa ge

c o n t e n t o n l y i s t ha t a p a ge r e l ev a n t t o t h e t o p i c i s m o re

l i k e l y t o p o i n t t o a r e l e v an t p a ge t ha n t o a n on r e l e v an t o n e .

T h us , t he h i ghe r t h e r e l ev an c e o f t he p a ge c o n t a in i n g t h e

l i n k i s , t h e h i ghe r t h e p r ob a b i l i t y t h a t t h e l i nk wi l l po i n t t o a

r e l ev a n t p a ge i s .

T he s e co nd imp l em e n t a t i on ( Be s t F i r s t C r aw le r u s in g

a n c ho r t ex t s i mi l a r i t y) t r i e s t o o v erc o m e a d i s a dv a n t age o f

t h e Be s t F i r s t C ra w l e r wi t h p a ge c o n t e n t on l y: a l l l i n ks

w i t h i n a p a ge h ave t h e s am e p r io r i t y r e ga r d l e s s o f a nc h o r

t ex t . A n ch or t ex t m a y b e r e ga r d e d a s a su mm a r y o f t he

c o n t e n t o f t h e p a ge t h a t t he l i n k po in t s t o . T he r e fo r e i t i s

r e a s on ab l e t o u s e t h i s d es c r i p to r fo r a s s i gn i n g p r i o r i t i e s t o

p a ge s . Ho w e ve r a nc h o r t ex t i s n ’ t a lwa ys d e s c r i p t i v e o f p a ge

c o n t e n t s a nd b y i g n o r i n g t h e p a ge c on t en t u s e f u l i n fo rma t i on

m a y n o t b e us ed . S o th e t h i r d Be s t F i r s t C raw l e r

i mp l e me n t a t i o n us es b o t h p a ge a n ch o r t ex t a nd p a ge c on t en t .

3 .3 Semant ic Craw lers

Be s t F i r s t c r aw le r s e s t i m a t e t h e r e l eva n c e b e t w e en t h e p a ge

c o n t e n t o r an c ho r t ex t a nd a u s e r q u e r y. T h e r e m a y ex i s t

c o n c ep t u a l l y r e l a t ed t e rm s i n bo th t h e q u e r y a n d t h e p a ge ( o r

a n c ho r t ex t ) , i n d i ca t i n g a r e l e v an c e t o t h e t o p i c . H ow e ve r i f

t h es e t e r ms a r e n ’ t l ex i c o gr a ph i c a l l y s im i l a r t h e i r r e l ev a n c e

Page 38: Technical University of Crete Department of Electronic and ...€¦ · Technical University of Crete Department of Electronic and Computer Engineering DESIGN AND EVALUATION OF TOPIC

CHAPTER 3. CRAWLER DESIGN

33

w i l l b e i gno r ed b ec a u s e VS M c omp ute s t ex t s i mi l a r i t y a s a

f u n c t i on o f s i mi l a r i t i e s b e t w ee n i d en t i c a l t e r ms fo un d i n t h e

v e c to r s w h i ch a r e c o mp a r ed . Th i s c a n b e r eso lv e d u s i n g

o n t o l o gi e s o r t e rm t ax o nom i es . In o n to lo g i es co n c ep t ua l l y

s imi l a r t e rms a r e r e l a t e d b y v i r t u e o f IS - A l i nks . A l l t e r ms

c o n c ep t u a l l y s i mi l a r t o u se r qu e r y t e r ms a r e r e t r i e ve d f r om

t h e on t o l o g y a n d u s ed f o r e nh an c ing t h e d e s c r ip t i o n o f t h e

t op i c ( e . g . b y a d d in g s yn o n ym t e r ms t o t h e t op i c k e yw o r d s )

a n d f o r co mp ut in g th e s i mi l a r i t y b e t w e en q u er y a nd

c a n d i da t e p a ge s . Fo r t h i s , v a r iou s m et ho ds h av e b e e n

p r op os e d i n c lu d i ng a m o n g o th e r s S e m an t i c S i mi l a r i t y

R e t r i e v a l M od e l (SSR M ) [ 1 4 ] a nd M ih a l c e a e t . a l [ 1 5 ] . Th e

m os t i mp or t an t r ep r e s en t a t i v e s o f t h i s c a t e go r y o f m e th o ds

a r e im p l em en t ed w i t h i n Bes t F i r s t c r a w le r s f o rmi n g th e so

c a l l e d h e r e a f t e r S em a n t i c C r aw le r s .

In t h i s w o rk , W o rdN e t [ 4 ] t e rm t ax o no m y i s u s ed as a n

o n t o l o g y f o r r e t r i ev in g c o nc e p t u a l l y s imi l a r t e rm s . W ord N et

w a s s e l ec t ed be c au s e i t p ro v id e s a v a s t co v e r a ge o f t h e

E n g l i sh vo c a bu l a r y s o i t c an b e u s ed fo r f o cu se d c r aw l ing o n

a lm os t ev e r y t o p i c m a k i n g o u r imp l em en t a t i on t h e f i r s t

ge n e r a l pu r pos e S em a n t i c C r aw le r . T h e ge n e r a l d es i gn

r e m ai ns s imi l a r t o t h a t o f C l as s i c Focu s ed Cr a wl e r s ( F i g . 6 )

b u t t h e p r i o r i t i e s a s s i gn e d t o l i n ks a r e e v a l u a t ed us in g

m et ho ds s u ch a s SSR M [ 1 4] an d Eh r ig e t . a l [ 1 3 ] . O th e r p a r t s

o f t h e s ys t e m s u ch a s d o wnl o ad in g , l i n k an d an ch o r t ex t

ex t r a c t i o n , p r ep ro ce s s i n g a nd r ep r es en t i n g t ex t s u s in g V e c t o r

S p ac e M od e l t e rm v e c to r s , r em ai n t h e s am e .

In t h e f o l l o wi n g , c a n d id a t e l i n ks f o r d o wnl o ad i n g a r e

r e p r es e n t ed b y t h e i r a nc ho r t ex t s . Ea c h c a nd i d a t e l i n k i s

a s s i gn ed a p r i o r i t y v a l u e wh i ch i s com pu te d a s t h e s ema n t i c

s imi l a r i t y b e t w e e n th e i r an c ho r t ex t a n d th e t op i c [ 1 4 , 1 5 ] .

In t u r n , s e m an t i c t ex t s imi l a r i t y i s c o m pu te d as a f un c t io n o f

Page 39: Technical University of Crete Department of Electronic and ...€¦ · Technical University of Crete Department of Electronic and Computer Engineering DESIGN AND EVALUATION OF TOPIC

CHAPTER 3. CRAWLER DESIGN

34

t h e s em an t i c ( c on ce p tu a l ) s imi l a r i t i e s b e t w e en t h e t e rms t he y

c o n t a i n . Th i s c a n b e de f in ed i n m any d i f f e r e n t w a ys [ 1 1 ]

l e a d in g to t h e im p le m e n t a t i on o f t h r e e s em an t i c c r a wl e r s .

3 . 3 . 1 Eh r ig C r aw le r

In t h i s im p l em e n ta t i on W e b p a ge s a r e r e p re s en t e d b y t h e

a n c ho r t ex t o f t h e l i n ks p o in t in g t o t h em ( in s t e ad o f p a ge

c o n t e n t a s i n [ 1 3 ] . A n ch o r t ex t s a n d th e us e r qu e r y a r e

r e p r es e n t ed b y t e r m v e c t o r s u s i n g t f w e i gh t s [ 13 ] . P a ge

p r io r i t i e s a r e com pu t ed as :

F�������:cS�d��� = V V �� ,-��� , �e" ∗ $�� ∗ $�e e&%

e&'

�&%

�&' �8�

W h e r e + i s t h e t o t a l n um b er o f t e rms in t o an c ho r t ex t an d

q u e r y, a n d �� ,- i s t e rm s e ma n t i c s imi l a r i t y c o m pu t ed u s in g

e q u a t i on 3 . N o t e t h a t on l y t f w e igh t s a r e u s e d wi th ou t

n o rm al i z in g b y v e c t o r l e n g th ( a s i t i s r e c om m en de d f o r sh o r t

t op i c a n d pa ge d e s c r i p t i on s ) , a nd t h a t W or d N et i s u s ed

i n s t e a d o f t op i c s pe c i f i c o n to lo g i es a s i n [ 1 3 ] .

3 . 3 . 2 SS R M C r aw le r

SSR M [ 1 4] i s u se d f o r a s s i gn i n g v i s i t p r io r i t i e s t o w e b

p a ge s . Sp e c i f i c a l l y t h e p r i o r i t y o f a U R L ( r e p r e s en t e d b y i t s

a n c ho r t ex t ) i s de f in e d a s f o l l ow s :

�F�������gghi��� = ∑ ∑ /�jkl�1m,1n"olm∗opnqnrsmrqmrs(∑ oln3nrq

nrs (∑ opn3nrqnrs

� �9�

W h e r e + i s t he t o t a l n um b er o f t e rm s in to t h e a n ch or t ex t

a n d t h e qu e r y. Li e t . a l . [ 42 ] i s t h e t e rm s i mi l a r i t y m e t ho d

Page 40: Technical University of Crete Department of Electronic and ...€¦ · Technical University of Crete Department of Electronic and Computer Engineering DESIGN AND EVALUATION OF TOPIC

CHAPTER 3. CRAWLER DESIGN

35

u s ed in o u r i mp l em e n t a t i on . Th e U RL w i t h h i gh es t p r io r i t y

v a lu e i s d o wnl o ad ed f i r s t .

3 . 2 . 3 S e man t i c C raw le r w i th syn onym s e t exp ans ion

A n ob v i ous im pr ov e m en t i s t o ex pa n d t ex t v e c t o r s w i t h

s yn o n ym s e t s i n W o r d Ne t an d u se bo th a n ch o r t ex t a nd p a ge

c o n t e n t fo r co mp ut in g t ex t s i mi l a r i t y a n d a s s i gn in g

p r io r i t i e s :

�F�������/uv/:1 :w08v,��� = similarity�5�, �x " + �� ������� ��x �, �x �2 � �10�

W h e r e F���������� i s t h e p r io r i t y v a l u e a s s i gne d to l i n k i ,

�� ��������5�, �x " i s t h e co s i ne s i mi l a r i t y o f ex p an d ed q u er y �x ( u s i n g W o r dN e t s yn o n ym s e t s ) an d 5� ( t he c on t en t o f p age

w h e r e t h e l i nk i i s l o c a t e d ) a nd �� ������� ��x �, �x �i s t he c os in e

s imi l a r i t y o f ex p and e d an c ho r t ex t �x � o f l i n k i a nd ex p and e d

q u e r y �x .

3 .4 Learning Craw lers

T h e m ai n i d e a be h i nd Le a r n i n g C r a wl e r s i s t ha t t h e c r aw l e r

l e a rn s us e r p r e f e r en c e s on th e t op i c f r om a s e t o f ex am pl e

p a ge s ( t h e t r a in i n g s e t ) . T r a in in g ma y i n v o l v e l e a r n i ng t h e

p a th l e ad i n g to t h e d e s i re d c on t e n t . In m o s t c a s es t h e

t r a i n i n g s e t c on s i s t s o f r e l ev a n t an d i r r e l ev a n t pa ge s . Ev e r y

d o wn lo ad e d p a ge i s c l a s s i f i ed (b as e d on t h e r es u l t s o f

l e a rn in g ) a s r e l ev an t o r i r r e l e v an t and i s a s s i gn e d a p r io r i t y .

T h e Co n t ex t Gr ap h m eth od [ 31 ] w o r ks n o t on ly b y

c l as s i f yi n g t h e c r aw l ed p a ge s as r e l ev a n t o f no t r e l e v an t , bu t

a l so b y l e a r n in g the d i s t an c e ( i n n umb e r o f r ou t in g ho ps ) t h a t

m a y l e a d f ro m an i r r e l ev a n t p a ge t o a r e l ev a n t on e ( F ig 3 ) .

T h e no n r e l ev a n t pa ge s i n t h e t r a i n ing s e t w e r e do wn loa d e d

Page 41: Technical University of Crete Department of Electronic and ...€¦ · Technical University of Crete Department of Electronic and Computer Engineering DESIGN AND EVALUATION OF TOPIC

CHAPTER 3. CRAWLER DESIGN

36

u s i n g r e cu rs iv e l y G o o g le ’ s b a ck l in k s e rv i ce , s t a r t i n g f r om

r e l ev a n t p a ge s , i n o r de r t o com pu te t he i r d i s t an c e ( l ev e l )

f r om t h e r e l ev a n t o r t a r ge t p a ge s . Du r in g c r aw l in g , page s

s imi l a r t o t h os e c l os e r t o t a r ge t pa ge s a r e g i v en h igh e r

p r io r i t y .

T h e Hi dd e n M a rk ov M od e l C ra wl e r [ 16 , 18 ] ex t en ds t h e

p r e v io us i d e a b y c a t e gor i z in g p a ge s n o t o n l y b y t h e i r

d i s t a n c e f r om a t a r ge t p a ge bu t a l so b y u s in g t h e i r c on t en t ,

t hu s es t im a t in g a r e l a t i o n b e tw e e n page c o n t en t an d t h e p a th

l e a d in g t o r e l e van t pa ge s . In i t i a l l y , a u s e r b r o wse s a

s e qu e n ce o f p a ge s l a b e l i n g t h em as r e l e v an t o r n o t . A s pa ge s

a r e d ow nl oa d ed , t h e v i s i t i n g s eq u en c e i s r ec o rd e d an d a

c o n t ex t g r a ph i s c r e a t ed wi t ho u t t h e n e ed o f a b a ck l i nk

s e r v i c e as i n [ 31 ] .

Fi g . 8 Ou t l i ne o f l e a r n in g f o cu s ed c r aw l in g

F i gu r e 8 i l l u s t r a t e s t h e f u nc t io n a l co mp on e n t s o f t h e H M M

c r a w le r im p l em e n t ed :

I . T r ain in g co mp on en t : Th e f i r s t c omp o ne n t r e co rd s t h e

U R L’ s v i s i t e d b y t h e us e r a nd t h e pa ge v i ew s e qu e n c e .

T h en i t d ow nl oa ds p a ge s a nd c om pu tes t h e t f - i d f v e c to r s

r e p r es e n t i n g th e i r c o n t e n t . F in a l l y p a ge s a r e c lu s t e r ed

u s i n g a c l us t e r in g a l go r i t hm . In o u r im p l em e n t a t i on K-

M e a ns an d X -M e ans [ 1 7 ] w er e u s ed f o r c l us t e r in g .

I I . H M M in i t ia l i z a t i on : Th e s e co nd com po n en t t a k es t he

H M M r ep r es e n t a t i on o f u s e r t r a i n i n g se t ( a s i n f i g . 4 ) a s

User training module

Hidden Markov Model

Initialization

Crawling Component

Page 42: Technical University of Crete Department of Electronic and ...€¦ · Technical University of Crete Department of Electronic and Computer Engineering DESIGN AND EVALUATION OF TOPIC

CHAPTER 3. CRAWLER DESIGN

37

i np u t a nd c ompu t es t h e Hi dd e n M a r ko v M od e l

P a r am et e r s ( i . e . π , A a nd B m a t r ix es ) . T h i s com po n en t i s

a p p l i e d du r i n g t h e i n i t i a l i z a t i on p h as e b e fo r e c r a wl i n g .

I I I . C r aw l i ng co mp one n t : i t do wn lo a ds s e l e c t e d p a ge s ,

ex t r a c t s c on t en t an d l i nk s , p r oc e s s p a ge co n t e n t a nd

a s s i gns t h e pa ge t o a c l us t e r u s i n g K -N e ar es t N e ig hb or s

a l go r i t hm [ 4 3 ] . G iv e n t he p a ge c l us t e r a nd th e H id d en

M a r ko v M od e l p a ra m et e r s (π , A a nd B m at r ix es ) t he

p r ob a b i l i t y t h a t t he n ex t p a ge v i s i t ed w i l l b e a t a r ge t

p a ge i s co mp ut ed u s in g V i t er b i a l go r i t hm [ 4 0 ] . T h a t

p r ob a b i l i t y r e p r e s en t s a l so v i s i t p r i o r i t y o f t h e l i nk . I f

t w o c lu s t e r s yi e l d a lm os t i de n t i c a l p ro b ab i l i t i e s ( i . e . t h e

d i f f e r en c e o f p ro b ab i l i t i e s i s b e l o w a p r e de f in ed

t h re sh o l d ε ) t h en h i gh e r p r i o r i t y i s a s s i gn ed t o t h e

c l us t e r l e a d in g w i th h i gh e r p r ob a b i l i t y t o t a r ge t p a ge s i n

t w o s t e ps ( a l so c o mp ut ed b y a p p l yi n g t h e Vi t e rb i

a l go r i t hm ) .

T h r e e Le a r n in g c r aw l e r s h av e b ee n im p l em e n t ed : t he f i r s t i s

t h e H id d en Ma r kov C ra wl e r (v a r i a n t s p ro po s ed i n [ 1 6 ] a n d

[ 1 8 ] ) . Th e nex t two v a r i an t s ( H yb r i d C r a wl e r s ) a re p ro po s ed

i n t h i s t h e s i s . T he y c o m b in e t h e pa ge p r io r i t y f u n c t i on s

c o mp ut ed b y t h e H i dd e n M a r ko v M ode l Cr a wl e r a nd th a t o f

t h e Be s t F i r s t C ra w l e r i n o r d e r t o e v a l u a t e t h e ov er a l l

p r io r i t y v a l u e o f a W eb p a ge .

3 . 4 . 1 H idd en M a rk ov Mod el Cr aw l er

T w o v a r i an t s o f t h i s c r aw le r h a v e be en i mp l e me n t e d :

a ) T h e f i r s t h i dd e n M a r ko v Mo d e l i mp l em e n t a t i o n us e s

K - M e an s a l go r i t hm f o r c l us t e r i n g a s d es c r i b ed in

[ 1 6 ] . In t h i s wo r k th e d i me ns io n a l i t y r e d u c t i on s t e p i s

o mi t t ed . K wa s s e t t o 5 , a nd t h e l a s t f i f t h c l us t e r

Page 43: Technical University of Crete Department of Electronic and ...€¦ · Technical University of Crete Department of Electronic and Computer Engineering DESIGN AND EVALUATION OF TOPIC

CHAPTER 3. CRAWLER DESIGN

38

h o l ds t h e r e l ev an t p a ge s . P a ge p r io r i t i e s ( pr ior i t y h m m )

a r e co mp ut ed u s i n g Vi t erb i [ 40 ] a l go r i t hm ( F i g . 9 ) .

b ) T h e s e co nd v a r i a n t i s a lm os t i d en t i ca l t o t he p r e v io us

o n e bu t i n s t e ad o f K - Me a ns , X -M e an s [ 17 ] i s u s ed .

O t he r min o r mo d i f i c a t i o ns a r e ( a ) i d f w e i gh t s a r e no t

p r e c omp ut e d ( a s i n t h e p r e v i ou s v a r i an t ) , bu t a r e

c o mp ut ed u s i n g the t r a in i n g s e t a nd ( b ) t h e r e l ev a n t

p a ge s d on ’ t f o rm a s e p a r a t e c lu s t e r b u t t h e y m a y

b e lo n g t o t h e s a me c lu s t e r w i th no n r e l ev a n t p a ge s .

A s w i l l b e s ho wn in t h e ex p e r im e n t s t h e t w o v a r i an t s

d e mo ns t r a t e d i d e n t i c a l p e r f o rm an c e . T h e f i r s t v a r i an t

w a s us e d f o r co mp a r i so ns wi t h t h e H yb r i d C r aw l e r s

p r op os e d i n t h i s wo r k .

F i g u r e 9 s u mma r i ze s H M M C r a w l e r s p r i o r i t y a s s i g n me n t

p r o c e d u r e :

Fi g . 9 HMM C r aw l e r p r io r i t y e s t ima t i on a l go r i t hm

��G� , �" = A�QU V ���G� , � − 1� ∗ ���/181:/

�&'�

��G�, � + 1" = V ���G� , �� ∗ ���/181:/

�&'�

Input: Training set, candidate page (p).

Output: priority value priorityhmm(p) assigned to candidate page p.

1. Cluster training set using K-Means or X-Means algorithm

2. Compute π, A, B matrixes.

3. Classify candidate page p to a cluster T1 using K-Nearest Neighbor

algorithm

4. Compute hidden state probabilities for current step using Viterbi formula:

5. Compute hidden state probabilities estimation for next step using

formula :

6. Assign priority priorityhmm(p) = ��G', � + 1� to page p.

Page 44: Technical University of Crete Department of Electronic and ...€¦ · Technical University of Crete Department of Electronic and Computer Engineering DESIGN AND EVALUATION OF TOPIC

CHAPTER 3. CRAWLER DESIGN

39

3 . 4 . 2 H yb r id C r aw l e rs

T w o v a r i an t s o f h yb r id c r a wl e r s a re im p l em e n t ed :

a ) H yb rid M a rko v M o de l C r aw l e r : T h e Hi dd en M a rk ov

M od e l C r aw l e r su f f e r s f ro m a t l e a s t tw o d r aw b a ck s : ( a ) i t

d o es n ’ t a s s i gn d i f fe r e n t p r io r i t i e s t o p a ge s b e lo n g in g to

t h e s a me c lu s t e r an d ( b ) i t i s v e r y d i f f i c u l t t o r e p re s en t

t h e s e t o f W e b p a ge s n o t r e l e v an t t o t h e t op i c b y c l u s t e r s

( i t i s a v e r y h e t e r oge n e o us s e t ) .

A h yb r i d ap p ro a ch c o mbi n i n g th e t ex t s imi l a r i t y o f a

p a ge w i th t h e c e n t ro i d o f t h e c lus t e r c o n t a i n i n g the

p os i t i v e ex am pl e pa ge s ( us i n g VS M) i s p ro po s ed h e r e fo r

d e a l in g wi th t h es e t wo p r ob l e ms . Th e c e n t ro id i s

c o mp ut ed a s t h e ave r a ge v e c to r o f t he p a ge s b e l on g i n g to

t h e c lus t e r . T ex t s i mi l a r i t y b e t w e en c a n d i da t e p a ge s wi t h

t h e c en t r o id m a y d i f f e r ev en i f p a ge s b e lo n g to t h e s ame

c l us t e r t hu s d e a l in g w i t h t h e f i r s t p r ob l em m e n t i one d .

S i mi l a r i t y w i t h t he c e n t r o i d o f r e l e v a n t p a ges i s no t

a f f e c t e d b y t h e wa y n o n r e l e v an t pa ge s a r e r e p r es e n t e d

t hu s d e a l i n g w i th t h e s e c on d p r ob l em a s we l l .

T h e H yb r i d Ma r ko v M od e l Cr a wl e r d i f f e r s f ro m th e

H M M Cr a wl e r i n t he w a y p r i o r i t i e s a r e a s s i gn e d to

c a n d i da t e p a ge s . I t c om pu t es a p r i o r i t y s c o r e fo r a p a ge

u s i n g t h e Hi dd e n M a r ko v Mo d e l ( pr io r i t y h m m ) an d a l so

c o mp ut es t h e t ex t s imi l a r i t y o f p a ge c o n t e n t wi t h t h e

c e n t r o i d o f t h e c l us t e r c on t a in in g t h e r e l e va n t p a ge s f r om

t h e us e r t r a in in g se t u s i n g eq ua t i on 2 . F i n a l l y , t h e p r io r i t y

o f p a ge p i i s c om put e d as fo l l o w s :

�5�������cuyS�,�5�� = �� ��������5�,TS" + 5�������cjj�5��2 � �11�

Page 45: Technical University of Crete Department of Electronic and ...€¦ · Technical University of Crete Department of Electronic and Computer Engineering DESIGN AND EVALUATION OF TOPIC

CHAPTER 3. CRAWLER DESIGN

40

W h e r e TS i s t he c en t ro id o f r e l ev an t p a ge s i n t r a i n i n g s e t ,

similarity�5�,TS" i s t h e cos in e s im i l a r i t y o f p a ge co n t e n t 5� w i t h c en t r o i d TS o f r e l ev a n t p a ge s , 5�������cjj�5�� i s t h e

p r io r i t y a s s i gne d to pa ge l i nk i i n to p a ge 5� u s i n g Hid d en

M a r ko v M od e l an d 5�������cuyS�,�5�� i s t h e p r io r i t y a s s i gn ed

t o l i n ks i n pa ge 5� b y t h e H yb r i d C ra wl e r .

b ) H yb rid H M M C raw le r w i th p ag e c o nt en t a nd an cho r

t e x t : A n ob v i ou s ex t ens io n to m et hod ( a ) i s t o u s e bo th

a n c ho r a n d pa ge t ex t i n t h e c om pu ta t i on o f p a ge

p r io r i t i e s . Th i s l ead t o t h e f o l l o win g e q u a t i on :

�5�������cuyS�, 8vQczS ���� = 5�������cjj�5�� + similarity�5�,TS " + similarity���,TS "2

2 � �12�

W h e r e TS i s t h e c en t r o i d o f r e l ev a n t p a ge s i n t r a i n in g s e t ,

�� ��������5�,TS " i s t h e cos in e s imi l a r i t y o f pa ge

c o n t e n t 5� w i t h t h e c en t r o i d TS o f r e l ev a n t pa ge s ,

�� ����������,TS" i s t h e c os i n e s im i l a r i t y o f l i nk a n c ho r

t ex t �� w i t h t h e c e n t r o i d TS o f r e l e va n t p a ge s ,

5�������cjj�5�� i s t h e p r io r i t y v a lu e as s i gn ed to l i nk s i n t o

p a ge 5� u s i n g H idd e n M a rk ov Mo d e l an d

5�������cuyS�, 8vQczS ���� i s t h e p r io r i t y a s s i gn ed to t h e l i n k

w i t h a n ch o r t ex t �� by t h e H yb r i d HM M C r aw le r wi t h p a ge

c o n t e n t a nd an c ho r t ex t .

T h e p r io r i t y f u n c t i on o f eq u a t i on 1 2 im pr ov e s t he

p e r f o r ma n c e o f t he h yb r i d C r aw l e r . As w i l l b e sh o wn in t h e

ex p e r im e n t s wh e n a n c ho r t ex t i s u s ed th e c r a wl e r i s ev e n

m o re f o cu s ed to t he t op i c . F i gu r e 9 i l l u s t r a t es t h e op e ra t i on

o f h yb r i d c r a wl e r s :

Page 46: Technical University of Crete Department of Electronic and ...€¦ · Technical University of Crete Department of Electronic and Computer Engineering DESIGN AND EVALUATION OF TOPIC

CHAPTER 3. CRAWLER DESIGN

41

C l us t e r 2

L 3 p a ge

L 2 p a ge C l us t e r 0

c e n t ro id

L 0 p a ge

L 1 p a ge

C l us t e r 1

C a nd i d a t e p a ges

Fi g . 10 H yb r i d c r aw l e r s o p er a t i o n .

In f i gu r e 1 0 t wo p a ge s (b lu e c i r c l e s ) a r e c an d id a t e f o r

d o wn lo ad in g . Th e H M M Cr a wl e r w i l l a s s i gn h i gh e r p r io r i t y

t o c an d i da t e p a ge p 1 b e l on g in g t o c l us t e r 1 s in c e t h i s c lu s t e r

l e a ds wi th h i gh e r p r ob a b i l i t y t o t a r ge t p a ge s ( c lu s t e r 0 ) i n

t w o l i nk s t ep s ( s i nc e t he p ro b ab i l i t y o f l e a d i n g t o c l us t e r 0

i n o n e s t e p i s i de n t i c a l fo r c lus t e r s 1 an d 2 ) . In s t ea d , a

H yb r i d c r a wl e r w i l l s e l e c t f o r ex p an s i on t h e p a ge p 2

b e lo n gi n g t o c lu s t e r 2 b e c au s e o f i t s p rox i mi t y ( s i mi l a r i t y)

w i t h t h e c en t r o i d o f c l us t e r 0 ( t h e c l us t e r co n t a i n in g th e

r e l ev a n t p a ge s f rom t he t r a i n i n g s e t ) .

3 .5 Summary

Cl a ss i c c ra wl e r s i n c lu d i n g t he w e l l k no w n Br e ad th -F i r s t

c r a w le r an d v a r i a t i o ns o f t h e Be s t - F i r s t C r a wl e r p r es en t ed i n

t h i s c h a p t e r h a v e b e e n i mp l e me n t e d i n t h e c u r r e n t t he s i s .

C2

C2

C0

C1

C0

C3

C1

cr

P2

P1

Page 47: Technical University of Crete Department of Electronic and ...€¦ · Technical University of Crete Department of Electronic and Computer Engineering DESIGN AND EVALUATION OF TOPIC

CHAPTER 3. CRAWLER DESIGN

42

S em a n t i c c r aw l e r s i n c lu d i n g a v a r i a t i o n o f t h e E hr i g c r aw l e r

u s i n g W or d N et , an d t h e n ov e l S SR M an d S yn o n ym s e t

ex p an s i on c r a wl e r s h av e b e e n imp le m e n t ed an d comp a r e d

w i t h s t a t e o f t h e a r t Be s t F i r s t C r aw l e r s . F i n a l l y a s e t o f

s t a t e o f t h e a r t HMM c r a wl e r s i n c lu d in g [ 1 6 , 18 ] an d th e h e r e

p r op os e d h yb r i d c r a w l e r s a re a l s o im p l em e n t ed a nd th e i r

p e r f o r ma n c e i s e v a l u a t ed .

Page 48: Technical University of Crete Department of Electronic and ...€¦ · Technical University of Crete Department of Electronic and Computer Engineering DESIGN AND EVALUATION OF TOPIC

CHAPTER 4. EXPERIMENTAL RESULTS

43

Chapter 4. Experimental Results

4.1 Introduction

T h e f o l l o win g s e t o f ex p e r im e n t s i s d es i gn ed to :

a ) P ro v i d e a c r i t i c a l e v a l u a t i on o f t h e v a r i ou s t yp e s o f

c r a w le r s ex ami n ed i n t h i s wo rk in c lu d in g c l as s i c

( Br e a d t h - F i r s t ) , t o p i c d r i v en ( Bes t - F i r s t a n d i t s

v a r i a n t s i n c lu d in g S em a n t i c c r aw l e r s ) , Le a r n in g a n d

H yb r i d c ra wl e r s .

b ) D e mo ns t r a t e t h e s up e r i o r i t y o f t h e n ew H yb r i d c r a wl e r

p r op os e d in t h i s w o r k o ve r s t a t e o f t h e a r t HM M

l e a rn in g c r aw l e r s su c h a s [ 16 , 18 ] .

S ix d i f f e r e n t t op i cs w e r e us e d ( “ l i n ux ” , “ as thm a ” ,

“ r o bo t i c s ” , “ de n gue f e v er ” , “ j a v a p ro gr a mm in g” an d “ f i r s t

a i d ” ) a nd t h e a b i l i t y o f t h e c r a wl e r s t o d ow nl oa d p a ge s on

t h e a bo v e to p i cs w a s m e as u r e d . T h e i r p e r fo rm a n c e w a s

c o mp ut ed us in g t wo w e l l e s t a b l i sh e d m e as u r es r e f e r r e d t o a s

h a r v es t r a t i o and a v e ra ge s i mi l a r i t y . E a c h c r aw l e r

d o wn lo ad e d 1 00 0 pa ge s a n d i t s av e r a ge p e r fo r ma n c e (o v er a l l

t op i c s ) w a s c om put e d u s in g b o t h c r i t e r i a . R e l ev a n t j ud ge d

p a ge s w e r e p r ov i de d b y t h e u se r wh o m a nu a l l y i n sp ec t e d

r e s u l t s ob t a in e d b y t he Go o g l e s ea r c h e n g in e on e ac h top i c .

T h es e r es u l t s w e r e u s ed as g r ou nd t ru th a nd co mp a r ed w i t h

r e s u l t s o b t a i ne d b y t h e c r a wl e r s . T h e m o re s im i l a r ( t o g ro u nd

t r u th ) t h e r esu l t s o f a c r aw l e r a r e , t h e mo s t s u c ce s s f u l t h e

c r a w le r s i s ( t h e h i gh er t h e p r ob a b i l i t y t h a t t h e c r aw l e r

r e t r i ev e s r e su l t s s im i l a r t o t h e t o p i c ) . P a ge t o t op i c r e l eva n c e

i s c omp ut e d b y V SM i n a l l c as e s .

Page 49: Technical University of Crete Department of Electronic and ...€¦ · Technical University of Crete Department of Electronic and Computer Engineering DESIGN AND EVALUATION OF TOPIC

CHAPTER 4. EXPERIMENTAL RESULTS

44

4 .2 Per formance measures

T w o d i f f e r e n t e v a lu a t io n c r i t e r i a w e re us e d :

a ) H a rv es t r a t i o : Fo r e v e r y p a ge i t s c o s i ne s imi l a r i t y

w i t h a l l p a ge s j ud ge d a s r e l ev a n t b y t h e u s e r i s

c o mp ut ed a nd t h e m ax im um o f t h es e c o s i ne s im i l a r i t i e s

i s t ak e n . I f t h e max imu m s i mi l a r i t y i s g r e a t e r t h a n a

p r e d ef in e d th r es ho ld ( 0 . 75 i n t h i s w o rk ) t h en t h e p a ge

i s ma r k ed as r e l e va n t ( o th e r wi s e t h e p a ge i s m a rk ed as

i r r e l ev a n t ) . T he h a r v es t r a t i o i s d e f i n ed a s t h e

p e r c en t a ge o f d ow nl o ad e d p a ges wi t h s imi l a r i t y g r e a t e r

t h an t he t h re sh o ld ( i n t h i s t h es i s t h e n um be r o f

r e l ev a n t p a ge s wa s u s e d i n s t e a d o f t he f r a c t i on o f t h em

a m on g t h e t o t a l num b er o f d o wnl o ad ed pa ge s ) .

b ) A v e r ag e s i mi la r i ty . T h e m ax imu m s i mi l a r i t y o f e a c h

d o wn lo ad e d p a ge w i t h a l l p a ge s m a rk e d a s r e l e v an t i s

c o mp ut ed . T h e a ve r a ge s imi l a r i t y i s d e f in e d a s t h e

a v e r a ge v a lu e o f t h e s e s i mi l a r i t i e s fo r a l l do w nlo a d ed

p a ge s .

T h e f i r s t c r i t e r i o n i s mo r e s e l e c t i v e t h a n t he s e co nd . H ar v e s t

r a t i o c an b e ad ju s t e d ( b y u s in g h i gh er t h re sh o l d ) t o m e as u re

t h e ab i l i t y o f t h e c r a w l e r t o do wn lo ad p a ge s h i gh l y r e l e v a n t

t o t h e t o p i c . A n app l i c a t i on c a l l ed “ ev a lu a t o r ” w a s d e v e l op e d

f o r au to ma t i n g t h e e v a lu a t i on p ro c es s . I t r e c e iv es a s i n p u t

t h e p os i t i v e p a ge s s e t (5 0 r e l ev a n t p a ge s on e v e r y t o p i c i n

o u r ex p e r i m en t ) and t h e 10 00 e va lu a t e d pa ge s d o wnl o ade d b y

t h e c r a wl e r , an d co mp ut es t h e p e r f o rm a n ce o f t h e c r a wle r a t

h a nd wi t h bo th c r i t e r i a .

Page 50: Technical University of Crete Department of Electronic and ...€¦ · Technical University of Crete Department of Electronic and Computer Engineering DESIGN AND EVALUATION OF TOPIC

CHAPTER 4. EXPERIMENTAL RESULTS

45

4 . 3 E xp e r i me n t se tup

T h e f o l l o win g c r a wl e r s a re co mp a r ed :

1 ) N o n Fo cu s ed C ra wle r s :

a ) B r e a d th F i r s t C r aw l e r

2 ) C l a s s i c Fo c us e d C r a wl e r s :

b ) Bes t F i r s t C r a wl e r wi t h p a ge c o n t e n t

c ) Be s t F i r s t C r a wl e r wi t h a n ch or t ex t

d ) Bes t F i r s t C r a wl e r wi t h p a ge c o n t e n t &

an c ho r t ex t

3 ) S e m an t i c C r aw l e r s :

e ) S em an t i c C r a wl e r u s i n g E h r ig e t . a l . [ 1 3 ]

m e t ho d fo r t ex t s i mi l a r i t y e s t i ma t i on .

f ) S em a n t i c C r a wl e r u s i n g SSRM [ 1 4]

m e t ho d fo r t ex t s i mi l a r i t y e s t i ma t i on .

g ) S e m an t i c C r a wl e r wi t h S yn s e t Ex p a ns i on .

4 ) Le a r n i n g C r aw le r s :

h ) Hi dd e n Ma r k ov M od e l C r aw le r

i ) H yb r i d Hid d en M a r ko v M od e l C r aw l e r

j ) H yb r i d Hid d en M a r ko v M od e l C r aw l e r wi t h

pa ge c on ten t & an c ho r t ex t .

A l l C r a wl e r s w er e e v a lu a t e d u s i n g the f o l l o wi n g to p i cs a n d

s e e d p a ges :

query seed Linux http://dir.yahoo.com/Computers_and_Internet/Software/Operating_Systems/UNIX/Linux

Asthma http://dir.yahoo.com/Health/Diseases_and_Conditions/Asthma/

Robotics http://dir.yahoo.com/Science/Computer_Science/

Dengue Fever http://health.yahoo.com/

Java programming http://dir.yahoo.com/Computers_and_Internet/

First Aid http://dir.yahoo.com/Health/

Fi g . 11 Ex p e r i m en t s e t up

1 0 00 p a ge s w e r e do w nl oa d ed fo r ea ch c r aw l e r an d f o r e a c h

t op i c . N o t i c e t h a t i n fo u r o u t o f t h e s ix t op i cs t h e s e e d p a ge

d o es n ’ t d i r e c t l y l i nk t o t a r ge t p a ge s .

Page 51: Technical University of Crete Department of Electronic and ...€¦ · Technical University of Crete Department of Electronic and Computer Engineering DESIGN AND EVALUATION OF TOPIC

CHAPTER 4. EXPERIMENTAL RESULTS

46

T h e ex pe r im en t s i n t h i s s e c t i on a re o r ga n iz ed b y c r a w l e r

t yp e s h o win g a c om p ar i s on b e t w e en v a r i ou s i mp l e m en ta t i on s

o f t he c r a wl e r o f t h e s am e t yp e . Sp e c i f i c a l l y t h e ex p er im e n t s

a r e o r ga n iz e d a s fo l l o ws :

a ) C l ass i c Fo cu se d Cr aw l e r s E xp e r i me n t

C r a wl e r s ( a ) - (d ) we r e e va lu a t e d us ing t h e s ix t o p i cs o f

F i g . 1 1 .

b ) S e man t i c C r aw le rs E xp e r i me nt s

C r a wl e r s ( e ) - ( f ) , a n d (c ) - (d ) fo r c o mp a r i so n , w ere

e v a lu a t e d us i n g t h e 6 t o p i cs o f F i g . 1 1 .

c ) L e ar n in g C r aw l ers E xp e r i me nt

C r a wl e r s ( h ) - ( j ) w e r e ev a l u a t ed us in g f ou r t o p i cs

( “ Ro bo t i cs ” , “ D engu e Fe v e r ” , “ J av a P r o gr am min g” a n d

“ F i r s t A i d ” ) .

In t h e ex p er im en t s b e l ow e a c h me tho d i s r e p re s en t e d b y a

p lo t sh ow in g n umb e r o f r e l ev a n t p age s i n t h e Y ax i s a s a

f u n c t i on o f t o t a l nu mb e r o f p a ge s r e t r i ev e d . E a ch po in t i n a

p lo t co r r esp on ds t o h a r ve s t r a t i o o r a v e r a ge s imi l a r i t y

m e as u r ed r e sp e c t ive l y.

N o t i c e t h a t Le a r n ing C r a wl e r s h a v e d i f f e r e n t i np u t ( t h e

t r a i n i n g s e t ) t h a n th e C l a s s i c a n d S ema n t i c f oc us e d C r a wl e r s

( t ha t h av e t h e us e r q ue r y a s i n pu t ) s o d i r e c t c omp a r i so ns

b e tw e e n t h e p e r fo rm a n ce o f l e a rn in g a n d o th e r c a t e go r i es o f

c r a w le r s i n n o t r e a l l y p l au s i b l e .

Page 52: Technical University of Crete Department of Electronic and ...€¦ · Technical University of Crete Department of Electronic and Computer Engineering DESIGN AND EVALUATION OF TOPIC

CHAPTER 4. EXPERIMENTAL RESULTS

47

4 .4 C lass ic Focused Craw lers

Fi g . 12 H ar v es t r a t i o f o r c l a s s i c c r a wle r s

T h e c om p ar i s on in F i g . 1 2 i n d i c a t es t h e p oo r p e r fo rm a nc e o f

Br e a d th F i r s t C r a wl e r , a s ex p e c t ed f o r a n on f o cu se d c r aw l e r .

T h e f a c t t h a t t h e Be s t F i r s t C r aw l e r u s in g a n c ho r t ex t o n l y

o u t p e r f o rms th e c ra w l e r u s in g o n l y p a ge c o n t en t i nd i ca t es

t h e v a lu e o f a n ch o r t ex t f o r c omp ut in g p a ge t o t o p i c

r e l ev a n c e .

T h e c r a wl e r c om bi n in g p a ge a n d a n ch o r t ex t

d e mo ns t r a t e d s up e r i o r p e r fo rm a n c e . Th i s r e su l t i nd i c a t e s t h a t

W eb c on t en t r e l e va n c e i s no t com put e d b y p a ge o r a n c h o r

t ex t a l on e . In s t e ad , t h e c om bin a t ion o f p a ge c on t en t a n d

a n c ho r t ex t fo rm s a mo r e r e l i ab l e p a ge d es c r i p t i on .

0

50

100

150

200

250

300

350

50

10

01

50

20

02

50

30

03

50

40

04

50

50

05

50

60

06

50

70

07

50

80

08

50

90

09

50

10

00

rele

va

nt

pa

ge

s

crawled pages

Breadth First

Best First-page content

Best First-anchor text

Best First-content & anchor

text

Page 53: Technical University of Crete Department of Electronic and ...€¦ · Technical University of Crete Department of Electronic and Computer Engineering DESIGN AND EVALUATION OF TOPIC

CHAPTER 4. EXPERIMENTAL RESULTS

48

Fi g . 13 Av e r a ge s im i l a r i t y f o r c l a s s i c f o cu s ed c r a wl e r s

F i g . 1 3 co n f i rms t h e r es u l t s o f t h e p r e v i ou s co mp a r i s on .

O v e r a l l a b es t f i r s t c r a wl e r com bi n ing p a ge a n d a n ch or t ex t

a c h i e v es s up e r i o r p e r f o r ma n c e ov e r a l l i t s com p et i t o r s w i th

b o t h c r i t e r i a .

4 .5 Semant ic Craw lers

T h e s e c on d ex pe r im e n t m e as u r es t h e p e r f o r ma n c e o f s em a n t i c

c r a w le r s u s i n g t he s ix t op i c s o f F i g . 1 1 ( a s i n t h e p r ev io us

ex p e r im e n t ) .

0,00%

10,00%

20,00%

30,00%

40,00%

50,00%

60,00%

70,00%

50

10

01

50

20

02

50

30

03

50

40

04

50

50

05

50

60

06

50

70

07

50

80

08

50

90

09

50

10

00

av

ara

ge

sim

ila

rity

crawled pages

Breadth First

Best First-page content

Best First-anchor text

Best First-content & anchor

text

Page 54: Technical University of Crete Department of Electronic and ...€¦ · Technical University of Crete Department of Electronic and Computer Engineering DESIGN AND EVALUATION OF TOPIC

CHAPTER 4. EXPERIMENTAL RESULTS

49

Fi g . 14 : H a rv es t Ra t i o f o r S em a n t i c Cr a w l e r s .

F i g . 14 i l l u s t r a t es o n l y m a r g i n a l p e r fo r ma n c e im pr ov e me n t s

o f s em a n t i c c r aw l e r s ov e r b es t f i r s t c r aw l e r s . I t i s

c o n j e c t u re d th a t t he p oo r p e r fo rm a nce o f s em a n t i c c r aw l e r s

s ho u l d no t b e r e ga r d e d a s a f a i l u r e o f s em a n t i c c r a wl e r s bu t

r a th e r a s a f a i l u r e o f W o r dN e t t o p r ov id e t e rm s c on c e p tu a l l y

s imi l a r t o t h e t o p i c . W or d N et i s a ge n e r a l t ax o nomy f o r

E n g l i sh t e rm s an d n o t a l l l i n k ed t e r m s a r e a c t u a l l y v e r y

s imi l a r , i mp l yi n g t h a t t h e r e su l t s c a n b e im p ro v ed b y u s in g

t op i c s p e c i f i c o n to l o g ie s . S uc h to p i c s pe c i f i c on t o l o g ies on

s e v e ra l d iv e r s e t op i cs w e r e no t a v a i l a b l e t o u s fo r t h e s e

ex p e r im e n t s .

0

50

100

150

200

250

300

350

50

10

01

50

20

02

50

30

03

50

40

04

50

50

05

50

60

06

50

70

07

50

80

08

50

90

09

50

10

00

rele

va

nt

pa

ge

s

crawled pages

Semantic Crawler Ehrig

method

Semantic Crawler SSRM

method

Best First-anchor text

Best First Content &

anchor text

Semantic Crawler with

synset expantion

Page 55: Technical University of Crete Department of Electronic and ...€¦ · Technical University of Crete Department of Electronic and Computer Engineering DESIGN AND EVALUATION OF TOPIC

CHAPTER 4. EXPERIMENTAL RESULTS

50

Fi g 1 5 : A v e ra ge S im i l a r i t y f o r S em an t i c C r aw le r s

R e su l t s wi th a v er age s i mi l a r i t y a c t u a l l y c o n f i rm e d t h e

r e s u l t s o f F i g . 14 . H e r e s em an t i c c ra w l e r s i mp ro v ed aga i n

t h e r es u l t s o f b es t f i r s t c r a wl e r s b u t o n l y m a r g i n a l l y ,

i nd i c a t i n g t h a t av e r a ge s imi l a r i t y ( a s l e s s s t r i c t c r i t e r i on ) i s

m o re t o l e r a n t t o r e l ax e d in t e r p r e t a t i o ns o f c on c ep tu a l

s imi l a r i t y a s p r o v id e d b y W o r dN et a nd t e r m s i mi l a r i t y

m e as u r es ( su c h a s Li e t . a l [ 4 2 ] ) .

4 .6 Learning Craw lers

T h e r esu l t s b e lo w a r e t a k en on fou r t op i cs ( “ r ob o t i c s ” ,

“ d e n gu e f e ve r ” , “ j a v a p ro gr a mmi n g” a nd “ f i r s t a id ” ) a n d

m e as u r ed o n th e f i r s t 10 00 w e b pa ge s r e t u rn ed b y e a c h

c r a w le r on e a c h t op i c . O n l y Le a r n i n g c r a wl e r s w e r e

e v a lu a t e d i n t h i s ex p e r im e n t : T wo v ar i an t s o f HM M C r aw l e r s

w e r e t e s t ed c o r r e spo n d in g t o d i f f e re n t im p l em e n t a t i o n o f t h e

c l us t e r i n g c omp on e n t (w i t h K -M e a ns an d X -M e a ns

0,00%

10,00%

20,00%

30,00%

40,00%

50,00%

60,00%

70,00%

50

10

01

50

20

02

50

30

03

50

40

04

50

50

05

50

60

06

50

70

07

50

80

08

50

90

09

50

10

00

av

ara

ge

sim

ila

rity

crawled pages

Semantic Crawler Ehrig

method

Semantic Crawler SSRM

method

Best First-anchor text

Best First content & anchor

text

Semantic Crawler with

synset expantion

Page 56: Technical University of Crete Department of Electronic and ...€¦ · Technical University of Crete Department of Electronic and Computer Engineering DESIGN AND EVALUATION OF TOPIC

CHAPTER 4. EXPERIMENTAL RESULTS

51

r e s p ec t iv e l y) . T h e r e s u l t s i n d i c a t e t ha t K -M e an s ( us in g K = 5

a s s u gge s t e d a t [ 16 ] ) a nd X - M e an s H i dd en M a r ko v Mo d e l

C r a wl e r s h a v e id e n t i c a l p e r f o rma n c e . Bo th c r aw l e r s

d e mo ns t r a t e d po o r p e r f o r ma n c e ( F i gs . 1 6 - 17 ) an d th i s c an be

a t t r i b u t ed t o s ev er a l r e a so ns : bo th v a r i a n t s do n’ t a s s i gn

d i f f e r en t p r i o r i t i e s t o p a ge s i n t o t h e s am e c l us t e r , a n d

b e tw e e n l i n ks i n to t h e s am e p a ge . Bo t h v a r i a n t s m us t b e

p r ov id e d w i th a t r a i n i n g s e t v e r y s i m i l a r i n co n t en t a nd l i n k

s t ru c t u re t o t h e p a r t o f t h e W eb t h a t wi l l b e c ra w l ed

( s om et h in g n o t a lw a ys a c h i ev a b l e ) . Be c a u s e t h e tw o H M M

C r a wl e r s ( u s i n g X - M e an s an d K -M e a ns ) h av e id en t i c a l

p e r f o r ma n c e th e f i r s t v a r i an t ( HM M Cr a w l e r u s i n g K -M ea n s )

w a s c ho s en fo r c omp a r i so n wi th t he o th e r Le a r n in g C r a wle r s .

In F i g . 1 6 t he pe r f o rm an c e o f t h e H M M c r a wl e r i s

c o mp a r ed wi t h t h e p e r fo rm a n c e o f t he n e w H yb r i d c r aw l e r s

( u s i n g c om bin a t ion o f p a ge c o n t en t a nd a n ch o r t ex t )

p r op os e d i n t h i s w o r k . Th e f i r s t ( Hyb r i d H MM us in g p a ge

c o n t e n t ) p r i o r i t i z es l i nk s u s in g e q ua t io n 1 1 ( s im i l a r i t y o f t h e

p a ge c o n t a in i n g t he l i n ks w i t h t h e ce n t r o i d o f t h e r e l ev a n t

p a ge s i n t h e t r a i n i n g se t ) . In a d d i t i o n t o t h a t t h e s ec o nd

i mpl e me n t a t i o n ( Hyb r i d H MM C r aw l e r w i t h a nc ho r t ex t ) a l so

c o mbi n es t h e s im i l a r i t y o f t h e c e n t ro i d wi t h t h e an c ho r t ex t

o f l i n ks po in t in g to c a nd id a t e p a ges f o r p r io r i t y a s s i gnm e n t

a s s u gge s t e d b y e q u a t io n 12 .

Page 57: Technical University of Crete Department of Electronic and ...€¦ · Technical University of Crete Department of Electronic and Computer Engineering DESIGN AND EVALUATION OF TOPIC

CHAPTER 4. EXPERIMENTAL RESULTS

52

Fi g . 16 : H a rv es t Ra t i o f o r HM M & Hyb r i d C r a wl e r s

Fi g . 17 : Av e r a ge Co s i n e S imi l a r i t y f o r HM M & H yb r i d

C r a wl e r s

T he H yb r i d c ra wle r s ou t p e r f o rm t h e H i dd en M a rk ov M od e l

u s i n g bo t h c r i t e r i a . T h e u s e o f p os i t i ve ex am pl es c e n t ro id a s

0

5

10

15

20

25

30

35

40

45

50

10

01

50

20

02

50

30

03

50

40

04

50

50

05

50

60

06

50

70

07

50

80

08

50

90

09

50

10

00

rela

tiv

e p

ag

es

crawled pages

HMM Crawler

Hybrid HMM Crawler

with page content

Hybrid HMM with page

content & anchor text

0,00%

5,00%

10,00%

15,00%

20,00%

25,00%

30,00%

35,00%

40,00%

45,00%

50,00%

50

10

01

50

20

02

50

30

03

50

40

04

50

50

05

50

60

06

50

70

07

50

80

08

50

90

09

50

10

00

av

ara

ge

sim

ila

rity

crawled pages

HMM Crawler

Hybrid HMM Crawler with

page content

Hybrid HMM with page

content & anchor text

Page 58: Technical University of Crete Department of Electronic and ...€¦ · Technical University of Crete Department of Electronic and Computer Engineering DESIGN AND EVALUATION OF TOPIC

CHAPTER 4. EXPERIMENTAL RESULTS

53

a q u e r y c l e a r l y i n c r e a s es p e r fo rm a n ce b e c a us e i t ov e r c om e s

t h e p ro b l ems o f H M M c r aw l e r s . As F i g . 16 an d F ig . 17

i nd i c a t e , t h e r esu l t s o b t a i n ed b y H yb r i d c r a wl e r s a r e

p r omi s i n g a n d m a y l e a d t o f u r th e r r e sea r c h o n t h i s d i r ec t i o n .

4 .7 Di scuss ion

Cl a ss i c Fo c us e d Cr a w l e r r e su l t s s how t h a t c om bin in g p a ge

c o n t e n t an d a n ch or t ex t ( Be s t F i r s t C r aw l e r - pa ge c on t en t

a n d an c ho r t ex t ) y i e l ds t h e b es t r e su l t s . Bo th p a ge c on t e n t

a n d an ch o r t ex t fo r m a r e p r e s en t a t i ve c o n t en t d es c r i p to r f o r

w e b p a ge s . S em an t i c C r aw le r s , wh e n c om bi ne d w i t h a

ge n e r a l pu r po s e on to lo g y, p e r f o rm ed po or l y c o m p a r ed to

Be s t F i r s t c r aw l e r s . B y r e s t r i c t i n g s e m an t i c r e l a t i on s t o

s yn o n ym s e t s (S ema n t i c C r aw l e r - S yn s e t ex p a nd m e th od ) t h e

p e r f o r ma n c e wa s im p ro ve d m a r g in a l ly . S yn o n ym s , a l t ho u gh

n o t l ex i c a l l y s i mi l a r su c c e ed i n i d e n t i f yi n g p a ge s w i t h

c o n t e n t s imi l a r t o t h e t op i c , i nd i c a t i ng t h a t i t i s p os s ib ly t o

ex p e c t fu r t h e r p e r f o rm a nc e imp r ove m e n t s b y u s i n g t op i c

s p e c i f i c on t o l o gi e s r i c h i n t e r ms v e r y s imi l a r t o t h e t e rm s o f

t h e t o p i c . A t t h i s p o i n t , on to l o g i es o f t h i s t yp e a r e n o t

a v a i l ab l e t o u s . Bo t h H yb r i d Cr a w l e r s a c h i ev e b e t t e r

p e r f o r ma n c e th a n t h e H id d en M a r kov Mo d e l C r a wl e r . T h e

r e s u l t s o b t a i n ed in d i c a t e t h a t p os i t i v e ex am pl es a r e m o re

i mp or t an t t h an the n e ga t iv e o n es d u r i n g t r a in in g in an

e n v i ro nm e n t s u ch a s t h e W o r ld W i d e W eb . Us i n g o n l y

p os i t i v e ex am pl es t h e p e r fo rm a nc e o f l e a r n in g c r a wl e r s i s

ex p e c t ed t o i mp r ove .

Page 59: Technical University of Crete Department of Electronic and ...€¦ · Technical University of Crete Department of Electronic and Computer Engineering DESIGN AND EVALUATION OF TOPIC

CHAPTER 5. CONCLUSIONS AND FUTURE WORK

54

Chapter 5. Conclusions and future

work

In t h e p r es e n t t h es i s , s e v e ra l v a r i an t s o f f o c us ed c r aw l e r s

w e r e im p l em e n t ed a n d e v a l ua t ed us i n g c om mo n ev a l ua t i on

c r i t e r i a . F i r s t t h e Br e a d th F i r s t C ra wl e r a nd v a r i an t s o f t h e

Be s t F i r s t C ra wl e r u s i n g p a ge c o n t en t , a nc ho r t ex t o r b o t h

w e r e co mp a r ed . Th e n s em an t i c r e l a t i on s w e r e us e d i n t h e

i mp l e me n t a t i o n o f t h r e e S em a n t i c C r a wl e r s t h a t w e r e

c o mp a r ed wi th c l as s i c fo c us e d c r a wle r s (v a r i a t i o ns o f b e s t

f i r s t c r a wl e r ) . F i na l l y , b a s e d on t h e H id d en Ma r ko v M od e l

l e a rn in g c r aw l e r , t wo n ov e l h yb r i d c r a wl e r s c om bi n i n g

e l em e n t s f rom l e a r n in g a nd c l as s i c f o c us e d c r aw l e r s w e r e

i mp l e me n t e d a nd ev a lu a t e d .

T h e ex p e r im e n t a l r e s u l t s i nd i c a t e t h a t t h e

i mp l e me n t a t i o n o f f o c us e d c r aw l e r s i s a p r o c es s wh e r e mi no r

c h a n ge s i n t h e c r a w le r d es i gn ha v e g r e a t e f f e c t i n

p e r f o r ma n c e . T he c o mb in a t io n o f a n c ho r t ex t a nd p a ge

c o n t e n t yi e l d s g r e a t p e r f o rm an c e i mp ro v em e n t i n t he c a se o f

c l a s s i c , s em a n t i c an d l e a rn in g f o cu s ed c r a wl e r s . T h e a dd i t i on

o f s e m an t i c r e l a t i o ns d id n ’ t im p ro ve p e r fo rm a n ce wi th t h e

ex c e p t i on o f ex pa n s i on wi th s yn o n ym s w h e r e s e ma n t i c

r e l a t i on s a r e r es t r i c t e d t o s yn o n ym t e r ms . P e r f o rm anc e i s

ex p e c t ed t o im p ro ve b y u s in g a p p l i c a t i on sp e c i f i c on t o l og i e s

( r e l a t ed t o t h e t o p i c ) , i n s t e a d o f ge ne r a l pu r po s e on t o lo g i e s

s u ch as W o rd N et .

Le a r n i n g C r a wl e r s t a k e as i np u t u s e r s e l e c t e d pa ge s n o t

d e s c r ib e d b y a s im p l e q ue r y. I t i s n o t o n l y t h a t Le a r n in g

c r a w le r s r e c e i v e d i f f e re n t i n pu t t h a n t h a t o f o t h e r f o cu s ed

c r a w le r s bu t a l s o t h e y a r e i n t en de d t o p e r fo r m a v e r y

d i f f i cu l t t a sk : t h ey a t t e mp t t o l e a rn w e b c r a wl i n g pa t t e rn s

Page 60: Technical University of Crete Department of Electronic and ...€¦ · Technical University of Crete Department of Electronic and Computer Engineering DESIGN AND EVALUATION OF TOPIC

CHAPTER 5. CONCLUSIONS AND FUTURE WORK

55

l e a d in g to r e l e v an t p a ge s p os s ib l y t h r o u gh o th e r n on r e l e v a n t

p a ge s t hu s i nc r e as i n g t h e p r ob ab i l i t y o f f a i l u r e ( s in c e w e b

s t ru c t u re s c an no t a l w a ys b e m o d e l e d b y s u c h l i nk p a t t e rn s ) .

H o w ev e r t h e i d e a l oo ks p r om is i n g ov e r a l l a nd m a y l e a d to

e v e n mo r e su c c es s fu l imp l em en ta t i ons o f l ea r n i n g c r a wl e r s i n

t h e f u tu r e . Th e p r es e n t w o rk ca n be r e ga r de d a s a

c o n t r i bu t io n to w a rd s t h a t d i r e c t i o n .

A n o t he r d i re c t i o n fo r fu tu r e w o rk wo u l d b e t o do m o re

e l a bo r a t e t e s t s w i th s em an t i c c r a wl e r s , m ak in g us e o f t o p i c

s p e c i f i c o n t o lo g i es ( e . g . m e d i c a l o n to l o g i e s fo r ap p l i c a t i on s

r e l a t e d t o h e a l t h ca r e ) . T h e p os i t i ve r e su l t s ob t a in ed b y

h yb r i d c r a wl e r s i nd i c a t e t h a t t h e r e l e v a n c e o f a c an d id a t e

p a ge w i th t h e s e t o f po s i t i v e ex am ple s on l y, i s an e f f ec t i v e

w a y f o r a s s i gn i n g p r io r i t i e s t o c a nd id a t e p a ge s . Us i n g o n l y

p os i t i v e ex am pl es ( i n s t e a d o f p os i t i ve a n d n e ga t i ve ) mi gh t

i mp ro v e t h e p e r f o r m an c e o f l ea r n i ng c r a w l e r s i n t e rm s o f

s p e ed an d a c cu r a c y .

Page 61: Technical University of Crete Department of Electronic and ...€¦ · Technical University of Crete Department of Electronic and Computer Engineering DESIGN AND EVALUATION OF TOPIC

REFERENCES

56

References:

[ 1 ] “ W e b S e a r c h f o r a P l a n e t : T h e G o o g l e C l u s t e r

A r c h i t e c t u r e ” L A B a r r o s o , J D e a n , U H o l z l e - M i c r o , IE E E ,

2 0 0 3 .

[ 2 ] “ V e r y L a r ge S c a l e R e t r i e va l a n d W e b S e a r c h ” D

H a w ki n g , N C r a s w e l l , I n E . V o o r h e e s a n d D . H a r ma n ,

e d i t o r s , T R E C : E x p e r i me n t a n d E va l u a t i o n i n

I n f o r ma t i o n R e t r i e v a l . M IT P r e s s , 2 0 0 5 .

[ 3 ] “ T h e In d e x a b l e W e b i s M o r e t h a n 1 1 . 5 B i l l i o n P a g e s ” A

G u l l i , A S i gn o r i n i - I n t e r n a t i o n a l W o r l d W i d e W e b

C o n f e r e n c e , 2 0 0 5 .

[ 4 ] h t t p : / / w o r d n e t . p r i n c e t o n . e d u

[ 5 ] h t t p : / / w w w . g o o g l e . c o m

[ 6 ] “ T h e A n a t o my o f a L a r ge -S c a l e H y p e r t e x t u a l W e b S e a r c h

E n g i n e ” S B r i n , L P a g e W W W 7 / C o mp u t e r N e t w o r ks , 1 9 9 8 .

[ 7 ] h t t p : / / w w w . ya h o o . c o m.

[ 8 ] h t t p : / / w w w . ms n . c o m

[ 9 ] h t t p : / / w w w . a s k . c o m

Page 62: Technical University of Crete Department of Electronic and ...€¦ · Technical University of Crete Department of Electronic and Computer Engineering DESIGN AND EVALUATION OF TOPIC

REFERENCES

57

[ 1 0 ] h t t p : / / l a r b i n . s o u r c e f o r ge . n e t / i n d e x -e n g . h t ml

[ 1 1 ] " In f o r ma t i o n R e t r i e v a l b y S e ma n t i c S i mi l a r i t y" A n ge l o s

H l i a o u t a k i s , G i a n n i s V a r e l a s , E p i me n i d i s V o u t s a k i s ,

E u r i p i d e s G . M . P e t r a k i s , E v a n ge l o s M i l i o s , I n t e r n a t i o n a l

J o u r n a l o n S e ma n t i c W e b a n d In f o r ma t i o n S ys t e ms

( I J S W IS ) , S p e c i a l I s s u e o f M u l t i me d i a S e ma n t i c s , V o l . 3 ,

N o . 3 , J u l y / S e p t e mb e r , 2 0 0 6 , p p . 5 5 -7 3 .

[ 1 2 ] “ A V e c t o r S p a c e M o d e l f o r A u t o ma t i c In d e x i n g ” G

S a l t o n , A W o n g , C S Y a n g – C o mmu n i c a t i o n s o f t h e A C M ,

1 9 7 5 .

[ 1 3 ] “ O n t o l o g y -F o c u s e d C r a w l i n g o f D o c u me n t s a n d

R e l a t i o n a l M e t a d a t a ” A l e x a n d e r M a e d c h e , M a r c E h r i g ,

S i e g f r i e d H a n d s c h u h , R a p h a e l V o l z , a n d L j i l j a n a

S t o j a n o v i c . P r o c e e d i n gs o f t h e E l e ve n t h In t e r n a t i o n a l

W o r l d W i d e W e b C o n f e r e n c e W W W -2 0 0 2 .

[ 1 4 ] “ S e ma n t i c S i mi l a r i t y M e t h o d s i n W o r d N e t a n d t h e i r

A p p l i c a t i o n t o In f o r ma t i o n R e t r i e v a l o n t h e W e b ” V a r e l a s

G . , V o u t s a k i s E . , R a f t o p o u l o u P . , P e t r a k i s E . , M i l i o s E . I n :

7 t h A C M In t e r n a t i o n a l W o r ks h o p o n W e b In f o r ma t i o n a n d

D a t a M a n a g e me n t ( W ID M 2 0 0 5 ) , B r e me n , G e r ma n y ( 2 0 0 5 ) .

[ 1 5 ] “ M e a s u r i n g t h e S e ma n t i c S i mi l a r i t y o f T e x t s . ”

C o r l e y , C . , M i h a l c e a , R . : , P r o c e e d i n gs o f t h e A C L

W o r k s h o p o n E mp i r i c a l M o d e l i n g o f S e ma n t i c

E q u i va l e n c e a n d E n t a i l me n t . A n n A r b o r , J u n e 2 0 0 5 .

Page 63: Technical University of Crete Department of Electronic and ...€¦ · Technical University of Crete Department of Electronic and Computer Engineering DESIGN AND EVALUATION OF TOPIC

REFERENCES

58

[ 1 6 ] “ F o c u s e d C r a w l i n g b y L e a r n i n g H M M f r o m u s e r ’ s

T o p i c -S p e c i f i c B r o w s i n g . ” H . L i u , E . M i l i o s , a n d J .

J a n s s e n . I n P r o c e e d i n g s o f 2 0 0 4 IE E E / W IC / A C M

I n t e r n a t i o n a l C o n f e r e n c e o n W e b In t e l l i g e n c e , p a ge s

7 3 2 – 7 3 5 , B e i j i n g , C h i n a , S e p t e mb e r 2 0 -2 4 , 2 0 0 4 .

[ 1 7 ] “ X -me a n s : E x t e n d i n g K -me a n s w i t h E f f i c i e n t

E s t i ma t i o n o f t h e N u mb e r o f C l u s t e r s . ” D . P e l l e g a n d A .

M o o r e . I n P r o c e e d i n gs o f t h e 1 7 t h In t e r n a t i o n a l

C o n f . o n M a c h i n e L e a r n i n g , p a ge s 7 2 7 – 7 3 4 . M o r ga n

K a u f ma n n , S a n F r a n c i s c o , C A , 2 0 0 0 .

[ 1 8 ] “ U s i n g H M M t o L e a r n U s e r B r o w s i n g P a t t e r n s f o r

F o c u s e d W e b C r a w l i n g ” H L i u , J J a n s s e n , E M i l i o s - D a t a

& K n o w l e d ge E n g i n e e r i n g , 2 0 0 6 .

[ 1 9 ] “ B r e a d t h -F i r s t S e a r c h C r a w l i n g Y i e l d s H i g h -Q u a l i t y

P a ge s . ” M . N a j o r k a n d J . L . W i e n e r . I n P r o c . 1 0t h

I n t e r n a t i o n a l W o r l d W i d e W e b C o n f e r e n c e , 2 0 0 1 .

[ 2 0 ] “ C r a w l i n g t h e W e b : D i s c o v e r y a n d M a i n t e n a n c e o f a

L a r ge -S c a l e W e b D a t a . ” C h o , J . 2 0 0 1 . P h . D . t h e s i s ,

S t a n f o r d U n i v e r s i t y .

[ 2 1 ] “ S e a r c h i n g t h e W e b . ” A r v i n d A r a s u , J u n gh o o C h o ,

H e c t o r G a r c i a -M o l i n a , A n d r e a s P a e p c k e , a n d S r i r a m

R a g h a va n . T r a n s a c t i o n s o n In t e r n e t T e c h n o l o g y ,

2 0 0 1 .

[ 2 2 ] “ E f f i c i e n t C r a w l i n g T h r o u gh U R L O r d e r i n g . ” J u n gh o o

C h o , H e c t o r G a r c i a - M o l i n a , L a w r e n c e P a g e . S e ve n t h

I n t e r n a t i o n a l W e b C o n f e r e n c e ( W W W 9 8 ) . B r i s b a n e ,

A u s t r a l i a , A p r i l 1 4 -1 8 , 1 9 9 8 .

Page 64: Technical University of Crete Department of Electronic and ...€¦ · Technical University of Crete Department of Electronic and Computer Engineering DESIGN AND EVALUATION OF TOPIC

REFERENCES

59

[ 2 3 ] “ In f o r ma t i o n R e t r i e va l i n D i s t r i b u t e d H y p e r t e x t s ” P .

D e B r a , G . - J . H o u b e n , Y . K o r n a t z k y , a n d R . P o s t , i n :

P r o c e e d i n g s o f R IA O '9 4 , I n t e l l i g e n t M u l t i me d i a ,

I n f o r ma t i o n R e t r i e v a l S ys t e ms a n d M a n a ge me n t , N e w

Y o r k , N Y , 1 9 9 4 .

[ 2 4 ] “ T h e S h a r k -S e a r c h A l go r i t h m - A n A p p l i c a t i o n :

T a i l o r e d W e b S i t e M a p p i n g” H e r s o v i c i , M . , J a c o v i , M . ,

M a a r e k , Y . S . , P e l l e g , D . , S h t a l h a i m , M . a n d U r , S .

( 1 9 9 8 ) , C o mp u t e r N e t w o r k s a n d IS D N S ys t e ms , V o l . 3 0

N o . 1 -7 , p p . 3 1 7 -2 6 .

[ 2 5 ] “ E va l u a t i n g T o p i c -D r i ve n W e b C r a w l e r s ” F . M e n c ze r ,

G . P a n t , M . R u i z , P . S r i n i va s a n , , P r o c . 2 4 t h A n n u a l I n t l .

A C M S IG IR C o n f . o n R e s e a r c h a n d D e v e l o p me n t i n

I n f o r ma t i o n R e t r i e v a l , A C M P r e s s , N e w Y o r k , N Y , 2 0 0 1

[ 2 6 ] “ T o p i c a l W e b C r a w l e r s : E va l u a t i n g A d a p t i ve

A l go r i t h ms ” F M e n c ze r , G P a n t , P S r i n i v a s a n – A C M

T r a n s a c t i o n s o n In t e r n e t T e c h n o l o g y ( T O IT ) , 2 0 0 4 .

[ 2 7 ] “ A G e n e r a l E va l u a t i o n F r a me w o r k f o r T o p i c a l

C r a w l e r s ” P S r i n i v a s a n , F M e n c ze r , G P a n t –

I n f o r ma t i o n R e t r i e v a l , 2 0 0 5 – S p r i n g e r .

[ 2 8 ] “ In t e l l i g e n t C r a w l i n g o n t h e W o r l d W i d e W e b w i t h

A r b i t r a r y P r e d i c a t e s . ” C . A g g a r w a l , F . A l -G a r a w i , a n d P .

Y u . I n P r o c . 1 0 t h In t l . W o r l d W i d e W e b C o n f e r e n c e ,

p a g e s 9 6 – 1 0 5 , 2 0 0 1 .

[ 2 9 ] “ A S u r ve y o f F o c u s e d W e b C r a w l i n g A l g o r i t h ms . ”

N o va k , B . P r o c e e d i n g s o f t h e 7 t h In t e r n a t i o n a l mu l t i -

c o n f e r e n c e In f o r ma t i o n S o c i e t y IS -2 0 0 4 , L j u b l j a n a :

I n s t i t u t “ J o že f S t e f a n ” , 2 0 0 4 .

Page 65: Technical University of Crete Department of Electronic and ...€¦ · Technical University of Crete Department of Electronic and Computer Engineering DESIGN AND EVALUATION OF TOPIC

REFERENCES

60

[ 3 0 ] “ F o c u s e d C r a w l i n g : A N e w A p p r o a c h f o r T o p i c

S p e c i f i c R e s o u r c e D i s c o v e r y” S C h a kr a b a r t i , M v a n d e n

B e r g , B D o m - W W W C o n f e r e n c e , 1 9 9 9 .

[ 3 1 ] “ F o c u s e d C r a w l i n g U s i n g C o n t e x t G r a p h s . ” M .

D i l i ge n t i , F . C o e t ze e , S . L a w r e n c e , C . L . G i l e s , a n d M .

G o r i . I n P r o c . 2 6 t h In t e r n a t i o n a l C o n f e r e n c e o n V e r y

L a r ge D a t a b a s e s ( V L D B 2 0 0 0 ) , p a ge s 5 2 7 – 5 3 4 , C a i r o ,

E g y p t , 2 0 0 0 .

[ 3 2 ] “ A c c e l e r a t e d F o c u s e d C r a w l i n g t h r o u g h O n l i n e

R e l e v a n c e F e e d b a c k ” C h a kr a b a r t i , S . , P u n e r a , K . , a n d

S u b r a ma n ya m, M . , I n P r o c e e d i n g s o f t h e e l e v e n t h

i n t e r n a t i o n a l c o n f e r e n c e o n W o r l d W i d e W e b ( W W W 2 0 0 2 ) ,

2 0 0 2 , p p . 1 4 8 -1 5 9 .

[ 3 3 ] “ L e a r n i n g t o C r a w l : C o mp a r i n g C l a s s i f i c a t i o n

S c h e me s ” G P a n t , P S r i n i va s a n – A C M T r a n s a c t i o n s o n

I n f o r ma t i o n S y s t e ms ( T O IS ) , 2 0 0 5 .

[ 3 4 ] “ F o c u s e d C r a w l i n g b y E x p l o i t i n g A n c h o r T e x t U s i n g

D e c i s i o n T r e e ” L i J u n , F u r u s e K , Y a ma g u c h i K . C ,

P r o c e e d i n g s o f t h e 1 4 t h In t e r n a t i o n a l W o r l d W i d e W e b

C o n f e r e n c e . 2 0 0 5 : 1 1 9 0 -1 1 9 1 .

[ 3 5 ] “ A N o ve l H y b r i d F o c u s e d C r a w l i n g A l go r i t h m t o B u i l d

D o ma i n -S p e c i f i c C o l l e c t i o n s ” Y C h e n , P h D t h e s i s – 2 0 0 7 .

[ 3 6 ] h t t p : / / j a va . s u n . c o m/

[ 3 7 ] h t t p : / / w w w . e c l i p s e . o r g /

[ 3 8 ] “ A T u t o r i a l o n S u p p o r t V e c t o r M a c h i n e s f o r P a t t e r n

R e c o g n i t i o n ” C J C B u r g e s - D a t a M i n i n g a n d K n o w l e d ge

D i s c o v e r y , 1 9 9 8 .

[ 3 9 ] h t t p : / / w w w . d mo z . o r g /

Page 66: Technical University of Crete Department of Electronic and ...€¦ · Technical University of Crete Department of Electronic and Computer Engineering DESIGN AND EVALUATION OF TOPIC

REFERENCES

61

[ 4 0 ] “ T h e V i t e r b i A l g o r i t h m” G D F o r n e y - P r o c e e d i n gs o f

t h e IE E E , 1 9 7 3 .

[ 4 1 ] “ In t e l l i S e a r c h : I n t e l l i g e n t S e a r c h f o r Ima g e s a n d

T e x t o n t h e W e b ” E V o u t s a k i s , E G M P e t r a k i s , E M i l i o s .

3 r d In t e r n . C o n f e r e n c e o n Ima g e A n a l y s i s a n d

R e c o g n i t i o n ( IC I A R 2 0 0 6 ) , p p . 6 9 7 -7 0 8 , S e p t . 1 8 -2 0 ,

2 0 0 6 , P o v o a d e V a r z i m , P o r t u ga l .

[ 4 2 ] “ A n A p p r o a c h f o r M e a s u r i n g S e ma n t i c S i mi l a r i t y

b e t w e e n w o r d s u s i n g M u l t i p l e In f o r ma t i o n S o u r c e s ” Y L i ,

Z B a n d a r - IE E E T r a n s a c t i o n s o n K n o w l e d g e a n d

D a t a E n g i n e e r i n g , 2 0 0 3 .

[ 4 3 ] “ N e a r e s t N e i g h b o r P a t t e r n C l a s s i f i c a t i o n ” T C o v e r , P

H a r t - I n f o r ma t i o n T h e o r y , IE E E T r a n s a c t i o n s o n , 1 9 6 7 .

[ 4 4 ] “ A n In t r o d u c t i o n t o H i d d e n M a r k o v M o d e l s ” L

R a b i n e r , B J u a n g - A S S P M a ga z i n e 1 9 8 6 .

[ 4 5 ] “ M e r c a t o r : A S c a l a b l e , E x t e n s i b l e W e b C r a w l e r ” A

H e y d o n , M N a j o r k – W o r l d W i d e W e b , 1 9 9 9 – S p r i n ge r .

[ 4 6 ] “ M i n i n g t h e L i n k S t r u c t u r e o f t h e W o r l d W i d e W e b ”

S o u me n C h a k r a b a r t i , B yr o n E . D o m, D a v i d G i b s o n , J o n

K l e i n b e r g , R a v i K u ma r , P r a b h a k a r R a g h a v a n , S r i d h a r

R a j a g o p a l a n , a n d A n d r e w T o mk i n s . IE E E C o mp u t e r ,

3 2 ( 8 ) : 6 0 -6 7 , 1 9 9 9 .

[ 4 7 ] “ D a t a C l u s t e r i n g : a R e v i e w ” A K J a i n , M N M u r t y , P J

F l yn n - A C M C o mp u t i n g S u r v e ys ( C S U R ) , 1 9 9 9 .

[ 4 8 ] “ A n A l go r i t h m f o r S u f f i x S t r i p p i n g ” P o r t e r , M . F . ( 1 9 8 0 )

P r o gr a m, 1 4 ( 3 ) : 1 3 0 - 1 3 7 .