design and implementation of a web log preprocessing system supporting path completion batchimeg ai...

32
Design and Design and Implementation of a Web Implementation of a Web Log Preprocessing System Log Preprocessing System Supporting Path Supporting Path Completion Completion Batchimeg AI lab. 2005.04.19

Upload: silvia-sharp

Post on 16-Jan-2016

217 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Design and Implementation of a Web Log Preprocessing System Supporting Path Completion Batchimeg AI lab. 2005.04.19

Design and Design and Implementation of a Web Implementation of a Web Log Preprocessing System Log Preprocessing System

Supporting Path Supporting Path CompletionCompletion

Design and Design and Implementation of a Web Implementation of a Web Log Preprocessing System Log Preprocessing System

Supporting Path Supporting Path CompletionCompletion

Batchimeg AI lab. 2005.04.19

Page 2: Design and Implementation of a Web Log Preprocessing System Supporting Path Completion Batchimeg AI lab. 2005.04.19

AI lab.

OutlineOutlineOutlineOutline

IntroductionIntroduction BackgroundBackground Related workRelated work Purposed SystemPurposed System Experiment and ResultExperiment and Result Conclusion and Future workConclusion and Future work

Page 3: Design and Implementation of a Web Log Preprocessing System Supporting Path Completion Batchimeg AI lab. 2005.04.19

AI lab.

IntroductionIntroductionIntroductionIntroduction

Web Log Mining ProcessWeb Log Mining Process

Viewing news

Web SiteVisitor

Logged data- IP-OS, Agent- Time- URL- Refer page- Date

-Cookie- Method- Status- UserID- bytes- …

DBDB

• Visualization tools• Knowledge Query• Intelligent Agents

E-Mail

download

shopping

Auction

Data Analysis

Saved Web Log Data in Web Server

Saved Web Log Data in Web Server

PreprocessingPreprocessing

Pattern DiscoveryPattern DiscoveryPattern AnalysisPattern Analysis

My research area:Web log preprocessing

Page 4: Design and Implementation of a Web Log Preprocessing System Supporting Path Completion Batchimeg AI lab. 2005.04.19

AI lab.

Background (Background (1/41/4) ) Background (Background (1/41/4) ) Log format :Log format :

– – Client IP -Client IP - 210.126.19.93210.126.19.93

– – Date - 23/Jan/2005Date - 23/Jan/2005

– – Accessed time - 13:37:12Accessed time - 13:37:12

– – Method - GET (to request page ),Method - GET (to request page ), POST, HEAD (send to server) POST, HEAD (send to server)

– – Protocol - HTTP/1.1Protocol - HTTP/1.1

– – Status code - 200 (Success),Status code - 200 (Success), 401,301,500 (error) 401,301,500 (error)

– – Size of file - 2705 Size of file - 2705

– – Agent type -Agent type - Mozilla/4.0Mozilla/4.0

– – Operating system - Windows NTOperating system - Windows NT

http://www.olloo.mn/modules.php?name=News&file=article&catid=25&sid=8225http://www.olloo.mn/modules.php?name=News&file=article&catid=25&sid=8225 → →

→ → http://www.olloo.mn/modules.php?name=News&file=friend&op=FriendSend&sid=http://www.olloo.mn/modules.php?name=News&file=friend&op=FriendSend&sid=82258225

A visitor (210.126.19.93) after to view the news who send it to friend.A visitor (210.126.19.93) after to view the news who send it to friend.

210.126.19.93 - - [23/Jan/2005:13:37:12 -0800]“GET /modules.php?name=News&file=friend&op=FriendSend&sid=8225 HTTP/1.1" 200 2705 "http://www.olloo.mn/modules.php?name=News&file=article&catid=25&sid=8225" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1)“ …

285014 lines record

Page 5: Design and Implementation of a Web Log Preprocessing System Supporting Path Completion Batchimeg AI lab. 2005.04.19

AI lab.

Session Identification

Background (2/4) - User identification, Session Identification

Background (2/4) - User identification, Session Identification

CleaningLog

User Identification

PathCompletion

Formatting

User Identification is identifying each user accessing Web siteUser Identification is identifying each user accessing Web site User IP+Browser (User IP+Browser (UserID+IP+OS or cookieUserID+IP+OS or cookie)=> Identify the users)=> Identify the usersSession identification is to find each user’s access pattern and frequency path.Session identification is to find each user’s access pattern and frequency path.

IP, Browser

User Identification

202.131.3.100

Mozilla/5.0(Windows NT)202.131.3.100

Mozilla/4.0 (Win2000)

210.126.19.93

Mozilla/4.0(Windows NT)

IPIP BrowserBrowser

A,B,C,D,F,A,LA,B,G,L

N,O

Visited pagesVisited pages

Session Identification

202.131.3.100

Mozilla/4.0 (Win2000)

A,B,C,D,F

A,B,G,L

N,O

202.131.3.100 A,L

Mozilla/5.0(Windows NT)

Mozilla/5.0(Windows NT)

202.131.3.100

Mozilla/5.0(Windows NT)210.126.19.93

Page 6: Design and Implementation of a Web Log Preprocessing System Supporting Path Completion Batchimeg AI lab. 2005.04.19

AI lab.

Missed Page Views at ServerMissed Page Views at Server

Background (Background (3/43/4) ) Server Log and CachingServer Log and CachingBackground (Background (3/43/4) ) Server Log and CachingServer Log and Caching

If client must request every web page from the server If client must request every web page from the server slower. slower.

The solution to this problem is cachingThe solution to this problem is caching. .

Clients and Proxy Servers save local copies of pages Clients and Proxy Servers save local copies of pages back” and “forwardback” and “forward

Client

Cache

Server

Request P4

Send P4

P4

Request P3

Send 5

P3

Request P6

Send P4

P5

Never logged by server

P3

… Request P3

Page 7: Design and Implementation of a Web Log Preprocessing System Supporting Path Completion Batchimeg AI lab. 2005.04.19

AI lab.

CleaningLog

User Identification

Session Identification

PathCompletion

Formatting

Topological Structure

Path completionA.htmlA.html

B.htmlB.html

G.html

L.html

C.htmlC.html

F.htmlF.html

N.html

D.htmlD.html E.html

H.html

I.html K.html

O.html

M.html

P.html

J.html

Q.html

A,B,C,D,FA,B,C,D,F A,B,C,D,C,B,FA,B,C,D,C,B,F

A,L A,L

A,B,G,I A,B,A,G,I

N,O N,O

Before ..Before .. After After

Background (4/4) - Path completion

Background (4/4) - Path completion

Not all requested pages are recorded in Web log. Due to caching problem.Not all requested pages are recorded in Web log. Due to caching problem.

Page 8: Design and Implementation of a Web Log Preprocessing System Supporting Path Completion Batchimeg AI lab. 2005.04.19

AI lab.

Related workRelated workRelated workRelated work

Related Related worksworks

Using Using Topological Topological StructureStructure

RemovinRemoving images g images

RemovinRemoving robot g robot

texttext

User User /Session /Session

IdentificatioIdentificationn

Path Path completiocompletio

nn

R. Cooley [12] OO OO OO

Login, IP, Login, IP, AgentAgent OO

1996 [8] 1996 [8] Olympics Olympics site site

XX OO XX CookieCookie XX

Yan, JacobseYan, Jacobsen [5]n [5] XX OO XX IP, AgentIP, Agent XX

Pitkow [7]Pitkow [7] OO OO XX Session IDSession ID OO

Shahabi [2] XX OO XX Session IDSession ID OO

Chen, Park Chen, Park [3][3] XX OO XX Login, IPLogin, IP XX

X – not used X – not used

O – used O – used

Page 9: Design and Implementation of a Web Log Preprocessing System Supporting Path Completion Batchimeg AI lab. 2005.04.19

AI lab.

Purposed System(Purposed System(1/71/7))((preprocessingpreprocessing))

Purposed System(Purposed System(1/71/7))((preprocessingpreprocessing))

Data cleaning Data cleaning

(eliminate irrelevant info)(eliminate irrelevant info)

ResultResult

Web site’s topological Web site’s topological structurestructure (find the hyperlink (find the hyperlink relation relation between web pages)between web pages)

User Identification, session User Identification, session Identification, Identification, (identify (identify each user, find each user’s each user, find each user’s access pattern)access pattern)

After session After session Identification and Identification and path completion path completion

User grouping User grouping User User IdentifyIdentify

After session After session Identification and Identification and path completion path completion

User grouping User grouping User User IdentifyIdentify

Construct the Construct the site topological site topological

structure by structure by web log data in web log data in

serverserver

Construct the Construct the site topological site topological

structure by structure by web log data in web log data in

serverserver

Why preprocessing?

Preprocessing can take up to 60-80% of the times spend analyzing the data.Incomplete preprocessing task can easily result invalid pattern and wrong conclusions.

Path Path completiocompletio

nn

User User GroupinGroupin

gg

Page 10: Design and Implementation of a Web Log Preprocessing System Supporting Path Completion Batchimeg AI lab. 2005.04.19

AI lab.

Purposed System (Purposed System (2/72/7))Purposed System (Purposed System (2/72/7))

Make the site topological structureMake the site topological structure Helps solving data preprocessing and Helps solving data preprocessing and

analysis:analysis:

- user identification- user identification- - path completionpath completion

Goal of purposed systemGoal of purposed system Discover Discover Similar user group, Relevant page group and FreqSimilar user group, Relevant page group and Freq

uency accessing pathsuency accessing paths

Page 11: Design and Implementation of a Web Log Preprocessing System Supporting Path Completion Batchimeg AI lab. 2005.04.19

AI lab.

Purposed System (Purposed System (3/73/7))Purposed System (Purposed System (3/73/7))begin

end

Not end of Log file

Enter URL toURL_Queue

URL QueueNot empty

Get head,define depth

Find “http” data

To add link tothe Topo_Str_DB

Is there otherRecord?

No

No

No

No

Yes

Yes

Yes

Yes

Algorithm of Topological StructureAlgorithm of Topological Structure

Make TopologicalStructure

Page 12: Design and Implementation of a Web Log Preprocessing System Supporting Path Completion Batchimeg AI lab. 2005.04.19

AI lab.

Purposed System (Purposed System (4/74/7)- )- Make the Make the topological structuretopological structure

Purposed System (Purposed System (4/74/7)- )- Make the Make the topological structuretopological structure

Topological StructureTopological Structure- input: URL input: URL path and link path and link- output: complete sitemap (treeoutput: complete sitemap (tree))

link, path, depth and referrerslink, path, depth and referrersqueuequeue0. Index.html (A) 1. L.html (referrer) 2.

Sport/Team/football.html 2.

Sport/News/Mongolia.html

1. Sport.html 2. Sport/Team/ 3.

Sport/Team/football.html 2. Sport/Advice/

..

..

..

Sport/Advice

Index.html (A)

Sport.html

Sport/News/Mongolia.html

L.html

Sport/Team/

Sport/Team/football.html

X

0

1

2

3

Depth

olloo.mn/L.htmlolloo.mn/L.html Sport/Team/football.htmlolloo.mn/L.html Sport/News/Mongolia.htmlolloo.mn/Sport.htmlolloo.mn/Sport.html /Team/football.htmlolloo.mn/Sport.html /Advice/

Page 13: Design and Implementation of a Web Log Preprocessing System Supporting Path Completion Batchimeg AI lab. 2005.04.19

AI lab.

Flow chart of User Identification algorithmFlow chart of User Identification algorithm

Begin Begin

Not end of log DBNot end of log DB

IF current IP’s Agent and OS same

IF current IP’s Agent and OS same

End End

Yes

YesNo

No

IP not in IPSetIP not in IPSetYes

No

Save the IP, Agent and OS

Save the IP, Agent and OS

Is there other Records?

Is there other Records?

No

Assign to the User Set,Increase User

counter

Assign to the User Set,Increase User

counter

Yes

Purposed System (Purposed System (5/75/7)) - User - User IdentificationIdentification

.. for similar user group

Page 14: Design and Implementation of a Web Log Preprocessing System Supporting Path Completion Batchimeg AI lab. 2005.04.19

AI lab.

Purposed System Purposed System ((6/76/7)- Session )- Session identificationidentification

Purposed System Purposed System ((6/76/7)- Session )- Session identificationidentification

Begin Begin

not end of log DB

not end of log DB

refer page empty?refer page empty?

End End

Yes

Yes

IP not in User Set?IP not in User Set? YesNo

Start new Session

Start new Session

Is there other Records?

Is there other Records?

No

A page append to the

session

A page append to the

session

Yes

time taken >25.5?time taken >25.5?

go to path Completion

go to path Completion

No

No

No Yes

Flow chart of Session Identification algorithmFlow chart of Session Identification algorithm

Page 15: Design and Implementation of a Web Log Preprocessing System Supporting Path Completion Batchimeg AI lab. 2005.04.19

AI lab.

Purposed System (Purposed System (7/77/7)) - Path - Path completioncompletion

Purposed System (Purposed System (7/77/7)) - Path - Path completioncompletion

Flow chart of Path completion algorithmFlow chart of Path completion algorithm

Begin Begin

Not end of Session set

Not end of Session set

End End

Yes

A page in a Session contains next page

in that session

A page in a Session contains next page

in that session

YesNo

check to the next page

check to the next page

No

Complete the path

Complete the path

Search that page from site map

Search that page from site map

Page 16: Design and Implementation of a Web Log Preprocessing System Supporting Path Completion Batchimeg AI lab. 2005.04.19

AI lab.

Experiment (Experiment (1/41/4))Experiment (Experiment (1/41/4))

URLs in Web server logwww.olloo.mn Raw log data

Page 17: Design and Implementation of a Web Log Preprocessing System Supporting Path Completion Batchimeg AI lab. 2005.04.19

AI lab.

Experiment Experiment (2/4)(2/4)Experiment Experiment (2/4)(2/4)Topological Structure

Page 18: Design and Implementation of a Web Log Preprocessing System Supporting Path Completion Batchimeg AI lab. 2005.04.19

AI lab.

Experiment (Experiment (3/43/4) ) Experiment (Experiment (3/43/4) )

0

10000

20000

30000

40000

50000

60000

Size (K)

Before clean After clean

Cleaning result

2005.01.032005.01.102005.01.172005.01.312005.02.192005.02.262005.03.142003.03.312003.04.05

Data cleaning

Page 19: Design and Implementation of a Web Log Preprocessing System Supporting Path Completion Batchimeg AI lab. 2005.04.19

AI lab.

Experiment (Experiment (4/44/4))Experiment (Experiment (4/44/4))

Page 20: Design and Implementation of a Web Log Preprocessing System Supporting Path Completion Batchimeg AI lab. 2005.04.19

AI lab.

ResultResultResultResult

This result can be more helpful to discover Similar user group, This result can be more helpful to discover Similar user group, Relevant page group, Frequency accessing paths in WUM.Relevant page group, Frequency accessing paths in WUM.

User groupPath completion

Page 21: Design and Implementation of a Web Log Preprocessing System Supporting Path Completion Batchimeg AI lab. 2005.04.19

AI lab.

Interface of Path Completion Interface of Path Completion Preprocessing System (PCPS)Preprocessing System (PCPS)Interface of Path Completion Interface of Path Completion Preprocessing System (PCPS)Preprocessing System (PCPS)

Start the new project.Start the new project.

Page 22: Design and Implementation of a Web Log Preprocessing System Supporting Path Completion Batchimeg AI lab. 2005.04.19

AI lab.

Interface of Path Completion Interface of Path Completion Preprocessing System (PCPS)Preprocessing System (PCPS)Interface of Path Completion Interface of Path Completion Preprocessing System (PCPS)Preprocessing System (PCPS)

Giving the project name and folderGiving the project name and folder

Page 23: Design and Implementation of a Web Log Preprocessing System Supporting Path Completion Batchimeg AI lab. 2005.04.19

AI lab.

Interface (Re Interface of Path Completion Interface (Re Interface of Path Completion Preprocessing System (PCPS) sult)Preprocessing System (PCPS) sult)

Interface (Re Interface of Path Completion Interface (Re Interface of Path Completion Preprocessing System (PCPS) sult)Preprocessing System (PCPS) sult)

Add the log file to projectAdd the log file to project

Page 24: Design and Implementation of a Web Log Preprocessing System Supporting Path Completion Batchimeg AI lab. 2005.04.19

AI lab.

Interface of Path Completion Interface of Path Completion Preprocessing System (PCPS)Preprocessing System (PCPS)Interface of Path Completion Interface of Path Completion Preprocessing System (PCPS)Preprocessing System (PCPS)

Choose the log file to addChoose the log file to add

Page 25: Design and Implementation of a Web Log Preprocessing System Supporting Path Completion Batchimeg AI lab. 2005.04.19

AI lab.

Interface of Path Completion Interface of Path Completion Preprocessing System (PCPS)Preprocessing System (PCPS)Interface of Path Completion Interface of Path Completion Preprocessing System (PCPS)Preprocessing System (PCPS)

Asking to remove the image filesAsking to remove the image files

(files) Should to analyze…

(files) Should to clean …

Page 26: Design and Implementation of a Web Log Preprocessing System Supporting Path Completion Batchimeg AI lab. 2005.04.19

AI lab.

Interface of Path Completion Interface of Path Completion Preprocessing System (PCPS)Preprocessing System (PCPS)Interface of Path Completion Interface of Path Completion Preprocessing System (PCPS)Preprocessing System (PCPS)

Cleaned log and informationCleaned log and information

The pages and files that wanted to analyze

Page 27: Design and Implementation of a Web Log Preprocessing System Supporting Path Completion Batchimeg AI lab. 2005.04.19

AI lab.

Interface of Path Completion Interface of Path Completion Preprocessing System (PCPS)Preprocessing System (PCPS)Interface of Path Completion Interface of Path Completion Preprocessing System (PCPS)Preprocessing System (PCPS)

Topological StructureTopological Structure

Page 28: Design and Implementation of a Web Log Preprocessing System Supporting Path Completion Batchimeg AI lab. 2005.04.19

AI lab.

Interface of Path Completion Interface of Path Completion Preprocessing System (PCPS)Preprocessing System (PCPS)Interface of Path Completion Interface of Path Completion Preprocessing System (PCPS)Preprocessing System (PCPS)

Browser Browser

Page 29: Design and Implementation of a Web Log Preprocessing System Supporting Path Completion Batchimeg AI lab. 2005.04.19

AI lab.

Interface of Path Completion Interface of Path Completion Preprocessing System (PCPS)Preprocessing System (PCPS)Interface of Path Completion Interface of Path Completion Preprocessing System (PCPS)Preprocessing System (PCPS)

SystemSystem

Page 30: Design and Implementation of a Web Log Preprocessing System Supporting Path Completion Batchimeg AI lab. 2005.04.19

AI lab.

Comparing other preprocessing Comparing other preprocessing approach to Purposed Systemapproach to Purposed SystemComparing other preprocessing Comparing other preprocessing approach to Purposed Systemapproach to Purposed System

Related Related worksworks

Creation of Creation of Topol. StrucTopol. Struc

tureture

Using Using TopologicaTopological Structurel Structure

Removing Removing images images

RemovinRemoving robot g robot

texttext

User User /Session /Session

IdentificatiIdentificationon

Path Path completiocompletio

nn

R. Cooley [12] XX OO OO OO

Login, IP, Login, IP, AgentAgent OO

1996 [8] 1996 [8] Olympics Olympics site site

XX XX OO XX CookieCookie XX

Yan, Jacobsen Yan, Jacobsen [5][5] XX XX OO XX IP, AgentIP, Agent XX

Pitkow [7]Pitkow [7] XX OO OO XX Session IDSession ID OO

Shahabi [2] XX XX OO XX Session IDSession ID OO

Chen, Park Chen, Park [3][3] XX XX OO XX Login, IPLogin, IP XX

Purposed Purposed SystemSystem OO OO OO OO

IP,Agent, IP,Agent, GroupingGrouping OO

O- used, X – not used

Page 31: Design and Implementation of a Web Log Preprocessing System Supporting Path Completion Batchimeg AI lab. 2005.04.19

AI lab.

Conclusion Conclusion Conclusion Conclusion

ApproachApproach Identified Identified number of number of

accessaccess

Identified Identified number of number of

Users Users

Identified Identified number of number of

SessionSessionNot used path Not used path completioncompletion

1801918019 28122812 1040710407

Purposed Purposed SystemSystem

1801918019 30613061 1101911019

• My work focus on preprocessing of Web log mining and enhance the My work focus on preprocessing of Web log mining and enhance the discovering patterns. discovering patterns. 3061 – 2812 = 249 users neglected.3061 – 2812 = 249 users neglected.

• This paper presented some new approach and practicable algorithm.This paper presented some new approach and practicable algorithm.• This approach can be better precision than some existence approaches.This approach can be better precision than some existence approaches.

Page 32: Design and Implementation of a Web Log Preprocessing System Supporting Path Completion Batchimeg AI lab. 2005.04.19

AI lab.

ReferenceReferenceReferenceReference

[1] R. Cooley, B. Mobasher, and J. Srivastava Department of Computer Science and Engineeri[1] R. Cooley, B. Mobasher, and J. Srivastava Department of Computer Science and Engineering University of Minnesota Minneapolis, MN 55455, USA ng University of Minnesota Minneapolis, MN 55455, USA “Web mining: Information and “Web mining: Information and Pattern Discovery on the World Wide Web” Pattern Discovery on the World Wide Web” 19981998

[2] [2] C. C. Shahabi and F.B. Kashani, “Shahabi and F.B. Kashani, “A Framework for Efficient and Anonymous Web Usage MiniA Framework for Efficient and Anonymous Web Usage Mining Based on Client-Side Trackingng Based on Client-Side Tracking,”,”20012001

[3] M.S. Chen, J.S. Park, P.S Yu. [3] M.S. Chen, J.S. Park, P.S Yu. Data mining for path traversal patterns in a Web environmeData mining for path traversal patterns in a Web environmentnt. 1996. 1996

[4] H. Mannila, H. Toivonen. [4] H. Mannila, H. Toivonen. Discovering generalized episodes using minimal occurrence.Discovering generalized episodes using minimal occurrence. 19 199696

[5] T. Yan, M. Jacobsen, H. Garcia-Molina, U. Dayal. [5] T. Yan, M. Jacobsen, H. Garcia-Molina, U. Dayal. From user access patterns to dynamic hFrom user access patterns to dynamic hypertext linking. ypertext linking. 1996.1996.

[6]. J. Pitkow. In search of reliable usage data on the WWW. 1997.[6]. J. Pitkow. In search of reliable usage data on the WWW. 1997.[7]. J. Pitkow, P. Pirolli and R. Rao. Silk. Extracting usable structures from the Web. [7]. J. Pitkow, P. Pirolli and R. Rao. Silk. Extracting usable structures from the Web. 19961996[8]. S. Elo-Dean and M. Viveros. Data mining the IBM official 1996 Olympics Web site.[8]. S. Elo-Dean and M. Viveros. Data mining the IBM official 1996 Olympics Web site.[9]. Open Market Inc. Open Market Web reporter. [9]. Open Market Inc. Open Market Web reporter. http://www.openmarkethttp://www.openmarket.com.com,1996.,1996.[10]. net.Genesis. net.analysis desktop [10]. net.Genesis. net.analysis desktop http://www.netgen.comhttp://www.netgen.com,1996 ,1996 [11]. Doru Tanasa, Brigitte Trousse[11]. Doru Tanasa, Brigitte Trousse “ “Advanced data preprocessing for intersites Web Usage Advanced data preprocessing for intersites Web Usage

Mining “2004Mining “2004[12]. R. Cooley, Web Usage Mining: Discovery and Application of Interesting Patterns from We[12]. R. Cooley, Web Usage Mining: Discovery and Application of Interesting Patterns from We

b Data, PhD thesis, Dept. of Computer Science, Univ. of Minnesota, 2000.b Data, PhD thesis, Dept. of Computer Science, Univ. of Minnesota, 2000.