processing and analyzing large log from search engine

24
Processing and Analyzing Large log from Search Engine Meng Dou 13/9/2012

Upload: wheatley-hamza

Post on 31-Dec-2015

29 views

Category:

Documents


0 download

DESCRIPTION

Processing and Analyzing Large log from Search Engine. Meng Dou 13/9/2012. Web-browsing data social network communications sensor data ->Behavior data Google and Facebook, for example, are Big Data companies. Big data. Opportunities. Challenges. Big data processing - PowerPoint PPT Presentation

TRANSCRIPT

Processing and Analyzing

Large log from Search Engine

Meng Dou13/9/2012

2

Web-browsing data social network communications sensor data->Behavior dataGoogle and Facebook, for example, are Big Data companies.

•Big data processing•Extracting useful information that reflects user behavior from massive log•Instance data management•Data analysis

Behavior data (like web log) can be used for improving and supporting business processes.Data mining, process mining and so on 

Big data

Challenges Opportunities

3

  

   

Distributed File

System(HDFS)

Distributed File

System(HDFS)

Key-value

Database(HBase ,Cassa

ndra, MongoDB)

Key-value

Database(HBase ,Cassa

ndra, MongoDB)

 

Unstructured Data

Cloud Storage

Big Data processing

BI/

Reporting

BI/

Reporting

Data

Mining

Data

Mining

Machine

Learning

Machine

Learning

Analytic applications

Cloud computing

(Map/Reduce Framework)

Cloud computing

(Map/Reduce Framework)

Big Data Access HiveHive NoSQLNoSQL

Raw data

Instance data

Distributed File

System(HDFS)

NoSQLNoSQL

Cloud computing

(Map/Reduce Framework)

Cloud computing

(Map/Reduce Framework)

CassandraCassandra

Web Data

Process

Mining

Process

Mining

Process

Mining

Process

Mining

Case study: Search Engine Company

4

•News, Page, Image, Maps, Music, navigationDataset: 66 million clicks in one month, 2.2 million clicks per

day->generate behavior in 10 minutes

User Behavior:•Visiting path (Referer)•Searching result effectiveness •Abs Clicking Behavior•Source and Destination of User visiting•Robot Behavior Reorganization and Analysis•Visiting page layout•Behavior comparison and product improvement•User grouping and recommendation

首页

图片首页

新闻首页

时评首页

网页结果页

时评结果页

图片结果页 图片过渡页

新闻过渡页

新闻专题页

新闻结果页

网页搜索

页面切换

网页结果点击

图片搜索

新闻搜索

时评搜索

外部页

图像点击

页面切换

新闻点击

点击全文

Data features 

5

• It contains massive information in a well recorded format

• Large scale with big growing potential

• Real-time analysis

existing tools

6

Data extracting: XESame , Prom Import

Process Mining : ProM 1)Due to large data set, analysing has low speed and in most situations it got crash 2)Offline analysis-> real-time analysis

Cloud Storage/no rational DB

Instance data(XES)

Extracting data from cloud

System Structure

7

Log processingLog processing

UnderstandableUnderstandable modelmodel

Extracting useful Extracting useful information that information that reflects user behavior reflects user behavior from massive logfrom massive log

Convert raw log to instance data(event log) with Map/Reduce

8

9

AEBDCFDAEFG

CaseID1+T1+ACaseID1+T3+ECaseID2+T3+BCaseID3+T2+DCaseID2+T1+CCaseID3+T3+FCaseID1+T2+DCaseID2+T4+ACaseID3+T1+ECaseID4+T1+FCaseID2+T2+G

CaseID1+T1+ACaseID1+T2+DCaseID1+T3+E

CaseID2+T1+CCaseID2+T2+GCaseID2+T3+BCaseID2+T4+A

CaseID3+T1+ECaseID3+T2+DCaseID3+T3+F

CaseID4+T1+F

A D E

C G B A

E D F

F

ADECGBAEDF

F

UKOC

If the events number in Xlog exceed 5000, output one Xlog, to avoid the exceed

heap size of computer

Map ReduceSort and Partition

XESName_0.xesXESName_1.xesXESName_2.xes

10

fileSize logNum OnePCTime MapReduceTime MapNum ReduceNum

8.84 MB 36422 5 s, 921 ms 7s 3 15

65.8M 218177 30 s, 846 ms 25s 3 15

112 M 772241 48 s, 559 ms 30s 3 15

One day(371M) 2,200,000 2.5minutes 1.3minutes 40 15

One week 15,000,000 20 Minutes (Expected )

2.5minutes 280 15

One month 66,000,000 2 hours(Expected )

6 minutes 1200 15

CPU: Intel Xeon 2.40GHZ RAM:2GB14Nodes

Process Discovery

11

Alpha minerHeuristic minerFuzzy minerSequence model

One instance/case is defined as one visitor’s one time visiting.•IP+UA•CookieIDActivity varies based on different requirements

Behavior analysis

12

User behavior pattern

range activity Data selection

Interaction between channels

all ContentType  

Web Map vising path

all Referer/URL  

webpage layout news ContentType+PageType+Block

(Channel =news)AND(PageType=195)

  image  ContentType+PageType+Block

(Channel =image)AND(PageType=435)

Searching result

all    

Behavior grouping

all    

Registration      

13

User behavior pattern

range activity Data selection

Interaction between channels

all ContentType  

Web Map vising path

all Referer/URL  

webpage layout news ContentType+PageType+Block

(Channel =news)AND(PageType=195)

  image  ContentType+PageType+Block

(Channel =image)AND(PageType=435)

Searching result

all    

Behavior grouping

all    

Registration      

Behavior analysis

14

User behavior pattern

range activity Data selection

Interaction between channels

all ContentType  

Web Map vising path

all Referer/URL  

webpage layout news ContentType+PageType+Block

(Channel =news)AND(PageType=195)

  image  ContentType+PageType+Block

(Channel =image)AND(PageType=435)

Searching result

all    

Behavior grouping

all    

Registration      

Active visitor’s visiting path

15

Behavior analysis

16

User behavior pattern

range activity Data selection

Interaction between channels

all ContentType  

Web Map vising path

all Referer/URL  

webpage layout news ContentType+PageType+Block

(Channel =news)AND(PageType=195)

  image  ContentType+PageType+Block

(Channel =image)AND(PageType=435)

Searching result

all    

Behavior grouping

all    

Registration      

Main page

17

18

Sequence model

19

`

20

XES statistics

21

Conclusion

22

It is a nice project to get into data analysis field ,with the combination of web data analysis, process mining and cloud computing technology.

Future work:1 More algorithms and technologies should be applied to this data set.2 Behavior comparison and user recommendation still need to be accomplished.3 Can process mining analyze the behavior that does not have a certain pattern.

1 Log Sampling2 Detect the incorrectness from logs before applying log to analysis technologies.3 Extend function of “converting data from key-value database or cloud storage to event log” in Prom or XESame.

feedback

23

1 What is the real questions?2 Why process mining?

Thank you !

Meng Dou13/9/2012