web mining issues
DESCRIPTION
Web Mining Issues. Size >350 million pages Grows at about 1 million pages a day Diverse types of data. Web Mining Taxonomy. Crawlers. Robot (spider) traverses the hypertext sructure in the Web. Collect information from visited pages Used to construct indexes for search engines - PowerPoint PPT PresentationTRANSCRIPT
Web Mining IssuesWeb Mining Issues
SizeSize– >350 million pages>350 million pages– Grows at about 1 million pages a dayGrows at about 1 million pages a day
Diverse types of dataDiverse types of data
Web Mining TaxonomyWeb Mining Taxonomy
CrawlersCrawlers Robot (spider)Robot (spider) traverses the hypertext sructure in the traverses the hypertext sructure in the
Web.Web. Collect information from visited pagesCollect information from visited pages Used to construct indexes for search enginesUsed to construct indexes for search engines Traditional CrawlerTraditional Crawler – visits entire Web (?) and – visits entire Web (?) and
replaces indexreplaces index Periodic CrawlerPeriodic Crawler – visits portions of the Web and – visits portions of the Web and
updates subset of indexupdates subset of index Incremental CrawlerIncremental Crawler – selectively searches the Web – selectively searches the Web
and incrementally modifies indexand incrementally modifies index Focused CrawlerFocused Crawler – visits pages related to a particular – visits pages related to a particular
subjectsubject
Focused CrawlerFocused Crawler
Classifier also determines how useful Classifier also determines how useful outgoing links areoutgoing links are
Focused CrawlerFocused Crawler
PersonalizationPersonalization Web access or contents tuned to better fit the Web access or contents tuned to better fit the
desires of each user.desires of each user. Manual techniques identify user’s preferences Manual techniques identify user’s preferences
based on profiles or demographics.based on profiles or demographics. Collaborative filteringCollaborative filtering identifies preferences identifies preferences
based on ratings from similar users.based on ratings from similar users. Content based filteringContent based filtering retrieves pages retrieves pages
based on similarity between pages and user based on similarity between pages and user profiles.profiles.
PageRankPageRank Used by GoogleUsed by Google Prioritize pages returned from search by Prioritize pages returned from search by
looking at Web structure.looking at Web structure. Importance of page is calculated based Importance of page is calculated based
on number of pages which point to it – on number of pages which point to it – BacklinksBacklinks..
Weighting is used to provide more Weighting is used to provide more importance to backlinks coming form importance to backlinks coming form important pages.important pages.
PageRank (cont’d)PageRank (cont’d)
PR(p) = c (PR(1)/NPR(p) = c (PR(1)/N11 + … + PR(n)/N + … + PR(n)/Nnn))– PR(i): PageRank for a page i which points PR(i): PageRank for a page i which points
to target page p.to target page p.– NNii: number of links coming out of page I: number of links coming out of page I
Rank source E: R= cAR+cERank source E: R= cAR+cE
CLEVERCLEVER
Identify authoritative and hub pages.Identify authoritative and hub pages. Authoritative PagesAuthoritative Pages : :
– Highly important pages.Highly important pages.– Best source for requested information.Best source for requested information.
Hub PagesHub Pages : :– Contain links to highly important pages.Contain links to highly important pages.
Web Usage Mining ApplicationsWeb Usage Mining Applications
PersonalizationPersonalization Improve structure of a site’s Web pagesImprove structure of a site’s Web pages Aid in caching and prediction of future Aid in caching and prediction of future
page referencespage references Improve design of individual pagesImprove design of individual pages Improve effectiveness of e-commerce Improve effectiveness of e-commerce
(sales and advertising)(sales and advertising)
Web Usage Mining ActivitiesWeb Usage Mining Activities Preprocessing Web logPreprocessing Web log
– Cleanse Cleanse – Remove extraneous informationRemove extraneous information– SessionizeSessionize
Session:Session: Sequence of pages referenced by one user at a sitting. Sequence of pages referenced by one user at a sitting. Pattern DiscoveryPattern Discovery
– Count patterns that occur in sessionsCount patterns that occur in sessions– Pattern Pattern is sequence of pages references in session.is sequence of pages references in session.– Similar to association rulesSimilar to association rules
» Transaction: sessionTransaction: session» Itemset: pattern (or subset)Itemset: pattern (or subset)» Order is importantOrder is important
Pattern AnalysisPattern Analysis
Web Usage Mining IssuesWeb Usage Mining Issues
Identification of exact user not possible.Identification of exact user not possible. Exact sequence of pages referenced by Exact sequence of pages referenced by
a user not possible due to caching.a user not possible due to caching. Session not well definedSession not well defined Security, privacy, and legal issuesSecurity, privacy, and legal issues
Web Log CleansingWeb Log Cleansing
Replace source IP address with unique Replace source IP address with unique but non-identifying ID.but non-identifying ID.
Replace exact URL of pages referenced Replace exact URL of pages referenced with unique but non-identifying ID.with unique but non-identifying ID.
Delete error records and records Delete error records and records containing not page data (such as containing not page data (such as figures and code)figures and code)
SessionizingSessionizing
Divide Web log into sessions.Divide Web log into sessions. Two common techniques:Two common techniques:
– Number of consecutive page references Number of consecutive page references from a source IP address occurring within from a source IP address occurring within a predefined time interval (e.g. 25 a predefined time interval (e.g. 25 minutes).minutes).
– All consecutive page references from a All consecutive page references from a source IP address where the interclick time source IP address where the interclick time is less than a predefined threshold.is less than a predefined threshold.
EpisodesEpisodes
Partially ordered set of pagesPartially ordered set of pages Serial episodeSerial episode – totally ordered with – totally ordered with
time constrainttime constraint Parallel episodeParallel episode – partial ordered with – partial ordered with
time constrainttime constraint General episodeGeneral episode – partial ordered with – partial ordered with
no time constraintno time constraint
DAG for EpisodeDAG for Episode
Longest Common SubseriesLongest Common Subseries
Find longest subseries they have in Find longest subseries they have in common.common.
Ex:Ex:– X = <10,5,6,9,22,15,4,2>X = <10,5,6,9,22,15,4,2>– Y = <6,9,10,5,6,22,15,4,2>Y = <6,9,10,5,6,22,15,4,2>– Output: <22,15,4,2>Output: <22,15,4,2>– Sim(X,Y) = l/n = 4/9Sim(X,Y) = l/n = 4/9
Similarity based on Linear Similarity based on Linear TransformationTransformation
Linear transformation function fLinear transformation function f– Convert a value form one series to a value Convert a value form one series to a value
in the secondin the second ff – tolerated difference in results – tolerated difference in results – – time value difference allowedtime value difference allowed
Distance between StringsDistance between Strings
Cost to convert one to the otherCost to convert one to the other TransformationsTransformations
– Match: Current characters in both strings Match: Current characters in both strings are the sameare the same
– Delete: Delete current character in input Delete: Delete current character in input stringstring
– Insert: Insert current character in target Insert: Insert current character in target string into stringstring into string
Distance between StringsDistance between Strings
Frequent SequenceFrequent Sequence
Frequent Sequence ExampleFrequent Sequence Example
Purchases made by Purchases made by customerscustomers
s(<{A},{C}>) = 1/3s(<{A},{C}>) = 1/3 s(<{A},{D}>) = 2/3s(<{A},{D}>) = 2/3 s(<{B,C},{D}>) = 2/3s(<{B,C},{D}>) = 2/3
Frequent Sequence LatticeFrequent Sequence Lattice
SPADESPADE
Sequential Pattern Discovery using Sequential Pattern Discovery using Equivalence classesEquivalence classes
Divides lattice into equivalent classes Divides lattice into equivalent classes and searches each separately.and searches each separately.
SPADE ExampleSPADE Example
ID-List for Sequences of length 1:ID-List for Sequences of length 1:
Count for <{A}> is 3Count for <{A}> is 3 Count for <{A},{D}> is 2Count for <{A},{D}> is 2
Equivalence ClassesEquivalence Classes
SPADE AlgorithmSPADE Algorithm