web mining issues size size –>350 million pages –grows at about 1 million pages a day diverse...
TRANSCRIPT
![Page 1: Web Mining Issues Size Size –>350 million pages –Grows at about 1 million pages a day Diverse types of data Diverse types of data](https://reader036.vdocuments.us/reader036/viewer/2022062519/5697bfbc1a28abf838ca1a1f/html5/thumbnails/1.jpg)
Web Mining IssuesWeb Mining Issues
SizeSize– >350 million pages>350 million pages– Grows at about 1 million pages a dayGrows at about 1 million pages a day
Diverse types of dataDiverse types of data
![Page 2: Web Mining Issues Size Size –>350 million pages –Grows at about 1 million pages a day Diverse types of data Diverse types of data](https://reader036.vdocuments.us/reader036/viewer/2022062519/5697bfbc1a28abf838ca1a1f/html5/thumbnails/2.jpg)
Web Mining TaxonomyWeb Mining Taxonomy
![Page 3: Web Mining Issues Size Size –>350 million pages –Grows at about 1 million pages a day Diverse types of data Diverse types of data](https://reader036.vdocuments.us/reader036/viewer/2022062519/5697bfbc1a28abf838ca1a1f/html5/thumbnails/3.jpg)
CrawlersCrawlers Robot (spider)Robot (spider) traverses the hypertext sructure in traverses the hypertext sructure in
the Web.the Web. Collect information from visited pagesCollect information from visited pages Used to construct indexes for search enginesUsed to construct indexes for search engines Traditional CrawlerTraditional Crawler – visits entire Web (?) and – visits entire Web (?) and
replaces indexreplaces index Periodic CrawlerPeriodic Crawler – visits portions of the Web and – visits portions of the Web and
updates subset of indexupdates subset of index Incremental CrawlerIncremental Crawler – selectively searches the Web – selectively searches the Web
and incrementally modifies indexand incrementally modifies index Focused CrawlerFocused Crawler – visits pages related to a – visits pages related to a
particular subjectparticular subject
![Page 4: Web Mining Issues Size Size –>350 million pages –Grows at about 1 million pages a day Diverse types of data Diverse types of data](https://reader036.vdocuments.us/reader036/viewer/2022062519/5697bfbc1a28abf838ca1a1f/html5/thumbnails/4.jpg)
Focused CrawlerFocused Crawler
Classifier also determines how useful Classifier also determines how useful outgoing links areoutgoing links are
![Page 5: Web Mining Issues Size Size –>350 million pages –Grows at about 1 million pages a day Diverse types of data Diverse types of data](https://reader036.vdocuments.us/reader036/viewer/2022062519/5697bfbc1a28abf838ca1a1f/html5/thumbnails/5.jpg)
Focused CrawlerFocused Crawler
![Page 6: Web Mining Issues Size Size –>350 million pages –Grows at about 1 million pages a day Diverse types of data Diverse types of data](https://reader036.vdocuments.us/reader036/viewer/2022062519/5697bfbc1a28abf838ca1a1f/html5/thumbnails/6.jpg)
PersonalizationPersonalization
Web access or contents tuned to better fit the Web access or contents tuned to better fit the desires of each user.desires of each user.
Manual techniques identify user’s preferences Manual techniques identify user’s preferences based on profiles or demographics.based on profiles or demographics.
Collaborative filteringCollaborative filtering identifies preferences identifies preferences based on ratings from similar users.based on ratings from similar users.
Content based filteringContent based filtering retrieves pages retrieves pages based on similarity between pages and user based on similarity between pages and user profiles.profiles.
![Page 7: Web Mining Issues Size Size –>350 million pages –Grows at about 1 million pages a day Diverse types of data Diverse types of data](https://reader036.vdocuments.us/reader036/viewer/2022062519/5697bfbc1a28abf838ca1a1f/html5/thumbnails/7.jpg)
PageRankPageRank Used by GoogleUsed by Google Prioritize pages returned from search by Prioritize pages returned from search by
looking at Web structure.looking at Web structure. Importance of page is calculated based Importance of page is calculated based
on number of pages which point to it – on number of pages which point to it – BacklinksBacklinks..
Weighting is used to provide more Weighting is used to provide more importance to backlinks coming form importance to backlinks coming form important pages.important pages.
![Page 8: Web Mining Issues Size Size –>350 million pages –Grows at about 1 million pages a day Diverse types of data Diverse types of data](https://reader036.vdocuments.us/reader036/viewer/2022062519/5697bfbc1a28abf838ca1a1f/html5/thumbnails/8.jpg)
PageRank (cont’d)PageRank (cont’d)
PR(p) = c (PR(1)/NPR(p) = c (PR(1)/N11 + … + PR(n)/N + … + PR(n)/Nnn))
– PR(i): PageRank for a page i which points PR(i): PageRank for a page i which points to target page p.to target page p.
– NNii: number of links coming out of page I: number of links coming out of page I
Rank source E: R= cAR+cERank source E: R= cAR+cE
![Page 9: Web Mining Issues Size Size –>350 million pages –Grows at about 1 million pages a day Diverse types of data Diverse types of data](https://reader036.vdocuments.us/reader036/viewer/2022062519/5697bfbc1a28abf838ca1a1f/html5/thumbnails/9.jpg)
CLEVERCLEVER
Identify authoritative and hub pages.Identify authoritative and hub pages. Authoritative PagesAuthoritative Pages : :
– Highly important pages.Highly important pages.– Best source for requested information.Best source for requested information.
Hub PagesHub Pages : :– Contain links to highly important pages.Contain links to highly important pages.
![Page 10: Web Mining Issues Size Size –>350 million pages –Grows at about 1 million pages a day Diverse types of data Diverse types of data](https://reader036.vdocuments.us/reader036/viewer/2022062519/5697bfbc1a28abf838ca1a1f/html5/thumbnails/10.jpg)
Web Usage Mining ApplicationsWeb Usage Mining Applications
PersonalizationPersonalization Improve structure of a site’s Web pagesImprove structure of a site’s Web pages Aid in caching and prediction of future Aid in caching and prediction of future
page referencespage references Improve design of individual pagesImprove design of individual pages Improve effectiveness of e-commerce Improve effectiveness of e-commerce
(sales and advertising)(sales and advertising)
![Page 11: Web Mining Issues Size Size –>350 million pages –Grows at about 1 million pages a day Diverse types of data Diverse types of data](https://reader036.vdocuments.us/reader036/viewer/2022062519/5697bfbc1a28abf838ca1a1f/html5/thumbnails/11.jpg)
Web Usage Mining ActivitiesWeb Usage Mining Activities Preprocessing Web logPreprocessing Web log
– Cleanse Cleanse – Remove extraneous informationRemove extraneous information– SessionizeSessionize
Session:Session: Sequence of pages referenced by one user at a sitting. Sequence of pages referenced by one user at a sitting.
Pattern DiscoveryPattern Discovery– Count patterns that occur in sessionsCount patterns that occur in sessions– Pattern Pattern is sequence of pages references in session.is sequence of pages references in session.– Similar to association rulesSimilar to association rules
» Transaction: sessionTransaction: session» Itemset: pattern (or subset)Itemset: pattern (or subset)» Order is importantOrder is important
Pattern AnalysisPattern Analysis
![Page 12: Web Mining Issues Size Size –>350 million pages –Grows at about 1 million pages a day Diverse types of data Diverse types of data](https://reader036.vdocuments.us/reader036/viewer/2022062519/5697bfbc1a28abf838ca1a1f/html5/thumbnails/12.jpg)
Web Usage Mining IssuesWeb Usage Mining Issues
Identification of exact user not possible.Identification of exact user not possible. Exact sequence of pages referenced by Exact sequence of pages referenced by
a user not possible due to caching.a user not possible due to caching. Session not well definedSession not well defined Security, privacy, and legal issuesSecurity, privacy, and legal issues
![Page 13: Web Mining Issues Size Size –>350 million pages –Grows at about 1 million pages a day Diverse types of data Diverse types of data](https://reader036.vdocuments.us/reader036/viewer/2022062519/5697bfbc1a28abf838ca1a1f/html5/thumbnails/13.jpg)
Web Log CleansingWeb Log Cleansing
Replace source IP address with unique Replace source IP address with unique but non-identifying ID.but non-identifying ID.
Replace exact URL of pages referenced Replace exact URL of pages referenced with unique but non-identifying ID.with unique but non-identifying ID.
Delete error records and records Delete error records and records containing not page data (such as containing not page data (such as figures and code)figures and code)
![Page 14: Web Mining Issues Size Size –>350 million pages –Grows at about 1 million pages a day Diverse types of data Diverse types of data](https://reader036.vdocuments.us/reader036/viewer/2022062519/5697bfbc1a28abf838ca1a1f/html5/thumbnails/14.jpg)
SessionizingSessionizing
Divide Web log into sessions.Divide Web log into sessions. Two common techniques:Two common techniques:
– Number of consecutive page references Number of consecutive page references from a source IP address occurring within from a source IP address occurring within a predefined time interval (e.g. 25 a predefined time interval (e.g. 25 minutes).minutes).
– All consecutive page references from a All consecutive page references from a source IP address where the interclick time source IP address where the interclick time is less than a predefined threshold.is less than a predefined threshold.
![Page 15: Web Mining Issues Size Size –>350 million pages –Grows at about 1 million pages a day Diverse types of data Diverse types of data](https://reader036.vdocuments.us/reader036/viewer/2022062519/5697bfbc1a28abf838ca1a1f/html5/thumbnails/15.jpg)
EpisodesEpisodes
Partially ordered set of pagesPartially ordered set of pages Serial episodeSerial episode – totally ordered with – totally ordered with
time constrainttime constraint Parallel episodeParallel episode – partial ordered with – partial ordered with
time constrainttime constraint General episodeGeneral episode – partial ordered with – partial ordered with
no time constraintno time constraint
![Page 16: Web Mining Issues Size Size –>350 million pages –Grows at about 1 million pages a day Diverse types of data Diverse types of data](https://reader036.vdocuments.us/reader036/viewer/2022062519/5697bfbc1a28abf838ca1a1f/html5/thumbnails/16.jpg)
DAG for EpisodeDAG for Episode
![Page 17: Web Mining Issues Size Size –>350 million pages –Grows at about 1 million pages a day Diverse types of data Diverse types of data](https://reader036.vdocuments.us/reader036/viewer/2022062519/5697bfbc1a28abf838ca1a1f/html5/thumbnails/17.jpg)
Longest Common SubseriesLongest Common Subseries
Find longest subseries they have in Find longest subseries they have in common.common.
Ex:Ex:– X = <10,5,6,9,22,15,4,2>X = <10,5,6,9,22,15,4,2>– Y = <6,9,10,5,6,22,15,4,2>Y = <6,9,10,5,6,22,15,4,2>– Output: <22,15,4,2>Output: <22,15,4,2>– Sim(X,Y) = l/n = 4/9Sim(X,Y) = l/n = 4/9
![Page 18: Web Mining Issues Size Size –>350 million pages –Grows at about 1 million pages a day Diverse types of data Diverse types of data](https://reader036.vdocuments.us/reader036/viewer/2022062519/5697bfbc1a28abf838ca1a1f/html5/thumbnails/18.jpg)
Similarity based on Linear Similarity based on Linear TransformationTransformation
Linear transformation function fLinear transformation function f– Convert a value form one series to a value Convert a value form one series to a value
in the secondin the second ff – tolerated difference in results – tolerated difference in results – – time value difference allowedtime value difference allowed
![Page 19: Web Mining Issues Size Size –>350 million pages –Grows at about 1 million pages a day Diverse types of data Diverse types of data](https://reader036.vdocuments.us/reader036/viewer/2022062519/5697bfbc1a28abf838ca1a1f/html5/thumbnails/19.jpg)
Distance between StringsDistance between Strings
Cost to convert one to the otherCost to convert one to the other TransformationsTransformations
– Match: Current characters in both strings Match: Current characters in both strings are the sameare the same
– Delete: Delete current character in input Delete: Delete current character in input stringstring
– Insert: Insert current character in target Insert: Insert current character in target string into stringstring into string
![Page 20: Web Mining Issues Size Size –>350 million pages –Grows at about 1 million pages a day Diverse types of data Diverse types of data](https://reader036.vdocuments.us/reader036/viewer/2022062519/5697bfbc1a28abf838ca1a1f/html5/thumbnails/20.jpg)
Distance between StringsDistance between Strings
![Page 21: Web Mining Issues Size Size –>350 million pages –Grows at about 1 million pages a day Diverse types of data Diverse types of data](https://reader036.vdocuments.us/reader036/viewer/2022062519/5697bfbc1a28abf838ca1a1f/html5/thumbnails/21.jpg)
Frequent SequenceFrequent Sequence
![Page 22: Web Mining Issues Size Size –>350 million pages –Grows at about 1 million pages a day Diverse types of data Diverse types of data](https://reader036.vdocuments.us/reader036/viewer/2022062519/5697bfbc1a28abf838ca1a1f/html5/thumbnails/22.jpg)
Frequent Sequence ExampleFrequent Sequence Example
Purchases made by Purchases made by customerscustomers
s(<{A},{C}>) = 1/3s(<{A},{C}>) = 1/3 s(<{A},{D}>) = 2/3s(<{A},{D}>) = 2/3 s(<{B,C},{D}>) = 2/3s(<{B,C},{D}>) = 2/3
![Page 23: Web Mining Issues Size Size –>350 million pages –Grows at about 1 million pages a day Diverse types of data Diverse types of data](https://reader036.vdocuments.us/reader036/viewer/2022062519/5697bfbc1a28abf838ca1a1f/html5/thumbnails/23.jpg)
Frequent Sequence LatticeFrequent Sequence Lattice
![Page 24: Web Mining Issues Size Size –>350 million pages –Grows at about 1 million pages a day Diverse types of data Diverse types of data](https://reader036.vdocuments.us/reader036/viewer/2022062519/5697bfbc1a28abf838ca1a1f/html5/thumbnails/24.jpg)
SPADESPADE
Sequential Pattern Discovery using Sequential Pattern Discovery using Equivalence classesEquivalence classes
Divides lattice into equivalent classes Divides lattice into equivalent classes and searches each separately.and searches each separately.
![Page 25: Web Mining Issues Size Size –>350 million pages –Grows at about 1 million pages a day Diverse types of data Diverse types of data](https://reader036.vdocuments.us/reader036/viewer/2022062519/5697bfbc1a28abf838ca1a1f/html5/thumbnails/25.jpg)
SPADE ExampleSPADE Example
ID-List for Sequences of length 1:ID-List for Sequences of length 1:
Count for <{A}> is 3Count for <{A}> is 3 Count for <{A},{D}> is 2Count for <{A},{D}> is 2
![Page 26: Web Mining Issues Size Size –>350 million pages –Grows at about 1 million pages a day Diverse types of data Diverse types of data](https://reader036.vdocuments.us/reader036/viewer/2022062519/5697bfbc1a28abf838ca1a1f/html5/thumbnails/26.jpg)
Equivalence ClassesEquivalence Classes
![Page 27: Web Mining Issues Size Size –>350 million pages –Grows at about 1 million pages a day Diverse types of data Diverse types of data](https://reader036.vdocuments.us/reader036/viewer/2022062519/5697bfbc1a28abf838ca1a1f/html5/thumbnails/27.jpg)
SPADE AlgorithmSPADE Algorithm