Download - Copy of Web Mining Overview
-
8/7/2019 Copy of Web Mining Overview
1/30
CS 345AData MiningLecture 1
Introduction to Web Mining
-
8/7/2019 Copy of Web Mining Overview
2/30
What is Web Mining?
Discovering useful information fromthe World-Wide Web and its usagepatterns
-
8/7/2019 Copy of Web Mining Overview
3/30
Web Mining v. Data Mining
Structure (or lack of it)
Textual information and linkage structure
Scale
Data generated per day is comparable tolargest conventional data warehouses
Speed
Often need to react to evolving usagepatterns in real-time (e.g.,merchandising)
-
8/7/2019 Copy of Web Mining Overview
4/30
Web Mining topics
Web graph analysis
Power Laws and The Long Tail
Structured data extraction
Web advertising
Systems Issues
-
8/7/2019 Copy of Web Mining Overview
5/30
Web Mining topics
Web graph analysis
Power Laws and The Long Tail
Structured data extraction
Web advertising
Systems Issues
-
8/7/2019 Copy of Web Mining Overview
6/30
Size of the Web
Number of pages Technically, infinite
Much duplication (30-40%)
Best estimate of unique static HTMLpages comes from search engine claims Until last year, Google claimed 8 billion(?),
Yahoo claimed 20 billion
Google recently announced that their index
contains 1 trillion pages How to explain the discrepancy?
-
8/7/2019 Copy of Web Mining Overview
7/30
The web as a graph
Pages = nodes, hyperlinks = edges
Ignore content
Directed graph
High linkage
10-20 links/page on average
Power-law degree distribution
-
8/7/2019 Copy of Web Mining Overview
8/30
Structure of Web graph
Lets take a closer look at structure
Broder et al (2000) studied a crawl of200M pages and other smaller crawls
Bow-tie structure Not a small world
-
8/7/2019 Copy of Web Mining Overview
9/30
Bow-tie Structure
Source: Broder et al, 2000
-
8/7/2019 Copy of Web Mining Overview
10/30
What can the graph tell us?
Distinguish important pages fromunimportant ones
Page rank
Discover communities of relatedpages
Hubs and Authorities
Detect web spam Trust rank
-
8/7/2019 Copy of Web Mining Overview
11/30
Web Mining topics
Web graph analysis
Power Laws and The Long Tail
Structured data extraction
Web advertising
Systems Issues
-
8/7/2019 Copy of Web Mining Overview
12/30
Power-law degree distribution
Source: Broder et al, 2000
-
8/7/2019 Copy of Web Mining Overview
13/30
Power-laws galore
Structure
In-degrees
Out-degrees
Number of pages per site
Usage patterns
Number of visitors
Popularity e.g., products, movies, music
-
8/7/2019 Copy of Web Mining Overview
14/30
The Long Tail
Source: Chris Anderson (2004)
-
8/7/2019 Copy of Web Mining Overview
15/30
The Long Tail
Shelf space is a scarce commodity fortraditional retailers Also: TV networks, movie theaters,
The web enables near-zero-cost
dissemination of information aboutproducts
More choice necessitates better filters Recommendation engines (e.g., Amazon)
How Into Thin Air made Touching the Void abestseller
-
8/7/2019 Copy of Web Mining Overview
16/30
Web Mining topics
Web graph analysis
Power Laws and The Long Tail
Structured data extraction
Web advertising
Systems Issues
-
8/7/2019 Copy of Web Mining Overview
17/30
Extracting Structured Data
http://www.simplyhired.com
-
8/7/2019 Copy of Web Mining Overview
18/30
Extracting structured data
http://www.fatlens.com
-
8/7/2019 Copy of Web Mining Overview
19/30
Web Mining topics
Web graph analysis
Power Laws and The Long Tail
Structured data extraction
Web advertising
Systems Issues
-
8/7/2019 Copy of Web Mining Overview
20/30
Searching the Web
Content aggregatorsThe Web Content consumers
-
8/7/2019 Copy of Web Mining Overview
21/30
Ads vs. search results
-
8/7/2019 Copy of Web Mining Overview
22/30
Ads vs. search results
Search advertising is the revenuemodel
Multi-billion-dollar industry
Advertisers pay for clicks on their ads
Interesting problems
What ads to show for a search?
If Im an advertiser, which search termsshould I bid on and how much to bid?
-
8/7/2019 Copy of Web Mining Overview
23/30
Web Mining topics
Web graph analysis
Power Laws and The Long Tail
Structured data extraction
Web advertising
Systems Issues
-
8/7/2019 Copy of Web Mining Overview
24/30
Two Approaches to AnalyzingData
Machine Learning approach
Emphasizes sophisticated algorithmse.g., Support Vector Machines
Data sets tend to be small, fit in memory Data Mining approach
Emphasizes big data sets (e.g., in theterabytes)
Data cannot even fit on a single disk!
Necessarily leads to simpler algorithms
-
8/7/2019 Copy of Web Mining Overview
25/30
Philosophy
In many cases, adding more dataleads to better results that improvingalgorithms
Netflix Google search
Google ads
More on my blog:Datawocky (datawocky.com)
-
8/7/2019 Copy of Web Mining Overview
26/30
Systems architecture
Memory
Disk
CPUMachine Learning, Statistics
Classical Data Mining
-
8/7/2019 Copy of Web Mining Overview
27/30
Very Large-Scale Data Mining
Mem
Disk
CPU
Mem
Disk
CPU
Mem
Disk
CPU
Cluster of commodity nodes
-
8/7/2019 Copy of Web Mining Overview
28/30
Systems Issues
Web data sets can be very large
Tens to hundreds of terabytes
Cannot mine on a single server!
Need large farms of servers
How to organize hardware/softwareto mine multi-terabye data sets
Without breaking the bank!
-
8/7/2019 Copy of Web Mining Overview
29/30
Web Mining topics
Web graph analysis
Power Laws and The Long Tail
Structured data extraction
Web advertising
Systems Issues
-
8/7/2019 Copy of Web Mining Overview
30/30
Project
Lots of interesting project ideas If you cant think of one please come discuss
with us
Infrastructure
Aster Data cluster on Amazon EC2 Supports both MapReduce and SQL
Data Netflix
ShareThis
Google
WebBase
TREC