web mining
TRANSCRIPT
What is web mining?
• Web mining is the use of the data mining techniques to automatically discover and extract information from web documents/services.• Discovering Knowledge from and about WWW - is one of the basic
abilities of an intelligent agent.
Web Mining .vs. Data Mining
• Structure (or lack of it)• Textual information and linkage structure
• Scale• Data generated per day is comparable to largest conventional data
warehouses
• Speed• Often need to react to evolving usage patterns in real-time (e.g.,
merchandising)
Web Mining topics
• Web graph analysis• Power Laws and The Long Tail• Structured data extraction• Web advertising • Systems Issues
Size of the Web
• Number of pages• Technically, infinite• Much duplication (30-40%)• Best estimate of “unique” static HTML pages comes from search engine
claims• Until last year, Google claimed 8 billion(?), Yahoo claimed 20 billion• Google recently announced that their index contains 1 trillion pages
• How to explain the discrepancy?
The web as a graph
• Pages = nodes, hyperlinks = edges• Ignore content• Directed graph
• High linkage• 10-20 links/page on average• Power-law degree distribution
Measures
• Structure• In-degrees• Out-degrees• Number of pages per site
• Usage patterns• Number of visitors• Popularity e.g., products, movies, music
Measures
• Shelf space is a scarce commodity for traditional retailers • Also: TV networks, movie theaters,…
• The web enables near-zero-cost dissemination of information about products• More choice necessitates better filters
• Recommendation engines (e.g., Amazon)• How Into Thin Air made Touching the Void a bestseller
Two approaches for analyzing data
• Machine Learning approach• Emphasizes sophisticated algorithms e.g., Support Vector Machines• Data sets tend to be small, fit in memory
• Data Mining approach• Emphasizes big data sets (e.g., in the terabytes)• Data cannot even fit on a single disk!• Necessarily leads to simpler algorithms
Issues
• Web data sets can be very large • Tens to hundreds of terabytes
• Cannot mine on a single server!• Need large farms of servers
• How to organize hardware/software to mine multi-terabyte data sets• Without breaking the bank!
What it should do?
• Finding relevant information • Low precision and unindexed information
• Creating new knowledge out of available information on the web• A data-triggered process
• Personalizing the information• Personal preference in content and presentation of the information
• Learning about the consumers • What does the customer want to do?
Direct vs Indirect web mining
• Web mining techniques can be used to solve the information overload problems:
DirectlyAddress the problem with web mining techniques
E.g. newsgroup agent classifies whether the news as relevantIndirectly
Used as part of a bigger application that addresses problemsE.g. used to create index terms for a web search service
Web Mining Categories
• Web Content MiningDiscovering useful information from web page
contents/data/documents.
• Web Structure MiningDiscovering the model underlying link structures (topology)
on the Web. E.g. discovering authorities and hubs
• Web Usage MiningExtraction of interesting knowledge from logging information
produced by web servers.Usage data from logs, user profiles, user sessions, cookies, user
queries, bookmarks, mouse clicks and scrolls, etc.
IRSystem
Query
Documentssource
RankedDocuments
Document
DocumentDocument
ClusteringSystem
Similarity measure
Documentssource
DocDo
cDoc
Doc
Doc
DocDoc
Doc
DocDoc
Web Content Data Structure
• Web content consists of several types of data• Text, image, audio, video, hyperlinks.
• Unstructured – free text• Semi-structured – HTML• More structured – Data in the tables or database generated HTML
pagesNote: much of the Web content data is unstructured text data.
Web Content Mining
• Unstructured DocumentsBag of words to represent unstructured documents
Takes single word as feature Ignores the sequence in which words occur
Features could be Boolean
Word either occurs or does not occur in a document Frequency based
Frequency of the word in a documentVariations of the feature selection include
Removing the case, punctuation, infrequent words and stop wordsFeatures can be reduced using different feature selection techniques:
Information gain, mutual information, cross entropy. Stemming: which reduces words to their morphological roots.
Web Content Mining
• Semi-Structured DocumentsUses richer representations for features
Due to the additional structural information in the hypertext document (typically HTML and hyperlinks)
Uses common data mining methods (whereas unstructured might use more text mining methods)
Application: Hypertext classification or categorization and clustering, learning relations between web documents, learning extraction patterns or rules, and finding patterns in semi-structured data.
Web Content Mining: DB View
• The database techniques on the Web are related to the problems of managing and querying the information on the Web.• DB view tries to infer the structure of a Web site or transform a Web site to
become a database Better information managementBetter querying on the Web
• Can be achieved by:Finding the schema of Web documentsBuilding a Web warehouseBuilding a Web knowledge baseBuilding a virtual database
Web Content Mining: DB View• DB view mainly uses the Object Exchange Model (OEM)
Represents semi-structured data by a labeled graphThe data in the OEM is viewed as a graph, with objects as the vertices
and labels on the edges Each object is identified by an object identifier [oid] and Value is either atomic or complex
• Process typically starts with manual selection of Web sites for doing Web content mining• Main application:
• The task of finding frequent substructures in semi-structured data• The task of creating multi-layered database
Taxonomies
• Ranking• Graph Search• Communities• Hyperlink Induced Topic Search• SEO• Hub & Authorities
Web Structure Mining
• Interested in the structure of the hyperlinks within the Web• Inspired by the study of social networks and citation analysis• Can discover specific types of pages(such as hubs, authorities, etc.) based on
the incoming and outgoing links.
• Application: • Discovering micro-communities in the Web , • measuring the “completeness” of a Web site
Web Usage Mining• Tries to predict user behavior from interaction
with the Web• Wide range of data (logs)
Web client data Proxy server data Web server data
• Two common approaches Maps the usage data of Web server into relational tables before
an adapted data mining techniques Uses the log data directly by utilizing special pre-processing
techniques
Web Usage Mining
Pre-Processing Pattern Discovery Pattern Analysis
User sessionFile Rules and Patterns Interesting
Knowledge
33
Use of Multi-Layer Meta Web• Benefits of Multi-Layer Meta-Web: • Multi-dimensional Web info summary analysis• Approximate and intelligent query answering• Web high-level query answering (WebSQL, WebML)• Web content and structure mining• Observing the dynamics/evolution of the Web
• Is it realistic to construct such a meta-Web?• Benefits even if it is partially constructed• Benefits may justify the cost of tool development,
standardization and partial restructuring
Web Search Products and ServicesAlta VistaDB2 text extenderExciteFulcrumGlimpse (Academic)Google! Inforseek Internet Inforseek Intranet Inktomi (HotBot) Lycos
PLSSmart (Academic)Oracle text extender Verity Yahoo!
Web Usage Mining
• Typical problems: • Distinguishing among unique users, server sessions,
episodes, etc. in the presence of caching and proxy servers
• Often Usage Mining uses some background or domain knowledge
E.g. site topology, Web content, etc.
Web Usage Mining
• Applications:• Two main categories:
Learning a user profile (personalized)Web users would be interested in techniques that learn their needs and preferences automatically
Learning user navigation patterns (impersonalized)Information providers would be interested in techniques that
improve the effectiveness of their Web site
References
• www.cs.jyu.fi/ai/vagan/Web_Mining.ppt• www.infolab.stanford.edu/~ullman/mining/webMiningOverview.ppt• www.psl.cs.columbia.edu/classes/.../Presentation_Jagriti_Mishra.ppt
x