Download - Web Mining & Text Mining
Web Mining
Web Mining
Web Mining is the application of data mining techniques to extract knowledge
from web data such as Web content, Web structure and Web usage data.
It is the process of discovering the useful and previously unknown information
from the web data.
Web data is :-
• Web content :- text, images, records, etc.
• Web structure :- hyperlinks, tags, etc.
• Web usage :- http logs, app server logs, etc.
Web Mining
Web Content Mining
Web content mining performed by extracting useful information from the content
of a web page/site.
It includes extraction of structured data/information from web pages,
identification, match, and integration of semantically similar data.
The type of web content may consist of text, image, audio, video, etc. It is also
know as text mining.
It uses the Natural Language Processing and Information Retrieval techniques for
mining the data.
Web Structure Mining
The structure of a typical Web graph consists of Web pages as nodes, and
hyperlinks as edges connecting between two related pages.
Web structure mining is the process of discovering structure information from the
web.
• This type of mining can be performed either at the (intra-page) document level or the
(inter-page) hyperlink level.
• The research at the hyperlink level is also called Hyperlink Analysis.
Web Structure Terminology
Web-graph : A directed graph that represents the Web.
Node : Each Web page represents a node of the Web-graph.
Link : Each hyperlink on the Web is a directed edge of the Web-graph.
In-degree : The number of distinct links that point to a node.
Out-degree : The number of distinct links originating at a node that point to other
nodes.
Directed Path : It is a sequence of links, starting from a node say r that can be
followed to reach another node say t.
Shortest Path : The path with the shortest length out of all the paths between
nodes p and q.
Diameter : It is the maximum of all the shortest paths between a pair of nodes p
and q, for all pairs of nodes p and q in the Web-graph.
Web Structure Terminology
Web Usage Mining
A Web is a collection of inter-related files on one or more Web Servers.
Discovery of meaningful patterns from data generated by client-server transaction
on one or more Web localities.
Typical Sources of Data :
• Automatically generated data stored in server access logs, referrer logs, agent logs, and
client-side cookies.
• User profiles.
• Metadata : page attributes, content attributes, usage data.
Web servers, Web proxies, and client application can quite easily capture Web
Usage data.
Web Server Log : It is a file that is created by the server to record all the
activities it performs.
For ex: When a user enters URL into the browsers address bar or requests by
clicking on a link.
The page request sent to web server maintains the following info. in its log like
Information about URL, Whether the request was successful, Users IP address,
time and date, etc.
Web Usage Mining
Text Mining
Text Mining
The objective of Text Mining is to exploit information contained in textual
documents in various ways, including discovery of patterns and trends in data,
associations among entities, predictive rules, etc.
The results can be important both for :
• The analysis of the collection, and
• Providing intelligent navigation and browsing methods.
Text Mining Workflow
Data Mining vs Text Mining
Both seek novel and useful pattern.
Both are semi-automated process.
Difference is the nature of the data:
• Structured versus Unstructured data
• Structured data: databases
• Unstructured data: word docs, pdf files, xml files, and so on
Text mining – first, impose structure to the data, then mine the structured data.
Technology premise of Text Mining
Summarization : It is a process of making summary of any document containing
large amount of information while theme or main idea of document is maintained.
Information Extraction : It utilizes relations within the text. It uses pattern
matching for it.
Categorization : It is a supervised learning technique which places the document
according to content. Document categorization is largely used in libraries.
Visualization : It is computer graphic effect to represent information and
revealing relationships.
Clustering : It is a document’s textual similarity based unsupervised technique
which is used by data analysis to divide the text into mutually exclusive groups.
Question Answering : Natural language queries or questions answering is
responsible to decide a way find a more suitable answer for particular question.
Sentiment Analysis : It is also known as opinion mining is configured of user’s
emotion, mostly into several classes which are positive, negative, neutral and
mixed. It is mainly used to get people’s view or attitude towards anything which
includes services and products.
Technology premise of Text Mining