Download - Web Mining & Text Mining

Web Mining & Text Mining

Prepared by : Sharma Hemant

[email protected]

Web Mining

Web Mining

Web Mining is the application of data mining techniques to extract knowledge

from web data such as Web content, Web structure and Web usage data.

It is the process of discovering the useful and previously unknown information

from the web data.

Web data is :-

• Web content :- text, images, records, etc.

• Web structure :- hyperlinks, tags, etc.

• Web usage :- http logs, app server logs, etc.

Web Mining

Web Content Mining

Web content mining performed by extracting useful information from the content

of a web page/site.

It includes extraction of structured data/information from web pages,

identification, match, and integration of semantically similar data.

The type of web content may consist of text, image, audio, video, etc. It is also

know as text mining.

It uses the Natural Language Processing and Information Retrieval techniques for

mining the data.

Web Structure Mining

The structure of a typical Web graph consists of Web pages as nodes, and

hyperlinks as edges connecting between two related pages.

Web structure mining is the process of discovering structure information from the

web.

• This type of mining can be performed either at the (intra-page) document level or the

(inter-page) hyperlink level.

• The research at the hyperlink level is also called Hyperlink Analysis.

Web Structure Terminology

Web-graph : A directed graph that represents the Web.

Node : Each Web page represents a node of the Web-graph.

Link : Each hyperlink on the Web is a directed edge of the Web-graph.

In-degree : The number of distinct links that point to a node.

Out-degree : The number of distinct links originating at a node that point to other

nodes.

Directed Path : It is a sequence of links, starting from a node say r that can be

followed to reach another node say t.

Shortest Path : The path with the shortest length out of all the paths between

nodes p and q.

Diameter : It is the maximum of all the shortest paths between a pair of nodes p

and q, for all pairs of nodes p and q in the Web-graph.

Web Structure Terminology

Web Usage Mining

A Web is a collection of inter-related files on one or more Web Servers.

Discovery of meaningful patterns from data generated by client-server transaction

on one or more Web localities.

Typical Sources of Data :

• Automatically generated data stored in server access logs, referrer logs, agent logs, and

client-side cookies.

• User profiles.

• Metadata : page attributes, content attributes, usage data.

Web servers, Web proxies, and client application can quite easily capture Web

Usage data.

Web Server Log : It is a file that is created by the server to record all the

activities it performs.

For ex: When a user enters URL into the browsers address bar or requests by

clicking on a link.

The page request sent to web server maintains the following info. in its log like

Information about URL, Whether the request was successful, Users IP address,

time and date, etc.

Web Usage Mining

Text Mining

Text Mining

The objective of Text Mining is to exploit information contained in textual

documents in various ways, including discovery of patterns and trends in data,

associations among entities, predictive rules, etc.

The results can be important both for :

• The analysis of the collection, and

• Providing intelligent navigation and browsing methods.

Text Mining Workflow

Data Mining vs Text Mining

Both seek novel and useful pattern.

Both are semi-automated process.

Difference is the nature of the data:

• Structured versus Unstructured data

• Structured data: databases

• Unstructured data: word docs, pdf files, xml files, and so on

Text mining – first, impose structure to the data, then mine the structured data.

Technology premise of Text Mining

Summarization : It is a process of making summary of any document containing

large amount of information while theme or main idea of document is maintained.

Information Extraction : It utilizes relations within the text. It uses pattern

matching for it.

Categorization : It is a supervised learning technique which places the document

according to content. Document categorization is largely used in libraries.

Visualization : It is computer graphic effect to represent information and

revealing relationships.

Clustering : It is a document’s textual similarity based unsupervised technique

which is used by data analysis to divide the text into mutually exclusive groups.

Question Answering : Natural language queries or questions answering is

responsible to decide a way find a more suitable answer for particular question.

Sentiment Analysis : It is also known as opinion mining is configured of user’s

emotion, mostly into several classes which are positive, negative, neutral and

mixed. It is mainly used to get people’s view or attitude towards anything which

includes services and products.

Technology premise of Text Mining

Download - Web Mining & Text Mining

Top Related