chapter 7 data, text, and web mining pages 304-309, 311, sections 7.3, 7.5, 7.6

23
Chapter 7 DATA, TEXT, AND WEB MINING Pages 304-309, 311, Sections 7.3, 7.5, 7.6

Upload: cory-gallagher

Post on 28-Dec-2015

213 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Chapter 7 DATA, TEXT, AND WEB MINING Pages 304-309, 311, Sections 7.3, 7.5, 7.6

Chapter 7

DATA, TEXT,

AND WEB MINING

Pages 304-309, 311,Sections 7.3, 7.5, 7.6

Page 2: Chapter 7 DATA, TEXT, AND WEB MINING Pages 304-309, 311, Sections 7.3, 7.5, 7.6

Data mining

A process that uses statistical, mathematical, artificial intelligence and machine-learning techniques to extract and identify new knowledge from large databases

Recognizes the untapped value of data in large databases

You may unexpectedly strike rich in understanding relationships among data

Page 3: Chapter 7 DATA, TEXT, AND WEB MINING Pages 304-309, 311, Sections 7.3, 7.5, 7.6

Example

Task: Find the best route to cover the territory

Page 4: Chapter 7 DATA, TEXT, AND WEB MINING Pages 304-309, 311, Sections 7.3, 7.5, 7.6

Challenge of finding relationships in large databases

Page 5: Chapter 7 DATA, TEXT, AND WEB MINING Pages 304-309, 311, Sections 7.3, 7.5, 7.6

Connect equal elevation points to make a contour map

The dark vertical line shows the best route to cross the territory without falling off a cliff.

Page 6: Chapter 7 DATA, TEXT, AND WEB MINING Pages 304-309, 311, Sections 7.3, 7.5, 7.6

Once relationships are discovered, they can be used for prediction

Page 7: Chapter 7 DATA, TEXT, AND WEB MINING Pages 304-309, 311, Sections 7.3, 7.5, 7.6

Uses of Data Mining-1

• Classification Identify attribute of interest (eg. You want to classify who is likely to pay late) Examine all other attribute values of customer from data warehouse and locate the one that is most related to the attribute of interest (eg. monthly income level)

• Mining AlgorithmThe most common algorithm used for Classification is Decision trees

Gini Index: helps to determine where to find the split between two classes (eg. at what income level)- used in developing decision trees(see example on page 316)

Page 8: Chapter 7 DATA, TEXT, AND WEB MINING Pages 304-309, 311, Sections 7.3, 7.5, 7.6

Which product class is the best seller?

Conclusion: Clay products with a price below $25!

Page 9: Chapter 7 DATA, TEXT, AND WEB MINING Pages 304-309, 311, Sections 7.3, 7.5, 7.6

• Segmentation Partitioning a database into groups in which the members of each group share similar characteristics

• Mining Algorithm Clustering: The object is to sort cases into groups so that the similarities within the group are strong among members of the same

cluster and weak between members of different clusters

Eg. Companies with over 100 employees may share similar characteristics (eg. revenue size) than those with less than 100 employees.

Knowledge can help with developing different policies when dealing with different type of companies

Uses of Data Mining-2

Page 10: Chapter 7 DATA, TEXT, AND WEB MINING Pages 304-309, 311, Sections 7.3, 7.5, 7.6

• Association A category of data mining algorithm that establishes relationships about items that occur together in a given record Eg. You may discover from data that senior students take elective courses together in the final semesterCan be helpful to schedule courses

People who buy a suit may also buy dress shirtPeople who buy swimwear may buy fins,

goggles, cap, etc.

Uses of Data Mining-3

Page 11: Chapter 7 DATA, TEXT, AND WEB MINING Pages 304-309, 311, Sections 7.3, 7.5, 7.6

• Sequence discovery The identification of associations over time. Discovering the order in which events occur.

• The algorithm can examine data and predict what event is most likely to occur next.

Widely used in studying how visitors navigate a Web site. Helps to

improve chances of making a sale.

Uses of Data Mining-4

Page 12: Chapter 7 DATA, TEXT, AND WEB MINING Pages 304-309, 311, Sections 7.3, 7.5, 7.6

Regression is a statistical technique that is used to map data to a prediction value

• Forecasting estimates future values based on patterns within large sets of data

Eg. Gasoline prices this month may predict next month’s sales of

SUVs

Uses of Data Mining-5

Page 13: Chapter 7 DATA, TEXT, AND WEB MINING Pages 304-309, 311, Sections 7.3, 7.5, 7.6

Data Mining Concepts and Applications

– Marketing– Banking– Retailing and sales– Manufacturing and

production– Brokerage and

securities trading– Insurance

– Computer hardware and software

– Government and defense

– Airlines– Health care– Broadcasting – Police– Homeland security

Data mining applications

Page 14: Chapter 7 DATA, TEXT, AND WEB MINING Pages 304-309, 311, Sections 7.3, 7.5, 7.6

Text Mining Application of data mining to text files, typically freestyle text material

Discovers new knowledge that is not obvious

Examples:

Examine all news services, cluster similar topics, create a new summary for each topic

Find the “hidden” content of documents, including additional useful relationships, eg. Lies, deceptions, scams

Not same as the search engine on the Web.

Page 15: Chapter 7 DATA, TEXT, AND WEB MINING Pages 304-309, 311, Sections 7.3, 7.5, 7.6

Text Mining – how is it done? It entails the generation of meaningful numerical indices/factors

from the unstructured text and then processing these indices using various data mining algorithms

Example:Extract each word from the document being text minedEliminate commonly used words (the, and, other, etc)Combine synonyms and phrasesCalculate weights for each term:

tf factor (term frequency) – actual number of times a word appears in a document

idf factor (inter document frequency) – across multiple documentsHigh tf factor value of a given term indicates that the document topic is probably around the meaning of that term!

Page 16: Chapter 7 DATA, TEXT, AND WEB MINING Pages 304-309, 311, Sections 7.3, 7.5, 7.6

Text Mining - applications

– Automatic detection of e-mail spam or phishing through analysis of the document content

– Automatic processing of messages or e-mails to route a message to the most appropriate party to process that message

– Analysis of warranty claims, help desk calls/reports, and so on to identify the most common problems and relevant responses

Page 17: Chapter 7 DATA, TEXT, AND WEB MINING Pages 304-309, 311, Sections 7.3, 7.5, 7.6

Web Mining The discovery and analysis of interesting and useful information from the Web

Page 18: Chapter 7 DATA, TEXT, AND WEB MINING Pages 304-309, 311, Sections 7.3, 7.5, 7.6

Web content mining

The extraction of useful information from Web pages

Eg. Search with the help of keywords in the Meta tags of the web page

You can analyze the document content of the first 10 links of Google in a search response

You can generate a summary of the contents automatically in a new document!

Page 19: Chapter 7 DATA, TEXT, AND WEB MINING Pages 304-309, 311, Sections 7.3, 7.5, 7.6

Web structure mining

The development of useful information from the links included in the Web documents

If a web site’s pages predominantly link to each other, you may consider the site to exist ‘independent’

If a collection of web sites are linked to each other heavily, it points to a web community or clan that share common interests

Example application: Web structure mining can lead to better understanding of extremist groups

Page 20: Chapter 7 DATA, TEXT, AND WEB MINING Pages 304-309, 311, Sections 7.3, 7.5, 7.6

Web usage mining

The extraction of useful information from the data being generated through webpage visits, transaction, etc.

Clickstream analysis

Uses cookies, number of logs, time of log, etc

Can help profile users

Page 21: Chapter 7 DATA, TEXT, AND WEB MINING Pages 304-309, 311, Sections 7.3, 7.5, 7.6

Uses for Web mining

– Determine the lifetime value of clients– Design cross-marketing strategies across

products– Evaluate promotional campaigns– Target electronic ads and coupons at user

groups– Predict user behavior– Present dynamic information to users

Page 22: Chapter 7 DATA, TEXT, AND WEB MINING Pages 304-309, 311, Sections 7.3, 7.5, 7.6

Data Mining Project Processes

Page 23: Chapter 7 DATA, TEXT, AND WEB MINING Pages 304-309, 311, Sections 7.3, 7.5, 7.6

Steps for Data Mining• Problem definition: Decide the measure to study and

the suitable mining algorithm (see Exercise 11)• Data preparation: Design the cube and populate it

relevant data from the data warehouse• Training: Run the mining algorithm on a subset of the

data warehouse data for the system to learn to find segments, associations, etc among data

• Validation: Run the ‘learnt’ model from previous step to the remaining subset of data and try to ‘predict’. Since you have historical data, you can verify if the ‘learnt’ model is any good.

• Deploy: Implement to predict in real environment where you do not know the actual results.