by dr. borne 2005 umuc data mining lecture 1

58
By Dr. Borne 200 5 UMUC Data Mining Lecture 1 1 Data Mining UMUC CSMN 667 Lecture #1

Upload: tommy96

Post on 13-Jan-2015

823 views

Category:

Documents


0 download

DESCRIPTION

 

TRANSCRIPT

Page 1: By Dr. Borne 2005 UMUC Data Mining Lecture 1

By Dr. Borne 2005 UMUC Data Mining Lecture 1 1

Data Mining UMUC CSMN 667

Lecture #1

Page 2: By Dr. Borne 2005 UMUC Data Mining Lecture 1

By Dr. Borne 2005 UMUC Data Mining Lecture 1 2

So what is it?

Data Mining is “an information extraction activity whose goal is

to discover hidden facts contained in large databases.”

Page 3: By Dr. Borne 2005 UMUC Data Mining Lecture 1

By Dr. Borne 2005 UMUC Data Mining Lecture 1 3

Class Textbooks

• Margaret Dunham’s book: “Data Mining Introductory and Advanced Topics”– from Prentice Hall– numerous publication dates listed (2002/2003)– there is only one edition (just buy it)

• APA Style Guide: Publication manual of the American Psychological Association (2001, 5th ed.) - required by UMUC

Page 4: By Dr. Borne 2005 UMUC Data Mining Lecture 1

By Dr. Borne 2005 UMUC Data Mining Lecture 1 4

Additional Assignment for first month

• Set up database account on our class database server: dbcourse3.umuc.edu

• Refer to WebTycho for instructions:– Change your passwords immediately in 2 places:

your Unix server account and your Oracle database account (both passwords are initially the same, but they are completely independent).

Page 5: By Dr. Borne 2005 UMUC Data Mining Lecture 1

By Dr. Borne 2005 UMUC Data Mining Lecture 1 5

Reminders

• The word “DATA” is plural. The singular form of the word is “datum” -- one datum is okay, but many data are better.

• Time is what prevents everything from happening at once. So, please use good time management skills to keep from falling behind in your reading and other class assignments.

Page 6: By Dr. Borne 2005 UMUC Data Mining Lecture 1

By Dr. Borne 2005 UMUC Data Mining Lecture 1 6

“Data Mining 101”

An Introduction to Data Mining

Data mining is defined as “an information extraction activity whose goal is to discover hidden facts contained in (large) databases."

Page 7: By Dr. Borne 2005 UMUC Data Mining Lecture 1

By Dr. Borne 2005 UMUC Data Mining Lecture 1 7

Evolutionary Step Business Question EnablingTechnologies

Characteristics

Data Collection(1960s)

"What was my totalrevenue in the last fiveyears?"

Computers, tapes, disks Retrospective, staticdata delivery

Data Access(1980s)

"What were unit sales inNew England lastMarch?"

Relational databases(RDBMS), StructuredQuery Language (SQL),ODBC

Retrospective, dynamicdata delivery at recordlevel

Data Warehousing &Decision Support(1990s)

"What were unit sales inNew England lastMarch? Drill down toBoston."

On-line analyticprocessing (OLAP),multidimensionaldatabases, datawarehouses

Retrospective, dynamicdata delivery at multiplelevels

Data Mining(Emerging Today)

"What’s likely tohappen to Boston unitsales next month?Why?"

Advanced algorithms,multiprocessorcomputers, massivedatabases

Prospective, proactiveinformation delivery

Evolution of Data Mining<http://www.thearling.com/text/dmwhite/dmwhite.htm>

Page 8: By Dr. Borne 2005 UMUC Data Mining Lecture 1

By Dr. Borne 2005 UMUC Data Mining Lecture 1 8

Data Mining is Ready for Prime Time

• Data mining is ready for general application because it engages three technologies that are now sufficiently mature:

Massive data collection & delivery

Powerful multiprocessor computers

Sophisticated data mining algorithms

Page 9: By Dr. Borne 2005 UMUC Data Mining Lecture 1

By Dr. Borne 2005 UMUC Data Mining Lecture 1 9

6 Business Reasons to use Data Mining

– Most organizations already collect and refine massive quantities of data.

– Their most important information is in their data warehouses.

– Data mining moves beyond the analysis of past events … to predicting future trends and behaviors that may be missed because they lie outside the experts’ expectations.

– Data mining tools can answer complex business questions that traditionally were too time-consuming to resolve.

– Data mining tools can explore the intricate interdependencies within databases in order to discover hidden patterns and relationships.

– Data mining allows decision-makers to make proactive, knowledge-driven decisions.

Page 10: By Dr. Borne 2005 UMUC Data Mining Lecture 1

By Dr. Borne 2005 UMUC Data Mining Lecture 1 10

Another Business Reason to use Data Mining

Page 11: By Dr. Borne 2005 UMUC Data Mining Lecture 1

By Dr. Borne 2005 UMUC Data Mining Lecture 1 11

A Key Concept for Data Mining

• Data Mining delivers actionable data :– data that support decision-making

– data that lead to knowledge and understanding

– data with a purpose

• i.e., Data do not exist for their own sake.

• The Data Warehouse is a corporate asset (whether in business, marketing, banking, science, telecommunications, entertainment, computer security, or Homeland Security).

Page 12: By Dr. Borne 2005 UMUC Data Mining Lecture 1

By Dr. Borne 2005 UMUC Data Mining Lecture 1 12

Data Mining - the up side• Data mining is everywhere:

– Huge scientific databases (NASA, Human Genome,…)– Corporate databases (OLAP)– Credit card usage histories (Capital One)– Loan applications (Credit Scoring)– Customer purchase records (CRM)– Web traffic analysis (Doubleclick)– Network security intrusion detection (Silent Runner)– The hunt for terrorists (DARPA TIA)– The NBA! … the NBA??

Page 13: By Dr. Borne 2005 UMUC Data Mining Lecture 1

By Dr. Borne 2005 UMUC Data Mining Lecture 1 13

Data Mining - the down side• Data mining is a pejorative in the business

database community (“data dredging”)– They prefer to call it Knowledge Discovery, or

Business Intelligence, or CRM (Customer Relationship Management), or Marketing, or OLAP (On-Line Analytical Processing)

• The Data Mining Moratorium Act of 2003– see first page of the bill on next slide– debated within the U.S.Congress– privacy concerns– directly primarily against the DARPA TIA

Program (Total Information Awareness)

Page 14: By Dr. Borne 2005 UMUC Data Mining Lecture 1

By Dr. Borne 2005 UMUC Data Mining Lecture 1 14

108TH CONGRESS

1ST SESSION S. __________

IN THE SENATE OF THE UNITED STATESMr. FEINGOLD introduced the following bill; which was read twice and referred to

the Committee on _________________

A BILLTo impose a moratorium on the implementation of datamining

under the Total Information Awareness program of the Department of Defense and any similar program of the Department of Homeland Security, and for other purposes.

1 Be it enacted by the Senate and House of Representa-

2 tives of the United States of America in Congress assembled,

3 SECTION 1. SHORT TITLE. 4 This Act may be cited as the ‘‘Data-Mining Morato-

5 rium Act of 2003’’.

6 SEC. 2. FINDINGS. 7 Congress makes the following findings:

http://www.cdt.org/legislation/108th/privacy/030122feingold.pdf

Page 15: By Dr. Borne 2005 UMUC Data Mining Lecture 1

By Dr. Borne 2005 UMUC Data Mining Lecture 1 15

The Information Age is Here!• "Data doubles about every year, but useful information

seems to be decreasing."– Margaret Dunham, "Data Mining Techniques & Algorithms", 2002

• "There is a growing gap between the generation of data and our understanding of it."– Witten & Frank, "Data Mining: Practical Machine Learning Tools", 1999

• "The trouble with facts is that there are so many of them"– Samuel McChord Crothers, "The Gentle Reader", 1973

• "Get your facts first, and then you can distort them as much as you please."– Mark Twain

Page 16: By Dr. Borne 2005 UMUC Data Mining Lecture 1

By Dr. Borne 2005 UMUC Data Mining Lecture 1 16

Characteristics of The Information Age:• Data “Avalanche”

– the flood of Terabytes of data is already happening, whether we like it or not

– our present techniques of handling these data do not scale well with data volume

• Distributed Digital Archives– will be the main access to data– will need to handle hundreds to thousands of queries per day

• Systematic Data Exploration and Data Mining– will have a central role

• statistical analysis of “typical” events

• automated search for “rare” events

Page 17: By Dr. Borne 2005 UMUC Data Mining Lecture 1

By Dr. Borne 2005 UMUC Data Mining Lecture 1 17

The Data Flood is Everywhere

• Huge quantities of data are being generated in all business, government, and research domains:– Banking, retail, marketing,

telecommunications, other business transactions ...

– Scientific data: genomics, astronomy, biology, etc.

– Web, text, and e-commerce

Page 18: By Dr. Borne 2005 UMUC Data Mining Lecture 1

By Dr. Borne 2005 UMUC Data Mining Lecture 1 18

5 million terabytes created in 2002

UC Berkeley 2003 estimate:

5 exabytes (5 million terabytes) of new data were created in 2002.

http://www.sims.berkeley.edu/research/projects/how-much-info-2003/

What is a gigabyte, terabyte, petabyte, exabyte, …?Look at the definitions and examples in the following article:

http://www.jamesshuggins.com/h/tek1/how_big.htm

Page 19: By Dr. Borne 2005 UMUC Data Mining Lecture 1

By Dr. Borne 2005 UMUC Data Mining Lecture 1 19

Data Growth Rate

• Twice as much information was created in 2002 as in 1999 (~30% annual growth rate).

• Other growth rate estimates are even higher.

• Very little of these data will ever be looked at by a human.

• Data Mining is NEEDED to make sense of and to make use of these data.

Page 20: By Dr. Borne 2005 UMUC Data Mining Lecture 1

By Dr. Borne 2005 UMUC Data Mining Lecture 1 20

What is Data Mining?• Data mining is defined as “an information extraction

activity whose goal is to discover hidden facts contained in (large) databases."

• Data mining is used to find patterns and relationships in data. (EDA = Exploratory Data Analysis)

• Patterns can be analyzed via 2 types of models:– Descriptive : Describe patterns and create

meaningful subgroups or clusters.– Predictive : Forecast explicit values, based upon

patterns in known results.• How does this become useful (not just bits of data)? ...

– … through KNOWLEDGE DISCOVERY Data Information Knowledge Understanding / Wisdom!

Page 21: By Dr. Borne 2005 UMUC Data Mining Lecture 1

By Dr. Borne 2005 UMUC Data Mining Lecture 1 21

Historical Note: Many Names of Data Mining

• Data Fishing, Data Dredging: 1960-– used by Statisticians (as a bad name)

• Data Mining :1990- – used by DB & business communities– in 2003 – bad image because of DARPA TIA

• Knowledge Discovery in Databases (1989-)– used by AI & Machine Learning communities

• also Data Archaeology, Information Harvesting, Information Discovery, Knowledge Extraction, ...

Currently: Data Mining and Knowledge Discovery are used interchangeably.

Page 22: By Dr. Borne 2005 UMUC Data Mining Lecture 1

By Dr. Borne 2005 UMUC Data Mining Lecture 1 22

Data Mining Examples• Classic Textbook Example of Data Mining (Legend?):

Data mining of grocery store logs indicated that men who buy diapers also tend to buy beer at the same time.

• Blockbuster Entertainment mines its video rental history database to recommend rentals to individual customers.

• A financial institution discovered that credit applicants who used pencil on the form were much more likely to default on their debts than those who filled out the application using ink.

• Credit card companies recommend products to cardholders based on analysis of their monthly expenditures.

• Airline purchase transaction logs revealed that 9-11 hijackers bought one-way airline tickets with the same credit card.

• Astronomers examined objects with extreme colors in a huge database to discover the most distant Quasars ever seen.

Page 23: By Dr. Borne 2005 UMUC Data Mining Lecture 1

By Dr. Borne 2005 UMUC Data Mining Lecture 1 23

Page 24: By Dr. Borne 2005 UMUC Data Mining Lecture 1

By Dr. Borne 2005 UMUC Data Mining Lecture 1 24

Data Mining Application:Marketing

Sales Analysis

• associations between product sales:• beer and diapers

• strawberry pop tarts and beer (and hurricanes)

Customer Profiling

• data mining can tell you what types of customers buy what products

Identifying Customer Requirements

• identify the best products for different customers

• use prediction to find what factors will attract new customers

Page 25: By Dr. Borne 2005 UMUC Data Mining Lecture 1

By Dr. Borne 2005 UMUC Data Mining Lecture 1 25

Auto Insurance Fraud

• Association Rule Mining can detect a group of people who stage accidents to collect on insurance

Money Laundering• Since 1993, the US Treasury's Financial Crimes

Enforcement Network agency has used a data-mining application to detect suspicious money transactions

Banking: Loan Fraud• Security Pacific/Bank of America uses data mining to

help with commercial lending decisions and to prevent fraud

Data Mining Application:Fraud Detection

Page 26: By Dr. Borne 2005 UMUC Data Mining Lecture 1

By Dr. Borne 2005 UMUC Data Mining Lecture 1 26

The Necessity of Data Mining• Enormous interest in these data collections. • The environment to exploit these data does not

exist! – 1 Terabyte at 100 Mbits/sec takes 1 day to transfer.– Hundreds to thousands of queries per day.– Data will reside at multiple locations, in many different

formats.– Existing analysis tools do not scale to Terabyte data

collections.

• The need is acute! A solution will not just happen.

Page 27: By Dr. Borne 2005 UMUC Data Mining Lecture 1

By Dr. Borne 2005 UMUC Data Mining Lecture 1 27

What is Knowledge Discovery?• Knowledge discovery refers to “finding out new

knowledge about an application domain using data on the domain usually stored in a database.”– Application domains: scientific, customer purchase records,

computer network logs, web traffic logs, financial transactions, census data, basketball play-by-play histories, ...

• Why are Data Mining & Knowledge Discovery such hot topics? --- because of the enormous interest in these huge databases and their potential for new discoveries.

• In large databases, Data Mining and Knowledge Discovery come in two flavors:– Event-based mining– Relationship-based mining

Page 28: By Dr. Borne 2005 UMUC Data Mining Lecture 1

By Dr. Borne 2005 UMUC Data Mining Lecture 1 28

Event-Based Mining(Event-based mining is based upon events or trends in data.)

Four distinct orthogonal categorizations:• Known events / known models - use existing models (descriptive

models) to locate known phenomena of interest either spatially or temporally within a large database.

• Known events / unknown models - use clustering properties of data to discover new relationships and patterns among known phenomena.

• Unknown events / known models - use known associations and relationships (predictive models) among parameters that describe a phenomenon to predict the presence of previously unseen examples of the same phenomenon within a large complex database.

• Unknown events / unknown models - use thresholds or trends to identify transient or otherwise unique ("one-of-a-kind") events and therefore to discover new phenomena. Serendipity!

Page 29: By Dr. Borne 2005 UMUC Data Mining Lecture 1

By Dr. Borne 2005 UMUC Data Mining Lecture 1 29

Relationship-Based Data Mining (Based upon associations & relationships among data items)

• Spatial associations -- identify events or objects at the same physical spatial location, or at related locations (e.g., urban versus rural data).

• Temporal associations -- identify events or transactions occurring during the same or related periods of time (e.g., periodically, or N days after event X).

• Coincidence associations -- use clustering techniques to identify events that are co-located (that coincide) within a multi-dimensional parameter space.

Page 30: By Dr. Borne 2005 UMUC Data Mining Lecture 1

By Dr. Borne 2005 UMUC Data Mining Lecture 1 30

Event-Based Mining (EBM) - Homeland Security Example(EBM is based upon events or trends in data.)

• Known events / known models - use existing models (descriptive models) to locate known phenomena of interest within a large database.

– e.g., Terrorist activities have been financed through certain organizations. Search for similar transactions in large financial databases.

• Known events / unknown models - use clustering properties of data to discover new relationships and patterns among known phenomena.

– e.g., Search through credit card, travel, and phone histories of 9-11 hijackers to discover previously unknown characteristics and behavior patterns of terrorists.

• Unknown events / known models - use known associations and relationships (predictive models) among parameters that describe a phenomenon to predict the presence of previously unseen examples within a large complex database.

– e.g., Use knowledge of terrorist behavior patterns (e.g, heightened phone activity) to identify new terrorists and/or to raise new terrorist alerts.

• Unknown events / unknown models - use thresholds or trends or outlier detection to identify transient or otherwise unique ("one-of-a-kind") events, and therefore to discover new phenomena.

– e.g., Explore all known data (including intelligence, news reports, e-mail, credit card histories, phone records, organizational memberships) to identify new threats.

Page 31: By Dr. Borne 2005 UMUC Data Mining Lecture 1

By Dr. Borne 2005 UMUC Data Mining Lecture 1 31

Relationship-Based Mining (RBM) -Homeland Security Example

(RBM is based upon associations and relationships among data items.)

• Spatial associations -- identify events (e.g, airline ticket purchases) occurring at the same location in some geospatial parameter space (e.g, travel on the same flights).

• Temporal associations -- identify events occurring during the same or related periods of time (e.g, airline ticket purchases for travel on the same flights purchased at the same time).

• Coincidence associations -- use clustering techniques to identify events that are co-located within a multi-dimensional parameter space (e.g, airline tickets for the same flights purchased at the same time as one-way tickets on the same credit card, with travelers of Mid-East origin, having recent U.S. entry, were students in flight schools, having records of numerous phone calls to Afghanistan, and having visited Hamburg Germany at some time in the past few years).

Page 32: By Dr. Borne 2005 UMUC Data Mining Lecture 1

By Dr. Borne 2005 UMUC Data Mining Lecture 1 32

User Requirements for a Data Mining System(What features must a D.M. system have for your users?)

• Cross-Identification - refers to the classical problem of associating the objects listed in one database to the objects listed in another.

• Cross-Correlation - refers to the search for correlations, tendencies, and trends between parameters in multi-dimensional data, usually across databases.

• Nearest-Neighbor Identification - refers to the general application of clustering algorithms in multi-dimensional parameter space, usually within a single database.

• Systematic Data Exploration - refers to the application of the broad range of event-based and relationship-based queries to one or more databases in the hope of making a serendipitous discovery of new events/objects or a new class of events/objects.

Page 33: By Dr. Borne 2005 UMUC Data Mining Lecture 1

By Dr. Borne 2005 UMUC Data Mining Lecture 1 33

Representative Data Mining Architecture<http://www.thearling.com/text/dmwhite/dmwhite.htm>

Page 34: By Dr. Borne 2005 UMUC Data Mining Lecture 1

By Dr. Borne 2005 UMUC Data Mining Lecture 1 34

Data leads to Knowledge leads to Understanding

Remember what we said earlier :

EXAMPLE :

• Data = 00100100111010100111100 (stored in database)

• Information = ages and heights of children (metadata)

• Knowledge = the older children tend to be taller

• Understanding = children’s bones grow as they get older

Data Information Knowledge Understanding / Wisdom!

Page 35: By Dr. Borne 2005 UMUC Data Mining Lecture 1

By Dr. Borne 2005 UMUC Data Mining Lecture 1 35

Astronomy Example

Data:

Information (catalogs / databases):– Measure brightness of galaxies from image (e.g., 14.2 or 21.7)– Measure redshift of galaxies from spectrum (e.g., 0.0167 or 0.346)

Knowledge:Hubble Diagram Redshift-Brightness

Correlation Redshift = Distance

Understanding: the Universe is expanding!!

(a) Imaging data (ones & zeroes) (b) Spectral data (ones & zeroes)

Page 36: By Dr. Borne 2005 UMUC Data Mining Lecture 1

By Dr. Borne 2005 UMUC Data Mining Lecture 1 36

Goal of Data Mining

• The end goal of data mining is not the data themselves, but the new knowledge and understanding that are revealed in the process = Business Intelligence (BI). (Remember what we said about the business

community’s opinion of D.M.)

• This is why the research field is usually referred to as KDD = Knowledge Discovery in Databases.

Page 37: By Dr. Borne 2005 UMUC Data Mining Lecture 1

By Dr. Borne 2005 UMUC Data Mining Lecture 1 37

Some words of wisdom

• "We have confused information (of which there is too much) with ideas (of which there are too few)."– Paul Theroux

• "The great Information Age is really an explosion of non-information; it is an explosion of data ... it is imperative to distinguish between the two; information is that which leads to understanding."– R.S. Wurman in his book: Information Anxiety2

Page 38: By Dr. Borne 2005 UMUC Data Mining Lecture 1

By Dr. Borne 2005 UMUC Data Mining Lecture 1 38

The Data Mining Process(more about this later)

The most important and time-consuming step is Cleaning the Data.

Page 39: By Dr. Borne 2005 UMUC Data Mining Lecture 1

By Dr. Borne 2005 UMUC Data Mining Lecture 1 39

ClusteringClassificationAssociationsNeural NetsDecision TreesPattern RecognitionCorrelation/Trend AnalysisPrincipal Component AnalysisRegression AnalysisOutlier/Glitch IdentificationVisualizationAutonomous AgentsSelf-Organizing Maps (SOM)Link (Affinity) Analysis

Data Mining Methods and Some Examples

Classify new data items usingthe known classes & groups

Find associations and patternsamong different data items

Organize information in the database based on relationships among key data descriptors

Identify linkages between data items

based on features shared in common

Find all groups and classes ofobjects represented in the data

Page 40: By Dr. Borne 2005 UMUC Data Mining Lecture 1

By Dr. Borne 2005 UMUC Data Mining Lecture 1 40

Some Data Mining Techniques Graphically Represented

Self-Organizing Map (SOM)

Outlier (Anomaly) Dectection

Clustering

Link Analysis Decision Tree

Neural Network

Page 41: By Dr. Borne 2005 UMUC Data Mining Lecture 1

By Dr. Borne 2005 UMUC Data Mining Lecture 1 41

Remember what it is …

Data Mining is “an information extraction activity whose goal is

to discover hidden facts contained in large databases.”

Page 42: By Dr. Borne 2005 UMUC Data Mining Lecture 1

By Dr. Borne 2005 UMUC Data Mining Lecture 1 42

Data Mining Technique: Clustering

In this case,three different groups (classes)of items were found amongall of the itemsin the data set.

Page 43: By Dr. Borne 2005 UMUC Data Mining Lecture 1

By Dr. Borne 2005 UMUC Data Mining Lecture 1 43

Data Mining Technique: Decision Tree Classification

Question:

Should I play tennis today?

(I must really love tennis!)

Similar to game “20 questions”

Same technique used by bank loan officers

to identify good potential customers

versus poor customers.

Page 44: By Dr. Borne 2005 UMUC Data Mining Lecture 1

By Dr. Borne 2005 UMUC Data Mining Lecture 1 44

Data Mining Technique:Association Rule Mining(Market Basket Analysis)

tran1 cust33 p2, p5, p8tran2 cust45 p5, p8, p11tran3 cust12 p1, p9tran4 cust40 p5, p8, p11tran5 cust12 p2, p9tran6 cust12 p9

transactio

n

id custo

mer

id products

bought

salesrecords:

• Trend (Rule): Products p5, p8 often bought together• Trend (Rule): Customer 12 likes product p9

Page 45: By Dr. Borne 2005 UMUC Data Mining Lecture 1

By Dr. Borne 2005 UMUC Data Mining Lecture 1 45

Data Mining Algorithm: The SOM

Figure: The SOM (Self-Organizing Map) is one technique for organizing information in a database based upon links between concepts.

It can be used to find hidden relationships and patterns in more complex data collections, usually based on links between keywords or metadata.

Page 46: By Dr. Borne 2005 UMUC Data Mining Lecture 1

By Dr. Borne 2005 UMUC Data Mining Lecture 1 46

Data Mining Application: Outlier Detection

Figure: The clustering of data clouds (dc#) within a multidimensional parameter space (p#).

Such a mapping can be used to search for and identify clusters, voids, outliers, one-of-kinds, relationships, and associations among arbitrary parameters in a database (or among various parameters in geographically distributed databases).

Page 47: By Dr. Borne 2005 UMUC Data Mining Lecture 1

By Dr. Borne 2005 UMUC Data Mining Lecture 1 47

Link Analysis for Homeland Security: Find all connections and relationships among known terrorists.

Page 48: By Dr. Borne 2005 UMUC Data Mining Lecture 1

By Dr. Borne 2005 UMUC Data Mining Lecture 1 48

Data Mining Technology:Parallel Mining

Figure: Parallel Data Mining

The application of parallel computing resources and parallel data access (e.g., RAID) enables concurrent drill-downs into large data collections

Page 49: By Dr. Borne 2005 UMUC Data Mining Lecture 1

By Dr. Borne 2005 UMUC Data Mining Lecture 1 49

Data Mining Methods Explained• Clustering: Group data items according to tight relationships.

• Classification: Assign data items to predetermined groups.

• Associations: Associate data with similar relationships. The beer-diaper example is an example of associative mining.

• Artificial Neural Networks (ANN): Non-linear predictive models that learn through training and resemble biological neural networks in structure.

• Decision Trees: Hierarchical sets of decisions, based upon rules, for rapid classification of a data collection.

• Sequential Patterns: Identify or predict behavior patterns or trends.

• Genetic Algorithms: Rapid optimization techniques that are based on the concepts of natural evolution.

• Nearest Neighbor Method: Classify a data item according to its nearest neighbors (records that are most similar).

• Rule induction: The extraction of useful if-then rules from data based on statistical significance.

• Data visualization: The illustration and visual interpretation of complex relationships in multidimensional data using graphics tools.

• Self-Organizing Map (SOM): Graphically organizes (in a 2-dimensional map) the information stored within a database based upon similarities and links between concepts. It can be used to find hidden relationships and patterns in more complex data collections.

Page 50: By Dr. Borne 2005 UMUC Data Mining Lecture 1

By Dr. Borne 2005 UMUC Data Mining Lecture 1 50

Data Mining Techniques: techniques are based on Algorithms; techniques are used in Applications

Page 51: By Dr. Borne 2005 UMUC Data Mining Lecture 1

By Dr. Borne 2005 UMUC Data Mining Lecture 1 51

http://www.kdnuggets.com/polls/2002/data_mining_techniques.htm

Poll of Users: Data Mining Techniques (October 2002)

“Which data mining techniques do you use regularly? (Choose several)” [825 votes total]

Decision Trees/Rules (128) …. 16%Clustering (103) …...………….. 12%Statistics (101) …….………….. 12%Logistic Regression (75) ….….. 9%Neural Networks (75) …….…… 9%Association Rules (63) ………... 8%Visualization (52) ………………. 6%Nearest Neighbor (42) …………. 5%Text Mining (30) ………………... 4%Sequence Analysis (27) ….…….. 3%Genetic Algorithms (26) ……..… 3%Bayesian Nets (24) ………..…… 3%Hybrid methods (21) ………...… 3%Web mining (19) ……………..… 2%Naïve Bayes (19) ...……….…….. 2%Other (20) ……………………..… 2%

Page 52: By Dr. Borne 2005 UMUC Data Mining Lecture 1

By Dr. Borne 2005 UMUC Data Mining Lecture 1 52

http://www.kdnuggets.com/polls/2003/data_mining_techniques.htm

Poll of Users: Data Mining Techniques (November 2003)

“Which data mining techniques do you use regularly? (Choose several)” [768 votes total]

Decision Trees/Rules (120) …... 16%Clustering (93) …...…………….. 12%Statistics (92) …….…………….. 12%Neural Networks (71) …….……. 9%Logistic Regression (69) ….…... 9%Visualization (55) ………………. 7%Association Rules (42) ………... 5% Nearest Neighbor (38) …………. 5%Text Mining (30) ………………... 4%Web Mining (29) ……………..… 4%Sequence Analysis (24) ….…….. 3%Bayesian Nets (24) ………..…… 3%Support Vector Machines (24) ... 3%Hybrid methods (23) ………...… 3%Genetic Algorithms (12) ……..… 2%Other (22) ……………………..… 3%

Page 53: By Dr. Borne 2005 UMUC Data Mining Lecture 1

By Dr. Borne 2005 UMUC Data Mining Lecture 1 53

http://www.kdnuggets.com/polls/data_mining_tools_2002_june2.htm

Poll of Users: Data Mining Tools (June 2002) [967 votes total]

SPSS Clementine (128) ……….…….. 13%

Weka (101) …………………….……... 10%

SAS (100) …………….………………. 10%

CART/MARS (89) ….…………………. 9%

SPSS/AnswerTree (76) …………….... 8%

SAS Enterprise Miner (67) ……….…. 7%

Other commercial tools (65) …….….. 7%

Other free/open-source tools (57) ….. 6%

MATLAB (52) …………………………. 5%

Microsoft SQLServer/Excel (40) ……. 4%

Insightful Miner (36) …………………. 4%

IBM Intelligent Miner (35) …………... 4%

KXEN (35) ……………………………. 4%

C4.5 / C4.8 (29) …………………….... 3%

Angoss (26) ……………………….….. 3%

Megaputer Polyanalyst (10) ……….... 1%

Neuralware (8) ………………….……. 1%

Oracle Suite (Darwin) (8) ……………. 1%

Quadstone (3) ………..…………….. 0.3%

ThinkAnalytics (2) …..……………... 0.2%

“Which tools do you use?”

Page 54: By Dr. Borne 2005 UMUC Data Mining Lecture 1

By Dr. Borne 2005 UMUC Data Mining Lecture 1 54

http://www.kdnuggets.com/polls/2003/data_mining_tools.htm

Poll of Users: Data Mining Tools (May 2003) [1252 votes total]

SPSS Clementine (176) ……….…….. 14%

SPSS/AnswerTree (110) ……………… 9%

SAS (102) …………….………………… 8%

Excel (92) ………………………………. 7%

Your own code (87) …………………… 7%

CART/MARS (76) ….………………….. 6%

SAS Enterprise Miner (76) ……….…… 6%

Other commercial tools (51) …….……. 4%

Microsoft SQLServer (50) …………….. 4%

Other free/open-source tools (49) ……. 4%

Prudsys Xelopes (46) ……………......... 4%

Weka (44) ……………………………….. 4%

Insightful Miner (38) …………………. 3%

R (37) ……………………................... 3%

C4.5 / C5 (36) ………………………… 3%

MATLAB (32) …………………………. 3%

IBM Intelligent Miner (22) …………... 2%

Oracle Suite (Darwin) (19) ………….. 2%

Angoss (17) ……………………….….. 1%

Megaputer Polyanalyst (12) ……….... 1%

Statsoft Statistica (10) ……………….. 1%

Unica(7), KXEN(4), Neuralware(4), …1%

“Which tools do you use?”

Page 55: By Dr. Borne 2005 UMUC Data Mining Lecture 1

By Dr. Borne 2005 UMUC Data Mining Lecture 1 55

http://www.kdnuggets.com/polls/2002/current_application_fields.htm

Poll of Users: Where do you currently apply data mining? (June 2002)

“Industries/fields where you currently apply data mining?” [608 votes total]

Banking (77) ……………….………. 13% Telecommunications (56) .……..….. 9% eCommerce/Web (53) ……………... 9% Scientific data (51) ………………..... 8% Fraud Detection (51) …………..…… 8% Direct Marketing/Fundraising (42) … 7% Insurance (36)……………………….. 6% Retail (36) ..………………………….. 6% Biology/Genetics/Proteomics (32) ... 5% Pharmaceuticals (31) ………………. 5% Manufacturing (28) …………………. 5% Supply Chain Analysis (21) ……….. 3% Investment/Stocks (17) ……………. 3% Security (14) ………………………... 2% Entertainment (10) …………………. 2% Other (44) …………………………... 7% None (9) ……………………………. 1%

Page 56: By Dr. Borne 2005 UMUC Data Mining Lecture 1

By Dr. Borne 2005 UMUC Data Mining Lecture 1 56

http://www.kdnuggets.com/polls/2004/data_mining_applications_industries.htm

Banking (29) ………………………... 13%Scientific data (20) …………………... 9%Direct Marketing/Fundraising (19) …. 9%Fraud Detection (19) ………………… 9%Bioinformatics/Biotech (18) …………. 8%Insurance (15) ………………………... 7%Medical/Pharma (15) ………………… 7%Telecommunications (12) …………… 6%eCommerce/Web (12) ………………. 6%Investment/Stocks (9) ……………….. 4%Manufacturing (9) ……………………. 4%Retail (9) ……………………………… 4%Security (8) …………………………… 4%Travel (2) ……………………………... 1%Entertainment/News (1) ………………0.5%Other (19) ……………………………... 9%

Poll of Users: Where do you currently apply data mining? (August 2004)

“Industries/fields where you currently apply data mining?” [216 votes total]

Page 57: By Dr. Borne 2005 UMUC Data Mining Lecture 1

By Dr. Borne 2005 UMUC Data Mining Lecture 1 57

Data Mining 101 - Summary· What? -- Data Mining is defined as "an information extraction activity

whose goal is to discover hidden facts contained in (large) databases."

· Why? -- To explore systematically and to make discoveries in huge databases.

· How? -- Apply one of many techniques to find patterns, relationships, groupings, classes, trends, anomalies, rare events, unusual connections, and causal connections among items in a database.

· Example -- The standard textbook example of data mining is the legendary trend found in grocery store logs: that men who buy diapers also tend to buy beer at the same time.

· Outcome -- “Actionable information” = make decisions based upon information discovered.

· What is needed -- “SIFTWARE” = software that aids in isolating interesting useful information by sifting through large databases.

· Real world application -- Data Information Knowledge Understanding / Wisdom!

Page 58: By Dr. Borne 2005 UMUC Data Mining Lecture 1

By Dr. Borne 2005 UMUC Data Mining Lecture 1 58

"It will work in practice, yes. But will it work in theory?"

- Jonathan Fenby, France on the Brink