2008 nvo summer school11 basic concepts in data mining kirk borne george mason university t he us n...

46
2008 NVO Summer School 1 Basic Concepts in Data Mining Kirk Borne George Mason University THE US NATIONAL VIRTUAL OBSERVATORY

Upload: piers-rice

Post on 29-Dec-2015

220 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: 2008 NVO Summer School11 Basic Concepts in Data Mining Kirk Borne George Mason University T HE US N ATIONAL V IRTUAL O BSERVATORY

2008 NVO Summer School 11

Basic Concepts in Data Mining

Kirk BorneGeorge Mason University

THE US NATIONAL VIRTUAL OBSERVATORY

Page 2: 2008 NVO Summer School11 Basic Concepts in Data Mining Kirk Borne George Mason University T HE US N ATIONAL V IRTUAL O BSERVATORY

2008 NVO Summer School 2

OUTLINE

• The New Face of Science• Scientific Knowledge Discovery• Data Mining Examples and Techniques• Basic Concepts in Data Mining• What’s next?

Page 3: 2008 NVO Summer School11 Basic Concepts in Data Mining Kirk Borne George Mason University T HE US N ATIONAL V IRTUAL O BSERVATORY

2008 NVO Summer School 3

OUTLINE

• The New Face of Science• Scientific Knowledge Discovery• Data Mining Examples and Techniques• Basic Concepts in Data Mining• What’s next?

Page 4: 2008 NVO Summer School11 Basic Concepts in Data Mining Kirk Borne George Mason University T HE US N ATIONAL V IRTUAL O BSERVATORY

2008 NVO Summer School 4

The Scientific Data FloodThe Scientific Data Flood

PipelinePipeline

Scientific Data FloodScientific Data Flood

Large ScienceProjectLarge ScienceProject

Page 5: 2008 NVO Summer School11 Basic Concepts in Data Mining Kirk Borne George Mason University T HE US N ATIONAL V IRTUAL O BSERVATORY

2008 NVO Summer School 5

The New Face of Science – 1

• Big Data (usually geographically distributed)– High-Energy Particle Physics– Astronomy and Space Physics– Earth Observing System (Remote Sensing)– Human Genome and Bioinformatics– Numerical Simulations of any kind– Digital Libraries (electronic publication repositories)

• e-Science– Built on Web Services (e-Gov, e-Biz) paradigm– Distributed heterogeneous data are the norm– Data integration across projects & institutions– One-stop shopping: “The right data, right now.”

Page 6: 2008 NVO Summer School11 Basic Concepts in Data Mining Kirk Borne George Mason University T HE US N ATIONAL V IRTUAL O BSERVATORY

2008 NVO Summer School 6

• Databases enable scientific discovery– Data Handling and Archiving (management of massive

data resources)– Data Discovery (finding data wherever they exist)– Data Access (WWW-Database interfaces)– Data/Metadata Browsing (serendipity)– Data Sharing and Reuse (within project teams; and by

other scientists – scientific validation)– Data Integration (from multiple sources)– Data Fusion (across multiple modalities & domains)– Data Mining (KDD = Knowledge Discovery in

Databases)

The New Face of Science – 2

Page 7: 2008 NVO Summer School11 Basic Concepts in Data Mining Kirk Borne George Mason University T HE US N ATIONAL V IRTUAL O BSERVATORY

2008 NVO Summer School 7

OUTLINE

• The New Face of Science• Scientific Knowledge Discovery• Data Mining Examples and Techniques• Basic Concepts in Data Mining• What’s next?

Page 8: 2008 NVO Summer School11 Basic Concepts in Data Mining Kirk Borne George Mason University T HE US N ATIONAL V IRTUAL O BSERVATORY

2008 NVO Summer School 8

So what is Data Mining?

• Data Mining is Knowledge Discovery in Databases (KDD)

• Data mining is defined as “an information extraction activity whose goal is to discover hidden facts contained in (large) databases.”

• Note: Machine Learning is the field of Computer Science research that focuses on algorithms that learn from data.

• Data Mining is the application of Machine Learning algorithms to large databases.

Page 9: 2008 NVO Summer School11 Basic Concepts in Data Mining Kirk Borne George Mason University T HE US N ATIONAL V IRTUAL O BSERVATORY

2008 NVO Summer School 9

Scientific Data MiningData Mining is the Killer App for

Scientific Databases.• Scientific Data Mining References:

– http://voneural.na.infn.it/ – http://astroweka.sourceforge.net/ – http://www.itsc.uah.edu/f-mass/

• Framework for Mining and Analysis of Space Science data (F-MASS)

• Data mining is used to find patterns and relationships in data. (EDA = Exploratory Data Analysis)

• Patterns can be analyzed via 2 types of models:– Descriptive : Describe patterns and create meaningful subgroups or

clusters. (Unsupervised Learning, Clustering)– Predictive : Forecast explicit values, based upon patterns in known

results. (Supervised Learning, Classification)

• How does this apply to Scientific Research? … – through KNOWLEDGE DISCOVERY

Data Information Knowledge Understanding / Wisdom!

Page 10: 2008 NVO Summer School11 Basic Concepts in Data Mining Kirk Borne George Mason University T HE US N ATIONAL V IRTUAL O BSERVATORY

2008 NVO Summer School 10

Understanding: the Universe is expanding!!

Astronomy Example

Data:

Information (catalogs / databases):– Measure brightness of galaxies from image (e.g., 14.2 or 21.7)– Measure redshift of galaxies from spectrum (e.g., 0.0167 or

0.346)

Knowledge:Hubble Diagram Redshift-Brightness

Correlation Redshift = Distance

(a) Imaging data (ones & zeroes)

(b) Spectral data (ones & zeroes)

Page 11: 2008 NVO Summer School11 Basic Concepts in Data Mining Kirk Borne George Mason University T HE US N ATIONAL V IRTUAL O BSERVATORY

2008 NVO Summer School 11

Astronomers have been doing Data Mining for centuries

“The data are mine, and you can’t have them!”

• Seriously ... • Astronomers love to classify things ...

(Supervised Learning. e.g., classification)

• Astronomers love to characterize things ... (Unsupervised Learning. e.g., clustering)

• And we love to discover new things ... (Semi-supervised Learning. e.g., outlier detection)

Page 12: 2008 NVO Summer School11 Basic Concepts in Data Mining Kirk Borne George Mason University T HE US N ATIONAL V IRTUAL O BSERVATORY

2008 NVO Summer School 12

This sums it up ...

• Characterize the new (clustering)

• Assign the known (classification)

• Discover the unknown (outlier detection)

• 2 benefits of very large data sets within a scientific domain:

•best statistical analysis of “typical” events

•automated search for “rare” events

Graphic from S. G. Djorgovski

Page 13: 2008 NVO Summer School11 Basic Concepts in Data Mining Kirk Borne George Mason University T HE US N ATIONAL V IRTUAL O BSERVATORY

2008 NVO Summer School 13

OUTLINE

• The New Face of Science• Scientific Knowledge Discovery• Data Mining Examples and Techniques• Basic Concepts in Data Mining• What’s next?

Page 14: 2008 NVO Summer School11 Basic Concepts in Data Mining Kirk Borne George Mason University T HE US N ATIONAL V IRTUAL O BSERVATORY

2008 NVO Summer School 14

Database Systems and Data Mining

• Data mining brings novel non-traditional (Machine Learning) concepts to large DBMS (e.g., association mining; neural networks; decision trees; link analysis; pattern recognition; classification; regression; self-organizing maps). For example:

– Clustering Analysis = group together similar items, and separate the dissimilar items

– Classification = predict the class label

– Regression = predict a numeric attribute value

– Association Analysis = detect attribute-value conditions that occur frequently together

Page 15: 2008 NVO Summer School11 Basic Concepts in Data Mining Kirk Borne George Mason University T HE US N ATIONAL V IRTUAL O BSERVATORY

2008 NVO Summer School 15

ClusteringClassificationAssociationsNeural NetsDecision TreesPattern RecognitionCorrelation/Trend AnalysisPrincipal Component

AnalysisIndependent Component

AnalysisRegression AnalysisOutlier/Glitch IdentificationVisualizationAutonomous AgentsSelf-Organizing Maps (SOM)Link (Affinity Analysis)

Data Mining Methods and Some Examples

Classify new data items usingthe known classes & groups

Find unusual co-occurring associationsof attribute values among DB items

Organize information in the database based on relationships among key data descriptors

Identify linkages between data items based on features shared in common

Group together similar items andseparate dissimilar items in DB

Predict a numeric attribute value

Page 16: 2008 NVO Summer School11 Basic Concepts in Data Mining Kirk Borne George Mason University T HE US N ATIONAL V IRTUAL O BSERVATORY

2008 NVO Summer School 16

Some Data Mining Techniques Graphically Represented

Self-Organizing Map (SOM)

Outlier (Anomaly) Detection

Clustering

Link Analysis Decision Tree

Neural Network

Page 17: 2008 NVO Summer School11 Basic Concepts in Data Mining Kirk Borne George Mason University T HE US N ATIONAL V IRTUAL O BSERVATORY

2008 NVO Summer School 17

• Supervised Learning– Classification

• Unsupervised Learning– Clustering– Link Analysis– Association Analysis

• Semisupervised Learning– Outlier Detection– Class Discovery

Categories of Machine Learningand some Examples

Page 18: 2008 NVO Summer School11 Basic Concepts in Data Mining Kirk Borne George Mason University T HE US N ATIONAL V IRTUAL O BSERVATORY

2008 NVO Summer School 18

Some Classification AlgorithmsClassification = the process of learning and then applying a function that classifies the data into a set of predefined classes.

• Bayes Theorem • Support Vector

Machines (SVM)• Decision Trees• Regression• Neural Networks• Markov Modeling• K-Nearest Neighbors

Page 19: 2008 NVO Summer School11 Basic Concepts in Data Mining Kirk Borne George Mason University T HE US N ATIONAL V IRTUAL O BSERVATORY

2008 NVO Summer School 19

Classification - a 2-Step Process

1. Model Construction (Description): describing a set of predetermined classes = Build the Model.

– Each data element/tuple/sample is assumed to belong to a predefined class, as determined by the class label attribute

– The set of tuples used for model construction = the training set– The model is represented by classification rules, decision trees, or

mathematical formulae

2. Model Usage (Prediction): for classifying future or unknown objects, or for predicting missing values = Apply the Model.

– It is important to estimate the accuracy of the model:• The known labels of the test sample are compared with the

classification results from the model• Accuracy rate is the percentage of test set samples that are

correctly classified by the model• Test set is chosen completely independent of the training set,

otherwise overfitting will occur – overfitting is a bad thing!

Page 20: 2008 NVO Summer School11 Basic Concepts in Data Mining Kirk Borne George Mason University T HE US N ATIONAL V IRTUAL O BSERVATORY

2008 NVO Summer School 20

Classification Methods:Decision Trees, Neural Networks, SVM (Support Vector Machines)

There are 2 Classes!

How do you ...-Separate them?-Distinguish them?-Learn the rules?-Classify them?

ApplyKernel

(SVM)

Page 21: 2008 NVO Summer School11 Basic Concepts in Data Mining Kirk Borne George Mason University T HE US N ATIONAL V IRTUAL O BSERVATORY

2008 NVO Summer School 21

Some Clustering AlgorithmsClustering = the process of partitioning a set of data into subsets or clusters such that a data element belonging to a cluster is more similar to data elements belonging to that same cluster than to the data elements belonging to other clusters.

– Squared Error– Nearest Neighbor– K-Means (most

popular)– Mixture Models

(statistical)

Page 22: 2008 NVO Summer School11 Basic Concepts in Data Mining Kirk Borne George Mason University T HE US N ATIONAL V IRTUAL O BSERVATORY

2008 NVO Summer School 22

Clustering is used to discover the different unique groupings (classes) of attribute values.The case shown below is not obvious: one or two groups?

Page 23: 2008 NVO Summer School11 Basic Concepts in Data Mining Kirk Borne George Mason University T HE US N ATIONAL V IRTUAL O BSERVATORY

2008 NVO Summer School 23

This case is easier: there are two groups.(in fact, this is the same set of data elements as shown on the previous slide, but plotted here using a different attribute.)

Page 24: 2008 NVO Summer School11 Basic Concepts in Data Mining Kirk Borne George Mason University T HE US N ATIONAL V IRTUAL O BSERVATORY

2008 NVO Summer School 24

Semi-supervised Learning: Outlier Detection and Class Discovery

Figure: The clustering of data clouds (dc#) within a multidimensional parameter space (p#).

Such a mapping can be used to search for and identify clusters, voids, outliers, one-of-kinds, relationships, and associations among arbitrary parameters in a database (or among various parameters in geographically distributed databases).

• statistical analysis of “typical” events

• automated search for “rare” events

Page 25: 2008 NVO Summer School11 Basic Concepts in Data Mining Kirk Borne George Mason University T HE US N ATIONAL V IRTUAL O BSERVATORY

2008 NVO Summer School 25

Outlier Detection: Serendipitous Discovery of Rare or New Objects & Events

Page 26: 2008 NVO Summer School11 Basic Concepts in Data Mining Kirk Borne George Mason University T HE US N ATIONAL V IRTUAL O BSERVATORY

2008 NVO Summer School 26

Principal Components Analysis &Independent Components Analysis

Cepheid Variables:Cosmic Yardsticks-- One Correlation-- Two Classes!

... Class Discovery!

Page 27: 2008 NVO Summer School11 Basic Concepts in Data Mining Kirk Borne George Mason University T HE US N ATIONAL V IRTUAL O BSERVATORY

2008 NVO Summer School 27

Why use Data Mining?Here are 6 reasons...

1. Most projects now collect massive quantities of data.2. Because of the enormous potential for new discoveries in

existing huge databases.3. Data mining moves beyond the analysis of past events to

predicting future trends and behaviors that may be missed because they lie outside experts’ expectations.

4. Data mining tools can answer complex questions that traditionally were too time- consuming to resolve.

5. Data mining tools can explore the intricate interdependencies within databases in order to discover hidden patterns and relationships.

6. Data mining allows decision-makers to make proactive, knowledge-driven decisions.

Page 28: 2008 NVO Summer School11 Basic Concepts in Data Mining Kirk Borne George Mason University T HE US N ATIONAL V IRTUAL O BSERVATORY

2008 NVO Summer School 28

OUTLINE

• The New Face of Science• Scientific Knowledge Discovery• Data Mining Examples and Techniques• Basic Concepts in Data Mining• What’s next?

Page 29: 2008 NVO Summer School11 Basic Concepts in Data Mining Kirk Borne George Mason University T HE US N ATIONAL V IRTUAL O BSERVATORY

2008 NVO Summer School 29

Basic Concepts = Key Steps

• The key steps in a data mining project usually invoke and/or follow these basic concepts:– Data browse, preview, and selection– Data cleaning and preparation– Feature selection– Data normalization and transformation– Similarity/Distance metric selection– ... Select the data mining method– ... Apply the data mining method– ... Gather and analyze data mining results– Accuracy estimation– Avoiding overfitting

Page 30: 2008 NVO Summer School11 Basic Concepts in Data Mining Kirk Borne George Mason University T HE US N ATIONAL V IRTUAL O BSERVATORY

2008 NVO Summer School 30

Key Concept for Data Mining:Data Previewing

• Data Previewing allows you to get a sense of the good, bad, and ugly parts of the database

• This includes:– Histograms of attribute distributions– Scatter plots of attribute combinations– Max-Min value checks (versus expectations)– Summarizations, aggregations (GROUP BY)– SELECT UNIQUE values (versus expectations)– Checking physical units (and scale factors)– External checks (cross-DB comparisons)– Verify with input DB

Page 31: 2008 NVO Summer School11 Basic Concepts in Data Mining Kirk Borne George Mason University T HE US N ATIONAL V IRTUAL O BSERVATORY

2008 NVO Summer School 31

Key Concept for Data Mining:Data Preparation = Cleaning the Data

• Data Preparation can take 40-80% (or more) of the effort in a data mining project

• This includes:– Dealing with NULL (missing) values– Dealing with errors– Dealing with noise– Dealing with outliers (unless that is your

science!)– Transformations: units, scale, projections– Data normalization– Relevance analysis: Feature Selection– Remove redundant attributes– Dimensionality Reduction

Page 32: 2008 NVO Summer School11 Basic Concepts in Data Mining Kirk Borne George Mason University T HE US N ATIONAL V IRTUAL O BSERVATORY

2008 NVO Summer School 32

Key Concept for Data Mining:Feature Selection – the Feature Vector

• A feature vector is the attribute vector for a database record (tuple).

• The feature vector’s components are database attributes: v = {w,x,y,z}

• It contains the set of database attributes that you have chosen to represent (describe) uniquely each data element (tuple).– This is only a subset of all possible attributes in the DB.

• Example: Sky Survey database object feature vector:– Generic: {RA, Dec, mag, redshift, color, size}– Specific: {ra2000, dec2000, r, z, g-r, R_eff }

Page 33: 2008 NVO Summer School11 Basic Concepts in Data Mining Kirk Borne George Mason University T HE US N ATIONAL V IRTUAL O BSERVATORY

2008 NVO Summer School 33

Key Concept for Data Mining:Data Types

• Different data types:– Continuous:

• Numeric (e.g., salaries, ages, temperatures, rainfall, sales)– Discrete:

• Binary (0 or 1; Yes/No; Male/Female)• Boolean (True/False)• Specific list of allowed values (e.g., zip codes; country names;

chemical elements; amino acids; planets)– Categorical:

• Non-numeric (character/text data) (e.g., people’s names)• Can be Ordinal (ordered) or Nominal (not ordered)• Reference: http://www.twocrows.com/glossary.htm#anchor311516

• Examples of Data Mining Classification Techniques:– Regression for continuous numeric data– Logistic Regression for discrete data– Bayesian Classification for categorical data

Page 34: 2008 NVO Summer School11 Basic Concepts in Data Mining Kirk Borne George Mason University T HE US N ATIONAL V IRTUAL O BSERVATORY

2008 NVO Summer School 34

Key Concept for Data Mining:Data Normalization & Data Transformation

• Data Normalization transforms data values for different database attributes into a uniform set of units or into a uniform scale (i.e., to a common min-max range).

• Data Normalization assigns the correct numerical weighting to the values of different attributes.

• For example:– Transform all numerical values from min to

max on a 0 to 1 scale (or 0 to Weight ; or -1 to 1; or 0 to 100; …).

– Convert discrete or character (categorical) data into numeric values.

– Transform ordinal data to a ranked list (numeric).

– Discretize continuous data into bins.

Page 35: 2008 NVO Summer School11 Basic Concepts in Data Mining Kirk Borne George Mason University T HE US N ATIONAL V IRTUAL O BSERVATORY

2008 NVO Summer School 35

Key Concept for Data Mining:Similarity and Distance Metrics

• Similarity between complex data objects is one of the central notions in data mining.

• The fundamental problem is to determine whether any selected pair of data objects exhibit similar characteristics.

• The problem is both interesting and difficult because the similarity measures should allow for imprecise matches.

• Similarity and its inverse – Distance – provide the basis for all of the fundamental data mining clustering techniques and for many data mining classification techniques.

Page 36: 2008 NVO Summer School11 Basic Concepts in Data Mining Kirk Borne George Mason University T HE US N ATIONAL V IRTUAL O BSERVATORY

2008 NVO Summer School 36

Similarity and Distance Measures (metrics)

Page 37: 2008 NVO Summer School11 Basic Concepts in Data Mining Kirk Borne George Mason University T HE US N ATIONAL V IRTUAL O BSERVATORY

2008 NVO Summer School 37

Similarity and Distance Measures

• Most clustering algorithms depend on a distance or similarity measure, to determine (a) the closeness or “alikeness” of cluster members, and (b) the distance or “unlikeness” of members from different clusters.

• General requirements for any similarity or distance metric:– Non-negative: dist(A,B) > 0 and sim(A,B) > 0– Symmetric: dist(A,B)=dist(B,A) and sim(A,B)=sim(B,A)

• In order to calculate the “distance” between different attribute values, those attributes must be transformed or normalized (either to the same units, or else normalized to a similar scale).

• The normalization of both categorical (non-numeric) data and numerical data with units generally requires domain expertise. This is part of the pre-processing (data preparation) step in any data mining activity.

Page 38: 2008 NVO Summer School11 Basic Concepts in Data Mining Kirk Borne George Mason University T HE US N ATIONAL V IRTUAL O BSERVATORY

2008 NVO Summer School 38

Popular Similarity and Distance Measures

• General Lp distance = ||x-y||p = [sum{|x-y|p}]1/p

• Euclidean distance: p=2

– DE = sqrt[(x1-y1)2 + (x2-y2)2 + (x3-y3)2 + … ]

• Manhattan distance: p=1 (# of city blocks walked)

– DM = |x1-y1| + |x2-y2| + |x3-y3| + …

• Cosine distance = angle between two feature vectors:

– d(X,Y) = arccos [ X ٠ Y / ||X|| . ||Y|| ]

– d(X,Y) = arccos [ (x1y1+x2y2+x3y3) / ||X|| . ||Y|| ]

• Similarity function: s(x,y) = 1 / [1+d(x,y)]

– s varies from 1 to 0, as distance d varies from 0 to .

8

Page 39: 2008 NVO Summer School11 Basic Concepts in Data Mining Kirk Borne George Mason University T HE US N ATIONAL V IRTUAL O BSERVATORY

2008 NVO Summer School 39

Data Mining Clustering and Nearest Neighbor Algorithms – Issues

• Clustering algorithms and nearest neighbor algorithms (for classification) require a distance or similarity metric.

• You must be especially careful with categorical data, which can be a problem. For example:– What is the distance between blue and

green? Is it larger than the distance from green to red?

– How do you “metrify” different attributes (color, shape, text labels)? This is essential in order to calculate distance in multi-dimensions. Is the distance from blue to green larger or smaller than the distance from round to square? Which of these are most similar?

Page 40: 2008 NVO Summer School11 Basic Concepts in Data Mining Kirk Borne George Mason University T HE US N ATIONAL V IRTUAL O BSERVATORY

2008 NVO Summer School 40

Typical Error Matrix:

Class-A Class-B Totals

Class-A

Class-B

Totals

TRAINING DATA (actual classes)

NEU

RAL

NET

WO

RK

CLA

SSIF

ICATI

ON

(out

put)

3007

318(FN)

34213103(TN)

32763152 6428

173(FP)

2834(TP)

True Positive False PositiveFalse Negative True Negative

Key Concept for Data Mining: Classification Accuracy

Page 41: 2008 NVO Summer School11 Basic Concepts in Data Mining Kirk Borne George Mason University T HE US N ATIONAL V IRTUAL O BSERVATORY

2008 NVO Summer School 41

Typical Measures of Accuracy

• Overall Accuracy = (TP+TN)/(TP+TN+FP+FN)

• Producer’s Accuracy (Class A) = TP/(TP+FN)• Producer’s Accuracy (Class B) = TN/(FP+TN)• User’s Accuracy (Class A) = TP/(TP+FP)• User’s Accuracy (Class B) = TN/(TN+FN)Accuracy of our Classification on preceding slide:• Overall Accuracy = 92.4%• Producer’s Accuracy (Class A) = 89.9%• Producer’s Accuracy (Class B) = 94.7%• User’s Accuracy (Class A) = 94.2%• User’s Accuracy (Class B) = 90.7%

Page 42: 2008 NVO Summer School11 Basic Concepts in Data Mining Kirk Borne George Mason University T HE US N ATIONAL V IRTUAL O BSERVATORY

2008 NVO Summer School 42

Key Concept for Data Mining:Overfitting

d(x)

• g(x) is a poor fit (fitting a straight line through the points)

• h(x) is a good fit • d(x) is a very poor fit (fitting every point) = Overfitting

Page 43: 2008 NVO Summer School11 Basic Concepts in Data Mining Kirk Borne George Mason University T HE US N ATIONAL V IRTUAL O BSERVATORY

2008 NVO Summer School 43

How to Avoid Overfitting in Data Mining Models

• In Data Mining, the problem arises because you are training the model on a set of training data (i.e., a subset of the total database).

• That training data set is simply intended to be representative of the entire database, not a precise exact copy of the database.

• So, if you try to fit every nuance in the training data, then you will probably over-constrain the problem and produce a bad fit.

• This is where a TEST DATA SET comes in very handy. You can train the data mining model (Decision Tree or Neural Network) on the TRAINING DATA, and then measure its accuracy with the TEST DATA, prior to unleashing the model (e.g., Classifier) on some real new data.

• Different ways of subsetting the TRAINING and TEST data sets:• 50-50 (50% of data used to TRAIN, 50% used to TEST)

• 10 different sets of 90-10 (90% for TRAINING, 10% for TESTING)

Page 44: 2008 NVO Summer School11 Basic Concepts in Data Mining Kirk Borne George Mason University T HE US N ATIONAL V IRTUAL O BSERVATORY

2008 NVO Summer School 44

Schematic Approach to Avoiding Overfitting

Error

Training Epoch

Test Set error

TrainingSet error

To avoid overfitting, you need to know when to stoptraining the model. Although the Training Seterror may continue to decrease, you may simply beoverfitting the Training Data.Test this by applying themodel to Test Data (not partof Training Set). If the TestSet error starts to increase, then you know that you areoverfitting the Training Setand it is time to stop!

STOP Training HERE !

Page 45: 2008 NVO Summer School11 Basic Concepts in Data Mining Kirk Borne George Mason University T HE US N ATIONAL V IRTUAL O BSERVATORY

2008 NVO Summer School 45

OUTLINE

• The New Face of Science• Scientific Knowledge Discovery• Data Mining Examples and Techniques• Basic Concepts in Data Mining• What’s next?

Page 46: 2008 NVO Summer School11 Basic Concepts in Data Mining Kirk Borne George Mason University T HE US N ATIONAL V IRTUAL O BSERVATORY

2008 NVO Summer School 462008 NVO Summer School 46

Scientific Data Mining in Astronomy