data mining lectures lecture 2: data measurement padhraic smyth, uc irvine ics 278: data mining...
Post on 21-Dec-2015
220 views
TRANSCRIPT
Data Mining Lectures Lecture 2: Data Measurement Padhraic Smyth, UC Irvine
ICS 278: Data Mining
Lecture 2: Measurement and Data
Data Mining Lectures Lecture 2: Data Measurement Padhraic Smyth, UC Irvine
Today’s lecture
• Questions on homework?
• Office hours tomorrow: 9:30 to 11
• Outline of today’s lecture:– From lecture 1: various tasks in data mining– Chapter 2: Measurement and Data
• Types of measurement• Distance measures• Multidimensional scaling
• Discussion of class projects
Data Mining Lectures Lecture 2: Data Measurement Padhraic Smyth, UC Irvine
Slides from Lecture 1……
Data Mining Lectures Lecture 2: Data Measurement Padhraic Smyth, UC Irvine
Different Data Mining Tasks
• Exploratory Data Analysis
• Descriptive Modeling
• Predictive Modeling
• Discovering Patterns and Rules
• + others….
Data Mining Lectures Lecture 2: Data Measurement Padhraic Smyth, UC Irvine
Exploratory Data Analysis
• Getting an overall sense of the data set– Computing summary statistics:
• Number of distinct values, max, min, mean, median, variance, skewness,..
• Visualization is widely used– 1d histograms– 2d scatter plots– Higher-dimensional methods
• Useful for data checking– E.g., finding that a variable is always integer valued or positive– Finding the some variables are highly skewed
• Simple exploratory analysis can be extremely valuable– You should always “look” at your data before applying any
data mining algorithms
Data Mining Lectures Lecture 2: Data Measurement Padhraic Smyth, UC Irvine
Example of Exploratory Data Analysis(Pima Indians data, scatter plot matrix)
Data Mining Lectures Lecture 2: Data Measurement Padhraic Smyth, UC Irvine
Different Data Mining Tasks
• Exploratory Data Analysis
• Descriptive Modeling
• Predictive Modeling
• Discovering Patterns and Rules
• + others….
Data Mining Lectures Lecture 2: Data Measurement Padhraic Smyth, UC Irvine
Descriptive Modeling
• Goal is to build a “descriptive” model – e.g., a model that could simulate the data if needed– models the underlying process
• Examples:– Density estimation:
• estimate the joint distribution P(x1,……xp)
– Cluster analysis:• Find natural groups in the data
– Dependency models among the p variables• Learning a Bayesian network for the data
Data Mining Lectures Lecture 2: Data Measurement Padhraic Smyth, UC Irvine
Example of Descriptive Modeling
3.3 3.4 3.5 3.6 3.7 3.8 3.9 43.7
3.8
3.9
4
4.1
4.2
4.3
4.4ANEMIA PATIENTS AND CONTROLS
Red Blood Cell Volume
Red
Blo
od C
ell H
emog
lobi
n C
once
ntra
tion
Anemia Group
Control Group
Data Mining Lectures Lecture 2: Data Measurement Padhraic Smyth, UC Irvine
Example of Descriptive Modeling
3.3 3.4 3.5 3.6 3.7 3.8 3.9 43.7
3.8
3.9
4
4.1
4.2
4.3
4.4ANEMIA PATIENTS AND CONTROLS
Red Blood Cell Volume
Red
Blo
od C
ell H
emog
lobi
n C
once
ntra
tion
Anemia Group
Control Group
3.3 3.4 3.5 3.6 3.7 3.8 3.9 43.7
3.8
3.9
4
4.1
4.2
4.3
4.4
Red Blood Cell Volume
Re
d B
loo
d C
ell
He
mo
glo
bin
Co
nce
ntr
atio
n
EM ITERATION 25
Data Mining Lectures Lecture 2: Data Measurement Padhraic Smyth, UC Irvine
128.195.36.195, -, 3/22/00, 10:35:11, W3SVC, SRVR1, 128.200.39.181, 781, 363, 875, 200, 0, GET, /top.html, -, 128.195.36.195, -, 3/22/00, 10:35:16, W3SVC, SRVR1, 128.200.39.181, 5288, 524, 414, 200, 0, POST, /spt/main.html, -, 128.195.36.195, -, 3/22/00, 10:35:17, W3SVC, SRVR1, 128.200.39.181, 30, 280, 111, 404, 3, GET, /spt/images/bk1.jpg, -, 128.195.36.101, -, 3/22/00, 16:18:50, W3SVC, SRVR1, 128.200.39.181, 60, 425, 72, 304, 0, GET, /top.html, -, 128.195.36.101, -, 3/22/00, 16:18:58, W3SVC, SRVR1, 128.200.39.181, 8322, 527, 414, 200, 0, POST, /spt/main.html, -, 128.195.36.101, -, 3/22/00, 16:18:59, W3SVC, SRVR1, 128.200.39.181, 0, 280, 111, 404, 3, GET, /spt/images/bk1.jpg, -, 128.200.39.17, -, 3/22/00, 20:54:37, W3SVC, SRVR1, 128.200.39.181, 140, 199, 875, 200, 0, GET, /top.html, -, 128.200.39.17, -, 3/22/00, 20:54:55, W3SVC, SRVR1, 128.200.39.181, 17766, 365, 414, 200, 0, POST, /spt/main.html, -, 128.200.39.17, -, 3/22/00, 20:54:55, W3SVC, SRVR1, 128.200.39.181, 0, 258, 111, 404, 3, GET, /spt/images/bk1.jpg, -, 128.200.39.17, -, 3/22/00, 20:55:07, W3SVC, SRVR1, 128.200.39.181, 0, 258, 111, 404, 3, GET, /spt/images/bk1.jpg, -, 128.200.39.17, -, 3/22/00, 20:55:36, W3SVC, SRVR1, 128.200.39.181, 1061, 382, 414, 200, 0, POST, /spt/main.html, -, 128.200.39.17, -, 3/22/00, 20:55:36, W3SVC, SRVR1, 128.200.39.181, 0, 258, 111, 404, 3, GET, /spt/images/bk1.jpg, -, 128.200.39.17, -, 3/22/00, 20:55:39, W3SVC, SRVR1, 128.200.39.181, 0, 258, 111, 404, 3, GET, /spt/images/bk1.jpg, -, 128.200.39.17, -, 3/22/00, 20:56:03, W3SVC, SRVR1, 128.200.39.181, 1081, 382, 414, 200, 0, POST, /spt/main.html, -, 128.200.39.17, -, 3/22/00, 20:56:04, W3SVC, SRVR1, 128.200.39.181, 0, 258, 111, 404, 3, GET, /spt/images/bk1.jpg, -, 128.200.39.17, -, 3/22/00, 20:56:33, W3SVC, SRVR1, 128.200.39.181, 0, 262, 72, 304, 0, GET, /top.html, -, 128.200.39.17, -, 3/22/00, 20:56:52, W3SVC, SRVR1, 128.200.39.181, 19598, 382, 414, 200, 0, POST, /spt/main.html, -,
…
5115
11111151511151
77777777
111333
3333131113332232
…
User 5
User 4
User 3
User 2
User 1
Learning User Navigation Patterns from Web Logs
Data Mining Lectures Lecture 2: Data Measurement Padhraic Smyth, UC Irvine
Clusters of Probabilistic State Machines Cadez, Heckerman, et al, 2003
B
E
C
A
B
E
C
A
B
E
C
A
Cluster 1 Cluster 2
Cluster 3
Motivation:capture heterogeneityof Web surfing behavior
Data Mining Lectures Lecture 2: Data Measurement Padhraic Smyth, UC Irvine
WebCanvas algorithm and software - currently in new SQLServer
Data Mining Lectures Lecture 2: Data Measurement Padhraic Smyth, UC Irvine
Another Example of Descriptive Modeling
• Learning Directed Graphical Models (aka Bayes Nets) – goal: learn directed relationships among p variables– techniques: directed (causal) graphs– challenge: distinguishing between correlation and
causation
canceryellow fingers?
smoking
– example: Do yellow fingers cause lung cancer?
hidden cause: smoking
Data Mining Lectures Lecture 2: Data Measurement Padhraic Smyth, UC Irvine
Different Data Mining Tasks
• Exploratory Data Analysis
• Descriptive Modeling
• Predictive Modeling
• Discovering Patterns and Rules
• + others….
Data Mining Lectures Lecture 2: Data Measurement Padhraic Smyth, UC Irvine
Predictive Modeling
• Predict one variable Y given a set of other variables X– Here X could be a p-dimensional vector
– Classification: Y is categorical– Regression: Y is real-valued
• In effect this is function approximation, learning the relationship between Y and X
• Many, many algorithms for predictive modeling in statistics and machine learning
• Often the emphasis is on predictive accuracy, less emphasis on understanding the model
Data Mining Lectures Lecture 2: Data Measurement Padhraic Smyth, UC Irvine
Predictive Modeling: Fraud Detection
• Credit card fraud detection– Credit card losses in the US are over 1 billion $ per year– Roughly 1 in 50k transactions are fraudulent
• Approach– For each transaction estimate p(fraudulent | transaction)– Model is built on historical data of known fraud/non-fraud– High probability transactions investigated by fraud police
• Example:– Fair-Isaac/HNC’s fraud detection software based on neural networks,
led to reported fraud decreases of 30 to 50%– http://www.fairisaac.com/fairisaac
• Issues– Significant feature engineering/preprocessing – false alarm rate vs missed detection – what is the tradeoff?
Data Mining Lectures Lecture 2: Data Measurement Padhraic Smyth, UC Irvine
Predictive Modeling: Customer Scoring
• Example: a bank has a database of 1 million past customers, 10% of whom took out mortgages
• Use machine learning to rank new customers as a function of p(mortgage | customer data)
• Customer data– History of transactions with the bank– Other credit data (obtained from Experian, etc)– Demographic data on the customer or where they live
• Techniques– Binary classification: logistic regression, decision trees, etc– Many, many applications of this nature
Data Mining Lectures Lecture 2: Data Measurement Padhraic Smyth, UC Irvine
Predictive Modeling: Telephone Call Modeling
• Background– AT&T has about 100 million customers– It logs 200 million calls per day, 40 attributes each– 250 million unique telephone numbers– Which are business and which are residential?
• Approach (Pregibon and Cortes, AT&T,1997)– Proprietary model, using a few attributes, trained on known
business customers to adaptively track p(business|data)– Significant systems engineering: data are downloaded nightly,
model updated (20 processors, 6Gb RAM, terabyte disk farm)
• Status: – running daily at AT&T – HTML interface used by AT&T marketing
Data Mining Lectures Lecture 2: Data Measurement Padhraic Smyth, UC Irvine
From C. Cortes and D. Pregibon,Giga-mining, in Proceedings of theACM SIGKDD Conference, 1997
Data Mining Lectures Lecture 2: Data Measurement Padhraic Smyth, UC Irvine
Different Data Mining Tasks
• Exploratory Data Analysis
• Descriptive Modeling
• Predictive Modeling
• Discovering Patterns and Rules
• + others….
Data Mining Lectures Lecture 2: Data Measurement Padhraic Smyth, UC Irvine
Structure: Models and Patterns
• Model = abstract representation of a processe.g., very simple linear model structure
Y = a X + b– a and b are parameters determined from the data– Y = aX + b is the model structure– Y = 0.9X + 0.3 is a particular model– “All models are wrong, some are useful” (G.E. Box)
• Pattern represents “local structure” in a data set– E.g., if X>x then Y >y with probability p– or a pattern might be a small cluster of outliers in
multi-dimensional space
Data Mining Lectures Lecture 2: Data Measurement Padhraic Smyth, UC Irvine
Pattern Discovery
• Goal is to discover interesting “local” patterns in the data rather than to characterize the data globally
• given market basket data we might discover that• If customers buy wine and bread then they buy cheese with
probability 0.9• These are known as “association rules”
• Given multivariate data on astronomical objects• We might find a small group of previously undiscovered
objects that are very self-similar in our feature space, but are very far away in feature space from all other objects
Data Mining Lectures Lecture 2: Data Measurement Padhraic Smyth, UC Irvine
Example of Pattern Discovery
ADACABDABAABBDDBCADDDDBCDDBCCBBCCDADADAADABDBBDABABBCDDDCDDABDCBBDBDBCBBABBBCBBABCBBACBBDBAACCADDADBDBBCBBCCBBBDCABDDBBADDBBBBCCACDABBABDDCDDBBABDBDDBDDBCACDBBCCBBACDCADCBACCADCCCACCDDADCBCADADBAACCDDDCBDBDCCCCACACACCDABDDBCADADBCBDDADABCCABDAACABCABACBDDDCBADCBDADDDDCDDCADCCBBADABBAAADAAABCCBCABDBAADCBCDACBCABABCCBACBDABDDDADAABADCDCCDBBCDBDADDCCBBCDBAADADBCAAAADBDCADBDBBBCDCCBCCCDCCADAADACABDABAABBDDBCADDDDBCDDBCCBBCCDADADACCCDABAABBCBDBDBADBBBBCDADABABBDACDCDDDBBCDBBCBBCCDABCADDADBACBBBCCDBAAADDDBDDCABACBCADCDCBAAADCADDADAABBACCBB
Data Mining Lectures Lecture 2: Data Measurement Padhraic Smyth, UC Irvine
Example of Pattern Discovery
ADACABDABAABBDDBCADDDDBCDDBCCBBCCDADADAADABDBBDABABBCDDDCDDABDCBBDBDBCBBABBBCBBABCBBACBBDBAACCADDADBDBBCBBCCBBBDCABDDBBADDBBBBCCACDABBABDDCDDBBABDBDDBDDBCACDBBCCBBACDCADCBACCADCCCACCDDADCBCADADBAACCDDDCBDBDCCCCACACACCDABDDBCADADBCBDDADABCCABDAACABCABACBDDDCBADCBDADDDDCDDCADCCBBADABBAAADAAABCCBCABDBAADCBCDACBCABABCCBACBDABDDDADAABADCDCCDBBCDBDADDCCBBCDBAADADBCAAAADBDCADBDBBBCDCCBCCCDCCADAADACABDABAABBDDBCADDDDBCDDBCCBBCCDADADACCCDABAABBCBDBDBADBBBBCDADABABBDACDCDDDBBCDBBCBBCCDABCADDADBACBBBCCDBAAADDDBDDCABACBCADCDCBAAADCADDADAABBACCBB
Data Mining Lectures Lecture 2: Data Measurement Padhraic Smyth, UC Irvine
Example of Pattern Discovery
• IBM “Advanced Scout” System– Bhandari et al. (1997)– Every NBA basketball game is annotated,
• e.g., time = 6 mins, 32 seconds event = 3 point basket player = Michael Jordan
• This creates a huge untapped database of information
– IBM algorithms search for rules of the form “If player A is in the game, player B’s scoring rate increases from 3.2 points per quarter to 8.7 points per quarter”
– IBM claimed around 1998 that all NBA teams except 1 were using this software…… the “other team” was Chicago.
Data Mining Lectures Lecture 2: Data Measurement Padhraic Smyth, UC Irvine
Data Mining Lectures Lecture 2: Data Measurement Padhraic Smyth, UC Irvine
Components of Data Mining Algorithms
• Representation:– Determining the nature and structure of the
representation to be used• Score function
– quantifying and comparing how well different representations fit the data
• Search/Optimization method– Choosing an algorithmic process to optimize the score
function; and• Data Management
– Deciding what principles of data management are required to implement the algorithms efficiently.
Data Mining Lectures Lecture 2: Data Measurement Padhraic Smyth, UC Irvine
Task
What’s in a Data Mining Algorithm?
Representation
Score Function
Search/Optimization
Data Management
Models, Parameters
Data Mining Lectures Lecture 2: Data Measurement Padhraic Smyth, UC Irvine
Task
An Example: Linear Regression
Representation
Score Function
Search/Optimization
Data Management
Models, Parameters
Regression
Y = Weighted linear sum of X’s
Least-squares
Gaussian elimination
None specified
Regression coefficients
Data Mining Lectures Lecture 2: Data Measurement Padhraic Smyth, UC Irvine
Task
An Example: Decision Trees (C4.5 or CART)
Representation
Score Function
Search/Optimization
Data Management
Models, Parameters
Classification
Hierarchy of axis-parallel linear class boundaries
Cross-validated accuracy
Greedy Search
None specified
Decision tree classifier
Data Mining Lectures Lecture 2: Data Measurement Padhraic Smyth, UC Irvine
Task
An Example: Hierarchical Clustering
Representation
Score Function
Search/Optimization
Data Management
Models, Parameters
Clustering
Tree of clusters
Various
Greedy search
None specified
Dendrogram
Data Mining Lectures Lecture 2: Data Measurement Padhraic Smyth, UC Irvine
Task
An Example: Association Rules
Representation
Score Function
Search/Optimization
Data Management
Models, Parameters
Pattern Discovery
Rules: if A and B then C with prob p
No explicit score
Systematic search
Multiple linear scans
Set of Rules
Data Mining Lectures Lecture 2: Data Measurement Padhraic Smyth, UC Irvine
Data Measurement
Data Mining Lectures Lecture 2: Data Measurement Padhraic Smyth, UC Irvine
Measurement
Real world
Relationship in data
Data
Relationship in real world
Mapping domain entities to symbolic representations
Data Mining Lectures Lecture 2: Data Measurement Padhraic Smyth, UC Irvine
Nominal or Categorical Variable
Here, numerical values just "name" the attribute uniquely. No ordering impliedi.e. jersey numbers in basketball; a player with number 30 is not more of anything than a player with number 15; certainly not twice whatever number 15 is.
Data Mining Lectures Lecture 2: Data Measurement Padhraic Smyth, UC Irvine
Measurements, cont.ordinal measurement - attributes can be rank-ordered. Distances between attributes do not have any meaning. i.e., on a survey you might code Educational Attainment as 0=less than H.S.; 1=some H.S.; 2=H.S. degree; 3=some college; 4=college degree; 5=post college. In this measure, higher numbers mean more education. But is distance from 0 to 1 same as 3 to 4? No. The interval between values is not interpretable in an ordinal measure.
interval measurement - distance between attributes does have meaning. i.e., when we measure temperature (in Fahrenheit), the distance from 30-40 is same as distance from 70-80. The interval between values is interpretable. average makes sense, however ratios don't - 80 degrees is not twice as hot as 40 degrees
Data Mining Lectures Lecture 2: Data Measurement Padhraic Smyth, UC Irvine
Measurements, cont.ratio measurement - an absolute zero that is meaningful. This means that you can construct a meaningful fraction (or ratio) with a ratio variable. Weight is a ratio variable. In applied social research most "count" variables are ratio, for example, the number of clients in past six months. Why? Because you can have zero clients and because it is meaningful to say that "...we had twice as many clients in the past six months as we did in the previous six months."
Data Mining Lectures Lecture 2: Data Measurement Padhraic Smyth, UC Irvine
Hierarchy of Measurements
Data Mining Lectures Lecture 2: Data Measurement Padhraic Smyth, UC Irvine
Scales
scale Legal transforms example
Nominal/Categorica
lAny one-one mapping Hair color, employment
ordinalAny order preserving
transformSeverity, preference
intervalMultiply by constant, add a
constantTemperature, calendar time
ratio Multiply by constant Weight, income
Data Mining Lectures Lecture 2: Data Measurement Padhraic Smyth, UC Irvine
Why is this important?
• As we will see….– Many models require data to be represented in a specific
form– e.g., real-valued vectors
• Linear regression, neural networks, support vector machines, etc
• These models implicitly assume interval-scale data (at least)
– What do we do with non-real valued inputs?• Nominal with M values:
– Not appropriate to “map” to 1 to M (maps to an interval scale) – Why? w_1 x employment_type + w_2 x city_name– Could use M binary “indicator” variables
» But what if M is very large? (e.g., cluster into groups of values)
• Ordinal?
Data Mining Lectures Lecture 2: Data Measurement Padhraic Smyth, UC Irvine
Mixed data
• Many real-world data sets have multiple types of variables, – e.g., demographic data sets for marketing– Nominal: employment type, ethnic group– Ordinal: education level– Interval: income, age
• Unfortunately, many data analysis algorithms are suited to only one type of data (e.g., interval)
• Exception: decision trees– Trees operate by subgrouping variable values at internal
nodes– Can operate effectively on binary, nominal, ordinal, interval– We will see more details later…..
Data Mining Lectures Lecture 2: Data Measurement Padhraic Smyth, UC Irvine
Distance Measures
• Many data mining techniques are based on similarity or distance measures between objects.
• Two methods for computing similarity or distance:1. Explicit similarity measurement for each pair of objects2. Similarity obtained indirectly based on vector of object
attributes.
• Metric: d(i,j) is a metric iff1. d(i,j) 0 for all i, j and d(i,j) = 0 iff i = j2. d(i,j) = d(j,i) for all i and j3. d(i,j) d(i,k) + d(k,i) for all i, j and k
Data Mining Lectures Lecture 2: Data Measurement Padhraic Smyth, UC Irvine
Vector data and distance matrices
• Data may be available as n “vectors” each p-dimensional
• Or “data” itself may be a n x n matrix of similarities or distances
Data Mining Lectures Lecture 2: Data Measurement Padhraic Smyth, UC Irvine
Distance
))i(x,),i(x),i(x()i(x p21 • Notation: n objects with p measurements
2
1p
1k
2kkE ))j(x)i(x()j,i(d
• Most common distance metric is Euclidean distance:
• Makes sense in the case where the different measurements are commensurate; each variable measured in the same units.
• If the measurements are different, say length and weight, Euclidean distance is not necessarily meaningful.
Data Mining Lectures Lecture 2: Data Measurement Padhraic Smyth, UC Irvine
StandardizationWhen variables are not commensurate, we can standardize them by dividing by the sample standard deviation. This makes them all equally important.
2
1
1
2)(1
ˆ
n
ikkk xix
n
The estimate for the standard deviation of xk :
where xk is the sample mean:
n
1ikk )i(x
n
1x
(When might standardization *not* be a such a good idea? hint: think of extremely skewed data and outliers, e.g., Gates income)
Data Mining Lectures Lecture 2: Data Measurement Padhraic Smyth, UC Irvine
Weighted Euclidean distance
2
1p
1k
2kkkWE ))j(x)i(x(w)j,i(d
If we have some idea of the relative importance ofeach variable, we can weight them:
Data Mining Lectures Lecture 2: Data Measurement Padhraic Smyth, UC Irvine
Other Distance Metrics
• Minkowski or L metric:
• Manhattan, city block or L1 metric:
• L
1p
1kkk ))j(x)i(x()j,i(d
p
1kkk )j(x)i(x)j,i(d
)j(x)i(xmax)j,i(d kkk
Data Mining Lectures Lecture 2: Data Measurement Padhraic Smyth, UC Irvine
Additive Distances
• Each variable contributes independently to the measure of distance.
• May not always be appropriate…
object i object j
height(i) height(j)
diameter(i) diameter(j)
height2(i)
height100(i)
… height2(j)
height100(j)
…
Data Mining Lectures Lecture 2: Data Measurement Padhraic Smyth, UC Irvine
Dependence among Variables
• Covariance and correlation measure linear dependence
• Assume we have two variables or attributes X and Y and n objects taking on values x(1), …, x(n) and y(1), …, y(n). The sample covariance of X and Y is:
• The covariance is a measure of how X and Y vary together.– it will be large and positive if large values of X are
associated with large values of Y, and small X small Y
n
1i
)y)i(y)(x)i(x(n
1)Y,X(Cov
Data Mining Lectures Lecture 2: Data Measurement Padhraic Smyth, UC Irvine
Sample correlation coefficient
• Covariance depends on ranges of X and Y• Standardize by dividing by standard deviation• Sample correlation coefficient
2
1
1
2
1
2
1
))(())((
))()()((),(
n
i
n
i
n
i
yiyxix
yiyxixYX
Data Mining Lectures Lecture 2: Data Measurement Padhraic Smyth, UC Irvine
Sample Correlation Matrix
business acreage
nitrous oxide
percentage of large residential lots
-1 0 +1
Data on characteristicsof Boston surburbs
average # rooms
Median house value
Data Mining Lectures Lecture 2: Data Measurement Padhraic Smyth, UC Irvine
Mahalanobis distance
2
11 )()()()(),( jxixjxixjid T
MH
1. It automatically accounts for the scaling of the coordinate axes2. It corrects for correlation between the different features
Price:1. The covariance matrices can be hard to determine accurately2. The memory and time requirements grow quadratically rather
than linearly with the number of features.
Data Mining Lectures Lecture 2: Data Measurement Padhraic Smyth, UC Irvine
What about…
Y
X
(X,Y) = ?
linear covariance, correlation
Are X and Y dependent?
Data Mining Lectures Lecture 2: Data Measurement Padhraic Smyth, UC Irvine
Binary Vectors
• matching coefficient
j=1 j=0
i=1 n11 n10
i=0 n01 n00
00011011
0011
nnnn
nn
• Jaccard coefficient
011011
11
nnn
n
Number ofvariables whereitem j =1 and item i = 0
Data Mining Lectures Lecture 2: Data Measurement Padhraic Smyth, UC Irvine
Other distance metrics
• Categorical variables– Number of matches divided by number of dimensions
• Distances between strings of different lengths– e.g., “Patrick J. Smyth” and “Padhraic Smyth”– Edit distance
• Distances between images and waveforms– Shift-invariant, scale invariant– i.e., d(x,y) = min_{a,b} ( (ax+b) – y)
Data Mining Lectures Lecture 2: Data Measurement Padhraic Smyth, UC Irvine
Transforming Data
• Duality between form of the data and the model– Useful to bring data onto a “natural scale”– Some variables are very skewed, e.g., income
• Common transforms: square root, reciprocal, logarithm, raising to a power– Often very useful when dealing with skewed real-world
data
• Logit: transforms from 0 to 1 to real-line
p
pp
1)(logit
Data Mining Lectures Lecture 2: Data Measurement Padhraic Smyth, UC Irvine
Multidimensional Scaling (MDS)
• Say we have data in the form of an N x N matrix of dissimilarities– 0’s on the diagonal– Symmetric
• Examples– Perceptual dissimilarity of N objects in cognitive science
experiments– String-edit distance between N protein sequences
• MDS:– Find k-dimensional coordinates for each of the N objects
such that Euclidean distances in “embedded” space matches set of dissimilarities as closely as possible
Data Mining Lectures Lecture 2: Data Measurement Padhraic Smyth, UC Irvine
Multidimensional Scaling (MDS)
• MDS criterion
• Optimization: find the set of N k-dimensional positions that minimize S– If original dissimilarities are Euclidean
• -> linear algebra solution (equivalent to principal components)
– Non-Euclidean (more typical)• Local iterative hill-climbing, e.g., move each point to increase S, repeat• Complexity is O(n2 k) per iteration (iteration = move all points)
– See Faloutsos and Lin (1995) for FastMap: O(nk) approximation for large N
• Often used for visualization, e.g., k=2, 3
jiji
jidjijidS,
2
,
2 ),(/)),(),((
Originaldissimilarities
Euclidean distancein “embedded” k-dim space
Data Mining Lectures Lecture 2: Data Measurement Padhraic Smyth, UC Irvine
MDS example: input distance data
Chicago Raleigh Boston Seattle S.F. Austin Orlando
Chicago 0
Raleigh 641 0
Boston 851 608 0
Seattle 1733 2363 2488 0
S.F. 1855 2406 2696 684 0
Austin 972 1167 1691 1764 1495 0
Orlando 994 520 1105 2565 2458 1015 0
Data Mining Lectures Lecture 2: Data Measurement Padhraic Smyth, UC Irvine
Data Mining Lectures Lecture 2: Data Measurement Padhraic Smyth, UC Irvine
Result of MDS
Data Mining Lectures Lecture 2: Data Measurement Padhraic Smyth, UC Irvine
MDS: Example data
Data Mining Lectures Lecture 2: Data Measurement Padhraic Smyth, UC Irvine
MDS: 2d embedding of face images
Data Mining Lectures Lecture 2: Data Measurement Padhraic Smyth, UC Irvine
Data Mining Lectures Lecture 2: Data Measurement Padhraic Smyth, UC Irvine
Data Quality
• Individual measurements– Random noise in individual measurements
• Variance (precision)• Bias• Random data entry errors• Noise in label assignment (e.g., class labels in medical data sets)
– Systematic errors• E.g., all ages > 99 recorded as 99• More individuals aged 20, 30, 40, etc than expected
– Missing information• Missing at random
– Questions on a questionnaire that people randomly forget to fill in• Missing systematically
– Questions that people don’t want to answer– Patients who are too ill for a certain test
Data Mining Lectures Lecture 2: Data Measurement Padhraic Smyth, UC Irvine
Data Quality
• Collections of measurements– Ideal case = random sample from population of interest– Real case = often a biased sample of some sort– Key point: patterns or models built on the training data may
only be valid on future data that comes from the same distribution
• Examples of non-randomly sampled data– Medical study where subjects are all students– Geographic dependencies– Temporal dependencies– Stratified samples
• E.g., 50% healthy, 50% ill– Hidden systematic effects
• E.g., market basket data the weekend of a large sale in the store• E.g., Web log data during finals week
Data Mining Lectures Lecture 2: Data Measurement Padhraic Smyth, UC Irvine
Next Lecture
• Chapter 3
– Exploratory data analysis and visualization