data mining lectures lecture 2: data measurement padhraic smyth, uc irvine ics 278: data mining...

Data Mining Lectures Lecture 2: Data Measurement Padhraic Smyth, UC Irvine

ICS 278: Data Mining

Lecture 2: Measurement and Data


Today’s lecture

• Questions on homework?

• Office hours tomorrow: 9:30 to 11

• Outline of today’s lecture:– From lecture 1: various tasks in data mining– Chapter 2: Measurement and Data

• Types of measurement• Distance measures• Multidimensional scaling

• Discussion of class projects


Slides from Lecture 1……


Different Data Mining Tasks

• Exploratory Data Analysis

• Descriptive Modeling

• Predictive Modeling

• Discovering Patterns and Rules

• + others….


Exploratory Data Analysis

• Getting an overall sense of the data set– Computing summary statistics:

• Number of distinct values, max, min, mean, median, variance, skewness,..

• Visualization is widely used– 1d histograms– 2d scatter plots– Higher-dimensional methods

• Useful for data checking– E.g., finding that a variable is always integer valued or positive– Finding the some variables are highly skewed

• Simple exploratory analysis can be extremely valuable– You should always “look” at your data before applying any

data mining algorithms


Example of Exploratory Data Analysis(Pima Indians data, scatter plot matrix)







• + others….


Descriptive Modeling

• Goal is to build a “descriptive” model – e.g., a model that could simulate the data if needed– models the underlying process

• Examples:– Density estimation:

• estimate the joint distribution P(x1,……xp)

– Cluster analysis:• Find natural groups in the data

– Dependency models among the p variables• Learning a Bayesian network for the data


Example of Descriptive Modeling

3.3 3.4 3.5 3.6 3.7 3.8 3.9 43.7

3.8

3.9

4

4.1

4.2

4.3

4.4ANEMIA PATIENTS AND CONTROLS

Red Blood Cell Volume

Red

Blo

od C

ell H

emog

lobi

n C

once

ntra

tion

Anemia Group

Control Group


Example of Descriptive Modeling

3.3 3.4 3.5 3.6 3.7 3.8 3.9 43.7

3.8

3.9

4

4.1

4.2

4.3

4.4ANEMIA PATIENTS AND CONTROLS


Red

Blo

od C

ell H

emog

lobi

n C

once

ntra

tion

Anemia Group

Control Group

3.3 3.4 3.5 3.6 3.7 3.8 3.9 43.7

3.8

3.9

4

4.1

4.2

4.3

4.4


Re

d B

loo

d C

ell

He

mo

glo

bin

Co

nce

ntr

atio

n

EM ITERATION 25


128.195.36.195, -, 3/22/00, 10:35:11, W3SVC, SRVR1, 128.200.39.181, 781, 363, 875, 200, 0, GET, /top.html, -, 128.195.36.195, -, 3/22/00, 10:35:16, W3SVC, SRVR1, 128.200.39.181, 5288, 524, 414, 200, 0, POST, /spt/main.html, -, 128.195.36.195, -, 3/22/00, 10:35:17, W3SVC, SRVR1, 128.200.39.181, 30, 280, 111, 404, 3, GET, /spt/images/bk1.jpg, -, 128.195.36.101, -, 3/22/00, 16:18:50, W3SVC, SRVR1, 128.200.39.181, 60, 425, 72, 304, 0, GET, /top.html, -, 128.195.36.101, -, 3/22/00, 16:18:58, W3SVC, SRVR1, 128.200.39.181, 8322, 527, 414, 200, 0, POST, /spt/main.html, -, 128.195.36.101, -, 3/22/00, 16:18:59, W3SVC, SRVR1, 128.200.39.181, 0, 280, 111, 404, 3, GET, /spt/images/bk1.jpg, -, 128.200.39.17, -, 3/22/00, 20:54:37, W3SVC, SRVR1, 128.200.39.181, 140, 199, 875, 200, 0, GET, /top.html, -, 128.200.39.17, -, 3/22/00, 20:54:55, W3SVC, SRVR1, 128.200.39.181, 17766, 365, 414, 200, 0, POST, /spt/main.html, -, 128.200.39.17, -, 3/22/00, 20:54:55, W3SVC, SRVR1, 128.200.39.181, 0, 258, 111, 404, 3, GET, /spt/images/bk1.jpg, -, 128.200.39.17, -, 3/22/00, 20:55:07, W3SVC, SRVR1, 128.200.39.181, 0, 258, 111, 404, 3, GET, /spt/images/bk1.jpg, -, 128.200.39.17, -, 3/22/00, 20:55:36, W3SVC, SRVR1, 128.200.39.181, 1061, 382, 414, 200, 0, POST, /spt/main.html, -, 128.200.39.17, -, 3/22/00, 20:55:36, W3SVC, SRVR1, 128.200.39.181, 0, 258, 111, 404, 3, GET, /spt/images/bk1.jpg, -, 128.200.39.17, -, 3/22/00, 20:55:39, W3SVC, SRVR1, 128.200.39.181, 0, 258, 111, 404, 3, GET, /spt/images/bk1.jpg, -, 128.200.39.17, -, 3/22/00, 20:56:03, W3SVC, SRVR1, 128.200.39.181, 1081, 382, 414, 200, 0, POST, /spt/main.html, -, 128.200.39.17, -, 3/22/00, 20:56:04, W3SVC, SRVR1, 128.200.39.181, 0, 258, 111, 404, 3, GET, /spt/images/bk1.jpg, -, 128.200.39.17, -, 3/22/00, 20:56:33, W3SVC, SRVR1, 128.200.39.181, 0, 262, 72, 304, 0, GET, /top.html, -, 128.200.39.17, -, 3/22/00, 20:56:52, W3SVC, SRVR1, 128.200.39.181, 19598, 382, 414, 200, 0, POST, /spt/main.html, -,

…

5115

11111151511151

77777777

111333

3333131113332232

…

User 5

User 4

User 3

User 2

User 1

Learning User Navigation Patterns from Web Logs


Clusters of Probabilistic State Machines Cadez, Heckerman, et al, 2003

B

E

C

A

B

E

C

A

B

E

C

A

Cluster 1 Cluster 2

Cluster 3

Motivation:capture heterogeneityof Web surfing behavior


WebCanvas algorithm and software - currently in new SQLServer


Another Example of Descriptive Modeling

• Learning Directed Graphical Models (aka Bayes Nets) – goal: learn directed relationships among p variables– techniques: directed (causal) graphs– challenge: distinguishing between correlation and

causation

canceryellow fingers?

smoking

– example: Do yellow fingers cause lung cancer?

hidden cause: smoking







• + others….


Predictive Modeling

• Predict one variable Y given a set of other variables X– Here X could be a p-dimensional vector

– Classification: Y is categorical– Regression: Y is real-valued

• In effect this is function approximation, learning the relationship between Y and X

• Many, many algorithms for predictive modeling in statistics and machine learning

• Often the emphasis is on predictive accuracy, less emphasis on understanding the model


Predictive Modeling: Fraud Detection

• Credit card fraud detection– Credit card losses in the US are over 1 billion $ per year– Roughly 1 in 50k transactions are fraudulent

• Approach– For each transaction estimate p(fraudulent | transaction)– Model is built on historical data of known fraud/non-fraud– High probability transactions investigated by fraud police

• Example:– Fair-Isaac/HNC’s fraud detection software based on neural networks,

led to reported fraud decreases of 30 to 50%– http://www.fairisaac.com/fairisaac

• Issues– Significant feature engineering/preprocessing – false alarm rate vs missed detection – what is the tradeoff?


Predictive Modeling: Customer Scoring

• Example: a bank has a database of 1 million past customers, 10% of whom took out mortgages

• Use machine learning to rank new customers as a function of p(mortgage | customer data)

• Customer data– History of transactions with the bank– Other credit data (obtained from Experian, etc)– Demographic data on the customer or where they live

• Techniques– Binary classification: logistic regression, decision trees, etc– Many, many applications of this nature


Predictive Modeling: Telephone Call Modeling

• Background– AT&T has about 100 million customers– It logs 200 million calls per day, 40 attributes each– 250 million unique telephone numbers– Which are business and which are residential?

• Approach (Pregibon and Cortes, AT&T,1997)– Proprietary model, using a few attributes, trained on known

business customers to adaptively track p(business|data)– Significant systems engineering: data are downloaded nightly,

model updated (20 processors, 6Gb RAM, terabyte disk farm)

• Status: – running daily at AT&T – HTML interface used by AT&T marketing


From C. Cortes and D. Pregibon,Giga-mining, in Proceedings of theACM SIGKDD Conference, 1997







• + others….


Structure: Models and Patterns

• Model = abstract representation of a processe.g., very simple linear model structure

Y = a X + b– a and b are parameters determined from the data– Y = aX + b is the model structure– Y = 0.9X + 0.3 is a particular model– “All models are wrong, some are useful” (G.E. Box)

• Pattern represents “local structure” in a data set– E.g., if X>x then Y >y with probability p– or a pattern might be a small cluster of outliers in

multi-dimensional space


Pattern Discovery

• Goal is to discover interesting “local” patterns in the data rather than to characterize the data globally

• given market basket data we might discover that• If customers buy wine and bread then they buy cheese with

probability 0.9• These are known as “association rules”

• Given multivariate data on astronomical objects• We might find a small group of previously undiscovered

objects that are very self-similar in our feature space, but are very far away in feature space from all other objects


Example of Pattern Discovery

ADACABDABAABBDDBCADDDDBCDDBCCBBCCDADADAADABDBBDABABBCDDDCDDABDCBBDBDBCBBABBBCBBABCBBACBBDBAACCADDADBDBBCBBCCBBBDCABDDBBADDBBBBCCACDABBABDDCDDBBABDBDDBDDBCACDBBCCBBACDCADCBACCADCCCACCDDADCBCADADBAACCDDDCBDBDCCCCACACACCDABDDBCADADBCBDDADABCCABDAACABCABACBDDDCBADCBDADDDDCDDCADCCBBADABBAAADAAABCCBCABDBAADCBCDACBCABABCCBACBDABDDDADAABADCDCCDBBCDBDADDCCBBCDBAADADBCAAAADBDCADBDBBBCDCCBCCCDCCADAADACABDABAABBDDBCADDDDBCDDBCCBBCCDADADACCCDABAABBCBDBDBADBBBBCDADABABBDACDCDDDBBCDBBCBBCCDABCADDADBACBBBCCDBAAADDDBDDCABACBCADCDCBAAADCADDADAABBACCBB


Example of Pattern Discovery

• IBM “Advanced Scout” System– Bhandari et al. (1997)– Every NBA basketball game is annotated,

• e.g., time = 6 mins, 32 seconds event = 3 point basket player = Michael Jordan

• This creates a huge untapped database of information

– IBM algorithms search for rules of the form “If player A is in the game, player B’s scoring rate increases from 3.2 points per quarter to 8.7 points per quarter”

– IBM claimed around 1998 that all NBA teams except 1 were using this software…… the “other team” was Chicago.


Components of Data Mining Algorithms

• Representation:– Determining the nature and structure of the

representation to be used• Score function

– quantifying and comparing how well different representations fit the data

• Search/Optimization method– Choosing an algorithmic process to optimize the score

function; and• Data Management

– Deciding what principles of data management are required to implement the algorithms efficiently.


Task

What’s in a Data Mining Algorithm?

Representation

Score Function

Search/Optimization

Data Management

Models, Parameters


Task

An Example: Linear Regression

Representation

Score Function

Search/Optimization

Data Management

Models, Parameters

Regression

Y = Weighted linear sum of X’s

Least-squares

Gaussian elimination

None specified

Regression coefficients


Task

An Example: Decision Trees (C4.5 or CART)

Representation

Score Function

Search/Optimization

Data Management

Models, Parameters

Classification

Hierarchy of axis-parallel linear class boundaries

Cross-validated accuracy

Greedy Search

None specified

Decision tree classifier


Task

An Example: Hierarchical Clustering

Representation

Score Function

Search/Optimization

Data Management

Models, Parameters

Clustering

Tree of clusters

Various

Greedy search

None specified

Dendrogram


Task

An Example: Association Rules

Representation

Score Function

Search/Optimization

Data Management

Models, Parameters

Pattern Discovery

Rules: if A and B then C with prob p

No explicit score

Systematic search

Multiple linear scans

Set of Rules


Data Measurement


Measurement

Real world

Relationship in data

Data

Relationship in real world

Mapping domain entities to symbolic representations


Nominal or Categorical Variable

Here, numerical values just "name" the attribute uniquely. No ordering impliedi.e. jersey numbers in basketball; a player with number 30 is not more of anything than a player with number 15; certainly not twice whatever number 15 is.


Measurements, cont.ordinal measurement - attributes can be rank-ordered. Distances between attributes do not have any meaning. i.e., on a survey you might code Educational Attainment as 0=less than H.S.; 1=some H.S.; 2=H.S. degree; 3=some college; 4=college degree; 5=post college. In this measure, higher numbers mean more education. But is distance from 0 to 1 same as 3 to 4? No. The interval between values is not interpretable in an ordinal measure.

interval measurement - distance between attributes does have meaning. i.e., when we measure temperature (in Fahrenheit), the distance from 30-40 is same as distance from 70-80. The interval between values is interpretable. average makes sense, however ratios don't - 80 degrees is not twice as hot as 40 degrees


Measurements, cont.ratio measurement - an absolute zero that is meaningful. This means that you can construct a meaningful fraction (or ratio) with a ratio variable. Weight is a ratio variable. In applied social research most "count" variables are ratio, for example, the number of clients in past six months. Why? Because you can have zero clients and because it is meaningful to say that "...we had twice as many clients in the past six months as we did in the previous six months."


Hierarchy of Measurements


Scales

scale Legal transforms example

Nominal/Categorica

lAny one-one mapping Hair color, employment

ordinalAny order preserving

transformSeverity, preference

intervalMultiply by constant, add a

constantTemperature, calendar time

ratio Multiply by constant Weight, income


Why is this important?

• As we will see….– Many models require data to be represented in a specific

form– e.g., real-valued vectors

• Linear regression, neural networks, support vector machines, etc

• These models implicitly assume interval-scale data (at least)

– What do we do with non-real valued inputs?• Nominal with M values:

– Not appropriate to “map” to 1 to M (maps to an interval scale) – Why? w_1 x employment_type + w_2 x city_name– Could use M binary “indicator” variables

» But what if M is very large? (e.g., cluster into groups of values)

• Ordinal?


Mixed data

• Many real-world data sets have multiple types of variables, – e.g., demographic data sets for marketing– Nominal: employment type, ethnic group– Ordinal: education level– Interval: income, age

• Unfortunately, many data analysis algorithms are suited to only one type of data (e.g., interval)

• Exception: decision trees– Trees operate by subgrouping variable values at internal

nodes– Can operate effectively on binary, nominal, ordinal, interval– We will see more details later…..


Distance Measures

• Many data mining techniques are based on similarity or distance measures between objects.

• Two methods for computing similarity or distance:1. Explicit similarity measurement for each pair of objects2. Similarity obtained indirectly based on vector of object

attributes.

• Metric: d(i,j) is a metric iff1. d(i,j) 0 for all i, j and d(i,j) = 0 iff i = j2. d(i,j) = d(j,i) for all i and j3. d(i,j) d(i,k) + d(k,i) for all i, j and k


Vector data and distance matrices

• Data may be available as n “vectors” each p-dimensional

• Or “data” itself may be a n x n matrix of similarities or distances


Distance

))i(x,),i(x),i(x()i(x p21 • Notation: n objects with p measurements

2

1p

1k

2kkE ))j(x)i(x()j,i(d

• Most common distance metric is Euclidean distance:

• Makes sense in the case where the different measurements are commensurate; each variable measured in the same units.

• If the measurements are different, say length and weight, Euclidean distance is not necessarily meaningful.


StandardizationWhen variables are not commensurate, we can standardize them by dividing by the sample standard deviation. This makes them all equally important.

2

1

1

2)(1

ˆ

n

ikkk xix

n

The estimate for the standard deviation of xk :

where xk is the sample mean:

n

1ikk )i(x

n

1x

(When might standardization *not* be a such a good idea? hint: think of extremely skewed data and outliers, e.g., Gates income)


Weighted Euclidean distance

2

1p

1k

2kkkWE ))j(x)i(x(w)j,i(d

If we have some idea of the relative importance ofeach variable, we can weight them:


Other Distance Metrics

• Minkowski or L metric:

• Manhattan, city block or L1 metric:

• L

1p

1kkk ))j(x)i(x()j,i(d

p

1kkk )j(x)i(x)j,i(d

)j(x)i(xmax)j,i(d kkk


Additive Distances

• Each variable contributes independently to the measure of distance.

• May not always be appropriate…

object i object j

height(i) height(j)

diameter(i) diameter(j)

height2(i)

height100(i)

… height2(j)

height100(j)

…


Dependence among Variables

• Covariance and correlation measure linear dependence

• Assume we have two variables or attributes X and Y and n objects taking on values x(1), …, x(n) and y(1), …, y(n). The sample covariance of X and Y is:

• The covariance is a measure of how X and Y vary together.– it will be large and positive if large values of X are

associated with large values of Y, and small X small Y

n

1i

)y)i(y)(x)i(x(n

1)Y,X(Cov


Sample correlation coefficient

• Covariance depends on ranges of X and Y• Standardize by dividing by standard deviation• Sample correlation coefficient

2

1

1

2

1

2

1

))(())((

))()()((),(

n

i

n

i

n

i

yiyxix

yiyxixYX


Sample Correlation Matrix

business acreage

nitrous oxide

percentage of large residential lots

-1 0 +1

Data on characteristicsof Boston surburbs

average # rooms

Median house value


Mahalanobis distance

2

11 )()()()(),( jxixjxixjid T

MH

1. It automatically accounts for the scaling of the coordinate axes2. It corrects for correlation between the different features

Price:1. The covariance matrices can be hard to determine accurately2. The memory and time requirements grow quadratically rather

than linearly with the number of features.


What about…

Y

X

(X,Y) = ?

linear covariance, correlation

Are X and Y dependent?


Binary Vectors

• matching coefficient

j=1 j=0

i=1 n11 n10

i=0 n01 n00

00011011

0011

nnnn

nn

• Jaccard coefficient

011011

11

nnn

n

Number ofvariables whereitem j =1 and item i = 0


Other distance metrics

• Categorical variables– Number of matches divided by number of dimensions

• Distances between strings of different lengths– e.g., “Patrick J. Smyth” and “Padhraic Smyth”– Edit distance

• Distances between images and waveforms– Shift-invariant, scale invariant– i.e., d(x,y) = min_{a,b} ( (ax+b) – y)


Transforming Data

• Duality between form of the data and the model– Useful to bring data onto a “natural scale”– Some variables are very skewed, e.g., income

• Common transforms: square root, reciprocal, logarithm, raising to a power– Often very useful when dealing with skewed real-world

data

• Logit: transforms from 0 to 1 to real-line

p

pp

1)(logit


Multidimensional Scaling (MDS)

• Say we have data in the form of an N x N matrix of dissimilarities– 0’s on the diagonal– Symmetric

• Examples– Perceptual dissimilarity of N objects in cognitive science

experiments– String-edit distance between N protein sequences

• MDS:– Find k-dimensional coordinates for each of the N objects

such that Euclidean distances in “embedded” space matches set of dissimilarities as closely as possible


Multidimensional Scaling (MDS)

• MDS criterion

• Optimization: find the set of N k-dimensional positions that minimize S– If original dissimilarities are Euclidean

• -> linear algebra solution (equivalent to principal components)

– Non-Euclidean (more typical)• Local iterative hill-climbing, e.g., move each point to increase S, repeat• Complexity is O(n2 k) per iteration (iteration = move all points)

– See Faloutsos and Lin (1995) for FastMap: O(nk) approximation for large N

• Often used for visualization, e.g., k=2, 3

jiji

jidjijidS,

2

,

2 ),(/)),(),((

Originaldissimilarities

Euclidean distancein “embedded” k-dim space


MDS example: input distance data

Chicago Raleigh Boston Seattle S.F. Austin Orlando

Chicago 0

Raleigh 641 0

Boston 851 608 0

Seattle 1733 2363 2488 0

S.F. 1855 2406 2696 684 0

Austin 972 1167 1691 1764 1495 0

Orlando 994 520 1105 2565 2458 1015 0


Result of MDS


MDS: Example data


MDS: 2d embedding of face images


Data Quality

• Individual measurements– Random noise in individual measurements

• Variance (precision)• Bias• Random data entry errors• Noise in label assignment (e.g., class labels in medical data sets)

– Systematic errors• E.g., all ages > 99 recorded as 99• More individuals aged 20, 30, 40, etc than expected

– Missing information• Missing at random

– Questions on a questionnaire that people randomly forget to fill in• Missing systematically

– Questions that people don’t want to answer– Patients who are too ill for a certain test


Data Quality

• Collections of measurements– Ideal case = random sample from population of interest– Real case = often a biased sample of some sort– Key point: patterns or models built on the training data may

only be valid on future data that comes from the same distribution

• Examples of non-randomly sampled data– Medical study where subjects are all students– Geographic dependencies– Temporal dependencies– Stratified samples

• E.g., 50% healthy, 50% ill– Hidden systematic effects

• E.g., market basket data the weekend of a large sale in the store• E.g., Web log data during finals week


Next Lecture

• Chapter 3

– Exploratory data analysis and visualization

data mining lectures lecture 2: data measurement padhraic smyth, uc irvine ics 278: data mining...

Documents

data slide

data mining lectures

data mining chapter

data mining algorithms

data dependency models

uc irvine slides

uc irvine ics

outline of todays lecture