lecture02 03 similaritymetrices datavisualization

Upload: bilo044

Post on 23-Feb-2018

218 views

Category:

Documents


0 download

TRANSCRIPT

  • 7/24/2019 LECTURE02 03 SimilarityMetrices DataVisualization

    1/53

    Distance or Similarity Measures

    Many data mining and analytics tasks involve the comparison of objects and

    determining in terms of their similarities (or dissimilarities)

    Clustering

    Nearest-neighbor search classification and prediction

    Correlation analysis

    Many of todays real-!orld applications rely on the computation similarities or

    distances among objects

    "ersonali#ation

    $ecommender systems

    %ocument categori#ation &nformation retrieval

  • 7/24/2019 LECTURE02 03 SimilarityMetrices DataVisualization

    2/53

    Similarity and Dissimilarity

    'imilarity

    Numerical measure of ho! alike t!o data objects are alue is higher !hen objects are more alike

    ften falls in the range *+,

    %issimilarity (e.g. distance)

    Numerical measure of ho! different t!o data objects are

    /o!er !hen objects are more alike

    Minimum dissimilarity is often +

    0pper limit varies

    "ro1imity refers to a similarity or dissimilarity

    2

  • 7/24/2019 LECTURE02 03 SimilarityMetrices DataVisualization

    3/53

    3

    Distance or Similarity Measures

    Measuring %istance

    &n order to group similar items !e need a !ay to measure thedistance bet!een objects (e.g. records)

    ften re2uires the representation of objects as 3feature vectors4

    ID Gender Age Salary

    1 F 27 19,000

    2 M 51 64,000

    3 M 52 100,000

    4 F 33 55,000

    5 M 45 45,000

    T1 T2 T3 T4 T5 T6

    Doc1 0 4 0 0 0 2

    Doc2 3 1 4 3 1 2

    Doc3 3 0 0 0 3 0

    Doc4 0 1 0 3 0 0

    Doc5 2 2 2 3 1 4

    An Employee DB Term Frequencies for Documents

    Feature vector corresponding to

    Employee 2:

  • 7/24/2019 LECTURE02 03 SimilarityMetrices DataVisualization

    4/53

    Distance or Similarity Measures

    "roperties of %istance Measures5

    for all objects 6 and 7 dist(6 7) + and dist(6 7) 8dist(7 6)

    for any object 6 dist(6 6) 8 +

    $epresentation of objects as vectors5

    9ach data object (item) can be vie!ed as an n-dimensional vector !here

    the dimensions are the attributes (features) in the data

    :he vector representation allo!s us to compute distance or similarity

    bet!een pairs of items using standard vector operations e.g. Cosine of the angle bet!een vectors

    Manhattan distance

    9uclidean distance

    ;amming %istance

  • 7/24/2019 LECTURE02 03 SimilarityMetrices DataVisualization

    5/53

    Data Matrix and Distance Matrix

    %ata matri1

    Conceptual representation of a table Cols 8 features< ro!s 8 data objects

    ndata points !ithpdimensions

    9ach ro! in the matri1 is the vector

    representation of a data object

    %istance (or 'imilarity) Matri1

    n data points but indicates only the

    pair!ise distance (or similarity)

    6 triangular matri1

    'ymmetric

    npx...

    nfx...

    n1x

    ...............

    ipx...ifx...i1x

    ...............1px...1fx...11x

    %&&&(2)(")

    :::

    (23)(

    ...ndnd

    0dd(3,1

    0d(2,1)

    0

  • 7/24/2019 LECTURE02 03 SimilarityMetrices DataVisualization

    6/53

    Proximity Measure for Nominal Attributes

    If object attributes are all nominal (categorical), then

    proximity measure are used to compare objects

    Can take 2 or more states, e.g., red, yellow, blue, green(generaliation of a binary attribute)

    !ethod "# $imple matching

    m# % of matches,p# total % of &ariables

    !ethod 2# Con&ert to $tandard $preadsheet format 'or each attributeA create Mbinary attribute for theMnominal states ofA

    hen use standard &ectorbased similarity or distancemetrics

    pmpjid

    =)(

  • 7/24/2019 LECTURE02 03 SimilarityMetrices DataVisualization

    7/53

    * + contingency table for binary data

    * $imple matching coecient

    dcbacbjid+++

    +=)(

    Proximity Measure for Binary Attributes

    pdbcasum

    dcdc

    baba

    sum

    ++

    +

    +

    +

    ,

    +,

    *+,ect i

    *+,ect j

  • 7/24/2019 LECTURE02 03 SimilarityMetrices DataVisualization

    8/53

    Proximity Measure for Binary Attributes

    * -xample

    /et the &alues 0 and 1 be set to ", and the &alue be set to3

    -!&%2""

    2"()

    #-&%"""

    ""()

    33&%"%2

    "%()

    maryjimd

    jimjackd

    maryjackd

    pdbcasum

    dcdc

    baba

    sum

    ++

    +

    +

    +

    ,

    +,

    cbacbjid++

    +=)(

  • 7/24/2019 LECTURE02 03 SimilarityMetrices DataVisualization

    9/53.

    Common Distance Measures for Numeric Data

    Consider t!o vectors

    $o!s in the data matri1

    Common %istance Measures5

    Manhattan distance5

    9uclidean distance5

    %istance can be defined as a dual of a similarity measure

    ( ) , ( )dist X Y sim X Y = = =( )

    ( )i i

    i

    i ii i

    x ysim X Y

    x y

    =

  • 7/24/2019 LECTURE02 03 SimilarityMetrices DataVisualization

    10/53

    Data Normalization for Numeric Data

    * + database can contain nnumbers of numeric

    attributes* + larger range attribute or noise can shift thesamples distances

    * 'or example# 4Income5 attribute can dominate thedistance as compared to 46eight5 and 4+ge5

    attributes

    * he objecti&e of normaliation is con&ert allnumeric attributes, so that there &alues fall within asmall speci7ed range, such as 3 to ".3

    * ormaliation is particularly useful for clusteringand distance based algorithms such as knearestneighbor

    Attr1 Attr2 Attr3 Attr4

    "2 28 99 9:

    22 2: 92 9;

    2; 28 98 9"2"2 29 9; 9;

  • 7/24/2019 LECTURE02 03 SimilarityMetrices DataVisualization

    11/53

    * min-max normalization

    -xample# $uppose that the minimumandmaximum&alues for the attribute income are"2,333 and y minmax normaliation, a&alue of ?9,;33for income is transformed to

    @ ((?9,333"2,333)A(

  • 7/24/2019 LECTURE02 03 SimilarityMetrices DataVisualization

    12/53

  • 7/24/2019 LECTURE02 03 SimilarityMetrices DataVisualization

    13/53

    Example: Data Matrix and Distance Matrix

    Distance Matrix (Euclidean)

    Data Matrix

    Distance Matrix (Manhattan)

  • 7/24/2019 LECTURE02 03 SimilarityMetrices DataVisualization

    14/53

    Vector-Based Similarity Measures

    &n some situations distance measures provide a ske!ed vie! of data

    9.g. !hen the data is very sparse and +@s in the vectors are notsignificant

    &n such cases typically vector-based similarity measures are

    used

    Most common measure5 Cosine similarity

    %ot product of t!o vectors5

    Cosine 'imilarity 8 normali#ed dot product

    the norm of a vector A is5

    the cosine similarity is5

    =i

    ixX =

    =

    =

    i

    i

    i

    i

    i

    ii

    yx

    yx

    yX

    YXYXsim

    ==

    )(

    )(

    , = nX x x x= / , = nY y y y= /

    ==i

    ii yxYXYXsim )(

  • 7/24/2019 LECTURE02 03 SimilarityMetrices DataVisualization

    15/53

  • 7/24/2019 LECTURE02 03 SimilarityMetrices DataVisualization

    16/53

    Example (Vector Space Model)

    * G# gold sil&er truck

    *H"#

    $hipment of gold damaged in a 7re

    * H2# Heli&ery of sil&er arri&ed in a sil&er truck

    * H9# $hipment of gold arri&ed in a truck

    he idf for the terms in the three documents is gi&en below#

  • 7/24/2019 LECTURE02 03 SimilarityMetrices DataVisualization

    17/5317

    Example (Vector Space Model)

    * Weight the items of ectors using tf !eighting

    schemecell contain ra! term fre"uenc# of

    term !ithin document

  • 7/24/2019 LECTURE02 03 SimilarityMetrices DataVisualization

    18/53

    Documents # $uery in n-dimensional Space

    "/

    Documents are reresented as !ectors in t"e term sace Tyically !alues in eac" dimension corresond to t"e #re$uency o#

    t"e corresonding term in t"e document

    %ueries reresented as !ectors in t"e same !ector&sace

    'osine similarity (et)een t"e $uery and documents is

    o#ten used to ran* retrie!ed documents

  • 7/24/2019 LECTURE02 03 SimilarityMetrices DataVisualization

    19/53".

    Example: Similarities amon% Documents

    Consider the follo!ing document-term matri1

    T1 T2 T3 T4 T5 T6 T7 T+

    Doc1 0 4 0 0 0 2 1 3

    Doc2 3 1 4 3 1 2 0 1

    Doc3 3 0 0 0 3 0 3 0

    Doc4 0 1 0 3 0 0 2 0

    Doc5 2 2 2 3 1 4 0 2

    Dot01roduct)Doc2Doc$(

  • 7/24/2019 LECTURE02 03 SimilarityMetrices DataVisualization

    20/53

    !an& Correlation as Similarity

    &n cases !here !e !ant to analy#e correlation among

    t!o features or !e have high mean variance across

    data objects (e.g. movies product ratings) $ank

    Correlation coefficient is the best option

    'pearman?s rank correlation coefficient Bendall tau rank correlation coefficient

    ften used in recommender systems based on

    Collaborative iltering

    2%

  • 7/24/2019 LECTURE02 03 SimilarityMetrices DataVisualization

    21/53

    !an& Correlation as Similarity

    * 1roducts ating

    Hell, +pple, $amsung, +cer, J1, $ony

    * Kser" atings#

    Hell(=A"3), +pple (8A"3), $amsung(

  • 7/24/2019 LECTURE02 03 SimilarityMetrices DataVisualization

    22/53

    !an& Correlation as Similarity

    * 6hich Kser is most to Kser"

    * Kser" and Kser2

    * Kser" and Kser9

    Rank Correlation as Similarity

  • 7/24/2019 LECTURE02 03 SimilarityMetrices DataVisualization

    23/53

    Rank Correlation as Similarity

    (Example)

    * 'ind Correlation between $tudentIG and Jouse of L per week

    * diis the distance between ranksof two attributes

    * n total samples

    Rank Correlation as Similarity

  • 7/24/2019 LECTURE02 03 SimilarityMetrices DataVisualization

    24/53

    Rank Correlation as Similarity

    (Example)

  • 7/24/2019 LECTURE02 03 SimilarityMetrices DataVisualization

    25/53

    D i Ti W i

  • 7/24/2019 LECTURE02 03 SimilarityMetrices DataVisualization

    26/53

    Dynamic Time Warping!erndt" Cli##ord" $%%&'

    * +llows accelerationdeceleration of signals along the

    time dimension

    * >asic idea

    Consider M x", x2, N, xn, and 0 y", y2, N, yn

    6e are allowed to extend each seOuence byrepeating elements

    -uclidean distance now calculated between theextended seOuences Mand 0

    !atrix !, where mij d(xi, yj)

  • 7/24/2019 LECTURE02 03 SimilarityMetrices DataVisualization

    27/53

    Example

    ,uclidean distance !s DT-

  • 7/24/2019 LECTURE02 03 SimilarityMetrices DataVisualization

    28/53

    o to Calc*late DTW

    * $teps

    Histance table calculation Calculating shortest path in the table

    * -xample# $eries" P2.3, 9.3, ?.3, ?.3, =.3, =.3, ;.3, 2.3,8.3, 2.3, :.3, 8.3, 8.3 Q

    $eries2 P:.3, ?.3, ?.3, =.3, 2.3, 2.3, ;.3, 8.3,

    9.3, :.3, 8.3, 8.3 Q

  • 7/24/2019 LECTURE02 03 SimilarityMetrices DataVisualization

    29/53

    o to Calc*late DTW

    * Histance able Calculation

    Distance Matrix comuting

    dist.i/0 2 dist.si/ $0 3 min 4dist.i&1/0&1/ dist.i/ 0&1/ dist.i&1/0

  • 7/24/2019 LECTURE02 03 SimilarityMetrices DataVisualization

    30/53

    Example $

    s" s2 s9 s: s8 s; s? s= s