datamining-lecture 2

Upload: tyaa

Post on 01-Jun-2018

215 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/9/2019 datamining-lecture 2

    1/46

    DATA MINING

    LECTURE 2Data Preprocessing

    Exploratory Analysis

    Post-processing

  • 8/9/2019 datamining-lecture 2

    2/46

    Wat is Data Mining!

    " Data #ining is te $se o% e%%icient tecni&$es %or te analysis o% 'ery

    large collections o% (ata an( te extraction o% $se%$lan( possi)ly

    $nexpecte(patterns in (ata*

    " +Data #ining is te analysis o% ,o%ten large o)ser'ational (ata sets to%in($ns$specte(relationsipsan( to s$##ari.ete (ata in no'el

    /ays tat are )ot$n(erstan(a)le an( $se%$l to te (ata analyst0

    ,1an( Mannila 3#yt

    " +Data #ining is te (isco'ery o% #o(els%or (ata0 ,Ra4ara#an Ull#an" We can a'e te %ollo/ing types o% #o(els

    " Mo(els tat explainte (ata ,e*g* a single %$nction

    " Mo(els tat pre(ictte %$t$re (ata instances*

    " Mo(els tat s$##ari.ete (ata

    " Mo(els te extractte #ost pro#inent %eat$reso% te (ata*

  • 8/9/2019 datamining-lecture 2

    3/46

    Wy (o /e nee( (ata #ining!

    " Really $gea#o$nts o% co#plex(ata generate( %ro# #$ltiple so$rces

    an( interconnecte(in (i%%erent /ays" 3cienti%ic(ata %ro# (i%%erent (isciplines

    " Weater astrono#y pysics )iological #icroarrays geno#ics

    " 1$ge textcollections

    " Te We) scienti%ic articles ne/s t/eets %ace)oo5 postings*

    " Transaction (ata" Retail store recor(s cre(it car( recor(s

    " 6ea'ioral(ata" Mo)ile pone (ata &$ery logs )ro/sing )ea'ior a( clic5s

    " Net/or5e((ata

    " Te We) 3ocial Net/or5s IM net/or5s e#ail net/or5 )iological net/or5s*

    "All tese types o% (ata can )e co#)ine(in #any /ays" 7ace)oo5 as a net/or5 text i#ages $ser )ea'ior a( transactions*

    " We nee( to analy.etis (ata to extract5no/le(ge" 8no/le(ge can )e $se( %or co##ercial or scienti%icp$rposes*

    " 9$r sol$tions so$l( scale to te si.e o% te (ata

  • 8/9/2019 datamining-lecture 2

    4/46

    Te (ata analysis pipeline

    " Mining is not te only step in te analysis process

    " Preprocessing: real (ata is noisy inco#plete an( inconsistent* Data cleaning is

    re&$ire( to #a5e sense o% te (ata" Tecni&$es: 3a#pling Di#ensionality Re($ction 7eat$re selection*

    "A (irty /or5 )$t it is o%ten te #ost i#portant step %or te analysis*

    " Post-Processing: Ma5e te (ata actiona)le an( $se%$l to te $ser" 3tatistical analysis o% i#portance

    " ;is$ali.ation*

    " Pre- an( Post-processing are o%ten (ata #ining tas5s as /ell

    Data

    PreprocessingData Mining

    Res$lt

    Post-processing

  • 8/9/2019 datamining-lecture 2

    5/46

  • 8/9/2019 datamining-lecture 2

    6/46

  • 8/9/2019 datamining-lecture 2

    7/46

    3a#pling

    " Te 5ey principle %or e%%ecti'e sa#pling is te %ollo/ing:" $sing a sa#ple /ill /or5 al#ost as /ell as $sing te entire

    (ata sets i% te sa#ple is representati'e

    "A sa#ple is representati'e i% it as approxi#ately te sa#eproperty ,o% interest as te original set o% (ata

    " 9ter/ise /e say tat te sa#ple intro($ces so#e )ias

    " Wat appens i% /e ta5e a sa#ple %ro# te $ni'ersity ca#p$s

    to co#p$te te a'erage eigt o% a person at Ioannina!

  • 8/9/2019 datamining-lecture 2

    8/46

    Types o% 3a#pling

    " 3i#ple Ran(o# 3a#pling" Tere is an e&$al pro)a)ility o% selecting any partic$lar ite#

    " 3a#pling /ito$t replace#ent

    " As eac ite# is selecte( it is re#o'e( %ro# te pop$lation

    " 3a#pling /it replace#ent" 9)4ects are not re#o'e( %ro# te pop$lation as tey are selecte( %or te

    sa#ple*

    " In sa#pling /it replace#ent te sa#e o)4ect can )e pic5e( $p #ore tanonce* Tis #a5es analytical co#p$tation o% pro)a)ilities easier

    " E*g* /e a'e =>>people B=are /o#en P,W >*B= F#en P,M >*F* I% I pic5 t/o persons /at is te pro)a)ility P,WWtat )ot are/o#en!" 3a#pling /it replace#ent: P,WW >*B=2

    " 3a#pling /ito$t replace#ent: P,WW B==>> H B>FF

  • 8/9/2019 datamining-lecture 2

    9/46

    Types o% 3a#pling

    " 3trati%ie(sa#pling" 3plit te (ata into se'eral gro$ps ten (ra/ ran(o# sa#ples %ro# eac

    gro$p*" Ens$res tat )ot gro$ps are represente(*

    " Exa#ple =* I /ant to $n(erstan( te (i%%erences )et/een legiti#ate an(

    %ra$($lent cre(it car( transactions* >*=Jo% transactions are %ra$($lent*Wat appens i% I select =>>>transactions at ran(o#!" I get =%ra$($lent transaction ,in expectation* Not eno$g to (ra/ any concl$sions* 3ol$tion:

    sa#ple =>>>legiti#ate an( =>>>%ra$($lent transactions

    " Exa#ple 2*I /ant to ans/er te &$estion: Do /e) pages tat are lin5e(a'e on a'erage #ore /or(s in co##on tan tose tat are not! I a'e =Mpages an( =Mlin5s /at appens i% I select =>8pairs o% pages at ran(o#!" Most li5ely I /ill not get any lin5s* 3ol$tion: sa#ple =>8 ran(o# pairs an( =>8 lin5s

    Pro)a)ility Re#in(er: I% an e'ent as pro)a)ility po% appening an( I (o N

    trials te expecte( n$#)er o% ti#es te e'ent occ$rs is pN

  • 8/9/2019 datamining-lecture 2

    10/46

  • 8/9/2019 datamining-lecture 2

    11/46

    3a#ple 3i.e

    " What sample size is necessary to get at least one

    object from each of 10 groups.

  • 8/9/2019 datamining-lecture 2

    12/46

    A (ata #ining callenge

    " Ko$ a'e Nintegers an( yo$ /ant to sa#ple one integer

    $ni%or#ly at ran(o#* 1o/ (o yo$ (o tat!

    " Te integers are co#ing in a strea#: yo$ (o not 5no/ te

    si.e o% te strea# in a('ance an( tere is not eno$g

    #e#ory to store te strea# in #e#ory* Ko$ can only 5eep a

    constanta#o$nt o% integers in #e#ory

    " 1o/ (o yo$ sa#ple!

    " 1int: i% te strea# en(s a%ter rea(ing nintegers te last integer inte strea# so$l( a'e pro)a)ility =nto )e selecte(*

    " Reser'oir 3a#pling:" 3tan(ar( inter'ie/ &$estion %or #any co#panies

  • 8/9/2019 datamining-lecture 2

    13/46

    Reser'oir sa#pling

    "Algorit#: Wit pro)a)ility =n select te n-t ite#

    o% te strea# an( replace te pre'io$s coice*

    " Clai#: E'ery ite# as pro)a)ility =N to )eselecte( a%ter N ite#s a'e )een rea(*

    " Proo%" Wat is te pro)a)ility o% te n-te ite# to )e selecte(!

    " Wat is te pro)a)ility o% te n-t ite#s to s$r'i'e %or N-n

    ro$n(s!

    "

  • 8/9/2019 datamining-lecture 2

    14/46

  • 8/9/2019 datamining-lecture 2

    15/46

  • 8/9/2019 datamining-lecture 2

    16/46

    Mining Tas5

    " Collect all re'ie/s %or te top-=> #ost re'ie/e(

    resta$rants in NK in Kelp" ,tan5s to 1a(y La/

    " 7in( %e/ ter#s tat )est (escri)e te resta$rants*

    "Algorit#!

  • 8/9/2019 datamining-lecture 2

    17/46

    Exa#ple (ata" I heard so many good things about this place so I was pretty juiced to try it. I'm

    from Cali and I heard Shake Shack is comparable to IN-N-O! and I gotta say" Shake

    Shake wins hands down. Surprisingly" the line was short and we waited about #$

    %IN. to order. I ordered a regular cheeseburger" fries and a black&white shake. So

    yummer. I lo(e the location too) It's in the middle of the city and the (iew is

    breathtaking. *efinitely one of my fa(orite places to eat in N+C.

    " I'm from California and I must say" Shake Shack is better than IN-N-O!" all day"

    err'day.

    " ,ould I pay #/ for a burger here0 No. 1ut for the price point they are asking for"

    this is a definite bang for your buck 2though for some" the opportunity cost of

    waiting in line might outweigh the cost sa(ings3 !hankfully" I came in before the

    lunch swarm descended and I ordered a shake shack 2the special burger with the patty

    / fried cheese 4amp5 portabella topping3 and a coffee milk shake. !he beef patty was

    (ery juicy and snugly packed within a soft potato roll. On the downside" I could do

    without the fried portabella-thingy" as the crispy taste conflicted with the juicy"tender burger. 6ow does shake shack compare with in-and-out or -guys0 I say a (ery

    close tie" and I think it comes down to personal affliations. On the shake side" true

    to its name" the shake was well churned and (ery thick and luscious. !he coffee

    fla(or added a tangy taste and complemented the (anilla shake well. Situated in an

    open space in N+C" the open air sitting allows you to munch on your burger while

    watching people oom by around the city. It's an oddly calming e7perience" or perhaps

    it was the food coma I was slowly falling into. 8reat place with food at a great

    price.

  • 8/9/2019 datamining-lecture 2

    18/46

    7irst c$t" Do si#ple processing to +nor#ali.e0 te (ata ,re#o'e p$nct$ation #a5einto lo/er case clear /ite spaces oter!

    " 6rea5 into /or(s 5eep te #ost pop$lar /or(s

    the 9:#;

    and #;$

    with ;#>

    to >9;

    a >=:$

    it #>?

    of #?

    is ;#?

    sauce ;$9$

    in =?#

    this =#?

    was =;=

    for ==9:you =99$

    that 9:>?

    but 9?$

    food 9;?:

    on 9=$

    my 9=##

    cart 99=>

    chicken 999$

    with 9#?rice 9$;?

    so #99so #>#$

    ha(e #

  • 8/9/2019 datamining-lecture 2

    19/46

    7irst c$t" Do si#ple processing to +nor#ali.e0 te (ata ,re#o'e p$nct$ation #a5einto lo/er case clear /ite spaces oter!

    " 6rea5 into /or(s 5eep te #ost pop$lar /or(s

    the 9:#;

    and #;$>

    with ;#>

    to >9;

    a >=:$

    it #>?

    of #?

    is ;#?

    sauce 4020

    in =?#

    this =#?

    was =;=

    for ==9:you =99$

    that 9:>?

    but 9?$

    food 9;?:

    on 9=$

    my 9=##

    cart 2236

    chicken 2220

    with 9#?rice 9$;?

    so #99

    so #>#$

    ha(e #

  • 8/9/2019 datamining-lecture 2

    20/46

    3econ( c$t

    " Re#o'e stop /or(s" 3top-/or( lists can )e %o$n( online*

    a"about"abo(e"after"again"against"all"am"an"and"any"are"aren't"as"at"be"be

    cause"been"before"being"below"between"both"but"by"can't"cannot"could"could

    n't"did"didn't"do"does"doesn't"doing"don't"down"during"each"few"for"from"f

    urther"had"hadn't"has"hasn't"ha(e"ha(en't"ha(ing"he"he'd"he'll"he's"her"he

    re"here's"hers"herself"him"himself"his"how"how's"i"i'd"i'll"i'm"i'(e"if"in

    "into"is"isn't"it"it's"its"itself"let's"me"more"most"mustn't"my"myself"no"

    nor"not"of"off"on"once"only"or"other"ought"our"ours"oursel(es"out"o(er"own

    "same"shan't"she"she'd"she'll"she's"should"shouldn't"so"some"such"than"tha

    t"that's"the"their"theirs"them"themsel(es"then"there"there's"these"they"they'd"they'll"they're"they'(e"this"those"through"to"too"under"until"up"(ery

    "was"wasn't"we"we'd"we'll"we're"we'(e"were"weren't"what"what's"when"when's

    "where"where's"which"while"who"who's"whom"why"why's"with"won't"would"would

    n't"you"you'd"you'll"you're"you'(e"your"yours"yourself"yoursel(es"

  • 8/9/2019 datamining-lecture 2

    21/46

    3econ( c$t

    " Re#o'e stop /or(s" 3top-/or( lists can )e %o$n( online*

    ramen #

    noodles 99:?

    ippudo 99>#

    buns 99#

    broth 9$;#

    like #?$9

    just #

    get #>;#

    time #>#=

    one #;>$

    really #;=:

    go #=>>

    food #9?>

    bowl #9:9

    can #9>

    great ##:9

    best ##>:

    burger ;=;$

    shack =9?#

    shake =99#

    line 9=?:

    fries 99>$

    good #?9$

    burgers #>;=

    wait #$9

    place ##?

    one ### 9

    patty #9.99>$=

    ss #;?.>>#= #

    patties #;:=?9=999=99 9

    >th >$.:?=$#:=;>< ?

    ;am .;#::;;;;:?>

    yellow ;.;;:$9>9$>>:= $: 9

    deli's ##:.;=#?> #

    car(er ##.#9?9;>;?:$9 #

    brown's #$?.;;#::

  • 8/9/2019 datamining-lecture 2

    26/46

    Tir( c$t

    " T7-ID7 ta5es care o% stop /or(s as /ell

    " We (o not nee( to re#o'e te stop/or(s since

    tey /ill get ID7,/ >

  • 8/9/2019 datamining-lecture 2

    27/46

  • 8/9/2019 datamining-lecture 2

    28/46

  • 8/9/2019 datamining-lecture 2

    29/46

    7re&$ency an( Mo(e

    " Te %re&$encyo% an attri)$te 'al$e is tepercentage o% ti#e te 'al$e occ$rs in te

    (ata set

    " 7or exa#ple gi'en te attri)$te gen(er an( arepresentati'e pop$lation o% people te gen(er %e#ale

    occ$rs a)o$t B>J o% te ti#e*

    " Te #o(eo% a an attri)$te is te #ost %re&$ent

    attri)$te 'al$e" Te notions o% %re&$ency an( #o(e are typically

    $se( /it categorical (ata

  • 8/9/2019 datamining-lecture 2

    30/46

    Percentiles

    " 7or contin$o$s (ata te notion o% a percentileis

    #ore $se%$l*

    Gi'en an or(inal or contin$o$s attri)$te xan( an$#)er p)et/een >an( =>> te ptpercentile is

    a 'al$e o% xs$c tat pJo% te o)ser'e( 'al$es

    o% x are less tan *

    " 7or instance te B>t percentile is te 'al$e s$c

    tat B>J o% all 'al$es o% x are less tan *

    "

  • 8/9/2019 datamining-lecture 2

    31/46

    Meas$res o% Location: Mean an( Me(ian

    " Te #eanis te #ost co##on #eas$re o% telocation o% a set o% points*

    " 1o/e'er te #eanis 'ery sensiti'e to o$tliers*

    " T$s te #e(ianor a tri##e( #ean is alsoco##only $se(*

  • 8/9/2019 datamining-lecture 2

    32/46

    Exa#ple

    Mean: =>F>8

    Tri##e( #ean ,re#o'e #in #ax: =>B8

    Me(ian: ,F>=>>2 FB8

  • 8/9/2019 datamining-lecture 2

    33/46

    Meas$res o% 3prea(: Range an( ;ariance

    " Rangeis te (i%%erence )et/een te #ax an( #in

    " Te 'arianceor stan(ar( (e'iation is te #ost

    co##on #eas$re o% te sprea( o% a set o% points*

    "

  • 8/9/2019 datamining-lecture 2

    34/46

    Nor#al Distri)$tion

    "

    "An i#portant (istri)$tion tat caracteri.es #any

    &$antities an( as a central role in pro)a)ilities an(

    statistics*"Appears also in te central li#it teore#

    " 7$lly caracteri.e( )y te #ean an( stan(ar(

    (e'iation

    "

    Tis is a 'al$e istogra#

  • 8/9/2019 datamining-lecture 2

    35/46

    Not e'eryting is nor#ally (istri)$te(

    " Plot o% n$#)er o% /or(s /it x n$#)er o% occ$rrences

    " I% tis /as a nor#al (istri)$tion /e /o$l( not a'e a

    %re&$ency as large as 2@8

    > B>>> =>>>> =B>>> 2>>>> 2B>>> ?>>>> ?B>>>

    >

    =>>>

    2>>>

    ?>>>

    >>>

    B>>>

    >>>

    O>>>

    @>>>

  • 8/9/2019 datamining-lecture 2

    36/46

    Po/er-la/ (istri)$tion

    " We can $n(erstan( te (istri)$tion o% /or(s i% /e

    ta5e te log-logplot

    " Linear relationsip in te log-log space

    "

    = => =>> =>>> =>>>> =>>>>>

    =

    =>

    =>>

    =>>>

    =>>>>

  • 8/9/2019 datamining-lecture 2

    37/46

    ip%s la/

    " Po/er la/s can )e (etecte( )y a linear relationsip in te

    log-log space %or te ran5-%re&$encyplot

    " 7re&$ency o% te r-t#ost %re&$ent /or(

    "

    = => =>> =>>> =>>>> =>>>>>

    =

    =>

    =>>

    =>>>

    =>>>>

    =>>>>>

  • 8/9/2019 datamining-lecture 2

    38/46

    Po/er-la/s are e'ery/ere

    " Inco#ingan( o$tgoing lin5s o% /e) pages n$#)er o% %rien(sin

    social net/or5s n$#)er o% occ$rrences o% /or(s %ile si.es city

    si.es inco#e (istri)$tion pop$larityo% pro($cts an( #o'ies" 3ignat$re o% $#an acti'ity!

    "A #ecanis# tat explains e'eryting!

    " Ric get ricer process

  • 8/9/2019 datamining-lecture 2

    39/46

    Te Long Tail

    3o$rce: Cris An(erson ,2>>

    http://www.wired.com/wired/archive/12.10/tail.htmlhttp://www.wired.com/wired/archive/12.10/tail.html
  • 8/9/2019 datamining-lecture 2

    40/46

  • 8/9/2019 datamining-lecture 2

    41/46

    3catter Plot Array o% Iris Attri)$tes

  • 8/9/2019 datamining-lecture 2

    42/46

    Conto$r Plot Exa#ple: 33T Dec =FF@

    Celsi$s

    43

  • 8/9/2019 datamining-lecture 2

    43/46

    Meaning%$lness o% Ans/ers

    "A )ig (ata-#ining ris5 is tat yo$ /ill +(isco'er0

    patterns tat are #eaningless*

    " 3tatisticians call it 6on%erronis principle:,ro$gly i% yo$ loo5 in #ore places %or

    interesting patterns tan yo$r a#o$nt o% (ata

    /ill s$pport yo$ are )o$n( to %in( crap*

    " Te Rine Para(ox: a great exa#ple o% o/not to con($ct scienti%ic researc*

    C3?BA Data Mining on te We): Anan( Ra4ara#an Qe%% Ull#an

    44

  • 8/9/2019 datamining-lecture 2

    44/46

    Rine Para(ox ,=

    " Qosep Rine /as a parapsycologist in te

    =FB>s /o ypotesi.e( tat so#e people a(

    Extra-3ensory Perception*" 1e (e'ise( ,so#eting li5e an experi#ent /ere

    s$)4ects /ere as5e( to g$ess => i((en car(s

    re( or )l$e*

    " 1e (isco'ere( tat al#ost = in =>>> a( E3P tey /ere a)le to get all => rigtS

    C3?BA Data Mining on te We): Anan( Ra4ara#an Qe%% Ull#an

    45

  • 8/9/2019 datamining-lecture 2

    45/46

    Rine Para(ox ,2

    " 1e tol( tese people tey a( E3P an( calle(

    te# in %or anoter test o% te sa#e type*

    "Alas e (isco'ere( tat al#ost all o% te# a(

    lost teir E3P*" Wat (i( e concl$(e!

    "Ans/er on next sli(e*

    C3?BA Data Mining on te We): Anan( Ra4ara#an Qe%% Ull#an

  • 8/9/2019 datamining-lecture 2

    46/46