introduction to data mining - university of minnesota to... · introduction to data mining 1....

IntroductiontoDataMining

Large-scaledataiseverywhere!

• Therehasbeenenormousdatagrowthinbothcommercialandscientificdatabasesduetoadvancesindatagenerationandcollectiontechnologies.

• Newmantra:– Gatherwhateverdatayoucanwheneverand

whereverpossible.

• Expectations:– Gathereddatawillhavevalueeitherforthe

purposecollectedorforapurposenotenvisioned.

Computational Simulations

Business Data

Sensor Networks

Geo-spatial data

Homeland Security

Whydatamining?

Commercialviewpoint:– Lotsofdataisbeingcollectedandwarehoused.• Webdata:– Yahoohaspetabytesofwebdata.– Facebookhas~2Bactiveusers.

• Purchasesatdepartment/grocerystores,e-commerce:– Amazonrecords1.1Bordersayear.– Bank/CreditCardtransactions.

– Computershavebecomecheaperandmorepowerful.

– Competitivepressureisstrong.• Providebetter,customizedservicesforanedge(e.g.inCustomerRelationshipManagement).

Whydatamining?

Scientificviewpoint:– Datacollectedandstoredat

enormousspeeds.

• Remotesensorsonasatellite.– NASAEOSDISarchivesover 1-petabytesof

earthsciencedata/year.

• Telescopesscanningtheskies.– Skysurveydata.

• High-throughputbiologicaldata.

• Scientificsimulations.– Terabytesofdatageneratedinafewhours.

– Datamininghelpsscientists.• Inautomatedanalysisofmassivedatasets.

• Inhypothesisformation.

Whatisdatamining?

Manydefinitions:– Non-trivialextractionofimplicit,previouslyunknownandpotentiallyusefulinformationfromdata.

– Exploration&analysis,byautomaticorsemi-automaticmeans,oflargequantitiesofdatainordertodiscovermeaningfulpatterns.

Originsofdatamining

• Drawsideasfrommachinelearning/AI,patternrecognition,statistics,anddatabasesystems.

• Traditionaltechniquesmaybeunsuitableduetodatathatis:– Large-scale– Highdimensional– Heterogeneous– Complex– Distributed

KeyDistinction:Datadrivenvs.Hypothesisdriven

Dataminingtasks

• Predictiontask:– Usesomevariablestopredictunknownorfuturevaluesofothervariables.

• Descriptiontask:– Findhuman-interpretablepatternsthatdescribethedata.

From [Fayyad, et.al.] Advances in Knowledge Discovery and Data Mining, 1996

Tid Refund Marital Status

Taxable Income Cheat

1 Yes Single 125K No

2 No Married 100K No

3 No Single 70K No

4 Yes Married 120K No

5 No Divorced 95K Yes

6 No Married 60K No

7 Yes Divorced 220K No

8 No Single 85K Yes

9 No Married 75K No

10 No Single 90K Yes

13 No Single 85K Yes

15 No Single 90K Yes 10

Dataminingmethods

Predictivemodeling:Classification

Findamodelforclassattributeasafunctionofthevaluesofotherattributes.

Tid Employed Level of Education

# years at present address

Credit Worthy

1 Yes Graduate 5 Yes 2 Yes High School 2 No 3 No Undergrad 1 No 4 Yes High School 10 Yes … … … … …

Model for predicting credit worthiness

ClassEmployed

No Education

Number ofyears

No Yes

Graduate High school, Undergrad

Yes No

> 7 yrs < 7 yrs

Number ofyears

> 3 yr < 3 yr

Examplesofclassification

• Predictingtumorcellsasbenignormalignant.

• Classifyingcreditcardtransactionsaslegitimateorfraudulent.

• Classifyingsecondarystructuresofproteinasalpha-helix,beta-sheet,orrandomcoil.

• Categorizingnewsstoriesasfinance,weather,entertainment,sports,etc.

• Identifyingintrudersinthecyberspace.

ClusteringFindinggroupsofobjectssuchthattheobjectsinagroupwillbesimilar(orrelated)tooneanotheranddifferentfrom(orunrelatedto)theobjectsinothergroups.

Inter-clusterdistancesaremaximized

Intra-clusterdistancesareminimized

Applicationsofclustering

• Understanding– Customprofilingfortargetedmarketing.– Grouprelateddocumentsforbrowsing.– Groupgenesandproteinsthathavesimilar

functionality.– Groupstockswithsimilarpricefluctuations.

• Summarization– Reducethesizeoflargedatasets.

Clusters for Raw SST and Raw NPP

longitude

latitu

-180 -150 -120 -90 -60 -30 0 30 60 90 120 150 180

Cluster

Sea Cluster 1

Sea Cluster 2

Ice or No NPP

Land Cluster 1

Land Cluster 2

UseofK-meanstopartitionSeaSurfaceTemperature(SST)andNetPrimaryProduction(NPP)intoclustersthatreflecttheNorthernandSouthernHemispheres.

Courtesy: Michael Eisen

Associationrulediscovery

Givenasetofrecordseachofwhichcontainsomenumberofitemsfromagivencollection.– Producedependencyruleswhichwillpredictoccurrenceofanitembasedonoccurrencesofotheritems.

TID Items

1 Bread, Coke, Milk2 Beer, Bread3 Beer, Coke, Diaper, Milk4 Beer, Bread, Diaper, Milk5 Coke, Diaper, Milk

Rules Discovered:Milk --> CokeDiaper, Milk --> Beer

Associationanalysis:Applications

• Market-basketanalysis.– Rulesareusedforsalespromotion,shelfmanagement,andinventory

management.

• Telecommunicationalarmdiagnosis.– Rulesareusedtofindcombinationofalarmsthatoccurtogether

frequentlyinthesametimeperiod.

• MedicalInformatics.– Rulesareusedtofindcombinationofpatientsymptomsandtest

resultsassociatedwithcertaindiseases.

Motivatingchallenges

• Scalability.

• Highdimensionality.

• Heterogeneousandcomplexdata.

• Dataownershipanddistribution.

• Non-traditionalanalysis.

The4V’sof“BigData”

PatternMining

ASSOCIATIONRULES

AssociationRuleMiningGivenasetoftransactions,findrulesthatwillpredicttheoccurrenceofanitembasedontheoccurrencesofotheritemsinthetransaction

Market-basket transactions

TID Items

1 Bread, Milk

2 Bread, Diaper, Beer, Eggs

3 Milk, Diaper, Beer, Coke 4 Bread, Milk, Diaper, Beer

5 Bread, Milk, Diaper, Coke

Example of association rules

Diaper ® Beer,Milk, Bread ® Eggs,Coke,Beer, Bread ® Milk,

Implication means co-occurrence, not causality!

Definition:FrequentItemsetItemset– Acollectionofoneormoreitems

• Example:Milk,Bread,Diaper– k-itemset

• Anitemset thatcontainskitemsSupportcount(𝜎)– Frequencyofoccurrenceofanitemset– E.g.𝜎(Milk,Bread,Diaper)=2

Support(𝑠)– Fractionoftransactionsthatcontainanitemset– E.g.𝑠(Milk,Bread,Diaper)=2/5

FrequentItemset– Anitemset whosesupportisgreaterthanor

equaltoaminsup threshold

TID Items

1 Bread, Milk

3 Milk, Diaper, Beer, Coke 4 Bread, Milk, Diaper, Beer

Definition:AssociationRule

Example:BeerDiaper,Milk Þ

|T|)BeerDiaper,,Milk(

67.032

)Diaper,Milk()BeerDiaper,Milk,(

Association Rule– An implication expression of the form

𝑋 → 𝑌, where 𝑋and 𝑌are itemsets.– Example:

Milk, Diaper →Beer

Rule Evaluation Metrics– Support (𝑠)

• Fraction of transactions that contain both 𝑋and 𝑌.

– Confidence (𝑐)• Measures how often items in 𝑌

appear in transactions thatcontain 𝑋.

• It is nothing more than 𝑃(𝑌|𝑋).

TID Items

1 Bread, Milk

3 Milk, Diaper, Beer, Coke

4 Bread, Milk, Diaper, Beer

AssociationRuleMiningTask

Givenasetoftransactions𝑇,thegoalofassociationruleminingistofindallruleshaving:

1) support≥minsup threshold,and2) confidence≥minconf threshold.

Example of Rules:Milk,Diaper ® Beer (s=0.4, c=0.67)Milk,Beer ® Diaper (s=0.4, c=1.0)Diaper,Beer ® Milk (s=0.4, c=0.67)Beer ® Milk,Diaper (s=0.4, c=0.67) Diaper ® Milk,Beer (s=0.4, c=0.5) Milk ® Diaper,Beer (s=0.4, c=0.5)

TID Items

1 Bread, Milk

Anapproach….

1. Listallpossibleassociationrules.2. Computethesupportandconfidenceforeachrule.3. Prunerulesthatfailtheminsup andminconf

thresholds.

ComputationalComplexityGivend uniqueitems:

– Totalnumberofitemsets =2𝑑– Totalnumberofpossibleassociationrules:

ùêë

é÷ø

öçè

æ -´÷ø

öçè

=å å

If d=6, R = 602 rules

MiningAssociationRules

Example of Rules:Milk,Diaper ® Beer (s=0.4, c=0.67)Milk,Beer ® Diaper (s=0.4, c=1.0)Diaper,Beer ® Milk (s=0.4, c=0.67)Beer ® Milk,Diaper (s=0.4, c=0.67) Diaper ® Milk,Beer (s=0.4, c=0.5) Milk ® Diaper,Beer (s=0.4, c=0.5)

TID Items

1 Bread, Milk

Observations:• All the above rules are binary partitions of the same itemset:

Milk, Diaper, Beer

• Rules originating from the same itemset have identical support butcan have different confidence.

• Thus, we may decouple the support and confidence requirements.

Miningassociationrules

Two-stepapproach:1. FrequentItemset Generation

– Generateallitemsets whosesupport³minsup.

2. RuleGeneration– Generatehighconfidencerulesfromeachfrequentitemset,

whereeachruleisabinarypartitioningofafrequentitemset.

Frequentitemset generationisstillexpensive.

Frequentitemset generationstrategies

• Reducethenumberofcandidates (𝑀)– Completesearch:𝑀 = 21.– Usepruningtechniquestoreduce𝑀.

• Reducethenumberoftransactions(𝑁)– ReducesizeofNasthesizeofitemset increases.– UsedbyDHPandvertical-basedminingalgorithms.

• Reducethenumberofcomparisons (𝑁𝑀)– Useefficientdatastructurestostorethecandidatesortransactions.

– Noneedtomatcheverycandidateagainsteverytransaction.

PatternLattice

AB AC AD AE BC BD BE CD CE DE

A B C D E

ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE

ABCD ABCE ABDE ACDE BCDE

Given 𝑑items, there are 2𝑑 possible candidate itemsets.

Reducingthenumberofcandidates

• Observation:– Ifanitemset isfrequent,thenallofitssubsetsmustalsobefrequent.

• Thisholdsduetothefollowingpropertyofthesupportmeasure:

– Supportofanitemset neverexceedsthesupportofitssubsets.

– Thisisknownastheanti-monotonepropertyofsupport.

)()()(:, YsXsYXYX ³ÞÍ"

Found to be Infrequent

A B C D E

ABCDEPruned supersets

Illustratingsupport’santi-monotonicity

Minimum Support = 3

TID Items

1 Bread, Milk

2 Beer, Bread, Diaper, Eggs

3 Beer, Coke, Diaper, Milk

4 Beer, Bread, Diaper, Milk

5 Bread, Coke, Diaper, Milk

Minimum Support = 3

TID Items

1 Bread, Milk

Items (1-itemsets)

Item Count Bread 4 Coke 2 Milk 4 Beer 3 Diaper 4 Eggs 1

Minimum Support = 3

TID Items

1 Bread, Milk

Items (1-itemsets)

Itemset Bread,Milk Bread, Beer Bread,Diaper Beer, Milk Diaper, Milk Beer,Diaper

Items (1-itemsets)

Pairs (2-itemsets)

(No need to generatecandidates involving Cokeor Eggs)

Minimum Support = 3

Itemset Count Bread,Milk 3 Beer, Bread 2 Bread,Diaper 3 Beer,Milk 2 Diaper,Milk 3 Beer,Diaper 3

Items (1-itemsets)

Pairs (2-itemsets)

Minimum Support = 3

Itemset Count Bread,Milk 3 Bread,Beer 2 Bread,Diaper 3 Milk,Beer 2 Milk,Diaper 3 Beer,Diaper 3

Itemset Beer, Diaper, Milk Beer,Bread,Diaper Bread, Diaper, Milk Beer, Bread, Milk

Items (1-itemsets)

Pairs (2-itemsets)

Triplets (3-itemsets)Minimum Support = 3

Itemset Count Beer, Diaper, Milk Beer,Bread, Diaper Bread, Diaper, Milk Beer, Bread, Milk

2 2 2 1

Items (1-itemsets)

Pairs (2-itemsets)

If every subset is considered, 6C1 + 6C2 + 6C36 + 15 + 20 = 41

With support-based pruning,6 + 6 + 4 = 16

Itemset Count Beer, Diaper, Milk Beer,Bread, Diaper Bread, Diaper, Milk Beer, Bread, Milk

2 2 2 1

Items (1-itemsets)

Pairs (2-itemsets)

If every subset is considered, 6C1 + 6C2 + 6C36 + 15 + 20 = 41

With support-based pruning,6 + 6 + 4 = 166 + 6 + 1 = 13

APRIORI

Apriori algorithm– ℱ4:frequent𝑘-itemsets– ℒ4:candidate𝑘-itemsets

Algorithm– Let𝑘 = 1– Generateℱ8 =frequent1-itemsets– Repeatuntilℱ4 isempty

1. CandidateGeneration:Generateℒ498 fromℱ4.2. CandidatePruning:Prunecandidateitemsets inℒ498 containing

subsetsoflength𝑘thatareinfrequent.3. SupportCounting:Countthesupportofeachcandidateinℒ498 by

scanningtheDB.4. CandidateElimination:Eliminatecandidatesinℒ498 thatare

infrequent,leavingonlythosethatarefrequent,leadingtoℱ498.

A B C D E

Level-by-leveltraversalofthelattice.

CandidateGeneration:theℱ4:8×ℱ4:8method

• Mergetwofrequent(𝑘 − 1)-itemsets iftheirfirst(𝑘 − 2)itemsareidentical

• 𝐹> =ABC,ABD,ABE,ACD,BCD,BDE,CDE– Merge(ABC,ABD)=ABCD– Merge(ABC,ABE)=ABCE– Merge(ABD,ABE)=ABDE

– Donotmerge(ABD,ACD)becausetheyshareonlyprefixoflength1insteadoflength2.

Candidatepruning• Letℱ> =ABC,ABD,ABE,ACD,BCD,BDE,CDEbethesetoffrequent3-itemsets.

• ℒ? =ABCD,ABCE,ABDEisthesetofcandidate4-itemsetsgenerated(frompreviousslide).

• Candidatepruning:– PruneABCEbecauseACEandBCEareinfrequent.– PruneABDEbecauseADEisinfrequent.

• Aftercandidatepruning:ℒ? =ABCD.

Supportcountingofcandidateitemsets

Scanthedatabaseoftransactionstodeterminethesupportofeachcandidateitemset.

– Mustmatcheverycandidateitemset againsteverytransaction,whichisanexpensiveoperation.

TID Items

1 Bread, Milk

Itemset Beer, Diaper, Milk Beer,Bread,Diaper Bread, Diaper, Milk Beer, Bread, Milk

Q:Howshouldweperformthisoperation?

Toreducenumberofcomparisons,storethecandidateitemsets inahashstructure.

– Insteadofmatchingeachtransactionagainsteverycandidate,matchitagainstcandidatescontainedinthehashedbuckets.

TID Items 1 Bread, Milk 2 Bread, Diaper, Beer, Eggs 3 Milk, Diaper, Beer, Coke 4 Bread, Milk, Diaper, Beer 5 Bread, Milk, Diaper, Coke

Transactions Hash Structure

Buckets

Supportcountingofcandidateitemsets

Supportcounting:AnexampleSuppose you have 15 candidate itemsets of length 3:

1 4 5, 1 2 4, 4 5 7, 1 2 5, 4 5 8, 1 5 9, 1 3 6, 2 3 4, 5 6 7, 3 4 5, 3 5 6, 3 5 7, 6 8 9, 3 6 7, 3 6 8

How many of these itemsets are supported by transaction (1,2,3,5,6)?

1 2 3 5 6

Transaction, t

2 3 5 61 3 5 62

5 61 33 5 61 2 61 5 5 62 3 62 5

1 2 31 2 51 2 6

1 3 51 3 6 1 5 6 2 3 5

2 3 6 2 5 6 3 5 6

Subsets of 3 items

Level 1

Level 2

Level 3

This is a “full” n-arytree where n is the number of items.

Q: Can we reduce storage requirements?

Supportcountingusingahashtree

2 3 45 6 7

1 4 5 1 3 6

1 2 44 5 7 1 2 5

4 5 81 5 9

3 4 5 3 5 63 5 76 8 9

3 6 73 6 8

1,4,72,5,8

3,6,9Hash function

Suppose you have 15 candidate itemsets of length 3:

1 4 5, 1 2 4, 4 5 7, 1 2 5, 4 5 8, 1 5 9, 1 3 6, 2 3 4, 5 6 7, 3 4 5, 3 5 6, 3 5 7, 6 8 9, 3 6 7, 3 6 8

You need:

• Hash function.

• Max leaf size: max number of itemsets stored in a leaf node (if number of candidate itemsets exceeds max leaf size, split the node).

FactorsaffectingthecomplexityofApriori

MAXIMAL&CLOSEDITEMSETS

Maximalfrequentitemset

A B C D E

BorderInfrequent Itemsets

Maximal Itemsets

An itemset is maximal frequent if it is frequent and none of its immediate supersets are frequent.

Closeditemsets• Anitemset 𝑋isclosed ifallofitsimmediatesupersetshavealowersupportthan𝑋.

• Itemset 𝑋isnotclosedifatleastoneofitsimmediatesupersetshasthatsamesupportas𝑋.

TID Items1 A,B2 B,C,D3 A,B,C,D4 A,B,D5 A,B,C,D

Itemset SupportA 4B 5C 3D 4A,B 4A,C 2A,D 3B,C 3B,D 4C,D 3

Itemset SupportA,B,C 2A,B,D 3A,C,D 2B,C,D 2A,B,C,D 2

Maximalvs closedfrequentitemsets

A B C D E

124 123 1234 245 345

12 124 24 4 123 2 3 24 34 45

12 2 24 4 4 2 3 4

Minimum support = 2

# Closed = 9

# Maximal = 4

Closed and maximal

Closed but not maximal

TID Items

2 ABCD

4 ACDE

Frequent,maximal,andcloseditemsets

FrequentItemsets

ClosedFrequentItemsets

MaximalFrequentItemsets

Frequent,maximal,andcloseditemsets

FrequentItemsets

ClosedFrequentItemsets

MaximalFrequentItemsets

Q1:Whatifinsteadoffindingthefrequentitemsets,wefindthemaximalfrequentitemsetsortheclosedfrequentitemsets?

Q2:Doestheknowledgeofjustthemaximalfrequentitemsetswillallowmetogenerateallrequiredassociationrules?

Q3:Doestheknowledgeofjusttheclosedfrequentitemsetswillallowmetogenerateallrequiredassociationrules?

BEYONDLEVEL-BY-LEVELEXPLORATION

AB AC AD BC BD CD

A B C D

ABC ABD ACD BCD

AB AC ADBC BD CD

A B C D

ABC ABD ACD BCD

(a) Prefix tree (b) Suffix tree

Traversingthepatternlattice

PatternsstartingwithA.(patternsthatcontainAandanyotheritem)

PatternsstartingwithB.(patternsthatcontainBandanyotheritemexceptA)

PatternsendingwithD.(patternsthatcontainDandanyotheritem)

PatternsendingwithC.(patternsthatcontainCandanyotheritemexceptD)

Breadth-firstvs Depth-first

a" b" c" d"

ab" ac" ad" bc" bd" cd"

abc" abd" bcd"acd"

a" b" c" d"

abc" abd" bcd"acd"

a" b" c" d"

abc" abd" bcd"acd"

a" b" c" d"

abc" abd" bcd"acd"

Plussesandminuses?

PROJECTIONMETHODS

Projection-basedmethods

A B C D E

Projection-basedmethodsnull

A B C D E

TID Items1 A,B2 B,C,D3 A,C,D,E4 A,D,E5 A,B,C6 A,B,C,D7 B,C8 A,B,C9 A,B,D10 B,C,E

TID Items1 B2 3 C,D,E4 D,E5 B,C6 B,C,D7 8 B,C9 B,D10

Initialdatabase

DatabaseassociatedwithnodeA

“Projected

TID Items1 2 D3 D,E4 5 6 D7 8 9 10 E

DatabaseassociatedwithnodeC

AprojectedDBonprefixpattern𝑋isobtainedasfollows:• Eliminateanytransactionsthatdonotcontain𝑋.• Fromthetransactionsthatareleft,retainonlytheitemsthatarelexicographicallygreaterthantheitemsin𝑋.

Projection-basedmethod• Itemsarelistedinlexicographicorder.• Let𝑃and𝐷𝐵(𝑃) beanode’spatternanditsassociatedprojecteddatabase.

• Miningisperformedbyrecursivelycallingthisfunction:– 𝑇𝑃(𝑃, 𝐷𝐵(𝑃))

1. Determinethefrequentitemsin𝐷𝐵(𝑃),anddenotethemby𝐸(𝑃).2. Eliminatefrom𝐷𝐵(𝑃) anyitemsnotin𝐸(𝑃).3. Foreachitem𝑥in𝐸(𝑃),call𝑇𝑃(𝑃𝑥, 𝐷𝐵(𝑃𝑥)).

BEYONDTRANSACTIONS

Beyondtransactiondatasets

• Theconceptoffrequentpatternsandassociationruleshasbeengeneralizedtodifferenttypesofdatasets:– Sequentialdatasets:

• Sequenceofpurchasingtransactions,web-pagesvisited,articlesread,biologicalsequences,eventlogs,etc.

– Relational/Graphdatasets:• Socialnetworks,chemicalcompounds,web-graphs,informationnetworks,etc.

• Thereisanextensivesetofapproachesandalgorithmsforthem,manyofwhichfollowsimilarideastothosedevelopedfortransactiondatasets.

Clustering(Unsupervisedlearning)

Whatisclusteranalysis?

Findinggroupsofobjectssuchthattheobjectsinagroupwillbesimilar(orrelated)tooneanotheranddifferentfrom(orunrelatedto)theobjectsinothergroups

Inter-cluster distances are maximized

Intra-cluster distances are

minimized

Notionofaclustercanbeambiguous

How many clusters?

Notionofaclustercanbeambiguous

How many clusters?

Four ClustersTwo Clusters

Six Clusters

Clusteringformulations

Anumberofclusteringformulationshavebeendeveloped:

1. Weneedtofindafixednumberofclusters.– Well-suitedforcompression-likeapplications.

2. Weneedtofindclustersoffixedsize.– Well-suitedforneighborhood-discovery(recommendationengine).

3. Weneedtofindthesmallestnumberofclustersthatsatisfycertainqualitycriteria.– Well-suitedforapplicationsinwhichclusterqualityisimportant.

4. Weneedtofindthenatural numberofclusters.– Thisisclustering'sholly-grail!

• Extremelyhard,problemdependent,and“quitesupervised”.

Typesofclusterings

• Aclustering isasetofclusters.

• Importantdistinctionbetweenhierarchical andpartitional setsofclusters.

• Partitional clustering– Adivisionofdataobjectsintonon-overlappingsubsets(clusters)such

thateachdataobjectisinexactlyonesubset.

• Hierarchicalclustering– Asetofnestedclustersorganizedasahierarchicaltree.

Partitional clustering

Original Points A Partitional Clustering

Hierarchicalclustering

p2 p4p1 p2 p3

Hierarchical clustering Dendrogram

Otherdistinctionsbetweensetsofclusters

• Exclusiveversusnon-exclusive– Innon-exclusiveclusterings,pointsmaybelongtomultipleclusters.– Canrepresentmultipleclassesor“border”points.

• Fuzzyversusnon-fuzzy– Infuzzyclustering,apointbelongstoeveryclusterwithsomeweight

between0and1.– Weightsmustsumto1.– Probabilisticclusteringhassimilarcharacteristics.

• Partialversuscomplete– Insomecases,weonlywanttoclustersomeofthedata.

• Heterogeneousversushomogeneous– Clustersofwidelydifferentsizes,shapes,anddensities.

Typesofclusters

• Well-separatedclusters• Center-basedclusters• Contiguousclusters• Density-basedclusters• Propertyorconceptual• Describedbyanobjectivefunction

Typesofclusters:Well-separated

Well-separatedclusters:– Aclusterisasetofpointssuchthatanypointinaclusteriscloser(ormore

similar)toeveryotherpointintheclusterthantoanypointnotinthecluster.

Three well-separated clusters

Typesofclusters:Center-based

Center-based– Aclusterisasetofobjectssuchthatanobjectinaclusteriscloser(more

similar)tothe“center”ofacluster,thantothecenterofanyothercluster.– Thecenterofaclusterisoftenacentroid,theaverageofallthepointsin

thecluster,oramedoid,themost“representative”pointofacluster.

Four center-based clusters

Typesofclusters:Contiguity-based

Contiguouscluster(nearestneighborortransitive)– Aclusterisasetofpointssuchthatapointinaclusteriscloser(ormore

similar)tooneormoreotherpointsintheclusterthantoanypointnotinthecluster.

Eight contiguous clusters

Typesofclusters:Density-based

Density-based– Aclusterisadenseregionofpoints,whichisseparatedbylow-density

regions,fromotherregionsofhighdensity.– Usedwhentheclustersareirregularorintertwined,andwhennoiseand

outliersarepresent.

Six density-based clusters

Typesofclusters:Conceptualclusters

SharedPropertyorConceptualClusters– Findsclustersthatsharesomecommonpropertyorrepresentaparticular

concept.

Two overlapping circles

Typesofclusters:Objectivefunction

Clustersdefinedbyanobjectivefunction– Findsclustersthatminimizeormaximizeanobjectivefunction.– Enumerateallpossiblewaysofdividingthepointsintoclustersand

evaluatethe“goodness”ofeachpotentialsetofclustersbyusingthegivenobjectivefunction.(NPHard)

– Canhaveglobalorlocalobjectives.• Hierarchicalclusteringalgorithmstypicallyhavelocalobjectives.• Partitional algorithmstypicallyhaveglobalobjectives.

– Avariationoftheglobalobjectivefunctionapproachistofitthedatatoaparameterizedmodel.• Parametersforthemodelaredeterminedfromthedata.

• Mixturemodelsassumethatthedataisa‘mixture'ofanumberofstatisticaldistributions.

Clusteringrequirements

Thefundamentalrequirementforclusteringistheavailabilityofafunctiontodeterminethesimilarity ordistance betweenobjectsinthedatabase.

Theusermustbeabletoanswersomeofthefollowingquestions:

1. Whenshouldtwoobjectsbelongtothesamecluster?2. Howshouldtheclusterslooklike(i.e.,whattypeofobjects

shouldthecontain)?3. Whataretheobject-relatedcharacteristicsofgoodclusters?

Datacharacteristics&clustering

• Typeofproximityordensitymeasure– Centraltoclustering.– Dependsondataandapplication.

• Datacharacteristicsthataffectproximityand/ordensityare– Dimensionality

• Sparseness

– Attributetype– Specialrelationshipsinthedata

• Forexample,autocorrelation

– Distributionofthedata

• Noiseandoutliers– Ofteninterferewiththeoperationoftheclusteringalgorithm

BASICCLUSTERINGALGORITHMS

1. K-means2. Hierarchicalclustering3. Density-basedclustering

K-meansclustering

• Partitional clusteringapproach.• Numberofclusters,K,mustbespecified.• Eachclusterisassociatedwithacentroid(centerpoint/object).• Eachpointisassignedtotheclusterwiththeclosestcentroid.• Thebasicalgorithmisverysimple.

ExampleofK-meansclustering

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

Iteration 1

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

Iteration 2

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

Iteration 3

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

Iteration 4

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

Iteration 5

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

Iteration 6

K-meansclustering– Details

• Initialcentroidsareoftenchosenrandomly.– Clustersproducedvaryfromoneruntoanother.

• Thecentroidis(typically)themeanofthepointsinthecluster.• “Closeness”ismeasuredbyEuclideandistance,cosinesimilarity,correlation,etc.

• K-meanswillconvergeforcommonsimilaritymeasuresmentionedabove.

• Mostoftheconvergencehappensinthefirstfewiterations.– Oftenthestoppingconditionischangedto“Untilrelativelyfewpointschange

clusters”.

• ComplexityisO(n*K*I*d)– n=numberofpoints,K=numberofclusters,

I=numberofiterations,d=numberofattributes.

K-meansclustering– Objective

Let o1, . . . , on be the set of objects to be clustered, k be the number of desired

clusters, p be the clustering indicator vector such that pi is the cluster number

that the ith object belongs to, and ci be the centroid of the ith cluster.

In the case of Euclidean distance, the K-means clustering algorithm solves the

following optimization problem:

minimize

pf(p) =

||oi cpi ||22.

Function f() is the objective or clustering criterion function of K-means.

Let o1, . . . , on be the set of objects to be clustered, k be the number of desired

clusters, p be the clustering indicator vector such that pi is the cluster number

that the ith object belongs to, and ri be a vector associated with the ith cluster.

In the case of Euclidean distance, the K-means clustering algorithm solves the

following optimization problem:

minimize

p,r1,...,rkg(p, r1, . . . , rk) =

||oi rpi ||22.

Note that p and r1, . . . , rk are the variables of the optimization problem that

need to be estimated such that the value of g() is minimized.

The solution to

minimize

pf(p) =

||oi cpi ||22

is the same as the solution to

minimize

p,r1,...,rkg(p, r1, . . . , rk) =

||oi rpi ||22.

and 8i, ri = ci.

The solution to

minimize

pf(p) =

||oi cpi ||22

is the same as the solution to

minimize

p,r1,...,rkg(p, r1, . . . , rk) =

||oi rpi ||22.

and 8i, ri = ci.

Ther_i vectorscanbethoughtasbeingrepresentativesoftheobjectsthatareassignedtotheith cluster.Ther_i vectorsrepresentacompressedviewofthedata.

K-meansclustering– Objectiveminimize

Pni=1 ||oi cpi ||22

minimizep,r1,...,rk

Pni=1 ||oi rpi ||22

Thesearenon-convexoptimizationproblems.

• The𝐾-meansclusteringalgorithmisawayofsolvingtheoptimizationproblem.• Itusesaniterativealternateleastsquaresoptimizationstrategy.

a. Optimizeclusterassignments𝑝,given𝑟$ for𝑖 = 1,… , 𝑘.b. Optimize𝑟$ for𝑖 = 1,… , 𝑘,givenclusterassignments𝑝.

• Itguaranteesconvergencetoalocalminimasolution.However,duetothenon-convexityoftheproblem,itmaynotbetheglobalminimum.

• Run𝐾-meansmultipletimeswithdifferentinitialcentroidsandreturnthesolutionthathasthebestvalue.

TwodifferentK-meansclusterings

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

Sub-optimal Clustering

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

Optimal Clustering

Original Points

LimitationsofK-means

• Def. problem:whentheclusteringsolutionthatyougetisnotthebest,natural,insightful,etc.

• K-meanshasproblems whenclustersareofdiffering– Sizes– Densities– Non-globularshapes

• K-meanshasproblems whenthedatacontainsoutliers.

LimitationsofK-means:Differingsizes

Original Points K-means (3 Clusters)

LimitationsofK-means:Differingdensity

LimitationsofK-means:Non-globularshapes

OvercomingK-meansLimitations

Original Points K-means Clusters

One solution is to use many clusters.Finds parts of clusters, and we may need to put them back together.

OvercomingK-meanslimitations

Original Points K-means Clusters

Importanceofchoosinginitialcentroids

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

Iteration 1

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

Iteration 2

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

Iteration 3

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

Iteration 4

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

Iteration 5

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

Iteration 6

Importanceofchoosinginitialcentroids…

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

yIteration 1

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

Iteration 2

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

Iteration 3

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

Iteration 4

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

Iteration 5

Solutionstoinitialcentroidsproblem• Multipleruns

– Helps,butprobabilityisnotonyourside.

• Sampleandusehierarchicalclusteringtodetermineinitialcentroids.

• Selectmorethan𝑘initialcentroidsandthenselectamongtheseinitialcentroids.– Selectmostwidelyseparated.

• Generatealargernumberofclustersandthenperformahierarchicalclustering.

• Bisecting𝐾-means– Notassusceptibletoinitializationissues.

Outliers

• A principled way of dealing with outliers is to do so directly during the

optimization process.

• Robust k-means algorithms as part of the optimization process in addition

to determining the clustering solution they also identify a set of outlier

objects that are not clustered by the algorithm.

• The non-clustered objects are treated as a penalty component of the objec-

tive function (in supervised learning, these penalty components are often

called regularizers) like

minimize

i : pi 6=1

||oi cpi ||22 +

i : pi=1

where is a user-specified parameter that controls the penalty associated

with not clustering an object, and q(i) is a cost function associated with

the ith object. A simple q() = 1 is such a cost function.

K-Meansandthe“Curseofdimensionality”

• Whendimensionalityincreases,databecomesincreasinglysparseinthespacethatitoccupies.

• Definitionsofdensityanddistancebetweenpoints,whichiscriticalforclusteringandoutlierdetection,becomelessmeaningful. • Randomly generate 500 points.

• Compute difference between max and min distance between any pair of points.

Asymmetricattributes

Ifwemetafriendinthegrocerystorewouldweeversaythefollowing?

“I see our purchases are very similar since we didn’t buy most of the same things.”

SphericalK-meansclustering

Let d1, . . . , dn be the unit length vectors of the set of objects to be clustered, kbe the number of desired clusters, p be the clustering indicator vector such that

pi is the cluster number that the ith object belongs to, and ci be the centroid

of the ith cluster.

The spherical K-means clustering algorithm solves the following optimization

problem:

maximize

cos(di, cpi).

SphericalK-means&Text

Inhigh-dimensionaldata,clustersexistinlower-dimensionalsub-spaces.

HIERARCHICALCLUSTERING

Hierarchicalclustering

• Producesasetofnestedclustersorganizedasahierarchicaltree.

• Canbevisualizedasadendrogram.– Atreelikediagramthatrecordsthesequencesofmergesorsplits.

1 3 2 5 4 60

Advantagesofhierarchicalclustering• Donothavetoassumeanyparticularnumberofclusters.– Anydesirednumberofclusterscanbe

obtainedby“cutting”thedendrogram attheproperlevel.

• Theymaycorrespondtomeaningfultaxonomies.– Exampleinbiologicalsciences(e.g.,animal

kingdom,phylogenyreconstruction,…).

1 3 2 5 4 60

Hierarchicalclustering• Twomainwaysofobtaininghierarchicalclusterings:

– Agglomerative:• Startwiththepointsasindividualclusters.• Ateachstep,mergetheclosestpairofclustersuntilonlyonecluster(orkclusters)left.

– Divisive:• Startwithone,all-inclusivecluster.• Ateachstep,splitaclusteruntileachclustercontainsapoint(ortherearekclusters).

• Traditionalhierarchicalalgorithmsuseasimilarityordistancematrix.– Mergeorsplitoneclusteratatime.

Agglomerativeclusteringalgorithm• Morepopularhierarchicalclusteringtechnique

• Basicalgorithmisstraightforward1. Computetheproximitymatrix.2. Leteachdatapointbeacluster.3. Repeat:4. Mergethetwoclosestclusters.5. Updatetheproximitymatrix.6. Until onlyasingleclusterremains(orkclustersremain).

• Keyoperationisthecomputationoftheproximityoftwoclusters.

– Differentapproachestodefiningthedistancebetweenclustersdistinguishthedifferentalgorithms

StartingsituationStartwithclustersofindividualpointsandaproximitymatrix

p1 p2 p3 p4 p5 . . .

. Proximity Matrix

...p1 p2 p3 p4 p9 p10 p11 p12

IntermediatesituationAftersomemergingsteps,wehavesomeclusters

C3 C4 C5

Proximity Matrix

...p1 p2 p3 p4 p9 p10 p11 p12

IntermediatesituationWewanttomergethetwoclosestclusters(C2andC5)andupdatetheproximitymatrix.

C3 C4 C5

Proximity Matrix

...p1 p2 p3 p4 p9 p10 p11 p12

AftermergingHowdoweupdatetheproximitymatrix?

C2 U C5

C3 ? ? ? ?

C2 U C5C1

C2 U C5

Proximity Matrix

...p1 p2 p3 p4 p9 p10 p11 p12

Defininginter-clusterproximity

p1 p2 p3 p4 p5 . . .

Proximity?

Minimum distance, maximum distance, average distance, distance between centroids, objective-driven selection, etc.

Proximity Matrix

p1 p2 p3 p4 p5 . . .

. Proximity Matrix

Usingminimumdistance.

p1 p2 p3 p4 p5 . . .

. Proximity Matrix

Usingmaximumdistance.

p1 p2 p3 p4 p5 . . .

. Proximity Matrix

Usingaveragedistance.

p1 p2 p3 p4 p5 . . .

. Proximity Matrix

Usingdistancebetweencentroids.

Strengthofminimumdistance

Can handle non-elliptical shapes.

Original Points Six Clusters

Limitationsofminimumdistance

Original Points

Two Clusters

Sensitive to noise and outliers.Three Clusters

Strengthofmaximumdistance

Less susceptible to noise and outliers.

Original Points Two Clusters

Limitationsofmaximumdistance

Tends to break large clusters.

Biased towards globular clusters.

Two ClustersOriginal Points

Groupaverage

• Compromisebetweensingleandcompletelink.

• Strengths:– Lesssusceptibletonoiseandoutliers.

• Limitations:– Biasedtowardsglobularclusters.

Hierarchicalclustering:Timeandspacerequirements• O(𝑁2) spacesinceitusestheproximitymatrix.

– 𝑁isthenumberofpoints.

• O(𝑁3) timeinmanycases– Thereare𝑁stepsandateachsteptheproximitymatrixmustbeupdatedandsearched(ontheaveragethereare𝑁2 onthatmatrix).

– ComplexitycanbereducedtoΟ(𝑁2log(𝑁))timewithsomecleverness.

Hierarchicalclustering:Problemsandlimitations• Onceadecisionismadetocombinetwoclusters,itcannotbeundone.

• Objectivefunctionisoptimizedonlylocally.

• Differentschemeshaveproblemswithoneormoreofthefollowing:– Sensitivitytonoiseandoutliers.– Difficultyhandlingdifferentsizedclustersandconvexshapes.

– Breakinglargeclusters.

DENSITY-BASEDCLUSTERING

DBSCAN

• DBSCANisadensity-basedalgorithm.– Thedensity isthenumberofpointswithinaspecifiedradius(Eps)

– Apointisacorepoint ifithasmorethanaspecifiednumberofpoints(MinPts)withinEps.

• Thesearepointsthatareattheinteriorofacluster.

– Aborderpoint hasfewerthanMinPts withinEps,butisintheneighborhoodofacorepoint.

– Anoisepoint isanypointthatisnotacorepointoraborderpoint.

DBSCAN:core,border,andnoisepoints

DBSCANalgorithm182 CHAPTER 6. CLUSTER ANALYSIS

Algorithm DBSCAN(Data: D, Radius: Eps, Density: τ )begin

Determine core, border and noise points of D at level (Eps, τ);Create graph in which core points are connected

if they are within Eps of one another;Determine connected components in graph;Assign each border point to connected component

with which it is best connected;return points in each connected component as a cluster;

Figure 6.15: Basic DBSCAN algorithm

3. Noise point: A data point that is neither a core point nor a border point is defined asa noise point.

Examples of core points, border points, and noise points are illustrated in Fig. 6.16 forτ = 10. The data point A is a core point because it contains 10 data points within theillustrated radius Eps. On the other hand, data point B contains only 6 points within aradius of Eps, but it contains the core point A. Therefore, it is a border point. The datapoint C is a noise point because it contains only 4 points within a radius of Eps, and itdoes not contain any core point.

After the core, border, and noise points have been determined, the DBSCAN clusteringalgorithm proceeds as follows. First, a connectivity graph is constructed with respect to thecore points, in which each node corresponds to a core point, and an edge is added betweena pair of core points, if and only if they are within a distance of Eps from one another. Notethat the graph is constructed on the data points rather than on partitioned regions, as ingrid-based algorithms. All connected components of this graph are identified. These corre-spond to the clusters constructed on the core points. The border points are then assigned tothe cluster with which they have the highest level of connectivity. The resulting groups arereported as clusters and noise points are reported as outliers. The basic DBSCAN algorithmis illustrated in Fig. 6.15. It is noteworthy that the first step of graph-based clustering isidentical to a single-linkage agglomerative clustering algorithm with termination-criterionof Eps-distance, which is applied only to the core points. Therefore, the DBSCAN algorithmmay be viewed as an enhancement of single-linkage agglomerative clustering algorithms bytreating marginal (border) and noisy points specially. This special treatment can reduce theoutlier-sensitive chaining characteristics of single-linkage algorithms without losing the abil-ity to create clusters of arbitrary shape. For example, in the pathological case of Fig. 6.9(b),the bridge of noisy data points will not be used in the agglomerative process if Eps and τare selected appropriately. In such cases, DBSCAN will discover the correct clusters in spiteof the noise in the data.

Practical Issues

The DBSCAN approach is very similar to grid-based methods, except that it uses circularregions as building blocks. The use of circular regions generally provides a smoother contourto the discovered clusters. Nevertheless, at more detailed levels of granularity, the twomethods will tend to become similar. The strengths and weaknesses of DBSCAN are also

DBSCAN:core,borderandnoisepoints

Original Points Point types: core, border and noise

Eps = 10, MinPts = 4

DBSCANclustering

Clusters

DBSCANclustering

Clusters

Thesearealsoclusters.Theyareusuallyeliminatedbyputtingaminimumclustersizethreshold.

DBSCANclustering

Original Points Clusters

• Resistant to (some) noise.

• Can handle clusters of different shapes and sizes.

DBSCAN:Howmuchnoise?

WhenDBSCANdoesnotworkwell

Original Points(MinPts=4, Eps=9.75).

(MinPts=4, Eps=9.92)

• Varying densities

• High-dimensional data

DBSCAN:DeterminingEPSandMinPts

• Ideaisthatforpointsinacluster,theirkth nearestneighborsareroughlyatthesamedistance.

• Noisepointshavethekth nearestneighboratfartherdistance.• So,plotsorteddistanceofeverypointtoitskth nearestneighbor.

CLUSTERVALIDITY

Differentaspectsofclustervalidation

• Determiningthe clusteringtendency ofasetofdata:– Isthereanon-randomstructureinthedata?

• Comparingtheresultsofaclusteranalysistoexternallyknownresults.– Dotheclusterscontainobjectsofmostlyasingleclasslabel?

• Evaluatinghowwelltheresultsofaclusteranalysisfitthedatawithout referencetoexternalinformation.– Lookatvariousintra- andinter-clusterdata-derivedproperties.

• Comparingtheresultsoftwodifferentsetsofclusteranalysestodeterminewhichisbetter.

• Theevaluationcanbedonefortheentireclusteringsolutionorjustforselectedclusters.

Measuresofclustervalidity• Numericalmeasuresthatareappliedtojudgevariousaspectsofclustervalidity,areclassifiedintothefollowingthreetypes.– InternalIndex(II): Usedtomeasurethegoodnessofaclusteringstructurewithout respecttoexternalinformation.• SumofSquaredError(SSE)(oranyotheroftheobjectivefunctionsthatwediscussed).

– ExternalIndex(EI): Usedtomeasuretheextenttowhichclusterlabelsmatchexternallysuppliedclasslabels.• Entropy,purity,f-score,etc.

– RelativeIndex(RI): Usedtocomparetwodifferentclusteringsorclusters.• Oftenanexternalorinternalindexisusedforthisfunction,e.g.,SSEorentropy.

II:Measuringclustervalidityviacorrelation

• Twomatrices:– Proximity(distance)matrixofthedata(e.g.,pair-wisecosinesimilarity(Euclidean

distance)).– Idealproximitymatrixthatisimpliedbytheclusteringsolution.

• Onerowandonecolumnforeachdatapoint.• Anentryis1iftheassociatedpairofpointsbelongtothesamecluster.• Anentryis0iftheassociatedpairofpointsbelongstodifferentclusters.

• Computethecorrelationbetweenthetwomatrices.– i.e.,thecorrelationbetweenthevectorized matrices.– (makesurethattheorderingofthedatapointsisthesameinbothmatrices)

• High(low)correlationindicatesthatpointsthatbelongtothesameclusterareclosetoeachother.

• Notagoodmeasureforsomedensityorcontiguitybasedclusters.

II:Measuringclustervalidityviacorrelation

CorrelationofidealsimilarityandproximitymatricesfortheK-meansclusterings ofthefollowingtwodatasets.

0 0.2 0.4 0.6 0.8 10

Corr = -0.9235 Corr = -0.5810

II:Usingsimilaritymatrixforclustervalidation

Orderthesimilaritymatrixwithrespecttoclusterlabelsandinspectvisually.

0 0.2 0.4 0.6 0.8 10

Points

20 40 60 80 100

100Similarity

Clustersfoundinrandomdata

0 0.2 0.4 0.6 0.8 10

Random Points

0 0.2 0.4 0.6 0.8 10

K-means

0 0.2 0.4 0.6 0.8 10

DBSCAN

0 0.2 0.4 0.6 0.8 10

yComplete Link

Clustersinrandomdataarenotsocrisp.

Points

20 40 60 80 100

100Similarity

DBSCAN

0 0.2 0.4 0.6 0.8 10

Points

20 40 60 80 100

100Similarity

K-means

0 0.2 0.4 0.6 0.8 10

Points

20 40 60 80 100

100Similarity

Complete Link

DBSCAN

500 1000 1500 2000 2500 3000

II:Frameworkforclustervalidity

• Needaframeworktointerpretanymeasure.– Forexample,ifourmeasureofevaluationhasa valueof10,isthatgood,fair,or

• Statisticsprovideaframeworkforclustervalidity.– Themore“atypical”aclusteringresultis,themorelikelyitrepresentsvalid

structureinthedata.– Cancomparethevaluesofanindexthatresultfromrandomdataorclusterings to

thoseofaclusteringresult.• Ifthevalueoftheindexisunlikely,thentheclusterresultsarevalid.

– Theseapproachesaremorecomplicatedandhardertounderstand.

• Forcomparingtheresultsoftwodifferentsetsofclusteranalyses,aframeworkislessnecessary.– However,thereisthequestionofwhetherthedifferencebetweentwoindex

valuesissignificant.

II:StatisticalframeworkforSSE

Example– CompareSSEof0.005againstthreeclustersinrandomdata.– HistogramshowsSSEofthreeclustersin500setsofrandomdatapointsofsize100

distributedovertherange0.2– 0.8forxandyvalues.

0.016 0.018 0.02 0.022 0.024 0.026 0.028 0.03 0.032 0.0340

0 0.2 0.4 0.6 0.8 10

II:Statisticalframeworkforcorrelation

CorrelationofidealsimilarityandproximitymatricesfortheK-meansclusterings ofthefollowingtwodatasets.

0 0.2 0.4 0.6 0.8 10

Corr = -0.9235 Corr = -0.5810

“Thevalidationofclusteringstructuresisthemostdifficultandfrustratingpartofclusteranalysis.Withoutastrongeffortinthisdirection,clusteranalysiswillremainablackartaccessibleonlytothosetruebelieverswhohaveexperienceandgreatcourage.”

AlgorithmsforClusteringData,JainandDubes

Finalcommentonclustervalidity

Classification(Supervisedlearning)

BASICCONCEPTS

Classification:Definition

• Wearegivenacollectionofrecords(trainingset)– Eachrecordischaracterizedbyatuple(x,y),wherexisasetofattributesandyistheclasslabel• x:setofattributes,predictors,independentvariables,inputs.• y:class,response,dependentvariable,oroutput.

• Task:– Learnamodelthatmapseachsetofattributesxintooneofthepredefinedclasslabelsy.

Examplesofclassificationtasks

Task Attributeset,x Classlabel,y

Categorizingemailmessages

Featuresextractedfromemailmessageheaderandcontent

spamornon-spam

Identifyingtumorcells

FeaturesextractedfromMRIscans

malignantorbenigncells

Cataloginggalaxies

Featuresextractedfromtelescopeimages

Elliptical,spiral,orirregular-shapedgalaxies

Buildingandusingaclassificationmodel

Apply Model

Induction

Deduction

Learn Model

Tid Attrib1 Attrib2 Attrib3 Class

1 Yes Large 125K No

2 No Medium 100K No

3 No Small 70K No

4 Yes Medium 120K No

5 No Large 95K Yes

6 No Medium 60K No

7 Yes Large 220K No

8 No Small 85K Yes

9 No Medium 75K No

10 No Small 90K Yes 10

11 No Small 55K ?

12 Yes Medium 80K ?

13 Yes Large 110K ?

14 No Small 95K ?

15 No Large 67K ? 10

Test Set

Learningalgorithm

Training Set

Classificationtechniques

• Baseclassifiers– Decisiontree-basedmethods.– Rule-basedmethods.– Nearest-neighbor.– Neuralnetworks.– NaïveBayesandBayesianbeliefnetworks.– Supportvectormachines.– …andothers.

• Ensembleclassifiers– Boosting,bagging,randomforests,etc.

DECISIONTREES

Wewillusethismethodtoillustratevariousconceptsandissuesassociatedwiththeclassificationtask.

Exampleofadecisiontree

ID Home Owner

Marital Status

Annual Income

Defaulted Borrower

3 No Single 70K No

6 No Married 60K No

8 No Single 85K Yes

9 No Married 75K No

HomeOwner

Income

Yes No

MarriedSingle,Divorced

<80K >80K

SplittingAttributes

TrainingData Model:Decisiontree

Exampleofdecisiontree

HomeOwner

Income

Yes No

MarriedSingle,

Divorced

<80K >80K

Therecouldbemorethanonetreethatfitsthesamedata!

ID Home Owner

Marital Status

Annual Income

Defaulted Borrower

3 No Single 70K No

6 No Married 60K No

8 No Single 85K Yes

9 No Married 75K No

Decisiontreeclassificationtask

Apply Model

Induction

Deduction

Learn Model

1 Yes Large 125K No

2 No Medium 100K No

3 No Small 70K No

5 No Large 95K Yes

6 No Medium 60K No

7 Yes Large 220K No

8 No Small 85K Yes

9 No Medium 75K No

11 No Small 55K ?

12 Yes Medium 80K ?

13 Yes Large 110K ?

14 No Small 95K ?

Test Set

TreeInductionalgorithm

Training SetDecision Tree

Applymodeltotestdata

Income

Yes No

MarriedSingle,Divorced

<80K >80K

HomeOwner

MaritalStatus

AnnualIncome

DefaultedBorrower

No Married 80K ?10

TestData

AssignDefaultedto“No”

HomeOwner

Startfromtherootofthetree

Decisiontreeclassificationtask

Apply Model

Induction

Deduction

Learn Model

1 Yes Large 125K No

2 No Medium 100K No

3 No Small 70K No

5 No Large 95K Yes

6 No Medium 60K No

7 Yes Large 220K No

8 No Small 85K Yes

9 No Medium 75K No

11 No Small 55K ?

12 Yes Medium 80K ?

13 Yes Large 110K ?

14 No Small 95K ?

Test Set

TreeInductionalgorithm

Training Set

DecisionTree

Buildingthedecisiontree—Treeinduction

• Let𝐷" bethesetoftrainingrecordsthatreachanode𝑡.

• Generalprocedure:– If𝐷" containsrecordsthatbelong

thesameclass𝑦",then𝑡isaleafnodelabeledas𝑦".

– If𝐷" containsrecordsthatbelongtomorethanoneclass,useanattributetesttosplitthedataintosmallersubsets.• Recursivelyapplytheproceduretoeachsubset.

ID Home Owner

Marital Status

Annual Income

Defaulted Borrower

3 No Single 70K No

6 No Married 60K No

8 No Single 85K Yes

9 No Married 75K No

Hunt’s'algorithm

(a) (b)

Defaulted = No

HomeOwner

Yes No

Defaulted = No Defaulted = No

Yes No

MaritalStatus

Single,Divorced Married

Yes No

MaritalStatus

AnnualIncome

<480K >=480K

HomeOwner

Defaulted = No

Defaulted = NoDefaulted = Yes

HomeOwner

Defaulted = No

Defaulted = Yes

(3,0) (4,3)

(1,3) (3,0)

(1,0) (0,3)

Buildingthedecisiontree:Example

ID Home Owner

Marital Status

Annual Income

Defaulted Borrower

3 No Single 70K No

6 No Married 60K No

8 No Single 85K Yes

9 No Married 75K No

Hunt’s'algorithm

(a) (b)

Defaulted = No

HomeOwner

Yes No

MaritalStatus

Yes No

MaritalStatus

AnnualIncome

<480K >=480K

HomeOwner

Defaulted = No

HomeOwner

Defaulted = No

Defaulted = Yes

(3,0) (4,3)

(1,3) (3,0)

(1,0) (0,3)

ID Home Owner

Marital Status

Annual Income

Defaulted Borrower

3 No Single 70K No

6 No Married 60K No

8 No Single 85K Yes

9 No Married 75K No

Hunt’s'algorithm

(a) (b)

Defaulted = No

HomeOwner

Yes No

MaritalStatus

Yes No

MaritalStatus

AnnualIncome

<480K >=480K

HomeOwner

Defaulted = No

HomeOwner

Defaulted = No

Defaulted = Yes

(3,0) (4,3)

(1,3) (3,0)

(1,0) (0,3)

Hunt’s'algorithm

(a) (b)

Defaulted = No

HomeOwner

Yes No

MaritalStatus

Yes No

MaritalStatus

AnnualIncome

<480K >=480K

HomeOwner

Defaulted = No

HomeOwner

Defaulted = No

Defaulted = Yes

(3,0) (4,3)

(1,3) (3,0)

(1,0) (0,3)

ID Home Owner

Marital Status

Annual Income

Defaulted Borrower

3 No Single 70K No

6 No Married 60K No

8 No Single 85K Yes

9 No Married 75K No

Hunt’s'algorithm

(a) (b)

Defaulted = No

HomeOwner

Yes No

MaritalStatus

Yes No

MaritalStatus

AnnualIncome

<480K >=480K

HomeOwner

Defaulted = No

HomeOwner

Defaulted = No

Defaulted = Yes

(3,0) (4,3)

(1,3) (3,0)

(1,0) (0,3)

ID Home Owner

Marital Status

Annual Income

Defaulted Borrower

3 No Single 70K No

6 No Married 60K No

8 No Single 85K Yes

9 No Married 75K No

Designissuesofdecisiontreeinduction

• Howshouldthetrainingrecordsbesplit?– Methodforspecifyingtestcondition.

• Thisdependsontheattributetypes.– Methodforselectingwhichattributeandsplitconditiontochoose.• Needameasureforevaluatingthegoodnessofatestcondition.

• Whenshouldthesplittingprocedurestop?– Stopsplittingifalltherecordsbelongtothesameclassorhaveidenticalattributevalues.

– Earlytermination.

Methodsforexpressingtestconditions

• Dependsonattributetypes:– Binary– Nominal– Ordinal– Continuous

• Dependsonnumberofwaystosplit:– 2-waysplit– Multi-waysplit

Multi-waysplit:Useasmanypartitionsasdistinctvalues:

Binarysplit:Dividevaluesintotwosubsets:

Testconditionfornominalattributes

MaritalStatus

Single Divorced Married

Single Married,Divorced

MaritalStatus

Married Single,Divorced

MaritalStatus

Single,Married

MaritalStatus

Divorced

Testconditionforordinalattributes

ShirtSize

Medium Extra LargeSmall

• Multi-waysplit:– Useasmanypartitionsasdistinctvalues.

• Binarysplit:– Dividesvaluesintotwosubsets.

– Preserveorderpropertyamongattributevalues.

Medium, Large,Extra Large

ShirtSize

SmallLarge,Extra Large

ShirtSize

Small,Medium

Medium,Extra Large

ShirtSize

Small,Large

Thisgroupingviolatesorderproperty.

Testconditionforcontinuousattributes

AnnualIncome> 80K?

Yes No

AnnualIncome?

(i) Binary split (ii) Multi-way split

[10K,25K) [25K,50K) [50K,80K)

Howtodeterminethebestsplit?

Gender

C0: 6C1: 4

C0: 4C1: 6

C0: 1C1: 3

C0: 8C1: 0

C0: 1C1: 7

CarType

C0: 1C1: 0

C0: 0C1: 1

CustomerID

Yes No Family

Sports

Luxury c1 c10

C0: 0C1: 1

Beforesplitting:10recordsofclass0,and10recordsofclass1.

Whichtestconditionisthebest?

Howtodeterminethebestsplit?

• Greedyapproach:– Nodeswithpurer classdistributionarepreferred.

• Needameasureofnodepurity/impurity:

C0: 5C1: 5

C0: 9C1: 1

Highdegreeofimpurity Lowdegreeofimpurity

Measuresofnodeimpurity

• GiniIndex

• Entropy

• Misclassificationerror

tjptGINI 2)]|([1)(

tjptjptEntropy )|(log)|()(

)|(max1)( tiPtErrori

Two-classproblem

Findingthebestsplit

1. Computeimpuritymeasure(P)beforesplitting.2. Computeimpuritymeasure(M)aftersplitting.

• Computeimpuritymeasureofeachchildnode.• Misthesize-weightedimpurityofthechildren.

3. Choosetheattributetestconditionthatproducesthehighestgain:

orequivalently,lowestimpuritymeasureaftersplitting(M).

Gain = P – M,

Decisiontreebasedclassification

• Advantages:– Inexpensivetoconstruct.– Extremelyfastatclassifyingunknownrecords.– Easytointerpretforsmall-sizedtrees.– Robusttonoise(especiallywhenmethodstoavoidoverfittingareemployed).

– Caneasilyhandleredundantorirrelevantattributes(unlesstheattributesareinteracting).

• Disadvantages:– Spaceofpossibledecisiontreesisexponentiallylarge.Greedyapproachesareoftenunabletofindthebesttree.

– Doesnottakeintoaccountinteractionsbetweenattributes.– Eachdecisionboundaryinvolvesonlyasingleattribute.

OVERFITTING

Classificationerrors

• Trainingerrors(apparenterrors):– Errorscommittedonthetrainingset.

• Testerrors:– Errorscommittedonthetestset.

• Generalizationerrors:– Expectederrorofamodelinarandomlyselectedsubsetofrecordsfromthesamedistribution.

Exampledataset

Twoclassproblem:

+:5400instances

• 5000instancesgeneratedfromaGaussiancenteredat(10,10)

• 400noisyinstancesadded

o:5400instances• Generatedfromauniformdistribution

10%ofthedatausedfortrainingand90%ofthedatausedfortesting

Increasingnumberofnodesinthedecisiontree

Decisiontreewith4nodes

Decisiontree

Decisionboundariesontrainingdata

Decision Tree

Decision boundaries on training data

Decisiontreewith50nodes

Increasingnumberofnodesindecisiontrees

Decision Tree with 4 nodes

Decision Tree with 50 nodes

Which tree is better ?

Modeloverfitting

Underfitting:whenmodelistoosimple,bothtrainingandtesterrorsarelarge.

Overfitting:whenmodelistoocomplex,trainingerrorissmallbuttesterrorislarge.

Model overfitting

Usingtwicethenumberofdatainstances

• Iftrainingdataisunder-representative,testingerrorsincreaseandtrainingerrorsdecreaseonincreasingnumberofnodes.

• Increasingthesizeoftrainingdatareducesthedifferencebetweentrainingandtestingerrorsatagivennumberofnodes.

Reasonsformodeloverfitting

• Presenceofnoise.

• Lackofrepresentativesamples.

• Multiplecomparisonprocedure.

Effectofmultiplecomparisonprocedure

• Considerthetaskofpredictingwhetherstockmarketwillrise/fallinthenext10tradingdays.

• Randomguessing:P(correct)=0.5

• Make10randomguessesinarow:

0547.02

)8(# 10 =÷÷ø

öççè

æ+÷÷ø

öççè

æ+÷÷ø

öççè

=³correctP

Day1 UpDay2 DownDay3 DownDay4 UpDay5 DownDay6 DownDay7 UpDay8 UpDay9 UpDay10 Down

• Approach:– Get50analysts.– Eachanalystmakes10randomguesses.– Choosetheanalystthatmakesthemostnumberofcorrectpredictions.

• Probabilitythatatleastoneanalystmakesatleast8correctpredictions:

9399.0)0547.01(1)8(# 50 =--=³correctP

• Manyalgorithmsemploythefollowinggreedystrategy:– Initialmodel:𝑀.– Alternativemodel:𝑀' = 𝑀 ∪ 𝛾,where𝛾isacomponenttobeaddedtothemodel(e.g.,atestconditionofadecisiontree).

– Keep𝑀' ifimprovement,Δ 𝑀,𝑀' > 𝛼.

• Oftentimes,𝛾ischosenfromasetofalternativecomponents,Γ=best(𝛾1, 𝛾2, … , 𝛾4).

• Ifmanyalternativesareavailable,onemayinadvertentlyaddirrelevantcomponentstothemodel,resultinginmodeloverfitting.

Effectofmultiplecomparison:Example

Useadditional100noisyvariablesgeneratedfromauniformdistributionalongwith𝑋and𝑌asattributes.

Use30%ofthedatafortrainingand70%ofthedatafortesting.

Usingonly𝑋and𝑌asattributes

Notesonoverfitting

• Overfittingresultsindecisiontreesthataremorecomplex thannecessary.

• Trainingerrordoesnotprovideagoodestimateofhowwellthetreewillperformonpreviouslyunseenrecords.

• Needwaysforestimatinggeneralizationerrors.

Handlingoverfittingindecisiontrees

Pre-Pruning(earlystoppingrule):– Stopthealgorithmbeforeitbecomesafully-growntree.– Typicalstoppingconditionsforanode:

• Stopifallinstancesbelongtothesameclass.• Stopifalltheattributevaluesarethesame.

– Morerestrictiveconditions:• Stopifnumberofinstancesislessthansomeuser-specifiedthreshold.• Stopifclassdistributionofinstancesareindependentoftheavailablefeatures(e.g.,using𝜒2 test).

• Stopifexpandingthecurrentnodedoesnotimproveimpurity measures(e.g.,Giniorinformationgain).

• Stopifestimatedgeneralizationerrorfallsbelowcertainthreshold.

Handling overfitting in decision trees

Post-pruning:– Growdecisiontreetoitsentirety.– Subtree replacement:

• Trimthenodesofthedecisiontreeinabottom-upfashion.• Ifgeneralizationerrorimprovesaftertrimming,replacesub-treebyaleafnode.

• Classlabelofleafnodeisdeterminedfrommajorityclassofinstancesinthesub-tree.

– Subtree raising:• Replacesubtree withmostfrequentlyusedbranch.

Examplesofpost-pruning

Simplified Decision Tree:

SubtreeRaising

SubtreeReplacement

ENSEMBLEMETHODS

Ensemblemethods

• Constructasetofclassifiersfromthetrainingdata.

• Predictclasslabeloftestrecordsbycombiningthepredictionsmadebymultipleclassifiers.

Whyensemblemethodswork?

Supposethereare25baseclassifiers:– Eachclassifierhaserrorrate,e =0.35.– Assumeerrorsmadebyclassifiersare

uncorrelated.– Probabilitythattheensemble

classifiermakesawrongprediction:

P(X ≥13) = 25i

⎝⎜

⎠⎟ε i (1−ε)25−i = 0.06

Generalapproach

OriginalTraining data

....D1 D2 Dt-1 Dt

Step 1:Create Multiple

Data Sets

C1 C2 Ct -1 Ct

Step 2:Build Multiple

Classifiers

C*Step 3:

CombineClassifiers

Typesofensemblemethods

• Manipulatedatadistribution.– Resamplingmethod.

• Baggingandboosting.

• Manipulateinputfeatures.– Featuresubsetselection.

• Randomforest:Randomlyselectfeaturesubsetsandbuiltdecisiontrees.

• Manipulateclasslabels.– Randomlypartitiontheclassesintotwosubsets,treatthemas+veand–ve,andlearnabinaryclassifier.Dothatmanytimes.Atclassification,useallbinaryclassifiersandgivecreditstotheconstituentclasses.

• Byusingdifferentmodels.– E.g.,DifferentANNtopologies.

Bagging

• Samplingwithreplacement.

• Buildaclassifieroneachbootstrapsample.• Useamajorityvotingpredictionapproach:

– Predictanunlabeledinstanceusingallclassifiersandreturnthemostfrequentlypredictedclassastheprediction.

Original Data 1 2 3 4 5 6 7 8 9 10Bagging (Round 1) 7 8 10 8 2 5 10 10 5 9Bagging (Round 2) 1 4 9 1 2 3 2 7 3 2Bagging (Round 3) 1 8 5 10 5 5 9 6 3 7

Boosting

• Aniterativeproceduretoadaptivelychangethedistributionoftrainingdatabyfocusingmoreonpreviouslymisclassifiedrecords.– Initially,all𝑁recordsareassignedequalweights.– Unlikebagging,weightsmaychangeattheendofeachboostinground.

• Theweightscanbeusedtocreateaweighted-lossfunctionorbiastheselectionofthesample.

Boosting

• Recordsthatarewronglyclassifiedwillhavetheirweightsincreased.

• Recordsthatareclassifiedcorrectlywillhavetheirweightsdecreased.

Original Data 1 2 3 4 5 6 7 8 9 10Boosting (Round 1) 7 3 2 8 7 9 4 10 6 3Boosting (Round 2) 5 4 9 4 2 5 1 7 4 2Boosting (Round 3) 4 4 8 10 4 5 4 6 3 4

Example4ishardtoclassify.

Itsweightisincreased,thereforeitismorelikelytobechosenagaininsubsequentrounds.

ARTIFICIALNEURALNETWORKS

Considerthefollowing

X1 X2 X3 Y1 0 0 -11 0 1 11 1 0 11 1 1 10 0 1 -10 1 0 -10 1 1 10 0 0 -1

Black box

Output

Output Y is 1 if at least two of the three inputs are equal to 1.

Considerthefollowing

X1 X2 X3 Y1 0 0 -11 0 1 11 1 0 11 1 1 10 0 1 -10 1 0 -10 1 1 10 0 0 -1

Black box

0.3 t=0.4

Outputnode

Inputnodes

Y = sign(0.3X1 + 0.3X2 + 0.3X3 − 0.4)

where sign(x) = +1 if x ≥ 0−1 if x < 0

⎧⎨⎩

Perceptron

• Modelisanassemblyofinter-connectednodesandweightedlinks.

• Outputnodesumsupeachofitsinputvalueaccordingtotheweightsofitslinks.

• Compareoutputnodeagainstsomethresholdt.

Black box

Outputnode

Inputnodes

Perceptron Model

Y = sign( wiXii=1

∑ − t)

= sign( wiXi )i=0

Perceptron

• Singlelayernetwork– Containsonlyinputandoutputnodes.

• Activationfunction:

• Applyingmodelisstraightforward:

– X1 =1,X2 =0,X3 =1=>y=sign(0.2)=1

f (w, x) = sign( x,w )

Y = sign(0.3X1 + 0.3X2 + 0.3X3 − 0.4)

where sign(x) = +1 if x ≥ 0−1 if x < 0

⎧⎨⎩

Perceptronlearningrule

• Initializetheweights(w0,w1,…,wd)• Repeat

– Foreachtrainingexample(xi,yi)• Computef(w,xi)• Updatetheweights:

• Untilstoppingconditionismet.• Theaboveisanexampleofastochasticgradientdescentoptimizationmethod.

w(k+1) = w(k ) +λ yi − f (w(k ), xi )⎡⎣ ⎤⎦xi

• Weightupdateformula:

• Intuition:– Updateweightbasedonerror:– Ify=f(x,w),e=0:noupdateneeded.– Ify>f(x,w),e=2:weightmustbeincreasedsothatf(x,w)willincrease.

– Ify<f(x,w),e=-2:weightmustbedecreasedsothatf(x,w)willdecrease.

w(k+1) = w(k ) +λ yi − f (w(k ), xi )⎡⎣ ⎤⎦xi ; λ: learning rate

e = yi − f (w(k ), xi )⎡⎣ ⎤⎦

• Sincef(w,x)isalinearcombinationofinputvariables,decisionboundaryislinear.

• Fornonlinearlyseparableproblems,perceptronlearningalgorithmwillfailbecausenolinearhyperplane canseparatethedataperfectly.

Nonlinearlyseparabledata

x1 x2 y0 0 -11 0 10 1 11 1 -1

21 xxy Å=XOR Data

Multilayerartificialneuralnetworks(ANN)

Activationfunction

g(Si )Si Oi

Neuron iInput Output

threshold, t

InputLayer

HiddenLayer

OutputLayer

x1 x2 x3 x4 x5

Training ANN means learning the weights of the neurons

Artificialneuralnetworks

• Varioustypesofneuralnetworktopologies:– Single-layerednetwork(perceptron)versusmulti-layerednetwork.

– Feed-forwardversusrecurrentnetwork.

• Varioustypesofactivationfunctions(f):

Y = f ( wiXii∑ )

Artificialneuralnetworks

Multi-layerneuralnetworkcansolveanytypeofclassificationtaskinvolvingnonlineardecisionsurfaces.

InputLayer

HiddenLayer

OutputLayer

XOR Data

DesignissuesofANN

• Numberofnodesininputlayer:– Oneinputnodeperbinary/continuousattribute.– 𝑘orlog2 𝑘 nodesforeachcategoricalattributewith𝑘values.

• Numberofnodesinoutputlayer:– Oneoutputforbinaryclassproblem.– 𝑘or log2(𝑘) nodesfork-classproblem.

• Numberofnodesinhiddenlayer.• Initialweightsandbiases.

CharacteristicsofANN

• MultilayerANNareuniversalfunctionapproximatorsbutcouldsufferfromoverfittingifthenetworkistoolarge.

• Gradientdescentmayconvergetolocalminimum.• Modelbuildingcanbeverytimeconsuming,butapplyingthemodelcanbeveryfast.

• Canhandleredundantattributesbecauseweightsareautomaticallylearnt.

• Sensitivetonoiseintrainingdata.• Difficulttohandlemissingattributes.

RecentnoteworthydevelopmentsinANN

• Useindeeplearningandunsupervisedfeaturelearning.– Seektoautomaticallylearnagoodrepresentationoftheinputfromunlabeleddata.

• GoogleBrainproject:– Learnedtheconceptofa‘cat’bylookingatunlabeledpicturesfromYouTube.

– Onebillionconnectionnetwork.

Purpose-builtneuralnetworks

• Convolutionneuralnetworks– Deepnetworksthataredesigned

toextractsuccessivelymorecomplicatedfeaturesfrom1D,2D,and3Dsignals(i.e.,audio,images,video).

Purpose-builtneuralnetworks

• Networksthatarespecificallydesignedtomodelarbitrarylengthsequencesandnon-localdependencies:– Recurrentneuralnetworks– Bi-directionalrecurrentneuralnetworks– Longshort-termmemory

• Goodforlanguagemodelingandvariousbiologicalapplications.

SUPPORTVECTORMACHINES

Separating hyperplanes

Findalinearhyperplane (decisionboundary)thatseparatesthedata.

One possible solution.

Another possible solution.

introduction to data mining - university of minnesota to... · introduction to data mining 1....

Documents

statistical data mining€¦ · 3 data mining data...

data mining and privacy preserving in data mining

lecture 2: data mining 1. roadmap what is data mining? data...

unit - i data mining. unit - i introduction : fundamentals...

data and text mining -...

1 data mining chapter 26. 2 chapter 1. introduction...

massive data analytics data mining...

from data mining to knowledge mining: symbolic data

web mining – data mining im internet mining – data...

mining scientific data: past, present, and future · pdf...

data, responsibly fairness, neutrality and transparency in...

data mining: what is data mining?

data mining by jemini islam. data mining outline: what is...

data mining lecture 1: introduction to data mining

visual data mining: an overview what is visual data mining?...

data mining with - lagout mining/data mining with...

software tools for teaching … tools for teaching...

data warehousing mining & bi data streams mining dwmbi1

mining educational data using data mining … ·...

data mining: introduction. chapter 1. introduction...