s1 introduction to course
TRANSCRIPT
-
7/30/2019 S1 Introduction to Course
1/102
S1:Introduc-ontotheCourse
ShawndraHill
Spring2013
TR1:303pmand34:30
-
7/30/2019 S1 Introduction to Course
2/102
Data
The amount of data created by each
person doubles every 1.5 2
years
after five years x 10
after ten years x 100
after twenty years x 10000
A. Weigend
-
7/30/2019 S1 Introduction to Course
3/102
1 billion connected flash
players
A. Weigend
-
7/30/2019 S1 Introduction to Course
4/102
40 billion RFID tags worldwide
A. Weigend
-
7/30/2019 S1 Introduction to Course
5/102
Biology:~100kyears
Time Scales
Technology: ~1 year
A.Weigend
-
7/30/2019 S1 Introduction to Course
6/102
A. Weigend
-
7/30/2019 S1 Introduction to Course
7/102
Social Data = Shared Data
................
pieces of content shared
per month
10 billion
-
7/30/2019 S1 Introduction to Course
8/102
Social Data = Shared Data
1 billionvideos watched
per .....day
-
7/30/2019 S1 Introduction to Course
9/102
onlyoccasionally
punctuatedbypurchases
Process ofcreating and refining
product space awareness
Shopping?
-
7/30/2019 S1 Introduction to Course
10/102
How do you know peoples
secret desires?
-
7/30/2019 S1 Introduction to Course
11/102
Instrument for feedback
-
7/30/2019 S1 Introduction to Course
12/102
n Situation" Location" Device
n Attention"
Transactions" Clicksn Intention
" Search
Data Sources
A.Weigend
-
7/30/2019 S1 Introduction to Course
13/102
Data Mining, Spring 20013
Shawndra Hill
13
WhatisDataMining?
-
7/30/2019 S1 Introduction to Course
14/102
Theprocessofdiscoveringmeaningfulnewcorrela-ons,
paNerns,andtrendsbysiOingthroughlargeamountsofdata
storedinrepositoriesandbyusingpaNernrecogni-on
technologiesaswellassta-s-calandmathema-caltechniques
(TheGartnerGroup).
Theexplora-onandanalysisoflargequan--esofdatainorder
todiscovermeaningfulpaNernsandrules(BerryandLinoff).
Thenontrivialextrac-onofimplicit,previouslyunknown,and
poten-allyusefulinforma-onfromdata(Frawley,Paitestsky
ShapiroandMathews).
14
WhatisDataMining?
-
7/30/2019 S1 Introduction to Course
15/102
15
Defini-on(Fayyadet.al):Thenontrivialdiscoveryof
novel,valid,comprehensibleandpoten-allyusefulpaNernsfromdata.
WhatisapaNern?Arela-onshipinthedata.E.g.,
nnThursdaynightspeoplewhobuydiapersalsotendtobuybeer
nPeoplewithgoodcreditra-ngsarelesslikelytohaveaccidents
nMaleconsumers,37+,incomebracket50K75Kspendbetween$25$50percatalogorder
WhatisDataMining?
-
7/30/2019 S1 Introduction to Course
16/102
HistoricalDifferencesBetween
Sta-s-csandDM
Sta%s%cs DataMining
Confirma-ve Explora-ve
Smalldatasets/
Filebased
Largedatasets/
Databases
Smallnumberofvariables Largenumberofvariables
Deduc-ve Induc-ve
umericdata umericandnonnumeric(includingtxt,networks)
Cleandata Datacleaning
16
-
7/30/2019 S1 Introduction to Course
17/102
DataMiningvs.Sta-s-cs
Sta-s-csisknownfor: welldefinedhypothesesusedtolearnabouta specificallychosenpopula-onstudiedusing carefullycollecteddataprovidinginferenceswith wellknownproper-es.
Dataminingisntthatcareful.Itis: datadrivendiscoveryof modelsandpaNernsfrom massiveand observa-onaldatasets
-
7/30/2019 S1 Introduction to Course
18/102
DataMiningv.Sta-s-cs
Tradi-onalsta-s-cs firsthypothesize,thencollectdata,thenanalyze oOenmodeloriented(strongparametricmodels)
Datamining:
fewifanyapriorihypotheses dataisusuallyalreadycollectedapriori
analysisistypicallydatadrivennothypothesisdriven Oenalgorithmorientedratherthanmodeloriented
Different? Yes,intermsofculture,mo-va-on:however.. sta-s-calideasareveryusefulindatamining,e.g.,invalida-ngwhether
discoveredknowledgeisuseful
Increasingoverlapattheboundaryofsta-s-csandDMe.g.,exploratorydataanalysis(basedonpioneeringworkofJohnTukeyinthe1960s)
-
7/30/2019 S1 Introduction to Course
19/102
DataMiningEnablers
Explosionofdata Fastandcheapcomputa-onandstorage MooresLaw:processingdoublesevery19months Diskstoragedoublesevery9months Databasetechnology
Compe--vepressureinbusiness Datahasvalue!
ew,successfulmodels SVM,boos-ng
Commercialproducts SAS,SPSS,Insighul,IBM,racle
penSourceproducts Weka R
1E+3
1E+4
1E+5
1E+6
1E+7
1988 1991 1994 1997 2000
disk TB
growth:
112%/y
Moore's Law:
58.7%/y
ExaByte
Disk TB Shipped per Year1998 DiskTrend(JimPorter)
h tt : w ww .d i s kt re nd . c om d f o r tr k . d f .
-
7/30/2019 S1 Introduction to Course
20/102
DataDrivenDiscovery
bserva-onaldatacheaprela-vetoexperimentaldata
Examples:
Transac-ondataarchivesforretailstores,airlines,etc
WeblogsforAmazon,Google,etcThehuman/mouse/ratgenomeEtc.,etc
makessensetoleverageavailabledatauseful(?)informa-onmaybehiddeninvastarchivesofdata
Whataretheperilsofobserva-onaldata?
-
7/30/2019 S1 Introduction to Course
21/102
DataMining:ConfluenceofMul-pleDisciplines
Data Mining
DatabaseTechnology Statistics
OtherDisciplines
InformationScience
MachineLearning Visualization
Different fields have different views of what data mining is(also different terminology!)
-
7/30/2019 S1 Introduction to Course
22/102
Induc-onvs.Deduc-on
TheproblemofDeduc-on:Howtodemonstrate
thatanabstractideaappliestonature?
TheProblemofInduc-on:Howtogobeyonda
collec-onoffactstonewconcepts?
22
-
7/30/2019 S1 Introduction to Course
23/102
DecisionSupportSystems(DSSs)
23
Assistmanagersinmakingdecisionsorchoices
TypesofDSSs:
Model-Driven:Spreadsheetsandotherop-miza-onbasedmethodsfrompera-onsManagementandFinance.
Communica8on-Driven:Groupware(e.g.vo-ng/ra-ng),ComputerSupportedCollabora-veWork(CSCW),Documentsharing,Teleconferencing
Data-driven:Collect,store,andanalyzelargedatavolumes.a.k.a.BusinessIntelligence(BI)systems,Warehouses,LAP
Knowledge-driven:e.g.Expertsystemsthatcaptureexper-sebyapplyingruleselicitedfromexperts.Tradi-onaluses:medicaldiagnosis(e.g.MYCI),computerconfigura-on
(e.g.XC),personaliza-on.Knowledgeelicita-onandknowledgerepresenta-on
problems.
Thiscoursedealsmainlywith: data-drivenDSSs(Part1) and knowledge-drivenDSSs(Part2).
Wewilltouchbrieflyonmodel-drivenDSSsinPart2(butseeOPIM101formoreonthat).
-
7/30/2019 S1 Introduction to Course
24/102
TheCourse
Data Mining, Spring 20013 Shawndra Hill
24
-
7/30/2019 S1 Introduction to Course
25/102
Coursebjec-ves
Approachbusinessproblemsdata-analy;cally.Thinkcarefully&systema-callyaboutwhether&howdatacanimprovebusinessperformance.
Beabletointeractcompetentlyonthetopicofdataminingforbusinessintelligence.Knowthebasicsofdataminingprocesses,algorithms,&systemswellenoughtointeractwithCTs,expertdataminers,andbusinessanalysts.Beabletoenvisiondataminingopportuni-es.
Hands-onexperienceminingdata.Bepreparedtofollowuponideasoropportuni-esthatpresentthemselves,e.g.,byperformingpilotstudies
25
-
7/30/2019 S1 Introduction to Course
26/102
urGoals
26
Understand the basics of the major Data Mining/Machine
Learning techniques:
What they do: problems they can solve Who uses them Where they are used When and how to use them How they work (at a high level only) Limitations
Apply techniques and evaluate the models built
-
7/30/2019 S1 Introduction to Course
27/102
27
Introduc-ontoModeling&DataMining
nFundamentalconceptsandterminology
DataMiningmethods
nClassifica-ondecisiontrees,associa-onrules,clusteringandsegmenta-on,collabora-vefiltering,gene-calgorithmsetc.
nInnerworkingsnStrengthsandweaknesses
Evalua-on
nHowtoevaluatetheresultsofadataminingsolu-ons
Applica-ons
nRealworldbusinessproblemsDMcanbeappliedto
Courseutline
-
7/30/2019 S1 Introduction to Course
28/102
28
Teachingstyle: Lecture/Lab/GuestSpeakers(AT&T,IBM,Yahoo!)
Studentpar-cipa-on/aNendanceisimportant
Labsessions: Weka,Gephi,python
Textbook:VariousPubliclyAvailableReadings
CourseInforma-on
-
7/30/2019 S1 Introduction to Course
29/102
29
SQL(MicrosoOAccess)
Weka
Gephi
Python(Version2.7)
Startinstallingnow
CourseTLS
-
7/30/2019 S1 Introduction to Course
30/102
30
Canvas Wordpressclasssite:hNp://opim672.wordpress.com Facebook/TwiNer
fficehours:M67pm,F25pm,orbyappointment Email:[email protected]
TA: KrishnaChoksi([email protected]) AdrianBenton
CourseInforma-on
-
7/30/2019 S1 Introduction to Course
31/102
31
n ReadmaterialbeforeandaOerclassn 8homeworkassignment(35points)groupsof2n Dataminingproject(50points)groupsof46,10groupsperclass
n FinalReportn Midsemesterupdaten Endofsemesterpresenta-onn ProjectReviews
n Classpar-cipa-on(15points)n Datasetcompe--on(op-onalforextracredit)Warning:
1.Thisisahandsonclass
2.Asignificantpor-onofdeliverablesareattheendofthe
semester.
CourseInforma-on
-
7/30/2019 S1 Introduction to Course
32/102
WhatisaDSS?
32
DecisionSupportSystemsaimatallowingbusinessuserstomakebeHerdecisionsfasterandtake
ac%onmoreeasilyandmoreprofitablybasedon
thisinforma%on.
Thisisachievedthrough:
Predic-onDescrip-onDataDissemina-onPrescrip-on
-
7/30/2019 S1 Introduction to Course
33/102
Induc%on:
Fromspecificexamples(instances)togeneralrulesInstances:
Rules:
IFswims=yesTENclass=dolphin Rules
Antecedent/Assump%on(RuleBody) Consequent/Conclusion(Ruleead)
Predic%on= DeterminingtheclassoraHribute-valueforanewitemwithsome
knownaHributes.
Predic-on
33
Swims Color TypeID
yes gray dolphinAnimal1
yes black dolphinAnimal2
no gray elephantAnimal3
-
7/30/2019 S1 Introduction to Course
34/102
TextMining
34
-
7/30/2019 S1 Introduction to Course
35/102
Predic-on:
ExamplesfromIndustry?
35
Classifyingdolphinsandflowersisdull
(toyproblemsoOencitedinthedatamining
literature).
Ques-ons:
Howdoweusedatamining/machinelearningtogeneraterevenuesorreducecosts?
Howdowemone-zeDM?!!!!!
-
7/30/2019 S1 Introduction to Course
36/102
Examples
Data Mining, Spring 20013 Shawndra Hill
36
Mining Medical Discussion Board Data
Mining Motley Fool Caps
Social Network Based Marketing
Social Network Based Fraud Detection
Social TV ExamplesProfit Maximizing Recommendation Engine
P di -
-
7/30/2019 S1 Introduction to Course
37/102
Predic-on:
ExamplesfromIndustry?
37
WachoviaCan I predict if someone will
default on their loan?
Visa Can I identify fraudulent credit cardTransactions?
Linens n Things
Monster.com
The World Bank
P di -
-
7/30/2019 S1 Introduction to Course
38/102
Predic-on:
ExamplesfromIndustry?
38
WachoviaCan I predict if someone willDefault on their loan?
VisaCan I identify fraudulent credit card
Transactions?
Linens n ThingsPredict response to recommendation online?
Monster.comPredict if stock value of company will go up based on
Employee attrition?
The World BankPredict if country/organization will default?
Predic-on
-
7/30/2019 S1 Introduction to Course
39/102
39
ACNielson
Pepsico
Predic-on:
ExamplesfromIndustry?
Predic-on:
-
7/30/2019 S1 Introduction to Course
40/102
40
ACNielsonAssociation rules for market baskets?
PepsicoIdentify business opportunities?
Predic-on:
ExamplesfromIndustry?
-
7/30/2019 S1 Introduction to Course
41/102
DataMiningasaCoreCompetency
41
-
7/30/2019 S1 Introduction to Course
42/102
ExamplesofDataMiningSuccesses
Googleisacompanybuiltondatamining PageRankminedthewebtobuildbeNersearch Googleasspellchecker Googleasadplacer Googleasnewsaggregator Googleasfacerecognizer
-
7/30/2019 S1 Introduction to Course
43/102
DataMiningasaCoreCompetency
43
-
7/30/2019 S1 Introduction to Course
44/102
DataMiningasaCoreCompetency
44
-
7/30/2019 S1 Introduction to Course
45/102
DataDataData
Itsallaboutthedatawheredoesitcomefrom?
wwwASABusinessprocesses/transac-onsTelecommunica-onsandnetworkingMedicalimageryGovernment,census,demographics(data.gov!)Sensornetworks,RFIDtagsSports
f l il
-
7/30/2019 S1 Introduction to Course
46/102
TypesofData:FlatFileorVector
Data
Rows=objects Columns=measurementsonobjects
Representeachrowasapdimensionalvector,wherepisthedimensionality Inefffect,embedourobjectsinapdimensionalvectorspace Oenuseful,butnotalwaysappropriate
Bothnandpcanbeverylargeindatamining Matrixcanbequitesparse
n
p
2.3 -1.5 -1.3
1.1 0.1 -0.1
-
7/30/2019 S1 Introduction to Course
47/102
Data Mining, Spring 200647
Text
T f D S M i (T )
-
7/30/2019 S1 Introduction to Course
48/102
TypesofData:SparseMatrix(Text)
Data
Word IDs
TextDocuments
-
7/30/2019 S1 Introduction to Course
49/102
128.195.36.195, -, 3/22/00, 10:35:11, W3SVC, SRVR1, 128.200.39.181, 781, 363, 875, 200, 0, GET, /top.html, -,
128.195.36.195, -, 3/22/00, 10:35:16, W3SVC, SRVR1, 128.200.39.181, 5288, 524, 414, 200, 0, POST, /spt/main.html, -,
128.195.36.195, -, 3/22/00, 10:35:17, W3SVC, SRVR1, 128.200.39.181, 30, 280, 111, 404, 3, GET, /spt/images/bk1.jpg, -,
128.195.36.101, -, 3/22/00, 16:18:50, W3SVC, SRVR1, 128.200.39.181, 60, 425, 72, 304, 0, GET, /top.html, -,
128.195.36.101, -, 3/22/00, 16:18:58, W3SVC, SRVR1, 128.200.39.181, 8322, 527, 414, 200, 0, POST, /spt/main.html, -,
128.195.36.101, -, 3/22/00, 16:18:59, W3SVC, SRVR1, 128.200.39.181, 0, 280, 111, 404, 3, GET, /spt/images/bk1.jpg, -,
128.200.39.17, -, 3/22/00, 20:54:37, W3SVC, SRVR1, 128.200.39.181, 140, 199, 875, 200, 0, GET, /top.html, -,
128.200.39.17, -, 3/22/00, 20:54:55, W3SVC, SRVR1, 128.200.39.181, 17766, 365, 414, 200, 0, POST, /spt/main.html, -,
128.200.39.17, -, 3/22/00, 20:54:55, W3SVC, SRVR1, 128.200.39.181, 0, 258, 111, 404, 3, GET, /spt/images/bk1.jpg, -,
128.200.39.17, -, 3/22/00, 20:55:07, W3SVC, SRVR1, 128.200.39.181, 0, 258, 111, 404, 3, GET, /spt/images/bk1.jpg, -,
128.200.39.17, -, 3/22/00, 20:55:36, W3SVC, SRVR1, 128.200.39.181, 1061, 382, 414, 200, 0, POST, /spt/main.html, -,
128.200.39.17, -, 3/22/00, 20:55:36, W3SVC, SRVR1, 128.200.39.181, 0, 258, 111, 404, 3, GET, /spt/images/bk1.jpg, -,
128.200.39.17, -, 3/22/00, 20:55:39, W3SVC, SRVR1, 128.200.39.181, 0, 258, 111, 404, 3, GET, /spt/images/bk1.jpg, -,128.200.39.17, -, 3/22/00, 20:56:03, W3SVC, SRVR1, 128.200.39.181, 1081, 382, 414, 200, 0, POST, /spt/main.html, -,
128.200.39.17, -, 3/22/00, 20:56:04, W3SVC, SRVR1, 128.200.39.181, 0, 258, 111, 404, 3, GET, /spt/images/bk1.jpg, -,
128.200.39.17, -, 3/22/00, 20:56:33, W3SVC, SRVR1, 128.200.39.181, 0, 262, 72, 304, 0, GET, /top.html, -,
128.200.39.17, -, 3/22/00, 20:56:52, W3SVC, SRVR1, 128.200.39.181, 19598, 382, 414, 200, 0, POST, /spt/main.html, -,
5115
1111115151115177777777
1113333333131113332232
User 5User 4
User 3User 2User 1
Sequence (Web) Data
Sometimes another representation is more useful
-
7/30/2019 S1 Introduction to Course
50/102
128.200.39.17, -, 3/22/00, 20:55:07, W3SVC, SRVR1, 128.200.39.181, 0, 258, 111, 404, 3, GET, /spt/images/bk1.jpg, -,128.200.39.17, -, 3/22/00, 20:55:36, W3SVC, SRVR1, 128.200.39.181, 1061, 382, 414, 200, 0, POST, /spt/main.html, -,
128.200.39.17, -, 3/22/00, 20:55:36, W3SVC, SRVR1, 128.200.39.181, 0, 258, 111, 404, 3, GET, /spt/images/bk1.jpg, -,
128.195.36.195, -, 3/22/00, 10:35:11, W3SVC, SRVR1, 128.200.39.181, 781, 363, 875, 200, 0, GET, /top.html, -,
128.195.36.195, -, 3/22/00, 10:35:16, W3SVC, SRVR1, 128.200.39.181, 5288, 524, 414, 200, 0, POST, /spt/main.html, -,
128.195.36.195, -, 3/22/00, 10:35:17, W3SVC, SRVR1, 128.200.39.181, 30, 280, 111, 404, 3, GET, /spt/images/bk1.jpg, -,
,
Types of Data: Relational Data
128.195.36.195, Doe, John, 12 Main St, 973-462-3421, Madison, NJ, 07932
114.12.12.25,Trank, Jill, 11 Elm St, 998-555-5675, Chester, NJ, 07911
07911, Chester, NJ, 07954, 34000, , 40.65, -74.12
07932, Madison, NJ, 56000, 40.642, -74.132
Most large data sets are stored in relational data setsOracle, MSFT, IBMGood open source versions: MySQL, PostGres
-
7/30/2019 S1 Introduction to Course
51/102
TypesofData:TimeSeriesData
0 5 10 15 20 25 3040
60
80
100
120
140
160
Often many time series, long
time series, or multivariatetime series
-
7/30/2019 S1 Introduction to Course
52/102
TypesofData:ImageData
-
7/30/2019 S1 Introduction to Course
53/102
Spa-oTemporalData
hNp://senseable.mit.edu/nyte/movies/nyteglobeencounters.movencounters.mov
kD
-
7/30/2019 S1 Introduction to Course
54/102
etworkData
Algorithms for estimating relative importance in networksS. White and P. Smyth, ACM SIGKDD, 2003.
-
7/30/2019 S1 Introduction to Course
55/102
HP Labs email network
-
7/30/2019 S1 Introduction to Course
56/102
Data Mining - Columbia University
HP Labs email network500 people, 20k relationships
Also, temporal networks
M j A li - A
-
7/30/2019 S1 Introduction to Course
57/102
MajorApplica-onAreas
Marke%ng Customerloyalty/aNri-on Marketbasketanalysis:nThursdaysshopperswhobuydiapersalso
buybeer
Directmarke-ng Personaliza-on Marketsegmenta-on
FraudDetec%on(Telecommunica-on,Credit,Securi-es) Creditrisk ealthCare Insurance
Peoplewithgoodcreditra-ngshavefeweraccidents Textmining:email,documents,andWebanalysis. StockSelec%on onbusinessapplica-ons:military,bioinforma%cs,etc.
57
-
7/30/2019 S1 Introduction to Course
58/102
ExamplesofDataMiningSuccesses
MarketBasket(WalMart) RecommenderSystems(Amazon.com) FraudDetec-oninTelecommunica-ons(AT&T) TargetMarke-ng/CRM FinancialMarkets DAMicroarrayanalysis Biometrics(fingerprin-ng,handwri-ng) WebTraffic/Bloganalysis
Why Data Mining ow?
-
7/30/2019 S1 Introduction to Course
59/102
WhyDataMiningow?
59
Better and cheaperComputing
Power
Maturedata miningtechnology
Improved DataCollection, Access& Storage
DM
-
7/30/2019 S1 Introduction to Course
60/102
Accuracyisking
60
Only 15% of mergers and acquisitions
succeed
Stephen Denning
The Leaders Guide to StoryTelling,
pg xiv
-
7/30/2019 S1 Introduction to Course
61/102
ProfitisKing
(orItpaystobewrongsome-mes
)
61
Failure rate of new ventures invested in: 8 out of 10
Profit on Google investment: $4 billion (on $25 million)
Source: http://www.financialnews-us.com/?contentid=534017
-
7/30/2019 S1 Introduction to Course
62/102
Some-mes
itpaystobewrongalmostallthe-me
62
Customer Lifetime Value: $2,700
Cost per flyer: 7 cents
Required hit rate = 7 / 270,000 = 1 in 38,571
Case: Verizon Wireless
-
7/30/2019 S1 Introduction to Course
63/102
Case:VerizonWireless(PlainVanillaDM)
AboutVerizonWireless LargestwirelessproviderinUS Customerbase:30.3million Covering90%ofUSpopula-on
Challenges
Highcustomerturnoverrate(churn)of2%permonth(600,000customersdisconnectpermonth)
Associatedreplacementcostinhundredsofmillionsperyear Averagecostofnewcustomeracquisi-on:$320
63
-
7/30/2019 S1 Introduction to Course
64/102
Possiblesolu-ons
fferincen-vestoeverycustomersbeforecontractsexpire
expensive nolearning
64
Case: Verizon Wireless
l d
-
7/30/2019 S1 Introduction to Course
65/102
DataMiningSolu-on:Predic-on
Buildapredic%vemodel:Beforecontractsexpireuseapredic8vemodelto
predictwhichcustomersarelikelyto
leave(i.e.,es-ma-ngtheprobability)
Then: Offerbenefitssuchasanewphoneonlytocustomersmostlikelyto
disconnect
Developnewplanstofitcustomerneeds
65
-
7/30/2019 S1 Introduction to Course
66/102
PhasesintheDMProcess:CRISPDM
66
BusinessUnderstanding
DataUnderstanding
DataPrepara%on
ModelingEvalua%on
Deployment
www.crisp-dm.org
-
7/30/2019 S1 Introduction to Course
67/102
CRossIndustryStandardProcessDM
BusinessUnderstanding:Understandingprojectobjec-vesanddataminingproblemiden-fica-on
DataUnderstanding:Capturing,understand,exploreyourdataforqualityissues
DataPrepara%on:Datacleaning,mergedata,deriveaNributesetc.
Modeling:Selectthedataminingtechniques,buildthemodel
Evalua%on:Evaluatetheresultsandapprovedmodels
Deployment:Putmodelsintoprac-ce,monitoringandmaintenance
67
Case: Verizon Wireless
-
7/30/2019 S1 Introduction to Course
68/102
UnderstandingTheBusinessProblemandData
68
n IT brought idea to Marketing team andpresented it as partnership
n Marketing learned the modeling process as wellas capabilities and weaknesses of modeling
n IT learned the business processes and directmarketing strategies
n Marketing recommended additions to attributesto use in building model
Case: Verizon Wireless
-
7/30/2019 S1 Introduction to Course
69/102
Modeling
69
Data Selection/PreparationIncluded hundreds of basic attributesDerived and Ratio fields added to enrich the model
Use predictive modeling technique to refine
relationship between predictors and output of interest
Test Model: how will it perform in real life
Select the best models (accuracy, profitability, etc.)
Case: Verizon Wireless
-
7/30/2019 S1 Introduction to Course
70/102
Results:Marke-ngCampaignsusingPredic-ve
Modeling
Beganwithonecampaign 4060Kpiecespermonth Verypersonalizeduniqueoffer Approximately15%takerate
Currentlyfourmaincampaigntypes
400,000pieces/month
Upto35takerateofhighchurnriskcustomers
70
Case: Verizon Wireless
-
7/30/2019 S1 Introduction to Course
71/102
Deployment
71
Direct Mail and Telemarketing
n Customized one-to-one mailings
Customer Care ApplicationCustomer flagged by offerUsed By: Customer Service, RetailChannels
To catch customers that:reps were unable to contact
Call to disconnect
Case: Verizon Wireless
-
7/30/2019 S1 Introduction to Course
72/102
Benefits
72
nCost Reductionn Customers saved up to 80% more takesn Direct Mail budget for same churner mailing reduced by 60%
Switched customers from analog to digitalContract Renewals increased
Revenue IncreaseAverage monthly revenue increase per billMonthly usage increased
-
7/30/2019 S1 Introduction to Course
73/102
Descrip-veVs.Predic-veDataMining
Descriptive DM is used to learn about andunderstand the data.
Example:
Iden-fyanddescribegroupsofcustomerswith
commonbuyingbehavior(Clustering)
73
Example for Descriptive (Visualization)
-
7/30/2019 S1 Introduction to Course
74/102
74
p p ( )DM
Using Customer Data
FindgroupsofcustomerswithsimilarbuyingpaNerns
Descrip-ve vs. Predic-ve Data Mining
-
7/30/2019 S1 Introduction to Course
75/102
Descrip-vevs.Predic-veDataMining
Predictive DM: Aims to build models in order topredict unknown values of interest.Examples: Amodelthatgivenacustomerscharacteris-cspredictshowmuch
thecustomerwillspendonthenextcatalogorder.
Amodelthatclassifiescreditapplicantstodeterminewhetherornotanapplicantwilldefaultonaloan. Mostpredic-vemodelsarealsodescrip-ve.Amountspentoncatalogpurchase=0.001*(Annunal_Income)+0.3*(um_Cards)+(1/um_rders)
75
35 yearsProfessional, 95K annual income
2 children2 credit cards3 orders last year
Last purchase: 8 months agoAverage spending $30Last purchase: $40
Next Order: $40-$50
What Data Mining Can and
-
7/30/2019 S1 Introduction to Course
76/102
76
gCannot Do
Not a magic wandn No automatic solutions - Data mining offers a
set oftools and methodologies. Need to knowhow to utilize them.
n Like any other powerful tool can be verydangerous if not used properly.
n Team work: Cannot (always) replace skilledbusiness analysts - needs guidance and
validation of output
What Can Go Wrong
-
7/30/2019 S1 Introduction to Course
77/102
77
n Problemformula%onn eedtounderstandthebusinesswell,goodformula-onofproblem
nInappropriateuseofmethodsn (And/r)Lackofsufficient/highqualitydatan Computa-onalissues
n Evalua%onn eeddomainexpertsthroughouttheprocesstoprovide
indispensableinputandvalidateresults
WhatCanGoWrong
Wh t C G W ?
-
7/30/2019 S1 Introduction to Course
78/102
78
What Can Go Wrong?
n InabilitytoactuponpaNernbecauseofpoli-calorethicalreasons
n Securi-esTradingmodelsnDatamininginclinicalevalua-onnPrivacy(Insurance&credit,DoubleclickInc.)nAdmissioninterviews
-
7/30/2019 S1 Introduction to Course
79/102
DataMiningv.Privacy
ThereisoOentensionbetweendataminingandpersonalprivacy:
hNp://www.aclu.org/pizza/images/screen.swf
Ri k R d i D Mi i
-
7/30/2019 S1 Introduction to Course
80/102
Risk v. Reward in Data MiningMore data about more people in fewer places
-
7/30/2019 S1 Introduction to Course
81/102
The risks of research
My own personal story:
or
how a paper published in JCGS leads me tobe connected to FBI wiretapping.
2006: (chris v) Published papers on Communities ofInterest using social networks and Guilt by association to
catch fraud
9 September 2007: NYT lead story F.B.I. Data Mining
Reached Beyond Initial Targets discusses FBI techniquesCOI and GBA
23 October 2007: Blogosphere erupts: How AT&T Provides
the FBI with Terror Suspect Leads
-
7/30/2019 S1 Introduction to Course
82/102
The risks of research
Another story:
-
7/30/2019 S1 Introduction to Course
83/102
Data Mining, Spring 2006 83
-
7/30/2019 S1 Introduction to Course
84/102
-
7/30/2019 S1 Introduction to Course
85/102
Data Mining, Spring 2006 85
-
7/30/2019 S1 Introduction to Course
86/102
86
Wikileaks Visualizations
-
7/30/2019 S1 Introduction to Course
87/102
Data Mining, Spring 2006 87
The Good, The Bad, and the
-
7/30/2019 S1 Introduction to Course
88/102
e Good, e ad, a d e
Maybe
The question remains: how do weeffectively leverage sensitive personal
data for research purposes?
Three case studies can give insight Netflix PrizeAOL search dataset Barabasi mobile study
C St d 1 AOL S h D t
-
7/30/2019 S1 Introduction to Course
89/102
Case Study 1: AOL Search Data
August 4, 2006: AOL releases 20M search termsby anonymized users for research purposes. Why?
Within hours, uproar on the blogs The utter stupidity of this is staggering -
TechCrunch August 7: AOL removes data, issues apology
this was a screw-up, and we are angry an innocent enough attempt to reach out to the
research community
August 9: NYT front page story Identifies Thelma Arnold, 62 year old widow
C St d 1 AOL S h D t
-
7/30/2019 S1 Introduction to Course
90/102
Case Study 1: AOL Search Data Whats the big deal?
Ego searches make it easy to figure out who you are combined with porn orillegal queries can make for serious privacy violations.
What went wrong Not well thought out : risk >> reward Poor internal controls on public data release Lack of understanding of subject matter Lack of understanding of anonymizing data
Fallout CTO + at least two others fired Data still out in the public
Is it ethical to study? Inspiration for bad drama purple lilac," "happy bunny pictures,
"square dancing steps "cut into your
trachea," "pee fetish, "Simpsons
incest."
C St d 2 N tfli P i
-
7/30/2019 S1 Introduction to Course
91/102
Case Study 2: Netflix Prize
October 2006: Netflix releasesanonymized movie ratings from its
customer base
100M ratings, 500K customers (
-
7/30/2019 S1 Introduction to Course
92/102
Case Study 2: Netflix Prize
Narayanan and Shmatikov (2008) The adversary with a small amount of background knowledge
about an individualcan identify with high probability that
individuals record in the data and learnsensitive attributes
Claim that Netflix data sanitization not relevant Accuse Netflix of violating Video Privacy Protection Act of 1988 Details:
With aux info on 8 movies, where 2 can be wrong, and datesare known within 14 days; 99% de-anonymization
Aux info can be gotten via web sites, water coolers, etc People might be willing to give away some ratings, but notothers
Case Study 2: Netflix Prize
-
7/30/2019 S1 Introduction to Course
93/102
Case Study 2: Netflix Prize
Much ado about nothing Although paper is technically correct, dates are key
Without dates, you must know 8 movies, all outside of the top500 to get over 80% chance of de-anonymization
Auxiliary data very hard to come by No known cases discovered
Netflix did it right Consulted with top machine learning experts 0 < risk
-
7/30/2019 S1 Introduction to Course
94/102
y
Study Gonzalez, Hidalgo and Barabasi (2008)
Article in Nature outlines study on human mobility patterns 100000 individuals selected randomly from dataset of 6 million Unidentified country (unclear if the researchers knew) Cell tower location at start of call 206 individuals were pinged every two hours for a week
Findings humans follow simple, reproducible patterns Sample finding: Nearly three-quarters of those studied mainly stayed within
a 20-mile-wide circle for half a year.
Results could impact all phenomena driven by human mobility, fromepidemic prevention to emergency response and urban planning.
Case Study 3: Barabasi Mobile
-
7/30/2019 S1 Introduction to Course
95/102
y
Study Uproar ensued oversecret tracking of cell phone users
Blowback of negative feedback to Nature and scientists Study would be illegal in the US Approval from ONR review board and Northeastern review board.
Barabasi did not check with an ethics panel
Response Hidalgo: the data could be misused, but we were not trying to do
evil things. We are trying to make the world a little better.
Northeastern and Nature backed the research Continues to be referenced as an example of dangerous research Risk and reward both very high
ResearchConceptsPrivacy
-
7/30/2019 S1 Introduction to Course
96/102
Howdoweguaranteethatdataisprivate? quasiiden-fierscombina-onsofaNributeswithinthe
datathatcanbeusedtoiden-fyindividuals.
E.g.87%ofthepopula-onoftheUnitedStatescanbeuniquelyiden-fiedbygender,dateofbirth,and5digitzip
code
Datasetsarekanonymouswhenforanygivenquasiiden-fier,arecordisindis-nguishablefromk1others.
But,onestepfurther,maybeallkhaveagivensensi-veaNribute! Thedistribu-onoftargetvalueswithinagroupisreferredto
asldiversity.
Waystofuzzdatatoincreaseanonymityanddiversity: Generalize/summarizethedata:binsize,aggregatecounts Suppressordeletedata Perturbdata
DataMiningSoOware
-
7/30/2019 S1 Introduction to Course
97/102
Data Mining - Columbia University
SoOware CanuseanysoOwareyoulike: Preferred:Weka Also:R,SAS,SPSS,Systat,EnterpriseMiner.Matlab,SQLServer Maybe:Excel
WhatisR? pensourcesta-s-calsoOwaregrownoutofS/Splus www.rproject.org PackagesatCRA
RTutorialsavailableonline(seewebsiteandCRA) Greatgraphics(withabitofalearningcurve)
Resources
-
7/30/2019 S1 Introduction to Course
98/102
Resources Dataminingisanewfieldandassuch,doesnothave
authorita-vetexts(yet).
Thisclassdrawsfrommanysources,bestare DataMiningTechniques:ForMarke-ng,Sales,andCustomer
Support,byMichaelJ.A.Berry,GordonLinoff,publishedbyJohnWiley&Sons,Inc.
ElementsofSta%s%calLearning as%e,Tibshirani,andFriedman
HandbookofDataMiningHand,MannilaandSmyth Interac-veandDynamicGraphicsforDataAnalysisCookand
Swayne
Alsogoodclassnotesavailablefromotherclasses: DavidMadigan,Columbia DiCook,IowaState PadhraicSmyth,UCIrvine JiaweiHan,SimonFraser
seeclasswebsiteforpointerstothesenotes,orjustGooglethem!)
Assignment1
-
7/30/2019 S1 Introduction to Course
99/102
99
nBy Monday (01/16/2013) midnight on canvasnConfirm access to canvas!nRequired readingsnProfiles will be posted on canvas to facilitate groupselection ASAP
nGenerate 3 potential classification (prediction)problems/ideas as part of Assignment 1 (Startexploring publicly available data sets projectsfrom last year are available)
Projects From Prior Years
-
7/30/2019 S1 Introduction to Course
100/102
ProjectsFromPriorYears
Data Mining, Spring 20013
Shawndra Hill 100
-
7/30/2019 S1 Introduction to Course
101/102
Sources:
AndreasWeigend,ChrisVolinsky
101
S1:Introduc-ontotheCourse
-
7/30/2019 S1 Introduction to Course
102/102
ShawndraHill
Spring2013
TR 1:30 3pm and 3 4:30