the good, the bad and the ugly - arizona state universityhuanliu/papers/dfc11132016.pdf · the...
TRANSCRIPT
UncoveringNovelOpportuni1esArizonaStateUniversityDataMiningandMachineLearningLab DFC2016,Nov13 1
TheGood,theBadandtheUgly-UncoveringNovelOpportuni1esofDataScience
HuanLiu
UncoveringNovelOpportuni1esArizonaStateUniversityDataMiningandMachineLearningLab DFC2016,Nov13 22hDp://dmml.asu.edu/smm/
UncoveringNovelOpportuni1esArizonaStateUniversityDataMiningandMachineLearningLab DFC2016,Nov13 3
BigDataChallengesTradi1onalThinking
• Dataisubiquitousandcanonlybecomebigger• Bigdataisnotjustbig– Transforminghowwelive,work,andthink
• BigdatamakesmanytaskseasierandbeGer• Anexampleofbigmobiledata– UsingGPStoguideourtraveltodayvs.notsolongago
• OpportuniQesarewherechallengesare
UncoveringNovelOpportuni1esArizonaStateUniversityDataMiningandMachineLearningLab DFC2016,Nov13 4
Tradi1onalMediaandData
BroadcastMediaOne-to-Many
CommunicaQonMediaOne-to-One Tradi1onalData
UncoveringNovelOpportuni1esArizonaStateUniversityDataMiningandMachineLearningLab DFC2016,Nov13 5
SomeChallengesinUnderstandingSocialMedia
• Noise-RemovalFallacy– CanweremovenoisewithoutlosingmuchinformaQon?
• StudyingDistrust(theImplicit)inSocialMedia– Wheretofindtheinvisibledistrust?
• Big-DataParadox– Lackofdatawithbigsocialmediadata
• EvaluaQonDilemma– Whereisgroundtruth?Howtoevaluatewithoutit?
• DataSamplingBiasandItsMiQgaQon– O^enwegetasmallsampleof(sQllbig)data.Wouldthatdatasufficetoobtaincrediblefindings?
UncoveringNovelOpportuni1esArizonaStateUniversityDataMiningandMachineLearningLab DFC2016,Nov13 6
TheGood,theBad,andtheUglyofSocialMediaData
• Thegood– Socialmediadataisbigandlinked
• Thebad– Socialmediadataisnoisyandshortofdatawhereitismostneeded
• Theugly– Socialmediadataisheterogeneous,parQal,andasymmetrical
TwoIllustraQveCasesforNovelChallenges:(1)Removingnoise,and(2)Inferringtheimplicit
UncoveringNovelOpportuni1esArizonaStateUniversityDataMiningandMachineLearningLab DFC2016,Nov13 77
• Weo^enheardthat:“99%TwiGerdataisuseless.”– “Hadeggs,sunny-side-up,thismorning”– CanweremovenoiseasweusuallydoinDM?
• Whatisle^a^ernoiseremoval?– TwiGerdatacanberendereduselessa^erconvenQonalnoiseremoval
• Aswearecertainthereisnoiseindata,shouldweremoveit?– Ifyes,how?
• Anewchallenge:FeatureselecQonwithlinkeddata
RemovingNoise–aFirstTaskinDataMining
UncoveringNovelOpportuni1esArizonaStateUniversityDataMiningandMachineLearningLab DFC2016,Nov13 8
SocialDataandFeatureSelec1on
• High-dimensionalsocialmediadataposesuniquechallengestodataminingtasks
• FeatureselecQonhasbeenwidelyusedtopreparelarge-scale,high-dimensionaldataforeffecQvedatamining
• TradiQonalfeatureselecQonalgorithmsdealwithonly“flat"data(a0ribute-valuedata).
• WenowcantakeadvantageoflinkeddataforfeatureselecQon
UncoveringNovelOpportuni1esArizonaStateUniversityDataMiningandMachineLearningLab DFC2016,Nov13 9
Representa1onforSocialMediaData
SocialContext
UncoveringNovelOpportuni1esArizonaStateUniversityDataMiningandMachineLearningLab DFC2016,Nov13 10
NewProblemStatementofFeatureSelec1on
• GivenlabeleddataXanditslabelindicatormatrixY,thedatasetF,itssocialcontextincludinguser-userfollowingrelaQonshipsSanduser-postrelaQonshipsP,
• SelectkmostrelevantfeaturesfrommfeaturesfordatasetFwithitssocialcontextSandP
UncoveringNovelOpportuni1esArizonaStateUniversityDataMiningandMachineLearningLab DFC2016,Nov13 11
HowtoUseLinkInforma1on
• WouldtheaddiQonal(i.e.,link)informaQonbeusefulforfeatureselecQon?
• Sometechnicalchallenges– RelaQonextracQon:WhataredisQnctrelaQonsthatcanbeextractedfromlinkeddata
– MathemaQcalrepresentaQon:HowtousetheserelaQonsinfeatureselecQonformulaQon
• AretheretheoriestoguideusingeneraQnghypotheses?
UncoveringNovelOpportuni1esArizonaStateUniversityDataMiningandMachineLearningLab DFC2016,Nov13 12
SocialTheoriesGuidedResearch
• SocialcorrelaQontheoriessuggestthatthefourrelaQonsmayaffecttherelaQonshipsbetweenposts
• SocialcorrelaQontheories– Homophily:Peoplewithsimilarinterestsaremorelikelytobelinked
– Influence:Peoplewhoarelinkedaremorelikelytohavesimilarinterests
• Guidedbytheories,weturnsocialrelaQonshypothesesforinvesQgaQon
UncoveringNovelOpportuni1esArizonaStateUniversityDataMiningandMachineLearningLab DFC2016,Nov13 13
1. CoPost2. CoFollowing3. CoFollowed4. Following
Rela1onExtrac1on
UncoveringNovelOpportuni1esArizonaStateUniversityDataMiningandMachineLearningLab DFC2016,Nov13 14
Evalua1onResultsonDigg
UncoveringNovelOpportuni1esArizonaStateUniversityDataMiningandMachineLearningLab DFC2016,Nov13 15
Evalua1onResultsonDigg
UncoveringNovelOpportuni1esArizonaStateUniversityDataMiningandMachineLearningLab DFC2016,Nov13 16
Summary
• WeevaluateiflinkinformaQoncanbeusedforfeatureselecQonandunderstandhowitworks– LinkinformaQoncanhelpfeatureselec<onforsocialmediadata,inparQcular,whenweareshortofdata
• Unlabeleddataismoreo^eninsocialmedia,unsupervisedlearningismoresensible,butalsomorechallenging
UncoveringNovelOpportuni1esArizonaStateUniversityDataMiningandMachineLearningLab DFC2016,Nov13 17
InferringtheImplicit–SecondCase
• Bothtrustanddistrust(posiQveandnegaQveinfo)helpdecisionmakersreducetheuncertaintyandriskassociatedwithdecisions
• Distrustmayplayanequally,ifnotmore,criQcalroleastrustdoesindecisionmaking
• DistrustisnewinSocialMediaAnalysis-AsymmetryofinformaQonavailable(likevsdislike)
• Distrustis,however,notnewinSocialSciences-VariousdefiniQonofdistrustinSocialSciences
UncoveringNovelOpportuni1esArizonaStateUniversityDataMiningandMachineLearningLab DFC2016,Nov13 18
TwoTheoriesofDistrustfromSocialSciences
• DistrustisthenegaQonoftrust─ Lowtrustisequivalenttohighdistrust─ Theabsenceofdistrustmeanshightrust─ LackofthestudyingofdistrustmaGersliGle
• Distrustisanewdimensionoftrust─ Trustanddistrustaretwoseparateconcepts─ Trustanddistrustcanco-exist─ AstudyignoringdistrustwouldyieldanincompleteesQmateoftheeffectoftrust
UncoveringNovelOpportuni1esArizonaStateUniversityDataMiningandMachineLearningLab DFC2016,Nov13 19
ChallengesinStudyingDistrustinSocialMedia
• Challenge1:LackofcomputaQonalunderstandingofdistrustwithsocialmediadata– SocialmediadataisbasedonpassiveobservaQons– LackofsomeinformaQonthatsocialsciencesconvenQonallyusetoconductstudies
• Challenge2:DistrustinformaQonisusuallynotpubliclyavailable– Trustisdesiredwhiledistrustisnotforopenonlinesocialplaoorms
UncoveringNovelOpportuni1esArizonaStateUniversityDataMiningandMachineLearningLab DFC2016,Nov13 20
Computa1onalUnderstandingofDistrust
• DesigncomputaQonaltaskstohelpunderstanddistrustwithpassivelyobservedsocialmediadata
§ Q1:Isdistrustthenega1onoftrust?– YesorNo?
§ Q2:IsthereanyvalueofdistrustaYerQ1isanswered?– Ifdistrustisanewdimensionoftrust,whatisaddedvalueofdistrust
• HowcanweusesocialmediadatatocomputaQonallyanswerthetwoquesQons?
UncoveringNovelOpportuni1esArizonaStateUniversityDataMiningandMachineLearningLab DFC2016,Nov13 21
Task1:Isdistrustthenega1onoftrust?
• IfdistrustisthenegaQonoftrust,orlowtrustisequivalenttodistrust,distrustshouldbepredictableusingtrustinformaQon
Distrust LowTrust
Predic1ngDistrust
Predic1ngLowTrust
IF
THEN
≡
≡
UncoveringNovelOpportuni1esArizonaStateUniversityDataMiningandMachineLearningLab DFC2016,Nov13 22
Evalua1onofTask1
§ Theperformanceofusinglowtrustfordistrustisconsistentlyworsethanrandomlyguessing§ Task1:Sinceitfailstopredictdistrustwithonlytrust,distrustisnotthenegaQonoftrust
dTP:ItusestrustpropagaQontocalculatetrustscoresforpairsofusersdMF:ItusesthematrixfactorizaQonbasedpredictortocomputetrustscoresforpairsofusersdTP-MF:ItisthecombinaQonofdTPanddMFusingOR
UncoveringNovelOpportuni1esArizonaStateUniversityDataMiningandMachineLearningLab DFC2016,Nov13 23
Task2:Isthereanyaddedvalueofdistrust?
• Ifdistrusthasanyaddedvalue,weshouldpredicttrustbeGerwithdistrust
• Toverifytheabovestatement,wedefinethesecondcomputaQonaltaskinvolvingdistrust– IncorporaQngdistrustintrustpredic1on
OldTrust NewTrust Distrust
TrustPredicQon
UncoveringNovelOpportuni1esArizonaStateUniversityDataMiningandMachineLearningLab DFC2016,Nov13 24
Evalua1onofDistrustinTrustPropaga1on
• IncorporaQngdistrustpropagaQoncanimprovetheperformanceoftrustmeasurement
• OnestepdistrustpropagaQonusuallyoutperformsmulQplestepdistrustpropagaQon
x%x
PAPerformance
UncoveringNovelOpportuni1esArizonaStateUniversityDataMiningandMachineLearningLab DFC2016,Nov13 25
ExperimentalSe[ngsforTask2
• x%ofpairsofuserswithtrustrelaQonsarechosenasoldtrustrelaQonsandtheremainingasnewtrustrelaQons
• Task2predictspairsofusersPfromasnewtrustrelaQons
• Theperformanceiscomputedas ||||
nT
nT
APAPA ∩
=
xTN| AT
n |
UncoveringNovelOpportuni1esArizonaStateUniversityDataMiningandMachineLearningLab DFC2016,Nov13 26
FindingsfromUnderstandingDistrust
• DistrustpresentsdisQnctproperQes– ProperQesoftrustcannotbeextendedtodistrust
• DistrustisnotthenegaQonoftrust– Lowtrustfailstopredictdistrust
• Distrusthasaddedvalueovertrust– DistrusthelpsimprovetrustpredicQonperformance
• However,distrustinformaQonisusuallynotavailableonasocialnetworkingsite
• Nexttask-discoveringnegaQvelinkslikedistrust
UncoveringNovelOpportuni1esArizonaStateUniversityDataMiningandMachineLearningLab DFC2016,Nov13 27
SomeChallengesinUnderstandingSocialMedia
• Noise-RemovalFallacy– CanweremovenoisewithoutlosingmuchinformaQon?
• StudyingDistrustinSocialMedia– Wheretofindtheinvisibledistrust?
• Big-DataParadox– Lackofdatawithbigsocialmediadata
• EvaluaQonDilemma– Whereisgroundtruth?Howtoevaluatewithoutit?
• SamplingBiasandItsMiQgaQon– O^enwegetasmallsampleof(sQllbig)data.Wouldthatdatasufficetoobtaincrediblefindings?
UncoveringNovelOpportuni1esArizonaStateUniversityDataMiningandMachineLearningLab DFC2016,Nov13 2828
• scikit-feature–anopensourcefeatureselecQonrepositoryinPython
• SocialCompuQngRepository• Somebooksavailableasfreedownload
RepositoriesandRecentBooks
UncoveringNovelOpportuni1esArizonaStateUniversityDataMiningandMachineLearningLab DFC2016,Nov13 2929
• Forthisopportunitytoshareourresearch• Acknowledgments– GrantsfromNSF,ONR,andARO– DMMLmembersandprojectleaders– Collaborators
Search“huanLiu”formoreinformaQonorathGp://www.public.asu.edu/~huanliuHLiu,FMorstaGer,JTang,andRZafarani.``Thegood,thebad,andtheugly:uncoveringnovelresearchopportuni1esinsocialmediamining",inTrendsofDataScience,InternaQonalJournalonDataScienceandAnalyQcs,SpringerInternaQonalPublishingSwitzerland.September,2016.DOI10.1007/s41060-016-0023-0
THANKYOUandDFC2016