xintao wu nov 19,2015 social computing in big data era – privacy preservation and fairness...

Click here to load reader

Upload: jordan-peters

Post on 18-Jan-2016

218 views

Category:

Documents


0 download

TRANSCRIPT

Spectral Methods for Detecting Random Link Attacks and Subtle Anomalies in Social Networks

Xintao Wu Nov 19,2015

Social Computing in Big Data Era Privacy Preservation and Fairness Awareness 11Drivers of Data Computing2

6AsAnytimeAnywhereAccess toAnything byAnyoneAuthorized

4VsVolumeVelocityVarietyVeracityReliabilitySecurityPrivacyUsability 4Vs3

AVC Denial Log Analysis4

Volume and Velocity:1 million log files per day and each has thousands entriesS3, Hive and EMR on AWSSocial Media Customer Analytics 5Network topology (friendship,followship,interaction)

namesexagediseasesalaryAdaF18cancer25kBobM25heart110kidSexageaddressIncome5FYNC25k3MYSC110kStructured profileRetweet sequenceProduct and reviewEntity resolutionPatternsTemporal/spatialScalabilityVisualizationSentimentPrivacyUnstructured text (e.g., blog, tweet)Transaction databaseVariety, Veracity

10GB tweets per dayBelk and Lowes

UNCC Chancellors special fund5A Single View to the Customer

Customer

Social MediaGamingEntertainBankingFinanceOurKnownHistory

Purchase

OutlineIntroductionPrivacy Preserving Social Network AnalysisInput perturbationOutput perturbationAnti-discrimination Learning

77Privacy Breach CasesNydia Velzquez (1994)Medical record on her suicide attempt was disclosed AOL Search Log (2006) Anonymized release of 650K users search histories lasted for less than 24 hours NetFlix Contest (2009)$1M contest was cancelled due to privacy lawsuit23andMe (2013)Genetic testing was ordered to discontinue by FDA due to genetic privacy823andme was founded in 2007 and named for 23 pairs of chromosomes in a human cell. The company provides genetic testing and interpretation to individual consumers. Its personal genome test kit was named invention of the year by Time Magazine in 2008. 8AcxiomPrivacy In 2003, the EPIC alleged Acxiom provided consumer information to US Army "to determine how information from public and private records might be analyzed to help defend military bases from attack."In 2013 Acxiom was among nine companies that the FTC investigated to see how they collect and use consumer data.SecurityIn 2003, more than 1.6 billion customer records were stolen during the transmission of information to and from Acxiom's clients.9

According to the complaint, Acxiom's activities constituted unfair and deceptive trade practices, as "Acxiom has publicly represented its belief that individuals should have notice about how information about them is used and have choices about that dissemination, and has stated that it does not permit clients to make non-public information available to individuals," yet Acxiom proceeded to sell information to Torch Concepts without obtaining consent, an ability to opt-out, or furnishing notice to the affected consumers.

In 2003, the Electronic Privacy Information Center filed a complaint before the Federal Trade Commission against Acxiom and JetBlue Airways, alleging the companies provided consumer information to Torch Concepts, a company hired by the United States Army "to determine how information from public and private records might be analyzed to help defend military bases from attack by terrorists and other adversaries."In 2013 Acxiom was among nine companies that the Federal Trade Commission is investigating to see how they collect and use consumer data.SecurityIn 2003, more than 1.6 billion customer records were stolen during the transmission of information to and from Acxiom's clients;

9

10

Most restrictedRestrictedSome restrictionsMinimal restrictionsEffectively no restrictionsNo legislation or no informationPrivacy Regulation -- ForresterPrivacy Protection LawsUSA HIPAA for health careGrann-Leach-Bliley Act of 1999 for financial institutionsCOPPA for children online privacyState regulations, e.g., California State Bill 1386CanadaPIPEDA 2000 - Personal Information Protection and Electronic Documents ActEuropean Union Directive 94/46/EC - Provides guidelines for member state legislation and forbids sharing data with states that do not protect privacy Contractual obligations Individuals should have notice about how their data is used and have opt-out choices

1111Privacy Preserving Data Mining12ssnnamezipraceageSexincomedisease28223Asian20M85kCancer28223Asian30F70kFlu28262Black20M120kHeart28261White26M23kCancer......28223Asian20M110kFlu

69% unique on zip and birth date87% with zip, birth date and gender

Generalization (k-anonymity, l-diversity, t-closeness) Randomization 13Privacy Preserving Data Mining

131313The goal of data mining is summary results (e.g., classification, cluster, association rules etc.) from the data (distribution).Individual values in database must not be disclosed, or at least no close estimation can be got by attackers.

Privacy Preserving Data MiningHow to transform data such that we can build a good data mining model (data utility) while preserving privacy at the record level (privacy)?

Social Network Data 14

Data owner

Data miner

releasenamesexagediseasesalaryAdaF18cancer25kBobM25heart110kCathyF20cancer70kDellM65flu65kEdM60cancer300kFredM24flu20kGeorgeM22cancer45kHarryM40flu95kIreneF45heart70kidSexagediseasesalary5FYcancer25k3MYheart110k6FYcancer70k1MOflu65k7MOcancer300k2MYflu20k9MYcancer45k4MMflu95k8FMheart70k

14Threat of Re-identification15idSexagediseasesalary5FYcancer25k3MYheart110k6FYcancer70k1MOflu65k7MOcancer300k2MYflu20k9MYcancer45k4MMflu95k8FMheart70k

AttackerattackPrivacy breachesIdentity disclosureLink disclosureAttribute disclosure

15Privacy Preservation in Social Network AnalysisInput PerturbationK-anonymityGeneralizationRandomization1616 Our Work Feature preservation randomizationSpectrum preserving randomization (SDM08)Markov chain based feature preserving randomization (SDM09)Reconstruction from randomized graph (SDM10)Link privacy (from the attacker perspective)Exploiting node similarity feature (PAKDD09 Best Student Paper Runner-up Award)Exploiting graph space via Markov chain (SDM09)

17Spectrum Preserving Randomization [SDM08]Spectral Switch:

To increase the eigenvalue:

To decrease the eigenvalue:18

18Reconstruction from Randomized Graph [SDM10]We can reconstruct a graph from such that

w/o incurring much privacy loss19

20

OriginalExploiting graph space [SDM09]

20PSNet (NSF-0831204)21

Output Perturbation22

Data owner

Data miner

namesexagediseasesalaryAdaF18cancer25kBobM25heart110kCathyF20cancer70kDellM65flu65kEdM60cancer300kFredM24flu20kGeorgeM22cancer45kHarryM40flu95kIreneF45heart70kQuery f Query result + noise Cannot be used to derive whether any individual is included in the database 22Differential Guarantee [Dwork, TCC06]23namediseaseAdacancerBobheartCathycancerDellfluEdcancerFredfluf count(#cancer) f(x) + noise namediseaseAdacancerBobheartCathycancerDellfluEdcancerFredfluKKf count(#cancer) f(x) + noise 3 + noise 2 + noise achieving Opt-Out

23 is a privacy parameter: smaller = stronger privacy Differential Privacy

2424Calibrating Noise25

Laplace distributionSensitivity of functionglobal sensitivity local sensitivity

25Sensitivity26

namesexagediseasesalaryAdaF18cancer25kBobM25heart110kCathyF20cancer70kDellM65flu65kEdM60cancer300kFredM24flu20kGeorgeM22cancer45kHarryM40flu95kIreneF45heart70kFunction fsensitivityCount(#cancer)1Sum(salary)u (domain upper bound)Avg(salary)u/nData mining tasks can be decomposed to a sequence of simple functions.

L-1 distance for vector output26Challenge in OSN27

[1,1,3,3,3,3,2][1,1,3,3,2,2,2]

Degree sequence, D=2, noise from Lap(2/) is needed

n-2 0# of triangles, =n-2, huge noise is neededHigh sensitivity!27Advanced MechanismsPossible theoretical approachesSmooth sensitivityExponential mechanism Functional mechanismSampling2828Our WorkDP-preserving cluster coefficient (ASONAM12)DP-preserving spectral graph analysis (PAKDD13) Linear-refinement of DP-preserving query answering (PAKDD13 Best Application Paper)DP-preserving graph generation based on degree correlation (TDP13)Regression model fitting under differential privacy and model inversion attack (IJCAI 15)DP-preservation for deep auto-encoders (AAAI 16)

29SMASH (NIH R01GM103309)30

Genetic Privacy (NSF 1502273 and 1523115)31

BIBM13 Best Paper Award

Individual identification from GWAS statistics (2008)Allele frequency of case-control Odds ratio/p-valueSurname leakage from personal genomes (2013)Genotype data from HapMap project and 1000 Genomes project.With published year of birth and state of location, individuals are accurately identified.

31OutlineIntroductionPrivacy Preserving Social Network AnalysisInput perturbationOutput perturbationAnti-discrimination Learning

3232What is discrimination?Discrimination refers to unjustified distinctions of individuals based on their membership in a certain group.Federal Laws and regulations disallow discrimination on several grounds: Gender, Age, Marital Status, Sexual Orientation, Race, Religion or Belief, Disability or Illness These attributes are referred to as the protected attributes.

protected groupsPredictive Learning

Finding evidence of discriminationBuilding non discriminatory classifiersthe training data consist of a set of training examples. In supervised learning, each example is a pair consisting of an input object (typically a vector) and a desired output value (also called the supervisory signal). A supervised learning algorithm analyzes the training data and produces an inferred function, which can be used for mapping new examples.34Motivating Example35namesexageprogramacceptanceAdaF18cancer+BobM25heart_CathyF20cancer+EdM60cancer_FredM24flu_Suppose 2000 applicants, 1000 M and 1000 FAcceptance ratio 36% M vs. 24% FDo we have discrimination here?

35Discrimination DiscoveryAssuming a causal Bayesian network that faithfully represents the data.

Discriminatory effect if P > , where is a threshold for discrimination depending on law (e.g., 5%).

Protected attributeDecisionattributec+, c-e+, e-P = P(e+|c+) P(e+|c)Motivate Examples

Case I

Case II

P = 0.1P = -0.01Motivate Examples

Case II

Case III

P + +P - +P = -0.01P = 0.104Discrimination AnalysisDiscriminationis treatment or consideration of, or making a distinction in favor of or against, a person or thing based on the group, class, or category to which that person or thing is perceived to belong to rather than on individual merit. (Wikipedia)Tweets discrimination analysis aims to detect whether a tweet contains discrimination against gender, race, age, etc.

We focus on identify discrimination from text. Specifically, I want to figure out whether a tweet contains discrimination or not.

39A Typical Deep Learning Pipeline for Text ClassificationTextWord RepresentationMultilayer PerceptionRecursive Neural NetworkRecurrent Neural NetworkConvolutional Neural NetworkDeep Learning ModelText RepresentationSoftmax Classifierwordsemantic compositiontextText RepresentationWord embeddings : use low dimensional real value vectors to represent words40Word EmbeddingsTweet

41Word EmbeddingsTweetLSTM-RNN

Word EmbeddingsTweetLSTM-RNNTweet RepresentationMean Pooling

Word EmbeddingsTweetLSTM-RNNMean PoolingLogistic Regression

Tweet RepresentationSummaryPreserving Privacy ValuesEducating Robustly and ResponsiblyBig Data and DiscriminationLaw Enforcement & SecurityData as a Public Resouce

45

45Acknowledgement46Collaborators: UNCC: Aidong Lu, Xinghua Shi, Yong GeOregon: Jun Li, Dejing Dou PeaceHealth: Brigitte PiniewskiUIUC: Tao XieDPL members:UNCC: PhD graduates: Songtao Guo, Ling Guo, Kai Pan, Leting Wu, Xiaowei Ying. PhD students: Yue Wang, Yuemeng Li, Zhilin Luo (visiting)UofA: Lu Zhang (postdoc), Yongkai Wu, Cheng Si, Miao Xie, Shuhan Yuan Funding support:

46Genome Wide Association Study47

t (xt)t (xt)v (xv)u (xu)u (xu)w (xw)v (xv)w (xw)