towards privacy-preserving speech data...

37
Towards Privacy-Preserving Speech Data Publishing Jianwei Qian Feng Han Jiahui Hou Illinois Tech USTC Illinois Tech Chunhong Zhang Yu Wang Xiang-Yang Li BUPT UNC Charlotte USTC

Upload: others

Post on 04-Jul-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Towards Privacy-Preserving Speech Data Publishingmypages.iit.edu/~jqian15/slides/PP-speech-publishing-INFOCOM.pdf · Relatedworks • Speakerrecognitionandfeaturelearning – Gaussian

Towards Privacy-Preserving Speech Data Publishing

Jianwei Qian Feng Han Jiahui HouIllinois Tech USTC Illinois Tech

Chunhong Zhang Yu Wang Xiang-Yang LiBUPT UNC Charlotte USTC

Page 2: Towards Privacy-Preserving Speech Data Publishingmypages.iit.edu/~jqian15/slides/PP-speech-publishing-INFOCOM.pdf · Relatedworks • Speakerrecognitionandfeaturelearning – Gaussian

2Towards Privacy-Preserving Speech Data Publishing

IntroductionProblem FormulationSanitization Methods

Evaluation

Page 3: Towards Privacy-Preserving Speech Data Publishingmypages.iit.edu/~jqian15/slides/PP-speech-publishing-INFOCOM.pdf · Relatedworks • Speakerrecognitionandfeaturelearning – Gaussian

Voice-based human-computer interaction

• Applications: – input keyboards, web search, voice assistants, and

voice authentication

3Towards Privacy-Preserving Speech Data Publishing

Page 4: Towards Privacy-Preserving Speech Data Publishingmypages.iit.edu/~jqian15/slides/PP-speech-publishing-INFOCOM.pdf · Relatedworks • Speakerrecognitionandfeaturelearning – Gaussian

Speech data is being collected

• Speech data– Voice input, voice commands, call records

4Towards Privacy-Preserving Speech Data Publishing

Page 5: Towards Privacy-Preserving Speech Data Publishingmypages.iit.edu/~jqian15/slides/PP-speech-publishing-INFOCOM.pdf · Relatedworks • Speakerrecognitionandfeaturelearning – Gaussian

Speech data is being shared/published

• For business– Outsourcing or selling

• For research– Spoken language analysis

• Speech/speaker recognition• Gender/age/accent recognition• Emotion/personality/health analysis

– E.g., TIMIT, NIST SRE5Towards Privacy-Preserving Speech Data Publishing

Page 6: Towards Privacy-Preserving Speech Data Publishingmypages.iit.edu/~jqian15/slides/PP-speech-publishing-INFOCOM.pdf · Relatedworks • Speakerrecognitionandfeaturelearning – Gaussian

Scenario

6Towards Privacy-Preserving Speech Data Publishing

Speakers(Victims)

Apps

Voice-based service provider(Publisher)

Remove PIIPublish/share/leak

Data consumers(Attackers)

Anonymous speech recordscan be de-anonymized!

Page 7: Towards Privacy-Preserving Speech Data Publishingmypages.iit.edu/~jqian15/slides/PP-speech-publishing-INFOCOM.pdf · Relatedworks • Speakerrecognitionandfeaturelearning – Gaussian

Private info contained in speech data

7Towards Privacy-Preserving Speech Data Publishing

Speech content• Searches, commands• Msgs, emails written

by voice input

Voice attributes• Gender, age, ethnicity,

accent, height, healthconditions

Membership• Dataset descriptions

leak info too, e.g.,“collected from heart disease patients”

Voiceprints• Biometrics. Once lost,

always lost

Page 8: Towards Privacy-Preserving Speech Data Publishingmypages.iit.edu/~jqian15/slides/PP-speech-publishing-INFOCOM.pdf · Relatedworks • Speakerrecognitionandfeaturelearning – Gaussian

Our goals

8Towards Privacy-Preserving Speech Data Publishing

Study the risk of privacy leak in speech data publishing

Quantify privacy level and data utility

Design sanitization methods

Balance the tradeoff betweenprivacy and utility

Page 9: Towards Privacy-Preserving Speech Data Publishingmypages.iit.edu/~jqian15/slides/PP-speech-publishing-INFOCOM.pdf · Relatedworks • Speakerrecognitionandfeaturelearning – Gaussian

Challenges

• Existing privacy definitions may not fit– Speech = voice + text (speech content)– Privacy in text is already hard to define

• Utility of speech data is unknown– Depends on multiple factors

• Audio quality, speaker diversity, speech content relevance, etc.

– Depends on the application/usage

9Towards Privacy-Preserving Speech Data Publishing

Page 10: Towards Privacy-Preserving Speech Data Publishingmypages.iit.edu/~jqian15/slides/PP-speech-publishing-INFOCOM.pdf · Relatedworks • Speakerrecognitionandfeaturelearning – Gaussian

Tradeoff

10Towards Privacy-Preserving Speech Data Publishing

Privacy Utility

Data use clarityMembership

Speech qualityVoiceprints

Voice diversityVoice attributes

SemanticsoundnessSpeech content

Page 11: Towards Privacy-Preserving Speech Data Publishingmypages.iit.edu/~jqian15/slides/PP-speech-publishing-INFOCOM.pdf · Relatedworks • Speakerrecognitionandfeaturelearning – Gaussian

Towards Privacy-Preserving Speech Data Publishing 11

IntroductionProblem Formulation

Sanitization MethodsEvaluation

Page 12: Towards Privacy-Preserving Speech Data Publishingmypages.iit.edu/~jqian15/slides/PP-speech-publishing-INFOCOM.pdf · Relatedworks • Speakerrecognitionandfeaturelearning – Gaussian

Notations

• " = (%&', {*}): dataset%&': dataset description*: an utterance (speech record) of a speaker

• -.: speaker *’s privacy leak• /: total utility loss

• Privacy notion: 0-leak limit" satisfies 0-leak limit iff. -. ≤ 0, ∀* ∈ 4

• Optimization– Minimize /, subject to 0-leak limit

12Towards Privacy-Preserving Speech Data Publishing

budget

Page 13: Towards Privacy-Preserving Speech Data Publishingmypages.iit.edu/~jqian15/slides/PP-speech-publishing-INFOCOM.pdf · Relatedworks • Speakerrecognitionandfeaturelearning – Gaussian

Privacy leak quantification

• Privacy leak !": the amount of private info contained in #– Text leak (speech content)

!$" = & t(idf(-.", -0)$2∈$45

– Voice attribute leak

!67" =&89:;

9<=– Voiceprint leak

!6>" = ?6>@– Membership leak

!2 =&8A:;B

A<=

13Towards Privacy-Preserving Speech Data Publishing

Sum of TF-IDF of

all terms

?6>: weight.

@: fraction of

voiceprint leak

Page 14: Towards Privacy-Preserving Speech Data Publishingmypages.iit.edu/~jqian15/slides/PP-speech-publishing-INFOCOM.pdf · Relatedworks • Speakerrecognitionandfeaturelearning – Gaussian

Utility loss quantification

• 4 aspects too– Voice diversity loss

!" =12 & − &′ )

– Text authenticity loss

!*+ =, ⋅ . + 0 + 1

2 , !* = !*+

– Speech quality loss!4+ = 1 − PESQ, !4 = !4+

– Data use clarity loss

!9 = : ;<<∈>

14Towards Privacy-Preserving Speech Data Publishing

Distance of attributedistributions

Edit distance

Averaged overall utterancesAveraged overall utterances

Page 15: Towards Privacy-Preserving Speech Data Publishingmypages.iit.edu/~jqian15/slides/PP-speech-publishing-INFOCOM.pdf · Relatedworks • Speakerrecognitionandfeaturelearning – Gaussian

Put them all together

• Privacy leak of speaker !– #$ = &' #($, #*+$ , #*,$ , #-– &' is decided by the publisher

• E.g., linear combination, or supermodular function

• Total utility loss– . = &/ .*, .(, .0, .1– &/ is decided by the publisher & consumer

• E.g., linear combination, or supermodular function

15Towards Privacy-Preserving Speech Data Publishing

Page 16: Towards Privacy-Preserving Speech Data Publishingmypages.iit.edu/~jqian15/slides/PP-speech-publishing-INFOCOM.pdf · Relatedworks • Speakerrecognitionandfeaturelearning – Gaussian

Sanitizedatasetd

escriptio

n

Rawspeech

Sanitizedspeech

Sanitize text

%-GramNER

Key termidentification

Truncation

Substitution

Key termperturbation

VoiceconversionSanitizevoice

Sanitization actions

16Towards Privacy-Preserving Speech Data Publishing

Page 17: Towards Privacy-Preserving Speech Data Publishingmypages.iit.edu/~jqian15/slides/PP-speech-publishing-INFOCOM.pdf · Relatedworks • Speakerrecognitionandfeaturelearning – Gaussian

Impact of sanitization actions

*Example formulas of these functions can be found in our paper

17Towards Privacy-Preserving Speech Data Publishing

Sanitizationactions

(parameters)

Privacy leak after sanitization ↓ Utility loss caused by sanitization ↑Text

#$%Voiceattribute#&'%

Voiceprint#&(%

Membership

#)

Voicediversity

*&

Text auth-enticity*$%

Speechquality*+%

Datause*,

Data descriptionsanitization (-)

.(-) 1(-)

Key termperturbation (2)

3%(2%) ℎ%(2%)

Voiceconversion (5)

6%(5%) 7(5%)8(59, … , 5<,=9, … , =<)

>(5%)

Speechsynthesis (=)

0 0

Page 18: Towards Privacy-Preserving Speech Data Publishingmypages.iit.edu/~jqian15/slides/PP-speech-publishing-INFOCOM.pdf · Relatedworks • Speakerrecognitionandfeaturelearning – Gaussian

Optimization

Privacy-preserving speech data publishing

18Towards Privacy-Preserving Speech Data Publishing

Not a convex problem

Single utilityloss constraints

!-leak limit

Unnecessary to dovoice conversion

and speechsynthesis together

Parametersof sanitization

Page 19: Towards Privacy-Preserving Speech Data Publishingmypages.iit.edu/~jqian15/slides/PP-speech-publishing-INFOCOM.pdf · Relatedworks • Speakerrecognitionandfeaturelearning – Gaussian

Towards Privacy-Preserving Speech Data Publishing 19

IntroductionProblem Formulation

Sanitization MethodsEvaluation

Page 20: Towards Privacy-Preserving Speech Data Publishingmypages.iit.edu/~jqian15/slides/PP-speech-publishing-INFOCOM.pdf · Relatedworks • Speakerrecognitionandfeaturelearning – Gaussian

Speech content sanitization

• Observation– A term frequently used by aperson but infrequently used byothers is highly related to thisperson

– Higher TF-IDFà more private

• Example– 8K Hillary Clinton emails

• Text privacy leak is defined as!"# = % t'idf(,-#, ,/)

"1∈"3420Towards Privacy-Preserving Speech Data Publishing

Page 21: Towards Privacy-Preserving Speech Data Publishingmypages.iit.edu/~jqian15/slides/PP-speech-publishing-INFOCOM.pdf · Relatedworks • Speakerrecognitionandfeaturelearning – Gaussian

• Key terms– Terms whose TF-IDF > "– Perturbing key terms can effectively reduce text leak while

causing minimal utility loss

• Key term identification– In text (transcript)

• $-Gram based• Named-entity recognition based

– In audio• DTW-based keyword spotting

• Key term perturbation– Substitution/truncation

Speech content sanitization (cont’d)

21Towards Privacy-Preserving Speech Data Publishing

Key terms

Page 22: Towards Privacy-Preserving Speech Data Publishingmypages.iit.edu/~jqian15/slides/PP-speech-publishing-INFOCOM.pdf · Relatedworks • Speakerrecognitionandfeaturelearning – Gaussian

Speech content sanitization (cont’d)

• Impact of !– Smaller !, less text leak– Smaller !, more text authenticity loss

22Towards Privacy-Preserving Speech Data Publishing

Page 23: Towards Privacy-Preserving Speech Data Publishingmypages.iit.edu/~jqian15/slides/PP-speech-publishing-INFOCOM.pdf · Relatedworks • Speakerrecognitionandfeaturelearning – Gaussian

Pitch marking

Frame segmentation

FFT

VTLN

IFFT

PSOLA

Voice sanitization - voice conversion

• VTLN (vocal tract lengthnormalization)– Deforms the frequency axisof the speech signalaccording to a warpingfunction,

– E.g., bilinear function!"($, &)

– & tunes the distortion level

23Towards Privacy-Preserving Speech Data Publishing

Outputvoice

Inputvoice

Page 24: Towards Privacy-Preserving Speech Data Publishingmypages.iit.edu/~jqian15/slides/PP-speech-publishing-INFOCOM.pdf · Relatedworks • Speakerrecognitionandfeaturelearning – Gaussian

Voice sanitization - voice conversion (cont’d)

• Impact of !– Bigger ! , less voiceprint leak– Bigger ! , worse speech quality

24Towards Privacy-Preserving Speech Data Publishing

Page 25: Towards Privacy-Preserving Speech Data Publishingmypages.iit.edu/~jqian15/slides/PP-speech-publishing-INFOCOM.pdf · Relatedworks • Speakerrecognitionandfeaturelearning – Gaussian

Voice sanitization - Speech synthesis

• Steps

• Status quo

– J A few companies can produce pretty nature voices

– L provide only a couple of voice options

• Impact on privacy & utility

– Completely protects voice attributes and voiceprints

– Damages voice diversity (many to several)

25Towards Privacy-Preserving Speech Data Publishing

TokenizationText-to-

phoneme

Waveform

generation

Page 26: Towards Privacy-Preserving Speech Data Publishingmypages.iit.edu/~jqian15/slides/PP-speech-publishing-INFOCOM.pdf · Relatedworks • Speakerrecognitionandfeaturelearning – Gaussian

Optimization problem

26Towards Privacy-Preserving Speech Data Publishing

Sanitizationactions

(parameters)

Privacy leak after sanitization ↓ Utility loss caused by sanitization ↑Text

#$%Voiceattribute#&'%

Voiceprint#&(%

Membership

#)

Voicediversity

*&

Text auth-enticity*$%

Speechquality*+%

Datause*,

Data descriptionsanitization (-)

.(-) 1(-)

Key termperturbation (2)

3%(2%) ℎ%(2%)

Voiceconversion (5)

6%(5%) 7(5%)8(59, … , 5<,=9, … , =<)

>(5%)

Speechsynthesis (=)

0 0

Minimize *, subject to #% ≤ @, ∀B

We compute *,#% by linear combination:* = E&*& + E$*$ + E+*+ + E,*,#% = #$% + #&'% + #&(% + #) (weights are assigned inside)

Page 27: Towards Privacy-Preserving Speech Data Publishingmypages.iit.edu/~jqian15/slides/PP-speech-publishing-INFOCOM.pdf · Relatedworks • Speakerrecognitionandfeaturelearning – Gaussian

Heuristic algorithm

• Intuition– If a certain utility is more important (larger !), then we should

allow more corresponding privacy leak (more budget)

• Heuristic– We allocate the budget " to #$%, #'(% + #'*% , #+ in the ratio!$, !' + !, , !-.

• Divide & conquer: 3 smaller problemsP1: For all ., minimize /$%, subject to #$% ≤ "!$P2: Minimize !'/' + !,/,, subject to #'(% + #'*% ≤ " !' + !, ,2%3% = 0, ∀.P3: Minimize /-, subject to #+ ≤ "!-

=>#$% + #'(% + #'*% + #+ ≤ "27Towards Privacy-Preserving Speech Data Publishing

Page 28: Towards Privacy-Preserving Speech Data Publishingmypages.iit.edu/~jqian15/slides/PP-speech-publishing-INFOCOM.pdf · Relatedworks • Speakerrecognitionandfeaturelearning – Gaussian

Towards Privacy-Preserving Speech Data Publishing 28

IntroductionProblem FormulationSanitization MethodsEvaluation

Page 29: Towards Privacy-Preserving Speech Data Publishingmypages.iit.edu/~jqian15/slides/PP-speech-publishing-INFOCOM.pdf · Relatedworks • Speakerrecognitionandfeaturelearning – Gaussian

Datasets

• TED talks– 562 audios with subtitles from ted.com

• LibriSpeech– Audios of 251 native speakers reading an English book

• US census data – 2.46M people’s demographics– Used to simulate linkage attacks

• Hillary Clinton emails – 8K emails– Used to study text privacy

29Towards Privacy-Preserving Speech Data Publishing

Page 30: Towards Privacy-Preserving Speech Data Publishingmypages.iit.edu/~jqian15/slides/PP-speech-publishing-INFOCOM.pdf · Relatedworks • Speakerrecognitionandfeaturelearning – Gaussian

Simulating linkage attacks

• De-anonymize with demographics– Average case: with 6 attributes, the search range drops from

2.46M to 1K– Best case: with 1 attribute, the search range drops to 40

• De-anonymize with speaker identification– 100% accurate when identifying a person out of 813 candidates

with 1 min voice sample

30Towards Privacy-Preserving Speech Data Publishing

De-anonymizing speech data is possible!

Page 31: Towards Privacy-Preserving Speech Data Publishingmypages.iit.edu/~jqian15/slides/PP-speech-publishing-INFOCOM.pdf · Relatedworks • Speakerrecognitionandfeaturelearning – Gaussian

Simulating data sanitization

• Metrics– Total utility loss– Qualification rate

• Ratio of utterances that satisfy !-leak limit• Performance depends on the dataset and parameter

settings

31Towards Privacy-Preserving Speech Data Publishing

Page 32: Towards Privacy-Preserving Speech Data Publishingmypages.iit.edu/~jqian15/slides/PP-speech-publishing-INFOCOM.pdf · Relatedworks • Speakerrecognitionandfeaturelearning – Gaussian

Summary of contributions

• Quantified privacy leak and utility loss from four aspects each

• Formulated privacy-preserving speed data publishing– Minimize utility loss, subject to !-leak limit

• Explored existing speech processing technologies fordata sanitization

• Proposed original approaches– TF-IDF based speech content sanitization– A heuristic algorithm for the optimization problem

32Towards Privacy-Preserving Speech Data Publishing

Page 33: Towards Privacy-Preserving Speech Data Publishingmypages.iit.edu/~jqian15/slides/PP-speech-publishing-INFOCOM.pdf · Relatedworks • Speakerrecognitionandfeaturelearning – Gaussian

Towards Privacy-Preserving Speech Data Publishing 33

Towards Privacy-Preserving Speech Data PublishingJianwei Qian

https://sites.google.com/view/jqian

Page 34: Towards Privacy-Preserving Speech Data Publishingmypages.iit.edu/~jqian15/slides/PP-speech-publishing-INFOCOM.pdf · Relatedworks • Speakerrecognitionandfeaturelearning – Gaussian

Towards Privacy-Preserving Speech Data Publishing 34

Page 35: Towards Privacy-Preserving Speech Data Publishingmypages.iit.edu/~jqian15/slides/PP-speech-publishing-INFOCOM.pdf · Relatedworks • Speakerrecognitionandfeaturelearning – Gaussian

Security risks

• Once voiceprint is leaked…– Spoofing attacks

• Pass voice authentication to access thevictim’s device

– Reputation attacks• Fabricate voice recordings with indecent

or illegal content

– Fraud• Authorize bogus charges on credit cards

35Towards Privacy-Preserving Speech Data Publishing

Page 36: Towards Privacy-Preserving Speech Data Publishingmypages.iit.edu/~jqian15/slides/PP-speech-publishing-INFOCOM.pdf · Relatedworks • Speakerrecognitionandfeaturelearning – Gaussian

Naïve solutions

• Voice shuffling– L Targeted voice conversion is immature

• Needs parallel speech corpora for training• Can’t convert fine details of one’s voice• Output audio has bad quality

• Replace all speech with machine generated voice– L Speech synthesis is immature, has poor diversity– L Ruins the point of publishing speech data

L They do not protect speech content36Towards Privacy-Preserving Speech Data Publishing

Page 37: Towards Privacy-Preserving Speech Data Publishingmypages.iit.edu/~jqian15/slides/PP-speech-publishing-INFOCOM.pdf · Relatedworks • Speakerrecognitionandfeaturelearning – Gaussian

Related works

• Speaker recognition and feature learning– Gaussian mixture models (GMMs), SVM– Joint factor analysis, i-vector– Deep neural network, d-vector– Fancier neural networks

• Convolutional time-delay DNN, ResCNN, GRU, etc.

• Privacy learning from speech– Use machine learning to predict demographics

• Gender, age, ethnicity, birth places, personalities, health conditions, and social-status

• Privacy-preserving speaker/speech recognition– Protect speech data from the service provider– Implemented with secure multi-party computation or cryptography– L different problem, focused on GMM-based models (outdated)

37Towards Privacy-Preserving Speech Data Publishing

Privacy-preserving speech data publishing/sharing was untouched