privacy protection for life-log video

Privacy Protection for Life-log Video

Jayashri Chaudhari

November 27, 2007

Department of Electrical and Computer EngineeringUniversity of Kentucky, Lexington, KY 40507

Outline

Motivation and Background Proposed Life-Log System Privacy Protection Methodology

Face Detection and Blocking Voice Segmentation and Distortion

Experimental Results Segmentation Algorithm Analysis Audio Distortion Analysis

Conclusions

What is a Life-Log System?

Applications include• Law enforcement

• Police Questioning

• Tourism

• Medical Questioning

• Journalism

“A system that records everything, at every moment and everywhere you go”

Existing Systems/work

1) “MyLifeBits Project”: At Microsoft Research

2) “WearCam” Project: At University of Toronto, Steve Mann

3) “Cylon Systems”: http::/cylonsystems.com at UK (a portable body worn surveillance system)

Technical Challenges

Security and Privacy Information management and storage Information Retrieval Knowledge Discovery Human Computer Interface

Why Privacy Protection?

Privacy is fundamental right of every citizen Emerging technologies threaten privacy right There are no clear and uniform rules and

regulations regarding video recording People are resistant toward technologies like

life-log Without tackling these issues the deployment of

such emerging technologies is impossible

Research Contributions

Practical audio-visual privacy protection scheme for life-log systems

Performance measurement (audio) onPrivacy protectionUsability

Proposed Life-log System

“A system that protects the audiovisual privacy of the persons captured by a portable video recording device”

Privacy Protection Scheme

Design Objectives

• Privacy• Hide the identity of the subjects being captured

• Privacy verses usefulness: • Recording should convey sufficient information to be useful

√ Usefulness× Privacy

× Usefulness√ Privacy

√ Usefulness√ Privacy

Design Objectives Anonymity or Ambiguity

• The scheme should generate ambiguous identity of the recorded subjects.

• Every individual will look and sound identical• Reduce correlation attacks

Speed• Protection scheme should work in real time

Interview Scenario• Producer is speaking with a single subject in relative quiet

room

Privacy Protection Scheme Overview

audio

Audio Segmentation

Audio Segmentation

Audio Distortion

Audio Distortion

Face Detection and

Blocking

Face Detection and

Blocking

videoSynchronization & Multiplexing

Synchronization & Multiplexing

storage

S

P

S: Subject (The person who is being recorded)

P: Producer (The person who is the user of the system)

Voice Segmentation and distortion

Statek=Statek-1 or Subject or Producer

Windowed

Power, Pk

Computation

Windowed

Power, Pk

ComputationPk <TSPk <TS Pk <TP

Pk <TP

Y Y

Statek= Producer

Statek= Subject

Storage

Pitch Shifting

We use the PitchSOLA time-domain pitch shifting method.

* “DAFX: Digital Audio Effects” by Udo Zölzer et al.

Pitch Shifting Algorithm

Pitch Shifting (Synchronous Overlap and Add):

Steps 1) Time Stretching by a factor of α using window of size N and stepsize Sa

Input Audio

N

X1(n)

SaX2(n)

α*Sa

Step 2) Re-sampling by a factor of 1/α to change pitch

X2(n) X2(n)Km

Max correlationReduce discontinuity in phase and pitchMixing

Face Detection and Blockingcamera

FaceDetection

FaceDetection

Face detection is based on Viola & Jones 2001.

FaceTracking

FaceTracking

SubjectSelection

SubjectSelection

SelectiveBlocking

SelectiveBlocking

Audio segmentationresults

Subjecttalking

Producertalking

Initial Experiments1

• Analysis of Segmentation algorithm

• Analysis of Audio distortion algorithm

1) Accuracy in hiding identity

2) Usability after distortion

1: Chaudhari J., S.-C. Cheung, and M. V. Venkatesh. Privacy protection for life-log video. In IEEE Signal Processing Society SAFE 2007: Workshop on Signal Processing Applications for Public Security and Forensics, 2007.

Segmentation ExperimentExperimental Data:

• Interview Scenario in quiet meeting room

• Three interviews recording of about 1 minute and 30 seconds long

Transitions

P S P S P PS Silence

S: Subject Speaking

P: Producer Speaking

Segmentation Results

Meeting# Transition#

(Ground truth)

Correctly identified transitions#

Falsely detected

Transitions#

Precision Recall

1 7 6 10 0.375 0.857

2 7 7 5 0.583 1

3 6 6 10 0.353 1

truthgroundin stransition#

ns transitioidentifiedcorrectly #Recall

ns transitioidentified #

ns transitioidentifiedcorrectly # Precision

Comparison With CMU Segmentation Algorithm

Meeting # Our Algorithm CMU Algorithm

Precision Recall Precision Recall

1 0.375 0.857 0.667 0.57

2 0.583 1 1 0.57

3 0.353 1 0.4 0.5

CMU audio segmentation algorithm1 used as benchmark

1:Matthew A. Seigler, Uday Jain, Bhiksha Raj, and Richard M. Stern. Automatic segmentation, classification and clustering of broadcast news audio. In Proceedings of the Ninth Spoken Language Systems Technology Workshop, Harriman, New York, 1997.

Speaker Identification Experiment

Experimental Data

• 11 Test subjects, 2 voice samples from each subject

• One voice sample is used as training and the other is used for testing

• Public domain speaker recognition software

Script1This script is used for training the speaker recognition software

Train

TestScript2This script is used to test the performance of audio distortion in hiding the identity

Speaker Identification Results

Person ID

Without Distortion

(Person ID identified)

Distortion 1


Distortion 2


Distortion 3


1 1 5 8 5

2 2 6 8 6

3 3 5 3 5

4 4 6 6 5

5 5 3 10 6

6 6 8 6 5

7 7 5 2 5

8 8 10 11 5

9 9 5 8 5

10 10 5 2 5

11 11 4 8 5

Error Rate

0% 100% 90.9% 100%

Distortion 1: (N=2048, Sa=256, α =1.5) Distortion 2: (N=2048, Sa=300, α =1.1)

Distortion 3: (N=1024, Sa=128, α =1.5)

Usability Experiments

Experimental Data

• 8 subjects, 2 voice samples from each subject

• One voice sample is used without distortion and the other is distorted

• Manual transcription (5 human tester)

1.Wav (transcription1)1.Wav (transcription1)This transcription is of undistorted This transcription is of undistorted voice --- stored in one dot wav file.voice --- stored in one dot wav file.

2.Wav (transcription2)2.Wav (transcription2)This transcription is of distorted voice This transcription is of distorted voice sample --- in two dot wav ---.sample --- in two dot wav ---.

Manual Transcription

Unrecognized words

Usability after distortion

Word Error Rate: Standard measure of word recognition error for speech recognition system

WER= (S+D+I) /N

S = # substitution

D = # deletion

I = # insertion

N = # words in reference sample

Tool used: NIST tool SCLITE

Extended Experiments

Data set TIMIT (Texas Instruments and Massachusetts Institute of

Technology) Speech Corpora

Experimental Setup Allowable range of alpha (α): 0.2-2.0 Five alpha values (α=0.5,0.75,1,1.25,1.40) Increase the scope of experiments

• “Subjective Experiments”: Use testers to access privacy and usability

Privacy Experiments (Speaker Identification)

• Total 30 audio clips in each set

• Re-divide the audio clips from each sets into five groups (1-5)

• Each group consists of 6 audio clips randomly selected from each set

• Each group was assigned to three testers and were asked to do 3 tasks

TIMIT Corpora

(630 speakers, 10 audio clips per speaker)

Our Experiments

(30 speakers, 5 audio clips per speaker)

Set A

(α=1)

Set B

(α=0.5)

Set C

(α=0.75)

Set E

(α=1.40)Set D

(α=1.25)

Experimental Setup

Task 1: Transcribe audio clips in the assigned group.

Purpose: Determine usability of the recording after distortion

Results:Metric: WER for each transcription by the

testerAverage WER for each clip from 3 testers

WER for Speaker with the given alpha(α) value

Subjective Experiments

(Effect of distortion on WER) Average WER for Set A,B,C,D,E

0

20

40

60

80

100

120

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29Person ID(1-30)

Aver

age

WER

Per

cent

age

set A

set B

set C

set D

set E

0

10

20

30

40

50

60

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29

set A

set C

set D

set E

Average WER per speaker for each alpha value

(0-30)

(0-60)

(0-35)

(0-35)

Average WER per Set

Avg WER for each set

0

20

40

60

80

100

120

1

Avg

WE

R

A B C D E

14.2

100

22.4 15.3 14.4

Sets

Statistical Analysis Z-test calculations

Null Hypothesis: The average WER does not change (from Set A (before distortion) ) after the distortion for a given value of pitch scaling parameter (alpha)

H0: p1 = p2 (Null Hypothesis) Ha: p1 != p2

Parameters Value

Population Size 12*30=360

α 0.05

Confidence Level 95%

Z-Test critical

( |Zα/2| )

1.96

Rule for Rejection of H0

Z>=Zα/2 or

Z<=-Zα/2

Comparison Statistics

Set A and B (0.50) 46.71>=1.96

Set A and C (0.75) 2.873>=1.96

Set A and D (1.25) 0.419<=1.96

Set A and E (1.40) 0.0695<=1.96

Z-Test parameters Z-Test Results


Group Average # of distinct voices per subset

(Each subset consist of 6 audio clips)

Subset of

A

Subset of

B

Subset of

C

Subset of

D

Subset of

E

1 6.0 3.33 4.33 4.0 3.33

2 6.0 3.0 3.33 4.0 4.0

3 6.0 2.0 4.0 3.0 4.0

4 6.0 2.67 4.0 3.67 2.67

5 6.0 3.0 3.0 3.67 4.0

Average Number of Distinct voices

6.0 2.75 3.92 3.67 3.50

Task 2: Identify the number of distinct voices in each subset in the assigned group.

Purpose: Estimate ambiguity created by pitch shifting

Results:


Task 3: For each clip from subset of Set A (which is the original un-distorted speech set); identify a clip in other subsets in which the same speaker may be speaking

Purpose: Qualitatively measure the assurance of Privacy Protection achieved by distortion

Results: None of the speakers from set A was identified from other distorted sets. (100% Recognition Error Rate)

Privacy Experiments

Speaker Identification Experiments

ASR tools (LIA_Spk-Det and ALIZE)1 by LIA lab at the University of Avignon Speaker Verification Tool GMM-UBM (Gaussian Mixture Model-Universal

Background Model)• Single Speaker Independent Background Model

• Decision: Likelihood Ratio:

1: Bonastre, J.-F., Wild, F., Alize: a free, open tool for speaker recognition, http://www.lia.univ-avignon.fr/heberges/ALIZE/

0

1

( | )

( | )

p Y H

p Y H

LIA_RAL Speaker-Det

WarpingTrainingInitialization

World Modeling

Bayesian Adaptation (MAP)

Target Speaker Modeling

32 coefficients = 16 LFCC + 16 derivative coefficients

(SPRO4)

2 GMM (2048 components)

1: Male 2:Female

Feature Extraction

(SPRO Tool)

Silence Frame Removal

(EnergyDetector)

Parameter Normalization

(NormFeat)

Front Processing

Adapts a World Model

(TrainWorld)(TrainTarget)

Speaker Detection

(ComputeTest)

( | )( | ) log

( | )

l s TLLR s T

l s W

Feature Vectors

Experimental Setup World Model

Number of male speakers = 325 Number of female speakers = 135

Target Speaker Model Number of male test clips = 20 Number of female test clips = 10

Two sets of experiments Same Model:

• World Model and Individual Speaker Models: (Training Set: distorted speech with the corresponding alpha)

Cross Model: • World Model and Individual Speaker Models: (Training Set: un-

distorted speech)

Privacy Results

Alpha Sex Same Model Cross Model

Set A M 1.0 1.0

Set A F 4.4 4.4

Set B M 2.5 150.75

Set B F 1.7 57.80

Set C M 8.65 170.90

Set C F 5.4 46.40

Set D M - 185.75

Set D F 20.30 67.80

Set E M 52.05 157.45

Set E F 29.20 79.80

Conclusions

• Cross Model: Distorted speech, no matter what alpha value is used, is very different from the original speech.

• Same Model: Set B and Set C do not provide adequate protection as the rank is still very near the top.

• Numbers in table is Average rank for the true speakers of the test clips for the corresponding alpha value

Example Video

Conclusions

Proposed Real time implementation of voice-distortion and face blocking for privacy protection in Life-log video

Analysis of Audio Segmentation Analysis of Audio Distortion for usability Analysis of Audio Distortion for privacy protection

Acknowledgment

• Prof. Samson Cheung• People at Center of Visualization and

Virtual Environment• Prof. Donohue and Prof. Zhang

Thank you!

Voice Distortion

Voice Identity Vocal Track (Formats) : Filters Vocal Chord (Pitch): Excitation Source

Different ways to distort audio: Random mixture

• Makes the recording useless Voice Transformation

• For example, • More Complex, not suitable for real-time applications

Pitch-shifting • Changes the pitch of voice• Keeps the recording useful

PitchSOLA time-domain pitch shifting method. * “DAFX: Digital Audio Effects” by U. Z. et al. Simple with less complexity

• Cross Model:

• World Model and Individual Speaker Models: (Training Set: un-distorted speech)

• Same Model

• World Model and Individual Speaker Models: (Training Set: distorted speech)

privacy protection for life-log video

Documents

privacy rightthere

audiovisual privacy

privacy verses usefulness

blockingvoice segmentation

systemvoice segmentation

ambiguitythe scheme

privacyinformation management

single subject