youtrace: a smartphone system for tracking video … · as smartphone cameras and processors get...

YOUTRACE: A SMARTPHONE SYSTEM FOR TRACKING

VIDEO MODIFICATIONS

By

ELIJAH JS HOULE

A thesis submitted in partial fulfillment ofthe requirements for the degree of

MASTER OF SCIENCE IN COMPUTER SCIENCE

WASHINGTON STATE UNIVERSITYSchool of Engineering and Computer Science, Vancouver

MAY 2015

To the Faculty of Washington State University:

The members of the Committee appointed to examine the thesis of

ELIJAH JS HOULE find it satisfactory and recommend that

it be accepted.

Scott Wallace, Ph.D., Chair

Xinghui Zhao, Ph.D.

Sarah Mocas, Ph.D.

ii

ACKNOWLEDGMENTS

I would like to thank both former and present faculty for their support and guidance through-

out my pursuance of the program, including Dr. Thanh Dang, Dr. Scott Wallace, Dr.

Xinghui Zhao, Dr. Sarah Mocas, and Dr. David Chiu.

I would also like to thank friends and family that have encouraged me, especially my

girlfriend Mahal and our feline companion Tessa, for helping me to manage my time and to

balance work with fun.

Lastly, I would like to thank you, the reader, for even looking at this thesis. I truly hope

that you gain something from it in return and that your curiosity never ends.

iii

YOUTRACE: A SMARTPHONE SYSTEM FOR TRACKING

VIDEO MODIFICATIONS

Abstract

by Elijah JS Houle, M.S.Washington State University

MAY 2015

Chair: Scott Wallace

As smartphone cameras and processors get better and faster, content creators increasingly

use them for both recording and editing of videos. On hosting sites, the lack of information

about how an upload relates to the original recording complicates the question of whether to

trust the content, especially for citizen journalism programs like CNN’s iReport. This thesis

introduces YouTrace, a system consisting of a trusted Android client (IntegriDroid) that

tracks modifications made to videos recorded on the smartphone, and a hosting server that

maintains lineage trees for near-duplicate videos from both trusted and untrusted sources.

YouTrace analyzes videos in a non-blind fashion, with the core algorithm, called video-

diff, comparing a parent and child video, and then reporting the transformations used to

produce the child through a structure called a delta-report. The comparison algorithm and

the report structure are capable of detecting and recording the type and degree of temporal

modifications to clips, such as scaling (stretching/shrinking duration) and trimming, as well

as spatial modifications to frames, such as scaling, cropping, bordering, color adjustment, and

content tampering. We implement the IntegriDroid client prototype on a Galaxy Nexus by

building on TaintDroid’s file tracking and porting an emulated Trusted Platform Module onto

iv

Android 4.3. Evaluation of detection accuracy, speed, and power consumption demonstrates

the feasibility for services to utilize a future system built on YouTrace to determine content

integrity.

v

TABLE OF CONTENTS

ACKNOWLEDGMENTS iii

ABSTRACT iv

TABLE OF CONTENTS viii

LIST OF TABLES ix

LIST OF FIGURES xi

1 Introduction 1

1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.1.1 Organizing a growing source of traffic . . . . . . . . . . . . . . . . . . 1

1.1.2 Building integrity to trust content . . . . . . . . . . . . . . . . . . . . 2

1.2 Use Case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

1.3 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

1.4 Thesis Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

1.5 Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

1.6 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

1.7 Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

vi

2 Related Works 8

2.1 Video Integrity Mechanisms . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.1.1 Taxonomy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.1.2 Watermarking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.1.3 Fingerprinting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.1.4 Classification and evaluation . . . . . . . . . . . . . . . . . . . . . . . 11

2.2 Trusted Sensing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

2.3 Tracing Lineage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

2.3.1 Data provenance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

3 System Architecture 15

3.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

3.2 Certificate Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

3.2.1 Trusted configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

3.2.2 Delta-report . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

3.3 Server Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

3.3.1 Matching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

3.4 Client Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

3.4.1 IntegriDroid . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

3.5 video-diff — Comparing Original and Derived Videos . . . . . . . . . . . . . 24

4 Evaluation 29

4.1 Accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

4.1.1 Spatial scaling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

4.1.2 Spatial cropping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

4.1.3 Block modification (general content tampering) . . . . . . . . . . . . 31

4.1.4 Color adjustment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

vii

4.1.5 Border change . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

4.1.6 Temporal scaling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

4.1.7 Temporal cropping . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

4.2 Speed . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

4.2.1 Concurrency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

4.3 IntegriDroid . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

4.3.1 Latency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

4.3.2 Power consumption . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

4.3.3 Tracking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

5 Conclusion 41

5.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

5.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

Appendices 43

A Classification and Evaluation of Video Authentication Schemes 43

Bibliography 51

viii

List of Tables

2.1 Taxonomy for video watermarking techniques. . . . . . . . . . . . . . . . . . 8

2.2 Taxonomy of authentication schemes for online video sharing. . . . . . . . . 9

2.3 Ratings for selected video authentication schemes. . . . . . . . . . . . . . . . 11

4.1 T-tests for F-ratio samples of different color components among “lighter”,

“darker”, and “normal” (no filter, border change only) color curve presets. . 34

A.1 Classification of video authentication schemes. . . . . . . . . . . . . . . . . . 44

A.2 Evaluation of video authentication schemes. . . . . . . . . . . . . . . . . . . 47

ix

List of Figures

1.1 Global IP traffic by application category. The Cisco Visual Networking Index

(VNI) forecasts Internet video to become the majority of traffic by 2018. “The

percentages within parentheses next to the legend denote the relative traffic

shares in 2013 and 2018, respectively.” [1]. . . . . . . . . . . . . . . . . . . . 3

3.1 Design of the overall architecture with both trusted and untrusted clients. . . 15

3.2 Delta-report structure for recording modifications. . . . . . . . . . . . . . . . 19

3.3 Media player with TraceBack button added, which links to the lineage tree

for the given video. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

3.4 Examples of TraceBack lineage trees generated and stored on the server, trac-

ing lineage for the bottom videos going upward through the parents (denoted

by IDs). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

3.5 Video upload to server from an untrusted (non-IntegriDroid) client. The

server matches the upload to a verified video if possible and describes the

modifications made. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

3.6 Design of the IntegriDroid client. . . . . . . . . . . . . . . . . . . . . . . . . 24

3.7 Video alignment of a temporally scaled sequence. . . . . . . . . . . . . . . . 25

4.1 video-diff accuracy in detecting spatial scaling by 2/3 and 3/2. . . . . . . . . 30

x

4.2 video-diff accuracy in detecting degree of spatial cropping with 50 and 100

pixels. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

4.3 Logo inserted into videos to evaluate block modification detection. Copy-

righted by Larry Ewing, Simon Budig, Anja Gerwinski. [2] . . . . . . . . . . 32

4.4 video-diff accuracy in detecting temporal scaling by 1/2, 2/3, 1, 10/7, and 2. 35

4.5 Running time versus number of threads for video-diff ’s average luminance loop. 37

4.6 Running time versus number of threads for video-diff ’s distance matrix com-

putation on longer videos (large matrices). . . . . . . . . . . . . . . . . . . . 38

xi

Dedication

This thesis is dedicated to Mahal, for her constant motivation, inspiration, and patience

while I carried out my research.

xii

Chapter 1

Introduction

1.1 Motivation

As video becomes an increasingly important form of media, with major decisions being made

based on video content, the following two issues become clear:

• Many videos on hosting sites consist of near-duplicates, complicating the ability to

quickly distinguish among different versions of a video.

• When presented with a video, users and services cannot attest to how it has been

edited since it was recorded, challenging the decision of whether to trust its content.

These issues motivate the work, as presented in the following subsections.

1.1.1 Organizing a growing source of traffic

With the present popularity of online sharing platforms such as YouTube, video has become

a predominant source of information and entertainment on the web. Even excluding those

shared on peer-to-peer networks, video made up 66% of global IP traffic in 2013, a percentage

1

that continues to grow at a rapid rate [3] (Figure 1.1). YouTube itself receives 100 hours of

video every minute [4]. However, 27% of the results to popular queries are near-duplicates [5],

videos that share most of their content with the most popular one, implying that some of

these uploads are redundant and would benefit from an organization scheme.

Major hosting sites already integrate near-duplicate detection into their systems. For

example, YouTube’s Content ID matches videos and live streams with a database of copy-

righted files to automate copyright infringement claims [6]. Recommendation systems likely

also use near-duplicate detection to filter out close matches, e.g., to avoid showing that a

video is a “related video” to itself. However, these sites would benefit from technology that

builds lineage trees, showing how a given video differs from its near-duplicates and, for un-

original content, what transformations the creator used on the original to produce it. Users

would have an interface that allows them to find original videos for popular content and to

distinguish them from derived videos. For example, a user may stumble upon a clip and want

to find the original recording to see the context or reference it. Proposed lineage detection

systems from the literature, along with their weaknesses, are discussed in Chapter 2.

1.1.2 Building integrity to trust content

In addition, user-generated content has become powerfully influential as smartphones have

become ubiquitous. Users can easily record and upload from anywhere, facilitating citi-

zen journalism with videos shared over social media. CNN takes advantage of this with

their iReport initiative, by crowdsourcing stories that may be difficult for reporters to cover

comprehensively, especially unexpected events [7]. However, only some stories are approved

for CNN, which requires manual verification. Most videos are not verified, which weakens

the usefulness of the service, and those that are verified are done so by humans, who may

be tricked by clever, malicious modifications. This and other citizen journalism programs

2

© 2014 Cisco and/or its affiliates. All rights reserved. This document is Cisco Public. Page 16 of 24

Trend 6: IP Video Will Accelerate IP Traffic Growth Through 2018

The sum of all forms of IP video, which includes Internet video, IP VoD, video files exchanged through file sharing, video-streamed gaming, and videoconferencing, will continue to be in the range of 80 to 90 percent of total IP traffic. Globally, IP video traffic will account for 79 percent of traffic by 2018 (Figure 14).

Figure 14. Global IP Traffic by Application Category

The implications of video growth would be difficult to overstate. With video growth, Internet traffic is evolving from a relatively steady stream of traffic (characteristic of P2P1) to a more dynamic traffic pattern.

Impact of Video on Traffic Symmetry With the exception of short-form video and video calling, most forms of Internet video do not have a large upstream component.

As a result, traffic is not becoming more symmetric as many expected when user-generated content first became popular. The emergence of subscribers as content producers is an extremely important social, economic, and cultural phenomenon, but subscribers still consume far more video than they produce. Upstream traffic has been flat as a percentage for several years, according to data from the participants in the Cisco VNI Usage program.

1 Peer-to-peer (P2P), by definition, is highly symmetric traffic, with between 40 and 60 percent of P2P traffic consisting of upstream traffic. For every high-definition movie downloaded, approximately the same amount of traffic is uploaded to a peer. Now, with increased video traffic, most video streams that cross the network have a highly asymmetric profile, consisting mostly of downstream traffic, except in areas where P2P TV is prevalent (in China, for example).

Figure 1.1: Global IP traffic by application category. The Cisco Visual Networking Index(VNI) forecasts Internet video to become the majority of traffic by 2018. “The percentageswithin parentheses next to the legend denote the relative traffic shares in 2013 and 2018,respectively.” [1].

would benefit from an automatic video integrity mechanism to strengthen the utility and

trustworthiness of the service.

On any hosting service, users tend to make benign changes to recordings of events before

uploading them. For example, users may want to share an event they witnessed, after

stripping irrelevant content, piecing together clips, or anonymizing faces and objects by

covering them through blurring, blocking, or pixelation. Traditional integrity mechanisms

do not support benign processing because it changes the video file at a low level. Even

mechanisms made to be robust to some forms of processing such as compression (discussed

in detail in Chapter 2) tend not to support intentional, high-level changes where the meaning

of the content is still preserved but the video structure changes, and they also cannot report

the specific modifications made. People would benefit from a system that lists modifications

that have been made to a video since it was recorded, in order to inform their decision on

how much to trust the content.

3

1.2 Use Case

The increasing ubiquity, camera quality, and video editing power of smartphones lead to a

common pattern in content creation:

• On smartphone:

– User U1 records and saves original video V .

– User U1 uses some app to edit video V and export as a new video V ′.

– User U1 uploads resulting video V ′ to hosting server.

• From server, some other user U2 downloads video V ′.

• User U2 makes some changes to the video V ′ and exports it as a new video V ′′.

• User U2 uploads resulting video V ′′ to hosting server.

We have two problems here. First, how can we track the transformations mapping

V → V ′ on the smartphone, so that we can decide whether to trust the content of V ′ as

genuine? (This assumes that the original recording V is trusted, because “analog” attacks, as

in non-digital special effects or event fabrication, are outside the scope of this work.) Second,

how can we both trace V ′′ back to V ′ and track the transformations mapping V ′ → V ′′?

This thesis proposes possible solutions to both questions, by building a tracking framework

for smartphones on Android software and trusted hardware, and then by extending the

framework to the server by combining the core modification tracking component with near-

duplicate detection.

1.3 Problem Statement

This thesis addresses the following problem:

4

Although users and services would benefit from access to the modification history of a

video, the necessary components for tracking modifications currently suffer from restrictive

assumptions and lack of application to video.

1.4 Thesis Statement

Existing systems present restrictive assumptions, requiring all clients to be trusted or only

considering an unrealistic set of subtle transformations. As a result, they cannot be applied

directly to real domains where content creators apply a series of transformations both spa-

tially and temporally to produce a video, on a variety of different devices. Therefore, the

following statement guides this thesis:

Combining trusted sensing with computer vision approaches, and applying them to real-

istic video modifications, could address the shortcomings of each while enabling a complete

video tracking framework.

1.5 Challenges

Our goal is to design and develop a video tracking framework to verify our thesis. To achieve

this, we need to address the following challenges.

• Android native tracking — To lower the barrier to deployment and let the smart-

phone user create/edit videos with any app, traditional data tracking will not work.

Android taint tracking typically uses the Dalvik Virtual Machine, which can only track

Java methods. We need to track how information propagates through native code as

well. Section 3.4.1 discusses our proposed solution through vidtracker, an Android file

system monitor that approximates video dependencies.

5

• Modification report structure — As video represents both spatial and temporal

information, a report of modifications needs to capture both types of changes. For our

purpose, it needs to adequately and concisely describe the degree of modification in a

human-readable manner, without explicitly holding each version of the data. This way,

other parties can use the report to decide whether they trust that the video’s content

has been preserved. We propose the delta-report, discussed in Section 3.2.2.

• Accuracy and efficiency — The main barrier to deployment exists in whether a

system that tracks modifications can do so accurately, while being implemented with

low latency on the target device. In Chapter 4, we evaluate our core video-diff code

with these parameters in mind.

1.6 Contributions

This thesis proposes YouTrace, a full video tracking framework that takes advantage of both

trusted sensing and computer vision techniques. YouTrace allows for the possibility of users

to trace the history of a video found online back to the device that recorded it. This has

applications in journalism, social media and trend analysis, forensics and law enforcement,

among others. Specifically, the YouTrace system consists of:

• A fully functioning tracking framework including a server and smartphone client.

• Data structure (delta-report) and algorithm (video-diff) for tracking modifications.

• Evaluation of the framework with real video data from YouTube.

We evaluate our system quantitatively by considering accuracy, speed, and power con-

sumption, and qualitatively by analyzing the attack architecture.

6

1.7 Thesis Outline

The remainder of the thesis is organized as follows. Chapter 2 summarizes related efforts

in tracking and organizing data to provide integrity and lineage. Chapter 3 presents an

overview of YouTrace’s architecture. Chapter 4 evaluates the system. Chapter 5 summarizes

and concludes this work. Appendix A contains a survey of video authentication schemes.

7

Chapter 2

Related Works

2.1 Video Integrity Mechanisms

Authentication schemes for video have been considerably explored in the literature, generally

as an extension of image integrity. Unlike traditional file integrity mechanisms (such as

checksums), video authentication needs to compensate for transcoding transformations that

preserve the high-level meaning but alter the low-level bits, such as scaling and compression.

Depending on the scheme, it may treat frame dropping as benign, due to packet loss, or

malicious, due to temporal tampering.

2.1.1 Taxonomy

Generation Content-Based, Content-Independent

Embedding Imperceptible (Invisible), Perceptible (Visible)

Robustness Semi-Fragile, Fragile, Robust

Detection Blind (Oblivious), Semi-Blind, Non-Blind

Table 2.1: Taxonomy for video watermarking techniques.

Two main paradigms exist for authentication techniques: watermarking and fingerprint-

8

ing (or perceptual hashing). Both types of techniques similarly extract some set of features

from the video. They differ in that watermarking embeds information based on the features

back into the video content by modifying it imperceptibly, while fingerprinting keeps this

information as a signature alongside the video (often storing it into a database like a hash

to preserve the content) [8].

Watermarking can be classified by four dimensions, shown in Table 2.1. Watermarking

schemes mainly differ in whether generating the watermark depends on the video content,

whether they are embedded visibly, how well they withstand content transformations, and

whether detecting the watermark requires the original content.

Table 2.2 shows a general taxonomy that we propose for video authentication schemes.

In general, content integrity involves the extraction, encoding, and transmission of features

that describe the content.

Transmission Paradigm Watermarking (Embedded), Fingerprinting (Streamed)

Feature Extraction Pixel Color Values, Pixel Luminance Values,

Transform Coefficients, Key Frames

Feature Compression/Encoding Quantization, Error Correction Coding, Cryptographic Hashing

(MD5, SHA), Cloud Drops, Variable Length Code

Table 2.2: Taxonomy of authentication schemes for online video sharing.

2.1.2 Watermarking

Fragile watermarks verify the integrity of video content by breaking under modification.

However, because any modification can break them, they do not satisfy robustness for online

video sharing. On the other hand, robust watermarks are made to survive attacks, and

so they generally do not offer tampering detection except in extreme cases of tampering.

In contrast, semi-fragile watermarks break under malicious transformations, while still be-

ing robust to acceptable modifications. Likewise, content-independent watermarks are too

9

insecure for this application by being easily extracted and forged [9].

Watermarking techniques also vary on whether the original video data is necessary to

detect the watermark. A watermark that one can detect without the original data is con-

sidered blind or oblivious. If the values of parameters used to generate the watermark are

necessary, then the detection is semi-blind. Otherwise, if the source data is necessary, then

the watermark detection is non-blind.

One way to track modifications to video content would be to embed a watermark into the

original recording, and then check whether it has changed upon upload. Some watermarking

schemes are capable of localizing tampering, based on how the watermark is broken in the

tampered version. However, many of these schemes only authenticate I-frames or keyframes

(usually DCT/DFT coefficients or luminance values), and so they can only detect spatial

tampering [10; 11; 12]. Those that can detect temporal tampering use motion vectors or

shots as features, which limits robustness [13].

Because we assume that, in our system, a certificate is generated for each video, and

we are mostly concerned with the series of transformations that produce a video, we do

not use watermarking. By treating the video features themselves as the watermark and not

embedding one, we do not have to worry about losing track of features when a watermark

is broken.

2.1.3 Fingerprinting

Cryptographic hashing algorithms such as SHA generate a unique string given data input.

Small changes in the data result in drastically different hashes. In contrast, video hashing

algorithms allow small changes to the content to result in similar feature vectors. Because

of the inconsistent nomenclature, they are sometimes called “robust hashes”, “soft hashes”,

or “passive fingerprints” by different researchers. A video’s digital signature refers to a

10

perceptual hash encrypted with one’s private key [14].

We make use of fingerprinting in our system by extracting features from both original and

derived videos, and comparing the features to discover the series of transformations between

them (discussed more in detail in Chapter 3).

Scheme Score

Watermarking based on DCT DC coefficients of 4x4 subblocks of I-framemacroblocks [10]

12

Watermarking based on compressive sensing, red and blue values of I-frames [11]

9 (robustness notmeasured)

Watermarking based on applying error correction coding (ECC) to encodeangular radial transformation (ART) coefficients [12]

12 (but ECC canbe exploited)

Watermarking based on cloud model, generated by DCT energy distribu-tion of I-frames in shots [13]

9

Watermarking based on combining robust and fragile watermarks [15] 8

Fingerprinting based on radial projection of key frame pixels, denoted Ra-dial hASHing (RASH) — variance of pixel luminance values along evenlyspaced lines articulated around the center of each keyframe between 0 and180 degrees [16]

10

Fingerprinting based on MD5 hash of features obtained from DCT coeffi-cients and secret key for each block of each frame [17]

12

Fingerprinting based on cryptographic secret sharing — aggregation ofkeyframes from each shot used as master secret [8]

12

Hybrid scheme based on “content-fragile watermarking”, edge character-istics for each frame[18]

8

Hybrid scheme based on combining fragile watermarking [19] with digitalsignature [20] — hash of transform coefficients used as watermark [21]

6

Table 2.3: Ratings for selected video authentication schemes.

2.1.4 Classification and evaluation

Appendix A contains classification and evaluation of selected watermarking and fingerprint-

ing mechanisms from the literature. In Table 2.3, we summarize the results by assigning a

rating to each based on how well it claims to perform on each of four criteria (Tampering De-

11

tection, Tampering Localization, Geometric/Transcoding Robustness, and Loss Tolerance).

For each criterion, the scheme can have a rating of 1, 2, or 3 (for Low, Medium, or High),

giving a total score out of 12.

2.2 Trusted Sensing

The previous authentication schemes assume that the sender (uploader) is already trusted.

They benefit the sender in ensuring that the receiver obtains the video as it is sent, modified

or not by the sender. In order to prevent the sender from manipulating the video before

sending it, systems must leverage trusted hardware.

Gilbert et al. [22] demonstrated a system called “YouProve” for preserving the meaning

of photo and audio content recorded and uploaded from Android smartphones. Using the

TaintDroid framework, YouProve tracks information as it flows through applications and

records modifications. A Trusted Platform Module is used to attest to a hosting service that

the phone is running a trusted configuration with a report of the changes made to content.

However, this system has not been extended to video. In addition, although the system is

built on top of Android (the most popular mobile operating system) to lower the barrier

to deployment, the hosting service relies on the clients running a trusted configuration.

Our system leverages information from both trusted and untrusted clients to build a more

complete picture, further lowering the barrier to deployment.

2.3 Tracing Lineage

Video lineage is related to the problem of near-duplicate detection [5], which acts as an

extension of the problem for images [23]. However, tracing lineage also requires identifying

parent-child relationships among an item’s near-duplicates, building a directed graph to trace

12

the item’s history. Presently, the problem of lineage, also called phylogeny or archaeology,

has three main solutions (applied first to images but with possible extension to video):

Given a set of near-duplicates:

1. Process item pairs using a specialized detector for each transformation (e.g., scaling,

cropping), each outputting the direction of derivation. A consensus of results implies

a parent-child relationship [24].

2. For each pair of items, detect a dependency by computing the mutual information

between both their content and their noise [25].

3. Calculate a dissimilarity/distance matrix and use it to build the tree (phylogeny) by

graph-theoretic algorithms [26; 27; 28; 29].

Solution 1 relies on the agreement of several detectors, which may not work for subtle

manipulations. Solution 2 and Solution 3, the latter of which has been previously extended

to video [30] as well as large-scale analysis [31], can detect subtle transformations but do

not attempt to describe the transformations involved in the parent-child relationship. They

also do not consider a realistic set of transformations, only including resampling, cropping,

and color adjustment, neglecting both other common transformations in the spatial domain

and the temporal domain. Thus, our system specifically detects and records the degree of

modification for a set of common spatial and temporal transformations.

Another issue in simply extending image algorithms to video is the common assumption

that the frames in the child (derived video) still line up perfectly with the frames in the parent

(original). Because videos can be modified temporally by changing frame rate and splicing

or trimming clips to remove frames, video solutions need to align clips before applying the

spatial techniques from the image solution. Our system incorporates an alignment step to

match frames, extended from a recent work [32].

13

Many of these works come from a funded European project called REWIND, short for

“REVerse engineering of audio-VIsual coNtent Data”. The project mainly assumes that

multimedia transformations leave footprints that can be analyzed to trace back the content

modification history without access to the original content [33]. This represents a blind

approach to lineage detection, while we use a non-blind approach, assuming that changes

have been recorded starting with the original content on the device. Despite starting with

different assumptions, both approaches work toward the same goal of modification detection

and can complement each other.

2.3.1 Data provenance

The notion of provenance refers to a history of the entities and processes that produced

a given data object. Recently, semantic web applications have emerged, with the W3C

(World Wide Web Consortium) publishing a family of documents specifying web data prove-

nance [34]. In this model, our system acts as a step toward automating the generation of a

process description in a trustworthy manner.

14

Chapter 3

System Architecture

Trusted SmartphoneClient

(IntegriDroid)

Record

Edit

UploadMedia, Certificate

Media/CertificateHosting Server

Lineage Tree

UntrustedClient

Download Upload

Edit

Original videofrom trusted client

Derived videos

Figure 3.1: Design of the overall architecture with both trusted and untrusted clients.

15

3.1 Overview

Shown in Figure 3.1, the YouTrace architecture consists of a trusted smartphone client (Sec-

tion 3.4) and a hosting server (Section 3.3), with support for untrusted clients. The client

runs a custom system called IntegriDroid, built on top of Android with trusted hardware in-

tegration. The trusted hardware attests to the original integrity of video recordings, and the

system tracks modifications made to these recordings. When the client uploads a video, it

also sends a certificate (Section 3.2) containing a report of modifications along with informa-

tion about the trusted hardware. The server stores the video and the certificate, associating

them together in a database, and then creates a new lineage tree with the video as the root

node. When the server receives a video without a certificate (from an untrusted client), it

looks for a matching video and inserts it at the appropriate place in the match’s lineage tree.

If there is no match, then the server can store the video as an unverified root node. However,

because the video cannot be traced back to its original recording, users may not trust it.

3.2 Certificate Design

A core component of this work is the certificate, a data structure that includes information

about the device that signed it and information about changes made to the associated video

content. The former is accomplished by using a device’s Trusted Platform Module (TPM) to

measure its configuration, which services can then verify as matching a trusted configuration

and a known TPM. We propose the delta-report to record changes made to a video.

Our certificate is adapted from YouProve’s “fidelity certificate” [22] to accommodate a

delta-report rather than their type-specific content analysis results. Thus, the certificate

contains the following fields: digest of the content being uploaded or stored, timestamp of

original recording, delta-report describing how content differs from the original, digest of

16

the report, boot and system partition digests as hashed by a PCR, TPM’s public key, and

a quote from the TPM that signs the report digest with the value of the PCR using the

private key. (The TPM’s public key is backed by a Certificate Authority, and the private

key is theoretically never exposed.)

3.2.1 Trusted configuration

Because it is becoming increasingly easy to falsify videos on smartphones, and modify the

operating system to undermine the trustworthiness of a system that tracks changes, trusted

hardware attests that both the sensor readings from the camera and the content-tracking

system are relatively unchanged. In our system, this takes the form of a Trusted Platform

Module, a cryptographic processor with the following capabilities:

• Remote attestation: attesting to the software running on the platform to a remote

entity.

• Sealed storage: allowing data access only when the system is in a trusted configuration.

Both issues involve measuring the system configuration. The TPM uses Platform Con-

figuration Registers (PCRs) to accomplish this. PCRs are a set of stored hash chains to

which values can be concatenated and hashed using the extend command, but not directly

overwritten. Then, a chain of system operations is easy to verify, via the quote command

that reports PCR values signed with the TPM’s key, but infeasible to forge.

Although no commercial smartphones presently have a TPM, they will likely incorporate

these capabilities in the future. The Trusted Computing Group has been working on inte-

grating TPM features into the currently used Trusted Execution Environment [35], which

has isolated execution (for critical applications) and secure boot functionality. For this work,

we emulate the TPM on an Android device, though future work may leverage real hardware

with equivalent capabilities.

17

3.2.2 Delta-report

Changes made to a video can be either spatial or temporal in nature, and so an exhaustive

report needs to capture both types. Because transformations are usually performed at the

“clip” level (on sequences of frames rather than on individual frames), the report is a list of

entries describing clips. We propose the delta-report, illustrated in Figure 3.2. Each entry

consists of the clip’s endpoints (indices of the first and last frame) in the child video, the

matching clip’s endpoints in the parent video, and a list of transformations. The trans-

formation structure describes the type of transformation as a string, with the position (for

spatial transformations, given as the 2-dimensional coordinates of the top-left corner) and

the degree of transformation (length or factor of transformation along each axis). Although

some applications may want greater distinction for a given transformation, this model is

generic enough to describe all of the transformations with which we concern ourselves, with

some fields being unused by certain transformations.

The following transformations are considered in this work, with the delta-report being

generated by video-diff (Section 3.5):

• Spatial - scaling, cropping, block modification (general content tampering), color ad-

justment, grayscale, bordering

• Temporal - scaling (frame rate change or stretching), cropping (trimming)

3.3 Server Design

The YouTrace server hosts videos with their certificates, maintaining lineage trees for all

videos. To develop this, we start with the MediaDrop project, an open source video hosting

platform, and add lineage functionality. The only change to the frontend consists of a

18

Clip

…

Delta-report(list of clips)

endpoints (e1, e

2)

parent endpoints (e1', e

2')

Transformation ...

Clip (list of transformations)

type (string)position (x, y)degree (x', y')

Figure 3.2: Delta-report structure for recording modifications.

“TraceBack” button on the video player (Figure 3.3), which takes the user to the lineage

tree for the current video (Figure 3.4). Building the lineage tree requires first finding matches

for a video, as described in the following section.

3.3.1 Matching

When a video is uploaded, feature extraction1 occurs as follows, derived from a process

described in a previous work [36]:

1. Obtain keyframes.

(a) Get one keyframe per shot. Shot segmentation is done using color correlation: for

each I-frame, a histogram is computed for each of the three channels (the colors

blue, green, and red). The histograms of consecutive I-frames are then compared

channel by channel, and the minimum correlation is taken. If this value is less

than some lower limit S for shot correlation, then the I-frame is returned as the

next keyframe, because it belongs to a different shot.

1Here we use color auto-correlograms, but implementations can use different features for near-duplicatedetection as desired.

19

Figure 3.3: Media player with TraceBack button added, which links to the lineage tree forthe given video.

(a) (b)

Figure 3.4: Examples of TraceBack lineage trees generated and stored on the server, tracinglineage for the bottom videos going upward through the parents (denoted by IDs).

20

(b) Discard blanks. A keyframe is discarded if its number of detected keypoints is

less than some threshold B.

2. Preprocess keyframes.

(a) Denoise frames using a median filter.

(b) Remove borders. The border color is inferred from the top-left pixel and used to

trim solid strips of that color from top and bottom, left and right.

(c) Normalize aspect ratio. Frames are resized to some standard width and height

(W,H).

(d) Equalize histogram for the value/luminance channel (V in HSV, Y in YCbCr).

3. Extract the feature for each frame.

(a) Convert to the HSV color space, and quantize to 166 dimensions (18 hues × 3

saturations × 3 values + 4 grays).2

(b) Mask a square region of C2 pixels off each corner to focus on the central portion

of the frame for robustness.

(c) Extract the color auto-correlogram from both the central horizontal and vertical

strips to obtain a 332-dimensional feature. The color auto-correlogram is a 166-

dimensional vector v; for each quantized color c ∈ [1, 166], vc is the probability of

a pixel of color c having a neighbor (within some distance D) of color c.

In our implementation, the following values are used: S = 0.7, B = 50, (W,H) =

(320, 180), C = 20, D = 7.

The features for all keyframes are saved to an HDF5 dataset. A searcher daemon then

adds the features to its index in memory and returns similar features from other videos, using

2In our implementation, grays are defined as saturation or value lower than 10%, but if value is higherthan 80%, then saturation must be lower than 5%.

21

Feature ExtractionUpload

Dataset

SearcherFeatures

video-diff

Matching videos(potential parents)

Delta-reportsTraceBack

Lineage Trees

Figure 3.5: Video upload to server from an untrusted (non-IntegriDroid) client. The servermatches the upload to a verified video if possible and describes the modifications made.

FLANN for indexing and searching.3 For uploads from untrusted clients, these matches are

used as potential parents in our model. Each matching parent is then associated in a database

with the upload, and the match is annotated with a delta-report generated by a server-side

video-diff, described in Section 3.5. The lineage tree can be constructed by following relations

in the database. Figure 3.5 illustrates this process.

As long as the original recording has come from a trusted device, untrusted clients can

modify the video to produce different versions and even remix the different versions, because

this chain of trust will be retained. When a trusted client uploads a video, it sends its

own certificate describing changes that have been made to the video since recording. If

the certificate can be verified (by matching the public key of the TPM to a known one and

verifying the signature), then the video is added as a root node, because the client has proven

that the video originated from a recording on its device.

3Each feature in our implementation consists of 332 32-bit floats, taking up about 1.328 KB. Assumingthat a video has one keyframe per second, 209.17 hours of video could be represented by 1 GB of features.

22

3.4 Client Design

Extending the idea of YouProve [22] to video, the trusted client is a smartphone that relies on

software built on top of Android and hardware integration with a Trusted Platform Module.

As stated above (Section 3.2.1), the TPM is emulated for this prototype.

We use TaintDroid on Android 4.3 r1 for data tracking, build our system on top of it,

and run it on a Galaxy Nexus. We describe our system, IntegriDroid, in the next section.

Figure 3.6 shows the high-level design.

3.4.1 IntegriDroid

To track how a video changes from its source, we add IntegriDroid into the Android frame-

work as a new class that logs video recordings. The media recorder prepares by generating

a random 32-bit integer as the taint tag, used by TaintDroid for tracking the content both

on the filesystem and as it propagates through apps. When the recorder stops, it calls In-

tegriDroid’s log method on the video’s filename and the taint tag. IntegriDroid then inserts

these as well as the following into a centralized database on the device: current timestamp,

a digest of the video data, and a signature of all the fields by the TPM sealed with the first

PCR’s value (having been extended with measurements of the platform’s boot and system

partitions). This allows the device to attest that the recording is original and not digitally

falsified (assuming that the camera hardware is tamper-resistant).

vidtracker — File system monitor

Because TaintDroid is built into the Dalvik Virtual Machine, it can only interpose on I/O

by Java apps. Native apps (written in C++ for example) are not run by the VM, and so

TaintDroid fails to track data flowing through such apps. While developing IntegriDroid,

we discovered that video editors are often implemented natively for performance. To work

23

Camera(MediaRecorder)

Originalrecording

IntegriDroid

Database

File System

TPM

Copy file

Parent, Child

video-diff Delta-report

TaintDroid

Taint file

Check and addtaints

vidtracker

Signed log

Figure 3.6: Design of the IntegriDroid client.

around this issue, we developed a service, vidtracker, that watches the filesystem for processes

reading tainted video files. If a process with a given UID reads a tainted video file and a

process with the same UID writes a video file, the written file is checked to see if its video’s

keyframes match the tainted video’s keyframes. (The best match can be used if the process

has read multiple matching files.) If it does, then vidtracker calls the video-diff code on the

tainted file (assumed to be the parent) and the new one (the child) to report on modifications.

The derived video is then tainted and associated with the parent in the database along with

a report. Later, when a certificate is requested for the derived video, it inherits its parent’s

report in addition to its own, which, if the parent is also a derived video, inherits its parent’s

and so on until getting back to the original recording. This way, the final report constructs

the entire chain of modifications, tracing the final video to the original recording.

3.5 video-diff — Comparing Original and Derived Videos

On the client side, when an app generates tainted video output, as detected by TaintDroid

or vidtracker, IntegriDroid compares the new (derived) video to the parent using video-diff.

24

(a) Distance matrix. Thevertical axis corresponds tosuccessive blocks of framesin the parent video, whilethe horizontal axis corre-sponds to those in the childvideo, which is the parentvideo temporally scaled to70%. A point’s value fallsin the range 0-64, beingthe Hamming distance be-tween the 3D-DCT hashes ofthe two blocks of 64 frameseach. Darker points rep-resent lower values, whereblocks match more.

(b) Distance matrix after bi-narizing. Points with valuesgreater than τ = 16 are dis-carded. (16 is 25% of 64, soresulting blocks have hashesthat match at least 75%.)

(c) Distance matrix afterbinarizing and applyingmorphological opening.This erodes spurious valuesand dilates larger matches,though outliers still occurwhere the video has similarsegments repeating. (Out-liers become insignificantduring clustering, but im-plementations could alsouse RANSAC [37] to filterthem out.) Unfortunately,thin components, in whichsurrounding frames do notmatch, erode into gaps. Thekernel size must be chosencarefully for the applicationin order to minimize thiscase. After clustering theresulting points and fittinglines over distance minimain the clusters, the sequencesare aligned.

Figure 3.7: Video alignment of a temporally scaled sequence.

25

On the server side, uploads from untrusted clients are matched to known, trusted contents

and compared likewise, using the matches as parents. Shown in Figure 3.7, the comparison

algorithm starts by aligning corresponding frames between the two videos to tolerate and

detect temporal transformations, leveraging a published technique [32]. Then, corresponding

keyframes are compared to discover spatial transformations. The algorithm goes as follows:

1. Align videos by frames.

(a) Split each video into overlapping blocks of 64 frames (i.e., the first block is frames

0-63, the second block is frames 1-64).

(b) Get the 3D-DCT hash of each block. Given the 64 3D-DCT coefficients for the

block indexed between (1, 1, 1) and (4, 4, 4) inclusively (where (0, 0, 0) is the DC

coefficient), set the corresponding bit in the hash to 1 if the coefficient is greater

than the median value, and 0 otherwise.

(c) Construct a distance matrix, where each (i, j) is the Hamming distance between

the hashes of block i of the parent video and block j of the child.

(d) Binarize the distance matrix using some threshold τ (values less than or equal to

τ are set to 1, otherwise 0).

(e) Erode and dilate the binarized matrix to discard outliers, using square kernels

(structuring elements) of sizes E2 and D2, respectively.

(f) Find distance minima for rows and columns of the binarized matrix.

(g) Cluster the minima using k-means, starting with some predefined number of clus-

ters K and going down to 1 cluster iteratively. For each iteration, fit a line

through each cluster, and take the mean squared error between the line and the

minima to determine individual cluster quality. Take the mean of all clusters’

26

MSEs (MMSE) for total cluster quality. Choose the k-clustering with the lowest

MMSE and its fitted lines as the clips.

(h) Fine-tune the alignment for each clip by computing the average luminance value

for each frame in the clip in each video. Take the differences in this value between

successive frames to derive a signal for each video’s version of the clip. Cross-

correlate the signals from the two videos and shift the clip accordingly.

2. Detect temporal transformations.

(a) For each line (matching clip), use its slope to estimate temporal scaling.

(b) Use the number of blocks in the parent video without matches in the child to

estimate temporal cropping.

3. Detect spatial transformations.4 For each matching clip, select keyframes in the parent

(using the same color correlation as the server side, as described in Section 3.3.1), and

compare to the corresponding frame in the child:

(a) Check for borders in the child frame that do not exist in the parent frame; if

they are present, trim the second until it matches the first. Use this difference to

estimate degree of bordering modification.

(b) Detect keypoints using a standard algorithm like ORB or SURF5; crop both

frames to minimum bounding box around keypoints. Estimate cropping factor by

how much the parent needs to be cropped compared to the child (relative size of

keypoint structure to frame size).

4Although we did not consider block artifacting due to lossy compression, it can be estimated by dividingeach matching frame of the second video into a grid of 8 × 8 blocks (assuming standard encoding) andcomparing histograms of pixel differences within and outside blocks [38].

5Keypoint structures are used because we are concerned with transformations on the matching contentitself. The frames may also contain non-matching content that we want to ignore.

27

(c) Estimate scaling by the relative size of the child’s keypoint structure compared

to the parent’s.

(d) Estimate color adjustment by comparing the standard deviations of both frames’

color channels. This can be done using a simple variant of the F-ratio: dividing the

larger standard deviation by the smaller one, and comparing it to some critical

value. If they differ greatly, check whether the second video’s frame has near-

equivalent mean and deviation for all color channels, which indicates a likely

grayscale operation.

(e) Estimate general content tampering in each block. Because the frame and key-

point alignment both have some error, a pixelwise comparison between frames

would be unsuitable. Divide each frame’s matching region (from cropping to key-

points above) into a grid of square blocks with size N × N and computing the

structural similarity (SSIM) between corresponding blocks. Similar to YouProve’s

block comparison [22], but with a different metric, we consider the center region

12% smaller in each direction from the first frame’s block and compare it to each

equally sized subblock in the second frame’s block.

In our implementation, the following values are used: τ = 16, E = 8, D = 50, K = 5,

N = 128.

28

Chapter 4

Evaluation

4.1 Accuracy

First we evaluate the accuracy of the core video-diff code, in its ability to detect the type

and degree of specific transformations. In general, detection performs well but returns false

positives (or falsely higher degrees) for spatial modifications due to the imprecision in the

alignment step. To evaluate video-diff ’s performance on a given transformation, we download

a set of popular videos from YouTube (with content varying from trailers to vlogs), apply

the transformation using ffmpeg (version 2.5.git) for each video, and run video-diff on the

original and transformed video.

4.1.1 Spatial scaling

The spatial scaling detector compares the sizes of the bounding rectangles around matching

keypoints. In this evaluation, each frame in the parent video is scaled by some factor for

both width and height to produce the child video. Figure 4.1 shows the results for scaling

by 2/3 and 3/2. The estimated values have low error.

29

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

0.6 0.8 1 1.2 1.4 1.6

Dete

cted

fact

or

of

spati

al sc

alin

g

Actual factor of spatial scaling

DetectedIdeal

(a)

Actual factor of spatial scalingDetected factor of spatial scaling

Average Std. dev. Keyframes Mean absolute error

0.67 0.66 0.02 8 0.01

1.50 1.52 0.10 40 0.04

(b)

Figure 4.1: video-diff accuracy in detecting spatial scaling by 2/3 and 3/2.

30

4.1.2 Spatial cropping

The spatial cropping detector works by matching keypoints in corresponding frames, and

then cropping the frames to bounding rectangles around the keypoints. To evaluate for

each video, a number of pixels is cropped from the bottom and right sides of each frame to

produce the child video. The detector returns a number for each axis, making the number

of samples twice the number of keyframes. Figure 4.2 shows the results for cropping by

50 and 100 pixels. The detector tends to underestimate, with the range of detected values

increasing for the higher actual value. It also performs better on some videos than others,

as demonstrated by the low differences between the standard deviations and mean absolute

errors. As with other detectors that work on matching regions, the observed errors here are

due to imprecision of frame alignment.

4.1.3 Block modification (general content tampering)

The block modification detector detects content tampering by computing the structural

similarity (SSIM) of blocks in matching keypoint regions between aligned frames. However,

this makes the detector sensitive to misalignment, for both false positives and false negatives.

Therefore, rather than evaluating on subtle manipulations, we take a logo of size 207x240

(Figure 4.3) and add it to the first 30 seconds of each video at position (256, 256). video-

diff only detects the modification in 11 out of 21 videos. For comparison, when running

video-diff on videos with borders added (but no other transformations) and their parents,

false positives for the detector occur in 10 out of 83 videos. This gives a true positive rate

of 52.38% and a true negative rate of 87.95%, meaning that the detector is biased toward

returning negative. This is likely because tampered frames are not recognized as matches to

the pre-tampered parents. In order for this detector to be more effective, frame alignment

accuracy must improve.

31

-40

-20

0

20

40

60

80

100

120

40 50 60 70 80 90 100 110

Dete

cted

num

ber

of

cropp

ed p

ixels

Actual number of cropped pixels

DetectedIdeal

(a)

Actual number of cropped pixelsDetected number of cropped pixels

Average Std. dev. Keyframes Mean absolute error

50 24.79 33.70 163 28.56

100 47.44 67.59 174 61.73

(b)

Figure 4.2: video-diff accuracy in detecting degree of spatial cropping with 50 and 100 pixels.

Figure 4.3: Logo inserted into videos to evaluate block modification detection. Copyrightedby Larry Ewing, Simon Budig, Anja Gerwinski. [2]

32

4.1.4 Color adjustment

In order to study whether the simple F-ratio can be used to determine color adjustment,

we take sample values from 23 keyframes of videos with the “lighter” curve preset filter, 19

keyframes with “darker”, and 48 with no color adjustment (only borders changed). Each

sample contains a red and blue component, where the standard deviations for that color are

taken from the parent and child frames, and the F-ratio is the greater standard deviation

divided by the lesser.

Lighter red (i.e., the red component of the “lighter” color transformation) and lighter blue

both give an average F-ratio for their respective color channels of 1.08, as does darker red,

while darker blue gives an average F-ratio of 1.12. “Normal” averages (no color adjustment)

give 1.03 for red and 1.04 for blue.

To test whether these last two means significantly differ from the others (and whether

there exists a basis for our claim), we conduct an independent two-sample T-test between

each, the results of which are shown in Table 4.1. The p-values computed between the

normal components and the color-adjusted components are less than 1%, indicating that

their means significantly differ, while the p-values between both color-adjusted components

are high, confirming the null hypothesis of statistical equivalence. These results demonstrate

that applications can use this simple F-test to determine color adjustment, by setting the

critical value somewhere between 1.04 and 1.08 exclusive (based on average F-ratios from

earlier).

4.1.5 Border change

The border change detector is simple enough to have a 100% true positive rate on 37 videos

with borders added. False positives that occur on other videos generally have low degree

and are likely due to misalignments.

33

Lighter Darker

t-statistic p-value t-statistic p-value

Normal red 3.2519 0.0018 3.4486 0.0010

Normal blue 2.7873 0.0069 3.2394 0.0019

Lighter red -0.3287 0.7441

Lighter blue -1.0909 0.2818

Table 4.1: T-tests for F-ratio samples of different color components among “lighter”,“darker”, and “normal” (no filter, border change only) color curve presets.

4.1.6 Temporal scaling

The temporal scaling detector simply measures the slope of the fitted line through each

matching clip of the videos, obtained from clustering their distance matrix. Figure 4.4

shows the results for scaling the video by 1/2, 2/3, 1 (from other transformations), 10/7,

and 2. Detected factors tend to stay close to the actual factors with low standard deviations.

4.1.7 Temporal cropping

The temporal cropping detector measures the lengths of clips in the parent without matching

clips in the child (through clustering and line fitting). For 37 videos that are spatially scaled

with borders added1, on average 15.56% of total parent time is falsely detected as cropped

out in the child. This amount indicates a small lack of accuracy in the video alignment,

likely where either DCT hashing fails or true matches in the distance matrix are eroded.

1Videos are spatially scaled to 720x432 then have black borders added to the top and bottom for a finalsize of 720x576. Videos almost all start at 1280x720, with three exceptions out of 37: 1280x534, 640x358,512x288.

34

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2

2.2

0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2

Dete

cted f

act

or

of

tem

pora

l sc

alin

g

Actual factor of temporal scaling

DetectedIdeal

(a)

Actual factor of temporal scalingDetected factor of spatial scaling

Average Std. dev. Clips

0.500 0.523 0.069 9

0.667 0.666 0.001 7

1.000 1.008 0.055 162

1.429 1.426 0.007 11

2.000 2.050 0.055 5

(b)

Figure 4.4: video-diff accuracy in detecting temporal scaling by 1/2, 2/3, 1, 10/7, and 2.

35

4.2 Speed

In this section, we evaluate the speed of the server-side video-diff, on a quad-core Intel Core

i5-2400S. (IntegriDroid client latency is evaluated in the next section.)

3D-DCT hashing operates on blocks of 64 frames each. However, before the DCT is

computed, the next frame is read, its color space is converted to gray and it is resized to

32x32. Therefore, the number of blocks that can be hashed per second depends on the

original frame size. Videos of size 1280x720 hash on average at 228.51 blocks per second,

while videos of size 720x576 hash on average at 333.41 blocks per second. Note that these

blocks are overlapping, and so if the video plays at 30 frames per second, then the time to

hash takes its duration divided by 7.617 in the first case and divided by 11.11 in the second.

Applying a morphological opening (dilation of erosion), finding distance minima, clus-

tering, and cross-correlation (after the costly average luminance computation) all take prac-

tically negligible time, only going up to 4 seconds for all operations on a 9-minute video. On

the other hand, computing histograms for the keyframe selection only works on average at

23.71 frames per second, which acts worse than real time.

Two functions that can be easily made to run in parallel, due to a lack of interdepen-

dencies, include the mean luminance computation loop and distance matrix computation,

explored in the next section.

4.2.1 Concurrency

A main bottleneck in video-diff consists of the loop in which the mean luminance is computed

for a series of frames. For each video sequence, this produces a mono-dimensional signal

used to cross-correlate for fine-tuning the alignment. Although each iteration takes only

0–1 seconds, the number of frames adds up to dominate the running time of the algorithm.

Because the frames can be processed independently, we decided to examine the effect of

36

0.82

0.84

0.86

0.88

0.9

0.92

0.94

0.96

0.98

1

1 2 3 4 5 6 7

Funct

ion t

ime (

norm

aliz

ed

)

Number of threads

Figure 4.5: Running time versus number of threads for video-diff ’s average luminance loop.

splitting the work among threads. Figure 4.5 shows how adding more threads decreases

the running time (normalized with respect to the time of one thread), averaged among 37

different videos. We would expect that doubling the number of threads would cut the running

time in half. The threading does not work as well as anticipated, only cutting the average

running time to 82.5% of normal at 7 threads. This is likely due to the cost of file I/O, as

each frame is read from the video file on disk. A better model in the future may be to have

the video file in RAM, across separate nodes on a distributed filesystem, or on solid-state

drives.

Although it does not present a bottleneck (negligible for short videos), distance matrix

computation can easily run in parallel, with each thread taking a different set of rows to

process. Figure 4.6 shows the results of running with 1, 2, 4, and 7 threads on three different

videos with numbers of blocks 12666, 12686, 16081, against videos of the same length (so the

distance matrix is the square of the number of blocks). This matches the expected results.

37

0

5

10

15

20

25

1 2 3 4 5 6 7

Funct

ion t

ime (

seco

nds)

Number of threads

16081 blocks12686 blocks12666 blocks

Figure 4.6: Running time versus number of threads for video-diff ’s distance matrix compu-tation on longer videos (large matrices).

4.3 IntegriDroid

Now we evaluate IntegriDroid’s performance on a Galaxy Nexus. We cross-compile both

OpenCV and ffmpeg for Android and link our video-diff code, because the prebuilt OpenCV4Android

lacks the ffmpeg backend for reading video. Due to the relatively weak CPU and multithread-

ing incapacity on the Galaxy Nexus, having only 2 cores at 1.2 GHz, our client prototype

does not perform the mean luminance computation for fine-tuning alignment or color corre-

lation for keyframe selection. Instead it simply selects the first and last frame of each clip

as the keyframes. Additionally, it scales each frame down to fit inside a 1024x768 bounding

box, similarly to YouProve [22]. Future implementations on better hardware may lift these

restrictions.

38

4.3.1 Latency

After recording a video, it takes about 5 seconds to log it: for copying the original recording

into secure storage, computing a digest, signing the log by the TPM, and inserting the log

into the database.

After editing a video, video-diff does not start running until the file has been closed,

which can either occur when the exported video stops playing or when the app stops. (See

Section 4.3.3 for a discussion of the vidtracker model.)

Once video-diff starts, a 1280x720 video hashes on average at 8.4 blocks per second, at

3.7% the speed of the server, or 28% playback speed if the video plays at 30 frames per

second. However, the distance matrix operations and clustering still take negligible time.

Future hardware may increase hashing speed as both smartphone processors and file I/O

improve.

4.3.2 Power consumption

We use University of Michigan’s PowerTutor app for reporting the system’s power consump-

tion as a weighted average over 5 minutes. All measurements are performed with WiFi off

and the screen dimmed unless otherwise noted.

When not using the IntegriDroid framework, the Galaxy Nexus uses 5 mW on idle with

the screen off. With the screen on, it uses 330 mW. Starting the TPM emulation processes

and vidtracker only brings this up to 337 mW.

While video-diff is running, consumption is surprisingly low at 600 mW. For comparison,

playing music (via the built-in Music app through the speaker at half the full volume) uses

735 mW. These results show that IntegriDroid has low power overhead and is feasible for

smartphone deployment.

39

4.3.3 Tracking

Although vidtracker works for prototyping, future works will require a more robust solu-

tion for native video dependency tracking. Because our model considers all tainted videos

that an app reads as potential parents to output videos, the number that it reads before

writing and closing a new one affects the tracking complexity and performance, with each

parent tested against the output video as a potential match. Realistically, the child video

may have multiple parents. If the vidtracker model is retained, then implementations may

consider caching video features to boost near-duplicate detection. However, other solutions

for native taint tracking may present higher utility, such as real-time disassembling of ARM

instructions [39].

40

Chapter 5

Conclusion

5.1 Summary

This thesis presents YouTrace, a full video tracking framework. First, we outline related

works in media integrity, lineage, and trusted sensing, showing that these mechanisms suffer

from restrictive or unrealistic assumptions when applied to videos. Second, we propose an

architecture based on a client running a custom Android OS with trusted hardware (Inte-

griDroid) as well as a hosting server, both of which trace video modifications back to the

original recording, as long as the original recording happens on a trusted IntegriDroid device.

Third, we show the accuracy of our system for tracking various types of transformations, as

well as speed and power consumption on a real smartphone. The results demonstrate feasibil-

ity for future deployment. Although the implementation still lacks efficiency on smartphones,

future hardware will likely increase the speed significantly through faster and more parallel

architectures while lowering the power consumption. Still, the results suggest future works

to improve YouTrace’s performance, as discussed in the following section.

41

5.2 Future Work

Several avenues for future research exist, within components of the current architecture or

by transforming the architecture.

The video-diff algorithm still has a long way to go in terms of accuracy and speed, though

one often comes at the expense of the other. The alignment algorithm suffers from impreci-

sion that sometimes causes frames to be mismatched, resulting in false positives, especially

for block modifications and temporal cropping. Implementations may choose to optimize for

a target subset of transformations, based on how users typically edit videos. Accordingly,

the delta-report structure may change to expose different dimensions of transformations.

In order for the proposed system to be practical, future works may target actual hard-

ware on commercial smartphones. Although smartphones do not have TPMs, equivalent

functionality may be possible in current trusted hardware, as the Trusted Computing Group

is working on integrating TPM features into the Trusted Execution Environment [35].

Moreover, file system monitoring using vidtracker may be an inefficient and imprecise

mechanism for taint tracking through native code. Future systems could use more ro-

bust techniques, such as real-time disassembling of ARM instructions as proposed in recent

works [39].

Currently, the architecture uses a simple client/server model. However, the server can

encompass several services, with all of them using a central database. Alternatively, video-

diff computation and certificate storage can be distributed over a peer-to-peer network.

Analogous to bitcoin mining, a set of nodes in the network could “mine” by computing the

delta-report for a parent and child video, possibly rewarded with some resource. This way,

the trust can be distributed rather than being concentrated in a central server. (Note that

the central server model still allows users to verify reports and certificates by recomputing,

but it may take longer.)

42

Appendix A

Classification and Evaluation of Video

Authentication Schemes

Some watermarking and fingerprinting mechanisms are classified in Table A.1, with columns

corresponding to the proposed taxonomy (Table 2.2). Additionally, the “Parameters” column

lists and describes parameters that the scheme accepts to configure application-dependent

settings such as security and robustness.

TransmissionParadigm

Feature Extraction Feature Encoding Parameters

[10] Watermarking

Transform Co-efficients (DCTDC coefficients inI-frame sub-blocks)

Quantization (di-agonal quantizedDCT coefficients)

Threshold T fortampering detec-tion

[11] WatermarkingPixel Color Values(red and blue layersof I-frames)

Quantization(DCT coefficientsof correspondingluminance blocks)

Compressive sens-ing matrices, lu-minance thresholdvalue T

[12] WatermarkingPixel LuminanceValues (ARTcoefficients)

Error CorrectionCoding (low andmiddle frequencyDFT coefficients)

Private key (op-tional)

43

[13] Watermarking

Transform Coef-ficients (averageDCT energy of allI-frames in shot)

Cloud Drops

Expected value Ex,entropy En, hyperentropy He, num-ber of cloud dropsn, secret key

[15] Watermarking Parameterized Quantization

Fragile watermarkfeature MF , robustwatermark featureMR

[16]Digital Signa-ture (PerceptualHashing)

Pixel LuminanceValues (radial pro-jection of key framepixels)

Quantization(RASH vector isquantized first 40DCT coefficients)

Key frame selec-tion algorithm,threshold τ forvisual equivalencedecision

[17]Digital Signa-ture (PerceptualHashing)

Pixel LuminanceValues

CryptographicHashing (MD5)

Maximum allow-able quantizationε, scalar constantfor intensity-transformation α,length/width ofblock P

[8] Digital Signature Key FramesAggregation offrames

Differential energyfactor D, weightfactor W

[18]Hybrid (Content-Fragile Watermark-ing)

Pixel LuminanceValues (edge char-acteristics)

Variable LengthCode

Private key

[21] HybridTransform Coeffi-cients

CryptographicHashing (embed-ded by changinglast LSB bit of xand y in motionvectors)

Thresholds T1 andT2 for selecting mo-tion vectors

Table A.1: Classification of video authentication schemes.

Table A.2 evaluates the same schemes according to the following criteria:

• Tampering Detection: The scheme detects malicious modifications to the video content

44

or source information.

• Tampering Localization: The scheme locates regions of a frame or video sequence that

have been tampered.

• Geometric/Transcoding Robustness : The scheme accepts content-preserving transfor-

mations such as scaling/re-compression.

• Loss Tolerance: The scheme tolerates packet/frame loss or makes up for it with redun-

dancy.

Also critical is speed of authentication. The user uploading the video to the server expects

to do so as quickly as possible. Likewise, the server needs to finish the upload task with

minimal authentication overhead in order to move quickly onto other tasks. The tradeoff

is to minimize overhead while still satisfying security. To prevent delay, especially for the

viewing client, authentication schemes should be capable of working in real time. However,

although they may factor into the usefulness of a system, the literature rarely explores these

characteristics in adequate detail for analysis. This presents a need for future works to

involve space, time, and energy requirements in their discussion.

Detection Localization Robustness Loss Tolerance

[10]High — Config-urable according tothreshold T

4x4 sub-block

High against Gaus-sian noise (89%),Salt and pep-per noise (94%),Gaussian low passfilter (74%), Con-trast enhancement(94%)

High — Only I-frames are signifi-cant

45

[11]

High — Detectionrate tested on fourtypical sequences(Carphone 92.5%,Container 93.3%,Mobile 91.4%,Paris 92.0%)

8x8 block Not measuredHigh — Only I-frames are signifi-cant

[12]High for non-overlapping, largeobjects

Object-level

High — Greaterthan 85% in allvideo processingtests

High — Each frameauthenticated inde-pendently

[13]

Medium (85-91%)— Low for shortshots, high for longones

Sub-region (de-pends on size),also temporaltampering degree(based on distancebetween W andW ′)

High for recompres-sion that preservesGOP structure

Low — Framedropping is con-sidered a temporalattack

[15]Configurable —Based on selectedfeatures

GOP level ConfigurableLow — Requiresmotion vectorsfrom inter-frames

[16]

Configurable —Correlates withthreshold τ byreducing the risk ofhash collision

Frame levelConfigurable —Inversely correlatedwith threshold τ

Medium — onlykey frames are sig-nificant, so toler-ance depends onkey frame selection

[17]Configurable —Correlates withparameter α

Block-level (sizeP × P )

High for recompres-sion, inversely cor-related with param-eter α

Configurable —Can be consideredtemporal tamper-ing or ignored

[8]

High — Inverselycorrelates with pa-rameter D overall,correlates with Won the block level

Specified regionHigh — Correlatedwith parameter D

High — Config-urable, exploitstemporal redun-dancy

46

[18]

Low — Maliciousmodifications thatpreserve edge char-acteristics are ac-cepted

Frame level

Medium — Usesrobust watermarks,but edges are sen-sitive to high com-pression and scal-ing transformations

High — Frames canbe authenticatedindependently

[21]High — Any triv-ial attack will altermotion vectors

GOP level

Low — Recompres-sion alters GOPstructures andmotion vectors

Low — Requiresmotion vectorsfrom inter-frames

Table A.2: Evaluation of video authentication schemes.

47

Bibliography

[1] Cisco, “The zettabyte era: Trends and anal-ysis,” http://www.cisco.com/c/en/us/solutions/collateral/service-provider/visual-networking-index-vni/VNI Hyperconnectivity WP.pdf, June 2014, retrieved01 March 2015, archived by WebCite. [Online]. Available:http://www.webcitation.org/6WieRDCib

[2] S. Budig, “The linux-penguin again...” http://www.home.unix-ag.org/simon/penguin/,retrieved 03 April 2015. [Online]. Available: http://www.home.unix-ag.org/simon/penguin/

[3] Cisco, “Cisco visual networking index: Forecast and methodology,2013-2018,” http://www.cisco.com/c/en/us/solutions/collateral/service-provider/ip-ngn-ip-next-generation-network/white paper c11-481360.pdf, June 2014, re-trieved 06 October 2014, archived by WebCite. [Online]. Available:http://www.webcitation.org/6T8ZCShaO

[4] YouTube, “Statistics,” https://www.youtube.com/yt/press/statistics.html, re-trieved 06 October 2014, archived by WebCite. [Online]. Available:http://www.webcitation.org/6T8e2x2HX

[5] X. Wu, A. G. Hauptmann, and C.-W. Ngo, “Practical elimination of near-duplicatesfrom web video search,” in Proceedings of the 15th international conference on Multi-media. ACM, 2007, pp. 218–227.

[6] YouTube, “Content verification program,”https://support.google.com/youtube/answer/6005923, retrieved 01 March 2015,archived by WebCite. [Online]. Available: http://www.webcitation.org/6WijIDfno

[7] CNN, “About ireport,” http://ireport.cnn.com/about.jspa, retrieved 06 October 2014,archived by WebCite. [Online]. Available: http://www.webcitation.org/6T8eJ9VAy

[8] P. K. Atrey, W.-Q. Yan, and M. S. Kankanhalli, “A scalable signature scheme for videoauthentication,” Multimedia Tools and Applications, vol. 34, no. 1, pp. 107–135, July2007.

48

[9] J. Wang, J. Lu, S. Lian, and G. Liu, “On the design of secure multimedia authenti-cation,” Journal of Universal Computer Science, vol. 15, no. 2, pp. 426–443, January2009.

[10] W. Zhang, R. Zhang, X. Liu, C. Wu, and X. Niu, “A video watermarking algorithm ofh.264/avc for content authentication,” Journal of Networks, vol. 7, no. 8, pp. 1150–1154,2012.

[11] C. Xiaoling and Z. Huimin, “A novel video content authentication algorithm combinedsemi-fragile watermarking with compressive sensing,” in 2012 International Conferenceon Intelligent Systems Design and Engineering Application. IEEE, 2012.

[12] D. He, Q. Sun, and Q. Tian, “A secure and robust object-based video authenticationsystem,” EURASIP Journal on Advances in Signal Processing, vol. 2004, pp. 2185–2200,2004.

[13] C.-Y. Liang, A. Li, and X.-M. Niu, “Video authentication and tamper detection basedon cloud model,” in IIHMSP 2007 - Third International Conference on Intelligent In-formation Hiding and Multimedia Signal Processing, November 2007.

[14] C. Zauner, “Implementation and benchmarking of perceptual image hash functions,”Master’s thesis, Upper Austria University of Applied Sciences, Hagenberg, 2010.

[15] P. Yin and H. H. Yu, “A semi-fragile watermarking system for mpeg video authenti-cation,” in 2002 IEEE International Conference on Acoustics, Speech, and Signal Pro-cessing (ICASSP), May 2002, pp. IV–3461–IV–3464.

[16] C. D. Roover, C. D. Vleeschouwer, F. Lefebvre, and B. Macq, “Robust video hashingbased on radial projections of key frames,” IEEE Transactions on Signal Processing,vol. 53, no. 10, pp. 4020–4037, October 2005.

[17] F. Ahmed and M. Y. Siyal, “A robust and secure signature scheme for video authenti-cation,” in 2007 IEEE International Conference on Multimedia and Expo. IEEE, July2007, pp. 2126–2129.

[18] J. Dittmann, A. Steinmetz, and R. Steinmetz, “Content-based digital signature for mo-tion pictures authentication and content-fragile watermarking,” in IEEE InternationalConference on Multimedia Computing and Systems, vol. 2. IEEE, 1999, pp. 209–213.

[19] J. Zhang and A. T. Ho, “Efficient video authentication for h.264,” in IEEE Proceedingsof the first International Conference on Innovative Computing, Information and Control(ICICIC’06), 2006.

[20] N. Ramaswamy and K. R. Rao, “Video authentication for h.264/avc using digital sig-nature standard and secure hash algorithm,” in NOSSDAV’06, May 2006.

49

[21] K. A. Saadi, A. Bouridane, and A. Guessoum, “Combined fragile watermark and dig-ital signature for h.264/avc video authentication,” in 17th European Signal ProcessingConference (EUSIPCO 2009), August 2009, pp. 1799–1803.

[22] P. Gilbert, J. Jung, K. Lee, H. Qin, D. Sharkey, A. Sheth, and L. P. Cox, “Youprove:authenticity and fidelity in mobile sensing,” in Proceedings of the 9th ACM Conferenceon Embedded Networked Sensor Systems. ACM, 2011, pp. 176–189.

[23] Y. Ke, R. Sukthankar, and L. Huston, “Efficient near-duplicate detection and sub-imageretrieval,” in ACM Multimedia, vol. 4, no. 1, 2004, p. 5.

[24] L. Kennedy and S.-F. Chang, “Internet image archaeology: automatically tracing themanipulation history of photographs on the web,” in Proceedings of the 16th ACMinternational conference on Multimedia. ACM, 2008, pp. 349–358.

[25] A. De Rosa, F. Uccheddu, A. Costanzo, A. Piva, and M. Barni, “Exploring imagedependencies: a new challenge in image forensics.” Media Forensics and Security, p.75410, 2010.

[26] Z. Dias, A. Rocha, and S. Goldenstein, “First steps toward image phylogeny,” in Infor-mation Forensics and Security (WIFS), 2010 IEEE International Workshop on. IEEE,2010, pp. 1–6.

[27] ——, “Image phylogeny by minimal spanning trees,” Information Forensics and Secu-rity, IEEE Transactions on, vol. 7, no. 2, pp. 774–788, 2012.

[28] Z. Dias, S. Goldenstein, and A. Rocha, “Toward image phylogeny forests: Automaticallyrecovering semantically similar image relationships,” Forensic science international, vol.231, no. 1, pp. 178–189, 2013.

[29] A. M. Bronstein, M. M. Bronstein, and R. Kimmel, “The video genome,” arXiv preprintarXiv:1003.5320, 2010.

[30] Z. Dias, A. Rocha, and S. Goldenstein, “Video phylogeny: Recovering near-duplicatevideo relationships,” in Information Forensics and Security (WIFS), 2011 IEEE Inter-national Workshop on. IEEE, 2011, pp. 1–6.

[31] Z. Dias, S. Goldenstein, and A. Rocha, “Large-scale image phylogeny: Tracing imageancestral relationships,” MultiMedia, IEEE, vol. 20, no. 3, pp. 58–70, 2013.

[32] S. Lameri, P. Bestagini, A. Melloni, S. Milani, A. Rocha, M. Tagliasacchi, and S. Tubaro,“Who is my parent? reconstructing video sequences from partially matching shots,” in2014 IEEE International Conference on Image Processing (ICIP), 2014.

[33] REWIND, “Reverse engineering of audio-visual content data,”http://www.rewindproject.eu/, retrieved 30 March 2015. [Online]. Available:http://www.rewindproject.eu/

50

[34] W3C, “Prov-overview: An overview of the prov family of documents,”http://www.w3.org/TR/prov-overview/, April 2013, retrieved 30 March 2015.[Online]. Available: http://www.w3.org/TR/prov-overview/

[35] T. C. Group, “Tpm mobile with trusted execu-tion environment for comprehensive mobile device secu-rity,” https://www.trustedcomputinggroup.org/files/static page files/5999C3C1-1A4B-B294-D0BC20183757815E/TPM%20MOBILE%20with%20Trusted%20Execution%20Environment%20for%20Comprehensive%20Mobile%20Device%20Security.pdf, re-trieved 05 January 2015, archived by WebCite. [Online]. Available:http://www.webcitation.org/6VMSMgBmh

[36] L. Xie, A. Natsev, X. He, J. Kender, M. Hill, and J. R. Smith, “Tracking large-scalevideo remix in real-world events,” IEEE Transactions on Multimedia, vol. 15, no. 6, pp.1244–1254, 2013.

[37] M. A. Fischler and R. C. Bolles, “Random sample consensus: a paradigm for model fit-ting with applications to image analysis and automated cartography,” Communicationsof the ACM, vol. 24, no. 6, pp. 381–395, 1981.

[38] Z. Fan and R. L. de Queiroz, “Identification of bitmap compression history: Jpeg de-tection and quantizer estimation,” Image Processing, IEEE Transactions on, vol. 12,no. 2, pp. 230–235, 2003.

[39] V. Pistol, “Practical dynamic information-flow tracking on mobile devices,” Ph.D. dis-sertation, Duke University, 2014.

51

youtrace: a smartphone system for tracking video … · as smartphone cameras and processors get...

Documents