security of big data: focus on data leakage prevention (dlp)1221204/fulltext01.pdfdeficiency in...

Security of Big Data: Focus on Data

Leakage Prevention (DLP)

Richard Nyarko

Information Security, master's level (120 credits)

2018

Luleå University of Technology

Department of Computer Science, Electrical and Space Engineering

Security of Big Data: Focus on Data

Leakage Prevention (DLP)

Richard Nyarko

Supervised by Prof. Ahmed Elragal

THESIS SUBMITTED IN PARTIAL FULFILMENT OF THE REQUIREMENTS FOR THE

DEGREE OF MASTER OF SCIENCE IN INFORMATION SECURITY

Luleå University of Technology

Department of Computer Science, Electrical and Space Engineering

June, 2018

3

ABSTRACT Data has become an indispensable part of our daily lives in this era of information age. The

amount of data which is generated is growing exponentially due to technological advances.

This voluminous of data which is generated daily has brought about new term which is

referred to as big data. Therefore, security is of great concern when it comes to securing big

data processes. The survival of many organizations depends on the preventing of these data

from falling into wrong hands. Because if these sensitive data fall into wrong hands it could

cause serious consequences. For instance, the credibility of several businesses or organizations

will be compromised when sensitive data such as trade secrets, project documents, and

customer profiles are leaked to their competitors (Alneyadi et al, 2016).

In addition, the traditional security mechanisms such as firewalls, virtual private networks

(VPNs), and intrusion detection systems/intrusion prevention systems (IDSs/IPSs) are not

enough to prevent against the leakage of such sensitive data. Therefore, to overcome this

deficiency in protecting sensitive data, a new paradigm shift called data leakage prevention

systems (DLPSs) have been introduced. Over the past years, many research contributions have

been made to address data leakage. However, most of the past research focused on data leakage

detection instead of preventing against the leakage. This thesis contributes to research by using

the preventive approach of DLPS to propose hybrid symmetric-asymmetric encryption to

prevent against data leakage.

Also, this thesis followed the Design Science Research Methodology (DSRM) with CRISP-

DM (CRoss Industry Standard Process for Data Mining) as the kernel theory or framework for

the designing of the IT artifact (method). The proposed encryption method ensures that all

confidential or sensitive documents of an organization are encrypted so that only users with

access to the decrypting keys can have access. This is achieved after the documents have been

classified into confidential and non-confidential ones with Naïve Bayes Classifier (NBC).

Therefore, any organizations that need to prevent against data leakage before the leakage occurs

can make use of this proposed hybrid encryption method.

Keywords: Big data, big data security, data leakage prevention, data leakage

prevention system.

4

ACKNOWLEDGEMENT First and foremost, I am most grateful to the Almighty God who gave me the knowledge, strength,

good health, and the happiness to complete the thesis.

I would also like to express my profound appreciation and gratitude to my supervisor, Prof. Ahmed

Elragal for his endless support and valuable comments and guidance through this thesis work.

This thesis would not have been successful without his valuable contributions.

I would also like to express my deepest gratitude to my wife, Louisa Ababio Nsiah who has

supported me throughout all these years. My master’s programme would not have been achieved

without her valuable support.

5

TABLE OF CONTENTS ABSTRACT……………………………………………………………………………………………….3

ACKNOWLEDGEMENT…………………………………………………………………………………4

TABLE OF FIGURES ............................................................................................................................ 7

TABLE OF TABLES ............................................................................................................................. 8

ABBREVIATIONS ................................................................................................................................ 9

1. INTRODUCTION ............................................................................................................................ 10

1.1 Research objective ........................................................................................................................ 11

1.2 Delimitations ............................................................................................................................... 11

2. RESEARCH PROBLEM .................................................................................................................. 12

2.1 Motivation ................................................................................................................................... 13

3. LITERATURE REVIEW .................................................................................................................. 14

3.1 Definition of review scope ............................................................................................................ 14

3.2 Conceptualisation of the research topic .......................................................................................... 14

3.3 Literature search ........................................................................................................................... 15

3.4 Literature analysis and synthesis .................................................................................................... 16

3.4.1 Big Data (BD) ...................................................................................................................... 16

3.4.2 Big Data Security (BDS) ...................................................................................................... 24

3.4.3 Data Leakage Prevention (DLP) ........................................................................................... 27

3.5 Literature review discussion .......................................................................................................... 41

3.6 Research gap ................................................................................................................................ 42

3.7 Research question ......................................................................................................................... 42

4. RESEARCH METHODOLOGY....................................................................................................... 43

4.1 Activity 1: Problem identification and motivation........................................................................... 44

4.2 Activity 2: Define the objectives for a solution ............................................................................... 44

4.3 Activity 3: Design and development .............................................................................................. 44

4.3.1 Kernel theory ........................................................................................................................ 44

4.4 Activity 4: Demonstration ............................................................................................................. 47

4.5 Activity 5: Evaluation ................................................................................................................... 47

4.6 Activity 6: Communication ........................................................................................................... 47

5. DESIGN AND DEVELOPMENT ..................................................................................................... 48

5.1 Data Understanding ...................................................................................................................... 48

5.2 Data Preparation ........................................................................................................................... 49

6

5.3 Modelling .................................................................................................................................... 50

5.4 Cryptography (Encryption and Decryption) ................................................................................... 51

5.5 Proposed DLP Method ................................................................................................................. 52

5.5.1 Phase 1: Classification of organizational documents ............................................................. 52

5.5.2 Phase 2: Encryption and decryption of confidential documents. ............................................ 53

6. DEMONSTRATION .................................................................................................................... 55

6.1 Experimental setup ....................................................................................................................... 55

6.2 Data Sets (Documents) ................................................................................................................. 55

6.3 Experiment 1................................................................................................................................ 55

6.4 Experiment 2................................................................................................................................ 55

6.5 Experiment 3................................................................................................................................ 56

6.6 Experiment 4................................................................................................................................ 56

7. EVALUATION ............................................................................................................................ 57

7.1 Impact of the IT Artifact ............................................................................................................... 58

8. DISCUSSION ............................................................................................................................... 59

8.1 Contribution ................................................................................................................................. 59

9. CONCLUSION ............................................................................................................................. 60

9.2 Future Research ........................................................................................................................... 60

REFERENCE ....................................................................................................................................... 61

APPENDIX 1 ....................................................................................................................................... 66

APPENDIX 2 ....................................................................................................................................... 68

APPENDIX 3 ....................................................................................................................................... 70

APPENDIX 4 ....................................................................................................................................... 73

APPENDIX 5 ....................................................................................................................................... 76

7

TABLE OF FIGURES Figure 1: Framework for literature reviewing (Brocke et al, 2009) ......................................................... 14

Figure 2: The 5 V’s of BD (Moura and Serrão, 2015) ............................................................................ 19

Figure 3: BD Interpretation Insights (Shukla et al, 2015) ....................................................................... 23

Figure 4: Challenges in BDS ................................................................................................................. 25

Figure 5: Different Data States (Alneyadi et al, 2013a) .......................................................................... 28

Figure 6: A taxonomy of DLP solutions (Shabtai et al, 2012) ................................................................ 29

Figure 7: Data leakage prevention categorisation by method. (Alneyadi et al, 2016) .............................. 31

Figure 8: Example of SVM hyper plane pattern (Patel and Mistry, 2015) .............................................. 38

Figure 9: NN Block Diagram (Patel and Mistry, 2015) .......................................................................... 40

Figure 10: DSRM Process Model (Peffers et al, 2007)........................................................................... 44

Figure 11: Phases of the CRISP-DM Process Model (Wirth and Hipp, 2000; Rocha and Sousa Júnior,

2010; Shearer, 2000) ............................................................................................................................. 45

Figure 12: Overview of the CRISP-DM tasks and their outputs (Wirth and Hipp, 2000; Shearer, 2000) . 47

Figure 13: Sample Data......................................................................................................................... 49

Figure 14: Flowchart of proposed DLP method ..................................................................................... 54

Figure 15: Evaluation activities within a DSR process ........................................................................... 57

Figure 16: Text Pre-processing Activities .............................................................................................. 66

Figure 17: Process Map of the Model for NBC ...................................................................................... 66

Figure 18: Process Map of Cross Validation for NBC ........................................................................... 67

Figure 19: Prediction label after applying the model on the unknown data. ............................................ 67

Figure 20: Screenshot showing the implementation of the second phase of the proposed DLP method in

experiment 1 ......................................................................................................................................... 69

Figure 21: Screenshot showing the encryption bash script ..................................................................... 70

Figure 22: Screenshot showing all the steps from step 2 to 11................................................................ 71

Figure 23: Screenshot showing the decryption bash script ..................................................................... 71

Figure 24: Screenshot showing the executable command (step 13) ........................................................ 72

Figure 25: Screenshot showing the decryption of the confidential file .................................................... 72

Figure 26: Screenshot showing the encryption of all the confidential files in a directory ........................ 73

Figure 27: Screenshot showing the decryption of all the confidential files in a directory ........................ 73

Figure 28: Screenshot showing all the files in a directory ...................................................................... 74

Figure 29: Screenshot showing the removal of the files in a directory .................................................... 74

Figure 30: Screenshot showing wrong password supplied for decryption ............................................... 75

Figure 31: Screenshot showing archiving of all files before encryption .................................................. 76

Figure 32: Screenshot showing removal of all confidential text files before encryption .......................... 76

Figure 33: Screenshot showing encryption of the archived file .............................................................. 76

Figure 34: Screenshot showing changing of directory ............................................................................ 76

Figure 35: Screenshot showing decryption of the archived file .............................................................. 77

Figure 36: Screenshot showing extraction of files from the archived file ............................................... 77

8

TABLE OF TABLES Table 1: Conceptualisation of the research topic .................................................................................... 15

Table 2: Knowledge Database Search ................................................................................................... 15

Table 3: Summary of knowledge database search results ....................................................................... 15

Table 4: Selected research papers for review ......................................................................................... 16

Table 5: Summary of Previous DLPS Analysis Techniques / Methods................................................... 35

Table 6: Advantages and Disadvantages of Classifiers .......................................................................... 40

9

ABBREVIATIONS BD: Big Data

IoT: Internet of Things

DLP: Data Leakage Prevention

VPN: Virtual Private Networks

IDSs: Intrusion Detection Systems

IPSs: Intrusion Prevention Systems

IM: Instant Messaging

XML: eXtensible Markup Language

JSON: Java Script Object Notation

RDBMS: Relational Database Management Systems

BDS: Big Data Security

SIEM: Security Information and Event Management

BYOD: Bring your own device (BYOD)

ABE: Attribute-Based Encryption

DAR: Data-at-Rest

DIU: Data-in-Use

DIM: Data-in-Motion

TF-IDF: Term Frequency-Inverse Document Frequency

SVD: Singular Value Decomposition

SVM: Support Vector Machine

K-NN: K Nearest Neighbor

ANN: Artificial Neural Network

NBC: Naive Bayes Classifier

DT: Decision Trees

DSR: Design Science Research

DSRM: Design Science Research Methodology

CRISP-DM: CRoss Industry Standard Process for Data Mining

NIST: National Institute of Standards and Technology

PII: Personally Identifiable Information

GDPR: General Data Protection Regulation

EU: European Union

HR: Human Resource

DES: Data Encryption Standard

AES: Advanced Encryption Standard

10

1. INTRODUCTION In this era of information-driven world, data has become an indispensable part of our daily

lives. With the combination of cloud computing, internet, and mobile devices which have

become greater portions of our lives and businesses, huge data are generated every day (Hima

Bindu et al, 2016). For example, huge data is generated daily through social networking

applications such as YouTube, Twitter, Facebook, LinkedIn, and WhatsApp, just to mention

few. The amount of data which is generated is growing exponentially and estimates suggest

that at least 2.5 quintillion bytes (that’s 2.5 followed by staggering 18 zeros!) of data is

produced every day (Harish Kumar and Menakadevi, 2017). Every second more data are

stored currently than there were on the entire Internet 20 years ago (McAfee and Brynjolfsson,

2012). These collections of datasets which are large and complex and become difficult to handle

by the traditional relational database management systems has brought about the term “Big

Data” (Shirudkar and Motwani, 2015). This term is now used everywhere in our daily lives. Big Data (BD) is increasingly becoming popular since the number of devices connected to

the so-called “Internet of Things” (IoT) is still increasing to unforeseen levels, producing

large volumes of data which needs to be transformed into valuable information (Moura and

Serrão, 2015). In addition, the advent of BD has brought about new challenges in terms of

data security (Toshniwal et al, 2015). According to Toshniwal et al (2015), there is an

increasing need to research in technologies that can handle these sets of large data and make

it secure efficiently. They go on to further reiterate that the “Current Technologies for

securing data are slow when applied to huge amounts of data” (Toshniwal et al, 2015, p. 17).

This means security is of much concern when it comes to BD collection, processing, and

analysing, the systems employed should be faster though secure. Ultimately the purpose of

BD security is of no different from the fundamental CIA triad, that is, Confidentiality,

Integrity, and Availability of data generated needed to be preserved. According to Tahboub and Saleh (2014), the need to protect information which is a valuable

asset of the organization cannot be over emphasized. Data Leakage Prevention (DLP) has

been found to be one of the effective ways of preventing Data Leak. DLP solutions detect

and prevent unauthorized attempts to copy or send sensitive data, both intentionally or/and

unintentionally, without authorization, by people who are authorized to access the sensitive

information. DLP is designed to detect potential data breach incidents in timely manner and

this happens by monitoring data while in-use (endpoint actions) or in-motion (network

traffic) or at-rest (data storage) (Tahboub and Saleh, 2014).

Securing the BD process encompasses securing the sources, the pre-processing and the

knowledge outcomes. According to ISACA (2010), DLP aims at halting the loss of sensitive

information that occurs in enterprises globally. By focusing on the location, classification

and monitoring of information at rest, in use and in motion, DLP has the task to helping

enterprises get a handle on what information it has, and in stopping the numerous leaks of

11

information that occur each and every day (ISACA, 2010). This research is set out to design

a method to help organizations prevent data leakage in big data. DLP is sometimes referred

to as Data Loss Prevention in most literatures, however, in this research DLP would mean

Data Leakage Prevention.

1.1 Research objective

The objective of this research is to design a method to help organizations prevent data leakage

in BD using the preventive approach such as encryption with emphasis on semi-structured

(textual) data.

1.2 Delimitations

The scope of this thesis is limited to the use of encryption as the preventive approach in

preventing data leakage in BD with emphasis on semi-structured (textual) data. This means that

other types of preventive methods such as access control, disabling functions, and awareness

will not be addressed. More so, the detective approach of handling data leakage in any DLPS

was not considered. Also, the encryption of other types of BD will not be considered though

the method is capable of handling certain documents which are not in TXT formats such as

DOCX, PDF, PPT, and many more. The encryption algorithms are also limited to only RSA

and AES. The proposed method is not automated since data are manually fed into data mining

tool in order to do classification. The volume of test data used in the experiments are too small

since the whole idea is to prevent leakage in BD. This situation has arisen since the

organization in question has not implemented BD technologies such as Hadoop to

accommodate several data sources.

12

2. RESEARCH PROBLEM One of the important assets to many companies is data, and for that matter the protection of

this data must take the first priority (Tahboub and Saleh, 2014). Even though many

organizations have put in place certain security mechanisms and technical systems such as

firewalls, virtual private networks (VPNs), and intrusion detection systems/intrusion

prevention systems (IDSs/IPSs) still data leakage does occur (Tahboub and Saleh, 2014).

Tahboub and Saleh (2014) reiterated that the data leakage occurs when sensitive data is

revealed to unauthorized users or parties either intentionally or not. The data leakage can

cause serious implications or threats to many organizations. For example, the loss of the

confidential or sensitive data can have severe or adverse impact on a company’s reputation

and credibility, customers, employee confidence, competitive advantage, and in some cases,

this can lead to the closure of the organization (Tahboub and Saleh, 2014). In addition, data

leakage is an important concern for the business organizations in this increasingly networked

world nowadays and for that matter any unauthorized disclosure of sensitive or confidential

data may have serious consequences for an organization in both long and short terms

(Soumya and Smitha, 2014).

In addition, according to Alneyadi et al (2016) the issue of data leakage is a growing concern

among organizations and individuals. Alneyadi et al (2016) indicated that more leakages

occurred in the business sectors than they were in the government sector. According to a

report in 2014, the statistics stands at 50% in the business sector and 20% in the government

sector. They further stated that although in some cases the data leaks were not detrimental

to organizations, however, others have caused several millions of dollars’ worth damage.

More so, the credibility of several businesses or organizations are comprised when sensitive

data such as trade secrets, project documents, and customer profiles are leaked to their

competitors (Alneyadi et al, 2016). Alneyadi et al (2016) take it further that government

sensitive information such as political decisions, law enforcement, and national security can

also be leaked. A typical example of government sensitive information that was leaked was

the United States diplomatic cables by WikiLeaks. “The leak consisted of about 250,000

United States diplomatic cables and 400,000 military reports referred to as ‘war logs’. This

revelation was carried out by an internal entity using an external hard drive and about

100,000 diplomatic cables were labelled confidential and 15,000 cables were classified as

secret” (Alneyadi et al, 2016, p. 137). According to Alneyadi et al (2016), this incident

received high public criticisms from among civil rights organizations all over the world. In

another development hackers stole 160 million credit and debit card numbers which targeted

800,000 bank accounts in US, which were considered as one of the largest hacking incident

that has occurred (Vadsola et al, 2014).

The need to address such serious issues culminated in the implementation of certain security

control mechanisms such as firewalls, VPNs, IDS, and IPSs by several organizations (Kale

13

et al, 2015). According to Alneyadi et al (2016), these systems work satisfactorily when the

data is well defined, structured and constant. Alneyadi et al (2016) further stated that when

data is either modified, tag differently or compressed, these systems become naïve and

confidential data can still be leaked. For example, a firewall can have rules to block access

to confidential data, however, the same data can be accessed through several means such as

an email attachment and instant messaging (IM). This means that the traditional security

mechanisms (firewall, VPNs, IDSs / IPSs) is handicapped and lack the understanding of data

semantics (Alneyadi et al, 2016). To overcome this deficiency in protecting sensitive data,

a new paradigm shift called data leakage prevention systems (DLPSs) have been introduced.

Security and privacy issues have increased by the velocity, volume and variety of BD, such as

large-scale cloud infrastructures, diversity of data sources and formats, streaming nature of data

acquisition, and high volume inter-cloud migration (Shirudkar and Motwani, 2015). BD can

be sensitive or non-sensitive, and no matter leakage of data can be costly for any businesses

or users. For instance, a customer credit card record which is leaked can be costly to both

the bank and the customer. Often data leakage occurs due to information sharing with users

internally or externally to the organization, exchanging emails that contain sensitive

information, publicly releasing information on the internet or cloud, information which is

stolen with illegal motives or unknowingly (Tidke et al, 2015). Data sensitivity varies such

as banking information, credit card information, criminal and justice data, financial data,

health records, etc. To add to this, the advent of BD has brought about numerous data security

challenges that require different mechanisms in dealing with the situation. Also, due to the

voluminous of data which are generated and used these days by organizations, there should

be sophisticated technologies and methodologies that can handle the voluminous of data

securely and efficiently and to prevent data leakage.

Finally, several DLP methods have been designed, however, there are little done to prevent

data leakage in BD using the preventive approach which can help organizations prevent the

leakage before they happen.

2.1 Motivation

The motivation of this research comes from the need to find a method to help organizations

prevent data leakage in BD using the preventive approach such as encryption so that leakage

can be prevented before they happen. It is my belief that when this method is finally

implemented it will enable various organizations to have access to less costly solution that can

be applied to prevent data leakage. For detective approach, the system will detect any possible

leakage incidents and apply the corrective action that is capable of handling the identified

leakage incident (Shabtai et al, 2012). However, sensitive data could still be leaked easily to

unauthorized users if the data owner is careless which could reduce the competitive advantage

of an organization (Margathavalli et al, 2016). The argument is that when relying on detective

approaches, data owners can carelessly leak sensitive data to unauthorized users. Therefore, it

is very necessary to rather prevent the data leakage through encryption to correct this disorder.

14

3. LITERATURE REVIEW According to Brocke et al (2009), literature review should be rigorous and exhaustive and

the search processes involved should be documented. To achieve this, the framework

suggested by Brocke et al (2009) will be adopted. According this framework, a rigorous

literature search should consist of five phases, namely; (1) definition of review scope, (2)

conceptualisation of the topic, (3) literature search, (4) literature analysis and synthesis, and

(5) research agenda (which will be discussed under Literature review discussion). The

framework is displayed in Figure 1.

Figure 1: Framework for literature reviewing (Brocke et al, 2009)

3.1 Definition of review scope This stage of the framework would be based on the various relevant concepts associated with

the study and the timeframe for the selection of past related research papers for the review.

The various relevant concepts identified in this research are big data, big data security, and

data leakage prevention. Also, the time frame will be within the past six years, that is

spanning from 2012 to 2018. The reasons for the choice of the time frame have been a

requirement for the master thesis and also since big data is a new concept, most relevant

papers are likely to be new.

3.2 Conceptualisation of the research topic When one considers the research topic, the key terms or concepts that are relevant for this

15

research study are; big data, big data security, and data leakage prevention.

Table 1: Conceptualisation of the research topic

Research topic Research question Sub-topics

Security of Big Data: Focus

on Data Leakage Prevention

(DLP).

How to design a method to

help organizations prevent

data leakage in BD?

Big data

Big data security

Data leakage prevention

3.3 Literature search

The main method for the literature search was keyword searches using the key terms or

concepts identified during the conceptualisation phase in a number of relevant knowledge

databases such as Elsevier Science Direct, Google Scholar, Scopus and Lulea University

database, with the time frame spanning from 2012 to 2018 with English as the search

language. To be more specific these keywords or phrases were typed directly into these

knowledge databases with specific reference to article titles. The search process included

academic journals, conference papers and peer-reviewed in the advanced search. These

keywords initially brought about a lot of articles which are presented in Table 2. They were

sorted according to relevance and date. The relevant ones which are in relation to the other

sub-topics were needed to be selected and so their abstracts and titles were reviewed. In

addition, these sub-topics were at a point in time combined so that only the articles which

were relevant to the research topic were selected. After all these rigorous processes, the

number of papers reduced as shown in Table 3.

Table 2: Knowledge Database Search

Knowledge Databases Keywords / Number of hits

Big data Big data security Data leakage

prevention

ScienceDirect 229,606 29,160 8540

Lulea University database 184,913 34,368 5,106

Google Scholar 25,400 437 83

Scopus 32,401 26 34

Table 3: Summary of knowledge database search results

Keywords / Sub-topics Number of papers found Number of papers used

Big data Over 229,000 19

Big data security Over 34,000 9

Data leakage prevention Over 8,000 20

The final list of the relevant papers for review is presented in Table 4 and considered in the

16

Literature analysis and synthesis section (Section 3.4).

Table 4: Selected research papers for review

No. Big data Big data security Data leakage prevention

1 Inukollu et al (2014) Kumar et al (2016) Tahboub and Saleh (2014)

2 Tene and Polonetsky (2013) Shirudkar and Motwani

(2015)

Kale et al (2015)

3 McAfee and Brynjolfsson

(2012)

Tabassum and Tyagi

(2016)

Ram (2015)

4 Tabassum and Tyagi (2016) Singh and Sinha (2016) Peneti and Rani (2015a)

5 Moura and Serrão (2015) Mahajan et al (2016) Alneyadi et al (2013a)

6 Hima Bindu et al (2016) Bhogal and Jain (2017) Alneyadi et al (2016)

7 Shirudkar and Motwani

(2015)

Kaushik and Jain

(2014)

Jain and Lenka (2016)

8 Toshniwal et al (2015) Yosepu et al (2015) Peneti and Rani (2015b)

9 Yosepu et al (2015) Hima Bindu et al

(2016)

Tahboub and Saleh (2015)

10 Ularu et al (2012) Tidke et al (2015)

11 Jamiy et al (2014) Margathavalli et al (2016)

12 Rodríguez-Mazahua et al

(2016)

Katz et al (2014)

13 Moorthy et al (2015) Alneyadi et al (2015)

14 Khan et al (2014) Shabtai et al (2012)

15 Shukla et al (2015) Soumya and Smitha (2014)

16 Tole (2013) Alneyadi et al (2013b)

17 Ammu and Irfanuddin

(2013)

Alneyadi et al (2014)

18 Bertino (2013) Ko et al (2014)

19 Sin and Muthu (2015) Peneti and Rani (2016)

20 Ahmad and Bamnote (2013)

3.4 Literature analysis and synthesis The literature analysis would be based on the relevant papers outlined in Table 4 and based

on the concepts big data, big data security and data leakage prevention.

3.4.1 Big Data (BD)

BD is the term which is used to describe huge volumes of structured, semi-structured and

unstructured data that are so large and complex that it is very difficult to be processed by the

traditional databases and software technologies (Inukollu et al, 2014). Again, the increasing

number of people, devices, and sensors that are now connected by digital networks (i.e., IoT)

has generated a vast amount of data. Data is generated from online transactions, social

networking interactions, emails, videos, images, clickstream, logs, search queries, sensors,

17

global positioning satellites, roads and bridges, and mobile phones (Tene and Polonetsky,

2013). The amount of data that is produced each day already exceeds 2.5 Exabyte (McAfee

and Brynjolfsson, 2012). The types of data that made up of BD are explained further below

(Tabassum and Tyagi, 2016; Yosepu et al, 2015; Moorthy et al, 2015; Toshniwal et al, 2015;

Khan et al, 2014; Shirudkar and Motwani, 2015; Ularu et al, 2012; Moura and Serrão, 2015;

Hima Bindu et al, 2016):

• Structured Data: These are data set which are made up of fixed fields within a record

or file, that is, they are relational data (tables data). Also, this is the type of data that

can be found in relational databases and information that has been created from

business applications. Examples are; database tables, tables, objects, tags, reports,

indexes, etc. They are highly structured, organized and mostly managed by SQL.

• Semi- Structured Data: These are text data that contains tags or mark-up elements in

order to separate elements and generate hierarchies of records and fields in the given

data. E.g. tags and mark-ups (XML – eXtensible Markup Language). In order words,

it is a type of structured data that lacks the data model structure and do not conform to

a formal structure, that is to say schema definition is rather optional and contains tags

and other markers to separate semantic elements and enforce hierarchies of records

fields within the data. This type of data is managed by Languages such as XML, Java

script object notation (JSON), etc.

• Unstructured Data: This type of data comes from machines generated or human

generated. For examples; text, emails, photos, videos, audios, movies, graph data,

scientific simulations, financial transactions, phone records, geospatial maps, tweets,

Facebook data, sensor data, etc. This increases most rapidly and about 80% of all stored

organizational data is unstructured (Khan et al, 2014).

The characteristics of BD comprises the following; (the initial properties are what is referred

to as the 3V’s- Volume, Velocity, and Variety, then two more were added to make it the

5V’s – Veracity and Value) (Moura and Serrão, 2015; Hima Bindu et al, 2016; Shirudkar

and Motwani, 2015; Toshniwal et al, 2015; Tabassum and Tyagi, 2016; Yosepu et al, 2015):

• Volume: This feature describes the huge volumes of data which many factors

contribute to that. The increase in data are generated from online transactions data, live

streaming from social media, customer feedback, also data produced by employee,

contractors, partners, and suppliers using social networking sites and data collected

18

from sensors.

• Velocity: This means how fast the data is being produced and how fast the data needs

to be processed to meet the demand.

• Variety: Today data comes in all types of formats (structured and unstructured data) -

from traditional databases, text documents, emails, video, audio, online transactions

etc.

• Veracity: This feature has to do with the quality and source of the data to ascertain

whether it is conflicting or improve and trustworthy. In other words, the credibility and

correctness of the data sources as well as the suitability of the data for the purpose of

use.

• Value: This is the usefulness of the data in making decisions. These characteristics have been expanded with two more features to describe BD as follows

(Inukollu et al, 2014; Toshniwal et al, 2015; Tabassum and Tyagi, 2016; Yosepu et al, 2015):

• Variability: This feature of BD refers to inconsistency of data. In addition to the

increasing velocities and varieties of data, data flows can be highly inconsistent

with periodic peaks.

• Complexity: Also, data comes from multiple sources and this must be cleaned,

merged, matched and transformed into required format before actual processing.

Figure 2 shows the diagram of the 5 V’s of BD.

19

Figure 2: The 5 V’s of BD (Moura and Serrão, 2015)

3.4.1.1 Benefits of BD

The main importance of BD is the potential to improve the efficiency in the context of the use of

large sets of data which are of different types. If BD is used properly within any business or

organization, it will offer better view in the business leading to efficiency in different areas like

sales, manufacturing, marketing and so forth (Ularu et al, 2012). Businesses can increase

productivity and it is even said that “Companies that use data most effectively stand out from the

rest” (Tene and Polonetsky, 2013, p.244). BD can be used effectively in the following areas (Ularu

et al, 2012):

20

• In information technology, the security and troubleshooting can be improved when the

patterns in the existing logs are analysed.

• In customer services information from call centres can be used to get the customer pattern

and thus enhance customer satisfaction by customizing services.

• In improving services and products, content from social media can be used. By knowing

the potential customers’ preferences, the company can modify its products in order to

address a larger area of people.

• In the detection of fraud in the online transactions for any industry.

• In risk assessment by analysing information from the transactions on the financial market.

Also, BD helps in decision making and competitiveness of organizations and public

administrations (Jamiy et al, 2014). Jamiy et al (2014) takes it further that this will go a long way

to grow the entire world economy significantly, in that organizations can take informed and timely

decisions. With these potentials, it is even better to listen and understand costumers, their ways of

using services and hence offer better and improved products (Jamiy et al, 2014).

Other benefits can be grouped into the following categories (Tene and Polonetsky, 2013):

• Healthcare: When the voluminous healthcare data generated are effectively used by

employing BD techniques and technologies, this can give the right outcomes to patients

and reduce care cost in the public health and medicine fields. For example, the computing

power of BD allow us to mine entire DNA strings in minutes and will provide us the

possibility to discover, monitor, improve health aspects of every one and predict disease

patterns.

• Mobile: Mobile devices are increasing and with multiple sensors including cameras,

microphones, movement sensors, GPS, and Wi-Fi capabilities, collect more data than ever

in the public sphere. There is also innovative use of data with regards to BD technologies.

These offer effective use of data.

• Smart Grid: BD use in terms of smart grid (the modernization of current electrical grid

to include bidirectional flow of information and electricity) brings about the benefits of

advanced data analysis. The smart grid is designed for instance to allow utility service

providers such as electricity companies to monitor and control electricity distribution and

usage. This helps these companies to respond to power outages or other problems in a

21

timely manner and precisely at the spotted location. Also, environmental policymakers

view the smart grid as a key in providing quality and more efficient delivery of electricity

by considering the factors whether to move towards renewable energy.

• Traffic Management: Another area of data-driven innovation is traffic management and

control. Governments around the world are implementing electronic toll pricing systems

that offer differentiated payments based on mobility and congestion charges. These

systems apply varying prices to drivers based on their differing use of vehicles and roads.

Also, town, urban and city planners benefit from analysis of location data where decisions

can be made on how to construct roads to ease traffic congestion leading to high-density

urban development.

• Retail: BD is also transforming the retail market. Nowadays there are technologies that

can be used for instance for suppliers to determine how much of their products are

available in shops. Others can use the feature of “those who bought this, also bought this”

to determine the items consumers are buying together. This enable the adverts that needed

to be presented to prospective buyers.

3.4.1.4 BD Challenges

While the potential benefits of BD are real and significant, many challenges must be addressed to

fully realize this potential. This section presents the challenges that needed to be addressed while

handling BD. Those challenges include (Ammu and Irfanuddin, 2013; Bertino, 2013; Jamiy et al,

2014; Shukla et al, 2015; Sin and Muthu, 2015; Ularu et al, 2012):

• Data Acquisition and Recording: It is very important and critical to capture the context

into which data has been generated, so that filtering out non-relevant data can be done.

This will also enable data to be compressed and to automatically generate metadata (that

is, data about data) that will support and enhance rich data description and to track and

record provenance (that is, information that documents the history, origin or source of

content information, changes that has taken place, who has had custody of it since it was

originated). To achieve this is complex and time consuming thus the real challenge of

handling big volumes of unstructured and structured data continuously arriving from many

sources.

22

• Information Extraction and Cleaning: Often data may have to be transformed in order

to extract information from it and this information should be presented in a form that is

suitable for analysis. Extracting meaningful information from huge sets of data is a major

challenge. Data may also be of poor quality and/or uncertain. Data cleaning and data

quality verification are thus critical.

• Data Integration, Aggregation and Representation: Data can be heterogenous and may

have different metadata. Thus, data integration requires huge human efforts. Manual

approaches fail to scale to what is required for BD, hence the requirement of newer and

better approaches arises which will offer automation data integration. Also, different data

aggregation and representation strategies may be needed for different data analysis tasks.

• Query Processing, and Analysis: Methods suitable for BD need to be discovered and

evaluated for efficiency so that they can deal with noisy, dynamic, heterogeneous,

untrustworthy data. However, despite these difficulties, BD even if noisy and uncertain

can be more valuable for identifying more reliable hidden patterns and knowledge

compared to tiny samples of good data. Also, the often-redundant relationships existing

among data can represent an opportunity for cross-checking data and thus improve data

trustworthiness. Supporting query processing and data analysis requires scalable mining

algorithms and powerful computing infrastructures. Also, as more huge data are generated

daily, analysis of the data may consume a lot of time and resources. However, there are

many situations in which the result of the analysis is required immediately.

• Interpretation: Analysis results extracted from BD needs to be interpreted by decision

makers and this may require the users to be able to analyse the assumptions at each stage

of data processing and possibly re-tracing the analysis. Rich provenance is critical in this

respect. BD interpretation insights is shown in Figure 5.

• Privacy and Security: They are also important challenges for BD. Because BD consists

in a large amount of complex data, it is very difficult for a company to sort this data on

privacy levels and apply the according security. In addition, many of the companies

nowadays are doing business cross countries and continents and the differences in privacy

laws are considerable and should be taken into consideration when starting the BD

initiative. Also, security and privacy issues have increased by the velocity, volume and

23

variety of BD, such as large-scale cloud infrastructures, diversity of data sources and

formats, streaming nature of data acquisition, and high volume inter-cloud migration.

• Storage: Nowadays, whiles the common capacity of hard disks is in the range of terabytes,

the amount of data generated daily through the internet is in the size of exabytes. This

would even get much larger in future. The traditional Relational Database Management

Systems (RDBMS) cannot handle the storage and processing of such huge voluminous

data. Therefore, certain technologies that do not use the traditional SQL based queries are

used to overcome such challenge. Also, compression technology needed to employ to

compress the data at rest and in memory.

Figure 3: BD Interpretation Insights (Shukla et al, 2015)

24

3.4.2 Big Data Security (BDS)

To start with, security and privacy issues have increased by the velocity, volume and variety of

BD, such as large-scale cloud infrastructures, diversity of data sources and formats, streaming

nature of data acquisition, and high volume inter-cloud migration (Shirudkar and Motwani, 2015).

They further reiterated that with the use of these large-scale cloud infrastructure, which are spread

across the world with diverse software platforms, attacks on systems have increased, therefore

traditional security mechanisms would not be adequate. Also, there should be sophisticated

technologies that can offer fast response times to the growing demand of streaming data across

several data centres (Shirudkar and Motwani, 2015). Singh and Sinha (2016, p. 33) also support

the argument that, “the currently used security mechanisms such as firewalls and DMZs cannot

be used in the BD infrastructure because the security mechanisms should be stretched out of the

perimeter of the organization's network”.

Kumar et al (2016) also support the issue of large-scale cloud infrastructures, multiplicity of

data sources and formats, streaming nature of data attainment and high volume inter-cloud

migration have also brought about security concerns. Therefore, they continue to reiterate that

conventional security mechanisms, which are customized to securing small-scale, static data are

not enough and for that matter there should be proper security technologies that can secure BD.

They further proposed an agent security-based solution model to deal with the security issues for

cloud BD. This model is capable of authenticating and checking the permission that are assigned

by the administrator during the registration of new cloud user.

The advent of BD has brought several challenges in terms of security of data (Tabassum and

Tyagi, 2016). They further highlighted that there is the need to research in technologies and

methodologies that can handle the voluminous of data securely and efficiently. They agreed to

the fact that though there are many new technologies and methods which have been developed,

but to some extent they get slowed down when there is an involvement of large amount of data.

3.4.2.1 BDS Challenges

BDS challenges can be grouped into four main categories which are also subdivided into ten as

shown in Figure 6 with brief descriptions below (Bhogal and Jain, 2017; Shirudkar and Motwani,

2015; Yosepu et al, 2015; Kaushik and Jain, 2014; Hima Bindu et al, 2016; Mahajan et al,

2016):

25

Figure 4: Challenges in BDS

1. Secure Computations in Distributed Programming Framework:

Distributed programming framework uses parallelism in computations and storage to

process large amounts of data. A known and popular example is MapReduce framework,

which splits an input file into multiple chunks in the first phase of MapReduce, a mapper

for each chunk reads the data, perform some computation, and generates a list of key/value

pairs. Then a reducer combines the values belonging to each distinct key and outputs the

result. There are two major attacks prevention measures: securing the manners and

securing the data in the presence of an untrusted manner.

2. Security Best Practices for Non-Relational Data Stores:

Non-relational data stores popularized by NoSQL databases are still developing with

respect to security infrastructure. For instance, robust solutions to NoSQL injection are

still not mature and for that matter each NoSQL databases were built to tackle different

challenges posed by the analytics world and hence security was never part of the model at

any point of its design stage. Security issues of NoSQL in general remain to be improved.

Therefore, developers using NoSQL databases usually embed security in the middleware.

NoSQL databases do not provide any Support for Enforcing it explicitly in the database.

However, clustering aspect of NoSQL databases poses additional challenges to the

robustness of such security practices and enhanced security is expected to come at the

expense of performance.

Infrastructure Security

Security Computations in Distributed Programming Frameworks

Security Best Practices for

Non Relational Data Stores

Data Privacy

Scalable and Composable

Privacy-Preserving Data Mining and

Analytics

Cryptographically Enforced Data

Centric Security

Granular Access Control

Data Management

Secure Data Storage and Transaction

Logs

Granular Audits

Data Provenance

Integrity and Reactive security

End Point Input

Validation / Filtering

Real-time Security

Monitoring

26

3. Secure Data Storage and Transaction Logs:

Data and transaction logs are stored in multi-tiered storage media. Manually moving data

between tiers helps the IT manager to control exactly what data is moved and when.

However, as the size of data set continues to increase or grow exponentially, scalability

and availability have necessitated auto-tiering for BD storage management. Auto-tiering

solutions do not keep track of where the data is stored, which creates new challenges to

secure data storage. Therefore, new mechanisms are imperative to prevent unauthorised

access and maintain 24/7 availability.

4. End Point Input Validation/Filtering:

Many BD uses in organization settings require data collection from many sources, such as

end point devices. For instance, a security information and event management system

(SIEM) may collect event logs from millions of hardware devices and software

applications in an enterprise network. A key challenge in the data collection process is

input validation: how can we trust the data? How can we confirm that a source of input

data is not malicious? And how can we filter harmful input from our collection? Validation

and filtering of input is a daunting challenge posed by untrusted input sources, especially

with the bring your own device (BYOD) model.

5. Real-time Security Monitoring:

Real time security monitoring has always been a challenge, given the number of alerts

generated by (security) devices. These alerts (correlated or not) lead to many false

positive, which are mostly ignored or simply “clicked away”, as humans cannot cope with

the shear amount. This problem might even increase with the BD given the volume and

velocity of data streams. However, BD technologies might also provide an opportunity to

fast process and analyse different types of data. These technologies can be used to provide,

for example, real-time inconsistency detection based on scalable security analytics.

6. Scalable and Composable Privacy-Preserving Data Mining and Analytics:

BD can be seen as potentially enabling invasions of privacy, invasive marketing, decreased

civil freedoms, and increase state and corporate control. A recent analysis of how

enterprises are leveraging data analytics for marketing purposes identified an example of

how a vendor was able to recognized that a teenager was pregnant before her father knew.

Similarly, anonymizing data for analytics is not adequate to maintain user privacy. For

example, AOL released anonymized search logs for academic purposes, but users were

easily identified by their searchers. Netflix faced a similar problem when users of their

anonymized data set were recognized by correlating their Netflix movie scores with IMDB

scores. Therefore, it is important to establish guidelines and recommendations for

preventing inadvertent privacy disclosures.

7. Cryptographically Enforced Data-Centric Security:

In order to ensure that the most sensitive private data is end to end secure and only

accessible to the authorized entities, data has to be encrypted based on access control

policies. Specific research in this area such as attribute-based encryption (ABE) has to

be made richer, more efficient, and scalable. To ensure authentication, agreement and

fairness among the distributed entities, a cryptographically secure communication

framework has to be implemented.

27

8. Granular Access Control:

The security Property that matters from the view of access control is secrecy-preventing

access to data by unauthorized people. The problem with course-grained access

mechanisms is that data that could otherwise be shared is often swept into a more

restrictive category to guarantee sound security. Therefore, granular access control gives

data managers more precision when sharing data without compromising privacy.

9. Granular Audits:

With real time security monitoring, we try to be notified at the moment an attack takes

place, however, in reality, this will not always be the case (e.g., new attacks, may missed

true positives). In order to determine a missed attack, audit information would be required.

This is not only relevant because we want to understand what might happened and what

went wrong, but also because of compliance, regulation and forensic reasons. In that

regard, auditing is not something new, but the scope and granularity might be different.

For example, we have to deal with more data objects, which probably are (but not

necessarily) distributed.

10. Data Provenance:

Provenance metadata will increase in complexity due to large provenance graphs

generated from provenance-enabled programming environments in BD applications.

Analysis of such large provenance graphs to identify metadata dependencies for security

or confidentiality applications is computationally intensive.

3.4.3 Data Leakage Prevention (DLP)

According to Kale et al (2015), Data Leakage Prevention (DLP) solution is one of the new

technical solutions and methodologies that basically protect sensitive data of an organization

from being viewed by wrong users or individuals either from outside or inside of the organization.

This means that specific data should be viewed only by authorized individuals or groups (Kale

et al 2015). “DLP solutions detect and prevent unauthorized attempts to copy or send sensitive

data, both intentionally or/and unintentionally, without authorization, by people who are

authorized to access the sensitive information” (Kale et al, 2015, p. 55; Tidke et al, 2015, p. 28).

In other words, “DLP is a technique used to hide the confidentiality of data being accessed by

unauthorized user” (Jain and Lenka, 2016, p. 57). In addition, DLP is a solution or products

designed to detect potential data breach incidents in timely manner and prevent them by

monitoring data while in-use (endpoint actions) or in-motion (network traffic) or at-rest (data

storage) (Tahboub and Saleh, 2014). DLP solutions address data leaks in the following three states

of data throughout their lifecycle by applying specific set of technologies (Tahboub and Saleh,

2014; Ahmad and Bamnote, 2013; Peneti and Rani, 2015a):

• Data-at-Rest (DAR): Data that resides in files system, databases and other storage

methods. E.g. A company’s financial data stored on the financial application server.

• Data-in-Use (DIU): Data at the endpoints of the network (e.g. data on USB devices,

external drives, MP3 players, laptops, and other highly-mobile devices). In other

28

words, all data with which the user is interacting or using.

• Data-in-Motion (DIM): Any data that is moving through (or are being sent through)

the network to the outside via the Internet. This feature applies to all data transmitted on

wire or wireless. E.g. Customer purchasing details sent over the Internet. In addition,

these data may be sent either inside the internal network of an organization or may cross

over into an external network.

Figure 7 shows these three data states

Figure 5: Different Data States (Alneyadi et al, 2013a)

More so, Ram (2015) explains that DLP is very useful in that it helps organizations to protect

not only structured data but also the protection and leakage prevention of unstructured data.

Ram (2015) further reiterated that DLP serves as the data control mechanisms that fits naturally

very well with the organizational business structure. According to Peneti and Rani (2015b),

data leakage prevention systems (DLPSs) make use of confidential terms and data identification

methods for controlling data leakages in the organization. First, DLPS identifies which

documents are confidential documents and non-confidential documents. According to Alneyadi

et al (2016), DLPS can be defined as a system that is designed to detect and prevent the

unauthorised access, use, disclosure, or transmission of confidential information. It is even

possible to use DLP to reduce risk and to improve upon data management practices and also to

lower compliance cost (Ram, 2015). Several DLP technologies are available on the market.

Ram (2015) made reference to one DLP technology called MyDLP, which is an open-source

all-in-all data loss / leak prevention software that runs with multi-site configurations on

network servers and endpoint computers. In addition, there are various examples of DLP

solutions offered by many vendors for various operating systems, they are; Symantec,

29

Websense, MacAfee, MyDLP and many others (Tahboub and Saleh, 2015).

Tahboub and Saleh (2014) also explains that there are differences between existing data

protection systems such as firewalls, Intrusion Detection Systems / Intrusion Prevention

Systems (IDSs / IPSs), antivirus, antispam, antimalware, encrypting and digital rights

management tools and a DLPS. They further explained that “the main difference between a

DLPS and existing technologies is that DLPSs are content-aware; they are designed to give

visibility into where the company's most sensitive data is stored, who has access to it, and where

and by whom it is sent outside the company's network. Existing security applications cannot

perform this level of monitoring” (Tahboub and Saleh, 2014, p. 17). This assertion is also

supported by Alneyadi et al (2016), “DLPSs differ from conventional security controls such as

firewalls, VPNs and IDSs in terms of dedication and proactivity. Conventional security controls

have less dedication towards the actual content of the data” (p. 138). Tahboub and Saleh (2014)

also reiterated that a DLPS should also be able to provide additional functionality to prevent

sensitive data from being sent outside the organization’s network either through an endpoint

computer or the network.

3.4.3.1 DLP Solutions

DLP solutions can be grouped according to the taxonomy that incorporates the following

features: data sate, deployment scheme, leakage handling approach / method, and action taken

upon leakage as indicated in Figure 8 (Shabtai et al, 2012; Peneti and Rani, 2015a; Alneyadi et

al, 2016):

Figure 6: A taxonomy of DLP solutions (Shabtai et al, 2012)

30

• What to protect? (data state): DLP solutions offer protections by differentiating

between the three phases of data throughout the lifecycle of DAR, DIU, and DIM.

• Where to protect? (deployment scheme): The two main deployment schemes that are

considered during the installation of DLP solutions are: Endpoint and Network. Those

that are deployed at the endpoint directly control devices or users. A solution deployed at

the endpoint monitors and controls access to data whiles another supervisory server takes

control of the administrative procedures and distribution of policies. For that matter, all

the DIU and DAR will be protected. On the other hand, the network DLP solution will be

deployed at the network level so that all the network traffic would be analysed. For that

reasons those transmissions which will go against the predefined policies would be

identified and blocked.

• How to protect? (leakage handling approach): All leakage incidents are handled by the

two main mechanisms or approaches of detective and preventive approaches. When any

leakage is detected, the detective mechanism or approach of the DLPS will attempt to

apply the necessary action based on the following forms: context-based inspection,

content-based inspection, and content tagging. For detective approach, the system will

detect any possible leakage incidents and apply the corrective action that is capable of

handling the identified leakage incident (Shabtai et al, 2012). Also, DLP solutions

support preventive approaches by applying the following mechanisms: access control,

disabling functions, encryption, and awareness. When it comes to preventive approach,

possible leakage incidents are prevented before they are happening by applying proper

measures (Shabtai et al, 2012). Alneyadi et al (2016) also supported the idea of preventive

and detective mechanisms by categorizing DLPSs based on the technique or method used as

indicated in Figure 9.

31

Figure 7: Data leakage prevention categorisation by method. (Alneyadi et al, 2016)

• Preventive method:

▪ Policy and Access Rights: Data leaks are prevented based on strict security

policies and access rights. Some policies in organizations can restrict the use of

USB drives and CDs.

▪ Virtualisation and Isolation: The advantages of virtualisation is applied here to

protect sensitive data. It is based on creating virtual environments when accessing

sensitive data. In this case only certain allowed processes will be permitted.

▪ Cryptographic Approaches: Cryptography is a way to hide sensitive data from

unauthorized users by making use of cryptographic tools and algorithms. Some

cryptographic approaches protect against data leaks in DIU and DAR states within

the confines of organizations.

One work done by Margathavalli et al (2016) is to use Attribute Based Encryption (ABE)

algorithm as a data leakage prevention method to preserve sensitive data. This falls under

preventive method. The idea behind their proposed method is to keep the sensitive data locked

so that only authorized users can accessed them. The argument is that when relying on detective

approaches, data owners can carelessly leak sensitive data to unauthorized users. Therefore, it is

very necessary to rather prevent the data leakage through encryption to correct this disorder.

According to Margathavalli et al (2016), ABE is also a type of public-key encryption where the

secret key of a user and the ciphertext are dependent upon attributes. For instance, the country

someone lives or the kind of subscription he or she has. With this approach, the decryption of the

ciphertext will be possible only if the set of required attributes of the user in question matches

32

the said attributes of the ciphertext. Again, the ciphertext is developed based on attributes and the

private key is associated with it. The private key will be used to download the data when it

matches with the attributes.

▪ Quantifying and Limiting: This is the approach where security administrators

try to pretend to be attackers and block all the loop holes leading to sensitive data

by attacking their own systems. This approach can be used in both detective and

preventive methods.

• Detective Method:

▪ Data Identification: How sensitive data are detected depended on the previous

knowledge of the targeted content and certain techniques such as data fingerprints,

regular expressions and exact or partial data match are involved.

▪ Social and Behavioural: Social network analysis and behavioural patterns can help

detect any irregularity and raise alarm so that security administrators can react to

them.

▪ Data Mining / Text Clustering: Data mining areas have capabilities to perform

advanced tasks such as anomaly detection, clustering and classification by extracting

data patterns from large datasets. Data mining is related to machine learning which

has algorithms to realize complex patterns and make better decisions. Text clustering

which is related to information retrieval also play significant roles in DLPSs.

Several information security solution providers have incorporated some of the above-mentioned

taxonomies in the development of DLP solutions. Amongst the top DLP solution vendors are

enumerated below (Alneyadi et al, 2016; Shabtai et al, 2012):

• Websense (provides Triton)

• McAfee (provides McAfee Data Loss Prevention)

• RSA

• Symantec

• Trend Micro

• MyDLP (being Open-source)

• VMware (provides AirWatch)

• Check Point Software Technologies (provides Check Point DLP)

• General Dynamics Fidelis Cybersecurity Solutions (provides Fidelis XPS)

• Varonis Systems (provides Varonis IDU Classification Framework).

3.4.3.2 DLPS Analysis Techniques

The various techniques that DLPS use in analysing whether data can be categorised or classified

into sensitive or confidential and non-sensitive or non-confidential data and subsequently used for

detective purposes are grouped into two main areas being context-based and content-based analysis

techniques or approaches as discussed below (Alneyadi et al, 2016; Alneyadi et al, 2015; Katz et

al, 2014; Shabtai et al, 2012):

• Context analysis technique: The context-based technique works by considering the

metadata (such as size, timing, source, format and destination) which is associated with the

33

actual confidential data without emphasis on the sensitivity of the content. The DLPS study

the context surrounding the confidential data in order to detect any potential data leaks. For

instance, if a user wants to send data to another entity, certain contextual attributes such as

source, file size and format, destination, timing would be studied. These features can then

be compared against certain transaction patterns or predefined policies. The context

analysis technique is sometimes combined with the content-based analysis technique in

order to be effective.

Katz et al (2014) proposed a context-based model comprising of two phases: training and

detection. For the training phase, clusters of documents are produced and a graph representation

of the confidential content of each cluster is generated. This representation consists of key terms

and the context in which they are required to appear in order to be considered confidential. For

the detection phase, each tested document is assigned to several clusters and its contents are then

matched to each cluster’s respective graph in an attempt to determine the confidentiality of the

document. Soumya and Smitha (2014) also developed a DLPS which is based on context

keyword matching and encrypted data detection. The main idea behind what they proposed is

to enhance the security of DLPSs by finding confidentiality of documents based on context of

keywords and detecting encrypted information in word or text documents. Their proposed

approach was also in two phases: Learning and Detection. Their approach was also similar to

that of Katz et al (2014) by making use of clustering and the use of graph representations.

• Content analysis technique: This technique works by focusing on the actual content of the

confidential data rather than the context. Since the main aim of any DLPS is to detect and

prevent confidential data from being leaked, it is more important and effective to consider

the content. The three main content analysis techniques are: data fingerprinting (including

exact or partial match), regular expression (including dictionary-based match) and

statistical analysis. N-gram analysis and term weighting analysis are the main statistical

analysis techniques.

▪ Data fingerprinting: This is the most common technique which is used to detect

data leakage. In most DLPSs, a whole file can be hashed using conventional hash

functions such as MD5 and SHA1, where the hash values of all sensitive documents

are stored in databases or on local machines under inspection. Such DLPSs can have

100% detection accuracy if the file is not modified by any means. However, since

confidential documents are subject to change, DLPSs with conventional hashing can

be ineffective because the hash value is susceptible to change. In effect, significant

changes to the data will make the conventional fingerprinting method ineffective.

Advanced fingerprinting methods such as Rabin’s randomised fingerprinting and

fuzzy fingerprinting can solve some of the modifications issues.

▪ Regular expression: This is also another popular method which is used in DLPSs.

They are made of set of terms or characters that are used to form detection patterns.

These patterns will be used to match and compare set of data strings mathematically.

This technique is mostly used in search engines and text processing to validate,

extract and replace data. However, in terms of information security, regular

expression is used mostly in data inspection for malicious codes or confidential data.

34

▪ Statistical analysis: This technique can facilitate certain tools such as machine

learning classification and information retrieval term weighting. They are mostly

dependant on the frequency of terms and n-grams within set of documents. The

drawback of regular expression and data fingerprinting were solved by N-gram

statistical analysis technique. A term simply means a word, while an n-gram can be

a word or pieces of a word such as unigram (one character), bigram (two characters)

and trigram (three characters). The main statistical analysis techniques are N-gram

and term weighting analyses. The whole idea of N-gram is to break each word in a

document into small characters N-grams and arrange them based on their frequency

to create N-grams profiles. Term weighting on the other hand, is a statistical method

that indicates the significance of a word within a document. It is normally used in

text classification using vector space models, where documents are represented as

vectors.

Alneyadi et al (2013a) proposed word N-gram based classification method to classify documents in

order to prevent data leakage. They made use of N-grams frequency to classify documents in order

to detect and prevent leakage of sensitive data. Alneyadi et al (2014) also studied the effectiveness

of using N-grams based statistical analysis, foster with stem words in order to classify documents

according to their topics. In short, they used stemmed N-gram classification methodology and this

gave classification accuracy of 92%. In addition, Alneyadi et al (2013b) investigated the use of N-

grams statistical analysis for data classification purposes. The method they presented is based on

using N-grams frequency to classify documents under distinct categories. They made use of a

technique called “taxicap” geometry to compute the similarity between documents and existing

categories. This method also could correctly classify 90.5% of the tested documents.

Several other techniques have been proposed to deal with DLP. Jain and Lenka (2016) proposed a

DLP technique called Image Steganography. This technique works by preventing the data from

being outsourced by giving a special inscription to sensitive data so that they cannot be

reproduced. According to Jain and Lenka (2016), this technique in practice works by embedding

a file, message and image within another image file and then the image file is transmitted. When

this is done the unauthorized user does not know that the data has been embedded in an image.

In short, the steganography gives the opportunity for data to be hidden in an image such that they

cannot be perceivable.

In addition, Ko et al (2014) proposed a novel user-centric, mantrap-inspired DLP approach which

can discover and deliver any sending of data, both authorized and unauthorized, to end-users and

subsequently provide them the opportunity to stop the sending process. They implemented their

own kernel module based on Linux operating system to work together with the user-space program

in getting users approval for every sending process by giving them full access control over all

outbound data sending process in their devices.

Peneti and Rani (2015b) also proposed confidential data identification method using data mining

approach to classify documents into confidential and non-confidential. They employed clustering

and language modelling technique during the training phase. During detection phase, confidential

score of all inspected documents are checked against predefined confidentiality scores and those

that exceed certain threshold are marked and blocked.

35

More so, Peneti and Rani (2016) developed an algorithm for DLP with time stamp. In identifying

confidential data, time stamp plays an important role in DLP. For example, in an educational system

a question paper is considered confidential until on or before the examination date, once the

examination is over and the paper is in the public domain, it will be treated as non-confidential.

Their method too made use of clustering technique. The method was 100% for documents that have

complete confidential or non-confidential content. However, it was not able to detect small portions

of confidential content within non-confidential documents.

Furthermore, Alneyadi et al (2015) presented statistical DLP model to classify data based on

semantics. They made contribution to the DLP field making use of statistical analysis that is able

to detect evolved confidential data. Their model made use of the famous information retrieval

concept of Term Frequency-Inverse Document Frequency (TF-IDF) to classify documents under

certain topics. The classification results were presented with a Singular Value Decomposition

(SVD) matrix. The results indicated that the proposed statistical DLP approach could correctly

classify documents even in extreme cases of modification. It also had a high level of recall scores

and precision.

The summary of the above detective approaches, techniques and algorithms will be presented in

Table 7.

Table 5: Summary of Previous DLPS Analysis Techniques / Methods

List of Papers Techniques /

Algorithms

Method /

Analysis

Contributions Limitations

Katz et al (2014),

CoBAn: A

context based

model for data

leakage

prevention.

Context-based

approach,

Clustering

Detective /

Context

A novel approach

regarding the context

of key terms for

classification purposes.

A new approach for

the graph

representation of text.

The method is

not capable of

using external

data sources to

enhance the

representation of

confidential

content.

Soumya and

Smitha (2014),

Data Leakage

Prevention

System by

Context based

Keyword

Matching and

Encrypted Data

Detection.

Context-based

keyword

matching,

Entropy

method,

Clustering

Detective /

Context

Detection of small

portion of confidential

information in a non-

confidential document.

Effective check for

information going

out of the organization

is either confidential or

encrypted.

Cannot perform

any

cryptanalysis

process for

retrieving the

information

represented by

the encrypted

data.

Alneyadi et al

(2013a), Word N-

N-grams

frequency

Detective /

Content

Covered most aspects

in using N-grams for

Encrypted

document

36


Algorithms

Method /

Analysis


gram Based

Classification for

Data Leakage

Prevention.

data

Classification.

imposes a great

challenge to the

method.

Modifying or

replacing a word

could lead to

wrong

classification or

misclassification.

Alneyadi et al

(2014), A

Semantics-

Aware

Classification

Approach for

Data Leakage

Prevention.

N-grams based

statistical

analysis

Detective /

Content

Effects of data

modification showed

acceptable accuracy.

Lacks term

weighting

approaches

which are more

flexible than raw

frequency.

Alneyadi et al

(2013b),

Adaptable N-

gram

classification

model for data

leakage

prevention

N-grams

frequency,

Employ

simple taxicap

geometry

Detective /

Content

Achievement of high

levels of recall and

precision as compared

to existing methods.

The method is

not effective

when word

synonyms and

special

characters are

used.

Jain and Lenka

(2016), A

Review on Data

Leakage

Prevention using

Image

Steganography.

Image

steganography,

LSB (Least

Significant

Bit) technique

Preventive /

Content

Being able to prevent

data from being

outsourced by

giving a special

inscription to sensitive

data from being

reproduced.

The method

couldn’t cover

all image file

formats.

Ko et al (2014),

A Mantrap-

Inspired, User-

Centric Data

Leakage

Prevention (DLP)

Approach.

Kernel-space

mantrap

approach

Detective /

Preventive

Content

Users have full control

over all outbound data

sending process in

their devices.

Lacks graphical

user interface

(GUI) for the

users.

Peneti and Rani Clustering and Detective / Detection of entire Cannot detect

37


Algorithms

Method /

Analysis


(2015b),

Confidential Data

Identification

Using Data

Mining

Techniques in

Data Leakage

Prevention

System.

language

modelling

techniques

Content confidential document

and detection of small

portions of confidential

content embedded in

larger non-confidential

documents.

and prevent

encrypted data.

Peneti and Rani

(2016), Data

Leakage

Prevention

System with

Time Stamp.

Clustering Detective /

Content

Time stamp method is

best suited for both

large and small dataset.

Cannot detect

small portions

of confidential

content in non-

confidential

documents.

Alneyadi et al

(2015), Detecting

Data Semantic:

A Data Leakage

Prevention

Approach.

Statistical data

analysis (TF-

IDF)

Detective /

Content

Contributed to the

DLP field by using

data statistical analysis

to detect evolved

confidential data. The

proposed statistical

DLP approach could

correctly classify

documents even in

cases of extreme

modification.

Cannot detect

and prevent

encrypted data.

Margathavalli et

al (2016),

Preserving

Sensitive Data by

Data Leakage

Prevention Using

Attribute Based

Encryption

Algorithm

Attribute

Based

Encryption

Preventive /

Content

Prevent data leakage

by encryption (ABE).

Cannot deal with

detection.

3.4.3.3 Data / Text Classification Methods

The DLP will focused on semi-structured data (textual data). Though text documents fall under

unstructured data, however, they can also be grouped under semi-structured because they contain

certain structured features. The main idea behind DLP is to categorize data into either confidential

38

or non-confidential and the confidential ones would be encrypted. Therefore, most of the data

classification methods to be considered would be text classification or categorization methods.

Text classification has been defined as the act of dividing input documents into two or more

classes such that each document can be said to belong to one or multiple class (Vala and Gandhi,

2015). In other words, this can be said to be “the task of assigning predefined categories to

documents” (Bali and Gore, 2015, p. 4888). Amongst several texts classify methods are Support

Vector Machine (SVM), K Nearest Neighbor (KNN), Artificial Neural Network (ANN), Naive

Bayes Classifier (NBC), and Decision Trees (DT) (Vala and Gandhi, 2015; Patil et al, 2016;

Topaloğlu, 2013; Patra and Singh, 2013).

SVM is the classification method which is used to classify both linear and non-linear data (Patra

and Singh, 2013). Patra and Singh (2013) further explained that this method makes use of the non-

linear mapping to transform training data into a dimension which is higher and then search for

linear optimal separating hyper plane. Patel and Mistry (2015) also reiterated that SVM works

with both positive and negative training data sets which are not common as compared with other

classification methods. According to Patel and Mistry (2015), the SVM requires these positive

and negative training sets in order to inquire about the decision surface that best separates these

data in the n dimensional space which is referred to as the hyper plane. The document

representatives which will be closer to the decision surface are known as the support vector as

shown in Figure 10 (Patel and Mistry, 2015).

Figure 8: Example of SVM hyper plane pattern (Patel and Mistry, 2015)

According to Figure 4, the equation that will be used for the hyper plane for the linear separable

space is WX+B=0; where X is arbitrary objects, W is the vector and B is the constant learned

from the set of linear separable objects in the training data sets or documents (Patel and Mistry,

2015). Patel and Mistry (2015) further explained that the hyper planes are mostly used for separation

of two different classes of data. However, the SVM can also work on pre-classified documents (Patel

and Mistry, 2015; Chavan et al, 2014). SVM has been extensively and successfully used for text

classification tasks (Thaoroijam, 2014). According to Ba-Alwi and Albared (2016), SVM is very

popular text categorization method which is used in the machine learning community. “SVM is

considered as one of the most effective classification method according to its performance on

39

text classification as proven by many researches” (Ba-Alwi and Albared, 2016, p. 5).

NBC is a well-known and practical probabilistic classifier which has been employed in many

applications (Chavan et al, 2014). It assumes that all attributes (i.e., features) of the examples are

independent of each other given the context of the class, i.e., this classifier makes an independent

assumption (Chavan et al, 2014). Also, this classification method is fast and easy to implement

(Patil et al, 2016). This is also very known statistical method which is relatively good for large

datasets, thereby making it useful in text classification problems (Patil et al, 2016). It is also based

on the Bayes theorem which performs the independence feature selection (Patel and Mistry,

2015). “The NB classifiers solves the text classification problem as follows:

given a document d which represented as a set of feature terms {ti | i=1, 2, ..., |d|} and c is a

category in the category set C, where |C| >2. NB can be defined as the conditional probability of

c given d constructed as follows:

” (Ba-Alwi and Albared, 2016, p. 5).

KNN classifier works on principle whereby documents which are closer in the space will belong

to the same class (Patel and Mistry, 2015). According to Patel and Mistry (2015), the algorithm

works by calculating the similarity between text document and their neighbours. In other words,

KNN algorithm classified documents or objects by voting several labelled training examples

with their smallest distance from each other (Bali and Gore, 2015; Patra and Singh, 2013). It is

a case-based learning algorithm that is calculated based on a distance or similarity function for

pairs of observations, like the Euclidean Distance or Cosine similarity measures (Patil et al, 2016;

Korde and Mahender, 2012). One advantage of KNN is its simplicity, effectiveness and less

implementation difficulties thus making it possible for more applications to use this method

(Korde and Mahender, 2012; Bali and Gore, 2015). However, one drawback of this method is

the long time and the difficulty it takes to find the optimal value of k especially when large

number of training examples are given (Korde and Mahender, 2012; Bali and Gore, 2015).

DT is a type of classification method that builds classification model in a form of a tree (Nisha

and Karthik, 2016). The topmost decision node which contains all the documents is referred to

as the root node (Patel and Mistry, 2015). According to Nisha and Karthik (2016), each internal

node is made up of a subset of the documents that are separated based on one attribute or feature.

In other words, a DT is a flowchart like tree structure, whereby each internal node represents a

test on an attribute, each branch represents an outcome of the test, and each leaf node holds a

class label (Nalini and Sheela, 2014). DT is capable of handling both categorical and numerical

data (Korde and Mahender, 2012; Nisha and Karthik, 2016). DT classifier poses a series of

carefully crafted questions about the various attributes of the test record. Anytime it receives an

answer, a follow-up question is asked until a conclusion about the class label of the record is

obtained or reached (Chavan et al, 2014; Nisha and Karthik, 2016). One remarkable advantage

of DT is that it is easily to understand and interpret even for persons who are not familiar (or non-

expert users) with the details of the model (Nalini and Sheela, 2014; Bali and Gore, 2015). One

disadvantage is that irrelevant attributes may affect badly the construction of a DT (Patra and

Singh, 2013; Vala and Gandhi, 2015).

Neural Networks (NNs) consist of many individual processing units or elements which are called

40

neurons that are connected by links which have weights that allow neurons to activate other

neurons (Bali and Gore, 2015; Thaoroijam, 2014). These neurons work together for solving any

specific problem (Patel and Mistry, 2015). Figure 11 is a block diagram for NN.

Figure 9: NN Block Diagram (Patel and Mistry, 2015)

NNs have the ability to extract meaningful information from a huge set of data, due to that

neurons have been configured for specific application areas, such as pattern recognition, feature

extraction, and noise reduction (Patel and Mistry, 2015). NNs have the advantage of being

flexible with the disadvantage of very high computing costs (Thaoroijam, 2014). Also, one

disadvantage of NN is that it is difficult for an average user to understand (Thaoroijam, 2014).

Advantages and disadvantages of the various classifiers talked about would be presented in Table

8 (Patel and Mistry, 2015; Patra and Singh, 2013; Vala and Gandhi, 2015; Vanjari and Thombre,

2015).

Table 6: Advantages and Disadvantages of Classifiers

Classifiers Advantages Disadvantages

SVM • Capture the inherent

characteristics of the data better.

• Global minima vs. local

minima.

• Compact description of the

learned model, more capable to

solve multi label classification.

• Parameter tuning.

• kernel selection.

• Training speed is slow.

NBC • Work well on numeric and

textual data.

• Easy to implement.

• Easy computation.

• Requires small amount of

training data to estimate

parameters.

• Good results are obtained in

• Conditional independence

assumption is violated (or

Assumption of class

conditional independence

leads to loss of accuracy.).

• Performs very poorly when

features are co related to each

other.

• Practically dependencies exist

41

Classifiers Advantages Disadvantages

most cases. among variables and

sometimes these dependencies

cannot be modelled by NB.

KNN • Effective

• Non-parametric

• More local characteristics of

Document are considered

comparing with Rocchio.

• Simple, effective and easy to

implement.

• Classification time is long.

• Difficult to find optimal value

of k.

DT • Easy to understand.

• Easy to generate rules.

• Reduce problem complexity.

• Simple even non-expert user

can understand

• Training time is relatively

expensive.

• One branch

• Once a mistake is made at a

higher level, any sub tree is

wrong.

• Does not handle continuous

variable well.

• May suffer from over fitting.

• Irrelevant attributes may

affect badly the construction

of a decision tree.

NN • Produce good results in

complex domains

• Suitable for both discrete and

continuous data.

• Testing is very fast

• Training is relatively slow.

• Learned results are difficult

for users to interpret.

• It may lead to over fitting.

3.5 Literature review discussion

This section refers to the literature review discussion. Security and privacy issues are of great

concern when one talks about the security of BD. It is quite obvious that traditional security

mechanisms such as firewall and IDS are not adequate and therefore, there should be

sophisticated technologies that can handle the security of BD and prevent data leaks.

For any effective DLPS to be implemented, there should be consideration of both context-based

and content-based analysis techniques or approaches. This means that the context surrounding

the confidential data and the content itself are very important in DLP. The three main content

analysis techniques are: data fingerprinting (including exact or partial match), regular expression

(including dictionary-based match) and statistical analysis. N-gram analysis and term weighting

analysis are the main statistical analysis techniques. The drawback of regular expression and data

fingerprinting were solved by N-gram statistical analysis technique. However, most of the DLP

techniques and methods discussed in section 3.4.3.2 fall under the detective approaches. A lot of

work has been done under the detective method which works by determining whether the leakage

42

has occurred or not and applied the appropriate corrective action. In addition, under detective

approach, the system will detect any possible leakage incidents and apply the corrective action

that is capable of handling the identified leakage incident (Shabtai et al, 2012). However,

sensitive data could still be leaked easily to unauthorized users if the data owner is careless

which could reduce the competitive advantage of an organization (Margathavalli et al, 2016).

Furthermore, preventive approach of DLP solution works by ensuring that possible leakage

incidents are prevented before they are happening by applying proper measures such as access

control, disabling functions, encryption, and awareness (Shabtai et al, 2012; Peneti and Rani,

2015a; Alneyadi et al, 2016). According to Margathavalli et al (2016), it is essential to correct

the disorder in detective approach by ensuring that the data is rather prevented from leakage.

This will ensure that the data leakage is prevented from unauthorized users in order to achieve

confidentiality. Several encryption algorithms are available which can be used to prevent data

leakage in organizations. More so, encrypted documents could still bypass detection

mechanisms of DLPSs which could result in leakage (Alneyadi et al, 2016). However, once the

documents are encrypted, without the correct decryption keys, it will be difficult for one to see

the details of these confidential or sensitive documents.

3.6 Research gap

From the literature review, it is clear that though there are a number of DLP techniques or

methods available with respect to detective approaches. However, a lot of work has not been

done using the preventive approach for DLP solution. This means that there is the need to use

the preventive approach such as encryption to develop a DLP method to help in preventing BD

leakage before they are happening with emphasis on semi-structured data (textual data).

3.7 Research question

The main question for this research study which is intended to solve the research gap is “How to

design a method to help organizations prevent data leakage in BD?”. Emphasis will be based on

semi-structured data (textual data).

43

4. RESEARCH METHODOLOGY A Design Science Research Methodology (DSRM) would be appropriate to answer the research

question and to achieve the objective of providing method that can help organizations to prevent

data leakage in BD. Hevner et al (2004, p.77) explain that Design Science Research (DSR)

“creates and evaluates IT artifacts intended to solve identified organizational problems”. IT

artifacts are made up of constructs, models, methods, and instantiations (Hevner et al, 2004).

According to Hevner et al (2014), methods define processes and guidance on how to solve

problems. The proposed solution would take the form of a method. Currently there are little

work done to prevent data leakage in BD using the preventive approach that can help

organizations prevent the leakage before they happen. In order to create IT artifact (method)

that can guide organizations to follow specified guidelines that can be used to prevent data

leakage, the suitable methodology is DSR. The main objective of this research is to design a

method to help organizations prevent data leakage in BD and this requires a comprehensive

methodology such as DSR to achieve this.

Peffers et al (2007) have provided six activities that should be followed when one needs to

conduct DSR. The details of these activities that would be followed in this research are

enumerated below. Figure 12 depicts the DSRM process model which summarises these six

steps or activities:

Activity 1: Problem identification and motivation.

Activity 2: Define the objectives for a solution.

Activity 3: Design and development.

Activity 4: Demonstration.

Activity 5: Evaluation.

Activity 6: Communication.

Identify

problem &

Motivate

Define

problem

Show

relevance

Define

Objectives of

a Solution

What would

a better

artifact

accomplish?

Design &

Development

Artifact

Demonstration

Find suitable

context

Use artifact to

solve problem

Evaluation

Observe

how

effective,

efficient

Iterate back

to design

Communication

Scholarly

publications

Professional

publications

Master thesis

report

Possible Research Entry Points

Client /

Context

Initiated

Design &

Development

Centered

Intiation

Objective

- centered

Solution

Problem-

Centered

Intiation

Process Iteration

Nom

inal

pro

cess

seq

uen

ce

44

Figure 10: DSRM Process Model (Peffers et al, 2007)

4.1 Activity 1: Problem identification and motivation

This is the first activity that should be followed as far as the DSRM is concerned. The problem

identification and motivation would be presented in chapter two. In summary, there are little done

to prevent data leakage in BD using the preventive approach which can help organizations

prevent the leakage before they happen.

4.2 Activity 2: Define the objectives for a solution

The outcome of the previous activity will be used to collect and determine the objectives for the

IT artifact (method) to prevent data leakage in BD using the preventive approach such as

encryption. This will go a long way to help organizations that are willing to adapt DLP

technologies or solutions to follow guidelines that will help them to prevent the leakage before

they happen. The solution will focus on text categorization or classification of semi-structured

big data sets (which are mostly textual data) into confidential and non-confidential data so that

the confidential ones would be encrypted to prevent leakage. However, text data can also be

grouped under unstructured data and they form huge percentage of today’s data. About 80% of

today’s data is stored as text (Patil et al, 2016). Also, studies have shown that about 80% of all

stored organizational data is unstructured (Khan et al, 2014; Kanimozhi and Venkatesan, 2015).

4.3 Activity 3: Design and development

The IT artifact (method) will be designed and developed based on the previous objectives. Text

data can be generated from numerous sources such as emails, comments, tweets, etc. For that

matter, there will be text categorization or classification of these data sets into confidential and

non-confidential data. The aim is to prevent leakage with preventive approach of encryption so

that even if the leakage occurs and without the proper decryption keys, one cannot get access to

the original documents (plaintext).

4.3.1 Kernel theory

During the designing of the IT artifact, design theory which are mostly referred to as kernel

theory would be followed. Kernel theory will be applied both on defining the design artifact and

during the design process. Kernel theory or theories is / are derived from natural sciences, social

sciences and mathematics and govern both the design requirements and the design process of the

artifact itself (Walls et al, 1992; Markus et al, 2002; Iivari, 2007). According to Markus et al

(2002), a practitioner theory-in-use could also serve as a kernel theory. This implies that a design

theory is not necessarily based on any scientifically or empirically validated knowledge and for

that matter a kernel theory could either be an academic theory (e.g., organizational psychology)

or a practitioner theory-in-use (Markus et al, 2002). For this reason, a comprehensive industry

process model for data mining projects called CRISP-DM (CRoss Industry Standard Process for

Data Mining) will be adopted as the kernel theory or framework for the designing of the IT

45

artifact (Wirth and Hipp, 2000; Rocha and Sousa Júnior, 2010; Shearer, 2000; Moro et al, 2011;

Al-Radaideh and Al-Nagi, 2012). According to these authors, this process model provides a

framework for carrying out data mining projects which are both independent of the technology

used and industry sector involve. They further reiterated that this process model also serves

the lifecycle of a data mining project. In addition, the CRISP-DM process model is made up

of non-rigid sequence of six phases as shown in Figure 11 (Wirth and Hipp, 2000; Rocha and

Sousa Júnior, 2010; Shearer, 2000; Moro et al, 2011; Al-Radaideh and Al-Nagi, 2012).

Figure 11: Phases of the CRISP-DM Process Model (Wirth and Hipp, 2000; Rocha and Sousa Júnior, 2010; Shearer, 2000)

• Phase 1: Business Understanding

The first phase ensures that the project objectives and requirements could be understood

from a business perspective so that the knowledge can be as well converted into a data

mining problem definition, a preliminary project plan which could then be designed in

order to achieve the objectives (Wirth and Hipp, 2000; Rocha and Sousa Júnior, 2010;

46

Shearer, 2000). For the sake of this thesis, the main objective to be achieved is to design

a method to help organizations prevent data leakage in BD with emphasis on semi-

structured data (textual data). It is prudent to understand the business for which a solution

is being looked for. The business understanding phase is made up of several key steps,

which includes determining business objectives, assessing the situation, determining the

data mining goals, and producing the project plan (Wirth and Hipp, 2000; Rocha and

Sousa Júnior, 2010; Shearer, 2000). These tasks and outputs involved in phases of the

CRISP-DM reference model have been presented in Figure 12.

• Phase 2: Data Understanding

The data understanding phase will begin with an initial data collection and then continue

with the data scientist getting familiar with the data, in order to identify data quality

problems, so as to get insights into the data as shown in Figure 11 (Wirth and Hipp, 2000;

Rocha and Sousa Júnior, 2010; Shearer, 2000). There is direct link between the Business

Understanding and Data Understanding. The initial data is to look for semi-structured

data (textual data). All other steps which will lead to the required data will be followed.

• Phase 3: Data Preparation

This stage covers all activities to construct the final data set which will be fed into the

modelling tool or software (Wirth and Hipp, 2000; Rocha and Sousa Júnior, 2010;

Shearer, 2000). In this case data will be prepared for use in classification algorithms.

• Phase 4: Modelling

In this phase, there will be selection and application of various modelling techniques to

ensure optimal results (Wirth and Hipp, 2000; Rocha and Sousa Júnior, 2010; Shearer,

2000). There can be several modelling techniques that can be applied to the same data

mining problem type (Wirth and Hipp, 2000; Rocha and Sousa Júnior, 2010; Shearer,

2000). Again, there is close link between Data Preparation phase and Modelling.

• Phase 5: Evaluation

This phase contains assessment of the data mining results to ensure whether they have

been able to achieve the business results. If more processes are to be modelled, the process

will then return to the Business Understanding phase (Wirth and Hipp, 2000; Rocha and

Sousa Júnior, 2010; Shearer, 2000).

• Phase 6: Deployment

Implementation is done in the deployment stage. For this case, when everything is

successful, it will be presented as thesis report.

47

Figure 12: Overview of the CRISP-DM tasks and their outputs (Wirth and Hipp, 2000; Shearer, 2000)

4.4 Activity 4: Demonstration

The IT artifact (method) will be demonstrated to proof how effective they can be implemented

to assist organizations that needed to prevent data leakage in BD through experiment of semi-

structured BD sets (textual data) which are publicly available. For instance, “electronic textual

documents are highly obtained from the social websites” (Patel and Mistry, 2015, p. 84).

4.5 Activity 5: Evaluation

The logical proof of the IT artifact (method) will be analysed and observed whether the objectives

are achieved. The feedback and the successful implementation of the method in practice will

enable this research to have presented initial “proof-of-concept” level validation of the new

method (Peffers et al, 2007).

4.6 Activity 6: Communication

The outcome of the thesis would be presented and the results shared through the master thesis

report. This will also be made available publicly and other interested parties through an approved

publication.

48

5. DESIGN AND DEVELOPMENT This section would be used to present the actual IT artifact (method) that will serve as the

solution to the research question. The objective of the solution is to provide method to help

organizations prevent data leakage in BD with emphasis on semi-structured data (textual data)

using the preventive approach such as encryption. In designing the IT artifact, the CRISP-DM

process model which will serve as the kernel theory (section 4.3.1) will be followed.

5.1 Data Understanding

The data that needs to be prevented against leakage from an organization could be classified

as either confidential or non-confidential data. This could be the data either for the

organization itself or the clients who share their private information with the organization.

5.1.1 What is Confidential or Sensitive Data?

According to the National Institute of Standards and Technology (NIST) Special Publication

800-122 (2010), confidential or sensitive data is any data which contains personally

identifiable information (PII) such as (name, social security number, date and place of birth,

mother’s maiden name, or biometric records). Further examples are listed below (NIST

Special Publication 800-122, 2010; PEER Mississippi, 2017; The University of Texas –

Austin Information Security Office, 2017; Carnegie Mellon University Information

Security Office, 2017):

• Social Security number (SSN).

• Credit / debit / payment card numbers – with information such as Cardholder name,

Expiration date, Card verification code.

• Driver's license number.

• Personal information for patients (medical records).

• Financial data for an organization.

• Personal information for students.

• Students records (study plans, marks, transcripts)

• Personal identifiable information for employees – salary, birth date, biometric

information, mother’s maiden name, electronic or digitized signatures, etc.

• Private key (digital certificate)

• Passwords or credentials

• PINs (Personal Identification Numbers)

• Research data within a university.

• Legal data special for a university.

• Trade secrets or intellectual property such as research activities

According to the General Data Protection Regulation (EU) (2016/679) (GDPR) and (Bhaimia,

2018), the GDPR which was adopted on 27 April 2016 and came into force throughout the

European Union (EU) on 25 May 2018 also made significant contribution towards the

protection which should be accorded to personal data and sensitive personal data also

supported the definition by NIST SP 800-122 (2010). According to the Article 4 (1) of the

GDPR [Regulation (EU) (2016/679)], “personal data means any information relating to an

identified or identifiable natural person (‘data subject’); an identifiable natural person is one

who can be identified, directly or indirectly, in particular by reference to an identifier such as

a name, an identification number, location data, an online identifier or to one or more factors

specific to the physical, physiological, genetic, mental, economic, cultural or social identity

of that natural person”. Also, what they normally referred to as sensitive personal data such

49

as health data, genetic data and biometric data should even have stronger protection.

The information that is considered confidential or sensitive will differ depending on the type

of business operated by an organization. However, there are certain information which are

considered confidential or sensitive in all organizations. Examples of such information are

personal information for employees, payroll information, appointment / offer letters, payslips,

phone numbers and home addresses for employees.

5.1.1.1 Data description

The data which will be used are payroll information such as payslips, salary details and

appointment / offer letters of a company (name withheld due to the sensitivity of the data)

which have been collected and exported into Excel, Word, and Text formats. The employees

of this company are in three categories, namely, Junior, Senior, and Management Staffs. The

basic salaries of these employees should only be known by some human resource (HR) and

finance staffs. Even the salaries of the management staffs are only known by few management

staff, that is the HR and Finance Managers. However, these documents could still be leaked.

Therefore, there is the need to prevent against the leakage of such confidential or sensitive

documents or information from falling into wrong hands. Even if leakage occurs, one should

not be able to make meaning of the data by encrypting them. For illustration, sample data is

shown in Figure 13.

Figure 13: Sample Data

5.2 Data Preparation

The data preparation or data pre-processing stage is to convert the raw data into appropriate

format for use in the modelling stage. This phase includes several subtasks such as data

selection, data cleansing, data constructing and data formatting. All the data have been

exported into Text (TXT) format which are capable of being read by most data mining

software. These are mostly application / offer letters, salaries, and payslips. These data are

considered very confidential or sensitive.

50

5.3 Modelling The three main types of machine learning algorithms are supervised, unsupervised, and

reinforcement learning algorithms (Abdallh et al, 2016; Kaur, 2016; Patil et al, 2016). With

supervised learning algorithms, the goal behind this to learn classifiers from known examples

or data sets (i.e. labelled documents) in order to perform or apply the classification

automatically on unknown examples or data sets (unlabelled documents) (Bali and Gore,

2015; Chavan et al, 2014; Vala and Gandhi, 2015). In other words, “supervised learning

means learning from examples” (Patil et al, 2016, p. 517). Examples of supervised learning

algorithms are; Support Vector Machine (SVM), K Nearest Neighbor (K-NN), Naive Bayes

Classifier (NBC), Random Forest, Regression, Logistic Regression, Decision Trees (DT), etc

(Abdallh et al, 2016; Bali and Gore, 2015; Chavan et al, 2014; Vala and Gandhi, 2015). For

unsupervised learning, the documents or data sets are not labelled at any point in the whole

process. The examples of unsupervised learning algorithms are Clustering, Apriori algorithm,

Affinity Analysis, Self‐Organizing Maps (SOM), etc (Abdallh et al, 2016; Kaur, 2016).

Reinforcement learning occurs when the algorithms learn based on the external feedback

given by the environment (Abdallh et al, 2016; Portugal et al, 2018; Bonaccorso, 2017). The

algorithms choose an action based on each data point and later learn to determine how good

the decision was (Abdallh et al, 2016). Over time the algorithms will change its ways to learn

better and achieve better reward. With this algorithm the machine is trained to make specific

decisions. This is how it works; the machine will be exposed to environment where it trains

itself continually using trial and error. With this algorithm the machine will learn from past

experience in order to try to provide best possible knowledge to make accurate business

decisions (Portugal et al, 2018). For example, consider an algorithm in computer field that

plays games against an opponent. The moves that lead to victories (positive feedback) in the

game should be learned and repeated whiles those that lead to losses (negative feedback)

should be avoided (Portugal et al, 2018; Bonaccorso, 2017). Examples of reinforcement

learning are Artificial Neural Network (ANN), Markov Chains (Markov Decision Process),

etc. However, there is fourth learning algorithm called semi-supervised learning algorithm

which are mostly applied to both labelled and unlabelled data (Portugal et al, 2018;

Bonaccorso, 2017). Also, they can learn from incomplete information or missing training set

where the algorithm still need to learn from it (Portugal et al, 2018; Bonaccorso, 2017). For

instance, in moving ratings where not every user rated the movie and for that matter, there is

missing information (Portugal et al, 2018). According to Bonaccorso (2017), the semi-

supervised learning algorithm can also be applied when it is necessary to categorize large

amount of data where few are labelled (complete).

The main idea behind every DLP solution is to “detect and prevent unauthorized attempts to

copy or send sensitive data, both intentionally or/and unintentionally, without authorization,

by people who are authorized to access the sensitive information” (Kale et al, 2015, p. 55;

Tidke et al, 2015, p. 28). In other words, “DLP is a technique used to hide the confidentiality

of data being accessed by unauthorized user” (Jain and Lenka, 2016, p. 57). To achieve this,

one should be able to classify documents into confidential and non-confidential based on

previously known (or predefined categories) documents or data sets. Organizations know and

can classify which documents are considered to be confidential or sensitive and non-

confidential such that unauthorized access or disclosure can harm their business or the

personnel involved and therefore should be prevented from leakage. Since organizations know

and can classify or group which documents that are considered to be confidential and non-

confidential, the supervised machine learning algorithm being classification would be

appropriate in this situation.

51

5.4 Cryptography (Encryption and Decryption)

After documents or files have been classified into confidential and non-confidential data, the

confidential documents need to be encrypted so that only users with the decrypting key can

access those confidential documents. This means that leakage will be prevented against whole

documents. Cryptography is the method that provides a way to store sensitive or confidential

information or to transmit it across insecure networks (i.e. the Internet) so that only the

intended recipients can read the information (Al-Hazaimeh, 2013; Bhanot and Hans, 2015).

Cryptography can be divided into three main areas; symmetric-key, asymmetric-key and

hashing.

• Symmetric-key cryptography

In symmetric-key cryptography, only a single secret key is shared by both the parties involved

in the communication for encryption and decryption purposes. Examples of symmetric key

encryption are Data Encryption Standard (DES), Triple DES, Advanced Encryption Standard

(AES), RC5, BLOWFISH, TWOFISH, THREEFISH etc (Daimary and Saikia, 2015; Bhanot

and Hans, 2015)

• Asymmetric-key cryptography

For asymmetric-key cryptography, two keys are involved in the communication, that is one is

private key and the other is public key. The data which is encrypted with the public key must

be decrypted with the corresponding private key. This type is also referred to as public key

cryptography. Examples are RSA, Elliptic Curve, etc (Daimary and Saikia, 2015; Bhanot and

Hans, 2015).

• Hashing

This type of encryption system involves fixed length message digest which is generated from

variable length message. The intended recipient or receiver must have the message as well as

the digest.

5.4.1 RSA Cryptosystem

The RSA cryptosystem is named after R. Rivest, A. Shamir, and L. Adleman (Jamgekar and

Joshi, 2013; Bhanot and Hans, 2015). RSA cryptosystem is the most widely used public key

cryptosystem (Jamgekar and Joshi, 2013; Bhanot and Hans, 2015). RSA is a public key

cryptography which uses two keys that is public and private keys. When public key is used to

encrypt the data and the corresponding private key is used to decrypt data. The RSA algorithm

involves the following (Bhanot and Hans, 2015; Mahajan and Sachdeva, 2013):

Key Generation (public/private key pair):

1. Let’s first select two large distinct primes p and q such that p not equal to q.

2. Compute n, n = p x q

3. Compute Ø(n), Ø(n) = (p-1) x (q-1)

4. Select e such that 1 < e < Ø(n) and e is coprime to Ø(n).

5. Compute unique integer d, d = e-1 mod Ø(n)

6. Public key is (e, n)

7. Private Key is (d)

Encryption:

While encrypting, the following is done:

C=Pe mod n

Where C=cipher text and P=plain text

52

Both e and n are public

Decryption:

While decrypting, the following is done

P=Cd mod n

In terms of security asymmetric encryption provides more security than the symmetric

encryption, however, symmetric encryption is faster than asymmetric in terms of encryption

speed (Bhanot and Hans, 2015). For instance, AES do not only provide security but also great

speed (Mahajan and Sachdeva, 2013). The main disadvantage of RSA is its encryption speed

(Bhanot and Hans, 2015). Actually, this is the main disadvantage of asymmetric key

algorithms. They provide good security but slow in encrypting files. Again, RSA can only

encrypt a file which is smaller than the key length (Elst, 2015; Brumbaugh, 2015). In addition,

RSA algorithm is able to encrypt a limited number of plaintext (Brumbaugh, 2015). For

instance, if the key size is 2018 bits, one is limited to at most 256 bytes of plaintext data can

be encrypted (Brumbaugh, 2015). Therefore, asymmetric encryption is slow and cannot be

applied for large files (Bikulov, 2013). To work around this situation, the solution then is to

use a hybrid symmetric-asymmetric encryption for big data situation.

5.5 Proposed DLP Method

The proposed IT artifact (method) which will help organizations to prevent data leakage in

BD with emphasis on semi-structured data (textual data) using the preventive approach such

as encryption consists of two main phases:

• Phase 1: Classification of organizational documents into confidential and non-

confidential with the help of a classification technique.

• Phase 2: Applying a hybrid cryptographic technique (made up of AES and RSA) to

encrypt all the confidential documents.

5.5.1 Phase 1: Classification of organizational documents

The objective of this phase is to determine which organizational documents are confidential

and non-confidential so that the confidential ones would be encrypted in the second phase.

The classification method of NBC will be performed on the documents. NBC has been

selected as the appropriate classification method due to the following advantages (see section

3.4.3.3):

• Work well on numeric and textual data.

• Easy to implement.

• Easy computation.

• Requires small amount of training data to estimate parameters.

• Good results are obtained in most cases.

The input to this phase is a collection of confidential and non-confidential documents. For

every document, they will be tokenized, cases transformed, stop words filtered, n-grams

generated, stemming performed to serve as the pre-processing stage. Finally, the documents

will be transformed into vectors of weighted terms of TF-IDF. The phase 1 will be subdivided

into Training (Learning) and Detection phases.

5.5.1.1 Training (Learning) Phase

During the training phase, a set of organizational confidential and non-confidential documents

which will serve as a training set will be used to develop a model using the NBC. This can be

achieved by following the algorithm below:

53

INPUT: Confidential and Non-Confidential text documents

PROCESS / OPERATION: Apply NBC technique

OUTPUT: Training model

Steps:

1. Collection of confidential and non-confidential text documents of an

organization,

2. Load both data sets into the appropriate data mining tool.

3. Perform text pre-processing

4. Perform supervised NBC on both data sets.

5. Store the training model

5.5.1.2 Detection Phase

During the detection phase, a set of unknown data which are the combination of confidential

and non-confidential documents will serve as input data so that the model generated in the

training phase can be applied. The following are the detection phase algorithm:

INPUT: Unknown text documents (Combination of Confidential and Non-Confidential

text documents)

PROCESS / OPERATION: Apply the training model generated in the training phase

OUTPUT: Prediction label of confidential and non-confidential text documents.

Steps:

1. Load the unknown text documents in the appropriate data mining tool.

2. Perform text pre-processing.

3. Apply the training model generated in the training phase.

4. Group the confidential text documents.

5.5.2 Phase 2: Encryption and decryption of confidential documents.

The phase 2 of the proposed IT artifact (method) is a hybrid of symmetric and asymmetric

encryption that is capable of encrypting a big file with symmetric algorithm (i.e. AES) with

on the fly random generated key or password. The key will then be stored in the file and

encrypted with asymmetric algorithm (i.e. RSA). This can be achieved by following the steps

(algorithm) below (Elst, 2015; Bikulov, 2013):

INPUT: Confidential text documents

PROCESS / OPERATION: Hybrid of AES and RSA encryption techniques

OUTPUT: Encrypted or decrypted confidential documents.

Steps:

1. Generate RSA Keypairs

2. Generate AES Key (the random password file)

3. Encryption:

a. Encrypt File with AES Key (i.e. Encrypt the file with the random key)

b. Encrypt AES Key with RSA Public Key (i.e. Encrypt the random key

with the public key file)

4. Decryption:

a. Decrypt AES Key with RSA Private Key (i.e. Decrypt the random key

with the private key file)

54

b. Decrypt File with AES Key (i.e. Decrypt the large file with the random

key).

Figure 14 illustrates the flowchart of the proposed DLP method.

Figure 14: Flowchart of proposed DLP method

55

6. DEMONSTRATION This chapter will be used to demonstrate how confidential documents (files) could be

encrypted to prevent leakage after they have been classified. This chapter will also serve as

the implementation or instantiation of the proposed DLP method to help prevent leakage of

semi-structured BD sets (textual data).

6.1 Experimental setup

The proof of concept has been developed on a virtual environment based on the Oracle VM

VirtualBox with Ubuntu version 16.04 LTS 32-bit operating system with 2GB RAM and

50GB hard disk space. The encryption and decryption of files with public keys will be done

via the OpenSSL command line. Also, RapidMiner Studio 8.1 will be used as the data mining

tool to model the data.

6.2 Data Sets (Documents)

The data comprising several files which have been selected from an organization (name

withheld due to the sensitivity of the data) would be placed into three separates folders being

confidential, non-confidential, and unknown. Confidential folder is made up of text files such

as appointment / offer letters, payslips and other salary information. The non-confidential

folder is made up of other files that have nothing to do with payroll information. The third

folder which is named as the unknown comprises combination of confidential and non-

confidential text files which will be used to test and apply the model.

Training / learning data sets:

• Confidential folder – eight (8) confidential text files

• Non-confidential folder – four (4) files which are not payroll or salary related.

Testing data sets:

• Unknown folder – combination of payroll / salary and other non-payroll / salary

related files. They are made up of three (3) confidential data and four (4) non-

confidential text files.

The Appendix 1 contains the various process maps involving the pre-processing activities, the

classification of documents into confidential and non-confidential, and the application of the

model.

6.3 Experiment 1

A confidential file named (largefile.txt) will be encrypted and decrypted to demonstrate the

implementation of the second phase of the proposed DLP method (see section 5.5.2). The

various commands and the final outcome screenshot indicating all the steps are shown in

Appendix 2.

6.4 Experiment 2

The overall goal of this thesis is to prevent leakage of confidential documents (text files).

However, users who are authorized to work with such files should be able to get access without

comprising the private keys involved in the asymmetric encryption. To achieve this,

authorized users should be allowed to run decryption scripts against any confidential

documents they are allowed to work with. In this case they will only be running executable

56

bash scripts and the details of private and public keys will not be of concern. For this reason,

I will create bash scripts to encrypt and decrypt confidential files.

To demonstrate this approach, I will create two folders – local and remote folders. The remote

folder will serve as a server machine where the encryption bash script, public key, AES

encrypted key and the confidential encrypted files will be stored. In this case even if the remote

machine (server) is hacked or leakage occurs, the confidential files will not be compromised

since they will all be encrypted and the private key and decryption bash script will also not be

available.

The second phase of the proposed DLP method which is a hybrid of symmetric-asymmetric

encryption (see section 5.5.2) will be implemented as bash scripts. The various steps and

screenshots involved are illustrated in Appendix 3 (Bikulov, 2013).

6.5 Experiment 3

In experiments 1 and 2, the focus was on encryption and decryption of single files. In

experiment 3, the emphasis will be based on encrypting and decrypting multiple files within

a folder or directory. To achieve this, one needs to archive all the confidential or sensitive files

in a folder with either tar or zip archive formats before encrypting them. With experiment 3,

the encryption password could be supplied directly through the terminal before the encryption.

The various stages involved have been indicated in Appendix 4.

6.6 Experiment 4

Experiment 4 will be used to combine the ideas from the experiments 2 and 3 such that the

multiple files can just be archived before encrypting them. When this is done, the same bash

scripts used in the experiment 2 (see section 6.4 and Appendix 4) could be used. To achieve

this, we will create gzip tarball and then encrypt the tarball. The steps involved in this

experiment have been presented in Appendix 5.

57

7. EVALUATION Evaluation is crucial and essential part in conducting rigorous DSR (Venable et al, 2012;

Sonnenberg and vom Brocke, 2012). According to Sonnenberg and vom Brocke (2012),

evaluation patterns which are mostly distinguished in a DSR process are ex ante and ex post

evaluations with four evaluation activities (Eval) as shown in Figure 35. Ex ante evaluations

are those which are conducted before the construction of any artefacts whiles ex post

evaluations occur after the construction of any artifact (Venable et al, 2012; Sonnenberg and

vom Brocke, 2012).

Figure 15: Evaluation activities within a DSR process

The Eval1 activity exists to ensure that proper research problem has been selected and

formulated. This has been achieved through the use of literature review processes such as

literature search to ensure that there is research gap. Again, proper research problem has been

formulated in chapter 2.

Also, Eval2 activity exists to ensure that an artifact design ingrains the solution to the stated

problem. Since the artifact at this stage has not been constructed, this evaluation is artificial.

This has been achieved through assertion that the IT artifact will be constructed to solve a

business problem and this ensures the feasibility of the design process.

The Eval3 activity will serve as initial demonstration to ensure how well the artifact will

perform by interacting with organizational elements. This has been achieved during the

demonstration section (see chapter 6) whereby several experiments were conducted to cater

for different scenarios within an organizational setting.

Finally, Eval4 exists to ensure that an artifact is both applicable and useful in practice. The

enormous experiments conducted and applied to documents from an organization had ensure

58

that the IT artifiact (method) is usable, effective and can be applied to several organizations.

This in other words can be said that Eval4 has been achieved through case study (organization

name withheld) whereby real organizational data has been used.

7.1 Impact of the IT Artifact

Before the construction of the IT artifact, when leakage of confidential documents of the

organization happens, people who are not authorized can make meaning of the data since they

are stored in plaintext. More so, encrypted documents could still bypass detection mechanisms

of DLPSs which could result in leakage (Alneyadi et al, 2016). However, once the documents

are encrypted, without the correct decrypting keys, it will be difficult for one to see the details

of these confidential or sensitive documents. Therefore, the designing and construction of the

IT artifact has resolved this problem because attackers or unauthorized users cannot make

sense of the data which are encrypted if the decrypting keys have not been provided.

59

8. DISCUSSION This thesis focuses on designing an IT artifact (method) that can prevent data leakage before

they happen. This means that the preventive approach of DLPS was considered since it is

better to prevent against leakage than to wait for it to happen before detective measures are

applied. In addition, the drawback of detection process is to check whether the leakage

happens or not. Due to that one of the preventive approaches being encryption was adopted to

tackle this situation after realizing from the literature review that a lot of work has been done

already with the detective approaches of DLPS. However, this process would not have been

successful without following appropriate method that can address the research question. For

this reason, DSRM was adopted to answer the research question. This was not the only method

which was adopted to deal with the situation at hand. The CRISP-DM which is de facto

standard to deal with data mining issues was added to serve as the kernel theory. Because

before any DLPS can work, there is the need to train the system with the actual confidential

or non-confidential documents within an organization. Because what is considered to be

confidential or sensitive documents vary from one organization to the other. However, there

are certain information such as personal information for employees, payroll information,

appointment / offer letters, payslips which are considered confidential or sensitive across all

organizations. To achieve this, there should be understanding of the data at hand and how to

model them by applying the appropriate technique.

To achieve the purpose of knowing the confidential documents before encrypting them, text

classification method of NBC was finally adopted as the modelling technique to classify the

documents. After this has been done the proposed encryption method which is a hybrid of

symmetric and asymmetric encryption was then applied to encrypt all the confidential

documents so that only authorized users can have access to them. To add to this, the RSA and

AES encryption algorithms were implemented with OpenSSL technology. This approach

proved very effective because without knowing the decrypting keys involved, it will be

difficult for an unauthorized user to access the confidential documents. In terms of security,

asymmetric encryption is strong but it is not fast and for that matter there was a need to include

symmetric encryption which is very fast and can also deal with encryption and decryption of

large files. The demonstration of the encryption and decryption of large files has accomplished

the BD aspect of the research question. This will improve the security of BD and for that

matter the organization’s data.

In addition, the proposed hybrid encryption approach has complemented the work done by

Margathavalli et al (2016) which was identified in the literature review to prevent data

leakage. They made use of ABE which is another way of implementing public key encryption

to prevent leakage of sensitive data.

8.1 Contribution

This thesis work has made significant contribution to the field of DLP using the preventive

approach to prevent data leakage before they happen by proposing a hybrid symmetric-

asymmetric encryption technique of encrypting confidential or sensitive documents within an

organization so that only authorized users can have access.

60

9. CONCLUSION This thesis has contributed to the area of DLP by proposing a hybrid symmetric-asymmetric

encryption approach to prevent data leakage. This is one of the preventive method or approach

of DLPS. This IT artifact (method) is capable of preventing data leakage before they happen.

In this case only authorized users of an organization can access their confidential or sensitive

documents. It was clear from the literature review that a lot of work has been done already in

the area of leakage detection which then bring a gap in the preventive approach of DLPS. This

research has demonstrated that encryption could also serve as the cornerstone of BDS. The

proposed hybrid encryption method which is the combination of asymmetric (RSA) and

symmetric (AES) encryptions can therefore be used by many organizations to prevent leakage

of their confidential or sensitive documents.

9.2 Future Research

Future research can be conducted by considering automating the proposed method such that

data can be fed into appropriate data mining tool and for that matter BD technologies such as

Hadoop automatically. Also, this method can be extended by encrypting all BD before they

are stored in Hadoop to better strengthening the security of BD and to prevent data leakage.

Hadoop is an open source framework that allows distributed storage and processing of large

data sets across clusters of networked computers using simple programming models (Khan et

al, 2014; Shukla et al, 2015; Tole, 2013; Ularu et al, 2012; Rodríguez-Mazahua et al, 2016).

61

REFERENCE Abdallh, M.M.A, Bilal, K. H.& Babiker, A. (2016), Machine Learning Algorithms, International Journal

of Engineering, Applied and Management Sciences Paradigms, vol. 36, issue 01, pp. 17-27.

Ahmad, S. W. & Bamnote, G. R. (2013), Data Leakage Detection and Data Prevention Using Algorithm,

International Journal of Computer Science and Applications, vol. 6, no. 2, pp. 394-399.

Al-Hazaimeh, O. M. (2013), A New Approach for Complex Encrypting and Decrypting Data, International

journal of Computer Networks & Communications, vol. 5, no. 2, pp. 95-103.

Alneyadi, S., Sithirasenan, E. & Muthukkumarasamy, V. (2016), A survey on data leakage prevention

systems, Journal of Network and Computer Applications, vol. 62, issue C, pp. 137-152.

Alneyadi, S., Sithirasenan, E. and Muthukkumarasamy, V. (2015), Detecting Data Semantic: A Data

Leakage Prevention Approach, In the Proceedings of the 2015 IEEE Trustcom/BigDataSE/ISPA, August

20 - 22, IEEE Computer Society Washington DC, USA, vol. 1, pp. 910-917.

Alneyadi S., Sithirasenan E. & Muthukkumarasamy V. (2014), A Semantics-Aware Classification Approach

for Data Leakage Prevention, In: Susilo W., Mu Y. (eds) Information Security and Privacy, ACISP 2014,

Lecture Notes in Computer Science, vol. 8544, pp.413-421, Springer, Cham.

Alneyadi, S., Sithirasenan, E. & Muthukkumarasamy, V. (2013a), Word N-gram Based Classification for

Data Leakage Prevention, In the proceedings of 2013 12th IEEE International Conference on Trust,

Security and Privacy in Computing and Communications.

Alneyadi, S., Sithirasenan, E. & Muthukkumarasamy, V. (2013b), Adaptable N-gram classification model

for data leakage prevention, In the proceedings of 7th International Conference on Signal Processing and

Communication Systems (ICSPCS), Carrara, VIC, 2013, pp. 1-8.

Al-Radaideh, Q. A. & Al-Nagi, E. (2012), Using Data Mining Techniques to Build a Classification Model

for Predicting Employees Performance, International Journal of Advanced Computer Science and

Applications, vol.3, no. 2, pp.144-151.

Ammu, N. & Irfanuddin, M. (2013), Big Data Challenges, International Journal of Advanced Trends in

Computer Science and Engineering, vol. 2, no. 1, pp. 613-615.

Ba-Alwi, F. M. & Albared, M. (2016), Experiments on the Use of Machine Learning Classification Methods

in Online Crime Text Filtering and Classification, British Journal of Applied Science & Technology, vol.

12, no. 5, pp. 1-12.

Bhanot, R. & Hans, R. (2015), A Review and Comparative Analysis of Various Encryption Algorithms,

International Journal of Security and Its Applications, vol. 9, no. 4, pp. 289-306.

Bali, M. & Gore, D. (2015), A Survey on Text Classification with Different Types of Classification

Methods, International Journal of Innovative Research in Computer and Communication Engineering, vol.

3, issue 5, pp. 4888-4894.

Bertino, E. (2013), Big Data - Opportunities and Challenges (Panel Position Paper), In the proceedings of the

2013 IEEE 37th Annual Computer Software and Applications Conference (COMPSAC), 22-26 July 2013,

Kyoto, Japan. [Online]. Available: https://www.cs.purdue.edu/homes/bertino/compsac13.pdf [Accessed: 10th

December, 2016].

Bhaimia, S. (2018), The General Data Protection Regulation: The Next Generation of EU Data Protection, Legal

Information Management, vol. 18, no. 2018, pp. 21-28.

Bhogal, N. & Jain, S. (2017), A Review on Big Data Security and Handling, International Research Based

Journal, vol. 6, issue 1, pp. 1-5.

Bikulov, D. (2013). Hybrid symmetric-asymmetric encryption for large files [Kenarius Octonotes].

[Online]. Available: http://bikulov.org/blog/2013/10/12/hybrid-symmetric-asymmetric-encryption-for-

large-files/ [Accessed: 5th May, 2018].

Bonaccorso, G. (2017), Machine Learning Algorithms, Packt Publishing, Birmingham, UK.

Brocke, J. v., Simons, A., Niehaves, B., Niehaves, B., Reimer, K., Plattfaut, R., & Cleven, A. (2009),

Reconstructing The Giant: On The Importance of Rigour in Documenting The Literature Search Process .

ECIS 2009 Proceedings. Paper 161.

Brumbaugh, D. (2015), How to Encrypt Large Messages with Asymmetric Keys and phpseclib. [Online].

https://www.cs.purdue.edu/homes/bertino/compsac13.pdf

http://bikulov.org/blog/2013/10/12/hybrid-symmetric-asymmetric-encryption-for-large-files/

http://bikulov.org/blog/2013/10/12/hybrid-symmetric-asymmetric-encryption-for-large-files/

62

Available: https://www.sitepoint.com/encrypt-large-messages-asymmetric-keys-phpseclib/ [Accessed: 8th

May, 2018].

Carnegie Mellon University Information Security Office (2017), Guidelines for Data Classification.

[Online]. Available: https://www.cmu.edu/iso/governance/guidelines/data-classification.html [Accessed:

28th April, 2018].

Chavan, G.S., Manjare, S., Hegde, P. & Sankhe, A. (2014), A Survey of Various Machine Learning

Techniques for Text Classification, International Journal of Engineering Trends and Technology (IJETT),

vol. 15, no. 6, pp. 288-292.

Daimary, A. & Saikia, L. P. (2015), A Study of Different Data Encryption Algorithms at Security Level:

A Literature Review, (IJCSIT) International Journal of Computer Science and Information Technologies,

vol. 6, no. 4, pp. 3507-3509.

Elst, R. V. (2015), Encrypt and decrypt files to public keys via the OpenSSL Command Line. [Online].

Available:

https://raymii.org/s/tutorials/Encrypt_and_decrypt_files_to_public_keys_via_the_OpenSSL_Command_L

ine.html [Accessed: 5th May 2018].

Harish Kumar, M. & Menakadevi, T. (2017), A Review on Big Data Analytics in the field of Agriculture,

International Journal of Latest Transactions in Engineering and Science, vol. 1, issue 4, pp. 0001-0010.

Hevner, A. R., March, S. T., Park, J. & Ram, S. (2004), Design Science in Information Systems Research,

MIS Quarterly, vol. 28, no. 1, pp. 75-105.

Hima Bindu, S., Gireesha, O., Sahithi, A. N. & Mounicama, A. (2016), Security Aspects in Big Data,

International Journal of Innovative Research in Computer and Communication Engineering, vol. 4, issue

4, pp. 1111-1118.

Inukollu, V. N., Arsi, S. & Ravuri, S. R. (2014), Security Issues Associated with Big Data in Cloud

Computing, International Journal of Network Security & Its Applications (IJNSA), vol.6, no.3, pp. 45-56.

ISACA (2010), Data Leak Prevention [White Paper]. [Online]. Available:

http://www.isaca.org/Groups/Professional-English/security-trend/GroupDocuments/DLP-WP-

14Sept2010-Research.pdf [Accessed: 22nd November, 2016].

Jain, M & Lenka, S. K. (2016), A Review on Data Leakage Prevention using Image Steganography,

International Journal of Computer Science Engineering (IJCSE), vol. 5, no. 02, pp. 56-59.

Jamgekar, R. S. & Joshi, G. S. (2013), File Encryption and Decryption Using Secure RSA, International

Journal of Emerging Science and Engineering (IJESE), vol. 1, issue 4, pp. 11-14.

Jamiy, F. EL., Daif, A., Azouazi, M. & Marzak, A. (2014), The potential and challenges of Big data -

Recommendation systems next level application, International Journal of Computer Science Issues (IJCSI),

vol. 11, issue 5, no. 2, pp. 21-26.

Kale, A. V., Bajpayee, V. & Dubey, S. P. (2015), Analysis of Data Leakage Prevention Solutions,

International Journal For Engineering Applications And Technology (IJFEAT), vol. 1, issue, 12, pp. 54-

57.

Kanimozhi, K. V. & Venkatesan, M. (2015), Unstructured Data Analysis-A Survey, International Journal

of Advanced Research in Computer and Communication Engineering, vol. 4, issue 3, pp. 223-225.

Kaur, K. (2016), Machine Learning: Applications in Indian Agriculture, International Journal of Advanced

Research in Computer and Communication Engineering, vol. 5, issue 4, pp. 342-344.

Katz, G., Elovici, Y. & Shapira, B. (2014), CoBAn: A context based model for data leakage prevention,

Information Sciences, vol. 262, pp.137-158.

Kaushik, M. & Jain, A. (2014), Challenges to Big Data Security and Privacy, International Journal of

Computer Science and Information Technologies, vol. 5, no. 3, pp. 3042-3043.

Khan, N., Yaqoob, I., Hashem, I. A. T., Inayat, Z., Ali, W. K. M., Alam, M., Shiraz, M. & Gani, A. (2014), Big

Data: Survey, Technologies, Opportunities, and Challenges, The Scientific World Journal, vol. 2014, no. 2014:

712826, pp. 1-18.

Ko, R. K. L., Tan, A. Y. S. & Gao, T. (2014), A Mantrap-Inspired, User-Centric Data Leakage Prevention

(DLP) Approach, 2014 IEEE 6th International Conference on Cloud Computing Technology and Science,

https://www.sitepoint.com/encrypt-large-messages-asymmetric-keys-phpseclib/

https://www.cmu.edu/iso/governance/guidelines/data-classification.html

https://raymii.org/s/tutorials/Encrypt_and_decrypt_files_to_public_keys_via_the_OpenSSL_Command_Line.html

https://raymii.org/s/tutorials/Encrypt_and_decrypt_files_to_public_keys_via_the_OpenSSL_Command_Line.html

http://www.isaca.org/Groups/Professional-English/security-trend/GroupDocuments/DLP-WP-

63

Singapore, 2014, pp. 1033-1039.

Korde, V. & Mahender, C. N. (2012), Text Classification and Classifiers: A Survey, International Journal

of Artificial Intelligence & Applications (IJAIA), vol. 3, no. 2, pp. 85-99.

Kumar, S., Shekhar, J., & Gupta, H. (2016), Agent based Security Model for Cloud Big Data, In the

proceedings of the Second International Conference on Information and Communication Technology for

Competitive Strategies (ICTCS’16), Udaipur, India. [Online]. Available:

https://www.researchgate.net/profile/Sunil_Kumar468/publication/289738412_Agent_based_security_mo

del_for_Cloud_Big_Data/links/56f4e40e08ae7c1fda2d7b23/Agent-based-security-model-for-Cloud-Big-

Data.pdf [Accessed: 30th January, 2017].

Iivari, J. (2007), A Paradigmatic Analysis of Information Systems As a Design Science, Scandinavian Journal

of Information Systems, vol. 19, issue 2, pp. 39-64.

Mahajan, P., Gaba, G. & Chauhan, N. S. (2016), Big Data Security, IITM Journal of Management and IT, vol.

7, issue 1, pp. 89-94.

Mahajan, P. & Sachdeva, A. (2013), A Study of Encryption Algorithms AES, DES and RSA for

Security, Global Journal of Computer Science and Technology, vol. 13, issue 15, version 1.0.

Margathavalli, P., Manjula, R., Pramila, V., Priya, R. & Abirami, P. (2016), Preserving Sensitive Data by Data

Leakage Prevention Using Attribute Based Encryption Algorithm, International Journal of Emerging

Technology in Computer Science & Electronics (IJETCSE), vol. 21, issue 3, pp. 705-711.

Markus, M.L., Majchrzak, A. & Gasser, L. (2002), A Design Theory For Systems That Support Emergent

Knowledge Processes, MIS Quarterly, vol. 26, no. 3, pp. 179-212.

McAfee, A. & Brynjolfsson, E. (2012), Big Data. The Management Revolution, Harvard

Business Review, vol. 90, no. 10, pp. 61-67.

Moorthy, J., Lahiri, R., Biswas, N., Sanyal, D., Ranjan, J., Nanath, K & Ghosh, P. (2015), Big Data: Prospects

and Challenges, The Journal for Decision Makers, vol. 40, issue 1, pp. 74-96.

Moro, S., Laureano, R. & Cortez, P. (2011), Using Data Mining for Bank Direct Marketing: An Application of

the CRISP-DM Methodology. In P. Novais et al. (Eds.), Proceedings of the European Simulation and Modelling

Conference - ESM'2011, Guimaraes, Portugal, October, pp. 117-121.

Moura, J. & Serrão, C. (2015), Security and Privacy Issues of Big Data. In book Handbook of Research on

Trends and Future Directions in Big Data and Web Intelligence, IGI Global. [Online], Available:

https://arxiv.org/ftp/arxiv/papers/1601/1601.06206.pdf [Accessed: 22nd November, 2016].

Nalini, K. & Sheela, L. J. (2014), Survey on Text Classification, International Journal of Innovative Research

in Advanced Engineering (IJIRAE), vol. 1, issue 6, pp. 412-417. Nisha, M. D. & Karthik, K. (2016), Survey on Text Classification Methods, International Journal of Advanced

Research in Computer Science and Software Engineering, vol. 6, issue 2, pp. 585-588.

NIST Special Publication 800-122 (2010), Guide to Protecting the Confidentiality of Personally Identifiable

Information (PII). [Online]. Available: https://nvlpubs.nist.gov/nistpubs/Legacy/SP/nistspecialpublication800-

122.pdf [Accessed: 28th April, 2018].

Patel, P. & Mistry, K. (2015), A Review: Text Classification on Social Media Data, IOSR Journal of Computer

Engineering, vol. 17, issue 1, pp. 80-84.

PEER Mississippi (2017), A Review of State Agencies’ Management of Confidential Data [Report to the

Mississippi Legislature, #612]. [Online]. Available: http://www.peer.ms.gov/Reports/reports/rpt612.pdf

[Accessed: 28th April, 2018].

Patel, P. & Mistry, K. (2015), A Review: Text Classification on Social Media Data, IOSR Journal of Computer

Engineering, vol. 17, issue 1, pp. 80-84. Patil, R. P., Bhavsar, R. P. & Pawar, B. V. (2016), A Comparative Study of Text Classification Methods: An

Experimental Approach, International Journal on Recent and Innovation Trends in Computing and

Communication, vol. 4, issue 3, pp. 517-523.

Patra, A. & Singh, D. (2013), A Survey Report on Text Classification with Different Term Weighing Methods

and Comparison between Classification Algorithms, International Journal of Computer Applications, vol. 75, no.

7, pp. 14–18.

Peneti, S. & Rani, B. P. (2016), Data Leakage Prevention System with Time Stamp, 2016 International Conference

on Information Communication and Embedded Systems (ICICES), 25-26 Feb. 2016, Chennai, India, pp. 1-4.

Peneti, S. & Rani, B. P. (2015a), Data Leakage Detection and Prevention Methods: Survey. Discovery, vol. 43,

no. 198, pp. 95-100.

Peneti, S. & Rani, B. P. (2015b), Confidential Data Identification Using Data Mining Techniques in Data Leakage

Prevention System, International Journal of Data Mining & Knowledge Management Process (IJDKP), vol. 5,

https://www.researchgate.net/profile/Sunil_Kumar468/publication/289738412_Agent_based_security_model_for_Cloud_Big_Data/links/56f4e40e08ae7c1fda2d7b23/Agent-based-security-model-for-Cloud-Big-Data.pdf



https://nvlpubs.nist.gov/nistpubs/Legacy/SP/nistspecialpublication800-122.pdf

https://nvlpubs.nist.gov/nistpubs/Legacy/SP/nistspecialpublication800-122.pdf

http://www.peer.ms.gov/Reports/reports/rpt612.pdf

64

no. 5, pp. 65-73.

Peffers, K., Tuunanen, T., Rothenberger. M. A. & Chatterjee, S. (2007), A Design Science Research

Methodology for Information Systems Research, Journal of Management Information Systems, vol. 24,

issue 3, pp. 45-77.

Portugal, I., Alencar, P. & Cowan, D. (2018), The use of machine learning algorithms in recommender

systems: A systematic review, Expert Systems with Applications, vol. 97, pp. 205-227.

Ram, K. (2015), Analysis of Data Leakage Prevention on cloud computing, International Journal of

Scientific & Engineering Research, vol. 6, issue 1, pp. 457-461.

General Data Protection Regulation (EU) (2016/679), Regulation (EU) 2016/679 of the European

Parliament and of the Council of 27 April 2016 on the protection of natural persons with regard to the

processing of personal data and on the free movement of such data and repealing Directive 95/46/EC

(General Data Protection Regulation), Official Journal of the European Union L 119(1).

Rocha, B. C. & Sousa Júnior, R. T. (2010), Identifying bank frauds using CRISP-DM and decision trees,

International journal of computer science & information Technology (IJCSIT), vol. 2, no. 5, pp. 162 – 169.

Rodríguez-Mazahua, L., Rodríguez-Enríquez, CA., Sánchez-Cervantes, J. L., Cervantes, J., García-

Alcaraz, J. L. & Alor-Hernández, G. (2016), A general perspective of Big Data: applications, tools,

challenges and trends, The Journal of Supercomputing, vol. 72, issue 8, pp. 3073-3113.

Shabtai, A., Elovici, Y. and Rokach, L. (2012), A taxonomy of data leakage prevention solutions, In A

Survey of Data Leakage Detection and Prevention Solutions (pp. 11-15), Springer US.

Shearer, C. (2000), The CRISP-DM Model: The New Blueprint for Data Mining, Journal of Data

Warehousing, vol. 5, no. 4, pp. 13-22.

Shirudkar, K. & Motwani, D. (2015), Big-Data Security. International Journal of Advanced Research in

Computer Science and Software Engineering, vol. 5, issue 3, pp. 1100-1109.

Shukla, S., Kukade, V. & Mujawar, S. (2015), Big Data: Concept, Handling and Challenges: An Overview,

International Journal of Computer Applications, vol. 114, no. 1, pp. 6-9. Sin, K. & Muthu, L. (2015), Application of Big Data in Education Data Mining and Learning Analytics - A

Literature Review, ICTACT Journal on Soft Computing, vol. 5, issue 4, pp. 1035-1049.

Singh R. & Sinha, R. (2016), Big Data Security and Privacy Issues in SMES, International Journal of

Environment, Science and Technology, vol. 2, issue 1, pp. 31.35.

Sonnenberg, C., & vom Brocke, J. (2012), Evaluation Patterns for Design Science Research Artefacts, In M.

Helfert & B. Donnellan (Eds.), Proceedings of the European Design Science Symposium (EDSS) 2011 Dublin,

Ireland: Springer Berlin/Heidelberg, vol. 286, pp. 71-83.

Soumya, S. R. & Smitha, E. S. (2014), Data Leakage Prevention System by Context based Keyword Matching

and Encrypted Data Detection, International Journal of Advanced Research in Computer Science Engineering

and Information Technology, vol. 3, issue 1, pp. 375-384.

Tabassum, R. & Tyagi, N. (2016), Issues and Approaches for Big Data Security, International Journal of Latest

Technology in Engineering, Management & Applied Science (IJLTEMAS), vol. V, issue VII, pp. 72-74. Tahboub, R & Saleh, Y. (2015), Precaution Model for Data Leakage Prevention/Loss (DLP) Systems, In the

proceedings of the 4th Palestinian International Conference on Computer and Information Technology (PICCIT

2015), Palestine, Hebron. [Online]. Available:

https://www.researchgate.net/profile/Radwan_Tahboub/publication/282942404_Precaution_Model_for_Data_L

eakage_PreventionLoss_DLP_Systems/links/56234afc08aea35f2682c5c8/Precaution-Model-for-Data-Leakage-

Prevention-Loss-DLP-Systems.pdf [Accessed: 30th January, 2017].

Tahboub, R & Saleh, Y. (2014), Data Leakage / Loss Prevention Systems (DLP), NNGT Journal: International

Journal of Information Systems, vol. 1, pp. 13-18.

Tene, O. & Polonetsky, J. (2013), Big Data for All: Privacy and User Control in the Age of Analytics,

Northwestern Journal of Technology and Intellectual Property, vol. 11 issue 5, pp. 238-273.

Thaoroijam, K. (2014), A Study on Document Classification using Machine Learning Techniques, International Journal of Computer Science Issues, vol. 11, issue 2, pp. 217-222.

The University of Texas – Austin Information Security Office (2017), Extended List of Confidential Data.

[Online]. Available: https://security.utexas.edu/policies/extended-cat-1 [Accessed: 28th April, 2018].

Tidke, P., Wagh, A., Bharade, D. & Dongre, A. G. (2015), Data Leakage Prevention with E-Mail Filtering,

International Journal of Advance Foundation and Research in Computer (IJAFRC), vol. 2, issue 2, pp. 28-32.

Tole, A. A. (2013), Big Data Challenges, Database Systems Journal, vol. IV, no. 3, pp. 31-40.

Topaloğlu, M. (2013), The Comparison of the Text Classification Methods to be used for the Analysis of Motion

https://www.researchgate.net/profile/Radwan_Tahboub/publication/282942404_Precaution_Model_for_Data_Leakage_PreventionLoss_DLP_Systems/links/56234afc08aea35f2682c5c8/Precaution-Model-for-Data-Leakage-Prevention-Loss-DLP-Systems.pdf



https://security.utexas.edu/policies/extended-cat-1

65

Data in DLP Architect, International Journal of Computer Science & Information Technology (IJCSIT), vol. 5,

no. 5, pp. 107-115.

Toshniwal, R., Dastidar, K. G., & Nath, A. (2015), Big Data Security Issues and Challenges, International

Journal of Innovative Research in Advanced Engineering (IJIRAE), vol. 2, issue 2, pp. 15-20.

Ularu, E.G., Puican, F.C., Apostu, A., Velicanu, M. (2012), Perspectives on Big Data and Big Data Analytics,

Database Systems Journal, vol. III, no. 4, pp. 3-13.

Vadsola, R., Desai, D., Brahmbhatt, M. & Patanwadia, A. (2014), Data Leakage Prevention by Using Word Gram Based Classification and Clustering, International Journal of Advanced Research in Computer and

Communication Engineering, vol. 3, issue 9, pp. 8040-8041.

Vala, M. & Gandhi, J. (2015), Survey of Text Classification Technique and Compare Classifier, International

Journal of Innovative Research in Computer and Communication Engineering, vol. 3, issue 11, pp. 10809-

10813.

Vanjari, S. P. & Thombre, V. D. (2015), An Experiential Study of SVM and Naïve Bayes for Gender

Recognization, International Journal on Recent and Innovation Trends in Computing and Communication, vol.

3, issue 9, pp. 5456-5460.

Venable, J., Pries-Heje, J. & Baskerville, R. (2012), A Comprehensive Framework for Evaluation in Design

Science Research, In K. Peffers, M. Rothenberger & B. Kuechler (Eds.), Design Science Research in Information

Systems. Advances in Theory and Practice, Berlin / Heidelberg: Springer, vol. 7286, pp. 423-438.

Walls, J. G., Widmeyer, G. R. & El Sawy O. A. (1992), Building an Information System Design Theory for Vigilant EIS, Information Systems Research, vol. 3, no. 1, pp. 36-59.

Wirth, R. & Hipp, J. (2000), CRISP-DM: Towards a standard process model for data mining. In the proceedings

of the 4th International Conference on the Practical Applications of Knowledge Discovery and Data Mining.

Yosepu¸ C., Srinivasulu¸ P. & Subbarayudu, B. (2015), A Study on Security and Privacy in Big Data Processing,

International Journal of Innovative Research in Computer and Communication Engineering, vol. 3, issue 12,

pp. 12292-12296.

66

APPENDIX 1

Figure 16: Text Pre-processing Activities

Figure 17: Process Map of the Model for NBC

67

Figure 18: Process Map of Cross Validation for NBC

Figure 19 illustrates that the model could predict the correct files being three (3) confidential

and four (4) non-confidential documents.

Figure 19: Prediction label after applying the model on the unknown data.

68

APPENDIX 2 1. Generate RSA Keypairs

//generates a private Key with 8196 Bits with the command

below.

openssl genrsa -out private.pem 8196

//strips out the public key from the private key with the

command below

openssl rsa -in private.pem -out public.pem -outform PEM -

pubout

2. Generate AES Key (the random password file)

//generate a Random 32 Bytes (256 Bits) AES Key and save the

key to the key.bin file with the command below

openssl rand -base64 32 > key.bin

3. Encryption:

a. Encrypt File with AES Key (i.e. Encrypt the file with the random key)

//encrypt the largefile.txt with the generated AES Key to the

largefile.txt.enc with the command below

openssl enc -aes-256-cbc -salt -in largefile.txt -out

largefile.txt.enc -pass file:./key.bin

b. Encrypt AES Key with RSA Public Key (i.e. Encrypt the random key

with the public key file)

//encrpyt the AES Key with the RSA Public Key and save the

outcome into the key.bin.enc file with the command below.

openssl rsautl -encrypt -inkey public.pem -pubin -in key.bin -

out key.bin.enc

4. Decryption:

a. Decrypt AES Key with RSA Private Key (i.e. Decrypt the random key

with the private key file)

//decrypt the AES Key with the Private RSA Key and save the

result in key.bin.dec with the command below

openssl rsautl -decrypt -inkey private.pem -in key.bin.enc -

out key.bin.dec

69

b. Decrypt File with AES Key (i.e. Decrypt the large file with the random

key).

//decrypt the encrypted file with the decrypted AES Key with

the command below

openssl enc -d -aes-256-cbc -in largefile.txt.enc -out

largefile.txt.dec -pass file:./key.bin.dec

//The largefile.txt.dec and largefile.txt should be the same

Figure 20: Screenshot showing the implementation of the second phase of the proposed DLP method in experiment 1

70

APPENDIX 3 Step 1: Create local and remote folders

Step 2: Change to local directory and generate RSA keypairs. In this case, private key will be

named (keyfile.key) and public key will be (keyfile.pub).

Step 3: The public key is stripped out from the private key and stored as (keyfile.pub) on the

remote folder.

Step 4: Copy the public key (keyfile.pub) to the remote folder.

Step 5: Change to the remote folder

Step 7: Encryption bash script created with gedit program

Figure 21: Screenshot showing the encryption bash script

Step 8: Make the encryption bash script executable

Step 9: Encrypt the confidential file (largefile.txt) with the encryption script.

The encrypted files will be copied also to the local folder.

Step 10: Change directory to the local folder

Step 11: Create the decryption bash script (decrypt.sh)

71

Figure 22: Screenshot showing all the steps from step 2 to 11

Step 12: Decryption bash script (decrypt.sh) created with gedit program

The encrypted AES key and encrypted file (largefile.txt.enc) will be removed from the local

folder after the decryption.

Figure 23: Screenshot showing the decryption bash script

Step 13: Make the decryption bash script executable

72

Figure 24: Screenshot showing the executable command (step 13)

Step 14: Decrypt the encrypted confidential file (largefile.txt.enc) by running the decryption

bash script. The decrypted file will be placed in the local folder. When authorized users need

access to the confidential files they can be allowed to run decryption script against those files.

Afterwards the decrypted ones can be deleted.

Figure 25: Screenshot showing the decryption of the confidential file

73

APPENDIX 4 Encrypting Multiple Files

Now, we will create gzip tarball and then encrypt the tarball. This can also be achieved in a

single command with pipe. With this approach, the correct encryption password should be

supplied. To encrypt all the files in a current directory or folder, use the following command:

tar -czf - * | openssl enc -e -aes256 -out

allconfidentialfiles.tar.gz

Figure 26: Screenshot showing the encryption of all the confidential files in a directory

Decrypting Multiple Files

A tar archive contents can also be decrypted with the following command.

openssl enc -aes-256-cbc -d -in allconfidentialfiles.tar.gz. |

tar xz

When the correct encryption password is supplied, all the contents of the encrypted archived

files will be made available to the authorized user.

Figure 27: Screenshot showing the decryption of all the confidential files in a directory

The content of the files within the current directory can be shown with the ls command as

shown below:

74

Figure 28: Screenshot showing all the files in a directory

When the authorized user is done working with the confidential files they can be deleted. This

can be achieved for instance with the rm command as shown below.

Figure 29: Screenshot showing the removal of the files in a directory

When wrong encryption password is rather supplied it will give error message as shown

below:

75

Figure 30: Screenshot showing wrong password supplied for decryption

76

APPENDIX 5 To archive all the confidential text files in a current directory or folder, use the following

command:

tar -czf newconfidentialfiles.tar.gz *.txt

Figure 31: Screenshot showing archiving of all files before encryption

Now remove all the single confidential text files with the remove (rm) command to leave only

the archived tar file.

Figure 32: Screenshot showing removal of all confidential text files before encryption

Afterwards, the encryption and decryption bash scripts (see section 6.3) could be used as

shown below. Now run the encryption bash script with the (./encrypt.sh) command.

Figure 33: Screenshot showing encryption of the archived file

Afterwards, change to the local folder and run the decryption bash script with the (./decrypt.sh)

command.

Figure 34: Screenshot showing changing of directory

Run the decryption bash script with the (./decrypt.sh) command.

77

Figure 35: Screenshot showing decryption of the archived file

Now extract the confidential files from the archived one with the command below: tar -xzf newconfidentialfiles.tar.gz

Figure 36: Screenshot showing extraction of files from the archived file