security of big data: focus on data leakage prevention (dlp)1221204/fulltext01.pdfdeficiency in...
TRANSCRIPT
Security of Big Data: Focus on Data
Leakage Prevention (DLP)
Richard Nyarko
Information Security, master's level (120 credits)
2018
Luleå University of Technology
Department of Computer Science, Electrical and Space Engineering
Security of Big Data: Focus on Data
Leakage Prevention (DLP)
Richard Nyarko
Supervised by Prof. Ahmed Elragal
THESIS SUBMITTED IN PARTIAL FULFILMENT OF THE REQUIREMENTS FOR THE
DEGREE OF MASTER OF SCIENCE IN INFORMATION SECURITY
Luleå University of Technology
Department of Computer Science, Electrical and Space Engineering
June, 2018
3
ABSTRACT Data has become an indispensable part of our daily lives in this era of information age. The
amount of data which is generated is growing exponentially due to technological advances.
This voluminous of data which is generated daily has brought about new term which is
referred to as big data. Therefore, security is of great concern when it comes to securing big
data processes. The survival of many organizations depends on the preventing of these data
from falling into wrong hands. Because if these sensitive data fall into wrong hands it could
cause serious consequences. For instance, the credibility of several businesses or organizations
will be compromised when sensitive data such as trade secrets, project documents, and
customer profiles are leaked to their competitors (Alneyadi et al, 2016).
In addition, the traditional security mechanisms such as firewalls, virtual private networks
(VPNs), and intrusion detection systems/intrusion prevention systems (IDSs/IPSs) are not
enough to prevent against the leakage of such sensitive data. Therefore, to overcome this
deficiency in protecting sensitive data, a new paradigm shift called data leakage prevention
systems (DLPSs) have been introduced. Over the past years, many research contributions have
been made to address data leakage. However, most of the past research focused on data leakage
detection instead of preventing against the leakage. This thesis contributes to research by using
the preventive approach of DLPS to propose hybrid symmetric-asymmetric encryption to
prevent against data leakage.
Also, this thesis followed the Design Science Research Methodology (DSRM) with CRISP-
DM (CRoss Industry Standard Process for Data Mining) as the kernel theory or framework for
the designing of the IT artifact (method). The proposed encryption method ensures that all
confidential or sensitive documents of an organization are encrypted so that only users with
access to the decrypting keys can have access. This is achieved after the documents have been
classified into confidential and non-confidential ones with Naïve Bayes Classifier (NBC).
Therefore, any organizations that need to prevent against data leakage before the leakage occurs
can make use of this proposed hybrid encryption method.
Keywords: Big data, big data security, data leakage prevention, data leakage
prevention system.
4
ACKNOWLEDGEMENT First and foremost, I am most grateful to the Almighty God who gave me the knowledge, strength,
good health, and the happiness to complete the thesis.
I would also like to express my profound appreciation and gratitude to my supervisor, Prof. Ahmed
Elragal for his endless support and valuable comments and guidance through this thesis work.
This thesis would not have been successful without his valuable contributions.
I would also like to express my deepest gratitude to my wife, Louisa Ababio Nsiah who has
supported me throughout all these years. My master’s programme would not have been achieved
without her valuable support.
5
TABLE OF CONTENTS ABSTRACT……………………………………………………………………………………………….3
ACKNOWLEDGEMENT…………………………………………………………………………………4
TABLE OF FIGURES ............................................................................................................................ 7
TABLE OF TABLES ............................................................................................................................. 8
ABBREVIATIONS ................................................................................................................................ 9
1. INTRODUCTION ............................................................................................................................ 10
1.1 Research objective ........................................................................................................................ 11
1.2 Delimitations ............................................................................................................................... 11
2. RESEARCH PROBLEM .................................................................................................................. 12
2.1 Motivation ................................................................................................................................... 13
3. LITERATURE REVIEW .................................................................................................................. 14
3.1 Definition of review scope ............................................................................................................ 14
3.2 Conceptualisation of the research topic .......................................................................................... 14
3.3 Literature search ........................................................................................................................... 15
3.4 Literature analysis and synthesis .................................................................................................... 16
3.4.1 Big Data (BD) ...................................................................................................................... 16
3.4.2 Big Data Security (BDS) ...................................................................................................... 24
3.4.3 Data Leakage Prevention (DLP) ........................................................................................... 27
3.5 Literature review discussion .......................................................................................................... 41
3.6 Research gap ................................................................................................................................ 42
3.7 Research question ......................................................................................................................... 42
4. RESEARCH METHODOLOGY....................................................................................................... 43
4.1 Activity 1: Problem identification and motivation........................................................................... 44
4.2 Activity 2: Define the objectives for a solution ............................................................................... 44
4.3 Activity 3: Design and development .............................................................................................. 44
4.3.1 Kernel theory ........................................................................................................................ 44
4.4 Activity 4: Demonstration ............................................................................................................. 47
4.5 Activity 5: Evaluation ................................................................................................................... 47
4.6 Activity 6: Communication ........................................................................................................... 47
5. DESIGN AND DEVELOPMENT ..................................................................................................... 48
5.1 Data Understanding ...................................................................................................................... 48
5.2 Data Preparation ........................................................................................................................... 49
6
5.3 Modelling .................................................................................................................................... 50
5.4 Cryptography (Encryption and Decryption) ................................................................................... 51
5.5 Proposed DLP Method ................................................................................................................. 52
5.5.1 Phase 1: Classification of organizational documents ............................................................. 52
5.5.2 Phase 2: Encryption and decryption of confidential documents. ............................................ 53
6. DEMONSTRATION .................................................................................................................... 55
6.1 Experimental setup ....................................................................................................................... 55
6.2 Data Sets (Documents) ................................................................................................................. 55
6.3 Experiment 1................................................................................................................................ 55
6.4 Experiment 2................................................................................................................................ 55
6.5 Experiment 3................................................................................................................................ 56
6.6 Experiment 4................................................................................................................................ 56
7. EVALUATION ............................................................................................................................ 57
7.1 Impact of the IT Artifact ............................................................................................................... 58
8. DISCUSSION ............................................................................................................................... 59
8.1 Contribution ................................................................................................................................. 59
9. CONCLUSION ............................................................................................................................. 60
9.2 Future Research ........................................................................................................................... 60
REFERENCE ....................................................................................................................................... 61
APPENDIX 1 ....................................................................................................................................... 66
APPENDIX 2 ....................................................................................................................................... 68
APPENDIX 3 ....................................................................................................................................... 70
APPENDIX 4 ....................................................................................................................................... 73
APPENDIX 5 ....................................................................................................................................... 76
7
TABLE OF FIGURES Figure 1: Framework for literature reviewing (Brocke et al, 2009) ......................................................... 14
Figure 2: The 5 V’s of BD (Moura and Serrão, 2015) ............................................................................ 19
Figure 3: BD Interpretation Insights (Shukla et al, 2015) ....................................................................... 23
Figure 4: Challenges in BDS ................................................................................................................. 25
Figure 5: Different Data States (Alneyadi et al, 2013a) .......................................................................... 28
Figure 6: A taxonomy of DLP solutions (Shabtai et al, 2012) ................................................................ 29
Figure 7: Data leakage prevention categorisation by method. (Alneyadi et al, 2016) .............................. 31
Figure 8: Example of SVM hyper plane pattern (Patel and Mistry, 2015) .............................................. 38
Figure 9: NN Block Diagram (Patel and Mistry, 2015) .......................................................................... 40
Figure 10: DSRM Process Model (Peffers et al, 2007)........................................................................... 44
Figure 11: Phases of the CRISP-DM Process Model (Wirth and Hipp, 2000; Rocha and Sousa Júnior,
2010; Shearer, 2000) ............................................................................................................................. 45
Figure 12: Overview of the CRISP-DM tasks and their outputs (Wirth and Hipp, 2000; Shearer, 2000) . 47
Figure 13: Sample Data......................................................................................................................... 49
Figure 14: Flowchart of proposed DLP method ..................................................................................... 54
Figure 15: Evaluation activities within a DSR process ........................................................................... 57
Figure 16: Text Pre-processing Activities .............................................................................................. 66
Figure 17: Process Map of the Model for NBC ...................................................................................... 66
Figure 18: Process Map of Cross Validation for NBC ........................................................................... 67
Figure 19: Prediction label after applying the model on the unknown data. ............................................ 67
Figure 20: Screenshot showing the implementation of the second phase of the proposed DLP method in
experiment 1 ......................................................................................................................................... 69
Figure 21: Screenshot showing the encryption bash script ..................................................................... 70
Figure 22: Screenshot showing all the steps from step 2 to 11................................................................ 71
Figure 23: Screenshot showing the decryption bash script ..................................................................... 71
Figure 24: Screenshot showing the executable command (step 13) ........................................................ 72
Figure 25: Screenshot showing the decryption of the confidential file .................................................... 72
Figure 26: Screenshot showing the encryption of all the confidential files in a directory ........................ 73
Figure 27: Screenshot showing the decryption of all the confidential files in a directory ........................ 73
Figure 28: Screenshot showing all the files in a directory ...................................................................... 74
Figure 29: Screenshot showing the removal of the files in a directory .................................................... 74
Figure 30: Screenshot showing wrong password supplied for decryption ............................................... 75
Figure 31: Screenshot showing archiving of all files before encryption .................................................. 76
Figure 32: Screenshot showing removal of all confidential text files before encryption .......................... 76
Figure 33: Screenshot showing encryption of the archived file .............................................................. 76
Figure 34: Screenshot showing changing of directory ............................................................................ 76
Figure 35: Screenshot showing decryption of the archived file .............................................................. 77
Figure 36: Screenshot showing extraction of files from the archived file ............................................... 77
8
TABLE OF TABLES Table 1: Conceptualisation of the research topic .................................................................................... 15
Table 2: Knowledge Database Search ................................................................................................... 15
Table 3: Summary of knowledge database search results ....................................................................... 15
Table 4: Selected research papers for review ......................................................................................... 16
Table 5: Summary of Previous DLPS Analysis Techniques / Methods................................................... 35
Table 6: Advantages and Disadvantages of Classifiers .......................................................................... 40
9
ABBREVIATIONS BD: Big Data
IoT: Internet of Things
DLP: Data Leakage Prevention
VPN: Virtual Private Networks
IDSs: Intrusion Detection Systems
IPSs: Intrusion Prevention Systems
IM: Instant Messaging
XML: eXtensible Markup Language
JSON: Java Script Object Notation
RDBMS: Relational Database Management Systems
BDS: Big Data Security
SIEM: Security Information and Event Management
BYOD: Bring your own device (BYOD)
ABE: Attribute-Based Encryption
DAR: Data-at-Rest
DIU: Data-in-Use
DIM: Data-in-Motion
TF-IDF: Term Frequency-Inverse Document Frequency
SVD: Singular Value Decomposition
SVM: Support Vector Machine
K-NN: K Nearest Neighbor
ANN: Artificial Neural Network
NBC: Naive Bayes Classifier
DT: Decision Trees
DSR: Design Science Research
DSRM: Design Science Research Methodology
CRISP-DM: CRoss Industry Standard Process for Data Mining
NIST: National Institute of Standards and Technology
PII: Personally Identifiable Information
GDPR: General Data Protection Regulation
EU: European Union
HR: Human Resource
DES: Data Encryption Standard
AES: Advanced Encryption Standard
10
1. INTRODUCTION In this era of information-driven world, data has become an indispensable part of our daily
lives. With the combination of cloud computing, internet, and mobile devices which have
become greater portions of our lives and businesses, huge data are generated every day (Hima
Bindu et al, 2016). For example, huge data is generated daily through social networking
applications such as YouTube, Twitter, Facebook, LinkedIn, and WhatsApp, just to mention
few. The amount of data which is generated is growing exponentially and estimates suggest
that at least 2.5 quintillion bytes (that’s 2.5 followed by staggering 18 zeros!) of data is
produced every day (Harish Kumar and Menakadevi, 2017). Every second more data are
stored currently than there were on the entire Internet 20 years ago (McAfee and Brynjolfsson,
2012). These collections of datasets which are large and complex and become difficult to handle
by the traditional relational database management systems has brought about the term “Big
Data” (Shirudkar and Motwani, 2015). This term is now used everywhere in our daily lives. Big Data (BD) is increasingly becoming popular since the number of devices connected to
the so-called “Internet of Things” (IoT) is still increasing to unforeseen levels, producing
large volumes of data which needs to be transformed into valuable information (Moura and
Serrão, 2015). In addition, the advent of BD has brought about new challenges in terms of
data security (Toshniwal et al, 2015). According to Toshniwal et al (2015), there is an
increasing need to research in technologies that can handle these sets of large data and make
it secure efficiently. They go on to further reiterate that the “Current Technologies for
securing data are slow when applied to huge amounts of data” (Toshniwal et al, 2015, p. 17).
This means security is of much concern when it comes to BD collection, processing, and
analysing, the systems employed should be faster though secure. Ultimately the purpose of
BD security is of no different from the fundamental CIA triad, that is, Confidentiality,
Integrity, and Availability of data generated needed to be preserved. According to Tahboub and Saleh (2014), the need to protect information which is a valuable
asset of the organization cannot be over emphasized. Data Leakage Prevention (DLP) has
been found to be one of the effective ways of preventing Data Leak. DLP solutions detect
and prevent unauthorized attempts to copy or send sensitive data, both intentionally or/and
unintentionally, without authorization, by people who are authorized to access the sensitive
information. DLP is designed to detect potential data breach incidents in timely manner and
this happens by monitoring data while in-use (endpoint actions) or in-motion (network
traffic) or at-rest (data storage) (Tahboub and Saleh, 2014).
Securing the BD process encompasses securing the sources, the pre-processing and the
knowledge outcomes. According to ISACA (2010), DLP aims at halting the loss of sensitive
information that occurs in enterprises globally. By focusing on the location, classification
and monitoring of information at rest, in use and in motion, DLP has the task to helping
enterprises get a handle on what information it has, and in stopping the numerous leaks of
11
information that occur each and every day (ISACA, 2010). This research is set out to design
a method to help organizations prevent data leakage in big data. DLP is sometimes referred
to as Data Loss Prevention in most literatures, however, in this research DLP would mean
Data Leakage Prevention.
1.1 Research objective
The objective of this research is to design a method to help organizations prevent data leakage
in BD using the preventive approach such as encryption with emphasis on semi-structured
(textual) data.
1.2 Delimitations
The scope of this thesis is limited to the use of encryption as the preventive approach in
preventing data leakage in BD with emphasis on semi-structured (textual) data. This means that
other types of preventive methods such as access control, disabling functions, and awareness
will not be addressed. More so, the detective approach of handling data leakage in any DLPS
was not considered. Also, the encryption of other types of BD will not be considered though
the method is capable of handling certain documents which are not in TXT formats such as
DOCX, PDF, PPT, and many more. The encryption algorithms are also limited to only RSA
and AES. The proposed method is not automated since data are manually fed into data mining
tool in order to do classification. The volume of test data used in the experiments are too small
since the whole idea is to prevent leakage in BD. This situation has arisen since the
organization in question has not implemented BD technologies such as Hadoop to
accommodate several data sources.
12
2. RESEARCH PROBLEM One of the important assets to many companies is data, and for that matter the protection of
this data must take the first priority (Tahboub and Saleh, 2014). Even though many
organizations have put in place certain security mechanisms and technical systems such as
firewalls, virtual private networks (VPNs), and intrusion detection systems/intrusion
prevention systems (IDSs/IPSs) still data leakage does occur (Tahboub and Saleh, 2014).
Tahboub and Saleh (2014) reiterated that the data leakage occurs when sensitive data is
revealed to unauthorized users or parties either intentionally or not. The data leakage can
cause serious implications or threats to many organizations. For example, the loss of the
confidential or sensitive data can have severe or adverse impact on a company’s reputation
and credibility, customers, employee confidence, competitive advantage, and in some cases,
this can lead to the closure of the organization (Tahboub and Saleh, 2014). In addition, data
leakage is an important concern for the business organizations in this increasingly networked
world nowadays and for that matter any unauthorized disclosure of sensitive or confidential
data may have serious consequences for an organization in both long and short terms
(Soumya and Smitha, 2014).
In addition, according to Alneyadi et al (2016) the issue of data leakage is a growing concern
among organizations and individuals. Alneyadi et al (2016) indicated that more leakages
occurred in the business sectors than they were in the government sector. According to a
report in 2014, the statistics stands at 50% in the business sector and 20% in the government
sector. They further stated that although in some cases the data leaks were not detrimental
to organizations, however, others have caused several millions of dollars’ worth damage.
More so, the credibility of several businesses or organizations are comprised when sensitive
data such as trade secrets, project documents, and customer profiles are leaked to their
competitors (Alneyadi et al, 2016). Alneyadi et al (2016) take it further that government
sensitive information such as political decisions, law enforcement, and national security can
also be leaked. A typical example of government sensitive information that was leaked was
the United States diplomatic cables by WikiLeaks. “The leak consisted of about 250,000
United States diplomatic cables and 400,000 military reports referred to as ‘war logs’. This
revelation was carried out by an internal entity using an external hard drive and about
100,000 diplomatic cables were labelled confidential and 15,000 cables were classified as
secret” (Alneyadi et al, 2016, p. 137). According to Alneyadi et al (2016), this incident
received high public criticisms from among civil rights organizations all over the world. In
another development hackers stole 160 million credit and debit card numbers which targeted
800,000 bank accounts in US, which were considered as one of the largest hacking incident
that has occurred (Vadsola et al, 2014).
The need to address such serious issues culminated in the implementation of certain security
control mechanisms such as firewalls, VPNs, IDS, and IPSs by several organizations (Kale
13
et al, 2015). According to Alneyadi et al (2016), these systems work satisfactorily when the
data is well defined, structured and constant. Alneyadi et al (2016) further stated that when
data is either modified, tag differently or compressed, these systems become naïve and
confidential data can still be leaked. For example, a firewall can have rules to block access
to confidential data, however, the same data can be accessed through several means such as
an email attachment and instant messaging (IM). This means that the traditional security
mechanisms (firewall, VPNs, IDSs / IPSs) is handicapped and lack the understanding of data
semantics (Alneyadi et al, 2016). To overcome this deficiency in protecting sensitive data,
a new paradigm shift called data leakage prevention systems (DLPSs) have been introduced.
Security and privacy issues have increased by the velocity, volume and variety of BD, such as
large-scale cloud infrastructures, diversity of data sources and formats, streaming nature of data
acquisition, and high volume inter-cloud migration (Shirudkar and Motwani, 2015). BD can
be sensitive or non-sensitive, and no matter leakage of data can be costly for any businesses
or users. For instance, a customer credit card record which is leaked can be costly to both
the bank and the customer. Often data leakage occurs due to information sharing with users
internally or externally to the organization, exchanging emails that contain sensitive
information, publicly releasing information on the internet or cloud, information which is
stolen with illegal motives or unknowingly (Tidke et al, 2015). Data sensitivity varies such
as banking information, credit card information, criminal and justice data, financial data,
health records, etc. To add to this, the advent of BD has brought about numerous data security
challenges that require different mechanisms in dealing with the situation. Also, due to the
voluminous of data which are generated and used these days by organizations, there should
be sophisticated technologies and methodologies that can handle the voluminous of data
securely and efficiently and to prevent data leakage.
Finally, several DLP methods have been designed, however, there are little done to prevent
data leakage in BD using the preventive approach which can help organizations prevent the
leakage before they happen.
2.1 Motivation
The motivation of this research comes from the need to find a method to help organizations
prevent data leakage in BD using the preventive approach such as encryption so that leakage
can be prevented before they happen. It is my belief that when this method is finally
implemented it will enable various organizations to have access to less costly solution that can
be applied to prevent data leakage. For detective approach, the system will detect any possible
leakage incidents and apply the corrective action that is capable of handling the identified
leakage incident (Shabtai et al, 2012). However, sensitive data could still be leaked easily to
unauthorized users if the data owner is careless which could reduce the competitive advantage
of an organization (Margathavalli et al, 2016). The argument is that when relying on detective
approaches, data owners can carelessly leak sensitive data to unauthorized users. Therefore, it
is very necessary to rather prevent the data leakage through encryption to correct this disorder.
14
3. LITERATURE REVIEW According to Brocke et al (2009), literature review should be rigorous and exhaustive and
the search processes involved should be documented. To achieve this, the framework
suggested by Brocke et al (2009) will be adopted. According this framework, a rigorous
literature search should consist of five phases, namely; (1) definition of review scope, (2)
conceptualisation of the topic, (3) literature search, (4) literature analysis and synthesis, and
(5) research agenda (which will be discussed under Literature review discussion). The
framework is displayed in Figure 1.
Figure 1: Framework for literature reviewing (Brocke et al, 2009)
3.1 Definition of review scope This stage of the framework would be based on the various relevant concepts associated with
the study and the timeframe for the selection of past related research papers for the review.
The various relevant concepts identified in this research are big data, big data security, and
data leakage prevention. Also, the time frame will be within the past six years, that is
spanning from 2012 to 2018. The reasons for the choice of the time frame have been a
requirement for the master thesis and also since big data is a new concept, most relevant
papers are likely to be new.
3.2 Conceptualisation of the research topic When one considers the research topic, the key terms or concepts that are relevant for this
15
research study are; big data, big data security, and data leakage prevention.
Table 1: Conceptualisation of the research topic
Research topic Research question Sub-topics
Security of Big Data: Focus
on Data Leakage Prevention
(DLP).
How to design a method to
help organizations prevent
data leakage in BD?
Big data
Big data security
Data leakage prevention
3.3 Literature search
The main method for the literature search was keyword searches using the key terms or
concepts identified during the conceptualisation phase in a number of relevant knowledge
databases such as Elsevier Science Direct, Google Scholar, Scopus and Lulea University
database, with the time frame spanning from 2012 to 2018 with English as the search
language. To be more specific these keywords or phrases were typed directly into these
knowledge databases with specific reference to article titles. The search process included
academic journals, conference papers and peer-reviewed in the advanced search. These
keywords initially brought about a lot of articles which are presented in Table 2. They were
sorted according to relevance and date. The relevant ones which are in relation to the other
sub-topics were needed to be selected and so their abstracts and titles were reviewed. In
addition, these sub-topics were at a point in time combined so that only the articles which
were relevant to the research topic were selected. After all these rigorous processes, the
number of papers reduced as shown in Table 3.
Table 2: Knowledge Database Search
Knowledge Databases Keywords / Number of hits
Big data Big data security Data leakage
prevention
ScienceDirect 229,606 29,160 8540
Lulea University database 184,913 34,368 5,106
Google Scholar 25,400 437 83
Scopus 32,401 26 34
Table 3: Summary of knowledge database search results
Keywords / Sub-topics Number of papers found Number of papers used
Big data Over 229,000 19
Big data security Over 34,000 9
Data leakage prevention Over 8,000 20
The final list of the relevant papers for review is presented in Table 4 and considered in the
16
Literature analysis and synthesis section (Section 3.4).
Table 4: Selected research papers for review
No. Big data Big data security Data leakage prevention
1 Inukollu et al (2014) Kumar et al (2016) Tahboub and Saleh (2014)
2 Tene and Polonetsky (2013) Shirudkar and Motwani
(2015)
Kale et al (2015)
3 McAfee and Brynjolfsson
(2012)
Tabassum and Tyagi
(2016)
Ram (2015)
4 Tabassum and Tyagi (2016) Singh and Sinha (2016) Peneti and Rani (2015a)
5 Moura and Serrão (2015) Mahajan et al (2016) Alneyadi et al (2013a)
6 Hima Bindu et al (2016) Bhogal and Jain (2017) Alneyadi et al (2016)
7 Shirudkar and Motwani
(2015)
Kaushik and Jain
(2014)
Jain and Lenka (2016)
8 Toshniwal et al (2015) Yosepu et al (2015) Peneti and Rani (2015b)
9 Yosepu et al (2015) Hima Bindu et al
(2016)
Tahboub and Saleh (2015)
10 Ularu et al (2012) Tidke et al (2015)
11 Jamiy et al (2014) Margathavalli et al (2016)
12 Rodríguez-Mazahua et al
(2016)
Katz et al (2014)
13 Moorthy et al (2015) Alneyadi et al (2015)
14 Khan et al (2014) Shabtai et al (2012)
15 Shukla et al (2015) Soumya and Smitha (2014)
16 Tole (2013) Alneyadi et al (2013b)
17 Ammu and Irfanuddin
(2013)
Alneyadi et al (2014)
18 Bertino (2013) Ko et al (2014)
19 Sin and Muthu (2015) Peneti and Rani (2016)
20 Ahmad and Bamnote (2013)
3.4 Literature analysis and synthesis The literature analysis would be based on the relevant papers outlined in Table 4 and based
on the concepts big data, big data security and data leakage prevention.
3.4.1 Big Data (BD)
BD is the term which is used to describe huge volumes of structured, semi-structured and
unstructured data that are so large and complex that it is very difficult to be processed by the
traditional databases and software technologies (Inukollu et al, 2014). Again, the increasing
number of people, devices, and sensors that are now connected by digital networks (i.e., IoT)
has generated a vast amount of data. Data is generated from online transactions, social
networking interactions, emails, videos, images, clickstream, logs, search queries, sensors,
17
global positioning satellites, roads and bridges, and mobile phones (Tene and Polonetsky,
2013). The amount of data that is produced each day already exceeds 2.5 Exabyte (McAfee
and Brynjolfsson, 2012). The types of data that made up of BD are explained further below
(Tabassum and Tyagi, 2016; Yosepu et al, 2015; Moorthy et al, 2015; Toshniwal et al, 2015;
Khan et al, 2014; Shirudkar and Motwani, 2015; Ularu et al, 2012; Moura and Serrão, 2015;
Hima Bindu et al, 2016):
• Structured Data: These are data set which are made up of fixed fields within a record
or file, that is, they are relational data (tables data). Also, this is the type of data that
can be found in relational databases and information that has been created from
business applications. Examples are; database tables, tables, objects, tags, reports,
indexes, etc. They are highly structured, organized and mostly managed by SQL.
• Semi- Structured Data: These are text data that contains tags or mark-up elements in
order to separate elements and generate hierarchies of records and fields in the given
data. E.g. tags and mark-ups (XML – eXtensible Markup Language). In order words,
it is a type of structured data that lacks the data model structure and do not conform to
a formal structure, that is to say schema definition is rather optional and contains tags
and other markers to separate semantic elements and enforce hierarchies of records
fields within the data. This type of data is managed by Languages such as XML, Java
script object notation (JSON), etc.
• Unstructured Data: This type of data comes from machines generated or human
generated. For examples; text, emails, photos, videos, audios, movies, graph data,
scientific simulations, financial transactions, phone records, geospatial maps, tweets,
Facebook data, sensor data, etc. This increases most rapidly and about 80% of all stored
organizational data is unstructured (Khan et al, 2014).
The characteristics of BD comprises the following; (the initial properties are what is referred
to as the 3V’s- Volume, Velocity, and Variety, then two more were added to make it the
5V’s – Veracity and Value) (Moura and Serrão, 2015; Hima Bindu et al, 2016; Shirudkar
and Motwani, 2015; Toshniwal et al, 2015; Tabassum and Tyagi, 2016; Yosepu et al, 2015):
• Volume: This feature describes the huge volumes of data which many factors
contribute to that. The increase in data are generated from online transactions data, live
streaming from social media, customer feedback, also data produced by employee,
contractors, partners, and suppliers using social networking sites and data collected
18
from sensors.
• Velocity: This means how fast the data is being produced and how fast the data needs
to be processed to meet the demand.
• Variety: Today data comes in all types of formats (structured and unstructured data) -
from traditional databases, text documents, emails, video, audio, online transactions
etc.
• Veracity: This feature has to do with the quality and source of the data to ascertain
whether it is conflicting or improve and trustworthy. In other words, the credibility and
correctness of the data sources as well as the suitability of the data for the purpose of
use.
• Value: This is the usefulness of the data in making decisions. These characteristics have been expanded with two more features to describe BD as follows
(Inukollu et al, 2014; Toshniwal et al, 2015; Tabassum and Tyagi, 2016; Yosepu et al, 2015):
• Variability: This feature of BD refers to inconsistency of data. In addition to the
increasing velocities and varieties of data, data flows can be highly inconsistent
with periodic peaks.
• Complexity: Also, data comes from multiple sources and this must be cleaned,
merged, matched and transformed into required format before actual processing.
Figure 2 shows the diagram of the 5 V’s of BD.
19
Figure 2: The 5 V’s of BD (Moura and Serrão, 2015)
3.4.1.1 Benefits of BD
The main importance of BD is the potential to improve the efficiency in the context of the use of
large sets of data which are of different types. If BD is used properly within any business or
organization, it will offer better view in the business leading to efficiency in different areas like
sales, manufacturing, marketing and so forth (Ularu et al, 2012). Businesses can increase
productivity and it is even said that “Companies that use data most effectively stand out from the
rest” (Tene and Polonetsky, 2013, p.244). BD can be used effectively in the following areas (Ularu
et al, 2012):
20
• In information technology, the security and troubleshooting can be improved when the
patterns in the existing logs are analysed.
• In customer services information from call centres can be used to get the customer pattern
and thus enhance customer satisfaction by customizing services.
• In improving services and products, content from social media can be used. By knowing
the potential customers’ preferences, the company can modify its products in order to
address a larger area of people.
• In the detection of fraud in the online transactions for any industry.
• In risk assessment by analysing information from the transactions on the financial market.
Also, BD helps in decision making and competitiveness of organizations and public
administrations (Jamiy et al, 2014). Jamiy et al (2014) takes it further that this will go a long way
to grow the entire world economy significantly, in that organizations can take informed and timely
decisions. With these potentials, it is even better to listen and understand costumers, their ways of
using services and hence offer better and improved products (Jamiy et al, 2014).
Other benefits can be grouped into the following categories (Tene and Polonetsky, 2013):
• Healthcare: When the voluminous healthcare data generated are effectively used by
employing BD techniques and technologies, this can give the right outcomes to patients
and reduce care cost in the public health and medicine fields. For example, the computing
power of BD allow us to mine entire DNA strings in minutes and will provide us the
possibility to discover, monitor, improve health aspects of every one and predict disease
patterns.
• Mobile: Mobile devices are increasing and with multiple sensors including cameras,
microphones, movement sensors, GPS, and Wi-Fi capabilities, collect more data than ever
in the public sphere. There is also innovative use of data with regards to BD technologies.
These offer effective use of data.
• Smart Grid: BD use in terms of smart grid (the modernization of current electrical grid
to include bidirectional flow of information and electricity) brings about the benefits of
advanced data analysis. The smart grid is designed for instance to allow utility service
providers such as electricity companies to monitor and control electricity distribution and
usage. This helps these companies to respond to power outages or other problems in a
21
timely manner and precisely at the spotted location. Also, environmental policymakers
view the smart grid as a key in providing quality and more efficient delivery of electricity
by considering the factors whether to move towards renewable energy.
• Traffic Management: Another area of data-driven innovation is traffic management and
control. Governments around the world are implementing electronic toll pricing systems
that offer differentiated payments based on mobility and congestion charges. These
systems apply varying prices to drivers based on their differing use of vehicles and roads.
Also, town, urban and city planners benefit from analysis of location data where decisions
can be made on how to construct roads to ease traffic congestion leading to high-density
urban development.
• Retail: BD is also transforming the retail market. Nowadays there are technologies that
can be used for instance for suppliers to determine how much of their products are
available in shops. Others can use the feature of “those who bought this, also bought this”
to determine the items consumers are buying together. This enable the adverts that needed
to be presented to prospective buyers.
3.4.1.4 BD Challenges
While the potential benefits of BD are real and significant, many challenges must be addressed to
fully realize this potential. This section presents the challenges that needed to be addressed while
handling BD. Those challenges include (Ammu and Irfanuddin, 2013; Bertino, 2013; Jamiy et al,
2014; Shukla et al, 2015; Sin and Muthu, 2015; Ularu et al, 2012):
• Data Acquisition and Recording: It is very important and critical to capture the context
into which data has been generated, so that filtering out non-relevant data can be done.
This will also enable data to be compressed and to automatically generate metadata (that
is, data about data) that will support and enhance rich data description and to track and
record provenance (that is, information that documents the history, origin or source of
content information, changes that has taken place, who has had custody of it since it was
originated). To achieve this is complex and time consuming thus the real challenge of
handling big volumes of unstructured and structured data continuously arriving from many
sources.
22
• Information Extraction and Cleaning: Often data may have to be transformed in order
to extract information from it and this information should be presented in a form that is
suitable for analysis. Extracting meaningful information from huge sets of data is a major
challenge. Data may also be of poor quality and/or uncertain. Data cleaning and data
quality verification are thus critical.
• Data Integration, Aggregation and Representation: Data can be heterogenous and may
have different metadata. Thus, data integration requires huge human efforts. Manual
approaches fail to scale to what is required for BD, hence the requirement of newer and
better approaches arises which will offer automation data integration. Also, different data
aggregation and representation strategies may be needed for different data analysis tasks.
• Query Processing, and Analysis: Methods suitable for BD need to be discovered and
evaluated for efficiency so that they can deal with noisy, dynamic, heterogeneous,
untrustworthy data. However, despite these difficulties, BD even if noisy and uncertain
can be more valuable for identifying more reliable hidden patterns and knowledge
compared to tiny samples of good data. Also, the often-redundant relationships existing
among data can represent an opportunity for cross-checking data and thus improve data
trustworthiness. Supporting query processing and data analysis requires scalable mining
algorithms and powerful computing infrastructures. Also, as more huge data are generated
daily, analysis of the data may consume a lot of time and resources. However, there are
many situations in which the result of the analysis is required immediately.
• Interpretation: Analysis results extracted from BD needs to be interpreted by decision
makers and this may require the users to be able to analyse the assumptions at each stage
of data processing and possibly re-tracing the analysis. Rich provenance is critical in this
respect. BD interpretation insights is shown in Figure 5.
• Privacy and Security: They are also important challenges for BD. Because BD consists
in a large amount of complex data, it is very difficult for a company to sort this data on
privacy levels and apply the according security. In addition, many of the companies
nowadays are doing business cross countries and continents and the differences in privacy
laws are considerable and should be taken into consideration when starting the BD
initiative. Also, security and privacy issues have increased by the velocity, volume and
23
variety of BD, such as large-scale cloud infrastructures, diversity of data sources and
formats, streaming nature of data acquisition, and high volume inter-cloud migration.
• Storage: Nowadays, whiles the common capacity of hard disks is in the range of terabytes,
the amount of data generated daily through the internet is in the size of exabytes. This
would even get much larger in future. The traditional Relational Database Management
Systems (RDBMS) cannot handle the storage and processing of such huge voluminous
data. Therefore, certain technologies that do not use the traditional SQL based queries are
used to overcome such challenge. Also, compression technology needed to employ to
compress the data at rest and in memory.
Figure 3: BD Interpretation Insights (Shukla et al, 2015)
24
3.4.2 Big Data Security (BDS)
To start with, security and privacy issues have increased by the velocity, volume and variety of
BD, such as large-scale cloud infrastructures, diversity of data sources and formats, streaming
nature of data acquisition, and high volume inter-cloud migration (Shirudkar and Motwani, 2015).
They further reiterated that with the use of these large-scale cloud infrastructure, which are spread
across the world with diverse software platforms, attacks on systems have increased, therefore
traditional security mechanisms would not be adequate. Also, there should be sophisticated
technologies that can offer fast response times to the growing demand of streaming data across
several data centres (Shirudkar and Motwani, 2015). Singh and Sinha (2016, p. 33) also support
the argument that, “the currently used security mechanisms such as firewalls and DMZs cannot
be used in the BD infrastructure because the security mechanisms should be stretched out of the
perimeter of the organization's network”.
Kumar et al (2016) also support the issue of large-scale cloud infrastructures, multiplicity of
data sources and formats, streaming nature of data attainment and high volume inter-cloud
migration have also brought about security concerns. Therefore, they continue to reiterate that
conventional security mechanisms, which are customized to securing small-scale, static data are
not enough and for that matter there should be proper security technologies that can secure BD.
They further proposed an agent security-based solution model to deal with the security issues for
cloud BD. This model is capable of authenticating and checking the permission that are assigned
by the administrator during the registration of new cloud user.
The advent of BD has brought several challenges in terms of security of data (Tabassum and
Tyagi, 2016). They further highlighted that there is the need to research in technologies and
methodologies that can handle the voluminous of data securely and efficiently. They agreed to
the fact that though there are many new technologies and methods which have been developed,
but to some extent they get slowed down when there is an involvement of large amount of data.
3.4.2.1 BDS Challenges
BDS challenges can be grouped into four main categories which are also subdivided into ten as
shown in Figure 6 with brief descriptions below (Bhogal and Jain, 2017; Shirudkar and Motwani,
2015; Yosepu et al, 2015; Kaushik and Jain, 2014; Hima Bindu et al, 2016; Mahajan et al,
2016):
25
Figure 4: Challenges in BDS
1. Secure Computations in Distributed Programming Framework:
Distributed programming framework uses parallelism in computations and storage to
process large amounts of data. A known and popular example is MapReduce framework,
which splits an input file into multiple chunks in the first phase of MapReduce, a mapper
for each chunk reads the data, perform some computation, and generates a list of key/value
pairs. Then a reducer combines the values belonging to each distinct key and outputs the
result. There are two major attacks prevention measures: securing the manners and
securing the data in the presence of an untrusted manner.
2. Security Best Practices for Non-Relational Data Stores:
Non-relational data stores popularized by NoSQL databases are still developing with
respect to security infrastructure. For instance, robust solutions to NoSQL injection are
still not mature and for that matter each NoSQL databases were built to tackle different
challenges posed by the analytics world and hence security was never part of the model at
any point of its design stage. Security issues of NoSQL in general remain to be improved.
Therefore, developers using NoSQL databases usually embed security in the middleware.
NoSQL databases do not provide any Support for Enforcing it explicitly in the database.
However, clustering aspect of NoSQL databases poses additional challenges to the
robustness of such security practices and enhanced security is expected to come at the
expense of performance.
Infrastructure Security
Security Computations in Distributed Programming Frameworks
Security Best Practices for
Non Relational Data Stores
Data Privacy
Scalable and Composable
Privacy-Preserving Data Mining and
Analytics
Cryptographically Enforced Data
Centric Security
Granular Access Control
Data Management
Secure Data Storage and Transaction
Logs
Granular Audits
Data Provenance
Integrity and Reactive security
End Point Input
Validation / Filtering
Real-time Security
Monitoring
26
3. Secure Data Storage and Transaction Logs:
Data and transaction logs are stored in multi-tiered storage media. Manually moving data
between tiers helps the IT manager to control exactly what data is moved and when.
However, as the size of data set continues to increase or grow exponentially, scalability
and availability have necessitated auto-tiering for BD storage management. Auto-tiering
solutions do not keep track of where the data is stored, which creates new challenges to
secure data storage. Therefore, new mechanisms are imperative to prevent unauthorised
access and maintain 24/7 availability.
4. End Point Input Validation/Filtering:
Many BD uses in organization settings require data collection from many sources, such as
end point devices. For instance, a security information and event management system
(SIEM) may collect event logs from millions of hardware devices and software
applications in an enterprise network. A key challenge in the data collection process is
input validation: how can we trust the data? How can we confirm that a source of input
data is not malicious? And how can we filter harmful input from our collection? Validation
and filtering of input is a daunting challenge posed by untrusted input sources, especially
with the bring your own device (BYOD) model.
5. Real-time Security Monitoring:
Real time security monitoring has always been a challenge, given the number of alerts
generated by (security) devices. These alerts (correlated or not) lead to many false
positive, which are mostly ignored or simply “clicked away”, as humans cannot cope with
the shear amount. This problem might even increase with the BD given the volume and
velocity of data streams. However, BD technologies might also provide an opportunity to
fast process and analyse different types of data. These technologies can be used to provide,
for example, real-time inconsistency detection based on scalable security analytics.
6. Scalable and Composable Privacy-Preserving Data Mining and Analytics:
BD can be seen as potentially enabling invasions of privacy, invasive marketing, decreased
civil freedoms, and increase state and corporate control. A recent analysis of how
enterprises are leveraging data analytics for marketing purposes identified an example of
how a vendor was able to recognized that a teenager was pregnant before her father knew.
Similarly, anonymizing data for analytics is not adequate to maintain user privacy. For
example, AOL released anonymized search logs for academic purposes, but users were
easily identified by their searchers. Netflix faced a similar problem when users of their
anonymized data set were recognized by correlating their Netflix movie scores with IMDB
scores. Therefore, it is important to establish guidelines and recommendations for
preventing inadvertent privacy disclosures.
7. Cryptographically Enforced Data-Centric Security:
In order to ensure that the most sensitive private data is end to end secure and only
accessible to the authorized entities, data has to be encrypted based on access control
policies. Specific research in this area such as attribute-based encryption (ABE) has to
be made richer, more efficient, and scalable. To ensure authentication, agreement and
fairness among the distributed entities, a cryptographically secure communication
framework has to be implemented.
27
8. Granular Access Control:
The security Property that matters from the view of access control is secrecy-preventing
access to data by unauthorized people. The problem with course-grained access
mechanisms is that data that could otherwise be shared is often swept into a more
restrictive category to guarantee sound security. Therefore, granular access control gives
data managers more precision when sharing data without compromising privacy.
9. Granular Audits:
With real time security monitoring, we try to be notified at the moment an attack takes
place, however, in reality, this will not always be the case (e.g., new attacks, may missed
true positives). In order to determine a missed attack, audit information would be required.
This is not only relevant because we want to understand what might happened and what
went wrong, but also because of compliance, regulation and forensic reasons. In that
regard, auditing is not something new, but the scope and granularity might be different.
For example, we have to deal with more data objects, which probably are (but not
necessarily) distributed.
10. Data Provenance:
Provenance metadata will increase in complexity due to large provenance graphs
generated from provenance-enabled programming environments in BD applications.
Analysis of such large provenance graphs to identify metadata dependencies for security
or confidentiality applications is computationally intensive.
3.4.3 Data Leakage Prevention (DLP)
According to Kale et al (2015), Data Leakage Prevention (DLP) solution is one of the new
technical solutions and methodologies that basically protect sensitive data of an organization
from being viewed by wrong users or individuals either from outside or inside of the organization.
This means that specific data should be viewed only by authorized individuals or groups (Kale
et al 2015). “DLP solutions detect and prevent unauthorized attempts to copy or send sensitive
data, both intentionally or/and unintentionally, without authorization, by people who are
authorized to access the sensitive information” (Kale et al, 2015, p. 55; Tidke et al, 2015, p. 28).
In other words, “DLP is a technique used to hide the confidentiality of data being accessed by
unauthorized user” (Jain and Lenka, 2016, p. 57). In addition, DLP is a solution or products
designed to detect potential data breach incidents in timely manner and prevent them by
monitoring data while in-use (endpoint actions) or in-motion (network traffic) or at-rest (data
storage) (Tahboub and Saleh, 2014). DLP solutions address data leaks in the following three states
of data throughout their lifecycle by applying specific set of technologies (Tahboub and Saleh,
2014; Ahmad and Bamnote, 2013; Peneti and Rani, 2015a):
• Data-at-Rest (DAR): Data that resides in files system, databases and other storage
methods. E.g. A company’s financial data stored on the financial application server.
• Data-in-Use (DIU): Data at the endpoints of the network (e.g. data on USB devices,
external drives, MP3 players, laptops, and other highly-mobile devices). In other
28
words, all data with which the user is interacting or using.
• Data-in-Motion (DIM): Any data that is moving through (or are being sent through)
the network to the outside via the Internet. This feature applies to all data transmitted on
wire or wireless. E.g. Customer purchasing details sent over the Internet. In addition,
these data may be sent either inside the internal network of an organization or may cross
over into an external network.
Figure 7 shows these three data states
Figure 5: Different Data States (Alneyadi et al, 2013a)
More so, Ram (2015) explains that DLP is very useful in that it helps organizations to protect
not only structured data but also the protection and leakage prevention of unstructured data.
Ram (2015) further reiterated that DLP serves as the data control mechanisms that fits naturally
very well with the organizational business structure. According to Peneti and Rani (2015b),
data leakage prevention systems (DLPSs) make use of confidential terms and data identification
methods for controlling data leakages in the organization. First, DLPS identifies which
documents are confidential documents and non-confidential documents. According to Alneyadi
et al (2016), DLPS can be defined as a system that is designed to detect and prevent the
unauthorised access, use, disclosure, or transmission of confidential information. It is even
possible to use DLP to reduce risk and to improve upon data management practices and also to
lower compliance cost (Ram, 2015). Several DLP technologies are available on the market.
Ram (2015) made reference to one DLP technology called MyDLP, which is an open-source
all-in-all data loss / leak prevention software that runs with multi-site configurations on
network servers and endpoint computers. In addition, there are various examples of DLP
solutions offered by many vendors for various operating systems, they are; Symantec,
29
Websense, MacAfee, MyDLP and many others (Tahboub and Saleh, 2015).
Tahboub and Saleh (2014) also explains that there are differences between existing data
protection systems such as firewalls, Intrusion Detection Systems / Intrusion Prevention
Systems (IDSs / IPSs), antivirus, antispam, antimalware, encrypting and digital rights
management tools and a DLPS. They further explained that “the main difference between a
DLPS and existing technologies is that DLPSs are content-aware; they are designed to give
visibility into where the company's most sensitive data is stored, who has access to it, and where
and by whom it is sent outside the company's network. Existing security applications cannot
perform this level of monitoring” (Tahboub and Saleh, 2014, p. 17). This assertion is also
supported by Alneyadi et al (2016), “DLPSs differ from conventional security controls such as
firewalls, VPNs and IDSs in terms of dedication and proactivity. Conventional security controls
have less dedication towards the actual content of the data” (p. 138). Tahboub and Saleh (2014)
also reiterated that a DLPS should also be able to provide additional functionality to prevent
sensitive data from being sent outside the organization’s network either through an endpoint
computer or the network.
3.4.3.1 DLP Solutions
DLP solutions can be grouped according to the taxonomy that incorporates the following
features: data sate, deployment scheme, leakage handling approach / method, and action taken
upon leakage as indicated in Figure 8 (Shabtai et al, 2012; Peneti and Rani, 2015a; Alneyadi et
al, 2016):
Figure 6: A taxonomy of DLP solutions (Shabtai et al, 2012)
30
• What to protect? (data state): DLP solutions offer protections by differentiating
between the three phases of data throughout the lifecycle of DAR, DIU, and DIM.
• Where to protect? (deployment scheme): The two main deployment schemes that are
considered during the installation of DLP solutions are: Endpoint and Network. Those
that are deployed at the endpoint directly control devices or users. A solution deployed at
the endpoint monitors and controls access to data whiles another supervisory server takes
control of the administrative procedures and distribution of policies. For that matter, all
the DIU and DAR will be protected. On the other hand, the network DLP solution will be
deployed at the network level so that all the network traffic would be analysed. For that
reasons those transmissions which will go against the predefined policies would be
identified and blocked.
• How to protect? (leakage handling approach): All leakage incidents are handled by the
two main mechanisms or approaches of detective and preventive approaches. When any
leakage is detected, the detective mechanism or approach of the DLPS will attempt to
apply the necessary action based on the following forms: context-based inspection,
content-based inspection, and content tagging. For detective approach, the system will
detect any possible leakage incidents and apply the corrective action that is capable of
handling the identified leakage incident (Shabtai et al, 2012). Also, DLP solutions
support preventive approaches by applying the following mechanisms: access control,
disabling functions, encryption, and awareness. When it comes to preventive approach,
possible leakage incidents are prevented before they are happening by applying proper
measures (Shabtai et al, 2012). Alneyadi et al (2016) also supported the idea of preventive
and detective mechanisms by categorizing DLPSs based on the technique or method used as
indicated in Figure 9.
31
Figure 7: Data leakage prevention categorisation by method. (Alneyadi et al, 2016)
• Preventive method:
▪ Policy and Access Rights: Data leaks are prevented based on strict security
policies and access rights. Some policies in organizations can restrict the use of
USB drives and CDs.
▪ Virtualisation and Isolation: The advantages of virtualisation is applied here to
protect sensitive data. It is based on creating virtual environments when accessing
sensitive data. In this case only certain allowed processes will be permitted.
▪ Cryptographic Approaches: Cryptography is a way to hide sensitive data from
unauthorized users by making use of cryptographic tools and algorithms. Some
cryptographic approaches protect against data leaks in DIU and DAR states within
the confines of organizations.
One work done by Margathavalli et al (2016) is to use Attribute Based Encryption (ABE)
algorithm as a data leakage prevention method to preserve sensitive data. This falls under
preventive method. The idea behind their proposed method is to keep the sensitive data locked
so that only authorized users can accessed them. The argument is that when relying on detective
approaches, data owners can carelessly leak sensitive data to unauthorized users. Therefore, it is
very necessary to rather prevent the data leakage through encryption to correct this disorder.
According to Margathavalli et al (2016), ABE is also a type of public-key encryption where the
secret key of a user and the ciphertext are dependent upon attributes. For instance, the country
someone lives or the kind of subscription he or she has. With this approach, the decryption of the
ciphertext will be possible only if the set of required attributes of the user in question matches
32
the said attributes of the ciphertext. Again, the ciphertext is developed based on attributes and the
private key is associated with it. The private key will be used to download the data when it
matches with the attributes.
▪ Quantifying and Limiting: This is the approach where security administrators
try to pretend to be attackers and block all the loop holes leading to sensitive data
by attacking their own systems. This approach can be used in both detective and
preventive methods.
• Detective Method:
▪ Data Identification: How sensitive data are detected depended on the previous
knowledge of the targeted content and certain techniques such as data fingerprints,
regular expressions and exact or partial data match are involved.
▪ Social and Behavioural: Social network analysis and behavioural patterns can help
detect any irregularity and raise alarm so that security administrators can react to
them.
▪ Data Mining / Text Clustering: Data mining areas have capabilities to perform
advanced tasks such as anomaly detection, clustering and classification by extracting
data patterns from large datasets. Data mining is related to machine learning which
has algorithms to realize complex patterns and make better decisions. Text clustering
which is related to information retrieval also play significant roles in DLPSs.
Several information security solution providers have incorporated some of the above-mentioned
taxonomies in the development of DLP solutions. Amongst the top DLP solution vendors are
enumerated below (Alneyadi et al, 2016; Shabtai et al, 2012):
• Websense (provides Triton)
• McAfee (provides McAfee Data Loss Prevention)
• RSA
• Symantec
• Trend Micro
• MyDLP (being Open-source)
• VMware (provides AirWatch)
• Check Point Software Technologies (provides Check Point DLP)
• General Dynamics Fidelis Cybersecurity Solutions (provides Fidelis XPS)
• Varonis Systems (provides Varonis IDU Classification Framework).
3.4.3.2 DLPS Analysis Techniques
The various techniques that DLPS use in analysing whether data can be categorised or classified
into sensitive or confidential and non-sensitive or non-confidential data and subsequently used for
detective purposes are grouped into two main areas being context-based and content-based analysis
techniques or approaches as discussed below (Alneyadi et al, 2016; Alneyadi et al, 2015; Katz et
al, 2014; Shabtai et al, 2012):
• Context analysis technique: The context-based technique works by considering the
metadata (such as size, timing, source, format and destination) which is associated with the
33
actual confidential data without emphasis on the sensitivity of the content. The DLPS study
the context surrounding the confidential data in order to detect any potential data leaks. For
instance, if a user wants to send data to another entity, certain contextual attributes such as
source, file size and format, destination, timing would be studied. These features can then
be compared against certain transaction patterns or predefined policies. The context
analysis technique is sometimes combined with the content-based analysis technique in
order to be effective.
Katz et al (2014) proposed a context-based model comprising of two phases: training and
detection. For the training phase, clusters of documents are produced and a graph representation
of the confidential content of each cluster is generated. This representation consists of key terms
and the context in which they are required to appear in order to be considered confidential. For
the detection phase, each tested document is assigned to several clusters and its contents are then
matched to each cluster’s respective graph in an attempt to determine the confidentiality of the
document. Soumya and Smitha (2014) also developed a DLPS which is based on context
keyword matching and encrypted data detection. The main idea behind what they proposed is
to enhance the security of DLPSs by finding confidentiality of documents based on context of
keywords and detecting encrypted information in word or text documents. Their proposed
approach was also in two phases: Learning and Detection. Their approach was also similar to
that of Katz et al (2014) by making use of clustering and the use of graph representations.
• Content analysis technique: This technique works by focusing on the actual content of the
confidential data rather than the context. Since the main aim of any DLPS is to detect and
prevent confidential data from being leaked, it is more important and effective to consider
the content. The three main content analysis techniques are: data fingerprinting (including
exact or partial match), regular expression (including dictionary-based match) and
statistical analysis. N-gram analysis and term weighting analysis are the main statistical
analysis techniques.
▪ Data fingerprinting: This is the most common technique which is used to detect
data leakage. In most DLPSs, a whole file can be hashed using conventional hash
functions such as MD5 and SHA1, where the hash values of all sensitive documents
are stored in databases or on local machines under inspection. Such DLPSs can have
100% detection accuracy if the file is not modified by any means. However, since
confidential documents are subject to change, DLPSs with conventional hashing can
be ineffective because the hash value is susceptible to change. In effect, significant
changes to the data will make the conventional fingerprinting method ineffective.
Advanced fingerprinting methods such as Rabin’s randomised fingerprinting and
fuzzy fingerprinting can solve some of the modifications issues.
▪ Regular expression: This is also another popular method which is used in DLPSs.
They are made of set of terms or characters that are used to form detection patterns.
These patterns will be used to match and compare set of data strings mathematically.
This technique is mostly used in search engines and text processing to validate,
extract and replace data. However, in terms of information security, regular
expression is used mostly in data inspection for malicious codes or confidential data.
34
▪ Statistical analysis: This technique can facilitate certain tools such as machine
learning classification and information retrieval term weighting. They are mostly
dependant on the frequency of terms and n-grams within set of documents. The
drawback of regular expression and data fingerprinting were solved by N-gram
statistical analysis technique. A term simply means a word, while an n-gram can be
a word or pieces of a word such as unigram (one character), bigram (two characters)
and trigram (three characters). The main statistical analysis techniques are N-gram
and term weighting analyses. The whole idea of N-gram is to break each word in a
document into small characters N-grams and arrange them based on their frequency
to create N-grams profiles. Term weighting on the other hand, is a statistical method
that indicates the significance of a word within a document. It is normally used in
text classification using vector space models, where documents are represented as
vectors.
Alneyadi et al (2013a) proposed word N-gram based classification method to classify documents in
order to prevent data leakage. They made use of N-grams frequency to classify documents in order
to detect and prevent leakage of sensitive data. Alneyadi et al (2014) also studied the effectiveness
of using N-grams based statistical analysis, foster with stem words in order to classify documents
according to their topics. In short, they used stemmed N-gram classification methodology and this
gave classification accuracy of 92%. In addition, Alneyadi et al (2013b) investigated the use of N-
grams statistical analysis for data classification purposes. The method they presented is based on
using N-grams frequency to classify documents under distinct categories. They made use of a
technique called “taxicap” geometry to compute the similarity between documents and existing
categories. This method also could correctly classify 90.5% of the tested documents.
Several other techniques have been proposed to deal with DLP. Jain and Lenka (2016) proposed a
DLP technique called Image Steganography. This technique works by preventing the data from
being outsourced by giving a special inscription to sensitive data so that they cannot be
reproduced. According to Jain and Lenka (2016), this technique in practice works by embedding
a file, message and image within another image file and then the image file is transmitted. When
this is done the unauthorized user does not know that the data has been embedded in an image.
In short, the steganography gives the opportunity for data to be hidden in an image such that they
cannot be perceivable.
In addition, Ko et al (2014) proposed a novel user-centric, mantrap-inspired DLP approach which
can discover and deliver any sending of data, both authorized and unauthorized, to end-users and
subsequently provide them the opportunity to stop the sending process. They implemented their
own kernel module based on Linux operating system to work together with the user-space program
in getting users approval for every sending process by giving them full access control over all
outbound data sending process in their devices.
Peneti and Rani (2015b) also proposed confidential data identification method using data mining
approach to classify documents into confidential and non-confidential. They employed clustering
and language modelling technique during the training phase. During detection phase, confidential
score of all inspected documents are checked against predefined confidentiality scores and those
that exceed certain threshold are marked and blocked.
35
More so, Peneti and Rani (2016) developed an algorithm for DLP with time stamp. In identifying
confidential data, time stamp plays an important role in DLP. For example, in an educational system
a question paper is considered confidential until on or before the examination date, once the
examination is over and the paper is in the public domain, it will be treated as non-confidential.
Their method too made use of clustering technique. The method was 100% for documents that have
complete confidential or non-confidential content. However, it was not able to detect small portions
of confidential content within non-confidential documents.
Furthermore, Alneyadi et al (2015) presented statistical DLP model to classify data based on
semantics. They made contribution to the DLP field making use of statistical analysis that is able
to detect evolved confidential data. Their model made use of the famous information retrieval
concept of Term Frequency-Inverse Document Frequency (TF-IDF) to classify documents under
certain topics. The classification results were presented with a Singular Value Decomposition
(SVD) matrix. The results indicated that the proposed statistical DLP approach could correctly
classify documents even in extreme cases of modification. It also had a high level of recall scores
and precision.
The summary of the above detective approaches, techniques and algorithms will be presented in
Table 7.
Table 5: Summary of Previous DLPS Analysis Techniques / Methods
List of Papers Techniques /
Algorithms
Method /
Analysis
Contributions Limitations
Katz et al (2014),
CoBAn: A
context based
model for data
leakage
prevention.
Context-based
approach,
Clustering
Detective /
Context
A novel approach
regarding the context
of key terms for
classification purposes.
A new approach for
the graph
representation of text.
The method is
not capable of
using external
data sources to
enhance the
representation of
confidential
content.
Soumya and
Smitha (2014),
Data Leakage
Prevention
System by
Context based
Keyword
Matching and
Encrypted Data
Detection.
Context-based
keyword
matching,
Entropy
method,
Clustering
Detective /
Context
Detection of small
portion of confidential
information in a non-
confidential document.
Effective check for
information going
out of the organization
is either confidential or
encrypted.
Cannot perform
any
cryptanalysis
process for
retrieving the
information
represented by
the encrypted
data.
Alneyadi et al
(2013a), Word N-
N-grams
frequency
Detective /
Content
Covered most aspects
in using N-grams for
Encrypted
document
36
List of Papers Techniques /
Algorithms
Method /
Analysis
Contributions Limitations
gram Based
Classification for
Data Leakage
Prevention.
data
Classification.
imposes a great
challenge to the
method.
Modifying or
replacing a word
could lead to
wrong
classification or
misclassification.
Alneyadi et al
(2014), A
Semantics-
Aware
Classification
Approach for
Data Leakage
Prevention.
N-grams based
statistical
analysis
Detective /
Content
Effects of data
modification showed
acceptable accuracy.
Lacks term
weighting
approaches
which are more
flexible than raw
frequency.
Alneyadi et al
(2013b),
Adaptable N-
gram
classification
model for data
leakage
prevention
N-grams
frequency,
Employ
simple taxicap
geometry
Detective /
Content
Achievement of high
levels of recall and
precision as compared
to existing methods.
The method is
not effective
when word
synonyms and
special
characters are
used.
Jain and Lenka
(2016), A
Review on Data
Leakage
Prevention using
Image
Steganography.
Image
steganography,
LSB (Least
Significant
Bit) technique
Preventive /
Content
Being able to prevent
data from being
outsourced by
giving a special
inscription to sensitive
data from being
reproduced.
The method
couldn’t cover
all image file
formats.
Ko et al (2014),
A Mantrap-
Inspired, User-
Centric Data
Leakage
Prevention (DLP)
Approach.
Kernel-space
mantrap
approach
Detective /
Preventive
Content
Users have full control
over all outbound data
sending process in
their devices.
Lacks graphical
user interface
(GUI) for the
users.
Peneti and Rani Clustering and Detective / Detection of entire Cannot detect
37
List of Papers Techniques /
Algorithms
Method /
Analysis
Contributions Limitations
(2015b),
Confidential Data
Identification
Using Data
Mining
Techniques in
Data Leakage
Prevention
System.
language
modelling
techniques
Content confidential document
and detection of small
portions of confidential
content embedded in
larger non-confidential
documents.
and prevent
encrypted data.
Peneti and Rani
(2016), Data
Leakage
Prevention
System with
Time Stamp.
Clustering Detective /
Content
Time stamp method is
best suited for both
large and small dataset.
Cannot detect
small portions
of confidential
content in non-
confidential
documents.
Alneyadi et al
(2015), Detecting
Data Semantic:
A Data Leakage
Prevention
Approach.
Statistical data
analysis (TF-
IDF)
Detective /
Content
Contributed to the
DLP field by using
data statistical analysis
to detect evolved
confidential data. The
proposed statistical
DLP approach could
correctly classify
documents even in
cases of extreme
modification.
Cannot detect
and prevent
encrypted data.
Margathavalli et
al (2016),
Preserving
Sensitive Data by
Data Leakage
Prevention Using
Attribute Based
Encryption
Algorithm
Attribute
Based
Encryption
Preventive /
Content
Prevent data leakage
by encryption (ABE).
Cannot deal with
detection.
3.4.3.3 Data / Text Classification Methods
The DLP will focused on semi-structured data (textual data). Though text documents fall under
unstructured data, however, they can also be grouped under semi-structured because they contain
certain structured features. The main idea behind DLP is to categorize data into either confidential
38
or non-confidential and the confidential ones would be encrypted. Therefore, most of the data
classification methods to be considered would be text classification or categorization methods.
Text classification has been defined as the act of dividing input documents into two or more
classes such that each document can be said to belong to one or multiple class (Vala and Gandhi,
2015). In other words, this can be said to be “the task of assigning predefined categories to
documents” (Bali and Gore, 2015, p. 4888). Amongst several texts classify methods are Support
Vector Machine (SVM), K Nearest Neighbor (KNN), Artificial Neural Network (ANN), Naive
Bayes Classifier (NBC), and Decision Trees (DT) (Vala and Gandhi, 2015; Patil et al, 2016;
Topaloğlu, 2013; Patra and Singh, 2013).
SVM is the classification method which is used to classify both linear and non-linear data (Patra
and Singh, 2013). Patra and Singh (2013) further explained that this method makes use of the non-
linear mapping to transform training data into a dimension which is higher and then search for
linear optimal separating hyper plane. Patel and Mistry (2015) also reiterated that SVM works
with both positive and negative training data sets which are not common as compared with other
classification methods. According to Patel and Mistry (2015), the SVM requires these positive
and negative training sets in order to inquire about the decision surface that best separates these
data in the n dimensional space which is referred to as the hyper plane. The document
representatives which will be closer to the decision surface are known as the support vector as
shown in Figure 10 (Patel and Mistry, 2015).
Figure 8: Example of SVM hyper plane pattern (Patel and Mistry, 2015)
According to Figure 4, the equation that will be used for the hyper plane for the linear separable
space is WX+B=0; where X is arbitrary objects, W is the vector and B is the constant learned
from the set of linear separable objects in the training data sets or documents (Patel and Mistry,
2015). Patel and Mistry (2015) further explained that the hyper planes are mostly used for separation
of two different classes of data. However, the SVM can also work on pre-classified documents (Patel
and Mistry, 2015; Chavan et al, 2014). SVM has been extensively and successfully used for text
classification tasks (Thaoroijam, 2014). According to Ba-Alwi and Albared (2016), SVM is very
popular text categorization method which is used in the machine learning community. “SVM is
considered as one of the most effective classification method according to its performance on
39
text classification as proven by many researches” (Ba-Alwi and Albared, 2016, p. 5).
NBC is a well-known and practical probabilistic classifier which has been employed in many
applications (Chavan et al, 2014). It assumes that all attributes (i.e., features) of the examples are
independent of each other given the context of the class, i.e., this classifier makes an independent
assumption (Chavan et al, 2014). Also, this classification method is fast and easy to implement
(Patil et al, 2016). This is also very known statistical method which is relatively good for large
datasets, thereby making it useful in text classification problems (Patil et al, 2016). It is also based
on the Bayes theorem which performs the independence feature selection (Patel and Mistry,
2015). “The NB classifiers solves the text classification problem as follows:
given a document d which represented as a set of feature terms {ti | i=1, 2, ..., |d|} and c is a
category in the category set C, where |C| >2. NB can be defined as the conditional probability of
c given d constructed as follows:
” (Ba-Alwi and Albared, 2016, p. 5).
KNN classifier works on principle whereby documents which are closer in the space will belong
to the same class (Patel and Mistry, 2015). According to Patel and Mistry (2015), the algorithm
works by calculating the similarity between text document and their neighbours. In other words,
KNN algorithm classified documents or objects by voting several labelled training examples
with their smallest distance from each other (Bali and Gore, 2015; Patra and Singh, 2013). It is
a case-based learning algorithm that is calculated based on a distance or similarity function for
pairs of observations, like the Euclidean Distance or Cosine similarity measures (Patil et al, 2016;
Korde and Mahender, 2012). One advantage of KNN is its simplicity, effectiveness and less
implementation difficulties thus making it possible for more applications to use this method
(Korde and Mahender, 2012; Bali and Gore, 2015). However, one drawback of this method is
the long time and the difficulty it takes to find the optimal value of k especially when large
number of training examples are given (Korde and Mahender, 2012; Bali and Gore, 2015).
DT is a type of classification method that builds classification model in a form of a tree (Nisha
and Karthik, 2016). The topmost decision node which contains all the documents is referred to
as the root node (Patel and Mistry, 2015). According to Nisha and Karthik (2016), each internal
node is made up of a subset of the documents that are separated based on one attribute or feature.
In other words, a DT is a flowchart like tree structure, whereby each internal node represents a
test on an attribute, each branch represents an outcome of the test, and each leaf node holds a
class label (Nalini and Sheela, 2014). DT is capable of handling both categorical and numerical
data (Korde and Mahender, 2012; Nisha and Karthik, 2016). DT classifier poses a series of
carefully crafted questions about the various attributes of the test record. Anytime it receives an
answer, a follow-up question is asked until a conclusion about the class label of the record is
obtained or reached (Chavan et al, 2014; Nisha and Karthik, 2016). One remarkable advantage
of DT is that it is easily to understand and interpret even for persons who are not familiar (or non-
expert users) with the details of the model (Nalini and Sheela, 2014; Bali and Gore, 2015). One
disadvantage is that irrelevant attributes may affect badly the construction of a DT (Patra and
Singh, 2013; Vala and Gandhi, 2015).
Neural Networks (NNs) consist of many individual processing units or elements which are called
40
neurons that are connected by links which have weights that allow neurons to activate other
neurons (Bali and Gore, 2015; Thaoroijam, 2014). These neurons work together for solving any
specific problem (Patel and Mistry, 2015). Figure 11 is a block diagram for NN.
Figure 9: NN Block Diagram (Patel and Mistry, 2015)
NNs have the ability to extract meaningful information from a huge set of data, due to that
neurons have been configured for specific application areas, such as pattern recognition, feature
extraction, and noise reduction (Patel and Mistry, 2015). NNs have the advantage of being
flexible with the disadvantage of very high computing costs (Thaoroijam, 2014). Also, one
disadvantage of NN is that it is difficult for an average user to understand (Thaoroijam, 2014).
Advantages and disadvantages of the various classifiers talked about would be presented in Table
8 (Patel and Mistry, 2015; Patra and Singh, 2013; Vala and Gandhi, 2015; Vanjari and Thombre,
2015).
Table 6: Advantages and Disadvantages of Classifiers
Classifiers Advantages Disadvantages
SVM • Capture the inherent
characteristics of the data better.
• Global minima vs. local
minima.
• Compact description of the
learned model, more capable to
solve multi label classification.
• Parameter tuning.
• kernel selection.
• Training speed is slow.
NBC • Work well on numeric and
textual data.
• Easy to implement.
• Easy computation.
• Requires small amount of
training data to estimate
parameters.
• Good results are obtained in
• Conditional independence
assumption is violated (or
Assumption of class
conditional independence
leads to loss of accuracy.).
• Performs very poorly when
features are co related to each
other.
• Practically dependencies exist
41
Classifiers Advantages Disadvantages
most cases. among variables and
sometimes these dependencies
cannot be modelled by NB.
KNN • Effective
• Non-parametric
• More local characteristics of
Document are considered
comparing with Rocchio.
• Simple, effective and easy to
implement.
• Classification time is long.
• Difficult to find optimal value
of k.
DT • Easy to understand.
• Easy to generate rules.
• Reduce problem complexity.
• Simple even non-expert user
can understand
• Training time is relatively
expensive.
• One branch
• Once a mistake is made at a
higher level, any sub tree is
wrong.
• Does not handle continuous
variable well.
• May suffer from over fitting.
• Irrelevant attributes may
affect badly the construction
of a decision tree.
NN • Produce good results in
complex domains
• Suitable for both discrete and
continuous data.
• Testing is very fast
• Training is relatively slow.
• Learned results are difficult
for users to interpret.
• It may lead to over fitting.
3.5 Literature review discussion
This section refers to the literature review discussion. Security and privacy issues are of great
concern when one talks about the security of BD. It is quite obvious that traditional security
mechanisms such as firewall and IDS are not adequate and therefore, there should be
sophisticated technologies that can handle the security of BD and prevent data leaks.
For any effective DLPS to be implemented, there should be consideration of both context-based
and content-based analysis techniques or approaches. This means that the context surrounding
the confidential data and the content itself are very important in DLP. The three main content
analysis techniques are: data fingerprinting (including exact or partial match), regular expression
(including dictionary-based match) and statistical analysis. N-gram analysis and term weighting
analysis are the main statistical analysis techniques. The drawback of regular expression and data
fingerprinting were solved by N-gram statistical analysis technique. However, most of the DLP
techniques and methods discussed in section 3.4.3.2 fall under the detective approaches. A lot of
work has been done under the detective method which works by determining whether the leakage
42
has occurred or not and applied the appropriate corrective action. In addition, under detective
approach, the system will detect any possible leakage incidents and apply the corrective action
that is capable of handling the identified leakage incident (Shabtai et al, 2012). However,
sensitive data could still be leaked easily to unauthorized users if the data owner is careless
which could reduce the competitive advantage of an organization (Margathavalli et al, 2016).
Furthermore, preventive approach of DLP solution works by ensuring that possible leakage
incidents are prevented before they are happening by applying proper measures such as access
control, disabling functions, encryption, and awareness (Shabtai et al, 2012; Peneti and Rani,
2015a; Alneyadi et al, 2016). According to Margathavalli et al (2016), it is essential to correct
the disorder in detective approach by ensuring that the data is rather prevented from leakage.
This will ensure that the data leakage is prevented from unauthorized users in order to achieve
confidentiality. Several encryption algorithms are available which can be used to prevent data
leakage in organizations. More so, encrypted documents could still bypass detection
mechanisms of DLPSs which could result in leakage (Alneyadi et al, 2016). However, once the
documents are encrypted, without the correct decryption keys, it will be difficult for one to see
the details of these confidential or sensitive documents.
3.6 Research gap
From the literature review, it is clear that though there are a number of DLP techniques or
methods available with respect to detective approaches. However, a lot of work has not been
done using the preventive approach for DLP solution. This means that there is the need to use
the preventive approach such as encryption to develop a DLP method to help in preventing BD
leakage before they are happening with emphasis on semi-structured data (textual data).
3.7 Research question
The main question for this research study which is intended to solve the research gap is “How to
design a method to help organizations prevent data leakage in BD?”. Emphasis will be based on
semi-structured data (textual data).
43
4. RESEARCH METHODOLOGY A Design Science Research Methodology (DSRM) would be appropriate to answer the research
question and to achieve the objective of providing method that can help organizations to prevent
data leakage in BD. Hevner et al (2004, p.77) explain that Design Science Research (DSR)
“creates and evaluates IT artifacts intended to solve identified organizational problems”. IT
artifacts are made up of constructs, models, methods, and instantiations (Hevner et al, 2004).
According to Hevner et al (2014), methods define processes and guidance on how to solve
problems. The proposed solution would take the form of a method. Currently there are little
work done to prevent data leakage in BD using the preventive approach that can help
organizations prevent the leakage before they happen. In order to create IT artifact (method)
that can guide organizations to follow specified guidelines that can be used to prevent data
leakage, the suitable methodology is DSR. The main objective of this research is to design a
method to help organizations prevent data leakage in BD and this requires a comprehensive
methodology such as DSR to achieve this.
Peffers et al (2007) have provided six activities that should be followed when one needs to
conduct DSR. The details of these activities that would be followed in this research are
enumerated below. Figure 12 depicts the DSRM process model which summarises these six
steps or activities:
Activity 1: Problem identification and motivation.
Activity 2: Define the objectives for a solution.
Activity 3: Design and development.
Activity 4: Demonstration.
Activity 5: Evaluation.
Activity 6: Communication.
Identify
problem &
Motivate
Define
problem
Show
relevance
Define
Objectives of
a Solution
What would
a better
artifact
accomplish?
Design &
Development
Artifact
Demonstration
Find suitable
context
Use artifact to
solve problem
Evaluation
Observe
how
effective,
efficient
Iterate back
to design
Communication
Scholarly
publications
Professional
publications
Master thesis
report
Possible Research Entry Points
Client /
Context
Initiated
Design &
Development
Centered
Intiation
Objective
- centered
Solution
Problem-
Centered
Intiation
Process Iteration
Nom
inal
pro
cess
seq
uen
ce
44
Figure 10: DSRM Process Model (Peffers et al, 2007)
4.1 Activity 1: Problem identification and motivation
This is the first activity that should be followed as far as the DSRM is concerned. The problem
identification and motivation would be presented in chapter two. In summary, there are little done
to prevent data leakage in BD using the preventive approach which can help organizations
prevent the leakage before they happen.
4.2 Activity 2: Define the objectives for a solution
The outcome of the previous activity will be used to collect and determine the objectives for the
IT artifact (method) to prevent data leakage in BD using the preventive approach such as
encryption. This will go a long way to help organizations that are willing to adapt DLP
technologies or solutions to follow guidelines that will help them to prevent the leakage before
they happen. The solution will focus on text categorization or classification of semi-structured
big data sets (which are mostly textual data) into confidential and non-confidential data so that
the confidential ones would be encrypted to prevent leakage. However, text data can also be
grouped under unstructured data and they form huge percentage of today’s data. About 80% of
today’s data is stored as text (Patil et al, 2016). Also, studies have shown that about 80% of all
stored organizational data is unstructured (Khan et al, 2014; Kanimozhi and Venkatesan, 2015).
4.3 Activity 3: Design and development
The IT artifact (method) will be designed and developed based on the previous objectives. Text
data can be generated from numerous sources such as emails, comments, tweets, etc. For that
matter, there will be text categorization or classification of these data sets into confidential and
non-confidential data. The aim is to prevent leakage with preventive approach of encryption so
that even if the leakage occurs and without the proper decryption keys, one cannot get access to
the original documents (plaintext).
4.3.1 Kernel theory
During the designing of the IT artifact, design theory which are mostly referred to as kernel
theory would be followed. Kernel theory will be applied both on defining the design artifact and
during the design process. Kernel theory or theories is / are derived from natural sciences, social
sciences and mathematics and govern both the design requirements and the design process of the
artifact itself (Walls et al, 1992; Markus et al, 2002; Iivari, 2007). According to Markus et al
(2002), a practitioner theory-in-use could also serve as a kernel theory. This implies that a design
theory is not necessarily based on any scientifically or empirically validated knowledge and for
that matter a kernel theory could either be an academic theory (e.g., organizational psychology)
or a practitioner theory-in-use (Markus et al, 2002). For this reason, a comprehensive industry
process model for data mining projects called CRISP-DM (CRoss Industry Standard Process for
Data Mining) will be adopted as the kernel theory or framework for the designing of the IT
45
artifact (Wirth and Hipp, 2000; Rocha and Sousa Júnior, 2010; Shearer, 2000; Moro et al, 2011;
Al-Radaideh and Al-Nagi, 2012). According to these authors, this process model provides a
framework for carrying out data mining projects which are both independent of the technology
used and industry sector involve. They further reiterated that this process model also serves
the lifecycle of a data mining project. In addition, the CRISP-DM process model is made up
of non-rigid sequence of six phases as shown in Figure 11 (Wirth and Hipp, 2000; Rocha and
Sousa Júnior, 2010; Shearer, 2000; Moro et al, 2011; Al-Radaideh and Al-Nagi, 2012).
Figure 11: Phases of the CRISP-DM Process Model (Wirth and Hipp, 2000; Rocha and Sousa Júnior, 2010; Shearer, 2000)
• Phase 1: Business Understanding
The first phase ensures that the project objectives and requirements could be understood
from a business perspective so that the knowledge can be as well converted into a data
mining problem definition, a preliminary project plan which could then be designed in
order to achieve the objectives (Wirth and Hipp, 2000; Rocha and Sousa Júnior, 2010;
46
Shearer, 2000). For the sake of this thesis, the main objective to be achieved is to design
a method to help organizations prevent data leakage in BD with emphasis on semi-
structured data (textual data). It is prudent to understand the business for which a solution
is being looked for. The business understanding phase is made up of several key steps,
which includes determining business objectives, assessing the situation, determining the
data mining goals, and producing the project plan (Wirth and Hipp, 2000; Rocha and
Sousa Júnior, 2010; Shearer, 2000). These tasks and outputs involved in phases of the
CRISP-DM reference model have been presented in Figure 12.
• Phase 2: Data Understanding
The data understanding phase will begin with an initial data collection and then continue
with the data scientist getting familiar with the data, in order to identify data quality
problems, so as to get insights into the data as shown in Figure 11 (Wirth and Hipp, 2000;
Rocha and Sousa Júnior, 2010; Shearer, 2000). There is direct link between the Business
Understanding and Data Understanding. The initial data is to look for semi-structured
data (textual data). All other steps which will lead to the required data will be followed.
• Phase 3: Data Preparation
This stage covers all activities to construct the final data set which will be fed into the
modelling tool or software (Wirth and Hipp, 2000; Rocha and Sousa Júnior, 2010;
Shearer, 2000). In this case data will be prepared for use in classification algorithms.
• Phase 4: Modelling
In this phase, there will be selection and application of various modelling techniques to
ensure optimal results (Wirth and Hipp, 2000; Rocha and Sousa Júnior, 2010; Shearer,
2000). There can be several modelling techniques that can be applied to the same data
mining problem type (Wirth and Hipp, 2000; Rocha and Sousa Júnior, 2010; Shearer,
2000). Again, there is close link between Data Preparation phase and Modelling.
• Phase 5: Evaluation
This phase contains assessment of the data mining results to ensure whether they have
been able to achieve the business results. If more processes are to be modelled, the process
will then return to the Business Understanding phase (Wirth and Hipp, 2000; Rocha and
Sousa Júnior, 2010; Shearer, 2000).
• Phase 6: Deployment
Implementation is done in the deployment stage. For this case, when everything is
successful, it will be presented as thesis report.
47
Figure 12: Overview of the CRISP-DM tasks and their outputs (Wirth and Hipp, 2000; Shearer, 2000)
4.4 Activity 4: Demonstration
The IT artifact (method) will be demonstrated to proof how effective they can be implemented
to assist organizations that needed to prevent data leakage in BD through experiment of semi-
structured BD sets (textual data) which are publicly available. For instance, “electronic textual
documents are highly obtained from the social websites” (Patel and Mistry, 2015, p. 84).
4.5 Activity 5: Evaluation
The logical proof of the IT artifact (method) will be analysed and observed whether the objectives
are achieved. The feedback and the successful implementation of the method in practice will
enable this research to have presented initial “proof-of-concept” level validation of the new
method (Peffers et al, 2007).
4.6 Activity 6: Communication
The outcome of the thesis would be presented and the results shared through the master thesis
report. This will also be made available publicly and other interested parties through an approved
publication.
48
5. DESIGN AND DEVELOPMENT This section would be used to present the actual IT artifact (method) that will serve as the
solution to the research question. The objective of the solution is to provide method to help
organizations prevent data leakage in BD with emphasis on semi-structured data (textual data)
using the preventive approach such as encryption. In designing the IT artifact, the CRISP-DM
process model which will serve as the kernel theory (section 4.3.1) will be followed.
5.1 Data Understanding
The data that needs to be prevented against leakage from an organization could be classified
as either confidential or non-confidential data. This could be the data either for the
organization itself or the clients who share their private information with the organization.
5.1.1 What is Confidential or Sensitive Data?
According to the National Institute of Standards and Technology (NIST) Special Publication
800-122 (2010), confidential or sensitive data is any data which contains personally
identifiable information (PII) such as (name, social security number, date and place of birth,
mother’s maiden name, or biometric records). Further examples are listed below (NIST
Special Publication 800-122, 2010; PEER Mississippi, 2017; The University of Texas –
Austin Information Security Office, 2017; Carnegie Mellon University Information
Security Office, 2017):
• Social Security number (SSN).
• Credit / debit / payment card numbers – with information such as Cardholder name,
Expiration date, Card verification code.
• Driver's license number.
• Personal information for patients (medical records).
• Financial data for an organization.
• Personal information for students.
• Students records (study plans, marks, transcripts)
• Personal identifiable information for employees – salary, birth date, biometric
information, mother’s maiden name, electronic or digitized signatures, etc.
• Private key (digital certificate)
• Passwords or credentials
• PINs (Personal Identification Numbers)
• Research data within a university.
• Legal data special for a university.
• Trade secrets or intellectual property such as research activities
According to the General Data Protection Regulation (EU) (2016/679) (GDPR) and (Bhaimia,
2018), the GDPR which was adopted on 27 April 2016 and came into force throughout the
European Union (EU) on 25 May 2018 also made significant contribution towards the
protection which should be accorded to personal data and sensitive personal data also
supported the definition by NIST SP 800-122 (2010). According to the Article 4 (1) of the
GDPR [Regulation (EU) (2016/679)], “personal data means any information relating to an
identified or identifiable natural person (‘data subject’); an identifiable natural person is one
who can be identified, directly or indirectly, in particular by reference to an identifier such as
a name, an identification number, location data, an online identifier or to one or more factors
specific to the physical, physiological, genetic, mental, economic, cultural or social identity
of that natural person”. Also, what they normally referred to as sensitive personal data such
49
as health data, genetic data and biometric data should even have stronger protection.
The information that is considered confidential or sensitive will differ depending on the type
of business operated by an organization. However, there are certain information which are
considered confidential or sensitive in all organizations. Examples of such information are
personal information for employees, payroll information, appointment / offer letters, payslips,
phone numbers and home addresses for employees.
5.1.1.1 Data description
The data which will be used are payroll information such as payslips, salary details and
appointment / offer letters of a company (name withheld due to the sensitivity of the data)
which have been collected and exported into Excel, Word, and Text formats. The employees
of this company are in three categories, namely, Junior, Senior, and Management Staffs. The
basic salaries of these employees should only be known by some human resource (HR) and
finance staffs. Even the salaries of the management staffs are only known by few management
staff, that is the HR and Finance Managers. However, these documents could still be leaked.
Therefore, there is the need to prevent against the leakage of such confidential or sensitive
documents or information from falling into wrong hands. Even if leakage occurs, one should
not be able to make meaning of the data by encrypting them. For illustration, sample data is
shown in Figure 13.
Figure 13: Sample Data
5.2 Data Preparation
The data preparation or data pre-processing stage is to convert the raw data into appropriate
format for use in the modelling stage. This phase includes several subtasks such as data
selection, data cleansing, data constructing and data formatting. All the data have been
exported into Text (TXT) format which are capable of being read by most data mining
software. These are mostly application / offer letters, salaries, and payslips. These data are
considered very confidential or sensitive.
50
5.3 Modelling The three main types of machine learning algorithms are supervised, unsupervised, and
reinforcement learning algorithms (Abdallh et al, 2016; Kaur, 2016; Patil et al, 2016). With
supervised learning algorithms, the goal behind this to learn classifiers from known examples
or data sets (i.e. labelled documents) in order to perform or apply the classification
automatically on unknown examples or data sets (unlabelled documents) (Bali and Gore,
2015; Chavan et al, 2014; Vala and Gandhi, 2015). In other words, “supervised learning
means learning from examples” (Patil et al, 2016, p. 517). Examples of supervised learning
algorithms are; Support Vector Machine (SVM), K Nearest Neighbor (K-NN), Naive Bayes
Classifier (NBC), Random Forest, Regression, Logistic Regression, Decision Trees (DT), etc
(Abdallh et al, 2016; Bali and Gore, 2015; Chavan et al, 2014; Vala and Gandhi, 2015). For
unsupervised learning, the documents or data sets are not labelled at any point in the whole
process. The examples of unsupervised learning algorithms are Clustering, Apriori algorithm,
Affinity Analysis, Self‐Organizing Maps (SOM), etc (Abdallh et al, 2016; Kaur, 2016).
Reinforcement learning occurs when the algorithms learn based on the external feedback
given by the environment (Abdallh et al, 2016; Portugal et al, 2018; Bonaccorso, 2017). The
algorithms choose an action based on each data point and later learn to determine how good
the decision was (Abdallh et al, 2016). Over time the algorithms will change its ways to learn
better and achieve better reward. With this algorithm the machine is trained to make specific
decisions. This is how it works; the machine will be exposed to environment where it trains
itself continually using trial and error. With this algorithm the machine will learn from past
experience in order to try to provide best possible knowledge to make accurate business
decisions (Portugal et al, 2018). For example, consider an algorithm in computer field that
plays games against an opponent. The moves that lead to victories (positive feedback) in the
game should be learned and repeated whiles those that lead to losses (negative feedback)
should be avoided (Portugal et al, 2018; Bonaccorso, 2017). Examples of reinforcement
learning are Artificial Neural Network (ANN), Markov Chains (Markov Decision Process),
etc. However, there is fourth learning algorithm called semi-supervised learning algorithm
which are mostly applied to both labelled and unlabelled data (Portugal et al, 2018;
Bonaccorso, 2017). Also, they can learn from incomplete information or missing training set
where the algorithm still need to learn from it (Portugal et al, 2018; Bonaccorso, 2017). For
instance, in moving ratings where not every user rated the movie and for that matter, there is
missing information (Portugal et al, 2018). According to Bonaccorso (2017), the semi-
supervised learning algorithm can also be applied when it is necessary to categorize large
amount of data where few are labelled (complete).
The main idea behind every DLP solution is to “detect and prevent unauthorized attempts to
copy or send sensitive data, both intentionally or/and unintentionally, without authorization,
by people who are authorized to access the sensitive information” (Kale et al, 2015, p. 55;
Tidke et al, 2015, p. 28). In other words, “DLP is a technique used to hide the confidentiality
of data being accessed by unauthorized user” (Jain and Lenka, 2016, p. 57). To achieve this,
one should be able to classify documents into confidential and non-confidential based on
previously known (or predefined categories) documents or data sets. Organizations know and
can classify which documents are considered to be confidential or sensitive and non-
confidential such that unauthorized access or disclosure can harm their business or the
personnel involved and therefore should be prevented from leakage. Since organizations know
and can classify or group which documents that are considered to be confidential and non-
confidential, the supervised machine learning algorithm being classification would be
appropriate in this situation.
51
5.4 Cryptography (Encryption and Decryption)
After documents or files have been classified into confidential and non-confidential data, the
confidential documents need to be encrypted so that only users with the decrypting key can
access those confidential documents. This means that leakage will be prevented against whole
documents. Cryptography is the method that provides a way to store sensitive or confidential
information or to transmit it across insecure networks (i.e. the Internet) so that only the
intended recipients can read the information (Al-Hazaimeh, 2013; Bhanot and Hans, 2015).
Cryptography can be divided into three main areas; symmetric-key, asymmetric-key and
hashing.
• Symmetric-key cryptography
In symmetric-key cryptography, only a single secret key is shared by both the parties involved
in the communication for encryption and decryption purposes. Examples of symmetric key
encryption are Data Encryption Standard (DES), Triple DES, Advanced Encryption Standard
(AES), RC5, BLOWFISH, TWOFISH, THREEFISH etc (Daimary and Saikia, 2015; Bhanot
and Hans, 2015)
• Asymmetric-key cryptography
For asymmetric-key cryptography, two keys are involved in the communication, that is one is
private key and the other is public key. The data which is encrypted with the public key must
be decrypted with the corresponding private key. This type is also referred to as public key
cryptography. Examples are RSA, Elliptic Curve, etc (Daimary and Saikia, 2015; Bhanot and
Hans, 2015).
• Hashing
This type of encryption system involves fixed length message digest which is generated from
variable length message. The intended recipient or receiver must have the message as well as
the digest.
5.4.1 RSA Cryptosystem
The RSA cryptosystem is named after R. Rivest, A. Shamir, and L. Adleman (Jamgekar and
Joshi, 2013; Bhanot and Hans, 2015). RSA cryptosystem is the most widely used public key
cryptosystem (Jamgekar and Joshi, 2013; Bhanot and Hans, 2015). RSA is a public key
cryptography which uses two keys that is public and private keys. When public key is used to
encrypt the data and the corresponding private key is used to decrypt data. The RSA algorithm
involves the following (Bhanot and Hans, 2015; Mahajan and Sachdeva, 2013):
Key Generation (public/private key pair):
1. Let’s first select two large distinct primes p and q such that p not equal to q.
2. Compute n, n = p x q
3. Compute Ø(n), Ø(n) = (p-1) x (q-1)
4. Select e such that 1 < e < Ø(n) and e is coprime to Ø(n).
5. Compute unique integer d, d = e-1 mod Ø(n)
6. Public key is (e, n)
7. Private Key is (d)
Encryption:
While encrypting, the following is done:
C=Pe mod n
Where C=cipher text and P=plain text
52
Both e and n are public
Decryption:
While decrypting, the following is done
P=Cd mod n
In terms of security asymmetric encryption provides more security than the symmetric
encryption, however, symmetric encryption is faster than asymmetric in terms of encryption
speed (Bhanot and Hans, 2015). For instance, AES do not only provide security but also great
speed (Mahajan and Sachdeva, 2013). The main disadvantage of RSA is its encryption speed
(Bhanot and Hans, 2015). Actually, this is the main disadvantage of asymmetric key
algorithms. They provide good security but slow in encrypting files. Again, RSA can only
encrypt a file which is smaller than the key length (Elst, 2015; Brumbaugh, 2015). In addition,
RSA algorithm is able to encrypt a limited number of plaintext (Brumbaugh, 2015). For
instance, if the key size is 2018 bits, one is limited to at most 256 bytes of plaintext data can
be encrypted (Brumbaugh, 2015). Therefore, asymmetric encryption is slow and cannot be
applied for large files (Bikulov, 2013). To work around this situation, the solution then is to
use a hybrid symmetric-asymmetric encryption for big data situation.
5.5 Proposed DLP Method
The proposed IT artifact (method) which will help organizations to prevent data leakage in
BD with emphasis on semi-structured data (textual data) using the preventive approach such
as encryption consists of two main phases:
• Phase 1: Classification of organizational documents into confidential and non-
confidential with the help of a classification technique.
• Phase 2: Applying a hybrid cryptographic technique (made up of AES and RSA) to
encrypt all the confidential documents.
5.5.1 Phase 1: Classification of organizational documents
The objective of this phase is to determine which organizational documents are confidential
and non-confidential so that the confidential ones would be encrypted in the second phase.
The classification method of NBC will be performed on the documents. NBC has been
selected as the appropriate classification method due to the following advantages (see section
3.4.3.3):
• Work well on numeric and textual data.
• Easy to implement.
• Easy computation.
• Requires small amount of training data to estimate parameters.
• Good results are obtained in most cases.
The input to this phase is a collection of confidential and non-confidential documents. For
every document, they will be tokenized, cases transformed, stop words filtered, n-grams
generated, stemming performed to serve as the pre-processing stage. Finally, the documents
will be transformed into vectors of weighted terms of TF-IDF. The phase 1 will be subdivided
into Training (Learning) and Detection phases.
5.5.1.1 Training (Learning) Phase
During the training phase, a set of organizational confidential and non-confidential documents
which will serve as a training set will be used to develop a model using the NBC. This can be
achieved by following the algorithm below:
53
INPUT: Confidential and Non-Confidential text documents
PROCESS / OPERATION: Apply NBC technique
OUTPUT: Training model
Steps:
1. Collection of confidential and non-confidential text documents of an
organization,
2. Load both data sets into the appropriate data mining tool.
3. Perform text pre-processing
4. Perform supervised NBC on both data sets.
5. Store the training model
5.5.1.2 Detection Phase
During the detection phase, a set of unknown data which are the combination of confidential
and non-confidential documents will serve as input data so that the model generated in the
training phase can be applied. The following are the detection phase algorithm:
INPUT: Unknown text documents (Combination of Confidential and Non-Confidential
text documents)
PROCESS / OPERATION: Apply the training model generated in the training phase
OUTPUT: Prediction label of confidential and non-confidential text documents.
Steps:
1. Load the unknown text documents in the appropriate data mining tool.
2. Perform text pre-processing.
3. Apply the training model generated in the training phase.
4. Group the confidential text documents.
5.5.2 Phase 2: Encryption and decryption of confidential documents.
The phase 2 of the proposed IT artifact (method) is a hybrid of symmetric and asymmetric
encryption that is capable of encrypting a big file with symmetric algorithm (i.e. AES) with
on the fly random generated key or password. The key will then be stored in the file and
encrypted with asymmetric algorithm (i.e. RSA). This can be achieved by following the steps
(algorithm) below (Elst, 2015; Bikulov, 2013):
INPUT: Confidential text documents
PROCESS / OPERATION: Hybrid of AES and RSA encryption techniques
OUTPUT: Encrypted or decrypted confidential documents.
Steps:
1. Generate RSA Keypairs
2. Generate AES Key (the random password file)
3. Encryption:
a. Encrypt File with AES Key (i.e. Encrypt the file with the random key)
b. Encrypt AES Key with RSA Public Key (i.e. Encrypt the random key
with the public key file)
4. Decryption:
a. Decrypt AES Key with RSA Private Key (i.e. Decrypt the random key
with the private key file)
54
b. Decrypt File with AES Key (i.e. Decrypt the large file with the random
key).
Figure 14 illustrates the flowchart of the proposed DLP method.
Figure 14: Flowchart of proposed DLP method
55
6. DEMONSTRATION This chapter will be used to demonstrate how confidential documents (files) could be
encrypted to prevent leakage after they have been classified. This chapter will also serve as
the implementation or instantiation of the proposed DLP method to help prevent leakage of
semi-structured BD sets (textual data).
6.1 Experimental setup
The proof of concept has been developed on a virtual environment based on the Oracle VM
VirtualBox with Ubuntu version 16.04 LTS 32-bit operating system with 2GB RAM and
50GB hard disk space. The encryption and decryption of files with public keys will be done
via the OpenSSL command line. Also, RapidMiner Studio 8.1 will be used as the data mining
tool to model the data.
6.2 Data Sets (Documents)
The data comprising several files which have been selected from an organization (name
withheld due to the sensitivity of the data) would be placed into three separates folders being
confidential, non-confidential, and unknown. Confidential folder is made up of text files such
as appointment / offer letters, payslips and other salary information. The non-confidential
folder is made up of other files that have nothing to do with payroll information. The third
folder which is named as the unknown comprises combination of confidential and non-
confidential text files which will be used to test and apply the model.
Training / learning data sets:
• Confidential folder – eight (8) confidential text files
• Non-confidential folder – four (4) files which are not payroll or salary related.
Testing data sets:
• Unknown folder – combination of payroll / salary and other non-payroll / salary
related files. They are made up of three (3) confidential data and four (4) non-
confidential text files.
The Appendix 1 contains the various process maps involving the pre-processing activities, the
classification of documents into confidential and non-confidential, and the application of the
model.
6.3 Experiment 1
A confidential file named (largefile.txt) will be encrypted and decrypted to demonstrate the
implementation of the second phase of the proposed DLP method (see section 5.5.2). The
various commands and the final outcome screenshot indicating all the steps are shown in
Appendix 2.
6.4 Experiment 2
The overall goal of this thesis is to prevent leakage of confidential documents (text files).
However, users who are authorized to work with such files should be able to get access without
comprising the private keys involved in the asymmetric encryption. To achieve this,
authorized users should be allowed to run decryption scripts against any confidential
documents they are allowed to work with. In this case they will only be running executable
56
bash scripts and the details of private and public keys will not be of concern. For this reason,
I will create bash scripts to encrypt and decrypt confidential files.
To demonstrate this approach, I will create two folders – local and remote folders. The remote
folder will serve as a server machine where the encryption bash script, public key, AES
encrypted key and the confidential encrypted files will be stored. In this case even if the remote
machine (server) is hacked or leakage occurs, the confidential files will not be compromised
since they will all be encrypted and the private key and decryption bash script will also not be
available.
The second phase of the proposed DLP method which is a hybrid of symmetric-asymmetric
encryption (see section 5.5.2) will be implemented as bash scripts. The various steps and
screenshots involved are illustrated in Appendix 3 (Bikulov, 2013).
6.5 Experiment 3
In experiments 1 and 2, the focus was on encryption and decryption of single files. In
experiment 3, the emphasis will be based on encrypting and decrypting multiple files within
a folder or directory. To achieve this, one needs to archive all the confidential or sensitive files
in a folder with either tar or zip archive formats before encrypting them. With experiment 3,
the encryption password could be supplied directly through the terminal before the encryption.
The various stages involved have been indicated in Appendix 4.
6.6 Experiment 4
Experiment 4 will be used to combine the ideas from the experiments 2 and 3 such that the
multiple files can just be archived before encrypting them. When this is done, the same bash
scripts used in the experiment 2 (see section 6.4 and Appendix 4) could be used. To achieve
this, we will create gzip tarball and then encrypt the tarball. The steps involved in this
experiment have been presented in Appendix 5.
57
7. EVALUATION Evaluation is crucial and essential part in conducting rigorous DSR (Venable et al, 2012;
Sonnenberg and vom Brocke, 2012). According to Sonnenberg and vom Brocke (2012),
evaluation patterns which are mostly distinguished in a DSR process are ex ante and ex post
evaluations with four evaluation activities (Eval) as shown in Figure 35. Ex ante evaluations
are those which are conducted before the construction of any artefacts whiles ex post
evaluations occur after the construction of any artifact (Venable et al, 2012; Sonnenberg and
vom Brocke, 2012).
Figure 15: Evaluation activities within a DSR process
The Eval1 activity exists to ensure that proper research problem has been selected and
formulated. This has been achieved through the use of literature review processes such as
literature search to ensure that there is research gap. Again, proper research problem has been
formulated in chapter 2.
Also, Eval2 activity exists to ensure that an artifact design ingrains the solution to the stated
problem. Since the artifact at this stage has not been constructed, this evaluation is artificial.
This has been achieved through assertion that the IT artifact will be constructed to solve a
business problem and this ensures the feasibility of the design process.
The Eval3 activity will serve as initial demonstration to ensure how well the artifact will
perform by interacting with organizational elements. This has been achieved during the
demonstration section (see chapter 6) whereby several experiments were conducted to cater
for different scenarios within an organizational setting.
Finally, Eval4 exists to ensure that an artifact is both applicable and useful in practice. The
enormous experiments conducted and applied to documents from an organization had ensure
58
that the IT artifiact (method) is usable, effective and can be applied to several organizations.
This in other words can be said that Eval4 has been achieved through case study (organization
name withheld) whereby real organizational data has been used.
7.1 Impact of the IT Artifact
Before the construction of the IT artifact, when leakage of confidential documents of the
organization happens, people who are not authorized can make meaning of the data since they
are stored in plaintext. More so, encrypted documents could still bypass detection mechanisms
of DLPSs which could result in leakage (Alneyadi et al, 2016). However, once the documents
are encrypted, without the correct decrypting keys, it will be difficult for one to see the details
of these confidential or sensitive documents. Therefore, the designing and construction of the
IT artifact has resolved this problem because attackers or unauthorized users cannot make
sense of the data which are encrypted if the decrypting keys have not been provided.
59
8. DISCUSSION This thesis focuses on designing an IT artifact (method) that can prevent data leakage before
they happen. This means that the preventive approach of DLPS was considered since it is
better to prevent against leakage than to wait for it to happen before detective measures are
applied. In addition, the drawback of detection process is to check whether the leakage
happens or not. Due to that one of the preventive approaches being encryption was adopted to
tackle this situation after realizing from the literature review that a lot of work has been done
already with the detective approaches of DLPS. However, this process would not have been
successful without following appropriate method that can address the research question. For
this reason, DSRM was adopted to answer the research question. This was not the only method
which was adopted to deal with the situation at hand. The CRISP-DM which is de facto
standard to deal with data mining issues was added to serve as the kernel theory. Because
before any DLPS can work, there is the need to train the system with the actual confidential
or non-confidential documents within an organization. Because what is considered to be
confidential or sensitive documents vary from one organization to the other. However, there
are certain information such as personal information for employees, payroll information,
appointment / offer letters, payslips which are considered confidential or sensitive across all
organizations. To achieve this, there should be understanding of the data at hand and how to
model them by applying the appropriate technique.
To achieve the purpose of knowing the confidential documents before encrypting them, text
classification method of NBC was finally adopted as the modelling technique to classify the
documents. After this has been done the proposed encryption method which is a hybrid of
symmetric and asymmetric encryption was then applied to encrypt all the confidential
documents so that only authorized users can have access to them. To add to this, the RSA and
AES encryption algorithms were implemented with OpenSSL technology. This approach
proved very effective because without knowing the decrypting keys involved, it will be
difficult for an unauthorized user to access the confidential documents. In terms of security,
asymmetric encryption is strong but it is not fast and for that matter there was a need to include
symmetric encryption which is very fast and can also deal with encryption and decryption of
large files. The demonstration of the encryption and decryption of large files has accomplished
the BD aspect of the research question. This will improve the security of BD and for that
matter the organization’s data.
In addition, the proposed hybrid encryption approach has complemented the work done by
Margathavalli et al (2016) which was identified in the literature review to prevent data
leakage. They made use of ABE which is another way of implementing public key encryption
to prevent leakage of sensitive data.
8.1 Contribution
This thesis work has made significant contribution to the field of DLP using the preventive
approach to prevent data leakage before they happen by proposing a hybrid symmetric-
asymmetric encryption technique of encrypting confidential or sensitive documents within an
organization so that only authorized users can have access.
60
9. CONCLUSION This thesis has contributed to the area of DLP by proposing a hybrid symmetric-asymmetric
encryption approach to prevent data leakage. This is one of the preventive method or approach
of DLPS. This IT artifact (method) is capable of preventing data leakage before they happen.
In this case only authorized users of an organization can access their confidential or sensitive
documents. It was clear from the literature review that a lot of work has been done already in
the area of leakage detection which then bring a gap in the preventive approach of DLPS. This
research has demonstrated that encryption could also serve as the cornerstone of BDS. The
proposed hybrid encryption method which is the combination of asymmetric (RSA) and
symmetric (AES) encryptions can therefore be used by many organizations to prevent leakage
of their confidential or sensitive documents.
9.2 Future Research
Future research can be conducted by considering automating the proposed method such that
data can be fed into appropriate data mining tool and for that matter BD technologies such as
Hadoop automatically. Also, this method can be extended by encrypting all BD before they
are stored in Hadoop to better strengthening the security of BD and to prevent data leakage.
Hadoop is an open source framework that allows distributed storage and processing of large
data sets across clusters of networked computers using simple programming models (Khan et
al, 2014; Shukla et al, 2015; Tole, 2013; Ularu et al, 2012; Rodríguez-Mazahua et al, 2016).
61
REFERENCE Abdallh, M.M.A, Bilal, K. H.& Babiker, A. (2016), Machine Learning Algorithms, International Journal
of Engineering, Applied and Management Sciences Paradigms, vol. 36, issue 01, pp. 17-27.
Ahmad, S. W. & Bamnote, G. R. (2013), Data Leakage Detection and Data Prevention Using Algorithm,
International Journal of Computer Science and Applications, vol. 6, no. 2, pp. 394-399.
Al-Hazaimeh, O. M. (2013), A New Approach for Complex Encrypting and Decrypting Data, International
journal of Computer Networks & Communications, vol. 5, no. 2, pp. 95-103.
Alneyadi, S., Sithirasenan, E. & Muthukkumarasamy, V. (2016), A survey on data leakage prevention
systems, Journal of Network and Computer Applications, vol. 62, issue C, pp. 137-152.
Alneyadi, S., Sithirasenan, E. and Muthukkumarasamy, V. (2015), Detecting Data Semantic: A Data
Leakage Prevention Approach, In the Proceedings of the 2015 IEEE Trustcom/BigDataSE/ISPA, August
20 - 22, IEEE Computer Society Washington DC, USA, vol. 1, pp. 910-917.
Alneyadi S., Sithirasenan E. & Muthukkumarasamy V. (2014), A Semantics-Aware Classification Approach
for Data Leakage Prevention, In: Susilo W., Mu Y. (eds) Information Security and Privacy, ACISP 2014,
Lecture Notes in Computer Science, vol. 8544, pp.413-421, Springer, Cham.
Alneyadi, S., Sithirasenan, E. & Muthukkumarasamy, V. (2013a), Word N-gram Based Classification for
Data Leakage Prevention, In the proceedings of 2013 12th IEEE International Conference on Trust,
Security and Privacy in Computing and Communications.
Alneyadi, S., Sithirasenan, E. & Muthukkumarasamy, V. (2013b), Adaptable N-gram classification model
for data leakage prevention, In the proceedings of 7th International Conference on Signal Processing and
Communication Systems (ICSPCS), Carrara, VIC, 2013, pp. 1-8.
Al-Radaideh, Q. A. & Al-Nagi, E. (2012), Using Data Mining Techniques to Build a Classification Model
for Predicting Employees Performance, International Journal of Advanced Computer Science and
Applications, vol.3, no. 2, pp.144-151.
Ammu, N. & Irfanuddin, M. (2013), Big Data Challenges, International Journal of Advanced Trends in
Computer Science and Engineering, vol. 2, no. 1, pp. 613-615.
Ba-Alwi, F. M. & Albared, M. (2016), Experiments on the Use of Machine Learning Classification Methods
in Online Crime Text Filtering and Classification, British Journal of Applied Science & Technology, vol.
12, no. 5, pp. 1-12.
Bhanot, R. & Hans, R. (2015), A Review and Comparative Analysis of Various Encryption Algorithms,
International Journal of Security and Its Applications, vol. 9, no. 4, pp. 289-306.
Bali, M. & Gore, D. (2015), A Survey on Text Classification with Different Types of Classification
Methods, International Journal of Innovative Research in Computer and Communication Engineering, vol.
3, issue 5, pp. 4888-4894.
Bertino, E. (2013), Big Data - Opportunities and Challenges (Panel Position Paper), In the proceedings of the
2013 IEEE 37th Annual Computer Software and Applications Conference (COMPSAC), 22-26 July 2013,
Kyoto, Japan. [Online]. Available: https://www.cs.purdue.edu/homes/bertino/compsac13.pdf [Accessed: 10th
December, 2016].
Bhaimia, S. (2018), The General Data Protection Regulation: The Next Generation of EU Data Protection, Legal
Information Management, vol. 18, no. 2018, pp. 21-28.
Bhogal, N. & Jain, S. (2017), A Review on Big Data Security and Handling, International Research Based
Journal, vol. 6, issue 1, pp. 1-5.
Bikulov, D. (2013). Hybrid symmetric-asymmetric encryption for large files [Kenarius Octonotes].
[Online]. Available: http://bikulov.org/blog/2013/10/12/hybrid-symmetric-asymmetric-encryption-for-
large-files/ [Accessed: 5th May, 2018].
Bonaccorso, G. (2017), Machine Learning Algorithms, Packt Publishing, Birmingham, UK.
Brocke, J. v., Simons, A., Niehaves, B., Niehaves, B., Reimer, K., Plattfaut, R., & Cleven, A. (2009),
Reconstructing The Giant: On The Importance of Rigour in Documenting The Literature Search Process .
ECIS 2009 Proceedings. Paper 161.
Brumbaugh, D. (2015), How to Encrypt Large Messages with Asymmetric Keys and phpseclib. [Online].
62
Available: https://www.sitepoint.com/encrypt-large-messages-asymmetric-keys-phpseclib/ [Accessed: 8th
May, 2018].
Carnegie Mellon University Information Security Office (2017), Guidelines for Data Classification.
[Online]. Available: https://www.cmu.edu/iso/governance/guidelines/data-classification.html [Accessed:
28th April, 2018].
Chavan, G.S., Manjare, S., Hegde, P. & Sankhe, A. (2014), A Survey of Various Machine Learning
Techniques for Text Classification, International Journal of Engineering Trends and Technology (IJETT),
vol. 15, no. 6, pp. 288-292.
Daimary, A. & Saikia, L. P. (2015), A Study of Different Data Encryption Algorithms at Security Level:
A Literature Review, (IJCSIT) International Journal of Computer Science and Information Technologies,
vol. 6, no. 4, pp. 3507-3509.
Elst, R. V. (2015), Encrypt and decrypt files to public keys via the OpenSSL Command Line. [Online].
Available:
https://raymii.org/s/tutorials/Encrypt_and_decrypt_files_to_public_keys_via_the_OpenSSL_Command_L
ine.html [Accessed: 5th May 2018].
Harish Kumar, M. & Menakadevi, T. (2017), A Review on Big Data Analytics in the field of Agriculture,
International Journal of Latest Transactions in Engineering and Science, vol. 1, issue 4, pp. 0001-0010.
Hevner, A. R., March, S. T., Park, J. & Ram, S. (2004), Design Science in Information Systems Research,
MIS Quarterly, vol. 28, no. 1, pp. 75-105.
Hima Bindu, S., Gireesha, O., Sahithi, A. N. & Mounicama, A. (2016), Security Aspects in Big Data,
International Journal of Innovative Research in Computer and Communication Engineering, vol. 4, issue
4, pp. 1111-1118.
Inukollu, V. N., Arsi, S. & Ravuri, S. R. (2014), Security Issues Associated with Big Data in Cloud
Computing, International Journal of Network Security & Its Applications (IJNSA), vol.6, no.3, pp. 45-56.
ISACA (2010), Data Leak Prevention [White Paper]. [Online]. Available:
http://www.isaca.org/Groups/Professional-English/security-trend/GroupDocuments/DLP-WP-
14Sept2010-Research.pdf [Accessed: 22nd November, 2016].
Jain, M & Lenka, S. K. (2016), A Review on Data Leakage Prevention using Image Steganography,
International Journal of Computer Science Engineering (IJCSE), vol. 5, no. 02, pp. 56-59.
Jamgekar, R. S. & Joshi, G. S. (2013), File Encryption and Decryption Using Secure RSA, International
Journal of Emerging Science and Engineering (IJESE), vol. 1, issue 4, pp. 11-14.
Jamiy, F. EL., Daif, A., Azouazi, M. & Marzak, A. (2014), The potential and challenges of Big data -
Recommendation systems next level application, International Journal of Computer Science Issues (IJCSI),
vol. 11, issue 5, no. 2, pp. 21-26.
Kale, A. V., Bajpayee, V. & Dubey, S. P. (2015), Analysis of Data Leakage Prevention Solutions,
International Journal For Engineering Applications And Technology (IJFEAT), vol. 1, issue, 12, pp. 54-
57.
Kanimozhi, K. V. & Venkatesan, M. (2015), Unstructured Data Analysis-A Survey, International Journal
of Advanced Research in Computer and Communication Engineering, vol. 4, issue 3, pp. 223-225.
Kaur, K. (2016), Machine Learning: Applications in Indian Agriculture, International Journal of Advanced
Research in Computer and Communication Engineering, vol. 5, issue 4, pp. 342-344.
Katz, G., Elovici, Y. & Shapira, B. (2014), CoBAn: A context based model for data leakage prevention,
Information Sciences, vol. 262, pp.137-158.
Kaushik, M. & Jain, A. (2014), Challenges to Big Data Security and Privacy, International Journal of
Computer Science and Information Technologies, vol. 5, no. 3, pp. 3042-3043.
Khan, N., Yaqoob, I., Hashem, I. A. T., Inayat, Z., Ali, W. K. M., Alam, M., Shiraz, M. & Gani, A. (2014), Big
Data: Survey, Technologies, Opportunities, and Challenges, The Scientific World Journal, vol. 2014, no. 2014:
712826, pp. 1-18.
Ko, R. K. L., Tan, A. Y. S. & Gao, T. (2014), A Mantrap-Inspired, User-Centric Data Leakage Prevention
(DLP) Approach, 2014 IEEE 6th International Conference on Cloud Computing Technology and Science,
63
Singapore, 2014, pp. 1033-1039.
Korde, V. & Mahender, C. N. (2012), Text Classification and Classifiers: A Survey, International Journal
of Artificial Intelligence & Applications (IJAIA), vol. 3, no. 2, pp. 85-99.
Kumar, S., Shekhar, J., & Gupta, H. (2016), Agent based Security Model for Cloud Big Data, In the
proceedings of the Second International Conference on Information and Communication Technology for
Competitive Strategies (ICTCS’16), Udaipur, India. [Online]. Available:
https://www.researchgate.net/profile/Sunil_Kumar468/publication/289738412_Agent_based_security_mo
del_for_Cloud_Big_Data/links/56f4e40e08ae7c1fda2d7b23/Agent-based-security-model-for-Cloud-Big-
Data.pdf [Accessed: 30th January, 2017].
Iivari, J. (2007), A Paradigmatic Analysis of Information Systems As a Design Science, Scandinavian Journal
of Information Systems, vol. 19, issue 2, pp. 39-64.
Mahajan, P., Gaba, G. & Chauhan, N. S. (2016), Big Data Security, IITM Journal of Management and IT, vol.
7, issue 1, pp. 89-94.
Mahajan, P. & Sachdeva, A. (2013), A Study of Encryption Algorithms AES, DES and RSA for
Security, Global Journal of Computer Science and Technology, vol. 13, issue 15, version 1.0.
Margathavalli, P., Manjula, R., Pramila, V., Priya, R. & Abirami, P. (2016), Preserving Sensitive Data by Data
Leakage Prevention Using Attribute Based Encryption Algorithm, International Journal of Emerging
Technology in Computer Science & Electronics (IJETCSE), vol. 21, issue 3, pp. 705-711.
Markus, M.L., Majchrzak, A. & Gasser, L. (2002), A Design Theory For Systems That Support Emergent
Knowledge Processes, MIS Quarterly, vol. 26, no. 3, pp. 179-212.
McAfee, A. & Brynjolfsson, E. (2012), Big Data. The Management Revolution, Harvard
Business Review, vol. 90, no. 10, pp. 61-67.
Moorthy, J., Lahiri, R., Biswas, N., Sanyal, D., Ranjan, J., Nanath, K & Ghosh, P. (2015), Big Data: Prospects
and Challenges, The Journal for Decision Makers, vol. 40, issue 1, pp. 74-96.
Moro, S., Laureano, R. & Cortez, P. (2011), Using Data Mining for Bank Direct Marketing: An Application of
the CRISP-DM Methodology. In P. Novais et al. (Eds.), Proceedings of the European Simulation and Modelling
Conference - ESM'2011, Guimaraes, Portugal, October, pp. 117-121.
Moura, J. & Serrão, C. (2015), Security and Privacy Issues of Big Data. In book Handbook of Research on
Trends and Future Directions in Big Data and Web Intelligence, IGI Global. [Online], Available:
https://arxiv.org/ftp/arxiv/papers/1601/1601.06206.pdf [Accessed: 22nd November, 2016].
Nalini, K. & Sheela, L. J. (2014), Survey on Text Classification, International Journal of Innovative Research
in Advanced Engineering (IJIRAE), vol. 1, issue 6, pp. 412-417. Nisha, M. D. & Karthik, K. (2016), Survey on Text Classification Methods, International Journal of Advanced
Research in Computer Science and Software Engineering, vol. 6, issue 2, pp. 585-588.
NIST Special Publication 800-122 (2010), Guide to Protecting the Confidentiality of Personally Identifiable
Information (PII). [Online]. Available: https://nvlpubs.nist.gov/nistpubs/Legacy/SP/nistspecialpublication800-
122.pdf [Accessed: 28th April, 2018].
Patel, P. & Mistry, K. (2015), A Review: Text Classification on Social Media Data, IOSR Journal of Computer
Engineering, vol. 17, issue 1, pp. 80-84.
PEER Mississippi (2017), A Review of State Agencies’ Management of Confidential Data [Report to the
Mississippi Legislature, #612]. [Online]. Available: http://www.peer.ms.gov/Reports/reports/rpt612.pdf
[Accessed: 28th April, 2018].
Patel, P. & Mistry, K. (2015), A Review: Text Classification on Social Media Data, IOSR Journal of Computer
Engineering, vol. 17, issue 1, pp. 80-84. Patil, R. P., Bhavsar, R. P. & Pawar, B. V. (2016), A Comparative Study of Text Classification Methods: An
Experimental Approach, International Journal on Recent and Innovation Trends in Computing and
Communication, vol. 4, issue 3, pp. 517-523.
Patra, A. & Singh, D. (2013), A Survey Report on Text Classification with Different Term Weighing Methods
and Comparison between Classification Algorithms, International Journal of Computer Applications, vol. 75, no.
7, pp. 14–18.
Peneti, S. & Rani, B. P. (2016), Data Leakage Prevention System with Time Stamp, 2016 International Conference
on Information Communication and Embedded Systems (ICICES), 25-26 Feb. 2016, Chennai, India, pp. 1-4.
Peneti, S. & Rani, B. P. (2015a), Data Leakage Detection and Prevention Methods: Survey. Discovery, vol. 43,
no. 198, pp. 95-100.
Peneti, S. & Rani, B. P. (2015b), Confidential Data Identification Using Data Mining Techniques in Data Leakage
Prevention System, International Journal of Data Mining & Knowledge Management Process (IJDKP), vol. 5,
64
no. 5, pp. 65-73.
Peffers, K., Tuunanen, T., Rothenberger. M. A. & Chatterjee, S. (2007), A Design Science Research
Methodology for Information Systems Research, Journal of Management Information Systems, vol. 24,
issue 3, pp. 45-77.
Portugal, I., Alencar, P. & Cowan, D. (2018), The use of machine learning algorithms in recommender
systems: A systematic review, Expert Systems with Applications, vol. 97, pp. 205-227.
Ram, K. (2015), Analysis of Data Leakage Prevention on cloud computing, International Journal of
Scientific & Engineering Research, vol. 6, issue 1, pp. 457-461.
General Data Protection Regulation (EU) (2016/679), Regulation (EU) 2016/679 of the European
Parliament and of the Council of 27 April 2016 on the protection of natural persons with regard to the
processing of personal data and on the free movement of such data and repealing Directive 95/46/EC
(General Data Protection Regulation), Official Journal of the European Union L 119(1).
Rocha, B. C. & Sousa Júnior, R. T. (2010), Identifying bank frauds using CRISP-DM and decision trees,
International journal of computer science & information Technology (IJCSIT), vol. 2, no. 5, pp. 162 – 169.
Rodríguez-Mazahua, L., Rodríguez-Enríquez, CA., Sánchez-Cervantes, J. L., Cervantes, J., García-
Alcaraz, J. L. & Alor-Hernández, G. (2016), A general perspective of Big Data: applications, tools,
challenges and trends, The Journal of Supercomputing, vol. 72, issue 8, pp. 3073-3113.
Shabtai, A., Elovici, Y. and Rokach, L. (2012), A taxonomy of data leakage prevention solutions, In A
Survey of Data Leakage Detection and Prevention Solutions (pp. 11-15), Springer US.
Shearer, C. (2000), The CRISP-DM Model: The New Blueprint for Data Mining, Journal of Data
Warehousing, vol. 5, no. 4, pp. 13-22.
Shirudkar, K. & Motwani, D. (2015), Big-Data Security. International Journal of Advanced Research in
Computer Science and Software Engineering, vol. 5, issue 3, pp. 1100-1109.
Shukla, S., Kukade, V. & Mujawar, S. (2015), Big Data: Concept, Handling and Challenges: An Overview,
International Journal of Computer Applications, vol. 114, no. 1, pp. 6-9. Sin, K. & Muthu, L. (2015), Application of Big Data in Education Data Mining and Learning Analytics - A
Literature Review, ICTACT Journal on Soft Computing, vol. 5, issue 4, pp. 1035-1049.
Singh R. & Sinha, R. (2016), Big Data Security and Privacy Issues in SMES, International Journal of
Environment, Science and Technology, vol. 2, issue 1, pp. 31.35.
Sonnenberg, C., & vom Brocke, J. (2012), Evaluation Patterns for Design Science Research Artefacts, In M.
Helfert & B. Donnellan (Eds.), Proceedings of the European Design Science Symposium (EDSS) 2011 Dublin,
Ireland: Springer Berlin/Heidelberg, vol. 286, pp. 71-83.
Soumya, S. R. & Smitha, E. S. (2014), Data Leakage Prevention System by Context based Keyword Matching
and Encrypted Data Detection, International Journal of Advanced Research in Computer Science Engineering
and Information Technology, vol. 3, issue 1, pp. 375-384.
Tabassum, R. & Tyagi, N. (2016), Issues and Approaches for Big Data Security, International Journal of Latest
Technology in Engineering, Management & Applied Science (IJLTEMAS), vol. V, issue VII, pp. 72-74. Tahboub, R & Saleh, Y. (2015), Precaution Model for Data Leakage Prevention/Loss (DLP) Systems, In the
proceedings of the 4th Palestinian International Conference on Computer and Information Technology (PICCIT
2015), Palestine, Hebron. [Online]. Available:
https://www.researchgate.net/profile/Radwan_Tahboub/publication/282942404_Precaution_Model_for_Data_L
eakage_PreventionLoss_DLP_Systems/links/56234afc08aea35f2682c5c8/Precaution-Model-for-Data-Leakage-
Prevention-Loss-DLP-Systems.pdf [Accessed: 30th January, 2017].
Tahboub, R & Saleh, Y. (2014), Data Leakage / Loss Prevention Systems (DLP), NNGT Journal: International
Journal of Information Systems, vol. 1, pp. 13-18.
Tene, O. & Polonetsky, J. (2013), Big Data for All: Privacy and User Control in the Age of Analytics,
Northwestern Journal of Technology and Intellectual Property, vol. 11 issue 5, pp. 238-273.
Thaoroijam, K. (2014), A Study on Document Classification using Machine Learning Techniques, International Journal of Computer Science Issues, vol. 11, issue 2, pp. 217-222.
The University of Texas – Austin Information Security Office (2017), Extended List of Confidential Data.
[Online]. Available: https://security.utexas.edu/policies/extended-cat-1 [Accessed: 28th April, 2018].
Tidke, P., Wagh, A., Bharade, D. & Dongre, A. G. (2015), Data Leakage Prevention with E-Mail Filtering,
International Journal of Advance Foundation and Research in Computer (IJAFRC), vol. 2, issue 2, pp. 28-32.
Tole, A. A. (2013), Big Data Challenges, Database Systems Journal, vol. IV, no. 3, pp. 31-40.
Topaloğlu, M. (2013), The Comparison of the Text Classification Methods to be used for the Analysis of Motion
65
Data in DLP Architect, International Journal of Computer Science & Information Technology (IJCSIT), vol. 5,
no. 5, pp. 107-115.
Toshniwal, R., Dastidar, K. G., & Nath, A. (2015), Big Data Security Issues and Challenges, International
Journal of Innovative Research in Advanced Engineering (IJIRAE), vol. 2, issue 2, pp. 15-20.
Ularu, E.G., Puican, F.C., Apostu, A., Velicanu, M. (2012), Perspectives on Big Data and Big Data Analytics,
Database Systems Journal, vol. III, no. 4, pp. 3-13.
Vadsola, R., Desai, D., Brahmbhatt, M. & Patanwadia, A. (2014), Data Leakage Prevention by Using Word Gram Based Classification and Clustering, International Journal of Advanced Research in Computer and
Communication Engineering, vol. 3, issue 9, pp. 8040-8041.
Vala, M. & Gandhi, J. (2015), Survey of Text Classification Technique and Compare Classifier, International
Journal of Innovative Research in Computer and Communication Engineering, vol. 3, issue 11, pp. 10809-
10813.
Vanjari, S. P. & Thombre, V. D. (2015), An Experiential Study of SVM and Naïve Bayes for Gender
Recognization, International Journal on Recent and Innovation Trends in Computing and Communication, vol.
3, issue 9, pp. 5456-5460.
Venable, J., Pries-Heje, J. & Baskerville, R. (2012), A Comprehensive Framework for Evaluation in Design
Science Research, In K. Peffers, M. Rothenberger & B. Kuechler (Eds.), Design Science Research in Information
Systems. Advances in Theory and Practice, Berlin / Heidelberg: Springer, vol. 7286, pp. 423-438.
Walls, J. G., Widmeyer, G. R. & El Sawy O. A. (1992), Building an Information System Design Theory for Vigilant EIS, Information Systems Research, vol. 3, no. 1, pp. 36-59.
Wirth, R. & Hipp, J. (2000), CRISP-DM: Towards a standard process model for data mining. In the proceedings
of the 4th International Conference on the Practical Applications of Knowledge Discovery and Data Mining.
Yosepu¸ C., Srinivasulu¸ P. & Subbarayudu, B. (2015), A Study on Security and Privacy in Big Data Processing,
International Journal of Innovative Research in Computer and Communication Engineering, vol. 3, issue 12,
pp. 12292-12296.
66
APPENDIX 1
Figure 16: Text Pre-processing Activities
Figure 17: Process Map of the Model for NBC
67
Figure 18: Process Map of Cross Validation for NBC
Figure 19 illustrates that the model could predict the correct files being three (3) confidential
and four (4) non-confidential documents.
Figure 19: Prediction label after applying the model on the unknown data.
68
APPENDIX 2 1. Generate RSA Keypairs
//generates a private Key with 8196 Bits with the command
below.
openssl genrsa -out private.pem 8196
//strips out the public key from the private key with the
command below
openssl rsa -in private.pem -out public.pem -outform PEM -
pubout
2. Generate AES Key (the random password file)
//generate a Random 32 Bytes (256 Bits) AES Key and save the
key to the key.bin file with the command below
openssl rand -base64 32 > key.bin
3. Encryption:
a. Encrypt File with AES Key (i.e. Encrypt the file with the random key)
//encrypt the largefile.txt with the generated AES Key to the
largefile.txt.enc with the command below
openssl enc -aes-256-cbc -salt -in largefile.txt -out
largefile.txt.enc -pass file:./key.bin
b. Encrypt AES Key with RSA Public Key (i.e. Encrypt the random key
with the public key file)
//encrpyt the AES Key with the RSA Public Key and save the
outcome into the key.bin.enc file with the command below.
openssl rsautl -encrypt -inkey public.pem -pubin -in key.bin -
out key.bin.enc
4. Decryption:
a. Decrypt AES Key with RSA Private Key (i.e. Decrypt the random key
with the private key file)
//decrypt the AES Key with the Private RSA Key and save the
result in key.bin.dec with the command below
openssl rsautl -decrypt -inkey private.pem -in key.bin.enc -
out key.bin.dec
69
b. Decrypt File with AES Key (i.e. Decrypt the large file with the random
key).
//decrypt the encrypted file with the decrypted AES Key with
the command below
openssl enc -d -aes-256-cbc -in largefile.txt.enc -out
largefile.txt.dec -pass file:./key.bin.dec
//The largefile.txt.dec and largefile.txt should be the same
Figure 20: Screenshot showing the implementation of the second phase of the proposed DLP method in experiment 1
70
APPENDIX 3 Step 1: Create local and remote folders
Step 2: Change to local directory and generate RSA keypairs. In this case, private key will be
named (keyfile.key) and public key will be (keyfile.pub).
Step 3: The public key is stripped out from the private key and stored as (keyfile.pub) on the
remote folder.
Step 4: Copy the public key (keyfile.pub) to the remote folder.
Step 5: Change to the remote folder
Step 7: Encryption bash script created with gedit program
Figure 21: Screenshot showing the encryption bash script
Step 8: Make the encryption bash script executable
Step 9: Encrypt the confidential file (largefile.txt) with the encryption script.
The encrypted files will be copied also to the local folder.
Step 10: Change directory to the local folder
Step 11: Create the decryption bash script (decrypt.sh)
71
Figure 22: Screenshot showing all the steps from step 2 to 11
Step 12: Decryption bash script (decrypt.sh) created with gedit program
The encrypted AES key and encrypted file (largefile.txt.enc) will be removed from the local
folder after the decryption.
Figure 23: Screenshot showing the decryption bash script
Step 13: Make the decryption bash script executable
72
Figure 24: Screenshot showing the executable command (step 13)
Step 14: Decrypt the encrypted confidential file (largefile.txt.enc) by running the decryption
bash script. The decrypted file will be placed in the local folder. When authorized users need
access to the confidential files they can be allowed to run decryption script against those files.
Afterwards the decrypted ones can be deleted.
Figure 25: Screenshot showing the decryption of the confidential file
73
APPENDIX 4 Encrypting Multiple Files
Now, we will create gzip tarball and then encrypt the tarball. This can also be achieved in a
single command with pipe. With this approach, the correct encryption password should be
supplied. To encrypt all the files in a current directory or folder, use the following command:
tar -czf - * | openssl enc -e -aes256 -out
allconfidentialfiles.tar.gz
Figure 26: Screenshot showing the encryption of all the confidential files in a directory
Decrypting Multiple Files
A tar archive contents can also be decrypted with the following command.
openssl enc -aes-256-cbc -d -in allconfidentialfiles.tar.gz. |
tar xz
When the correct encryption password is supplied, all the contents of the encrypted archived
files will be made available to the authorized user.
Figure 27: Screenshot showing the decryption of all the confidential files in a directory
The content of the files within the current directory can be shown with the ls command as
shown below:
74
Figure 28: Screenshot showing all the files in a directory
When the authorized user is done working with the confidential files they can be deleted. This
can be achieved for instance with the rm command as shown below.
Figure 29: Screenshot showing the removal of the files in a directory
When wrong encryption password is rather supplied it will give error message as shown
below:
75
Figure 30: Screenshot showing wrong password supplied for decryption
76
APPENDIX 5 To archive all the confidential text files in a current directory or folder, use the following
command:
tar -czf newconfidentialfiles.tar.gz *.txt
Figure 31: Screenshot showing archiving of all files before encryption
Now remove all the single confidential text files with the remove (rm) command to leave only
the archived tar file.
Figure 32: Screenshot showing removal of all confidential text files before encryption
Afterwards, the encryption and decryption bash scripts (see section 6.3) could be used as
shown below. Now run the encryption bash script with the (./encrypt.sh) command.
Figure 33: Screenshot showing encryption of the archived file
Afterwards, change to the local folder and run the decryption bash script with the (./decrypt.sh)
command.
Figure 34: Screenshot showing changing of directory
Run the decryption bash script with the (./decrypt.sh) command.
77
Figure 35: Screenshot showing decryption of the archived file
Now extract the confidential files from the archived one with the command below: tar -xzf newconfidentialfiles.tar.gz
Figure 36: Screenshot showing extraction of files from the archived file