a deep learning approach for extracting attributes of abac

12
A Deep Learning Approach for Extracting Aributes of ABAC Policies Manar Alohaly Dept. of Computer Science and Engineering University of North Texas Denton, TX, USA Princess Nourah bint Abdulrahman University [email protected] Hassan Takabi Dept. of Computer Science and Engineering University of North Texas Denton, TX, USA [email protected] Eduardo Blanco Dept. of Computer Science and Engineering University of North Texas Denton, TX, USA [email protected] ABSTRACT The National Institute of Standards and Technology (NIST) has iden- tified natural language policies as the preferred expression of policy and implicitly called for an automated translation of ABAC natu- ral language access control policy (NLACP) to a machine-readable form. An essential step towards this automation is to automate the extraction of ABAC attributes from NLACPs, which is the fo- cus of this paper. We, therefore, raise the question of: how can we automate the task of attributes extraction from natural language documents? Our proposed solution to this question is built upon the recent advancements in natural language processing and machine learning techniques. For such a solution, the lack of appropriate data often poses a bottleneck. Therefore, we decouple the primary contributions of this work into: (1) developing a practical frame- work to extract ABAC attributes from natural language artifacts, and (2) generating a set of realistic synthetic natural language ac- cess control policies (NLACPs) to evaluate the proposed framework. The experimental results are promising with regard to the potential automation of the task of interest. Using a convolutional neural network (CNN), we achieved - in average - an F1-score of 0.96 when extracting the attributes of subjects, and 0.91 when extracting the objects’ attributes from natural language access control policies. CCS CONCEPTS Security and privacy Access control; KEYWORDS access control policy, attribute based access control, policy author- ing, natural language processing, relation extraction, deep learning ACM Reference Format: Manar Alohaly, Hassan Takabi, and Eduardo Blanco. 2018. A Deep Learning Approach for Extracting Attributes of ABAC Policies. In SACMAT ’18: The 23rd ACM Symposium on Access Control Models Technologies (SACMAT), Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]. SACMAT ’18, June 13–15, 2018, Indianapolis, IN, USA © 2018 Association for Computing Machinery. ACM ISBN 978-1-4503-5666-4/18/06. . . $15.00 https://doi.org/10.1145/3205977.3205984 June 13–15, 2018, Indianapolis, IN, USA. ACM, New York, NY, USA, 12 pages. https://doi.org/10.1145/3205977.3205984 1 INTRODUCTION The concept of access control mechanisms has been around since Lampson’s access matrix was coined in the late 1960s [31]. There- after, dozens of models have been proposed upon the shortage of others. Most of these models, however, have shown forms of inade- quacy in withstanding the ever-increasing complexity of today’s business environments or safeguard the compliance demand of dynamic organizations. From within this dilemma, attribute based access control (ABAC) has emerged as a good fit. The National Institute of Standards and Technology (NIST) Special Publication 800-162 “Guide to Attribute based Access Control (ABAC) Defi- nition and Considerations” testifies to ABAC’s potential to pro- mote information sharing among enterprises with a heterogeneous business nature [23]. In fact, it is predicted that “by 2020, 70% of enterprises will use attribute based access control... as the dominant mechanism to protect critical assets” [17]. Despite the promising potential, the ABAC policy authoring task has stood out, among other factors, as a costly development effort. This is anything but a new challenge. In fact, the complex syntax of what has become the standard ABAC policy language, namely XACML [40], has been well understood since the earliest days of ABAC. Hence, it was presumed that “XACML is intended primarily to be generated by tools”, said in 2003 [1]. This is where the policy authoring tools come into play. XACML is a generic policy language framework ratified by OA- SIS to provide an abstraction layer for defining access control poli- cies (ACPs) as machine-readable authorization rules [40]. It pro- motes standardization, expressiveness, granularity, interoperability, and efficiency [4]. This flexibility, on the other hand, has introduced implementation complexity. Solutions proposed to aid in the ABAC policy authoring task have ranged from GUI policy authoring tools to APIs and plug-ins [6, 46]. These tools generally provide a higher abstraction level that obscures much of XACML’s syntax complex- ity. The former set of tools, i.e. the GUI aids, are designed as fill-in forms or spreadsheets that present policy authors with lists, or other graphical formats, populated with valid options of policy elements (i.e. subjects and objects attributes). The assumption is that a non-IT specialist would easily be able to construct a policy by selecting its constituent attributes from drop-down menus [46]. The latter set of tools is mainly designed for developers as it has borrowed Session: Research Track - ABAC SACMAT’18, June 13-15, 2018, Indianapolis, IN, USA 137

Upload: others

Post on 19-Jun-2022

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: A Deep Learning Approach for Extracting Attributes of ABAC

A Deep Learning Approach for ExtractingAttributes of ABAC Policies

Manar AlohalyDept. of Computer Science and

EngineeringUniversity of North Texas

Denton, TX, USAPrincess Nourah bint Abdulrahman

[email protected]

Hassan TakabiDept. of Computer Science and

EngineeringUniversity of North Texas

Denton, TX, [email protected]

Eduardo BlancoDept. of Computer Science and

EngineeringUniversity of North Texas

Denton, TX, [email protected]

ABSTRACTThe National Institute of Standards and Technology (NIST) has iden-tified natural language policies as the preferred expression of policyand implicitly called for an automated translation of ABAC natu-ral language access control policy (NLACP) to a machine-readableform. An essential step towards this automation is to automatethe extraction of ABAC attributes from NLACPs, which is the fo-cus of this paper. We, therefore, raise the question of: how can weautomate the task of attributes extraction from natural languagedocuments? Our proposed solution to this question is built upon therecent advancements in natural language processing and machinelearning techniques. For such a solution, the lack of appropriatedata often poses a bottleneck. Therefore, we decouple the primarycontributions of this work into: (1) developing a practical frame-work to extract ABAC attributes from natural language artifacts,and (2) generating a set of realistic synthetic natural language ac-cess control policies (NLACPs) to evaluate the proposed framework.The experimental results are promising with regard to the potentialautomation of the task of interest. Using a convolutional neuralnetwork (CNN), we achieved - in average - an F1-score of 0.96 whenextracting the attributes of subjects, and 0.91 when extracting theobjects’ attributes from natural language access control policies.

CCS CONCEPTS• Security and privacy → Access control;

KEYWORDSaccess control policy, attribute based access control, policy author-ing, natural language processing, relation extraction, deep learning

ACM Reference Format:Manar Alohaly, Hassan Takabi, and Eduardo Blanco. 2018. A Deep LearningApproach for Extracting Attributes of ABAC Policies. In SACMAT ’18: The23rd ACM Symposium on Access Control Models Technologies (SACMAT),

Permission to make digital or hard copies of all or part of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor profit or commercial advantage and that copies bear this notice and the full citationon the first page. Copyrights for components of this work owned by others than ACMmust be honored. Abstracting with credit is permitted. To copy otherwise, or republish,to post on servers or to redistribute to lists, requires prior specific permission and/or afee. Request permissions from [email protected] ’18, June 13–15, 2018, Indianapolis, IN, USA© 2018 Association for Computing Machinery.ACM ISBN 978-1-4503-5666-4/18/06. . . $15.00https://doi.org/10.1145/3205977.3205984

June 13–15, 2018, Indianapolis, IN, USA. ACM, New York, NY, USA, 12 pages.https://doi.org/10.1145/3205977.3205984

1 INTRODUCTIONThe concept of access control mechanisms has been around sinceLampson’s access matrix was coined in the late 1960s [31]. There-after, dozens of models have been proposed upon the shortage ofothers. Most of these models, however, have shown forms of inade-quacy in withstanding the ever-increasing complexity of today’sbusiness environments or safeguard the compliance demand ofdynamic organizations. From within this dilemma, attribute basedaccess control (ABAC) has emerged as a good fit. The NationalInstitute of Standards and Technology (NIST) Special Publication800-162 “Guide to Attribute based Access Control (ABAC) Defi-nition and Considerations” testifies to ABAC’s potential to pro-mote information sharing among enterprises with a heterogeneousbusiness nature [23]. In fact, it is predicted that “by 2020, 70% ofenterprises will use attribute based access control... as the dominantmechanism to protect critical assets” [17]. Despite the promisingpotential, the ABAC policy authoring task has stood out, amongother factors, as a costly development effort. This is anything but anew challenge. In fact, the complex syntax of what has become thestandard ABAC policy language, namely XACML [40], has beenwell understood since the earliest days of ABAC. Hence, it waspresumed that “XACML is intended primarily to be generated bytools”, said in 2003 [1]. This is where the policy authoring toolscome into play.

XACML is a generic policy language framework ratified by OA-SIS to provide an abstraction layer for defining access control poli-cies (ACPs) as machine-readable authorization rules [40]. It pro-motes standardization, expressiveness, granularity, interoperability,and efficiency [4]. This flexibility, on the other hand, has introducedimplementation complexity. Solutions proposed to aid in the ABACpolicy authoring task have ranged from GUI policy authoring toolsto APIs and plug-ins [6, 46]. These tools generally provide a higherabstraction level that obscures much of XACML’s syntax complex-ity. The former set of tools, i.e. the GUI aids, are designed as fill-informs or spreadsheets that present policy authors with lists, or othergraphical formats, populated with valid options of policy elements(i.e. subjects and objects attributes). The assumption is that a non-ITspecialist would easily be able to construct a policy by selectingits constituent attributes from drop-down menus [46]. The latterset of tools is mainly designed for developers as it has borrowed

Session: Research Track - ABAC SACMAT’18, June 13-15, 2018, Indianapolis, IN, USA

137

Page 2: A Deep Learning Approach for Extracting Attributes of ABAC

SACMAT ’18, June 13–15, 2018, Indianapolis, IN, USA M. Alohaly et al.

Figure 1: The authorization policy lifecycle (adopted from[10] ). *Asterisk marks the beginning of the cycle

much of its structure from common programming languages [6].Either way, the goal is to have XACML equivalent policies be auto-matically generated with the help of the tools. While such aids canpossibly provide a better authoring experience, they heavily rely onthe presence of back-end attribute exchange providers. This strongassumption sheds some light on the question of: where do the re-quired attributes come from? Upon a closer inspection of the ABACauthorization policy lifecycle (Figure 1), policies are introduced toa system at the earliest stages of its development lifecycle as use-cases. At this point, access control policies, as well as other securityand functional requirements, are components of specifications doc-uments, which are almost always written in natural language. Toidentify the policy elements required to build the machine-readableauthorization rules, the current practice assumes that a securityarchitect, policy author or system analyst should manually analyzethe existing documentations to extract the relevant information[10]. The fact that these requirements are often elicited from a poolof natural language statements, e.g. interview transcripts, blurs theboundaries between the policies and other requirements. Hence,manually deriving the attributes along with other policy elementsneeded to feed the previously mentioned policy authoring tools israther a repetitive, time-consuming and error-prone task.

The necessity to automate this manual process has motivatedseveral studies in developing automated tools to derive elementsof access control rules directly from the natural language poli-cies [35, 37, 43, 44, 48]. Previous research efforts have mostly beenrestricted to two areas: (1) automatic identification of the accesscontrol policy sentences from natural language artifacts, and (2)automatic extraction of the subject, object and action triples fromthese sentences. To the best of our knowledge, no detailed inves-tigations of this automation, in the context of extracting ABAC

attributes, have been conducted. Therefore, we strive to answer thequestion of how the attributes extraction task can be automated.

The ABAC model groups the authorization attributes into cat-egories which represent the grammatical function of the elementbeing described by the underlying attributes. Going from this ob-servation, we design our attributes identification framework as arelation extraction problem [19, 25]. The relation in this context,is what chains an element (namely, subject and object), with theattributes that modify this element. We view this work as an essen-tial step towards the automated transformation of ABAC naturallanguage policies to a machine-readable form. The primary contri-butions of this paper are as follows:

(1) We design a practical framework to automate the attributesextraction task, which is, by itself, an entire phase of theABAC policy authoring lifecycle (Phase 3 in Figure 1). Tothe best of our knowledge, this is the first evaluated effortto address this problem.

(2) We construct a realistic synthetic dataset, annotated withthe attributes of the subject and object elements of accesscontrol policies. We use this corpus to evaluate our proposal,and it will be made available to the research community,aiming to advance the work in this field.

The remainder of this paper is organized as follows: Section 2provides the background information needed to establish an un-derstanding of the problem domain. In Section 3, we present theproposed framework and discuss our methodology to automate theattributes extraction task. Section 4 describes the procedure to con-struct the required corpus. Our experimental results are reportedin Section 5. We review the related work in Section 6. Section 7discusses the limitations of the proposed approach. In Section 8, weconclude our study with the recommendations for future work.

2 BACKGROUNDIn the following subsections, we define key terminology used inthis study. Then, we provide background information regardingABAC policy authoring process and the underlying techniques ofour proposed framework.

2.1 Overview of ABAC Definition andTerminology

NIST defines ABAC as “An access control method where subjectrequests to perform operations on objects are granted or deniedbased on assigned attributes of the subject, assigned attributes ofthe object...” [23]. The following list provides a high-level definitionof terms relevant to our study. We, also, exemplify each elementusing the illustrative example shown in Figure 2, whenever possible.

• Subject: an entity that initiates access requests to performactions on objects, e.g. patient.

• Subject attribute: a characteristic of the subject, e.g. regis-tered.

• Object: anything upon which an action might be performedby a subject, e.g. health record.

• Object attribute: a characteristic of the object, e.g. full.Herein, we use the terms “authorization requirement”, “access pol-icy” and “access control policy” interchangeably to refer to natural

Session: Research Track - ABAC SACMAT’18, June 13-15, 2018, Indianapolis, IN, USA

138

Page 3: A Deep Learning Approach for Extracting Attributes of ABAC

A Deep Learning Approach for ExtractingAttributes of ABAC Policies SACMAT ’18, June 13–15, 2018, Indianapolis, IN, USA

Figure 2: Attribute identification scenario

language access control policy (NLACP). We, also, use “policy ele-ments” as well as “elements” to refer to subject and object elementsof the authorization requirement, while we use “attributes” to referto their corresponding characteristics.

2.2 ABAC Policy AuthoringBrossard et al. have designed a systematic approach for implement-ing ABAC [10]. The authors’ proposed model is composed of aniterative sequence of phases shown in Figure 1. Several studieshave researched challenges and opportunities that face the naturallanguage processing (NLP) applications in the automation of thesecond phase of the cycle, named “gather authorization require-ments” [35, 37, 43, 44, 48]. To complement this prior effort, ourwork aims to aid policy authors in the process of deriving requiredattributes from natural language authorization requirements, whichis mainly the third phase of the lifecycle. We, therefore, zoom intothe details of this phase to establish an understanding of the prob-lem domain and to identify key activities with the potential to beautomated. This, in turn, provides insights on the essential buildingblocks needed to design a practical ABAC attribute identificationframework.

The goal of the “identify required attributes” phase (see Figure 1)is to identify attributes that jointly build up an ABAC access controlrule. Each attribute is defined by the following pieces of information,as suggested by [10]:

• Short Name: a key a policy author would use to refer to anattribute while writing policy. From Figure 2, subjectTypeobjectType, and status are all examples of short names.

• Category: the class to which an attribute belongs. The corecategories are subject and object [40]. From the previouslyused example, the attribute subjectType, for instance, wouldbelong to the subject category.

• Data type: this property tells what type of data an attribute’svalue would be. The most common data types are: string,boolean, numerical, and date and time. In our example, stringwould be a valid type of subjectType.

• Value constraints: this provides a range or discrete list ofvalues that can possibly be assigned to an attribute. Forinstance, patient is a possible value of subjectType attribute,as well as registered is a possible value of status (of subjectcategory).

There exist additional properties that are either optional or lan-guage dependent. Thus, deriving these properties is out of the scopeof this paper. However, we refer the interested readers to [10]. Inlight of this description, an automated ABAC attributes identifi-cation tool is presumed to infer properties needed to define each

attribute, and make them available to the back-end attributes ex-change. Hence, we design the framework, shown in Figure 3 , thatleverages natural language processing (NLP), relation extraction(RE) and machine learning (ML) techniques in order to facilitate anautomated analysis of authorization requirements and as well asthe inference of information needed to define ABAC attributes.

2.3 Natural Language Processing TechniquesThis section provides an overview of the underlying natural lan-guage processing, relation extraction and machine learning tech-niques used in our framework.

2.3.1 Relation Extraction (RE). Relation extraction is a centraltask in natural language processing. The aim of this task is to an-swer whether it is possible to detect and characterize a semanticrelation between pairs of co-occurring entities, also known as rela-tion mentions, mostly within a sentence. Throughout the literature,RE problem has been often addressed as a classification task wherethe goal is to assign a predefined relation to pair of entities [25].This approach is mainly organized into two stages. First is theidentification of candidate relation instances phase, and then theclassification phase. Techniques to identify candidate instances canbe feature-based or kernel-based [11, 12, 15, 19, 26, 30, 33, 49, 51].Either way, the presence of entities alone does not guarantee thatrelation is validly expressed. Both strategies, therefore, require aclassification phase. As a standard classification task, classifiersare built upon training corpus in which all relations and its par-ticipating entities have been manually annotated. The annotatedrelations are used as positive training examples, whilst the rest of

Figure 3: High level overview of the proposed attributes ex-traction approach

co-occurring pairs that are not labeled are deemed as negative train-ing examples. Classical learning algorithms such as support vectormachines can then be used to train relation extraction classifiers.

2.3.2 Natural Language Processing (NLP) in RE. Four key naturallanguage techniques have influenced relation extraction studiesincluding: tokenization, Part of Speech (POS) tagging, named entityrecognition and syntactic parsing.

Tokenization is an essential starting point for almost any NLPtask. It splits a text segment into tokens which are words or punc-tuation marks. Considering English language, rule-based text tok-enization, using spaces or punctuations, is a straightforward task.

Session: Research Track - ABAC SACMAT’18, June 13-15, 2018, Indianapolis, IN, USA

139

Page 4: A Deep Learning Approach for Extracting Attributes of ABAC

SACMAT ’18, June 13–15, 2018, Indianapolis, IN, USA M. Alohaly et al.

However, abbreviation-like structures introduce a level of complex-ity to this process.

POS taggers determine the corresponding POS tag for each wordin a sentence [9]. It is a sequential labeling problem. In the contextof RE, POS is used to detect relations between candidate pair ofentities in a sentence by matching patterns over its POS tags. POSare also useful in providing lexical information needed to design REmodel. In attribute relation extraction, for example, attributes areusually adjectives and policy elements are nouns or combinationof nouns.

Named entity recognition (NER) aims to detect phrases thatexpress some real-world entities such as people, organizations,locations, times, events, etc. [45]. in a given segment of text. Of-tentimes, this task requires more than simple matching againstpredefined dictionaries or gazetteers. Real-world applications, thus,combine heuristics, probabilistic matching, and sequential learningtechniques [32]. RE has adopted NER to identify pairs that canpotentially form a candidate instance. It is also used to encodesemantic knowledge of participating entities.

Syntactic parsing techniques provide a mediation between thelexical representation of a sentence and its meaning. The two dom-inant syntactic parsing strategies are constituency and dependencyparsers. The first of these strategies constructs the explicit syntacticconstituents (e.g. noun phrases and verb phrases) that formallydefine a sentence [52]. The latter, on the other hand, reflects thegrammatical structure of a given sentence [22]. Both represen-tations have been used in the literature of relation extraction toencode patterns that link pair of entities of a particular relation[15, 25, 49, 51]. However, relation extraction studies have empiri-cally shown that patterns drawn from a dependency tree are moreresilient to lexical inconsistency among different domains [27].

2.3.3 Convolutional Neural Network (CNN) in RE. CNN has re-cently been used in solving a wide range of applied machine learn-ing problems. In RE, CNN can be used to model syntactic and se-mantic relations between words within a sentence while reducingthe dependence on manually designed features. Generally speaking,the use of CNN-based models in various NLP tasks has achievedimpressive results [14, 29, 50].

The architecture of a typical CNN consists of a stack of layerswhich can be (1) an input layer that encodes words in each rela-tion instance by real-valued vectors using the lookup tables; (2)a convolutional layer to the capture the contextual features, e.g.n-grams, in the given input; (3) the pooling layer to determine themost relevant features and, (4) an output layer which is the fullyconnected layer that performs the classification. Subsection 3.1.3provides a detailed discussion of each layer as well as the overallstructure of our CNN.

3 THE PROPOSED METHODOLOGYAs mentioned earlier in the background section, attributes in ABACmodel are defined via short name, potential value(s), the data typeand category to which an attribute belongs. Out of these four dimen-sions, the property that is usually expressed in a natural languageauthorization requirement is the attribute value. Looking back toour illustrative example shown in Figure 2, registered, for instance,refers to the status of the authorized subjects. Full as well describes

the status of health record that can be accessed by the registeredpatient. From the natural language standpoint, clauses/phrases thatexpress such characteristics follow certain grammatical patterns tomodify particular elements of the policy sentence. Therefore, ourapproach leverages the grammatical relations holding between pol-icy elements and constituents that modify these elements, to viewthis task as a relation extraction problem [19, 25]. Extracting thevalue of an attribute, say registered, and associating this value withthe corresponding policy element, e.g. patient, equates finding thevalue and category properties of two attributes, namely status andsubjectType as described later in this section. However, deriving theshort name, e.g. status, as well as the data type, e.g. string, requiresfurther analysis. Figure 3 shows an overview of our frameworkwhere rectangles depict the four main modules and arrows pointin the direction of data flow. The overall approach consists of fourparts. It begins with analyzing the natural language authorizationrequirement to locate modifiers of each policy element. Then, itproceeds towards three independent tasks of identifying attributes’category, their data type and suggesting an appropriate attributeshort name.

Figure 4: Dependency tree representation

3.1 Module 1: Attribute extractionWe define two element-specific relations, named subject-attributeand object-attribute. These relations exist between policy elementsand a phrase or clause that expresses a characteristic of this par-ticular element. To capture the target relations, we first identifythe grammatical patterns that can possibly encode each relation.Using the most common patterns, we generate lists of candidateinstances, which are then fed into a machine learning classifier todetermine whether or not a candidate instance encodes the relationof interest. The following subsections provide the details of eachstep.

It is worth mentioning that capturing relations of interest can beaddressed as a multi-class learning task. However, results obtainedfrom this experiment design were less promising. This motivatesour design decision of using two separate binary classifiers, one foreach relation.

3.1.1 Identify patterns encoding the target relations. An autho-rization requirement is deemed to involve a subject-attribute re-lation if there exists at least one modifier that describes a char-acteristic of an authorized subject in this particular requirement.

Session: Research Track - ABAC SACMAT’18, June 13-15, 2018, Indianapolis, IN, USA

140

Page 5: A Deep Learning Approach for Extracting Attributes of ABAC

A Deep Learning Approach for ExtractingAttributes of ABAC Policies SACMAT ’18, June 13–15, 2018, Indianapolis, IN, USA

Figure 5: The architecture of the CNN model. The marked entities (or mentions) of an object-attribute instance are obtainedusing one of the patterns identified in 3.1.1. The mentions and the corresponding heads of the instance are concatenated and

fed to the network using the word embeddings representation.

The same applies to object-attribute relation, but with respect tothe object element. Knowing whether this relation holds for thesubject, or object is essential to assign the appropriate category ofthe detected attributes, hence the use of the two element-specificrelations.

Identifying patterns encoding these relations requires a struc-tured representation of the textual authorization requirement. Theconstituency parse tree and the dependency tree are both valid op-tions. However, relation extraction studies have empirically shownthat patterns drawn from the dependency tree are more resilientto lexical inconsistency among different domains[27]. We, there-fore, represent policy sentences with the typed dependency tree asshown in Figure 4 for the sentence “A registered patient can accesshis/her full health record.” In this tree, each vertex represents aword from the sentence, whilst edges represent the grammaticalrelationship between two words. patient, for instance, functions asthe nominal subject (nsubj) for access and record is the direct object(dobj) to be accessed. In this tree, the path that connects registeredto patient is of a special interest as it encodes subject-attributerelation. Similarly, the path between full and record encodes anobject-attribute relation. In other words, the shortest dependencypath that links subject or object elements with their attributes isdeemed as a relation pattern. To extract such patterns, we refer toour manually annotated data in which we explicitly annotate theauthorization attributes of both subjects and objects as described incorpus creation, Section 4. From the entire set of resulted patterns,we follow the recommendation of Berland et al. to focus on themost frequent ones [8].

3.1.2 Generate candidate instances. After identifying the mostfrequent patterns for the target relations, we search through ourdataset to find all matching instances. In this study, we target two

relations that are encoded with two different sets of patterns. Match-ing against these patterns produces two lists of candidates instancesof subject-attribute and object-attribute relations. For both rela-tions, instances are presented with the tuple R( E, AV). R defines theoverall relation. Depending on the pattern used for generating theinstance, E, which stands for element, could either be the subject orobject. AV is the value of an attribute associated with E. The pair Eand AV represent entities linked via the set of patterns that encodethe relation R. Note that candidate instances do not always encodea valid relation. We, therefore, feed instances to a machine learningclassifier(s) to decide whether or not a candidate instance encodesthe relation of interest.

3.1.3 Classify candidate instances using CNN-based model. Theextracted candidate instances are now used to build the CNN-basedclassifiers for both relations. Particularly, the combination of enti-ties of each candidate instance, also known as the relation mentions,and the head of each mention is what will be fed to the model, asillustrated in Figure 5. Since CNNs can only accept fixed lengthinputs, we compute the maximum length of entity mentions andchoose the input width to be more than twice as long as the maxi-mum length (to account for both mentions and the heads). Inputslonger than the predefined length are trimmed, while shorter in-puts are padded with special tokens. Figure 5 shows the overallarchitecture of our CNN. Further, in the followings, we provide therationale for each layer of the model.

• Input layer: at the very first layer, the two relation mentions aswell as their heads are concatenated to form the vector x = [x1,x2, . . . , xn] where x i is the i-th word in this sequence, as seen inFigure 5. Before being fed into the CNN, each word x i in x mustbe encoded in a real-valued vector e ofm dimensions using wordembedding tableW. The embeddings tableW can either be learned

Session: Research Track - ABAC SACMAT’18, June 13-15, 2018, Indianapolis, IN, USA

141

Page 6: A Deep Learning Approach for Extracting Attributes of ABAC

SACMAT ’18, June 13–15, 2018, Indianapolis, IN, USA M. Alohaly et al.

as a part of the model or loaded from a pretrained word embeddings.Following the recommendation of Chen et al. [13], which calls forthe latter approach, we build our model using the GloVe pretrainedword embeddings [24]. The output of this layer is a matrix x = [ x1,x2, . . . , xn]. The size of x ism × n wherem is the dimensionalityof the word embedding vectors, and n is the length of the inputsequence x .

• Convolution layer: to learn a higher level syntactic and seman-tic features, the matrix x, representing the given input, is passedthrough a convolution layer with filter f = [ f1, f2, . . . , fw] of sizem×w wherew is the width of the filter. The filter convolves over theinput sequence to learn the contextual features from thew adjacentterms in the input. The convolution of the two matrices, x and f,results in a “windowed” average vector h = [ h1, h2, . . . , hn-w+1]as defined in Equation 1:

hi = д(w−1∑j=0

f Tj+1xTj+i + b) (1)

Here д is an optional activation function and b is a bias term. Thetrained weights of f would then be used to detect the features ofthe relations of interest.

• Pooling layer: the pooling layer is mainly used for dimensionalityreduction purposes. Its two basic types are the max and averagepooling.Max pooling is known to be useful in the relation extractiontask as it only passes the salient features of the vector h to thesubsequent layers, and filters out the less informative ones [42].Hence, we apply a max operation to the result of each filter inour model. The application of the max pooling over the vector hproduces a scalar p =max (h) =max {h1, h2, . . . , hn-w+1}

• Fully connected layer: in this step, the pooling scores obtainedfrom the previous layer are concatenated into a single feature vectorz. The vector z is then fed into a sigmoid layer to perform theclassification task.

Upon the successful completion of this module, a pair (E , AV)classified as valid member of relation R will then be used to derivevalues along with the category of ABAC attributes as shown inAlgorithm 1 and further explained in Module 2-4. It is imperative tonote that values that are explicitly expressed in the natural languagesentence are the ones that are necessary to generate the executableform of the policy. Consider for instance a natural language policysentence that begins with, “A patient of 18 years of age or oldercan ...” and its equivalent machine-readable form, shown in Figure6, one can assume that the actual values of the attribute (e.g. age)that go beyond what the sentence may include are obtained fromthe policy information point to be used later for evaluation in thepolicy decision point as suggested in [23]

3.2 Module 2: Identifying the category of anattribute

A straightforward method to determine the attribute category isto refer to the type of relation by which the attribute is detected.The categories subject and object correspond the relations subject-attribute and object-attribute, respectively.

p o l i c i e s {a t t r i b u t e {

c a t e go ry = s u b j e c t C a ti d = " age "type = numer i c a l }

a t t r i b u t e {c a t e go ry = s u b j e c t C a ti d = " sub j e c tType "type = s t r i n g }

p o l i c y someAction {t a r g e t c l a u s e sub j e c tType == " p a t i e n t "r u l e a l l ow {c on d i t i o n on age >= 18

pe rmi t }}

}

Figure 6: The machine-readable form of a policy sentencethat begins with “A patient of 18 years of age or older canetc.”

Input :The function takes the relation R along with E andAV, e.g. (subject-attribute , patient , registered)

Output :List of attributes defined by key, value, category anddata type, e.g. [(subjectType, patient, subject attribute, string) , (status, registered , subject attribute ,boolean)]

Function addNewAttributes (R,E,AV)values = [E, AV];category = R.type();for v in values do

key = getShortName(E,v);value = v;dataType = getDataType(v);attributes.add(key,value,category,dataType);

endAlgorithm 1: Define the attributes out of detected relations

3.3 Module 3: Suggesting an attribute’s shortname

This module takes as input the pair (E, AV), produced by Module1. E and AV are both values of ABAC attributes. Since E alwaysexpresses a value of a core policy element, its short name is definedto be either subjectType or objectType depending on the type ofrelation by which E is captured. However, naming the attributevalued by AV, e.g. status, requires more complex reasoning process.There are several approaches in the literature that could be used toaddress this task [7, 20, 21], and are beyond the scope of this paper.

Session: Research Track - ABAC SACMAT’18, June 13-15, 2018, Indianapolis, IN, USA

142

Page 7: A Deep Learning Approach for Extracting Attributes of ABAC

A Deep Learning Approach for ExtractingAttributes of ABAC Policies SACMAT ’18, June 13–15, 2018, Indianapolis, IN, USA

Figure 7: Synthetic data generation framework. The dashed box indicates the manual task in this process.

3.4 Module 4: Identifying an attribute’s datatype

We currently employ a heuristics-based approach to infer the datatype of an attribute. Our heuristics are directly derived from thenamed entity (NE) type assigned to either E or AV. That is, a NE tag-ger can recognize information units such as a person, organization,location, time, date, money and many more. An example of suchheuristic is “if the NE type of AV is money, then the data type isnumeric”. An obvious limitation of this approach is that it can infera data type of an attribute only if NE tagger successfully capturesNE type of its value, i.e. E or AV. We, therefore, plan to extend thismodule in order to move towards a more robust approach.

4 CORPUS CREATIONWhile the advancements of ML and NLP techniques can provideABAC policy authors and security architects with powerful tools toaddress requirements analysis and attributes extraction tasks, theeffective adoption of these techniques is restricted by the lack ofappropriate datasets. One reason for this dearth of data is that therequirement specifications in general, and authorization require-ments in particular are meant to be proprietary except for a fewsoftware projects for which requirements are publicly available.This results in a lack of authorization requirements that are richwith attributes. Meaning that the majority of the available naturallanguage policy documents often express authorization rules bythe means of roles, actions and resources, e.g. “a patient choosesto view his or her access log.” While some of these elements couldbe considered as attributes, they mainly capture the core policyelements. Hence, these are not enough to train an automated at-tribute extraction model. To remedy the lack of appropriate data,we construct a synthetic natural language policy dataset such thatthe authorization rules are expressed by the means of attributes.We then use this dataset to evaluate the proposed approach.

Two information sources have been used to fuel the syntheticdata generation framework. The first is the datasets obtained fromthe prior efforts in natural language access control policy (NLACP)

collection. These datasets were constructed to study various aspectsof the automated translation of NLACPs to machine-readable code.The most representative examples of these datasets are :

(1) iTrust: is a patient-centric application for maintaining anelectronic health records [5].

(2) IBM Course Management App: is a registration system forcolleges [2].

(3) CyberChair: is a conference management system [47].(4) Collected ACP: is a dataset consists of combined access con-

trol policy sentences collected by Xiao et al. [48].Herein we refer to these documents collectively as ACP dataset1.

The authoring style of the authorization requirements in the ACPdataset is tuned more towards the role-based access control thanattribute based access control. Hence, these datasets do not haveenough attributes for the purpose of our study. To address thislimitation, we augment either the subjects or objects elementsof the policies with additional descriptive context derived from ageneral purpose textual data, namely Common Crawl [3], whichcomposes our second information source. The purpose of injectingthis additional context is to change the tone of the policy so itreads as if it was written with ABAC model in mind. For example,before injecting the descriptive context, a policy sentence mightbe written as “faculty can access library materials”, while afterinjection, it becomes “retired USD faculty can access the unrestrictedlibrary materials.” The latter sentence ties together the core policyelements with their attributes, i.e. USD, retired and unrestricted,to mimic the type of authorization requirements ABAC model ismeant to capture.

Figure 7 shows the overall synthetic data generation framework.We start the process with the subject and object elements of accesscontrol sentences in ACP dataset as they were identified by theoriginal authors. Next, we search through the Common Crawl webcontent to retrieve segments of text that contain the elements ofinterest, i.e. the subject and the object. We limit the language of the

1These documents can be downloaded from:https://sites.google.com/site/accesscontrolruleextraction/documents

Session: Research Track - ABAC SACMAT’18, June 13-15, 2018, Indianapolis, IN, USA

143

Page 8: A Deep Learning Approach for Extracting Attributes of ABAC

SACMAT ’18, June 13–15, 2018, Indianapolis, IN, USA M. Alohaly et al.

Table 1: Examples of the synthesized natural language policies. Column A contains the original policies as obtained fromACP datasets, and B shows the policies after being augmented with attributes.

A B

A lab technician can update the status ... A full-time first shift lab technician can updatethe status ...

Professor must be able to access the systemto ...

Professor, of economics at University of NSW, mustbe able to access the system ...

Program Committee (PC) must be selected toreview the papers.

Program Committee (PC) must be selected to review the"border line" papers.

retrieved segments to English. Intuitively, not all of the matchingsegments appropriately fit with the semantic of policies in ACPdataset. To filter irrelevant matches, we first build a language modelusing the ACP dataset . Then, we use the conditional probability ofeach segment under the computed language model to determine thesemantic fitness of a segment to the context of ACP dataset. Next,we manually augment policy sentence with the relevant portionof text from the matching segment and make necessary changes.The injection process ends once we collect as much of the relevantsegments as needed to inject sentences of ACP dataset with thesynthetic attributes.

One might argue that the data generation process can be doneentirely manually, by recruiting individuals to do so. An obviousmotivation for the automation is to generate the synthesized poli-cies as natural as can be. This is to say that data generated manuallyby individuals in a controlled experiment settings might be biasedby the individuals’ writing styles, their background knowledge aswell as their comprehension of the data generation instructions.However, using a content drawn from a general purpose contentrepository, such as Common Crawl, captures the natural contextthat normally contains the elements of interests. Scalability, interms of the volume of data and the variety of domains, is anothermotivation. The reason is that an automated means of data genera-tion is designed to at least partially eliminate the need of domainexperts. Table 1 shows examples of the synthesized policies which ifgenerated manually would require a certain level of domain-specificknowledge.

To this end, we introduce our synthetic natural language policydataset of four different domains. It contains a total of 851 sentenceswith manual annotations of 867 subject-attribute and 912 object-attribute instances. Table 2 summarizes the number of elementsalong with their attributes in our dataset. To capture the subjectand object elements of ABAC policy along with their attributes, wedevelop the following annotation scheme:

(1) e2SA: stands for element2 in Subject Attributes Relation. Asegment of the sentence that is tagged with this label corre-sponds to a value of a core policy element named subjectType

(2) e1SA: stands for element1 in Subject Attributes Relation. Thislabel is used to tag the part of the sentence that expresses avalue of an attribute that modifies e2SA

We similarly define e2OA, e1OA to label objects and object at-tributes, respectively. Figure 8 presents the format for an annotatedACP sentence.

A [ r e g i s t e r e d #e1SA ] [ p a t i e n t #e2SA ] can a c c e s s h i s / her[ f u l l #e1OA ] [ h e a l t h r e co r d #e2OA]

Figure 8: Annotated ACP sentence

Table 2: The corpus built for the experiment

Dataset ACP sentences # Subject-Attribute # Object-AttributeiTrust 466 511 514Collected ACP 112 144 100IBM App 163 140 195CyberChair 110 72 103Total 851 867 912

5 EXPERIMENTAL RESULTS ANDPERFORMANCE EVALUATION

We next present the evaluation we conducted to assess the effec-tiveness of the Attribute Extraction module, i.e. Module 1. In ourevaluation, we address two main research questions:

• RQ1: What are the possible patterns that link policy ele-ments and their attributes?

• RQ2: How effectively, in terms of precision, recall and F1-score, does our approach extract attributes of subjects andobjects?

To answer RQ1, we first generate the dependency tree represen-tation of each sentence. The resulted representation is augmentedwith annotation information needed to explore patterns encodingeither subject-attribute or object-attribute relations. Our analysishas shown that there exist 125 unique patterns to encode subject-attribute relation, while the object-attribute relation is encodedwith 240. Table 3 shows the five most frequently occurring pat-terns for each relation as per our corpus. The remaining patternsaccounts for very few valid instances. Thus, we only consider thetop 5 ones. Matching against these top 5 patterns generates the twosets of candidate instances, one for each relation. Table 4 shows, foreach relation, the total number of candidate instances as well as thenumber of valid matches in each set. It is worth mentioning thatidentifying these patterns can be not only valuable to the purpose ofthis study, but also it can be used as seeds to establish an inductiveapproach to semi-automate the process of learning patterns of ourtarget relations [28].

In response to RQ2, we build our CNN model with a total of 4layers of which two are convolution layers with dropout of 0.2 fol-lowed by one max pooling layer and the sigmoid layer. Throughout

Session: Research Track - ABAC SACMAT’18, June 13-15, 2018, Indianapolis, IN, USA

144

Page 9: A Deep Learning Approach for Extracting Attributes of ABAC

A Deep Learning Approach for ExtractingAttributes of ABAC Policies SACMAT ’18, June 13–15, 2018, Indianapolis, IN, USA

Table 3: Five most occurring patterns in each relation (formore information regarding components of each pattern,e.g.nsubj, see [16])

Subject-Attribute Object-AttributePatterns % of Occurrence Patterns % of Occurrence

nsubj,amod 32% dobj,amod 24%nsubj,prep 8% pobj,amod 9%nsubjpass,amod 7% dobj,prep 6%nsubj,compound 5% dobj,compound 4%nsubj,ROOT,amod 4% nsubjpass,amod 3%

the experiment, we utilize the publicly available GloVe pretrainedword embeddings of 300 dimensions. GloVe embeddings are trainedon 42 billion words of Common Crawl data using the global word-word co-occurrence statistics [24, 41]. For the convolutional layer,we use Relu as an activation function and 128 filters with a win-dow of size 2. All further parameters, e.g. batch size and kernelinitializer, are left as default. The models (i.e. subject-attribute andobject-attribute) are trained for 10 epochs.

To evaluate the performance of our CNN, we employ a 3-foldstratified cross-validation as well as document fold validation. Inthe document fold, the models are trained on three documents andtested on the forth. The rational of such evaluation setting is toevaluate how the system will perform on out-of-domain data.

A general observation with regard to the performance behavioris that the subject-attribute relation extraction model surpasses theperformance of object-attribute model. A possible interpretationof the overall strength of the subject-attribute relation is rooted inthe nature of the instances of this relation. More specifically, thesubject-attribute relation is meant to chain system’s users, whichare mostly human entities, with their corresponding attributes.This introduces a level of semantic consistency between instancesbelonging to this particular relation. However, this is not the case inthe object-attribute relation as it can possibly apply on wide varietyof entity types. Hence, there will bemore clusters of instances of thisrelation, which makes it less likely to be learned from a relativelysmall data size. We, therefore, expect that the accumulation of morelabeled instances can substantially improve the performance.

The first four rows of Table 5 show the means of the 3-fold crossvalidation across 10 runs. Scores represent the performance of themodel in detecting the valid instances of both relations, i.e. theperformance over the positive class. The variance of these meansranges between 0.01 to 0.1. This experiment, the n-fold, representsthe performance of the model on in-domain data. The high scoreof the precision and recall indicates model’s ability to infer thesemantic consistency among candidate instances. To evaluate howthe model would perform when tested on out-of-domain data, weconduct a document fold experiment. Here, the model is trained on3 documents and tested on the fourth that belongs to a differentdomain. The results of this experiment have shown an expecteddrop in the values of precision and recall. However, the loss ismore significant in the case of object-attribute relation. A potentialjustification for this behaviour is that despite the domain differences,there are common attributes of subjects, e.g. location, age .. etc. thatare shared across domains. Such similarity is less likely to occurbetween objects across domains. For instance, a model trained

Table 4: Valid matches versus the total number of candidateinstances using the top 5 patterns shown in Table 3.

Dataset Subject-Attribute Object-Attributevalid 265 288

iTrust total 570 1207valid 84 70

Collected ACP total 350 162valid 90 139

IBM App total 299 312valid 52 65

CyberChair total 148 207

Table 5: Results of the proposed models (P, R, F1 representprecision, recall and F1-score, respectively). The first fourrows show the performance of the model when tested on in-domain data whilst the last row represents how the systemwould perform on out-of-domain data.

Dataset Subject-Attribute Object-AttributeP R F1 P R F1

iTrust 0.97 0.99 0.98 0.91 0.90 0.90Collected ACP 0.93 0.93 0.93 0.93 0.95 0.94IBM App 0.95 0.96 0.96 0.92 0.92 0.92CyberChair 0.97 0.98 0.97 0.88 0.87 0.88Document fold 0.91 0.80 0.85 0.75 0.69 0.71

on data from a medical domain would not be certain about theattributes of, say, courses and class as does with medical records. Itis also worthy to note that we have experimented with several otherclassical learning algorithms, e.g. SVM and Naive Bayes, trainedon manually engineered features; but, the CNN outperforms othermodels. The results are omitted here due to space limit.

6 RELATEDWORKSeveral studies have researched various aspects of the automatedanalysis of natural language ACPs. In order to review these priorefforts, we introduce six comparison dimensions along which eachrelated work has been characterized: (1) the building blocks ofthe proposed framework; (2) the underlying techniques used foreach component; (3) indicators or features used in case a learningapproach is employed; (4) the dataset used for the validation; (5)size of the dataset (measured by the number of sentences), and (6)the average of available performance metrics. Table 6 presents asummary of the findings, while the rest of this section discusseseach study separately.

Xiao et al. [48] have first introduced an automated approach,named Text2Policy, to construct access control policy rules out ofnatural language artifacts. Their approach was built upon matchingagainst 4 predefined patterns to discern ACP sentences from othertypes of sentences. Heuristics were defined based on the same setof patterns to extract instances that jointly compose ACP rules.The extraction task is followed by transformation process in whicha formal representation of the extracted rule is generated. Thereported results achieved an average recall of 89.4% with a precision

Session: Research Track - ABAC SACMAT’18, June 13-15, 2018, Indianapolis, IN, USA

145

Page 10: A Deep Learning Approach for Extracting Attributes of ABAC

SACMAT ’18, June 13–15, 2018, Indianapolis, IN, USA M. Alohaly et al.

Table 6: ACP extraction from natural language artifacts in the related work. In this table, N/A stands for notapplicable while N/R means not reported.

Study Components of Framework Underlying Tech. Features Dataset Size Performance

[48] ACP sentence identification Semantic patternsmatching N/A iTrust , IBM App 927 Prec:88.7%

Rec:89.4%

ACP elements extraction Heuristics over the patterns N/A Access control sentences in:iTrust, IBM App,and collected ACP 241 Accu:86.3%

Transformation to formal model Heuristics N/A N/R N/R N/R

[43] ACP sentence identificationMajority vote of K-NN,Naive Bayes and SVMclassifiers

Words, synonyms, POS, named entities,and Levenshtein distancein the case of K-NN

iTrust 1159 Prec:87.3%Rec:90.8%

ACP elements extractionRE using bootstrapping;seeding patterns are derivedfrom dependency tree

N/A Access control sentences in:iTrust 409 Prec:46.3%

Rec:53.6%

Naive Bayes classifierof candidate instances Were not clearly reported

[44] ACP sentence identification K-NN Levenshtein distance iTrust, IBM App,Cyberchair, collected ACP 2477 Prec:81%

Rec:65%

ACP elements extraction

RE using bootstrapping;seeding patterns arederived from dependencytree

N/AAccess control sentences in:iTrust, IBM App, Cyberchair,collected ACP

1390 N/R

Naive Bayes classifierof candidate instances

Pattern itself, relationshipsto resource and subject,POS of subject and resource

[34] ACP sentence identification Naive Bayes and SVMclassifiers

A total of 821 features categorized into:pointwise mutual information, security,syntactic complexity, and dependency features

iTrust, IBM App,Cyberchair, collected ACP 2477 Prec:90%

Rec:90%

[35] ACP sentence identification Deep recurrent neuralnetwork Word embedding ACPData, iTrust ,IBM App, Cyberchair,

collected ACP 5137 Prec:81.28%Rec:74.21%

[37] ACP elements extraction Semantic role labeler(SRL) N/A Access control sentences in:iTrust, IBM App, Cyberchair 726 Prec:58.3%

Rec:86.3%

[36] ACP elements extraction Semantic role labeler(SRL) N/AAccess control sentences in:iTrust, IBM App, Cyberchair,collected ACP

841 Prec:63.5%Rec:86.25%

of 88.7% in the identification phase and an accuracy of 86.3% on theextraction. Evaluation results of the transformation phase were notreported. However, their approach cannot capture ACP sentencesthat do not follow the predefined patterns; and, it has been shownthat only 34.4% of ACPs can be captured with Text2Policy’s patterns[44].

Slankas et al. [43] have proposed Access Control Relation Extrac-tion (ACRE), a machine learning based approach for ACP elementsextraction from natural language documents. ACRE can be bro-ken down to ACP sentence identification phase followed by ACPelements extraction. In the identification phase, the authors haveinvestigated whether words, words’ synonyms, POS and namedentities could be used as indicators to identify ACP sentences. Inelements extraction phase, a bootstrapping technique built uponpatterns drawn from dependency tree representation of sentencehas been adopted to extract ACP instances. Slankas et al. have em-pirically validated the proposed approach on a version of iTrustthat contains 1159 sentences. Their approach achieved a recall of90.8% with a precision 87.3% in the identification phase, whereas inACP rule extraction authors reported 46.3% precision with a recallof 53.6% as the best performance they can achieve.

Slankas et al. [44] have extended ACRE, which was first intro-duced in [43]. The main components of the framework, as well asthe underlying techniques, are similar to their original proposalwith only subtle modifications.What clearly distinguishes this workfrom its predecessor is the evaluation. In [44], Slankas et al. havevalidated the proposed approach against larger datasets collectedfrom 5 sources of policy data that have been previously used in the

literature. For the identification phase, K-nearest neighbor (K-NN)learning approach was employed to discern ACP sentence fromother types of sentences. Further, the authors have achieved an av-erage classification precision and recall of 81% and 65%, respectively.For elements extraction phase, on the other hand, performance wasreported per dataset, meaning the overall average scores were notreported.

Narouei et al. [34] designed a new set of features to distinguishACP from non-ACP sentences. Using the proposed feature set andfollowing the classical machine learning pipeline, the authors re-ported an average precision and recall of 90%.

Narouei et al. [35] proposed a top-down policy engineeringframework to particularly aid ABAC policy authoring. They haveadopted the following pipeline: ACP sentence identification fol-lowed by policy elements extraction phase. For the identificationtask, the authors have used a deep recurrent neural network thatuses pre-trained word embedding to identify sentences that containaccess control policy content; and, they achieved an average preci-sion of 81.28% and 74.21% for recall. For policy elements extraction,the authors have suggested the use of semantic role labeler (SRL),but no evaluation results were reported [35]. Using the language ofABAC model, the authors regarded the arguments of SRL tagger asthe attribute values while argument definitions were seen as thekeys or what are so called short names. Consider for instance thetwo ACPs sentences : (1) A registered patient can access his/herfull health record and (2) A registered patient can read his/herfull health record. Following the approach of Narouei et al [35],attributes would be represented as follows:

Session: Research Track - ABAC SACMAT’18, June 13-15, 2018, Indianapolis, IN, USA

146

Page 11: A Deep Learning Approach for Extracting Attributes of ABAC

A Deep Learning Approach for ExtractingAttributes of ABAC Policies SACMAT ’18, June 13–15, 2018, Indianapolis, IN, USA

• accessor = patient• reader = patient

where reader and accessor are the arguments definition obtainedfrom the SRL tagger. Such a direct mapping from the SRL output(i.e. arguments and argument definitions) to attribute values andshort names of ABAC model is insufficient. This is mainly becauseof two reasons. First, SRL labels arguments only with respect tothe predicate, e.g. access and read. Therefore, attributes of subjectsand objects such as registered and full will not be captured usingthis approach. Second, from the SRL perspective, an argumentdefinition, e.g. accessor and reader, describes the semantic relationbetween an argument and a predicate in a sentence. This description,however, might not be an appropriate short name of an attribute ascan be seen in the above-mentioned example. That is from ABACstandpoint, patient is a value of one attribute called subjectType,or any semantically similar attribute name, rather than the twoattribute names —reader and accessor— suggested by the SRL tagger.

Narouei et al. [36, 37] have proposed a new approach for ex-tracting policy elements from ACP policy sentences using semanticrole labeler (SRL). The performance of SRL approach was evaluatedper dataset. The authors reported that they were able to achievehigher extraction recall (88%), on their test data, when comparedto ACRE [44]. This boost of recall, however, comes at the cost ofprecision. While we are aware that the lack of benchmark datasetmakes the comparison task non-trivial, the reported results suggestthat a combination of policy extraction approaches by Slankas et al.[44] and Narouei et al. [37] can potentially balance both precisionand recall values. Furether, Narouei et al. [38] have investigatedthe idea of improving SRL performance using domain adaptationtechniques. The authors reported that they were able to identifyACP elements with an average F1 score of 75%, which bested theprevious work by 15%.

Turner designed an ABAC authoring assistance tool that allowsa security architect to configure an ABAC expression as a sequenceof fields, namely subject, subject attributes, object, and object at-tribute, etc. [46]. The natural language ABAC policy will then beprovided to the tool according to the predefined fields. The aim isto create an ABAC authoring environment of business-level userexperience while delegating the transformation of the providedinput to machine-readable ABAC rules to the application. Unlikeour proposal, following Turner’s approach the task of extractinginformation that are relevant to the ABAC rule from the NLACP isstill done manually. Although it is mentioned that this task couldbe automated, the presented approach is manual. Additionally, un-like our approach that works with unrestricted natural languagedocuments, Turner’s approach constrains the NLP’s context-freegrammar.

7 LIMITATIONSEvaluating the proposed approach on real-world requirementswould be ideal. Hence, the use of the synthetic data might imposea limitation on this work. However, acquiring the real data fromreal organizations might not be feasible considering the privacy-sensitive nature of the data. We, therefore, develop our syntheticdata that, while not real, are intended to be realistic. Particularly, toensure that our synthetic data preserves the key characteristics of

real data, we made the following design decisions: (1) the data wasbuilt upon real requirement documents, (2) the injected attributeswere obtained from the Web content in an attempt to capture thereal-world context of a particular policy element, and (3) the man-ual part of the injection process was conducted by Ph.D. studentswith expertise in access control domain. Hence, we believe that oursynthetic data is representative of the real-world data and so is theperformance. Generally speaking, the nature of privacy-sensitivecontent is a major challenge in devising data-driven security/pri-vacy solutions, such as insider threat detection or clinically-relevanttext analysis tools, to name few. In such situations the syntheticcontent has been considered as an alternative [18, 39]. Anotherlimitation of this work is that our evaluation mainly focuses onattribute extraction module (see subsection 3.1) and other modulesof the proposed framework are discussed briefly. Note that thismodule constitutes the basis for our framework and this work is thefirst step towards developing the proposed framework; developingand evaluating Modules 2-4 is part of our ongoing research and theresults will be reported in future work.

8 CONCLUSIONS AND FUTUREWORKA practical framework was proposed to analyze natural languageACPs, and automatically identify the core properties needed todefine the attributes contained in these policies. To the best of ourknowledge, this is the first evaluated effort to solve this problem.When extracting the attributes of subject and object elements ofaccess control policy, our approach achieved an average F1-scoreof 0.96 a nd 0.91, respectively. The high values of the F1-measureindicate that our proposed method is promising. While the currentresults are encouraging, further data collection effort is neededto improve the performance over out-of-domain data. The futureresearch directions are focused on devising an automated means inorder to provide support to ABAC authorization policy lifecycle.

9 ACKNOWLEDGMENTThework of the first author has been partially supported by PrincessNourah bint Abdulrahman University, Riyadh, Saudi Arabia.

REFERENCES[1] 2003. XACML –ANo-Nonsense Developer’s Guide. (2003). http://www.idevnews.

com/stories/57[2] 2004. IBM 2004. Course Registration Requirements. (2004).[3] 2017. Common Crawl. (September 2017). http://commoncrawl.org/[4] Ryma Abassi, Michael Rusinowitch, Florent Jacquemard, and Sihem Guemara

El Fatmi. 2010. XML access control: from XACML to annotated schemas. InCommunications and Networking (ComNet), 2010 Second International Conferenceon. IEEE, 1–8.

[5] Laurie Williams Andrew Meneely, Ben Smith. 2011. iTrust Electronic HealthCare System: A Case Study. (2011). http://bensmith.s3.amazonaws.com/website/papers/sst2011.pdf

[6] Axiomatics. 2017. Attribute Based Access Control (ABAC). (2017). https://www.axiomatics.com/attribute-based-access-control/

[7] Omid Bakhshandeh and James Allen. 2015. From Adjective Glosses to AttributeConcepts: Learning Different Aspects That an Adjective Can Describe.. In IWCS.23–33.

[8] Matthew Berland and Eugene Charniak. 1999. Finding Parts in Very LargeCorpora. In Proceedings of the 37th Annual Meeting of the Association for Com-putational Linguistics on Computational Linguistics (ACL ’99). Association forComputational Linguistics, Stroudsburg, PA, USA, 57–64. https://doi.org/10.3115/1034678.1034697

[9] Eric Brill. 1995. Transformation-based Error-driven Learning and Natural Lan-guage Processing: A Case Study in Part-of-speech Tagging. Comput. Linguist. 21,4 (Dec. 1995), 543–565. http://dl.acm.org/citation.cfm?id=218355.218367

Session: Research Track - ABAC SACMAT’18, June 13-15, 2018, Indianapolis, IN, USA

147

Page 12: A Deep Learning Approach for Extracting Attributes of ABAC

SACMAT ’18, June 13–15, 2018, Indianapolis, IN, USA M. Alohaly et al.

[10] David Brossard, Gerry Gebel, and Mark Berg. 2017. A Systematic Approach toImplementing ABAC. In Proceedings of the 2Nd ACMWorkshop on Attribute-BasedAccess Control (ABAC ’17). ACM, New York, NY, USA, 53–59. https://doi.org/10.1145/3041048.3041051

[11] Yee Seng Chan and Dan Roth. 2010. Exploiting Background Knowledge forRelation Extraction. In Proceedings of the 23rd International Conference on Com-putational Linguistics (COLING ’10). Association for Computational Linguis-tics, Stroudsburg, PA, USA, 152–160. http://dl.acm.org/citation.cfm?id=1873781.1873799

[12] Yee Seng Chan and Dan Roth. 2011. Exploiting Syntactico-semantic Structures forRelation Extraction. In Proceedings of the 49th Annual Meeting of the Associationfor Computational Linguistics: Human Language Technologies - Volume 1 (HLT’11). Association for Computational Linguistics, Stroudsburg, PA, USA, 551–560.http://dl.acm.org/citation.cfm?id=2002472.2002542

[13] Danqi Chen and Christopher Manning. 2014. A fast and accurate dependencyparser using neural networks. In Proceedings of the 2014 conference on empiricalmethods in natural language processing (EMNLP). 740–750.

[14] Ronan Collobert, JasonWeston, Léon Bottou,Michael Karlen, Koray Kavukcuoglu,and Pavel Kuksa. 2011. Natural language processing (almost) from scratch.Journal of Machine Learning Research 12, Aug (2011), 2493–2537.

[15] Aron Culotta and Jeffrey Sorensen. 2004. Dependency Tree Kernels for Rela-tion Extraction. In Proceedings of the 42Nd Annual Meeting on Association forComputational Linguistics (ACL ’04). Association for Computational Linguistics,Stroudsburg, PA, USA, Article 423. https://doi.org/10.3115/1218955.1219009

[16] Marie-Catherine De Marneffe and Christopher D Manning. 2008. Stanford typeddependencies manual. Technical Report. Technical report, Stanford University.

[17] Gartner. 2013. Market Trends: Cloud-Based Security Services Market, Worldwide,2014. (October 2013). https://www.gartner.com/doc/2607617

[18] J. Glasser and B. Lindauer. 2013. Bridging the Gap: A Pragmatic Approach toGenerating Insider Threat Data. In 2013 IEEE Security and Privacy Workshops.98–104. https://doi.org/10.1109/SPW.2013.37

[19] Zhou GuoDong, Su Jian, Zhang Jie, and Zhang Min. 2005. Exploring VariousKnowledge in Relation Extraction. (2005), 427–434. https://doi.org/10.3115/1219840.1219893

[20] Matthias Hartung and Anette Frank. 2010. A Structured Vector Space Model forHidden Attribute Meaning in Adjective-noun Phrases. In Proceedings of the 23rdInternational Conference on Computational Linguistics (COLING ’10). Associationfor Computational Linguistics, Stroudsburg, PA, USA, 430–438. http://dl.acm.org/citation.cfm?id=1873781.1873830

[21] Matthias Hartung, Fabian Kaupmann, Soufian Jebbara, and Philipp Cimiano.2017. Learning Compositionality Functions on Word Embeddings for ModellingAttribute Meaning in Adjective-Noun Phrases. In Proceedings of the 15th Meetingof the European Chapter of the Association for Computational Linguistics (EACL),Vol. 1.

[22] Mary Hearne, Sylwia Ozdowska, and John Tinsley. 2008. Comparing constituencyand dependency representations for smt phrase-extraction. (2008).

[23] Vincent C Hu, David Ferraiolo, Rick Kuhn, Arthur R Friedman, Alan J Lang,Margaret M Cogdell, Adam Schnitzer, Kenneth Sandlin, Robert Miller, KarenScarfone, et al. 2013. Guide to attribute based access control (ABAC) definitionand considerations (draft). NIST special publication 800, 162 (2013).

[24] Christopher D. Manning Jeffrey Pennington, Richard Socher. 2017. GloVe: GlobalVectors for Word Representation. (2017). https://nlp.stanford.edu/projects/glove/

[25] Jing Jiang. 2012. Information Extraction from Text. InMining Text Data, Charu C.Aggarwal and ChengXiang Zhai (Eds.). Springer US, Boston, MA, 11–41. https://doi.org/10.1007/978-1-4614-3223-4_2

[26] Jing Jiang and ChengXiang Zhai. 2007. A Systematic Exploration of the FeatureSpace for Relation Extraction.. In HLT-NAACL. 113–120.

[27] Richard Johansson and Pierre Nugues. 2008. Dependency-based Semantic RoleLabeling of PropBank. In Proceedings of the Conference on Empirical Methodsin Natural Language Processing (EMNLP ’08). Association for ComputationalLinguistics, Stroudsburg, PA, USA, 69–78. http://dl.acm.org/citation.cfm?id=1613715.1613726

[28] Daniel Jurafsky and James H Martin. 2009. Speech and Language Processing: AnIntroduction to Natural Language Processing, Computational Linguistics, andSpeech Recognition. (2009).

[29] Nal Kalchbrenner and Phil Blunsom. 2013. Recurrent Convolutional Neu-ral Networks for Discourse Compositionality. CoRR abs/1306.3584 (2013).arXiv:1306.3584 http://arxiv.org/abs/1306.3584

[30] Nanda Kambhatla. 2004. Combining Lexical, Syntactic, and Semantic FeatureswithMaximumEntropyModels for Extracting Relations. In Proceedings of the ACL2004 on Interactive Poster and Demonstration Sessions (ACLdemo ’04). Associationfor Computational Linguistics, Stroudsburg, PA, USA, Article 22. https://doi.org/10.3115/1219044.1219066

[31] Butler W. Lampson. 1974. Protection. SIGOPS Oper. Syst. Rev. 8, 1 (Jan. 1974),18–24. https://doi.org/10.1145/775265.775268

[32] James H Martin and Daniel Jurafsky. 2000. Speech and language processing.International Edition 710 (2000).

[33] Raymond J Mooney and Razvan C Bunescu. 2006. Subsequence kernels forrelation extraction. In Advances in neural information processing systems. 171–178.

[34] Masoud Narouei, Hamed Khanpour, and Hassan Takabi. 2017. Identification ofAccess Control Policy Sentences from Natural Language Policy Documents. InData and Applications Security and Privacy XXXI, Giovanni Livraga and SencunZhu (Eds.). Springer International Publishing, Cham, 82–100.

[35] Masoud Narouei, Hamed Khanpour, Hassan Takabi, Natalie Parde, and RodneyNielsen. 2017. Towards a Top-down Policy Engineering Framework for Attribute-based Access Control. In Proceedings of the 22Nd ACM on Symposium on AccessControl Models and Technologies (SACMAT ’17 Abstracts). ACM, New York, NY,USA, 103–114. https://doi.org/10.1145/3078861.3078874

[36] Masoud Narouei and Hassan Takabi. 2015. Automatic Top-Down Role Engi-neering Framework Using Natural Language Processing Techniques. In Informa-tion Security Theory and Practice, Raja Naeem Akram and Sushil Jajodia (Eds.).Springer International Publishing, Cham, 137–152.

[37] Masoud Narouei and Hassan Takabi. 2015. Towards an Automatic Top-downRole Engineering Approach Using Natural Language Processing Techniques. InProceedings of the 20th ACM Symposium on Access Control Models and Technologies(SACMAT ’15). ACM, New York, NY, USA, 157–160. https://doi.org/10.1145/2752952.2752958

[38] M. Narouei, H. Takabi, and R. Nielsen. 2018. Automatic Extraction of Access Con-trol Policies from Natural Language Documents. IEEE Transactions on Dependableand Secure Computing (2018), 1–1. https://doi.org/10.1109/TDSC.2018.2818708

[39] Mayuresh Oak, Anil Behera, Titus Thomas, Cecilia Ovesdotter Alm, EmilyPrud’hommeaux, Christopher Homan, and Raymond W Ptucha. 2016. Gen-erating Clinically Relevant Texts: A Case Study on Life-Changing Events.. InCLPsych@ HLT-NAACL. 85–94.

[40] OASIS. 2013. eXtensible Access Control Markup Language (XACML) Version 3.0.(January 2013). http://docs.oasis-open.org/xacml/3.0/xacml-3.0-core-spec-os-en.html

[41] Jeffrey Pennington, Richard Socher, and Christopher Manning. 2014. Glove:Global vectors for word representation. In Proceedings of the 2014 conference onempirical methods in natural language processing (EMNLP). 1532–1543.

[42] Yelong Shen, Xiaodong He, Jianfeng Gao, Li Deng, and Grégoire Mesnil. 2014.Learning Semantic Representations Using Convolutional Neural Networks forWeb Search. In Proceedings of the 23rd International Conference on World WideWeb (WWW ’14 Companion). ACM, New York, NY, USA, 373–374. https://doi.org/10.1145/2567948.2577348

[43] John Slankas and Laurie Williams. 2013. Access Control Policy Identification andExtraction from Project Documentation. Academy of Science and EngineeringScience 2, 3 (2013), p145–159.

[44] John Slankas, Xusheng Xiao, Laurie Williams, and Tao Xie. 2014. RelationExtraction for Inferring Access Control Rules from Natural Language Artifacts. InProceedings of the 30th Annual Computer Security Applications Conference (ACSAC’14). ACM, New York, NY, USA, 366–375. https://doi.org/10.1145/2664243.2664280

[45] Erik F. Tjong Kim Sang and Fien De Meulder. 2003. Introduction to the CoNLL-2003 Shared Task: Language-independent Named Entity Recognition. In Proceed-ings of the Seventh Conference on Natural Language Learning at HLT-NAACL 2003- Volume 4 (CONLL ’03). Association for Computational Linguistics, Stroudsburg,PA, USA, 142–147. https://doi.org/10.3115/1119176.1119195

[46] Ronald C. Turner. 2017. Proposed Model for Natural Language ABAC Authoring.In Proceedings of the 2Nd ACMWorkshop on Attribute-Based Access Control (ABAC’17). ACM, New York, NY, USA, 61–72. https://doi.org/10.1145/3041048.3041054

[47] Richard Van De Stadt. 2012. CyberChair: A Web-Based Groupware Applica-tion to Facilitate the Paper Reviewing Process. CoRR abs/1206.1833 (2012).arXiv:1206.1833 http://arxiv.org/abs/1206.1833 Withdrawn.

[48] Xusheng Xiao, Amit Paradkar, Suresh Thummalapenta, and Tao Xie. 2012. Au-tomated Extraction of Security Policies from Natural-language Software Docu-ments. In Proceedings of the ACM SIGSOFT 20th International Symposium on theFoundations of Software Engineering (FSE ’12). ACM, New York, NY, USA, Article12, 11 pages. https://doi.org/10.1145/2393596.2393608

[49] Dmitry Zelenko, Chinatsu Aone, and Anthony Richardella. 2003. Kernel methodsfor relation extraction. Journal of machine learning research 3, Feb (2003), 1083–1106.

[50] Daojian Zeng, Kang Liu, Siwei Lai, Guangyou Zhou, Jun Zhao, et al. 2014. RelationClassification via Convolutional Deep Neural Network.. In COLING. 2335–2344.

[51] Min Zhang, Jie Zhang, and Jian Su. 2006. Exploring Syntactic Features for RelationExtraction Using a Convolution Tree Kernel. In Proceedings of the Main Conferenceon Human Language Technology Conference of the North American Chapter ofthe Association of Computational Linguistics (HLT-NAACL ’06). Association forComputational Linguistics, Stroudsburg, PA, USA, 288–295. https://doi.org/10.3115/1220835.1220872

[52] Muhua Zhu, Yue Zhang, Wenliang Chen, Min Zhang, and Jingbo Zhu. 2013.Fast and Accurate Shift-Reduce Constituent Parsing. In Proceedings of the 51stAnnual Meeting of the Association for Computational Linguistics (Volume 1: LongPapers). Association for Computational Linguistics, Sofia, Bulgaria, 434–443.http://www.aclweb.org/anthology/P13-1043

Session: Research Track - ABAC SACMAT’18, June 13-15, 2018, Indianapolis, IN, USA

148