recovering traceability links in requirements documents …horacek/ontologies-presentation.pdf ·...

RECOVERING TRACEABILITY LINKS IN REQUIREMENTS DOCUMENTS

Zeheng Li Mingrui Chen LiGuo Huang

Department of Computer Science & Engineering

Southeren methodist University

Dallas, TX 75275-0122

Vincet Ng

Human Language Technology Institute

University Of Texas at Dallas

Richardson, TX 75083-0688

Presented By

Narendra Narisetti

Cuk,2552738 1

Introduction

• Software system development initialized with evaluation andrefinement of requirements.

• Documenting those requirements using natural language iscalled “requirements documents”.

• The requirements are refined with additional design detailsand implementation information.

• Linking of requirements in which one is refinement of other iscalled ‘’ requirements traceability’’.

Types of Requirements

• Specifically, requirements can be divided into two types:

1. High Level Requirements(coarse-grained)

2. Low Level Requirements(fine-grained)

• Requirement traceability links each high-level requirementwith all the low-level requirements that improves.

• The traceability mapping is many-to-many .

Example: Pine email system by Sultanov and Hayes

Figure 1: Sample of high- and low-level requirements4

Drawbacks:

• Information irrelevant to the establishment of one link is related to establishment of other link in same requirement.

Example: Description section in UC01 is irrelevant to the HR02 but it is relevant to HR01 for linking.

• Link can exist between a pair of requirements even if they don’t have similar content words or overlapping.

Requirements Traceability Approaches:

• It is classified as two types:

Manual approaches: Requirements traceability links arerecovered manually by developers.

Automated approaches: Depends on information retrieval(IR)techniques to generate links automatically.

Automated approaches

• Binary classification tasks.

• Measures similarity between high and low level requirements.

• Classifying positive means high and low level requirements are linked.

• Information retrieval (IR) techniques are used for traceability link prediction.

Supervised Learning Methods

• Supervised methods are employed with two types of humansupplied knowledge:

i) Annotator rationales : It contains the information relevantto the establishment of link by the human annotator.

we use this rationales to create additional training instancesfor the learner.

ii) Ontology hand-built: It is defined by a domain expert tocreate additional training features for learner. (see next slide)

Hand-built ontology of pine

Why ontology based features are useful for traceability links?

1.Only those verbs and nouns appear in training data

2. For link identification , verbs and nouns are deemed relevant by domain expert in ontology.

3. Robust generalization of the words/phrases .

Hand-built Ontology

Manual Vs Automated

Manual Approach

1. System analysts uses requirement management tools to build RTM.

2. Rational DOORS, Rational RequisitePro, CASE .

3. It is human-intensive so error prone gives large set of requirements.

Automated Approach

1. Calculate textual similaritybetween requirements.

Ex: Cosine coefficients, Jaccard

2. Tf-idf-based vector spacemodel, Latent DirichletAllocation.

3. Depend on IR techniques.

For our evaluation we are taking second dataset“WorldVistA” , an electronic health informationsystem developed by the USA veteransadministration along with pine email system.

Datasets

Table 1: Statistics on the Datasets12

Manual ontology for WorldVistA

Baseline Systems

• It employs different methods for traceability prediction.

Baseline Systems

Unsupervised Baseline Supervised Baseline

Tf-idf LDA Word Pairs LDA induced topic pairs

Unsupervised Baselinesa) The Tf-idf baseline: If cosine similarity value between two

documents is greater than given threshold value then it ispositive.

b) The LDA baseline: Each entry in document has certainprobability such that it belongs to one of the topics ofn(length of the document) and apply cosine similarity asabove method.

Note: Here LDA is trained to produce n topics.

Supervised Baseline

• Instance is pair of high-level and low-level requirements.

• Instance is positive then two requirements are linked otherwiseit is negative.

• Instances can be represented using two types of features:

a) word pairs: Instance is pair of words taken from traininginstances.

b) LDA-induced topic pairs: Instance is pair of features and it ispositive if both features are most probable topics in high andlow-level requirements.

Note: Here LDA is trained with additional parameter C toproduce n topics.

Exploiting Rationales

Extension:

• Generating extra training instances i.e. pseudo instances, weneed to adopt extension to baseline systems.

• We employ a binary SVM classifier on training data set withlinear kernel and setting all parameters to default valuesexpect C parameter.

Evaluation:

• Dataset is five fold cross validation in which three folds fortraining data, one fold for development set and one fold forevaluation.

• F-score on dev set give performance of the classifier. 18

Rationale in Traceability Prediction

• According to Zaidan et al, Rationale is a human-annotated textfragment that motivated an annotator to assign a particularlabel to training document.

• In traceability prediction rationales are identified only forpositive instances.

• In traceability prediction, negative instances are because ofabsence of evidence that two requirements involved shouldbe linked rather than presence of evidence that they shouldnot be linked.

Creating Negative Pseudo Instances

• Steps for creating negative pseudo instances:

i) Select pair of linked requirements.

ii) Remove rationale from both requirements. Only negativeinstances will remain.

iii) Remaining text fragments create pseudo instances which arenegative in nature.

iv) From each pair of positive instances, three types of negativepseudo instances are possible:

a) Removing all and rationales from high-level requirements.

b) Removing all and rationales from low-level requirements.

c) Removing all rationales from both requirements.20

Creating Positive Pseudo Instances

• Steps for creating positive pseudo instances:

i) Select pair of linked requirements.

ii) Remove text fragments which are not part of rationale in pair.

iii) Reaming pseudo instances are positive pseudo instances.

iv) Add a constraint to the SVM learner to classify pseudo instances with less confidence.

Soft-margin SVM formulation

i) Positive instances:

ii) Positive pseudo instances:

iii) Negative pseudo instances:

• Xi = Training example

• C = error penalty

• Vi ,uij = pos/neg pseudo instances created from Xi

• Ci = { -1,+1} class label

• ξi = slack variable

• μ = margin size

Exploiting an Ontology

• For generating additional features we employ SVM learner tohand-built ontology contains verb and noun clusters.

• In this, each training instance is

i) from high-level and low-level requirements

ii) from the list of Ontology.

Ontology Based Features

Verb pairsNoun pairs Verb group pairs Noun group pairs Dependency pairs

focus on verbs/Nouns that relevant to traceability prediction

Replace verbs/Nouns with cluster id’sCreate binary file with cluster id’sBest performance

Combination of verb and nounUse Stanford dependency parser Connected by dependency relation

Learning the Ontology

Is it possible to learn an ontology rather than hand-buildingit?

Yes, it involves 3steps procedure:

Step1: verb/noun selection

Select verbs, nouns, noun phrases from training data in such

way that

a) should appear more than once

b) it contains at least three characters. Ex: be, is.

c) should appear in high level but not in low level and vice

versa.

Learning the Ontology• Step2: Verb/Noun representation

a) Represent each verb with set of nouns/NPs using Stanforddependency parser.

b) similarly noun with set of verbs collected in step1.

• Step3: Clustering-

a) Apply clustering to both verb and noun clusters separatelyusing single-link algorithm.

b) This algorithm merges two most similar clusters usingsimilarity measurement and stops when it reaches desirednumber of clusters.

It gives induced number of clusters for given datasets.26

Evaluation

• In evaluation, we compare F-score of different methodswhich depends on combination of noun clustering and verbclustering and C value.

• F-score depend on two terms:

i) Recall (R) :- It is percentage of links in the gold standard thatare recovered by our system.

ii) Precision (P) :- It is percentage of links recovered by oursystem that are correct.

• F-score is harmonic mean of recall and precision.

Result of Supervised Systems

Conclusion• Traceability prediction is crucial task with annotator rationale

and ontology.

• Supervised baseline techniques reduces relative error by 11.1-19.7% compared to baseline techniques.

• F-score is competitive in between manual clusters andinduced clusters.

• The results might change depending on datasets.

recovering traceability links in requirements documents …horacek/ontologies-presentation.pdf ·...

Documents

context-based analytics - establishing explicit links...

bop handling systems - dri-france...the large links provide...

honey and traceability p-142/2005/traceability©qsi 2005 69...

enhancing seafood traceability issues...

traceability requirements management2 traceability systems...

honey and traceability p-142/2005/traceability©qsi 2005 1...

traceability lunch & learn. global standards management...

traceability for sustainable trade - unece · traceability...

traceability in food and agricultural …...traceability in...

relink: recovering links between bugs and changes ›...

customizing traceability links for the uniﬁed...

recovery of traceability links and behavior models for...

ac voltage standards with quantum...

traceability values

traceability in the supply chain - can-trace.org · product...

findability through traceability - a realistic application...

gray links in the use of requirements...

recovering traceability links between code and documentation

traceability in global supply chains · identify whether...

recovery of traceability links between...