research--trace ability with lsi
TRANSCRIPT
-
8/14/2019 Research--Trace Ability With LSI
1/41
TRACEABILITYTRACEABILITY
WITH LSIWITH LSI
Team 2
-
8/14/2019 Research--Trace Ability With LSI
2/41
Traceability
-
8/14/2019 Research--Trace Ability With LSI
3/41
-
8/14/2019 Research--Trace Ability With LSI
4/41A Comparison of Traceability Techniques for Specif 4
Whats Traceability Good
Program ComprehensionWhich code segment implements which specific
requirement and vice-versa
Impact analysisKeeping non-code artifacts up-to-date
Requirement TracingDiscover what code needs to change to handle
a new req.Aid in determining whether a specification is
completely implemented and covered by tests
-
8/14/2019 Research--Trace Ability With LSI
5/41A Comparison of Traceability Techniques for Specif 5
Challenges
Scalability Large # of artifacts
Heterogeneity Large # of different document formats and programming languages
Noisy
Free text information (natural language): conjuctions, prepositions,abbreviations, etc.
Some information may be outdated, or just plain wrong Prior work:
Recovering Traceability Links in Software Artifact ManagementSystems using information retrieval methods [Lucia et al., 2007]
Recovering Traceability Links between Code and Documentation[Antoniol et al., 2002, Deerwester et al., 1990, Marcus and Maletic,2003]
-
8/14/2019 Research--Trace Ability With LSI
6/41A Comparison of Traceability Techniques for Specif 6
Example/** The File interface provides*/publicclass FileImpl extendsFilePOA{ private String nativefileName;
/*** Creates a new File*/
public FileImpl(String nativePath...){}
/****/
Private String f(..){}}
-
8/14/2019 Research--Trace Ability With LSI
7/417
Traceability Link RecoveryM th
Source CodeComponent
IdentifiersExtraction
IdentifiersSeparation
TextNormalization
INDEXER
SoftwareDocuments
LetterTransformation
StopwordsRemoval
MorphologicalAnalysis
INDEXER
Query Extraction
Text Normalization
Code Path
Document Path
DocumentClassifier
Scored Document
List
Capital to SmallLetter
Articles,Punctuations,
etc Removal
MorphologicalAnalysis (Plural tosingular, infinitive,etc)
List of identifiers
Find basic parts ofidentifiers (e,g,
Auto_due into Autoand due)
The three steps of
document path
-
8/14/2019 Research--Trace Ability With LSI
8/41
Pre-processing
-
8/14/2019 Research--Trace Ability With LSI
9/41A Comparison of Traceability Techniques for Specif 9
Text Preprocessing
TextPreprocessing
Copyright ownersgrant membercompanies of theOMG permissionto make a limited
copyright ownergrant membercompaniomg permiss
make limit
Lower-case , stop-words, numberetc.
-
8/14/2019 Research--Trace Ability With LSI
10/41A Comparison of Traceability Techniques for Specif 10
Words Extraction/** The File interface provides
*/publicclass FileImpl extendsFilePOA{ private String nativefileName;
/*** Creates a new File*/public FileImpl(String nativePath
...){}
/****/
Private String f(..){}}
wordsextraction
Class NamePublic Function namesPublic function arguments and returntype
Comments
-
8/14/2019 Research--Trace Ability With LSI
11/41
A Comparison of Traceability Techniques for Specif 11
Words Expansion
Words
expansionNativePath,fileName,
NativePath,Native,Path,fileName,File,Name, delete all elements,
Use well-known coding standards for sub-words
-
8/14/2019 Research--Trace Ability With LSI
12/41
Vector Space Model
-
8/14/2019 Research--Trace Ability With LSI
13/41
13
Vector Space Model
Vector Space Model (VSM) [Salton et al., 1975]Each document, d, is represented by a vector of ranks
of the terms in the vocabulary:
vd = [rd(w1), rd(w2), , rd(w|V|)]
The query is similarly represented by a vectorThe similarity between the query and document is the
cosine of the angle between their respective vectors
Assumes terms are independentSome terms are likely to appear togetherTerms can have different meanings depending on
context
-
8/14/2019 Research--Trace Ability With LSI
14/41
14
Vector Space Model
s Classic IR might lead to poor retrieval due to:x unrelated documents might be included in the
answer setx relevant documents that do not contain at least
one index term are not retrievedx Reasoning: retrieval based on index terms is
vague and noisy
Term-document matrix has a very highdimensionalityare there really that many important features
for each document and term?
-
8/14/2019 Research--Trace Ability With LSI
15/41
Example
from Lillian Lee
autoengine
bonnettireslorryboot
caremissions
hoodmakemodeltrunk
makehiddenMarkovmodel
emissionsnormalize
Synonymy
Will have small cosine
but are related
Polysemy
Will have large cosine
but not truly related
-
8/14/2019 Research--Trace Ability With LSI
16/41
Latent Semantic Indexing
-
8/14/2019 Research--Trace Ability With LSI
17/41
Lecture 12 Information Retrieval 17
Latent Semantic
s The user information need is more related to conceptsand ideas than to index terms
s A document that shares concepts with anotherdocument known to be relevant might be of interest
LSI [Deerwester et al., 1990] Enhance the semantics of long descriptions. reduction can improve effectiveness reduction can find surprising relationships!
s The key idea is to map documents and queries into a lowerdimensional space (i.e., composed of higher level conceptswhich are in fewer number than the index terms)
s Retrieval in this reduced concept space might be superior toretrieval in the space of index terms
-
8/14/2019 Research--Trace Ability With LSI
18/41
Latent Semantic
-
8/14/2019 Research--Trace Ability With LSI
19/41
Lecture 12 Information Retrieval 19
Singular Value
unique mathematical decomposition of amatrix into the product of three matrices:two with orthonormal columns
one with singular values on the diagonal tool for dimension reduction
finds optimal projection into low-dimensionalspace
-
8/14/2019 Research--Trace Ability With LSI
20/41
Lecture 12 Information Retrieval 20
Singular ValueD m iti n
Compute singular value decomposition of aterm-document matrix
D, a representation of M in rdimensionsT, a matrix for transforming new documents
gives relative importance ofdimensions
t t
wtd =T
r
T
r
-
8/14/2019 Research--Trace Ability With LSI
21/41
Lecture 12 Information Retrieval 21
LSI Term matrix T
T matrixgives a vector for each term in LSI spacemultiply by a new document vector to fold in
new documents into LSI space
LSI is a rotation of the term-spaceoriginal matrix: terms are d-dimensionalnew space has lower dimensionalitydimensions are groups of terms that tend to co-
occur in the same documents synonyms, contextually-related words, variant endings
-
8/14/2019 Research--Trace Ability With LSI
22/41
Lecture 12 Information Retrieval 22
Document matrix D
D matrixcoordinates of documents in LSI space
same dimensionality as T vectors
can compute the similarity between a termand a document
http://lsi.research.telcordia.com/
-
8/14/2019 Research--Trace Ability With LSI
23/41
Lecture 12 Information Retrieval 23
Dimension Reduction
-
8/14/2019 Research--Trace Ability With LSI
24/41
Lecture 12 Information Retrieval 24
Improved Retrieval with
New documents and queries are "folded in"multiply vector by T
Compute similarity for ranking as in VSMcompare queries and documents by dot-product
Improvements come fromreduction of noiseno need to stem terms (variants will co-occur)no need for stop list stop words are used uniformly throughout collection, so
they tend to appear in the first dimension
No speed or space gains, though
-
8/14/2019 Research--Trace Ability With LSI
25/41
Computing an Example
Technical Memo Titles
c1: Human machine interface for ABC computerapplications
c2: A survey ofuseropinion ofcomputersystemresponsetime
c3: The EPSuserinterface management systemc4: System and humansystem engineering testing ofEPS
c5: Relation ofuserperceived responsetime to error measurement
m1: The generation of random, binary, ordered trees
m2: The intersection graph of paths in trees
m3: Graphminors IV: Widths oftrees and well-quasi-ordering
-
8/14/2019 Research--Trace Ability With LSI
26/41
M=
Computing an Example
r (human.user) = -.38 r (human.minors) = -.29
-
8/14/2019 Research--Trace Ability With LSI
27/41
K=
Computing an Example
-
8/14/2019 Research--Trace Ability With LSI
28/41
S=
Computing an Example
-
8/14/2019 Research--Trace Ability With LSI
29/41
D=
Computing an Example
-
8/14/2019 Research--Trace Ability With LSI
30/41
New M=
Computing an Example
r (human.user) = .94 r (human.minors) = -.83
-
8/14/2019 Research--Trace Ability With LSI
31/41
Incremental LSI
Allows the fast and low-cost computation oftraceability links by using the results fromprevious LSI computation.
Avoids the full cost of LSI computation forTLR by analyzing the changes todocumentation and source code in differentversions of the system,
and then derive the changes to the set ofdocumentation-to-source code traceabilitylinks.
-
8/14/2019 Research--Trace Ability With LSI
32/41
Evaluation
-
8/14/2019 Research--Trace Ability With LSI
33/41
Experiment of LEDA
Trance line from the Manual to the SourceCode
Find out which parts of the source code are
described by a given manual section.
-
8/14/2019 Research--Trace Ability With LSI
34/41
A Comparison of Traceability Techniques for Specif 34
IR Quality Measures
Precision @ n:
Recall @ n:
Average precision:
-
8/14/2019 Research--Trace Ability With LSI
35/41
Experiment of LEDA
-
8/14/2019 Research--Trace Ability With LSI
36/41
Experiment of LEDA
i f
-
8/14/2019 Research--Trace Ability With LSI
37/41
Experiment of LEDA
E i f LEDA
-
8/14/2019 Research--Trace Ability With LSI
38/41
Experiment of LEDA
Figure2.TimeCostwithiLSI,threshold=0.7
C l i
-
8/14/2019 Research--Trace Ability With LSI
39/41
Conclusions
Latent semantic indexing provides aninteresting conceptualization of the IRproblem
It allows reducing the complexity of theunderline representational frameworkwhich might be explored, for instance,with the purpose of interfacing with the
user
C l i
-
8/14/2019 Research--Trace Ability With LSI
40/41
A Comparison of Traceability Techniques for Specif 40
Conclusions
Traceability between code anddocumentation in real world systems iseffective via IR techniques.
For realistic datasets the Vector Space
Model, which did not performdimensionality reduction where shown tobe the most effective.
R f
-
8/14/2019 Research--Trace Ability With LSI
41/41
References
Hsinyi Jiang, Tien N. Nguyen, Ing-Xiang Chen, Hojun Jaygarl, Carl K. Chang:Incremental Latent Semantic Indexing for Automatic Traceability Link EvolutionManagement. ASE 2008: 59-68
G. Antoniol, G. Canfora, G. Casazza, A.D. Lucia, and E. Merlo. RecoveringTraceability Links Between Code and Documentation. IEEE Trans. Softw. Eng. ,28(10):970-983, 2002.
Dumais, S. T., Furnas, G. W., Landauer, T. K. and Deerwester, S. (1988), "Usinglatent semantic analysis to improve information retrieval." In Proceedings of CHI'88:Conference on Human Factors in Computing, New York: ACM, 281-285.
Deerwester, S., Dumais, S. T., Landauer, T. K., Furnas, G. W. and Harshman, R.A.(1990) "Indexing by latent semantic analysis." Journal of the Society for InformationScience, 41(6), 391-407.
Foltz, P. W. (1990) "Using Latent Semantic Indexing for Information Filtering". In R.
B. Allen (Ed.) Proceedings of the Conference on Office Information Systems,Cambridge, MA, 40-47.
Deerwester, S.,Dumais, S.T., Landauer, T.K.,Furnas, G.W. and Harshman, R.A.(1990). "Indexing by latent semantic analysis." Journal of the Society for InformationScience, 41(6), 391-407.