research--trace ability with lsi

Upload: andrew-denner

Post on 30-May-2018

220 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/14/2019 Research--Trace Ability With LSI

    1/41

    TRACEABILITYTRACEABILITY

    WITH LSIWITH LSI

    Team 2

  • 8/14/2019 Research--Trace Ability With LSI

    2/41

    Traceability

  • 8/14/2019 Research--Trace Ability With LSI

    3/41

  • 8/14/2019 Research--Trace Ability With LSI

    4/41A Comparison of Traceability Techniques for Specif 4

    Whats Traceability Good

    Program ComprehensionWhich code segment implements which specific

    requirement and vice-versa

    Impact analysisKeeping non-code artifacts up-to-date

    Requirement TracingDiscover what code needs to change to handle

    a new req.Aid in determining whether a specification is

    completely implemented and covered by tests

  • 8/14/2019 Research--Trace Ability With LSI

    5/41A Comparison of Traceability Techniques for Specif 5

    Challenges

    Scalability Large # of artifacts

    Heterogeneity Large # of different document formats and programming languages

    Noisy

    Free text information (natural language): conjuctions, prepositions,abbreviations, etc.

    Some information may be outdated, or just plain wrong Prior work:

    Recovering Traceability Links in Software Artifact ManagementSystems using information retrieval methods [Lucia et al., 2007]

    Recovering Traceability Links between Code and Documentation[Antoniol et al., 2002, Deerwester et al., 1990, Marcus and Maletic,2003]

  • 8/14/2019 Research--Trace Ability With LSI

    6/41A Comparison of Traceability Techniques for Specif 6

    Example/** The File interface provides*/publicclass FileImpl extendsFilePOA{ private String nativefileName;

    /*** Creates a new File*/

    public FileImpl(String nativePath...){}

    /****/

    Private String f(..){}}

  • 8/14/2019 Research--Trace Ability With LSI

    7/417

    Traceability Link RecoveryM th

    Source CodeComponent

    IdentifiersExtraction

    IdentifiersSeparation

    TextNormalization

    INDEXER

    SoftwareDocuments

    LetterTransformation

    StopwordsRemoval

    MorphologicalAnalysis

    INDEXER

    Query Extraction

    Text Normalization

    Code Path

    Document Path

    DocumentClassifier

    Scored Document

    List

    Capital to SmallLetter

    Articles,Punctuations,

    etc Removal

    MorphologicalAnalysis (Plural tosingular, infinitive,etc)

    List of identifiers

    Find basic parts ofidentifiers (e,g,

    Auto_due into Autoand due)

    The three steps of

    document path

  • 8/14/2019 Research--Trace Ability With LSI

    8/41

    Pre-processing

  • 8/14/2019 Research--Trace Ability With LSI

    9/41A Comparison of Traceability Techniques for Specif 9

    Text Preprocessing

    TextPreprocessing

    Copyright ownersgrant membercompanies of theOMG permissionto make a limited

    copyright ownergrant membercompaniomg permiss

    make limit

    Lower-case , stop-words, numberetc.

  • 8/14/2019 Research--Trace Ability With LSI

    10/41A Comparison of Traceability Techniques for Specif 10

    Words Extraction/** The File interface provides

    */publicclass FileImpl extendsFilePOA{ private String nativefileName;

    /*** Creates a new File*/public FileImpl(String nativePath

    ...){}

    /****/

    Private String f(..){}}

    wordsextraction

    Class NamePublic Function namesPublic function arguments and returntype

    Comments

  • 8/14/2019 Research--Trace Ability With LSI

    11/41

    A Comparison of Traceability Techniques for Specif 11

    Words Expansion

    Words

    expansionNativePath,fileName,

    NativePath,Native,Path,fileName,File,Name, delete all elements,

    Use well-known coding standards for sub-words

  • 8/14/2019 Research--Trace Ability With LSI

    12/41

    Vector Space Model

  • 8/14/2019 Research--Trace Ability With LSI

    13/41

    13

    Vector Space Model

    Vector Space Model (VSM) [Salton et al., 1975]Each document, d, is represented by a vector of ranks

    of the terms in the vocabulary:

    vd = [rd(w1), rd(w2), , rd(w|V|)]

    The query is similarly represented by a vectorThe similarity between the query and document is the

    cosine of the angle between their respective vectors

    Assumes terms are independentSome terms are likely to appear togetherTerms can have different meanings depending on

    context

  • 8/14/2019 Research--Trace Ability With LSI

    14/41

    14

    Vector Space Model

    s Classic IR might lead to poor retrieval due to:x unrelated documents might be included in the

    answer setx relevant documents that do not contain at least

    one index term are not retrievedx Reasoning: retrieval based on index terms is

    vague and noisy

    Term-document matrix has a very highdimensionalityare there really that many important features

    for each document and term?

  • 8/14/2019 Research--Trace Ability With LSI

    15/41

    Example

    from Lillian Lee

    autoengine

    bonnettireslorryboot

    caremissions

    hoodmakemodeltrunk

    makehiddenMarkovmodel

    emissionsnormalize

    Synonymy

    Will have small cosine

    but are related

    Polysemy

    Will have large cosine

    but not truly related

  • 8/14/2019 Research--Trace Ability With LSI

    16/41

    Latent Semantic Indexing

  • 8/14/2019 Research--Trace Ability With LSI

    17/41

    Lecture 12 Information Retrieval 17

    Latent Semantic

    s The user information need is more related to conceptsand ideas than to index terms

    s A document that shares concepts with anotherdocument known to be relevant might be of interest

    LSI [Deerwester et al., 1990] Enhance the semantics of long descriptions. reduction can improve effectiveness reduction can find surprising relationships!

    s The key idea is to map documents and queries into a lowerdimensional space (i.e., composed of higher level conceptswhich are in fewer number than the index terms)

    s Retrieval in this reduced concept space might be superior toretrieval in the space of index terms

  • 8/14/2019 Research--Trace Ability With LSI

    18/41

    Latent Semantic

  • 8/14/2019 Research--Trace Ability With LSI

    19/41

    Lecture 12 Information Retrieval 19

    Singular Value

    unique mathematical decomposition of amatrix into the product of three matrices:two with orthonormal columns

    one with singular values on the diagonal tool for dimension reduction

    finds optimal projection into low-dimensionalspace

  • 8/14/2019 Research--Trace Ability With LSI

    20/41

    Lecture 12 Information Retrieval 20

    Singular ValueD m iti n

    Compute singular value decomposition of aterm-document matrix

    D, a representation of M in rdimensionsT, a matrix for transforming new documents

    gives relative importance ofdimensions

    t t

    wtd =T

    r

    T

    r

  • 8/14/2019 Research--Trace Ability With LSI

    21/41

    Lecture 12 Information Retrieval 21

    LSI Term matrix T

    T matrixgives a vector for each term in LSI spacemultiply by a new document vector to fold in

    new documents into LSI space

    LSI is a rotation of the term-spaceoriginal matrix: terms are d-dimensionalnew space has lower dimensionalitydimensions are groups of terms that tend to co-

    occur in the same documents synonyms, contextually-related words, variant endings

  • 8/14/2019 Research--Trace Ability With LSI

    22/41

    Lecture 12 Information Retrieval 22

    Document matrix D

    D matrixcoordinates of documents in LSI space

    same dimensionality as T vectors

    can compute the similarity between a termand a document

    http://lsi.research.telcordia.com/

  • 8/14/2019 Research--Trace Ability With LSI

    23/41

    Lecture 12 Information Retrieval 23

    Dimension Reduction

  • 8/14/2019 Research--Trace Ability With LSI

    24/41

    Lecture 12 Information Retrieval 24

    Improved Retrieval with

    New documents and queries are "folded in"multiply vector by T

    Compute similarity for ranking as in VSMcompare queries and documents by dot-product

    Improvements come fromreduction of noiseno need to stem terms (variants will co-occur)no need for stop list stop words are used uniformly throughout collection, so

    they tend to appear in the first dimension

    No speed or space gains, though

  • 8/14/2019 Research--Trace Ability With LSI

    25/41

    Computing an Example

    Technical Memo Titles

    c1: Human machine interface for ABC computerapplications

    c2: A survey ofuseropinion ofcomputersystemresponsetime

    c3: The EPSuserinterface management systemc4: System and humansystem engineering testing ofEPS

    c5: Relation ofuserperceived responsetime to error measurement

    m1: The generation of random, binary, ordered trees

    m2: The intersection graph of paths in trees

    m3: Graphminors IV: Widths oftrees and well-quasi-ordering

  • 8/14/2019 Research--Trace Ability With LSI

    26/41

    M=

    Computing an Example

    r (human.user) = -.38 r (human.minors) = -.29

  • 8/14/2019 Research--Trace Ability With LSI

    27/41

    K=

    Computing an Example

  • 8/14/2019 Research--Trace Ability With LSI

    28/41

    S=

    Computing an Example

  • 8/14/2019 Research--Trace Ability With LSI

    29/41

    D=

    Computing an Example

  • 8/14/2019 Research--Trace Ability With LSI

    30/41

    New M=

    Computing an Example

    r (human.user) = .94 r (human.minors) = -.83

  • 8/14/2019 Research--Trace Ability With LSI

    31/41

    Incremental LSI

    Allows the fast and low-cost computation oftraceability links by using the results fromprevious LSI computation.

    Avoids the full cost of LSI computation forTLR by analyzing the changes todocumentation and source code in differentversions of the system,

    and then derive the changes to the set ofdocumentation-to-source code traceabilitylinks.

  • 8/14/2019 Research--Trace Ability With LSI

    32/41

    Evaluation

  • 8/14/2019 Research--Trace Ability With LSI

    33/41

    Experiment of LEDA

    Trance line from the Manual to the SourceCode

    Find out which parts of the source code are

    described by a given manual section.

  • 8/14/2019 Research--Trace Ability With LSI

    34/41

    A Comparison of Traceability Techniques for Specif 34

    IR Quality Measures

    Precision @ n:

    Recall @ n:

    Average precision:

  • 8/14/2019 Research--Trace Ability With LSI

    35/41

    Experiment of LEDA

  • 8/14/2019 Research--Trace Ability With LSI

    36/41

    Experiment of LEDA

    i f

  • 8/14/2019 Research--Trace Ability With LSI

    37/41

    Experiment of LEDA

    E i f LEDA

  • 8/14/2019 Research--Trace Ability With LSI

    38/41

    Experiment of LEDA

    Figure2.TimeCostwithiLSI,threshold=0.7

    C l i

  • 8/14/2019 Research--Trace Ability With LSI

    39/41

    Conclusions

    Latent semantic indexing provides aninteresting conceptualization of the IRproblem

    It allows reducing the complexity of theunderline representational frameworkwhich might be explored, for instance,with the purpose of interfacing with the

    user

    C l i

  • 8/14/2019 Research--Trace Ability With LSI

    40/41

    A Comparison of Traceability Techniques for Specif 40

    Conclusions

    Traceability between code anddocumentation in real world systems iseffective via IR techniques.

    For realistic datasets the Vector Space

    Model, which did not performdimensionality reduction where shown tobe the most effective.

    R f

  • 8/14/2019 Research--Trace Ability With LSI

    41/41

    References

    Hsinyi Jiang, Tien N. Nguyen, Ing-Xiang Chen, Hojun Jaygarl, Carl K. Chang:Incremental Latent Semantic Indexing for Automatic Traceability Link EvolutionManagement. ASE 2008: 59-68

    G. Antoniol, G. Canfora, G. Casazza, A.D. Lucia, and E. Merlo. RecoveringTraceability Links Between Code and Documentation. IEEE Trans. Softw. Eng. ,28(10):970-983, 2002.

    Dumais, S. T., Furnas, G. W., Landauer, T. K. and Deerwester, S. (1988), "Usinglatent semantic analysis to improve information retrieval." In Proceedings of CHI'88:Conference on Human Factors in Computing, New York: ACM, 281-285.

    Deerwester, S., Dumais, S. T., Landauer, T. K., Furnas, G. W. and Harshman, R.A.(1990) "Indexing by latent semantic analysis." Journal of the Society for InformationScience, 41(6), 391-407.

    Foltz, P. W. (1990) "Using Latent Semantic Indexing for Information Filtering". In R.

    B. Allen (Ed.) Proceedings of the Conference on Office Information Systems,Cambridge, MA, 40-47.

    Deerwester, S.,Dumais, S.T., Landauer, T.K.,Furnas, G.W. and Harshman, R.A.(1990). "Indexing by latent semantic analysis." Journal of the Society for InformationScience, 41(6), 391-407.