cd in natural language software artifacts 1 clone detection in natural language software artifacts:...

25
1 CD in Natural Language Software Artifacts Clone Detection in Natural Language Software Artifacts: Techniques and Applications Elmar Juergens November 23 rd , 2010

Upload: verity-mason

Post on 12-Jan-2016

223 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: CD in Natural Language Software Artifacts 1 Clone Detection in Natural Language Software Artifacts: Techniques and Applications Elmar Juergens November

1

CD

in

Na

tura

l L

an

gu

ag

e S

oft

wa

re A

rtif

ac

ts

Clone Detection in Natural Language Software Artifacts: Techniques and Applications

Elmar JuergensNovember 23rd, 2010

Page 2: CD in Natural Language Software Artifacts 1 Clone Detection in Natural Language Software Artifacts: Techniques and Applications Elmar Juergens November

2

CD

in

Na

tura

l L

an

gu

ag

e S

oft

wa

re A

rtif

ac

ts

Agenda

Why clone detection in natural language documents?

How to do (and how to improve) it?

Where else to apply it?

Page 3: CD in Natural Language Software Artifacts 1 Clone Detection in Natural Language Software Artifacts: Techniques and Applications Elmar Juergens November

3

CD

in

Na

tura

l L

an

gu

ag

e S

oft

wa

re A

rtif

ac

ts

Motivation

Requirements specifications pivotal in many projects

• IEEE 830-1998: “Modifiability generally requires a

requirements specification to […] not be redundant”

• Reviews or inspections employed to assure their quality

System tests often described in natural language

• Must adapt, if functionality changes

Our Experience

• Some reviews discovered substantial duplication

• Difficult to find manually; can we improve it?

Page 4: CD in Natural Language Software Artifacts 1 Clone Detection in Natural Language Software Artifacts: Techniques and Applications Elmar Juergens November

5

CD

in

Na

tura

l L

an

gu

ag

e S

oft

wa

re A

rtif

ac

ts

Cloning in Requirements Specifications*

• 28 industrial specs

• English & German

• Domains:

administration,

automotive,

convenience,

finance,

telecommunication,

transport

• 11 companies

*Juergens et al., ICSE 2010, Can Clone Detection Support QA of Requirements Specifications?

Page 5: CD in Natural Language Software Artifacts 1 Clone Detection in Natural Language Software Artifacts: Techniques and Applications Elmar Juergens November

Cloning in Use Cases

System A: 176 use cases in total, 150 contain cloning

Page 6: CD in Natural Language Software Artifacts 1 Clone Detection in Natural Language Software Artifacts: Techniques and Applications Elmar Juergens November

7

CD

in

Na

tura

l L

an

gu

ag

e S

oft

wa

re A

rtif

ac

ts

Extent of Cloning

Page 7: CD in Natural Language Software Artifacts 1 Clone Detection in Natural Language Software Artifacts: Techniques and Applications Elmar Juergens November

8

CD

in

Na

tura

l L

an

gu

ag

e S

oft

wa

re A

rtif

ac

ts

“The contracts with the clients describe the conditions

regarding obligatory liabilities that the clients have agreed on

with X. The liabilities are calculated from the exposures from

Y and the contract conditions from X. The liability-relevant

parts of the contracts thus need to be managed in system

Z.”

“The contracts with the clients describe the conditions

regarding obligatory liabilities that the clients have agreed on

with X. The liabilities are calculated from the exposures from

Y and the contract conditions from X. The liability-relevant

parts of the contracts thus need to be managed in system

Z.”

“The contracts with the clients describe the conditions

regarding obligatory liabilities that the clients have agreed on

with X. The liabilities are calculated from the exposures from

Y and the contract conditions from X. The liability-relevant

parts of the contracts thus need to be managed in system

Z.”

Extent of Cloning (2)

Typical Clones

• Entire use cases copied (create / edit XY)

• Similar combinations of pre and post conditions copied

• Descriptions of terms or roles copied

Example*: 42 instances (61 words, 13 instances with > 100 words)

*Translated from German

“The contracts with the clients describe the conditions

regarding obligatory liabilities that the clients have agreed on

with X. The liabilities are calculated from the exposures from

Y and the contract conditions from X. The liability-relevant

parts of the contracts thus need to be managed in system

Z.”

“The contracts with the clients describe the conditions

regarding obligatory liabilities that the clients have agreed on

with X. The liabilities are calculated from the exposures from

Y and the contract conditions from X. The liability-relevant

parts of the contracts thus need to be managed in system

Z.”

Page 8: CD in Natural Language Software Artifacts 1 Clone Detection in Natural Language Software Artifacts: Techniques and Applications Elmar Juergens November

9

CD

in

Na

tura

l L

an

gu

ag

e S

oft

wa

re A

rtif

ac

ts

Consequences

Reading and Inspections• Reading speed: 220 words/minute [Gould et al, ‘87]

• Average blow-up: 3578 words: ~16 minutes reading overhead

• Slower recomm. reading speed: 600 words/hour [Gilb & Graham, ‘93]

• Time increase (3 persons): 2.5 person days (avg), 13 p. days (max)

Further Observations• Multiple, probably unintentional inconsistent specification clones

• Traced specification clone groups to implementation. 3 cases observed: 1) Shared abstraction; 2) Cloned code; 3) reimplementation

Impact: Inspection effort, inconsistencies, code redundancy

Page 9: CD in Natural Language Software Artifacts 1 Clone Detection in Natural Language Software Artifacts: Techniques and Applications Elmar Juergens November

10

CD

in

Na

tura

l L

an

gu

ag

e S

oft

wa

re A

rtif

ac

ts

Cloning in Test Cases

Study Object

• 167 system tests. Performed manually

• SUT: Business information system

• Test steps: action & expected result

Results

• Min length: 20 words. Only steps.

• > 1000 clones; Coverage: 54%

• Often interaction sequences cloned

• Some interactions >50 times

Can Clone Detection Support Test Automation Efforts?

Page 10: CD in Natural Language Software Artifacts 1 Clone Detection in Natural Language Software Artifacts: Techniques and Applications Elmar Juergens November

11

CD

in

Na

tura

l L

an

gu

ag

e S

oft

wa

re A

rtif

ac

ts

Agenda

Why clone detection in natural language documents?

How to do (and how to improve) it?

Where else to apply it?

Page 11: CD in Natural Language Software Artifacts 1 Clone Detection in Natural Language Software Artifacts: Techniques and Applications Elmar Juergens November

12

CD

in

Na

tura

l L

an

gu

ag

e S

oft

wa

re A

rtif

ac

ts

Current Approach

Clone Detector• ConQAT (open source)

• Extensible through visual configuration language

• Word stemming, stop-word Removal

Limitations• Not robust against

modifications.

• Word repl. (user → actor) obscure clones

• Compare: identifier renaming

Page 12: CD in Natural Language Software Artifacts 1 Clone Detection in Natural Language Software Artifacts: Techniques and Applications Elmar Juergens November

13

CD

in

Na

tura

l L

an

gu

ag

e S

oft

wa

re A

rtif

ac

ts

Detecting Type-2 NL Clones

Approach: POS Analysis

• Determine word categories for natural language texts

• Mature technique; open source tools exist, e.g. TreeTagger

Early Case Study

• Integrated TreeTagger into ConQAT

• Clone Detection on POS sequences instead of on raw text

Preliminary Results (on Requirements Specification)

• Compensated different actions (create/edit), tenses, targets

• Compensated systematic replacements (e.g. actor → user)

• Too aggressive normalization can produce false positives

Page 13: CD in Natural Language Software Artifacts 1 Clone Detection in Natural Language Software Artifacts: Techniques and Applications Elmar Juergens November

14

CD

in

Na

tura

l L

an

gu

ag

e S

oft

wa

re A

rtif

ac

ts

Agenda

Why clone detection in natural language documents?

How to do (and how to improve) it?

Where else to apply it?

Page 14: CD in Natural Language Software Artifacts 1 Clone Detection in Natural Language Software Artifacts: Techniques and Applications Elmar Juergens November

15

CD

in

Na

tura

l L

an

gu

ag

e S

oft

wa

re A

rtif

ac

ts

Further Applications

Use Cases• (Self)Plagiarism detection for papers

• Copyright infringement analysis

• Whitebox reuse studies across systems

Local Clone Queries• Retrieve clones between file and base

• Not interested in all clones in base

Index Based Clone Detection• Index stores base for efficient clone queries

• Efficient clone queries btw file and base

Documents /Source Code

Create /

update

Clone Index

*Hummel, Juergens, Heinemann, ICSM 2010, Index-Based Clone Detection

Page 15: CD in Natural Language Software Artifacts 1 Clone Detection in Natural Language Software Artifacts: Techniques and Applications Elmar Juergens November

16

CD

in

Na

tura

l L

an

gu

ag

e S

oft

wa

re A

rtif

ac

ts

Scalability

Study (@Google)

• MapReduce

• Bigtable for index

• 73.2 MLOC

(C/C++/Java)

Ultra Large

• 120 million C++ files

• 2,915,947,163 LOC

• Index creation on 1000 machines: 7 hours => Real scalability

Page 16: CD in Natural Language Software Artifacts 1 Clone Detection in Natural Language Software Artifacts: Techniques and Applications Elmar Juergens November

17

CD

in

Na

tura

l L

an

gu

ag

e S

oft

wa

re A

rtif

ac

ts

Conclusion

Lessons Learned

• Many specifications contain a lot of cloning

• Negative impact on reading and inspection effort

• Indication for corresponding redundancy in source code

• Cloning not necessary – many specs contain none

Future Work

• How can cloning be avoided or removed?

• What are causes & evolution for clones? Different from code clones?

• Further empirical studies, e.g. on consequences for implementation

• How does POS normalization affect precision and recall?

• Can clone detection support test automation?

Page 17: CD in Natural Language Software Artifacts 1 Clone Detection in Natural Language Software Artifacts: Techniques and Applications Elmar Juergens November

18

CD

in

Na

tura

l L

an

gu

ag

e S

oft

wa

re A

rtif

ac

ts

Page 18: CD in Natural Language Software Artifacts 1 Clone Detection in Natural Language Software Artifacts: Techniques and Applications Elmar Juergens November

19

CD

in

Na

tura

l L

an

gu

ag

e S

oft

wa

re A

rtif

ac

ts

Terms

Requirements Specification [IEEE 830-1998]:

“specification for a particular software product, program, or set of

programs that performs certain functions in a specific environment.”

Clone

• Duplicated specification text of at least 20 words.

• Small differences (e.g. declination) are tolerated

• Must refer to specified system.

• False positives: e.g. page footers with copyright information

Page 19: CD in Natural Language Software Artifacts 1 Clone Detection in Natural Language Software Artifacts: Techniques and Applications Elmar Juergens November

20

CD

in

Na

tura

l L

an

gu

ag

e S

oft

wa

re A

rtif

ac

ts

Research Questions

RQ1: How much cloning do requirements specifications contain?

RQ2: What kind of information is cloned in requirements specifications?

RQ3: What consequences does cloning in requirements specifications have?

RQ4: Can cloning in requirements specifications be detected accurately using existing clone detectors?

Out of Scope: Reasons for cloning or clone avoidance

Page 20: CD in Natural Language Software Artifacts 1 Clone Detection in Natural Language Software Artifacts: Techniques and Applications Elmar Juergens November

21

CD

in

Na

tura

l L

an

gu

ag

e S

oft

wa

re A

rtif

ac

ts

Study Design

Page 21: CD in Natural Language Software Artifacts 1 Clone Detection in Natural Language Software Artifacts: Techniques and Applications Elmar Juergens November

22

CD

in

Na

tura

l L

an

gu

ag

e S

oft

wa

re A

rtif

ac

ts

RQ2: Kind of Cloned Information

Cloning not limited to a specific kind of information

# seen

Page 22: CD in Natural Language Software Artifacts 1 Clone Detection in Natural Language Software Artifacts: Techniques and Applications Elmar Juergens November

23

CD

in

Na

tura

l L

an

gu

ag

e S

oft

wa

re A

rtif

ac

ts

Threats to Validity

Internal

• Pairs of researchers to reduce errors during manual steps

• Reading speeds for cloned vs non-cloned text? Assumed similar.

Further research required

• Recall unclear. But: does not affect study results

External

• Substantial differences between requirements specifications (format,

organization, language, …)

But: large amount of study objects from different companies, domains

Page 23: CD in Natural Language Software Artifacts 1 Clone Detection in Natural Language Software Artifacts: Techniques and Applications Elmar Juergens November

24

CD

in

Na

tura

l L

an

gu

ag

e S

oft

wa

re A

rtif

ac

ts

90 clone groups with length > 90 words

Page 24: CD in Natural Language Software Artifacts 1 Clone Detection in Natural Language Software Artifacts: Techniques and Applications Elmar Juergens November

25

CD

in

Na

tura

l L

an

gu

ag

e S

oft

wa

re A

rtif

ac

ts

49 clone groups with cardinality > 10

Page 25: CD in Natural Language Software Artifacts 1 Clone Detection in Natural Language Software Artifacts: Techniques and Applications Elmar Juergens November

26

CD

in

Na

tura

l L

an

gu

ag

e S

oft

wa

re A

rtif

ac

ts

Clone Types for NL Clones

Code Type 1• different whitespace, comments

Type 2• different identifiers, literals

Type 3• Different statements

Type 4• Different syntax but similar

semantics

NL Type 1• different whitespace

Type 2• POS-preserving word

replacements (e.g. adj → adj)

Type 3• Different sentences

Type 4• Different syntax but similar

semantics