cd in natural language software artifacts 1 clone detection in natural language software artifacts:...
TRANSCRIPT
1
CD
in
Na
tura
l L
an
gu
ag
e S
oft
wa
re A
rtif
ac
ts
Clone Detection in Natural Language Software Artifacts: Techniques and Applications
Elmar JuergensNovember 23rd, 2010
2
CD
in
Na
tura
l L
an
gu
ag
e S
oft
wa
re A
rtif
ac
ts
Agenda
Why clone detection in natural language documents?
How to do (and how to improve) it?
Where else to apply it?
3
CD
in
Na
tura
l L
an
gu
ag
e S
oft
wa
re A
rtif
ac
ts
Motivation
Requirements specifications pivotal in many projects
• IEEE 830-1998: “Modifiability generally requires a
requirements specification to […] not be redundant”
• Reviews or inspections employed to assure their quality
System tests often described in natural language
• Must adapt, if functionality changes
Our Experience
• Some reviews discovered substantial duplication
• Difficult to find manually; can we improve it?
5
CD
in
Na
tura
l L
an
gu
ag
e S
oft
wa
re A
rtif
ac
ts
Cloning in Requirements Specifications*
• 28 industrial specs
• English & German
• Domains:
administration,
automotive,
convenience,
finance,
telecommunication,
transport
• 11 companies
*Juergens et al., ICSE 2010, Can Clone Detection Support QA of Requirements Specifications?
Cloning in Use Cases
System A: 176 use cases in total, 150 contain cloning
7
CD
in
Na
tura
l L
an
gu
ag
e S
oft
wa
re A
rtif
ac
ts
Extent of Cloning
8
CD
in
Na
tura
l L
an
gu
ag
e S
oft
wa
re A
rtif
ac
ts
“The contracts with the clients describe the conditions
regarding obligatory liabilities that the clients have agreed on
with X. The liabilities are calculated from the exposures from
Y and the contract conditions from X. The liability-relevant
parts of the contracts thus need to be managed in system
Z.”
“The contracts with the clients describe the conditions
regarding obligatory liabilities that the clients have agreed on
with X. The liabilities are calculated from the exposures from
Y and the contract conditions from X. The liability-relevant
parts of the contracts thus need to be managed in system
Z.”
“The contracts with the clients describe the conditions
regarding obligatory liabilities that the clients have agreed on
with X. The liabilities are calculated from the exposures from
Y and the contract conditions from X. The liability-relevant
parts of the contracts thus need to be managed in system
Z.”
Extent of Cloning (2)
Typical Clones
• Entire use cases copied (create / edit XY)
• Similar combinations of pre and post conditions copied
• Descriptions of terms or roles copied
Example*: 42 instances (61 words, 13 instances with > 100 words)
*Translated from German
“The contracts with the clients describe the conditions
regarding obligatory liabilities that the clients have agreed on
with X. The liabilities are calculated from the exposures from
Y and the contract conditions from X. The liability-relevant
parts of the contracts thus need to be managed in system
Z.”
“The contracts with the clients describe the conditions
regarding obligatory liabilities that the clients have agreed on
with X. The liabilities are calculated from the exposures from
Y and the contract conditions from X. The liability-relevant
parts of the contracts thus need to be managed in system
Z.”
…
9
CD
in
Na
tura
l L
an
gu
ag
e S
oft
wa
re A
rtif
ac
ts
Consequences
Reading and Inspections• Reading speed: 220 words/minute [Gould et al, ‘87]
• Average blow-up: 3578 words: ~16 minutes reading overhead
• Slower recomm. reading speed: 600 words/hour [Gilb & Graham, ‘93]
• Time increase (3 persons): 2.5 person days (avg), 13 p. days (max)
Further Observations• Multiple, probably unintentional inconsistent specification clones
• Traced specification clone groups to implementation. 3 cases observed: 1) Shared abstraction; 2) Cloned code; 3) reimplementation
Impact: Inspection effort, inconsistencies, code redundancy
10
CD
in
Na
tura
l L
an
gu
ag
e S
oft
wa
re A
rtif
ac
ts
Cloning in Test Cases
Study Object
• 167 system tests. Performed manually
• SUT: Business information system
• Test steps: action & expected result
Results
• Min length: 20 words. Only steps.
• > 1000 clones; Coverage: 54%
• Often interaction sequences cloned
• Some interactions >50 times
Can Clone Detection Support Test Automation Efforts?
11
CD
in
Na
tura
l L
an
gu
ag
e S
oft
wa
re A
rtif
ac
ts
Agenda
Why clone detection in natural language documents?
How to do (and how to improve) it?
Where else to apply it?
12
CD
in
Na
tura
l L
an
gu
ag
e S
oft
wa
re A
rtif
ac
ts
Current Approach
Clone Detector• ConQAT (open source)
• Extensible through visual configuration language
• Word stemming, stop-word Removal
Limitations• Not robust against
modifications.
• Word repl. (user → actor) obscure clones
• Compare: identifier renaming
13
CD
in
Na
tura
l L
an
gu
ag
e S
oft
wa
re A
rtif
ac
ts
Detecting Type-2 NL Clones
Approach: POS Analysis
• Determine word categories for natural language texts
• Mature technique; open source tools exist, e.g. TreeTagger
Early Case Study
• Integrated TreeTagger into ConQAT
• Clone Detection on POS sequences instead of on raw text
Preliminary Results (on Requirements Specification)
• Compensated different actions (create/edit), tenses, targets
• Compensated systematic replacements (e.g. actor → user)
• Too aggressive normalization can produce false positives
14
CD
in
Na
tura
l L
an
gu
ag
e S
oft
wa
re A
rtif
ac
ts
Agenda
Why clone detection in natural language documents?
How to do (and how to improve) it?
Where else to apply it?
15
CD
in
Na
tura
l L
an
gu
ag
e S
oft
wa
re A
rtif
ac
ts
Further Applications
Use Cases• (Self)Plagiarism detection for papers
• Copyright infringement analysis
• Whitebox reuse studies across systems
Local Clone Queries• Retrieve clones between file and base
• Not interested in all clones in base
Index Based Clone Detection• Index stores base for efficient clone queries
• Efficient clone queries btw file and base
Documents /Source Code
Create /
update
Clone Index
*Hummel, Juergens, Heinemann, ICSM 2010, Index-Based Clone Detection
16
CD
in
Na
tura
l L
an
gu
ag
e S
oft
wa
re A
rtif
ac
ts
Scalability
Study (@Google)
• MapReduce
• Bigtable for index
• 73.2 MLOC
(C/C++/Java)
Ultra Large
• 120 million C++ files
• 2,915,947,163 LOC
• Index creation on 1000 machines: 7 hours => Real scalability
17
CD
in
Na
tura
l L
an
gu
ag
e S
oft
wa
re A
rtif
ac
ts
Conclusion
Lessons Learned
• Many specifications contain a lot of cloning
• Negative impact on reading and inspection effort
• Indication for corresponding redundancy in source code
• Cloning not necessary – many specs contain none
Future Work
• How can cloning be avoided or removed?
• What are causes & evolution for clones? Different from code clones?
• Further empirical studies, e.g. on consequences for implementation
• How does POS normalization affect precision and recall?
• Can clone detection support test automation?
18
CD
in
Na
tura
l L
an
gu
ag
e S
oft
wa
re A
rtif
ac
ts
19
CD
in
Na
tura
l L
an
gu
ag
e S
oft
wa
re A
rtif
ac
ts
Terms
Requirements Specification [IEEE 830-1998]:
“specification for a particular software product, program, or set of
programs that performs certain functions in a specific environment.”
Clone
• Duplicated specification text of at least 20 words.
• Small differences (e.g. declination) are tolerated
• Must refer to specified system.
• False positives: e.g. page footers with copyright information
20
CD
in
Na
tura
l L
an
gu
ag
e S
oft
wa
re A
rtif
ac
ts
Research Questions
RQ1: How much cloning do requirements specifications contain?
RQ2: What kind of information is cloned in requirements specifications?
RQ3: What consequences does cloning in requirements specifications have?
RQ4: Can cloning in requirements specifications be detected accurately using existing clone detectors?
Out of Scope: Reasons for cloning or clone avoidance
21
CD
in
Na
tura
l L
an
gu
ag
e S
oft
wa
re A
rtif
ac
ts
Study Design
22
CD
in
Na
tura
l L
an
gu
ag
e S
oft
wa
re A
rtif
ac
ts
RQ2: Kind of Cloned Information
Cloning not limited to a specific kind of information
# seen
23
CD
in
Na
tura
l L
an
gu
ag
e S
oft
wa
re A
rtif
ac
ts
Threats to Validity
Internal
• Pairs of researchers to reduce errors during manual steps
• Reading speeds for cloned vs non-cloned text? Assumed similar.
Further research required
• Recall unclear. But: does not affect study results
External
• Substantial differences between requirements specifications (format,
organization, language, …)
But: large amount of study objects from different companies, domains
24
CD
in
Na
tura
l L
an
gu
ag
e S
oft
wa
re A
rtif
ac
ts
90 clone groups with length > 90 words
25
CD
in
Na
tura
l L
an
gu
ag
e S
oft
wa
re A
rtif
ac
ts
49 clone groups with cardinality > 10
26
CD
in
Na
tura
l L
an
gu
ag
e S
oft
wa
re A
rtif
ac
ts
Clone Types for NL Clones
Code Type 1• different whitespace, comments
Type 2• different identifiers, literals
Type 3• Different statements
Type 4• Different syntax but similar
semantics
NL Type 1• different whitespace
Type 2• POS-preserving word
replacements (e.g. adj → adj)
Type 3• Different sentences
Type 4• Different syntax but similar
semantics