model comparison for delta-compression

82
Model Comparison for Delta- Compression 1 Markus Scheidgen [email protected] @mscheidgen BigMDE at STAF 2016, Vienna

Upload: markus-scheidgen

Post on 18-Jan-2017

26 views

Category:

Technology


0 download

TRANSCRIPT

Model Comparison for Delta-Compression

1

Markus [email protected]

@mscheidgen

BigMDE at STAF 2016, Vienna

Agenda

▶ Motivation for Delta-Compression

▶ Model Comparison: Approaches

▶ Experiments

▶ Conclusions

2

Motivation – Delta-Compression

▶ What it is: Only store the differences of similar models

▶ Where do we have a lot of similar models:

■ Model Versioning

■ Model-based Mining of Source Repositories with reverse engineering

▶ Why: Storage space and (indirectly) execution time for persistence operations (I/O, etc.)

3

Model Versioning – Approaches

4

(or versioning in general)

1. Altmanninger, K., Seidl, M., Wimmer, M.: A survey on model versioning approaches. International Journal of Web Information Systems 5(3), 271–304 (2009)

Model Versioning – Approaches

4

(or versioning in general)

state-based

r0

r1

r2

r3

e.g. models stored in regular version control systems

(VCS)

1. Altmanninger, K., Seidl, M., Wimmer, M.: A survey on model versioning approaches. International Journal of Web Information Systems 5(3), 271–304 (2009)

Model Versioning – Approaches

4

(or versioning in general)

+ +

+

-

change-based(or operation-based)

+

e.g. EMF-store,requires to record or infer operations from the editing

environment

state-based

r0

r1

r2

r3

e.g. models stored in regular version control systems

(VCS)

1. Altmanninger, K., Seidl, M., Wimmer, M.: A survey on model versioning approaches. International Journal of Web Information Systems 5(3), 271–304 (2009)

Model Versioning – Approaches

4

(or versioning in general)

+ +

+

-

change-based(or operation-based)

+

e.g. EMF-store,requires to record or infer operations from the editing

environment

state-based

r0

r1

r2

r3

e.g. models stored in regular version control systems

(VCS)

or compare?

1. Altmanninger, K., Seidl, M., Wimmer, M.: A survey on model versioning approaches. International Journal of Web Information Systems 5(3), 271–304 (2009)

Model Versioning – Approaches

4

(or versioning in general)

+ +

+

-

change-based(or operation-based)

+

e.g. EMF-store,requires to record or infer operations from the editing

environment

state-based

r0

r1

r2

r3

e.g. models stored in regular version control systems

(VCS)

+ +

+

-

hybrid(persist changes, appear state-based)

+

e.g. GIT: you only see whole files, internally uses

pack-files with delta-compression

or compare?

1. Altmanninger, K., Seidl, M., Wimmer, M.: A survey on model versioning approaches. International Journal of Web Information Systems 5(3), 271–304 (2009)

Model Versioning – Architecture

5

user environment interface state representation

compression persistence

+ +

+

-

Model Versioning – Architecture

5

user environment interface state representation

compression persistence

+ +

+

-- +

Model Versioning – Architecture

5

user environment interface state representation

compression persistence

+ +

+

-- +

show existing diff

Model Versioning – Architecture

5

user environment interface state representation

compression persistence

+ +

+

-- +

create diff

Model Comparison – Tradeoffs

6

Comparison Quality

Comparison Time

Model Comparison – Tradeoffs

6

Comparison Quality

Comparison Time Extraction Time

Storage Space

Model Comparison – Tradeoffs

6

Comparison Quality

Comparison Time

Difference Model Usability

Extraction Time

Storage Space

Model Comparison – Tradeoffs

6

Comparison Quality

Comparison Time

Difference Model Usability

Extraction Time

Storage Space

Delta-Compression – Tradeoffs

7

Comparison Quality

priority for showing diffs to users [model comparison]

priority for using diffs in persistence [compression]

Model Comparison

▶ We know how to compare lists (e.g. lines of code) for a long time

■ Meyer’s algorithm: O(N*D)

▶ Models aren’t list, but graphs (with spanning trees)

■ but, each feature in each model element is a “list” of values

■ we can compare two model elements, feature by feature

■ But, what pairs of elements should we compare?

■ We need a prior step to establish pairs of (supposingly) matching elements

8

Meyers, E.W.: An O (ND) difference algorithm and its variations. Algorithmica 1(1- 4), 251–266 (1986)

Model Matching

9

model 1

model 2

matches differences

▶ Matching determines the quality of the comparison

▶ Different strategies to matching model elements

■ signatures: just meta-class, [name, parameter types], parent

■ similarity: lots of different criteria and heuristics

cheapexpensive

Comparison Representation

10

model 1 model 2matches

differences

Comparison Representation

11

model 1 model 2matches

differences

Comparison Representation for Compression

12

model 1

Comparison Representation for Compression

12

model 1 Δ(1,2)+

Comparison Representation for Compression

12

model 1 Δ(1,2)+ model 2=

Comparison Representation for Compression

12

model 1 Δ(1,2)+

Δ(2,3)+

model 2=

Comparison Representation for Compression

12

model 1 Δ(1,2)+

Δ(2,3)+

model 2=

model 3=

Comparison Representation for Compression

12

model 1 Δ(1,2)+

Δ(2,3)+

model 2=

model 3=

Δ(n,n+1) model n+1+ =...

Comparison Representation for Compression

13

model 1 Δ(1,2)+

Δ(2,3)+

model 2=

model 3=

Δ(n,n+1) model n+1+ =...

EMF-Compress

▶ We build a comparison framework for compression

■ signature-based matching

■ difference meta-model that allows patching

▶ http://github.com/markus1978/emf-compress

14

Experiments

▶ Reverse engineered GIT repositories with Java-code using MoDisco

▶ Eclipse Foundation sources, i.e. Eclipse platform and plug-ins

▶ organized in different projects: JDT, CDT, EMF, ...

▶ available via GIT-Hub

▶ GIT repositories can be gathered automated via GIT-Hub’s REST API

▶ We used the 200 largest Eclipse repositories that actually contained Java code: 6.6 GB Git, 400 MLOC, 250 GB (binary) models with 4 billion objects.

15

Experiment 1: EMF-Compare vs EMF-Compress

▶ only first 1000 revisions of the 100 largest (GIT-size) repos; only CU’s with less than 20k elements: ~300k individual comparisons

16

signature lines

Differences model size

signature similarity similarity similaritylines

020

4060

8010

0

Number of matches

(%)

020

4060

8010

0(%

)

signature parse

15

5050

0

Avg. execution times (log)

avg.

tim

e pe

r com

pila

tion

unit

(ms)

17

0

10

20

30

0 5000 10000 15000 20000 0 5000 10000 15000 20000

0 5000 10000 15000 20000 0 5000 10000 15000 20000

exec

utio

n tim

e (s

)

0

10

20

30

1204008000

0.0

0.5

1.0

1.5

2.0

size (#objects)

exec

utio

n tim

e (s

)

0.0

0.5

1.0

1.5

2.0

size (#objects)

Similarity-based Signature-based

18

0

10

20

30

0 5000 10000 15000 20000 0 5000 10000 15000 20000

0 5000 10000 15000 20000 0 5000 10000 15000 20000

exec

utio

n tim

e (s

)

0

10

20

30

1204008000

0.0

0.5

1.0

1.5

2.0

size (#objects)

exec

utio

n tim

e (s

)

0.0

0.5

1.0

1.5

2.0

size (#objects)

Similarity-based Signature-based

Experiment 2: Partial Comparison

▶ Problem: not all elements have a meaningful signature

▶ Two signature matching strategies:

■ Named-only: only match named elements, use equality for the contents of named elements

■ Meta-class: match named elements based on their signature, match the contents of named elements based on their parent and meta-class only

19

Experiment 2: Partial Comparison

20

cdt

cdo

...om

pare

em

f

...e.

core

DeltaUCFull

Size - Named-only Matcher

GB

0

2

4

6

8

10

12

14

cdt

cdo

...om

pare

em

f

...e.

core

DeltaUCFull

Size - Meta Class Matcher

GB

0

2

4

6

8

10

12

14

cdt

cdo

...om

pare

em

f

...e.

core

DeltaUCFull

Lines

MLi

nes

0

20

40

60

80

100

cdt

cdo

...om

pare

em

f

...e.

core

DeltaUCFull

All vs Matched - Named-only Matcher

MO

bjec

ts

0

100

200

300

400

500

cdt

cdo

...om

pare

em

f

...e.

core

DeltaUCFull

All vs Matched - Meta Class Matcher

MO

bjec

ts

0

100

200

300

400

500

cdt

cdo

...om

pare

em

f

...e.

core

DeltaUCFull

All vs Matched Lines

MLi

nes

0

20

40

60

80

100

Discussion

▶ Only reverse engineered Java models, different result for other meta-models possible

▶ Similarity-based matching != similarity-based matching: more evaluation with different qualities of similarity-based matching necessary

▶ Signatures are not necessarily meta-model independent

▶ We need a better understanding about the relationship of comparison runtime and comparison quality

21

Conclusions

▶ EMF Compare is tailored for difference model usability, not for model-compression

■ insufficient execution times

■ wrong representation of differences

▶ EMF-Compress is an alternative: http://github.com/markus1978/emf-compress

▶ Better analysis of matching strategies necessary:

■ evaluation of more matching strategies

■ evaluation with models in different languages

22

Example Use-Case: Model-based MSR

23

MS M{C}

Example Use-Case: Model-based MSR

23

MS M{C}

MM{Cn}RHEAD…

R0

Example Use-Case: Model-based MSR

23

MS M{C}

MM{Cn}RHEAD…

R0

Example Use-Case: Model-based MSR

23

MS M{C}

MM{Cn}RHEAD…

R0{Cn-1} MM

Example Use-Case: Model-based MSR

23

MS M{C}

MM{Cn}RHEAD…

R0{Cn-1} MM

{C0}

MM… … …

Example Use-Case: Model-based MSR

23

MS M{C}

MM{Cn}RHEAD…

R0{Cn-1} MM

{C0}

MM… … …

Model-based Mining of Software Repositories

▶ MSR tools are already “model-based”, but in a proprietary manner

▶ Idea: existing reverse engineering framework and corresponding standard meta-models and modeling frameworks instead of proprietary solutions

▶ Goals

■ deal with heterogeneity (different version control systems, different languages)

■ reuse of existing meta-models, transformations, and languages

■ interoperability with existing analysis tools

■ retaining meaningful scalability24

Model-based Mining of Software Repositories

▶ Scope

■ depends on concreter MSR-application and its goals

■ number of software projects: single repositories, large repositories, ultra-large repositories

■ Sources as text and text based metrics, e.g. LOC

■ Declarations only: packages, classes, methods, but no statements, expressions, etc.

■ Full AST with or without cross-references

25

Model-based Mining of Software Repositories

▶ Scope

■ depends on concreter MSR-application and its goals

■ number of software projects: single repositories, large repositories, ultra-large repositories

■ Sources as text and text based metrics, e.g. LOC

■ Declarations only: packages, classes, methods, but no statements, expressions, etc.

■ Full AST with or without cross-references

26

Reverse Engineering with MoDisco

▶ Model Discovery

▶ reverse engineering for Java, based on EMF

▶ discovery, i.e. finding sources (so called compilation units) within projects, source folders, and packages

▶ uses Eclipse’s workspace and Java Development Toolkit (JDT)

▶ provides

■ discovers for many languages: Java, xText, JSP, XML

■ creates instances of a Java EMF meta-model that corresponds to the handwritten JDT AST-model

■ provides transformation to language independent artifacts, e.g. KDM

27

From Source Code- to Model-Repository

28

snapshot

A1 B1

snapshot

A2 B1

snapshot

A2 B3

snapshot

snapshot

snapshot

M3

M2

M1

f

B.f

fB.f

Load(r)

Analysis(r)

Merge(r)

Save(r)

Checkout(r)

X

d2CUs(r)

Parse(d)

X

R

Checkout+

X

CUs

Parse+Analysis

!

�X

R

Checkout+

X

�CUs

(Parse+Merge) +Analysis

!

�X

R

X

�CUs

(Load+Merge) +Analysis

!

�X

R

X

�CUs

(Load+Analysis

0)

vers

ion

cont

rol s

yste

m

A1-A2

A1 B1

B1-B3

X

d2CUs(r)

Parse(d)

Model-based MSR Strategies

29

vers

ion

cont

rol s

yste

m

A1-A2

A1 B1

B1-B3

Model-based MSR Strategies

29

snapshot

A1 B1

Checkout(r)

vers

ion

cont

rol s

yste

m

A1-A2

A1 B1

B1-B3

Model-based MSR Strategies

29

snapshot

A1 B1

Checkout(r)

vers

ion

cont

rol s

yste

m

A1-A2

A1 B1

B1-B3

snapshot

A2 B1

snapshot

A2 B3

Model-based MSR Strategies

29

snapshot

A1 B1

Checkout(r)

vers

ion

cont

rol s

yste

m

A1-A2

A1 B1

B1-B3

snapshot

A2 B1

snapshot

A2 B3

Model-based MSR Strategies

29

snapshot

A1 B1

Checkout(r)

vers

ion

cont

rol s

yste

m

A1-A2

A1 B1

B1-B3

snapshot

A2 B1

snapshot

X

d2CUs(r)

Parse(d)

snapshot

A2 B3

Model-based MSR Strategies

29

snapshot

A1 B1

Checkout(r)

vers

ion

cont

rol s

yste

m

A1-A2

A1 B1

B1-B3

snapshot

A2 B1

snapshot

X

d2CUs(r)

Parse(d) snapshot

snapshot

snapshot

A2 B3

Model-based MSR Strategies

29

snapshot

A1 B1

Checkout(r)

vers

ion

cont

rol s

yste

m

A1-A2

A1 B1

B1-B3

snapshot

A2 B1

snapshot

X

d2CUs(r)

Parse(d) snapshot

snapshot

snapshot

A2 B3

Model-based MSR Strategies

29

snapshot

A1 B1

Checkout(r)

vers

ion

cont

rol s

yste

m

A1-A2

A1 B1

B1-B3

snapshot

A2 B1

snapshot

X

d2CUs(r)

Parse(d) snapshot

M1

Analysis(r)

snapshot

snapshot

A2 B3

Model-based MSR Strategies

29

snapshot

A1 B1

Checkout(r)

vers

ion

cont

rol s

yste

m

A1-A2

A1 B1

B1-B3

snapshot

A2 B1

snapshot

X

d2CUs(r)

Parse(d) snapshot

M1

Analysis(r)

M2

snapshot

snapshot

A2 B3

Model-based MSR Strategies

29

snapshot

A1 B1

Checkout(r)

vers

ion

cont

rol s

yste

m

A1-A2

A1 B1

B1-B3

snapshot

A2 B1

snapshot

X

d2CUs(r)

Parse(d) snapshot

M1

Analysis(r)

M3

M2

snapshot

snapshot

A2 B3

Model-based MSR Strategies

30

snapshot

A1 B1

Checkout(r)

vers

ion

cont

rol s

yste

m

A1-A2

A1 B1

B1-B3

snapshot

A2 B1

snapshot

X

d2CUs(r)

Parse(d) snapshot

M1

Analysis(r)

M3

M2

snapshot

snapshot

A2 B3

Model-based MSR Strategies

30

snapshot

A1 B1

Checkout(r)

vers

ion

cont

rol s

yste

m

A1-A2

A1 B1

B1-B3

snapshot

A2 B1

snapshot

X

d2CUs(r)

Parse(d) snapshot

M1

Analysis(r)

M3

M2

f

B.f

fB.f

Parse(d)X

d2�CUs(r)

snapshot

snapshot

A2 B3

Model-based MSR Strategies

30

snapshot

A1 B1

Checkout(r)

vers

ion

cont

rol s

yste

m

A1-A2

A1 B1

B1-B3

snapshot

A2 B1

snapshot

X

d2CUs(r)

Parse(d) snapshot

M1

Analysis(r)

M3

M2

Merge(r)

f

B.f

fB.f

Parse(d)X

d2�CUs(r)

snapshot

snapshot

A2 B3

Model-based MSR Strategies

31

snapshot

A1 B1

Checkout(r)

vers

ion

cont

rol s

yste

m

A1-A2

A1 B1

B1-B3

snapshot

A2 B1

snapshot

X

d2CUs(r)

Parse(d) snapshot

M1

Analysis(r)

M3

M2

Merge(r)

f

B.f

fB.f

Parse(d)X

d2�CUs(r)

snapshot

snapshot

A2 B3

Model-based MSR Strategies

31

snapshot

A1 B1

Checkout(r)

vers

ion

cont

rol s

yste

m

A1-A2

A1 B1

B1-B3

snapshot

A2 B1

snapshot

X

d2CUs(r)

Parse(d) snapshot

M1

Analysis(r)

M3

M2

Merge(r)

f

B.f

fB.f

Parse(d)X

d2�CUs(r)

Load(r)Save(r)

snapshot

snapshot

A2 B3

Model-based MSR Strategies

32

snapshot

A1 B1

Checkout(r)

vers

ion

cont

rol s

yste

m

A1-A2

A1 B1

B1-B3

snapshot

A2 B1

snapshot

X

d2CUs(r)

Parse(d) snapshot

M1

Analysis(r)

M3

M2

Merge(r)

f

B.f

fB.f

Parse(d)X

d2�CUs(r)

Load(r)Save(r)

snapshot

snapshot

A2 B3

Model-based MSR Strategies

33

snapshot

A1 B1

Checkout(r)

vers

ion

cont

rol s

yste

m

A1-A2

A1 B1

B1-B3

snapshot

A2 B1

snapshot

X

d2CUs(r)

Parse(d) snapshot

M1

Analysis(r)

M3

M2

Merge(r)

f

B.f

fB.f

Parse(d)X

d2�CUs(r)

Load(r)Save(r)

Importing and Traversing Source Code Repositories

34

R1

R2

R3

A1 B1f

A2

A1 B1f

f

B3

A2

A1 B1f

f

A2 B3

fA2 B1f

A1 B1f

A2 B1f

A1 B1fA1 B1f

B3

A2

A1 B1f

f

A2

A1 B1f

fA1 B1f

A2 B3

A2 B1

f

f

A1 B1f

A2 B1

f

A1 B1f

A2 B.f ?

A1 B1! fB.f ?

A2 B3f

A2 B1f

A1 B1f ✓

import traverse

Importing and Traversing Source Code Repositories

35

R1

R2

R3

import (persistent/storage)

traverse(transient/runtime)

Importing and Traversing Source Code Repositories

36

R1

R2

R3

A1 B1f

import (persistent)

traverse(transient)

A2

A1 B1f

f

Importing and Traversing Source Code Repositories

37

R1

R2

R3

import (persistent)

traverse(transient)

Importing and Traversing Source Code Repositories

38

R1

R2

R3 B3

A2

A1 B1f

f

import (persistent)

traverse(transient)

Importing and Traversing Source Code Repositories

39

R1

R2

R3

A1 B1f

B3

A2

A1 B1f

f✓

import (persistent)

traverse(transient)

Importing and Traversing Source Code Repositories

40

R1

R2

R3

A2 B1f

A1 B1f

B3

A2

A1 B1f

f✓

import (persistent)

traverse(transient)

Importing and Traversing Source Code Repositories

41

R1

R2

R3 A2 B3

fA2 B1f

A1 B1f

B3

A2

A1 B1f

f✓

import (persistent)

traverse(transient)

A1 B1! fB.f ?

Importing and Traversing Source Code Repositories

42

R1

R2

R3

import (persistent)

traverse(transient)

A2 B.f ?

A1 B1! fB.f ?

Importing and Traversing Source Code Repositories

43

R1

R2

R3

import (persistent)

traverse(transient)

Importing and Traversing Source Code Repositories

44

R1

R2

R3 B3! f

A2 B.f ?

A1 B1! fB.f ?

import (persistent)

traverse(transient)

Importing and Traversing Source Code Repositories

45

R1

R2

R3 B3! f

A2 B.f ?

A1 B1! fB.f ? A1 B1f ✓

import (persistent)

traverse(transient)

Importing and Traversing Source Code Repositories

46

R1

R2

R3 B3! f

A2 B.f ?

A1 B1! fB.f ?

A2 B1f

A1 B1f

import (persistent)

traverse(transient)

Importing and Traversing Source Code Repositories

47

R1

R2

R3 B3! f

A2 B.f ?

A1 B1! fB.f ?

A2 B1f

A1 B1f

A2 B3f ✓

import (persistent)

traverse(transient)

Importing and Traversing Source Code Repositories

48

R1

R2

R3

B3! f

A2 B.f ?

A1 B1! fB.f ? A1 B1f

import (persistent)

traverse(transient)

Importing and Traversing Source Code Repositories

49

R1

R2

R3

B3! f

A2 B.f ?

A1 B1! fB.f ?

A2

A1 B1f

f✗

import (persistent)

traverse(transient)

Importing and Traversing Source Code Repositories

50

R1

R2

R3

B3! f

A2 B.f ?

A1 B1! fB.f ?

B3f

A2

A1 B1f

f✗

import (persistent)

traverse(transient)