icse 2013-tutorial-data-science-for-software-engineering

124
ICSE’13 Tutorial: Data Science for Software Engineering Tim Menzies, West Virginia University Ekrem Kocaguneli, West Virginia University Fayola Peters, West Virginia University Burak Turhan, University of Oulu Leandro L. Minku, The University of Birmingham ICSE 2013 May 18th - 26th, 2013 San Francisco, CA http://bit.ly/ icse13tutorial

Upload: cs-ncstate

Post on 06-May-2015

544 views

Category:

Technology


4 download

TRANSCRIPT

Page 1: Icse 2013-tutorial-data-science-for-software-engineering

ICSE’13 Tutorial: Data Science for Software Engineering

Tim Menzies, West Virginia UniversityEkrem Kocaguneli, West Virginia UniversityFayola Peters, West Virginia UniversityBurak Turhan, University of OuluLeandro L. Minku, The University of Birmingham

ICSE 2013May 18th - 26th, 2013San Francisco, CAhttp://bit.ly/icse13tutorial

Page 2: Icse 2013-tutorial-data-science-for-software-engineering

Who we are…

2

Tim MenziesWest Virginia [email protected]

Ekrem KocaguneliWest Virginia [email protected]

Fayola PetersWest Virginia [email protected]

Burak TurhanUniversity of [email protected]

Leandro L. MinkuThe University of [email protected]

Page 3: Icse 2013-tutorial-data-science-for-software-engineering

3

OUTLINE• PART 0: Introduction• PART 1: Organization Issues

– Rule #1: Talk to the users– Rule #2: Know your domain– Rule #3: Suspect your data– Rule #4: Data science is cyclic

• PART 2: Data Issues– How to solve lack or scarcity of data– How to prune data, simpler & smarter– How to advance simple CBR methods– How to keep your data private

• PART 3: Model Issues– Problems of SE models– Solutions

• Envy-based learning• Ensembles

Page 4: Icse 2013-tutorial-data-science-for-software-engineering

4

OUTLINE• PART 0: Introduction• PART 1: Organization Issues

– Rule #1: Talk to the users– Rule #2: Know your domain– Rule #3: Suspect your data– Rule #4: Data science is cyclic

• PART 2: Data Issues– How to solve lack or scarcity of data– How to prune data, simpler & smarter– How to advance simple CBR methods– How to keep your data private

• PART 3: Model Issues– Problems of SE models– Solutions

• Envy-based learning• Ensembles

Page 5: Icse 2013-tutorial-data-science-for-software-engineering

What can we share?

• Two software project managers meet– What can they learn

from each other?

• They can share1. Data2. Models3. Methods

• techniques for turning data into models

4. Insight into the domain

• The standard mistake– Generally assumed that

models can be shared, without modification.

– Yeah, right…

5

Page 6: Icse 2013-tutorial-data-science-for-software-engineering

SE research = sparse sample of a very diverse set of activities

6

Microsoft research,Redmond, Building 99

Other studios,many other projects

And they are all different.

Page 7: Icse 2013-tutorial-data-science-for-software-engineering

Models may not move(effort estimation)

• 20 * 66% samples of data from NASA

• Linear regression oneach sample to learneffort = a*LOCb *Σiβixi

• Back select to removeuseless xi

• Result? – Wide βi variance

7* T. Menzies, A.Butcher, D.Cok, A.Marcus, L.Layman, F.Shull, B.Turhan, T.Zimmermann, "Local vs. Global Lessons for Defect Prediction and Effort Estimation," IEEE TSE pre-print 2012. http://menzies.us/pdf/12gense.pdf

Page 8: Icse 2013-tutorial-data-science-for-software-engineering

8

Models may not move(defect prediction)

* T. Menzies, A.Butcher, D.Cok, A.Marcus, L.Layman, F.Shull, B.Turhan, T.Zimmermann, "Local vs. Global Lessons for Defect Prediction and Effort Estimation," IEEE TSE pre-print 2012. http://menzies.us/pdf/12gense.pdf

Page 9: Icse 2013-tutorial-data-science-for-software-engineering

Oh woe is me

• No generality in SE?• Nothing we can learn

from each other?• Forever doomed to

never make a conclusion?– Always, laboriously,

tediously, slowly, learning specific lessons that hold only for specific projects?

• No: 3 things we might want to share– Models, methods, data

• If no general models, then – Share methods

• general methods for quickly turning local data into local models.

– Share data• Find and transfer relevant

data from other projects to us

9

Page 10: Icse 2013-tutorial-data-science-for-software-engineering

The rest of this tutorial

• Data science– How to share data– How to share methods

• Maybe one day, in the future, – after we’ve shared enough data and methods– We’ll be able to report general models– ICSE 2020?

• But first,– Some general notes on data mining

10

Page 11: Icse 2013-tutorial-data-science-for-software-engineering

11

OUTLINE• PART 0: Introduction

• PART 1: Organization Issues– Rule #1: Talk to the users– Rule #2: Know your domain– Rule #3: Suspect your data– Rule #4: Data science is cyclic

• PART 2: Data Issues– How to solve lack or scarcity of data– How to prune data, simpler & smarter– How to advance simple CBR methods– How to keep your data private

• PART 3: Model Issues– Problems of SE models– Solutions

• Envy-based learning• Ensembles

Page 12: Icse 2013-tutorial-data-science-for-software-engineering

12

OUTLINE• PART 0: Introduction• PART 1: Organization Issues

– Rule #1: Talk to the users– Rule #2: Know your domain– Rule #3: Suspect your data– Rule #4: Data science is cyclic

• PART 2: Data Issues– How to solve lack or scarcity of data– How to prune data, simpler & smarter– How to advance simple CBR methods– How to keep your data private

• PART 3: Model Issues– Problems of SE models– Solutions

• Envy-based learning• Ensembles

Page 13: Icse 2013-tutorial-data-science-for-software-engineering

The great myth• Let’s face it:

– Humans are a pest– And experts doubly so.

• “The notion of ‘user’ cannot be precisely defined and therefore has no place in CS and SE”– Edsger Dijkstra, ICSE’4, 1979

• http://en.wikipedia.org/wiki/List_of_cognitive_biases

• 96 Decision-making, belief and behavioral biases

– Attentional bias – paying more attention to emotionally dominant stimuli in one's environment and to neglect relevant data

• 23 Social biases– Worse-than-average effect –

believing we are worse than others at tasks which are difficult

• 52 Memory errors and biases– Illusory correlation – inaccurately

remembering a relationship between two event

13

Page 14: Icse 2013-tutorial-data-science-for-software-engineering

The great myth

• Wouldn’t it be wonderful if we did not have to listen to them– The dream of olde

worlde machine learning• Circa 1980s

– Dispense with live experts and resurrect dead ones.

• But any successful learner needs biases– Ways to know what’s

important• What’s dull• What can be ignored

– No bias? Can’t ignore anything

• No summarization• No generalization• No way to predict the future

14

Page 15: Icse 2013-tutorial-data-science-for-software-engineering

15

Christian Bird, data miner,Msoft research, Redmond

• Microsoft Research, Redmond– Assesses learners by

“engagement”

A successful “Bird” session:

• Knowledge engineers enter with sample data

• Users take over the spreadsheet

• Run many ad hoc queries• In such meetings, users often…

• demolish the model • offer more data• demand you come back

next week with something better

Expert data scientists spend more time with users than algorithms

Page 16: Icse 2013-tutorial-data-science-for-software-engineering

Also: Users control budgets

• Why talk to users?– Cause they own the wallet

• As the Mercury astronauts used to say– No bucks, no Buck Rodgers

• We need to give users a sense of comfort that we know what we are doing– That they are part of the process– That we understand their problem and processes– Else, budget = $0

16

Page 17: Icse 2013-tutorial-data-science-for-software-engineering

The Inductive Engineering Manifesto

• Users before algorithms: – Mining algorithms are only useful in industry if

users fund their use in real-world applications.

• Data science – Understanding user goals to inductively generate

the models that most matter to the user.

17• T. Menzies, C. Bird, T. Zimmermann, W. Schulte, and E. Kocaganeli.

The inductive software engineering manifesto. (MALETS '11).

Page 18: Icse 2013-tutorial-data-science-for-software-engineering

18

OUTLINE• PART 0: Introduction• PART 1: Organization Issues

– Rule #1: Talk to the users

– Rule #2: Know your domain– Rule #3: Suspect your data– Rule #4: Data science is cyclic

• PART 2: Data Issues– How to solve lack or scarcity of data– How to prune data, simpler & smarter– How to advance simple CBR methods– How to keep your data private

• PART 3: Model Issues– Problems of SE models– Solutions

• Envy-based learning• Ensembles

Page 19: Icse 2013-tutorial-data-science-for-software-engineering

Algorithms is only part of the story

19• Drew Conway, The Data Science Venn Diagram, 2009,

• http://www.dataists.com/2010/09/the-data-science-venn-diagram/

• Dumb data miners miss important domains semantics

• An ounce of domain knowledge is worth a ton to algorithms.

• Math and statistics only gets you machine learning,

• Science is about discovery and building knowledge, which requires some motivating questions about the world

• The culture of academia, does not reward researchers for understanding domains.

Page 20: Icse 2013-tutorial-data-science-for-software-engineering

Case Study #1: NASA

• NASA’s Software Engineering Lab, 1990s– Gave free access to all comers to their data– But you had to come to get it (to Learn the domain)– Otherwise: mistakes

• E.g. one class of software module with far more errors that anything else.– Dumb data mining algorithms: might learn that this kind of module in

inherently more data prone

• Smart data scientists might question “what kind of programmer work that module”– A: we always give that stuff to our beginners as a learning exercise

20* F. Shull, M. Mendonsa, V. Basili, J. Carver, J. Maldonado, S. Fabbri, G. Travassos, and M. Ferreira, "Knowledge-Sharing Issues in Experimental Software Engineering", EMSE 9(1): 111-137, March 2004.

Page 21: Icse 2013-tutorial-data-science-for-software-engineering

Case Study #2: Microsoft

• Distributed vs centralized development

• Who owns the files?– Who owns the files with most bugs

• Result #1 (which was wrong)– A very small number of people

produce most of the core changes to a “certain Microsoft product”.

– Kind of an uber-programmer result– I.e. given thousands of programmers

working on a project• Most are just re-arrange deck chairs• TO improve software process, ignore

the drones and focus mostly on the queen bees

• WRONG:– Microsoft does much auto-generation

of intermediary build files. – And only a small number of people

are responsible for the builds– And that core build team “owns”

those auto-generated files– Skewed the results. Send us down

the wrong direction• Needed to spend weeks/months

understanding build practices– BEFORE doing the defect studies

21* E. Kocaganeli, T. Zimmermann, C.Bird, N.Nagappan, T.Menzies. Distributed Development Considered Harmful?. ICSE 2013 SEIP Track, San Francisco, CA, USA, May 2013.

Page 22: Icse 2013-tutorial-data-science-for-software-engineering

22

OUTLINE• PART 0: Introduction• PART 1: Organization Issues

– Rule #1: Talk to the users– Rule #2: Know your domain

– Rule #3: Suspect your data– Rule #4: Data science is cyclic

• PART 2: Data Issues– How to solve lack or scarcity of data– How to prune data, simpler & smarter– How to advance simple CBR methods– How to keep your data private

• PART 3: Model Issues– Problems of SE models– Solutions

• Envy-based learning• Ensembles

Page 23: Icse 2013-tutorial-data-science-for-software-engineering

You go mining with the data you have—not the data you might want

• In the usual case, you cannot control data collection. – For example, data mining at NASA 1999 – 2008

• Information collected from layers of sub-contractors and sub-sub-contractors.

• Any communication to data owners had to be mediated by up to a dozen account managers, all of whom had much higher priority tasks to perform.

• Hence, we caution that usually you must:– Live with the data you have or dream of accessing at

some later time.23

Page 24: Icse 2013-tutorial-data-science-for-software-engineering

Rinse before use

• Data quality tests (*)– Linear time checks for (e.g.) repeated rows

• Column and row pruning for tabular data– Bad columns contain noise, irrelevancies– Bad rows contain confusing outliers– Repeated results:

• Signal is a small nugget within the whole data• R rows and C cols can be pruned back to R/5 and C0.5

• Without losing signal

24* M. Shepperd, Q. Song, Z. Sun, C. Mair, "Data Quality: Some Comments on the NASA Software Defect Data Sets," IEEE TSE, 2013, pre-prints

Page 25: Icse 2013-tutorial-data-science-for-software-engineering

e.g. NASAeffort data

25

Nasa data: mostProjects highly complexi.e. no information in saying“complex”

The more features we remove for smaller

projects the better the predictions.

Page 26: Icse 2013-tutorial-data-science-for-software-engineering

26

OUTLINE• PART 0: Introduction• PART 1: Organization Issues

– Rule #1: Talk to the users– Rule #2: Know your domain– Rule #3: Suspect your data

– Rule #4: Data science is cyclic• PART 2: Data Issues

– How to solve lack or scarcity of data– How to prune data, simpler & smarter– How to advance simple CBR methods– How to keep your data private

• PART 3: Model Issues– Problems of SE models– Solutions

• Envy-based learning• Ensembles

Page 27: Icse 2013-tutorial-data-science-for-software-engineering

Do it again, and again, and again, and …

27

In any industrial application, data science is repeated multiples time to either answer an extra user question, make some enhancement and/or bug fix to the method, or to deploy it to a different set of users.

Page 28: Icse 2013-tutorial-data-science-for-software-engineering

Thou shall not click

• For serious data science studies, – to ensure repeatability, – the entire analysis should be automated – using some high level scripting language;

• e.g. R-script, Matlab, Bash, ….

28

Page 29: Icse 2013-tutorial-data-science-for-software-engineering

The feedback process

29

Page 30: Icse 2013-tutorial-data-science-for-software-engineering

The feedback process

30

Page 31: Icse 2013-tutorial-data-science-for-software-engineering

31

OUTLINE• PART 0: Introduction• PART 1: Organization Issues

– Rule #1: Talk to the users– Rule #2: Know your domain– Rule #3: Suspect your data– Rule #4: Data science is cyclic

• PART 2: Data Issues– How to solve lack or scarcity of data– How to prune data, simpler & smarter– How to advance simple CBR methods– How to keep your data private

• PART 3: Model Issues– Problems of SE models– Solutions

• Envy-based learning• Ensembles

Page 32: Icse 2013-tutorial-data-science-for-software-engineering

32

OUTLINE• PART 0: Introduction• PART 1: Organization Issues

– Rule #1: Talk to the users– Rule #2: Know your domain– Rule #3: Suspect your data– Rule #4: Data science is cyclic

• PART 2: Data Issues

– How to solve lack or scarcity of data– How to prune data, simpler & smarter– How to advance simple CBR methods– How to keep your data private

• PART 3: Model Issues– Problems of SE models– Solutions

• Envy-based learning• Ensembles

Page 33: Icse 2013-tutorial-data-science-for-software-engineering

33

How to Solve Lack or Scarcity of Local Data

Page 34: Icse 2013-tutorial-data-science-for-software-engineering

34

What are my options?

Isn’t local (within) data better?

It may not be availableIt may be scarce

Tedious data collection effortToo slow to collect

The verdict with global (cross) data?

Effort estimation1:No clear winners, either way

Defect Prediction2:Can use global data as a stop gap

LOCAL GLOBAL

1 Barbara A. Kitchenham, Emilia Mendes, Guilherme Horta Travassos: Cross versus Within-Company Cost Estimation Studies: A Systematic Review. IEEE Trans. Software Eng. 33(5): 316-329 (2007)

2 B. Turhan, T. Menzies, A. Bener and J. Distefano, “On the relative value of cross-company and within-company data for defect prediction”, Empirical Software Engineering Journal, Vol.14/5, pp.540-578, 2009.

Page 35: Icse 2013-tutorial-data-science-for-software-engineering

35

Comparing options• For NASA data

– Seven test sets from 10% of each source

• Treatment CC (using global)– Train on the 6 other data sets

• Treatment WC (using local)– Train on the remaining 90% of

the local data

Page 36: Icse 2013-tutorial-data-science-for-software-engineering

36

NN-FilteringStep 1: Calculate the pairwise Euclidean distances between the local (test) set and the candidate (global) training set.

Step 2: For each test datum, pick its k nearest neighbors from global set.

Step 3: Pick unique instances from the union of those selected across all local set to construct the final training set Now, train your favorite model on the

filtered training set!

B. Turhan, A. Bener, and T. Menzies, “Nearest Neighbor Sampling for Cross Company Defect Predictors”, in Proceedings of the 1st International Workshop on Defects in Large Software Systems (DEFECTS 2008), pp. 26, 2008.

Page 37: Icse 2013-tutorial-data-science-for-software-engineering

37

More Comparisons: PD

• For NASA data– Seven test sets from 10% of each

source• Treatment CC (using global)

– Train on the 6 other data sets• Treatment WC (using local)

– Train on the remaining 90% of the local data

• Treatment NN (using global+NN)– Initialize train set with 6 other data

sets,– Prune the train set to just the 10

nearest neighbors (Euclidean)of the test set (discarding repeats)

B. Turhan, T. Menzies, A. Bener and J. Distefano, “On the relative value of cross-company and within-company data for

defect prediction”, Empirical Software Engineering Journal, Vol.14/5, pp.540-578, 2009.

Page 38: Icse 2013-tutorial-data-science-for-software-engineering

38

More Comparisons: PF

• For NASA data– Seven test sets from 10% of each

source• Treatment CC (using global)

– Train on the 6 other data sets• Treatment WC (using local)

– Train on the remaining 90% of the local data

• Treatment NN (using global+NN)– Initialize train set with 6 other data

sets,– Prune the train set to just the 10

nearest neighbors (Euclidean)of the test set (discarding repeats)

B. Turhan, T. Menzies, A. Bener and J. Distefano, “On the relative value of cross-company and within-company data for

defect prediction”, Empirical Software Engineering Journal, Vol.14/5, pp.540-578, 2009.

Page 39: Icse 2013-tutorial-data-science-for-software-engineering

B. Turhan, T. Menzies, A. Bener and J. Distefano, “On the relative value of cross-company and within-company data for

defect prediction”, Empirical Software Engineering Journal, Vol.14/5, pp.540-578, 2009.

• For SOFTLAB data– Three test sets from embedded

systems• Treatment CC (using global)

– Train on the seven NASA data sets• Treatment WC (using local)

– Train on the remaining two local test data

• Treatment NN (using global+NN)– Initialize train set with 7 NASA data

sets,– Prune the train set to just the 10

nearest neighbors (Euclidean)of the test set (discarding repeats)

External Validity

Page 40: Icse 2013-tutorial-data-science-for-software-engineering

40

Page 41: Icse 2013-tutorial-data-science-for-software-engineering

“Theories can be learned from a very small sample of available data”

Microsampling

• Given N defective modules:– M = {25, 50, 75, ...} <= N– Select M defective and M

defect-free modules.– Learn theories on 2M

instances• Undersampling: M=N• 8/12 datasets -> M = 25 • 1/12 datasets -> M = 75 • 3/12 datasets -> M = {200,

575, 1025}

T. Menzies, B. Turhan, A. Bener, G. Gay, B. Cukic, Y. Jiang, “Implications of Ceiling Effects in Defect Predictors”, in Proceedings of the 4th International Workshop on Predictor Models in Software Engineering (PROMISE 2008), pp. 47-54, 2008.

Page 42: Icse 2013-tutorial-data-science-for-software-engineering

43

How about mixing local and global?

• Is it feasible to use additional data from other projects:

– (Case 1) When there is limited local project history, i.e. no prior releases– (Case 2) When there is existing local project history, i.e. many releases over some period

B. Turhan, A. T. Mısırlı, A. Bener, “Empirical Evaluation of The Effects of Mixed Project Data on Learning Defect Predictors”, (in print) Journal of Information and Software Technology, 2013

• For 73 versions of 41 projects– Reserve test sets from 10% of each project– Additional test sets if the project has

multiple releases• Treatment WP (using local)

– Train on 10%..90% of the local data– Train on the previous releases

• Treatment WP+CP (using global)– Enrich training sets above with NN-filtered

data from all other projects

Case 1: WP(10%) + CP is as good as WP(90%)

Case 2: WP+CP is significantly better than WP (with small effect

size)

Page 43: Icse 2013-tutorial-data-science-for-software-engineering

44

OUTLINE• PART 0: Introduction• PART 1: Organization Issues

– Rule #1: Talk to the users– Rule #2: Know your domain– Rule #3: Suspect your data– Rule #4: Data science is cyclic

• PART 2: Data Issues– How to solve lack or scarcity of data

– How to prune data, simpler & smarter– How to advance simple CBR methods– How to keep your data private

• PART 3: Model Issues– Problems of SE models– Solutions

• Envy-based learning• Ensembles

Page 44: Icse 2013-tutorial-data-science-for-software-engineering

45

How to Prune Data, Simpler and Smarter

Data is the new oil

Page 45: Icse 2013-tutorial-data-science-for-software-engineering

46

And it has a cost too

e.g. $1.5M spent by NASA in the period 1987 to 1990 to understand the historical records of all their software in support of the planning activities for the International Space Station [1]

Do we need to discuss all the projects and all the features in a client meeting or in a Delphi session?

Similarly, do we need all the labels for supervised methods?

[1] E. Kocaguneli, T. Menzies, J. Keung, D. Cok, and R. Madachy, “Active learning and effort estimation: Finding the essential content of software effort estimation data,” IEEE Trans. on Softw. Eng., vol. Preprints, 2013.

Page 46: Icse 2013-tutorial-data-science-for-software-engineering

47

Data for Industry / Active LearningConcepts of E(k) matrices and popularity…

Let’s see it in action: Point to the person closest to you

Page 47: Icse 2013-tutorial-data-science-for-software-engineering

48

Data for Industry / Active LearningInstance pruning

1. Calculate “popularity” of instances

2. Sorting by popularity, 3. Label one instance at a time4. Find the stopping point5. Return closest neighbor from

active pool as estimate

1. Calculate the popularity of features

2. Select non-popular features

Synonym pruning

We want to find the dissimilar features, that are unlike others

We want the instances that are similar to others

Page 48: Icse 2013-tutorial-data-science-for-software-engineering

49

Data for Industry / Active LearningFinding the stopping point

• If all popular instances are exhausted.

Stop asking for labels if one of the rules fire

• Or if there is no MRE (magnitude of relative error = abs(actual-predicted)/actual) improvement for n consecutive times.

• Or if the ∆ between the best and the worst error of the last n times is very small. (∆ = 0.1; n = 3)

Page 49: Icse 2013-tutorial-data-science-for-software-engineering

50

Data for Industry / Active LearningQUICK: An active learning solution, i.e. unsupervised

Instances are labeled with a cost by the expert• We want to stop before all the instances are labeled

Page 50: Icse 2013-tutorial-data-science-for-software-engineering

51

Picking random training instance is

not a good idea

More popular instances in the active pool decrease error

One of the stopping point conditions fires

Data for Industry / Active Learning

X-axis: Instances sorted in decreasing popularity numbers

Y-ax

is: M

edia

n M

RE

Page 51: Icse 2013-tutorial-data-science-for-software-engineering

52

Data for Industry / Active Learning

At most 31% of all the cells

On median 10%

Intrinsic dimensionality: There is a consensus in the high-dimensional data analysis community that the only reason any methods work in very high dimensions is that, in fact, the data is not truly high-dimensional [1]

[1] E. Levina and P.J. Bickel. Maximum likelihood estimation of intrinsic dimension. In Advances in Neural Information Processing Systems, volume 17, Cambridge, MA, USA, 2004. The MIT Press.

Page 52: Icse 2013-tutorial-data-science-for-software-engineering

53

OUTLINE• PART 0: Introduction• PART 1: Organization Issues

– Rule #1: Talk to the users– Rule #2: Know your domain– Rule #3: Suspect your data– Rule #4: Data science is cyclic

• PART 2: Data Issues– How to solve lack or scarcity of data– How to prune data, simpler & smarter

– How to advance simple CBR methods– How to keep your data private

• PART 3: Model Issues– Problems of SE models– Solutions

• Envy-based learning• Ensembles

Page 53: Icse 2013-tutorial-data-science-for-software-engineering

54

Case-based reasoning (CBR) methods make use of similar past projects for estimation

They are very widely used as [1]:• No model-calibration to local data• Can better handle outliers• Can work with 1 or more attributes• Easy to explain

Two promising research areas• weighting the selected analogies [2]• improving design options [3]

How to Advance Simple CBR Methods

[1] F. Walkerden and R. Jeffery, “An empirical study of analogy-based software effort estimation,” Empirical Software Engineering, vol. 4, no. 2, pp. 135–158, 1999.[2] E. Mendes, I. D. Watson, C. Triggs, N. Mosley, and S. Counsell, “A comparative study of cost estimation models for web hypermedia applications,” Empirical Software Engineering, vol. 8, no. 2, pp. 163–196, 2003.[3] J. W. Keung, “Theoretical Maximum Prediction Accuracy for Analogy-Based Software Cost Estimation,” 15th Asia-Pacific Software Engineering Conference, pp. 495– 502, 2008.

Page 54: Icse 2013-tutorial-data-science-for-software-engineering

55

In none of the scenarios did we see a significant improvement

Compare performance of each k-value with and without weighting.

Building on the previous research [1], we adopted two different strategies [2]

We used kernel weighting to weigh selected analogies

a) Weighting analogies [3]

How to Advance Simple CBR Methods

[1] E. Mendes, I. D. Watson, C. Triggs, N. Mosley, and S. Counsell, “A comparative study of cost estimation models for web hypermedia applications,” Empirical Software Engineering, vol. 8, no. 2, pp. 163–196, 2003.[2] W. Keung, “Theoretical Maximum Prediction Accuracy for Analogy-Based Software Cost Estimation,” 15th Asia-Pacific Software Engineering Conference, pp. 495– 502, 2008.[3] Kocaguneli, Ekrem, Tim Menzies, and Jacky W. Keung. "Kernel methods for software effort estimation." Empirical Software Engineering 18.1 (2013): 1-24.

Page 55: Icse 2013-tutorial-data-science-for-software-engineering

56

D-ABE• Get best estimates of all training

instances• Remove all the training instances

within half of the worst MRE (acc. to TMPA).

• Return closest neighbor’s estimate to the test instance.

c

t

db

e

a

f

Test instanceTraining Instances

Worst MRE

Close to the worst MREReturn the

closest neighbor’s estimate

b) Designing ABE methods

Easy-path: Remove training instance that violate assumptions

TEAK will be discussed later.D-ABE: Built on theoretical maximum prediction accuracy (TMPA) [1]

How to Advance Simple CBR Methods

[1] W. Keung, “Theoretical Maximum Prediction Accuracy for Analogy-Based Software Cost Estimation,” 15th Asia-Pacific Software Engineering Conference, pp. 495– 502, 2008.

Page 56: Icse 2013-tutorial-data-science-for-software-engineering

57

D-ABE Comparison to static k w.r.t. MMRE

D-ABE Comparison to static k w.r.t. win, tie, loss

How to Advance Simple CBR Methods

Page 57: Icse 2013-tutorial-data-science-for-software-engineering

Finding enough local training data is a fundamental problem [1]

Merits of using cross-data from another company is questionable [2]

We use a relevancy filtering method called TEAK on public and proprietary data sets.

How to Advance Simple CBR Methods/Using CBR for cross company learning

[1] B. Turhan, T. Menzies, A. Bener, and J. Di Stefano, “On the relative value of cross-company and within-company data for defect prediction,” Empirical Software Engineering, vol. 14, no. 5, pp. 540–578, 2009. [2] E. Kocaguneli and T. Menzies, “How to find relevant data for effort estimation,” in ESEM’11: International Symposium on Empirical Software Engineering and Measurement, 2011. [3] B. A. Kitchenham, E. Mendes, and G. H. Travassos, “Cross versus within-company cost estimation studies: A systematic review,” IEEE Trans. Softw. Eng., vol. 33, no. 5, pp. 316–329, 2007. [4] T. Zimmermann, N. Nagappan, H. Gall, E. Giger, and B. Murphy, “Cross-project defect prediction: A large scale experiment on data vs. domain vs. process,” ESEC/FSE, pp. 91–100, 2009.

Similar amounts of evidence for and against the performance of cross-data [3, 4]

Page 58: Icse 2013-tutorial-data-science-for-software-engineering

59

Cross data works as well as within data for 6 out of 8 proprietary data sets, 19 out of 21 public data sets after TEAK’s relevancy filtering

Similar projects, dissimilar effort values, hence high variance

Similar projects, similar effort values, hence low variance

How to Advance Simple CBR Methods/Using CBR for cross company learning

Build a second GAC tree with low-variance instances

Return closest neighbor’s value from the lowest variance region

In summary: Design options of CBR helps, but not fiddling with single instances and weights!

[1] E. Kocaguneli and T. Menzies, “How to find relevant data for effort estimation,” in ESEM’11: International Symposium on Empirical Software Engineering and Measurement, 2011.

Page 59: Icse 2013-tutorial-data-science-for-software-engineering

60

OUTLINE• PART 0: Introduction• PART 1: Organization Issues

– Rule #1: Talk to the users– Rule #2: Know your domain– Rule #3: Suspect your data– Rule #4: Data science is cyclic

• PART 2: Data Issues– How to solve lack or scarcity of data– How to prune data, simpler & smarter– How to advance simple CBR methods

– How to keep your data private• PART 3: Model Issues

– Problems of SE models– Solutions

• Envy-based learning• Ensembles

Page 60: Icse 2013-tutorial-data-science-for-software-engineering

Is Data Sharing Worth the Risk to Individual Privacy

• Former Governor Massachusetts.• Victim of re-identification privacy breach.• Led to sensitive attribute disclosure of his medical records.

What would William Weld say?

Page 61: Icse 2013-tutorial-data-science-for-software-engineering

Is Data Sharing Worth the Risk to Individual Privacy

What about NASA contractors?

Subject to competitive bidding every 2 years.

Unwilling to share data that would lead to sensitive attribute disclosure.

e.g. actual software development times

Page 62: Icse 2013-tutorial-data-science-for-software-engineering

When To Share – How To Share

So far we cannot guarantee 100% privacy.What we have is a directive as to whether data is private and useful enough to share...

We have a lot of privacy algorithms geared toward minimizing risk.

Old SchoolK-anonymity L-diversityT-closeness

But What About Maximizing Benefits (Utility)?

The degree of risk to thedata sharing entity must not exceed the benefits of sharing.

Page 63: Icse 2013-tutorial-data-science-for-software-engineering
Page 64: Icse 2013-tutorial-data-science-for-software-engineering

Balancing Privacy and Utilityor...

Minimize risk of privacy disclosure while maximizing utility.

Instance Selection with CLIFFSmall random moves with MORPH

= CLIFF + MORPH

F. Peters, T. Menzies, L. Gong, H. Zhang, "Balancing Privacy and Utility in Cross-Company Defect Prediction," IEEE Transactions on Software Engineering, 24 Jan. 2013. IEEE computer Society Digital Library. IEEE Computer Society

Page 65: Icse 2013-tutorial-data-science-for-software-engineering

CLIFFDon't share all the data.

F. Peters, T. Menzies, L. Gong, H. Zhang, "Balancing Privacy and Utility in Cross-Company Defect Prediction," IEEE Transactions on Software Engineering, 24 Jan. 2013. IEEE computer Society Digital Library. IEEE Computer Society

Page 66: Icse 2013-tutorial-data-science-for-software-engineering

CLIFFDon't share all the data.

"a=r1" powerful for selection for class=yes

more common in "yes" than "no"

CLIFFstep1: for each class find ranks of all values

F. Peters, T. Menzies, L. Gong, H. Zhang, "Balancing Privacy and Utility in Cross-Company Defect Prediction," IEEE Transactions on Software Engineering, 24 Jan. 2013. IEEE computer Society Digital Library. IEEE Computer Society

Page 67: Icse 2013-tutorial-data-science-for-software-engineering

CLIFFDon't share all the data.

"a=r1" powerful for selection for class=yes

more common in "yes" than "no"

CLIFFstep2: multiply ranks of each row

F. Peters, T. Menzies, L. Gong, H. Zhang, "Balancing Privacy and Utility in Cross-Company Defect Prediction," IEEE Transactions on Software Engineering, 24 Jan. 2013. IEEE computer Society Digital Library. IEEE Computer Society

Page 68: Icse 2013-tutorial-data-science-for-software-engineering

CLIFFDon't share all the data.

CLIFFstep3: select the most powerful rows of each class

Note linear time

Can reduce N rows to 0.1N

So an O(N2) NUN algorithm now takes time O(0.01)

Scalability

F. Peters, T. Menzies, L. Gong, H. Zhang, "Balancing Privacy and Utility in Cross-Company Defect Prediction," IEEE Transactions on Software Engineering, 24 Jan. 2013. IEEE computer Society Digital Library. IEEE Computer Society

Page 69: Icse 2013-tutorial-data-science-for-software-engineering

MORPHPush the CLIFF data from their original position.

y = x ± (x − z) r∗

x D, the original ∈instance

z D the NUN of x∈

y the resulting MORPHed instance

F. Peters and T. Menzies, “Privacy and utility for defect prediction: Experiments with morph,” in Software Engineering (ICSE), 2012 34th International Conference on, june 2012, pp. 189 –199.F. Peters, T. Menzies, L. Gong, H. Zhang, "Balancing Privacy and Utility in Cross-Company Defect Prediction," IEEE Transactions on Software Engineering, 24 Jan. 2013. IEEE computer Society Digital Library. IEEE Computer Society

Page 70: Icse 2013-tutorial-data-science-for-software-engineering

Case Study: Cross-Company Defect Prediction (CCDP)

Sharing Required.

Zimmermann et al.

Local data not always available• companies too small• product in first release, so

no past data.

Kitchenham et al.

• no time for collection• new technology can make all

data irrelevant

T. Zimmermann, N. Nagappan, H. Gall, E. Giger, and B. Murphy, “Cross-project defect prediction: a large scale experiment on data vs. domain vs. process.” in ESEC/SIGSOFT FSE’09,2009B. A. Kitchenham, E. Mendes, and G. H. Travassos, “Cross versus within-company cost estimation studies: A systematic review,” IEEE Transactions on Software Engineering, vol. 33, pp. 316–329, 2007

- Company B has little or no data to build a defect model;- Company B uses data from Company A to build defect models;

Page 71: Icse 2013-tutorial-data-science-for-software-engineering

CCDPBetter with data filtering

Initial results with cross-company defect prediction- negative(Zimmerman FSE '09)

- or inconclusive (Kitchenham TSE '07)

More recent work show better results - Turhan et al. 2009 (The Burak Filter)

B. Turhan, T. Menzies, A. Bener, and J. Di Stefano, “On the relative value of cross-company and within-company data for defect prediction,” Empirical Software Engineering, vol. 14, pp. 540–578, 2009.F. Peters, T. Menzies, and A. Marcus, “Better Cross Company Defect Prediction,” Mining Software Repositories (MSR), 2013 10th IEEE Working Conference on, (to appear)

Page 72: Icse 2013-tutorial-data-science-for-software-engineering

Making Data Private for CCDP

Here is how we look at the data

Terms

Non-Sensitive Attribute (NSA)

Sensitive Attribute

Class Attribute

Page 73: Icse 2013-tutorial-data-science-for-software-engineering

Measuring the RiskIPR = Increased Privacy Ratio

Queries Original Privatized Privacy Breach

Q1 0 0 yes

Q2 0 1 no

Q3 1 1 yes

yes = 2/3IPR = 1- 2/3 = 0.33

F. Peters, T. Menzies, L. Gong, H. Zhang, "Balancing Privacy and Utility in Cross-Company Defect Prediction," IEEE Transactions on Software Engineering, 24 Jan. 2013. IEEE computer Society Digital Library. IEEE Computer Society

Page 74: Icse 2013-tutorial-data-science-for-software-engineering

Measuring the UtilityThe g-measure

Probability of detection (pd) Probability of False alarm (pf)

Actualyes no

Predicted yes TP FP

no FN TN

pd TP/(TP+FN)

pf FP/(FP+TN)

g-measure 2*pd*(1-pf)/(pd+(1-pf))

F. Peters, T. Menzies, L. Gong, H. Zhang, "Balancing Privacy and Utility in Cross-Company Defect Prediction," IEEE Transactions on Software Engineering, 24 Jan. 2013. IEEE computer Society Digital Library. IEEE Computer Society

Page 75: Icse 2013-tutorial-data-science-for-software-engineering

Making Data Private for CCDPComparing CLIFF+MORPH to Data Swapping and K-anonymity

Data Swapping (s10, s20, s40)

A standard perturbation technique used for privacyTo implement...• For each NSA a certain percent

of the values are swapped with any other value in that NSA.

• For our experiments, these percentages are 10, 20 and 40.

k-anonymity (k2, k4)

The Datafly Algorithm.To implement...• Make a generalization

hierarchy.• Replace values in the NSA

according to the hierarchy.• Continue until there are k or

fewer distinct instances and suppress them.

K. Taneja, M. Grechanik, R. Ghani, and T. Xie, “Testing software in age of data privacy: a balancing act,” in Proceedings of the 19th ACM SIGSOFT symposium and the 13th European conference on Foundations of software engineering, ser. ESEC/FSE ’11. New York, NY, USA: ACM, 2011, pp. 201–211.

L. Sweeney, “Achieving k-anonymity privacy protection using generalization and suppression,” Int. J. Uncertain. Fuzziness Knowl.-Based Syst., vol. 10, no. 5, pp. 571–588, Oct. 2002.

Page 76: Icse 2013-tutorial-data-science-for-software-engineering

Making Data Private for CCDPComparing CLIFF+MORPH to Data Swapping and K-anonymity

F. Peters, T. Menzies, L. Gong, H. Zhang, "Balancing Privacy and Utility in Cross-Company Defect Prediction," IEEE Transactions on Software Engineering, 24 Jan. 2013. IEEE computer Society Digital Library. IEEE Computer Society

Page 77: Icse 2013-tutorial-data-science-for-software-engineering

Making Data Private for CCDPComparing CLIFF+MORPH to Data Swapping and K-anonymity

F. Peters, T. Menzies, L. Gong, H. Zhang, "Balancing Privacy and Utility in Cross-Company Defect Prediction," IEEE Transactions on Software Engineering, 24 Jan. 2013. IEEE computer Society Digital Library. IEEE Computer Society

Page 78: Icse 2013-tutorial-data-science-for-software-engineering

Making Data Private for CCDP

F. Peters, T. Menzies, L. Gong, H. Zhang, "Balancing Privacy and Utility in Cross-Company Defect Prediction," IEEE Transactions on Software Engineering, 24 Jan. 2013. IEEE computer Society Digital Library. IEEE Computer Society

Page 79: Icse 2013-tutorial-data-science-for-software-engineering

81

OUTLINE• PART 0: Introduction• PART 1: Organization Issues

– Rule #1: Talk to the users– Rule #2: Know your domain– Rule #3: Suspect your data– Rule #4: Data science is cyclic

• PART 2: Data Issues– How to solve lack or scarcity of data– How to prune data, simpler & smarter– How to advance simple CBR methods– How to keep your data private

• PART 3: Model Issues– Problems of SE models– Solutions

• Envy-based learning• Ensembles

Page 80: Icse 2013-tutorial-data-science-for-software-engineering

82

OUTLINE• PART 0: Introduction• PART 1: Organization Issues

– Rule #1: Talk to the users– Rule #2: Know your domain– Rule #3: Suspect your data– Rule #4: Data science is cyclic

• PART 2: Data Issues– How to solve lack or scarcity of data– How to prune data, simpler & smarter– How to advance simple CBR methods– How to keep your data private

• PART 3: Model Issues

– Problems of SE models– Solutions

• Envy-based learning• Ensembles

Page 81: Icse 2013-tutorial-data-science-for-software-engineering

83

Problems of SE Models

• Instability is the problem of not being able to elicit same/similar results under changing conditions– E.g. data set, performance measure etc.

• We will look at instability in 2 areas– Instability in Effort Estimation– Instability in Process

Page 82: Icse 2013-tutorial-data-science-for-software-engineering

84

There is no agreed upon best estimation method [1]

Methods change ranking w.r.t. conditions such as data sets, error

measures [2]

Experimenting with: 90 solo-methods, 20 public data sets, 7 error measures

Problems of SE Models/Instability in Effort

[1] M. Jorgensen and M. Shepperd, “A systematic review of software development cost estimation studies,” IEEE Trans. Softw. Eng., vol. 33, no. 1, pp. 33–53, 2007. [2] I. Myrtveit, E. Stensrud, and M. Shepperd, “Reliability and validity in comparative studies of software prediction models,” IEEE Trans. Softw. Eng., vol. 31, no. 5, pp. 380–391, May 2005.

Page 83: Icse 2013-tutorial-data-science-for-software-engineering

85

Problems of SE Models/Instability in Effort

1. Rank methods acc. to win, loss and win-loss values

2. δr is the max. rank change

3. Sort methods acc. to loss and observe δr values

Page 84: Icse 2013-tutorial-data-science-for-software-engineering

86

We have a set of superior methods to

recommend

Assembling solo-methods may be a good idea

Baker et al. [1], Kocaguneli et al. [2], Khoshgoftaar et al. [3] failed to outperform solo-methods

But the previous evidence of assembling multiple methods in SEE is discouraging

Problems of SE Models/Instability in Effort

Top 13 methods are CART & ABE methods (1NN, 5NN)

[1] D. Baker, “A hybrid approach to expert and model-based effort esti- mation,” Master’s thesis, Lane Department of Computer Science and Electrical Engineering, West Virginia University, 2007, available from https://eidr.wvu.edu/etd/documentdata.eTD?documentid=5443. [2] E. Kocaguneli, Y. Kultur, and A. Bener, “Combining multiple learners induced on multiple datasets for software effort prediction,” in Interna- tional Symposium on Software Reliability Engineering (ISSRE), 2009, student Paper. [3] T. M. Khoshgoftaar, P. Rebours, and N. Seliya, “Software quality analysis by combining multiple projects and learners,” Software Quality Control, vol. 17, no. 1, pp. 25–49, 2009.

Page 85: Icse 2013-tutorial-data-science-for-software-engineering

87

Combine top 2,4,8,13 solo-methods via mean, median and IRWM

Problems of SE Models/Instability in Effort

Re-rank solo and multi-methods together

Page 86: Icse 2013-tutorial-data-science-for-software-engineering

88

Problems of SE Models/Instability in Process: Dataset Shift/Concept Drift

Candela JQ, Sugiyama M, Schwaighofer A, Lawrence ND (eds) (2009) Dataset shift in machine learning. The MIT Press, Cambridge, MA

Page 87: Icse 2013-tutorial-data-science-for-software-engineering

90

Dataset Shift: Covariate Shift

• Consider a size-based effort estimation model – Effective for projects within the

traditional operational boundaries of a company

• What if a change with impact on products’ size:– new business domains– change in technologies– change in development

techniques

BeforeAfter

Eff

ort

p(Xtrain) ≠ p(Xtest)

Size

B. Turhan, “On the Dataset Shift Problem in Software Engineering Prediction Models”, Empirical Software Engineering Journal, Vol.17/1-2, pp.62-74, 2012.

Page 88: Icse 2013-tutorial-data-science-for-software-engineering

91

Dataset Shift: Prior Probability Shift

• Now, consider a defect prediction model…

• … and again, what if defect characteristics change:– Process improvement– More QA resources– Increased experience over

time– Basically you improve over

time!

BeforeAfter

% D

efec

ts

p(Ytrain) ≠ p(Ytest)

kLOC

B. Turhan, “On the Dataset Shift Problem in Software Engineering Prediction Models”, Empirical Software Engineering Journal, Vol.17/1-2, pp.62-74, 2012.

Page 89: Icse 2013-tutorial-data-science-for-software-engineering

92

Dataset Shift: Usual Suspects

Sample Selection Bias & Imbalanced Data

B. Turhan, “On the Dataset Shift Problem in Software Engineering Prediction Models”, Empirical Software Engineering Journal, Vol.17/1-2, pp.62-74, 2012.

Page 90: Icse 2013-tutorial-data-science-for-software-engineering

93

Dataset Shift: Usual Suspects

Sample Selection Bias & Imbalanced Data

B. Turhan, “On the Dataset Shift Problem in Software Engineering Prediction Models”, Empirical Software Engineering Journal, Vol.17/1-2, pp.62-74, 2012.

Page 91: Icse 2013-tutorial-data-science-for-software-engineering

94

Dataset Shift

Domain shift• Be consistent in the way

you measure concepts for model training and testing!

• *: “…the metrics based assessment of a software system and measures taken to improve its design differ considerably from tool to tool.”

Source Component Shift• a.k.a. Data Heterogeneity• Ex: ISBSG contains data

from 6000+ projects from 30+ countries.

Where do the training data come from?

vs.Where do the test data come

from?* Rüdiger Lincke, Jonas Lundberg, and Welf Löwe. “Comparing software metrics tools”, ISSTA '08

B. Turhan, “On the Dataset Shift Problem in Software Engineering Prediction Models”, Empirical Software Engineering Journal, Vol.17/1-2, pp.62-74, 2012.

Page 92: Icse 2013-tutorial-data-science-for-software-engineering

95B. Turhan, “On the Dataset Shift Problem in Software Engineering Prediction Models”, Empirical Software Engineering Journal, Vol.17/1-2, pp.62-74, 2012.

Outlier ‘Detection’

RelevancyFiltering

Instance Weighting

Stratification

Cost Curves

Mixture Models

Managing Dataset Shift

Covariate Shift

Prior Probability

Shift

Sampling Imbalanced Data

Domain Shift

Source Component

Shift

Page 93: Icse 2013-tutorial-data-science-for-software-engineering

96

OUTLINE• PART 0: Introduction• PART 1: Organization Issues

– Rule #1: Talk to the users– Rule #2: Know your domain– Rule #3: Suspect your data– Rule #4: Data science is cyclic

• PART 2: Data Issues– How to solve lack or scarcity of data– How to prune data, simpler & smarter– How to advance simple CBR methods– How to keep your data private

• PART 3: Model Issues– Problems of SE models

– Solutions• Envy-based learning• Ensembles

Page 94: Icse 2013-tutorial-data-science-for-software-engineering

• Seek the fence where the grass is greener on the other side.

• Learn from there

• Test on here

• Cluster to find “here” and “there”

12/1/2011 97

Envy =The WisDOM Of the COWs

Page 95: Icse 2013-tutorial-data-science-for-software-engineering

9812/1/2011

@attribute recordnumber real@attribute projectname {de,erb,gal,X,hst,slp,spl,Y} @attribute cat2 {Avionics, application_ground, avionicsmonitoring, … }@attribute center {1,2,3,4,5,6}@attribute year real@attribute mode {embedded,organic,semidetached}@attribute rely {vl,l,n,h,vh,xh}@attribute data {vl,l,n,h,vh,xh} …@attribute equivphyskloc real

@attribute act_effort real

@data

1,de,avionicsmonitoring,g,2,1979,semidetached,h,l,h,n,n,l,l,n,n,n,n,h,h,n,l,25.9,117.62,de,avionicsmonitoring,g,2,1979,semidetached,h,l,h,n,n,l,l,n,n,n,n,h,h,n,l,24.6,117.63,de,avionicsmonitoring,g,2,1979,semidetached,h,l,h,n,n,l,l,n,n,n,n,h,h,n,l,7.7,31.24,de,avionicsmonitoring,g,2,1979,semidetached,h,l,h,n,n,l,l,n,n,n,n,h,h,n,l,8.2,365,de,avionicsmonitoring,g,2,1979,semidetached,h,l,h,n,n,l,l,n,n,n,n,h,h,n,l,9.7,25.26,de,avionicsmonitoring,g,2,1979,semidetached,h,l,h,n,n,l,l,n,n,n,n,h,h,n,l,2.2,8.4….

DATA = MULTI-DIMENSIONAL VECTORS

Page 96: Icse 2013-tutorial-data-science-for-software-engineering

CAUTION: data may not divide neatly on raw dimensions

• The best description for SE projects may be synthesize dimensions extracted from the raw dimensions

12/1/2011 99

Page 97: Icse 2013-tutorial-data-science-for-software-engineering

Fastmap

12/1/2011 100

Fastmap: Faloutsos [1995]

O(2N) generation of axis of large variability• Pick any point W; • Find X furthest from W, • Find Y furthest from Y.

c = dist(X,Y)All points have distance a,b to (X,Y)

• x = (a2 + c2 − b2)/2c • y= sqrt(a2 – x2)

Find median(x), median(y)

Recurse on four quadrants

Page 98: Icse 2013-tutorial-data-science-for-software-engineering

Hierarchical partitioningPrune

• Find two orthogonal dimensions• Find median(x), median(y)• Recurse on four quadrants

• Combine quadtree leaves with similar densities

• Score each cluster by median score of class variable

101

Grow

Page 99: Icse 2013-tutorial-data-science-for-software-engineering

Q: why cluster Via FASTMAP?

• A1: Circular methods (e.g. k-means) assume round clusters.• But density-based clustering allows clusters to be

any shape

• A2: No need to pre-set the number of clusters

• A3: cause other methods (e.g. PCA) are much slower• Fastmap is the O(2N)• Unoptimized Python:

12/1/2011 102

Page 100: Icse 2013-tutorial-data-science-for-software-engineering

12/1/2011 103

Learning via “envy”

Page 101: Icse 2013-tutorial-data-science-for-software-engineering

• Seek the fence where the grass is greener on the other side.

• Learn from there

• Test on here

• Cluster to find “here” and “there”

12/1/2011 104

Envy =The WisDOM Of the COWs

Page 102: Icse 2013-tutorial-data-science-for-software-engineering

Hierarchical partitioningPrune

• Find two orthogonal dimensions• Find median(x), median(y)• Recurse on four quadrants

• Combine quadtree leaves with similar densities

• Score each cluster by median score of class variable

105

Grow

Page 103: Icse 2013-tutorial-data-science-for-software-engineering

Hierarchical partitioningPrune

• Find two orthogonal dimensions• Find median(x), median(y)• Recurse on four quadrants

• Combine quadtree leaves with similar densities

• Score each cluster by median score of class variable

• This cluster envies its neighbor with better score and max abs(score(this) - score(neighbor)) 106

Grow

Where is grass greenest?

Page 104: Icse 2013-tutorial-data-science-for-software-engineering

Q: How to learn rules from neighboring clusters

• A: it doesn’t really matter– Many competent rule learners

• But to evaluate global vs local rules:– Use the same rule learner for local vs global rule learning

• This study uses WHICH (Menzies [2010])– Customizable scoring operator– Faster termination– Generates very small rules (good for explanation)

107

Page 105: Icse 2013-tutorial-data-science-for-software-engineering

Data from http://promisedata.googlecode.com• Effort reduction =

{ NasaCoc, China } : COCOMO or function points

• Defect reduction = {lucene,xalan jedit,synapse,etc } : CK metrics(OO)

• Clusters have untreated class distribution.

• Rules select a subset of the examples: – generate a treated class

distribution

108

25th

50th

75th

100th

0 10 20 30 40 50 60 70 80 90

untreated global local

Distributions have percentiles:

Treated with ruleslearned from all data

Treated with rules learned from neighboring cluster

Page 106: Icse 2013-tutorial-data-science-for-software-engineering

• Lower median efforts/defects (50th percentile)• Greater stability (75th – 25th percentile)• Decreased worst case (100th percentile)

By any measure, Local BETTER THAN GLOBAL

109

Page 107: Icse 2013-tutorial-data-science-for-software-engineering

Rules learned in each cluster

• What works best “here” does not work “there”– Misguided to try and tame conclusion instability – Inherent in the data

• Can’t tame conclusion instability. • Instead, you can exploit it • Learn local lessons that do better than overly generalized global theories

12/1/2011 110

Page 108: Icse 2013-tutorial-data-science-for-software-engineering

111

OUTLINE• PART 0: Introduction• PART 1: Organization Issues

– Rule #1: Talk to the users– Rule #2: Know your domain– Rule #3: Suspect your data– Rule #4: Data science is cyclic

• PART 2: Data Issues– How to solve lack or scarcity of data– How to prune data, simpler & smarter– How to advance simple CBR methods– How to keep your data private

• PART 3: Model Issues– Problems of SE models– Solutions

• Envy-based learning

• Ensembles

Page 109: Icse 2013-tutorial-data-science-for-software-engineering

Solutions to SE Model Problems/Ensembles of Learning Machines*

Sets of learning machines grouped together. Aim: to improve predictive performance.

...

estimation1 estimation2 estimationN

Base learners

E.g.: ensemble estimation = Σ wi estimationi

B1 B2 BN

* T. Dietterich. Ensemble Methods in Machine Learning. Proceedings of the First International Workshop in Multiple Classifier Systems. 2000.

Page 110: Icse 2013-tutorial-data-science-for-software-engineering

Solutions to SE Model Problems/Ensembles of Learning Machines

One of the keys: Diverse* ensemble: “base learners” make different

errors on the same instances.

* G. Brown, J. Wyatt, R. Harris, X. Yao. Diversity Creation Methods: A Survey and Categorisation. Journal of Information Fusion 6(1): 5-20, 2005.

Page 111: Icse 2013-tutorial-data-science-for-software-engineering

Solutions to SE Model Problems/Ensembles of Learning Machines

One of the keys: Diverse ensemble: “base learners” make different

errors on the same instances.

Three different types of ensembles that have been applied for software effort estimation will be presented in the next slides.

Different ensemble approaches can be seen as different ways to generate diversity among base learners!

Page 112: Icse 2013-tutorial-data-science-for-software-engineering

Solutions to SE Model Problems/Static Ensembles

Training data(completed projects)

training

Ensemble

An existing training set is used for creating/training the ensemble.

BNB1 B2...

Page 113: Icse 2013-tutorial-data-science-for-software-engineering

Solutions to SE Model Problems/Static Ensembles

Bagging ensembles of Regression Trees (Bag+RTs)*

Study with 13 data sets from PROMISE and ISBSG repositories.

Bag+RTs: Obtained the highest rank across data set in terms of Mean

Absolute Error (MAE). Rarely performed considerably worse (>0.1SA, SA = 1 – MAE /

MAErguess) than the best approach:

* L. Minku, X. Yao. Ensembles and Locality: Insight on Improving Software Effort Estimation. Information and Software Technology, Special Issue on Best Papers from PROMISE 2011, 2012 (in press),

http://dx.doi.org/10.1016/j.infsof.2012.09.012.

Page 114: Icse 2013-tutorial-data-science-for-software-engineering

Solutions to SE Model Problems/Static Ensembles

Bagging* ensembles of regression trees

* L. Breiman. Bagging Predictors. Machine Learning 24(2):123-140, 1996.

Training data(completed projects)

Ensemble

RT1 RT2 RTN ...Sample

uniformly with replacement

Page 115: Icse 2013-tutorial-data-science-for-software-engineering

Solutions to SE Model Problems/Static Ensembles

Bagging ensembles of regression trees

Functional Size

Functional Size Effort = 5376

Effort = 1086 Effort = 2798

>= 253< 253

< 151 >= 151

Regression trees: Estimation by analogy. Divide projects

according to attribute value.

Most impactful attributes are in higher levels.

Attributes with insignificant impact are not used.

E.g., REPTrees*.

* M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann, I. H. Witten. The WEKA Data Mining Software: An Update. SIGKDD Explorations 11(1), 2009.

http://www.cs.waikato.ac.nz/ml/weka.

Page 116: Icse 2013-tutorial-data-science-for-software-engineering

Solutions to SE Model Problems/Static Ensembles

Bagging ensembles of regression trees

Weka: classifiers – meta – bagging classifiers – trees – REPTree

Page 117: Icse 2013-tutorial-data-science-for-software-engineering

Solutions to SE Model Problems/Static Ensembles

Multiple-objective Pareto ensembles There are different measures/metrics of performance for

evaluating SEE models. Different measures capture different quality features of the

models. E.g.: MAE, standard

deviation, PRED, etc. There is no agreed

single measure. A model doing well

for a certain measure may not do so well for another.

Multilayer Perceptron (MLP) models created using Cocomo81.

Page 118: Icse 2013-tutorial-data-science-for-software-engineering

Solutions to SE Model Problems/Static Ensembles

Multi-objective Pareto ensembles*

We can view SEE as a multi-objective learning problem.

A multi-objective approach (e.g. Multi-Objective Evolutionary Algorithm (MOEA)) can be used to:

Better understand the relationship among measures. Create ensembles that do well for a set of measures, in

particular for larger data sets (>=60). Sample result: Pareto ensemble of MLPs (ISBSG):

* L. Minku, X. Yao. Software Effort Estimation as a Multi-objective Learning Problem. ACM Transactions on Software Engineering and Methodology, 2012 (accepted). Author's final version:

http://www.cs.bham.ac.uk/~minkull/publications/MinkuYaoTOSEM12.pdf.

Page 119: Icse 2013-tutorial-data-science-for-software-engineering

Solutions to SE Model Problems/Static Ensembles

Multi-objective Pareto ensembles

Training data(completed projects)

Ensemble

B1 B2 B3

Multi-objective evolutionary algorithm creates nondominated models with several different trade-offs.

The model with the best performance in terms of each particular measure can be picked to form an ensemble with a good trade-off.

Page 120: Icse 2013-tutorial-data-science-for-software-engineering

Solutions to SE Model Problems/Dynamic Adaptive Ensembles

Companies are not static entities – they can change with time (data set shift / concept drift*). Models need to learn new

information and adapt to changes.

Companies can start behaving more or less similarly to other companies.

* L. Minku, A. White, X. Yao. The Impact of Diversity on On-line Ensemble Learning in the Presence of Concept Drift. IEEE Transactions on Knowledge and Data Engineering, 22(5):730-742, 2010.

Predicting effort for a single company from ISBSG based on its projects and other companies' projects.

Page 121: Icse 2013-tutorial-data-science-for-software-engineering

Solutions to SE Model Problems/Dynamic Adaptive Ensembles

Dynamic Cross-company Learning (DCL)*

Cross-company Training Set 1

(completed projects)Cross-company

Training Set 1(completed projects)Cross-company (CC) m training sets with

different productivity(completed projects)

CC model 1 CC model m

w1 wm

......

...

Within-company (WC) training data

(projects arriving with time)

CC model

1

CC model

m

...

WC model

1

w1 wm

wm+1

* L. Minku, X. Yao. Can Cross-company Data Improve Performance in Software Effort Estimation? Proceedings of the 8th International Conference on Predictive Models in Software Engineering, p. 69-78, 2012.

http://dx.doi.org/10.1145/2365324.2365334.

• Dynamic weights control how much a certain model contributes to predictions:

At each time step, “loser” models have weight multiplied by Beta.

Models trained with “very different” projects from the one to be predicted can be filtered out.

Page 122: Icse 2013-tutorial-data-science-for-software-engineering

Solutions to SE Model Problems/Dynamic Adaptive Ensembles

Dynamic Cross-company Learning (DCL) DCL uses new completed projects that arrive with time. DCL determines when CC data is useful. DCL adapts to changes by using CC data.

Predicting effort for a single company from ISBSG based on its projects and other companies' projects.

Page 123: Icse 2013-tutorial-data-science-for-software-engineering

126

What have we covered?

Organizational Issues

Data Issues

Model Issues

Page 124: Icse 2013-tutorial-data-science-for-software-engineering

127