predicting more from less: synergies of learning

Post on 06-May-2015

187 Views

Category:

Technology

0 Downloads

Preview:

Click to see full reader

DESCRIPTION

Ekrem Kocaguneli, ekrem@kocaguneli.com Bojan Cukic, bojan.cukic@mail.wvu.edu, Huihua Lu, hlu3@mix.wvu.edu

TRANSCRIPT

Predicting More from Less: Synergies of Learning

Ekrem Kocaguneli, ekrem@kocaguneli.comBojan Cukic, bojan.cukic@mail.wvu.edu,

Huihua Lu, hlu3@mix.wvu.edu

RAISE'13 2nd International NSF sponsored Workshop on Realizing Artificial Intelligence Synergies in Software Engineering5/25/2013

RAISE'13

2

Collecting data is important

SourceForge currently hosts 324K projects with a user base of 3.4M1

GoogleCode hosts 250K open source projects2

1. http://sourceforge.net/apps/trac/sourceforge/wiki/What%20is%20SourceForge.net2. https://developers.google.com/open-source/

3

Also, there is an abundant amount of SE repositories

ISBSG1 PROMISE2

Eclipse Bug Data3 TukuTuku4

1. C. Lokan, T. Wright, P. Hill, and M. Stringer. Organizational bench- marking using the ISBSG data repository. IEEE Software, 18(5):26–32, 2001.

2. T. Menzies, B. Caglayan, E. Kocaguneli, J. Krall, F. Peters, and B. Turhan. The promise repository of empirical software engineering data, June 2012.

3. T. Zimmermann, R. Premraj, and A. Zeller. Predicting defects for eclipse. In International Workshop on Predictor Models in Software Engineering, 2007. PROMISE’07: ICSE Workshops 2007.

4. http://www.metriq.biz/tukutuku/

4

We have mountains of data, but then what?

5

Abundance of data is promising for predictive modeling and supervised learning

Yet, dependent variable information is

not always available!

Dependent variables (labels, effort values

etc.) may be missing, outdated or available for a limited number of instances

6

When an organization has no local data or the local data is outdated, transferring data helps

When only a limited amount of data is labeled, we can use the existing labels to label other training instances

When no labels exist, we can request labels from experts with a cost

Transfer learning

Semi-supervised learning

Active learning

7

How to transfer data data between domains and projects?

How to accommodate prediction problems for which a limited amount of labeled instances are available?

How to handle prediction problems in which no instances have labels?

Transfer learning

Semi-supervised learning

Active learning

8

What is the current state-of-the-art?

9

Transfer learning is a set of learning methods that allow the training and test sets to have different domains and/or tasks (Ma2012 [1]).

Transfer learning - 1

[1] Y. Ma, G. Luo, X. Zeng, and A. Chen. Transfer learning for cross- company software defect prediction. Information and Software Technol- ogy, 54(3):248 – 256, 2012.

SE transfer learning studies (a.k.a. cross-company learning) have the same task yet different domains (data coming from different organizations or different time frames).

10

Transfer learning results in SE report instability and significant variability if data is used as-is (Kitchenham2007 [1], Zimmermann2009[2])

Transfer learning - 2

[1] B.A.Kitchenham,E.Mendes,andG.H.Travassos.Crossversuswithin- company cost estimation studies: A systematic review. IEEE Trans. Softw. Eng., 33(5):316–329, 2007. [2] T.Zimmermann,N.Nagappan,H.Gall,E.Giger,andB.Murphy.Cross- project defect prediction: A large scale experiment on data vs. domain vs. process. ESEC/FSE, pages 91–100, 2009.[3] B. Turhan, T. Menzies, A. Bener, and J. Di Stefano. On the relative value of cross-company and within-company data for defect prediction. Empirical Software Engineering, 14(5):540–578, 2009. [4] E. Kocaguneli and T. Menzies. How to find relevant data for effort es- timation. In ESEM’11: International Symposium on Empirical Software Engineering and Measurement, 2011.

Filtering-based approaches support prior results (Turhan2009[3], Kocaguneli2011[4])• Transferring all cross data yields poor performance• Filtering cross data significantly improves estimation

11

SSL methods are a group of machine learning algorithms that learn from a set of training instances among which only a small subset has pre-assigned labels [1].

Semi-supervised learning (SSL) -1

[1] O. Chapelle, B. Schlkopf, and A. Zien. Semi-supervised Learning. MIT Press, Cambridge, MA, USA, 2006.

SSL helps relax the dependent variable dependence of supervised methods

Hence, we can supplement supervised estimation methods.

12

Despite the promise, SSL appears to be less than thoroughly investigated in SE

Semi-supervised learning (SSL) - 2

[1] Huihua Lu, Bojan Cukic, and Mark Culp. 2012. Software defect prediction using semi-supervised learning with dimension reduction. In Proceedings of the 27th IEEE/ACM International Conference on Automated Software Engineering (ASE 2012).[2] M. Li, H. Zhang, R. Wu, and Z.-H. Zhou. Sample-based software defect prediction with active and semi-supervised learning. Automated Software Engineering, 19:201–230, 2012.

Lu et al. use an SSL algorithm augmented with multi-dimensional scaling (MDS) as pre-processor, which outperforms corresponding supervised methods

Li et al. developed a framework which maps ensemble learning and random forests into an SSL setting [19].

13

AL methods are unsupervised methods working on an initially unlabeled data set.

Active Learning (AL) - 1

[1] M.-F.Balcan, A.Beygelzimer, andJ.Langford. “Agnostic active learning”. Proceedings of the 23rd international conference on Machine learning - ICML ’06, pages 65–72, 2006.

AL methods can query an oracle, which can provide labels. Yet, each label comes with a cost. Hence, we need as few queries as possible.

e.g. Balcan et al. show AL provides the same performance as a supervised learner with substantially smaller samples sizes [1]

14

In SE, AL methods hold a good potential to reduce the labeling costs

Active Learning (AL) - 2

[1] Huihua Lu and Bojan Cukic. 2012. An adaptive approach with active learning in software fault prediction. In Proceedings of the 8th International Conference on Predictive Models in Software Engineering (PROMISE '12). [2] Kocaguneli, E.; Menzies, T.; Keung, J.; Cok, D.; Madachy, R., "Active Learning and Effort Estimation: Finding the Essential Content of Software Effort Estimation Data," Software Engineering, IEEE Transactions on , vol.PP, no.99, pp.1,1, 0

Lu et al. propose an AL-based fault prediction method, which outperforms supervised techniques by using 20% or less of the data [1]

Kocaguneli et al. use AL in SEE. The proposed method performs comparable to supervised methods with 31% of the original data [2]

15

So what do we do?

16

Strengths and WeaknessesSupervised Learning (SL)Strengths• Successfully used in SE for predictive

purposes.• Provides successful estimation

performance.Challenges• Requires retrospective local data.• Requires dependent variable

information.

Transfer Learning (TL)Strengths• Enables data to be transferred between

different organizations or time frames.• Provides a solution to the lack of local data.• After relevancy filtering, cross data can

perform as well as within data.Challenges• Use of cross-data in an as is manner results in

unstable performance results.• TL filters relevant cross data, which reduces

the transferred cross data amount.

Semi-supervised Learning (SSL)Strengths• Enables learning from small sets of labeled

instances.• Supplements the learning with unlabeled instances.• Relaxes the requirement of dependent variables.Challenges• Although being small, it still requires an initially

labeled set of training instances.• For datasets with large number of independent

features, it requires feature subset selection.

Active Learning (AL)Strengths• Helps find the essential content of the data.• Decreases the number of dependent variable

information, thereby reducing the associated data collection costs.

Challenges• Susceptible to unbalanced class distributions

in classification problems.

17

Strengths and WeaknessesSupervised Learning (SL)

•Requires retrospective local data.•

Transfer Learning (TL)

• Provides a solution to the lack of local data.•

• TL filters relevant cross data, which reduces the transferred cross data amount.

Semi-supervised Learning (SSL)

• Enables learning from small sets of labeled instances.

••

Active Learning (AL)

• Helps find the essential content of the data.•

1

2

3

18

Synergy #1

Synergy #1 is already being pursued in SE

With successful applications of transferring data among:• Domain• Time frame

19

Filtering labeled cross data yields a very limited amount of locally relevant data

SSL can use filtered cross data to provide pseudo-labels for the unlabeled within data

Synergy #2

20

SE data (defect and effort) can be summarized with its essential content

Transfer learning may benefit from using essential content instead of all the data, which may contain noise and outliers

Synergy #3

21

Did you try any of the synergies?

22

Experiments with Synergy #3

23

Experiments with Synergy #3

Estimation from pseudo-labeled within data

Within data is summarized to at most 15%

Opportunity for within data to be locally interpreted

24

What have we covered?

top related