predicting more from less: synergies of learning
Post on 06-May-2015
187 Views
Preview:
DESCRIPTION
TRANSCRIPT
Predicting More from Less: Synergies of Learning
Ekrem Kocaguneli, ekrem@kocaguneli.comBojan Cukic, bojan.cukic@mail.wvu.edu,
Huihua Lu, hlu3@mix.wvu.edu
RAISE'13 2nd International NSF sponsored Workshop on Realizing Artificial Intelligence Synergies in Software Engineering5/25/2013
RAISE'13
2
Collecting data is important
SourceForge currently hosts 324K projects with a user base of 3.4M1
GoogleCode hosts 250K open source projects2
1. http://sourceforge.net/apps/trac/sourceforge/wiki/What%20is%20SourceForge.net2. https://developers.google.com/open-source/
3
Also, there is an abundant amount of SE repositories
ISBSG1 PROMISE2
Eclipse Bug Data3 TukuTuku4
1. C. Lokan, T. Wright, P. Hill, and M. Stringer. Organizational bench- marking using the ISBSG data repository. IEEE Software, 18(5):26–32, 2001.
2. T. Menzies, B. Caglayan, E. Kocaguneli, J. Krall, F. Peters, and B. Turhan. The promise repository of empirical software engineering data, June 2012.
3. T. Zimmermann, R. Premraj, and A. Zeller. Predicting defects for eclipse. In International Workshop on Predictor Models in Software Engineering, 2007. PROMISE’07: ICSE Workshops 2007.
4. http://www.metriq.biz/tukutuku/
4
We have mountains of data, but then what?
5
Abundance of data is promising for predictive modeling and supervised learning
Yet, dependent variable information is
not always available!
Dependent variables (labels, effort values
etc.) may be missing, outdated or available for a limited number of instances
6
When an organization has no local data or the local data is outdated, transferring data helps
When only a limited amount of data is labeled, we can use the existing labels to label other training instances
When no labels exist, we can request labels from experts with a cost
Transfer learning
Semi-supervised learning
Active learning
7
How to transfer data data between domains and projects?
How to accommodate prediction problems for which a limited amount of labeled instances are available?
How to handle prediction problems in which no instances have labels?
Transfer learning
Semi-supervised learning
Active learning
8
What is the current state-of-the-art?
9
Transfer learning is a set of learning methods that allow the training and test sets to have different domains and/or tasks (Ma2012 [1]).
Transfer learning - 1
[1] Y. Ma, G. Luo, X. Zeng, and A. Chen. Transfer learning for cross- company software defect prediction. Information and Software Technol- ogy, 54(3):248 – 256, 2012.
SE transfer learning studies (a.k.a. cross-company learning) have the same task yet different domains (data coming from different organizations or different time frames).
10
Transfer learning results in SE report instability and significant variability if data is used as-is (Kitchenham2007 [1], Zimmermann2009[2])
Transfer learning - 2
[1] B.A.Kitchenham,E.Mendes,andG.H.Travassos.Crossversuswithin- company cost estimation studies: A systematic review. IEEE Trans. Softw. Eng., 33(5):316–329, 2007. [2] T.Zimmermann,N.Nagappan,H.Gall,E.Giger,andB.Murphy.Cross- project defect prediction: A large scale experiment on data vs. domain vs. process. ESEC/FSE, pages 91–100, 2009.[3] B. Turhan, T. Menzies, A. Bener, and J. Di Stefano. On the relative value of cross-company and within-company data for defect prediction. Empirical Software Engineering, 14(5):540–578, 2009. [4] E. Kocaguneli and T. Menzies. How to find relevant data for effort es- timation. In ESEM’11: International Symposium on Empirical Software Engineering and Measurement, 2011.
Filtering-based approaches support prior results (Turhan2009[3], Kocaguneli2011[4])• Transferring all cross data yields poor performance• Filtering cross data significantly improves estimation
11
SSL methods are a group of machine learning algorithms that learn from a set of training instances among which only a small subset has pre-assigned labels [1].
Semi-supervised learning (SSL) -1
[1] O. Chapelle, B. Schlkopf, and A. Zien. Semi-supervised Learning. MIT Press, Cambridge, MA, USA, 2006.
SSL helps relax the dependent variable dependence of supervised methods
Hence, we can supplement supervised estimation methods.
12
Despite the promise, SSL appears to be less than thoroughly investigated in SE
Semi-supervised learning (SSL) - 2
[1] Huihua Lu, Bojan Cukic, and Mark Culp. 2012. Software defect prediction using semi-supervised learning with dimension reduction. In Proceedings of the 27th IEEE/ACM International Conference on Automated Software Engineering (ASE 2012).[2] M. Li, H. Zhang, R. Wu, and Z.-H. Zhou. Sample-based software defect prediction with active and semi-supervised learning. Automated Software Engineering, 19:201–230, 2012.
Lu et al. use an SSL algorithm augmented with multi-dimensional scaling (MDS) as pre-processor, which outperforms corresponding supervised methods
Li et al. developed a framework which maps ensemble learning and random forests into an SSL setting [19].
13
AL methods are unsupervised methods working on an initially unlabeled data set.
Active Learning (AL) - 1
[1] M.-F.Balcan, A.Beygelzimer, andJ.Langford. “Agnostic active learning”. Proceedings of the 23rd international conference on Machine learning - ICML ’06, pages 65–72, 2006.
AL methods can query an oracle, which can provide labels. Yet, each label comes with a cost. Hence, we need as few queries as possible.
e.g. Balcan et al. show AL provides the same performance as a supervised learner with substantially smaller samples sizes [1]
14
In SE, AL methods hold a good potential to reduce the labeling costs
Active Learning (AL) - 2
[1] Huihua Lu and Bojan Cukic. 2012. An adaptive approach with active learning in software fault prediction. In Proceedings of the 8th International Conference on Predictive Models in Software Engineering (PROMISE '12). [2] Kocaguneli, E.; Menzies, T.; Keung, J.; Cok, D.; Madachy, R., "Active Learning and Effort Estimation: Finding the Essential Content of Software Effort Estimation Data," Software Engineering, IEEE Transactions on , vol.PP, no.99, pp.1,1, 0
Lu et al. propose an AL-based fault prediction method, which outperforms supervised techniques by using 20% or less of the data [1]
Kocaguneli et al. use AL in SEE. The proposed method performs comparable to supervised methods with 31% of the original data [2]
15
So what do we do?
16
Strengths and WeaknessesSupervised Learning (SL)Strengths• Successfully used in SE for predictive
purposes.• Provides successful estimation
performance.Challenges• Requires retrospective local data.• Requires dependent variable
information.
Transfer Learning (TL)Strengths• Enables data to be transferred between
different organizations or time frames.• Provides a solution to the lack of local data.• After relevancy filtering, cross data can
perform as well as within data.Challenges• Use of cross-data in an as is manner results in
unstable performance results.• TL filters relevant cross data, which reduces
the transferred cross data amount.
Semi-supervised Learning (SSL)Strengths• Enables learning from small sets of labeled
instances.• Supplements the learning with unlabeled instances.• Relaxes the requirement of dependent variables.Challenges• Although being small, it still requires an initially
labeled set of training instances.• For datasets with large number of independent
features, it requires feature subset selection.
Active Learning (AL)Strengths• Helps find the essential content of the data.• Decreases the number of dependent variable
information, thereby reducing the associated data collection costs.
Challenges• Susceptible to unbalanced class distributions
in classification problems.
17
Strengths and WeaknessesSupervised Learning (SL)
•
•
•Requires retrospective local data.•
Transfer Learning (TL)
•
• Provides a solution to the lack of local data.•
•
• TL filters relevant cross data, which reduces the transferred cross data amount.
Semi-supervised Learning (SSL)
• Enables learning from small sets of labeled instances.
••
•
•
Active Learning (AL)
• Helps find the essential content of the data.•
•
1
2
3
18
Synergy #1
Synergy #1 is already being pursued in SE
With successful applications of transferring data among:• Domain• Time frame
19
Filtering labeled cross data yields a very limited amount of locally relevant data
SSL can use filtered cross data to provide pseudo-labels for the unlabeled within data
Synergy #2
20
SE data (defect and effort) can be summarized with its essential content
Transfer learning may benefit from using essential content instead of all the data, which may contain noise and outliers
Synergy #3
21
Did you try any of the synergies?
22
Experiments with Synergy #3
23
Experiments with Synergy #3
Estimation from pseudo-labeled within data
Within data is summarized to at most 15%
Opportunity for within data to be locally interpreted
24
What have we covered?
top related