software mining and software datasets

96
Most slides prepared in collaboration with Ahmed Hassan and Dongmei Zhang More information available at https://sites.google.com/site/asergrp/dmse/ Software Mining and Software Datasets Tao Xie University of Illinois at Urbana-Champaign http://taoxie.cs.illinois.edu/ [email protected]

Upload: tao-xie

Post on 13-Jan-2017

1.233 views

Category:

Software


6 download

TRANSCRIPT

Page 1: Software Mining and Software Datasets

Most slides prepared in collaboration with Ahmed Hassan and Dongmei Zhang

More information available at

https://sites.google.com/site/asergrp/dmse/

Software Mining and Software Datasets

Tao XieUniversity of Illinois at Urbana-Champaign

http://taoxie.cs.illinois.edu/[email protected]

Page 2: Software Mining and Software Datasets

• Associate Professor at University of Illinois at Urbana-Champaign, USA

• Leads the ASE research group at Illinois

• PC Chair of ISSTA 2015, PC Co-Chair of MSR 2011/2012, ICSM 2009

• Co-organizer of 2007 Dagstuhl Seminar on Mining Programs and Processes, 2013 NII Shonan Meeting on Software Analytics: Principles and Practice

2

Page 3: Software Mining and Software Datasets

Software Services

3

Page 4: Software Mining and Software Datasets

Individual

Social

Isolated

Not much contentgeneration

Collaborative

Huge amount of artifactsgenerated anywhere anytime

4

Page 5: Software Mining and Software Datasets

5

Data pervasive

Long product cycle

Experience & gut-feeling

In-lab testing

Informed decision making

Centralized development

Code centric

Debugging in the large

Distributed development

Continuous release

… …

Page 6: Software Mining and Software Datasets
Page 7: Software Mining and Software Datasets
Page 8: Software Mining and Software Datasets
Page 9: Software Mining and Software Datasets
Page 10: Software Mining and Software Datasets
Page 11: Software Mining and Software Datasets

http://msrconf.org

An international effort to make software repositories actionable

http://openscience.us/repo/

Page 12: Software Mining and Software Datasets

• Transforms static record-keeping repositories to activerepositories

• Makes repository data actionable by uncovering hidden patterns and trends

12

MailinglistBugzilla Crashes

Field logs CVS/SVN

Page 13: Software Mining and Software Datasets

1313

Field Logs

Source ControlCVS/SVN

Bugzilla Mailinglists

CrashRepos

Historical Repositories Runtime Repos

Code Repos

SourceforgeGoogleCode

Page 14: Software Mining and Software Datasets
Page 15: Software Mining and Software Datasets

A. Hassan and T Xie. Software Intelligence: Future of Mining Software Engineering Data. In FoSER 2010.

Page 16: Software Mining and Software Datasets

Software analytics is to enable software practitioners to perform data exploration and analysis in order to obtain insightful and actionable information for data-driven tasks around software and services.

16

Dongmei Zhang, Yingnong Dang, Jian-Guang Lou, Shi Han, Haidong Zhang, and Tao Xie. Software Analytics as a Learning Case in Practice: Approaches and Experiences. In MALETS 2011http://research.microsoft.com/en-us/groups/sa/malets11-analytics.pdf

Page 17: Software Mining and Software Datasets

Software analytics is to enable software practitioners to perform data exploration and analysis in order to obtain insightful and actionable information for data-driven tasks around software and services.

17

Dongmei Zhang, Yingnong Dang, Jian-Guang Lou, Shi Han, Haidong Zhang, and Tao Xie. Software Analytics as a Learning Case in Practice: Approaches and Experiences. In MALETS 2011http://research.microsoft.com/en-us/groups/sa/malets11-analytics.pdf

Page 18: Software Mining and Software Datasets

18

Research Topics

Technology Pillars

Target Audience

Connection to Practice

Output

Page 19: Software Mining and Software Datasets

19

Software

Users

Software

Development

Process

Software

System

• Covering different areas of software domain

• Throughout entire development cycle

• Enabling practitioners to obtain insights

Page 20: Software Mining and Software Datasets

20

Runtime traces

Program logs

System events

Perf counters

Usage log

User surveys

Online forum posts

Blog & Twitter

Source code

Bug history

Check-in history

Test cases

Page 21: Software Mining and Software Datasets

21

Developer

Tester

Program Manager

Usability engineer

Designer

Support engineer

Management personnel

Operation engineer

Page 22: Software Mining and Software Datasets

• Conveys meaningful and useful understanding or knowledge towards completing the target task

• Not easily attainable via directly investigating raw data without aid of analytics technologies

• Example– It is easy to count the number of re-opened bugs,

but how to find out the primary reasons for these re-opened bugs?

22

Page 23: Software Mining and Software Datasets

• Enables software practitioners to come up with concrete solutions towards completing the target task

• Examples– Why bugs were re-opened?

• A list of bug groups each with the same reason of re-opening

– Which part of my code should be refactored?• A list of cloned code snippets easily explored from

different perspectives

23

Page 24: Software Mining and Software Datasets

24

Software

Users

Software

Development

Process

Software

System

Information Visualization

Data Analysis Algorithms

Large-scale Computing

Vertical

Horizontal

Page 25: Software Mining and Software Datasets

Bugzilla CVS/SVNMailinglist Crashes

fixed

bug

discussionsBuggy change &

Fixing changeField

crashes

Estimate fix effort

Mark duplicates

Suggest experts and fix!

New Bug Report

Page 26: Software Mining and Software Datasets

Bugzilla CVS/SVNMailinglist Crashes

fixed

bug

Field

crashes

Suggest APIs

Warn about risky code or bugs

Suggest locations to co-change

New Change

discussionsBuggy change &

Fixing change

Page 27: Software Mining and Software Datasets

Textual Software Artifacts

Page 28: Software Mining and Software Datasets

NL Software Artifacts are of Many Types

• requirementsdocuments

• code comments

• identifier names

• commit logs

• release notes

• bug reports

• …

• emails discussing bugs, designs, etc.

• mailing list discussions

• test plans

• project websites & wikis

• …

Page 29: Software Mining and Software Datasets

29

NL Software Artifacts are of Large Quantity

• code comments:

– 2M in Eclipse, 1M in Mozilla, 1M in Linux

• identifier names:

– 1M in Chrome

• commit logs:

– 222K for Linux (05-10), 31K for PostgreSQL

• bug reports:

– 641K in Mozilla, 18K in Linux, 7K in Apache

• …

NL data contains useful information, much of which is not in structured data.

Page 30: Software Mining and Software Datasets

linux/drivers/scsi/in2000.c:

static int in2000_bus_reset(…){ …

reset_hardware(…);

}

Code comments contain Specifications

No lock acquisition ⇒ A bug!

linux/drivers/scsi/in2000.c:

/* Caller must hold instance

lock! */

static int reset_hardware(…)

{…}

Tan et al. “/*iComment: Bugs or Bad Comments?*/”, SOSP’07

Page 31: Software Mining and Software Datasets

API documentation contains resource usages

• java.sql.ResultSet.deleteRow() : “Deletes the current row from this ResultSet object and from the underlying database”

• java.sql.ResultSet.close() : “Releases this ResultSet object’s database and JDBC resources immediately instead of waiting for this to happen when it is automatically closed”.

java.sql.ResultSet.deleteRow()

java.sql.ResultSet.close()

Zhong, Zhang, Xie, Mei. Inferring Resource Specifications from Natural Language API Documentation. ASE 2009.

Page 32: Software Mining and Software Datasets

NL Data Contains Useful Info – Example 3

Don’t ignore the semantics of identifiers

Sridhara, Pollock, Vijay-Shanker. Automatically Detecting and Describing High Level Actions within Methods. ICSE 2011

Page 33: Software Mining and Software Datasets

• Unstructured

– Hard to parse, sometimes wrong grammar

• Ambiguous: often has no defined or precise semantics (as opposed to source code)

– Hard to understand

• Many ways to represent similar concepts

– Hard to extract information from

/* We need to acquire the write IRQ lock before calling ep_unlink(). */

/* Lock must be acquired on entry to this function. */

/* Caller must hold instance lock! */

Page 34: Software Mining and Software Datasets

• Redundant data

• Easy to get “good” results for simple tasks

– Simple algorithms without much tuning effort

• Evolution/version history readily available

• Many techniques to borrow from text analytics: NLP, Machine Learning (ML), Information Retrieval (IR), etc.

Page 35: Software Mining and Software Datasets

Stepping Back ….

Image from https://www.snowfactor.com/events/back-to-the-future-quiz/

Page 36: Software Mining and Software Datasets

https://www.kaggle.com/c/the-allen-ai-science-challenge

“consistently understand and correctly

answer general questions about the world.”

“Using a dataset of multiple choice question and answers from a

standardized 8th grade science exam, AI2 is challenging you

to create a model that gets to the head of the class.”

Page 37: Software Mining and Software Datasets
Page 38: Software Mining and Software Datasets
Page 39: Software Mining and Software Datasets
Page 40: Software Mining and Software Datasets

A. E. Hassan and T. Xie:

Mining Software Engineering

Data

• Bug report image

• Overlay the triage questions

Duplicate?

Bugzilla: open source bug tracking tool

http://www.bugzilla.org/

http://www.cs.ubc.ca/labs/spl/projects/bugTriage.html

Assigned To: ?

Anvik, Hiew, Murphy. Who should fix this bug? ICSE 2006.Wang, Zhang, Xie, Anvik, Sun. An Approach to Detecting Duplicate Bug Reports using Natural Language and Execution Information. ICSE 2008.

Page 41: Software Mining and Software Datasets
Page 42: Software Mining and Software Datasets

Thummalapenta, Sinha, Singhania, Chandra. Automating Test Automation Suresh. ICSE 2012.

Page 43: Software Mining and Software Datasets

Thummalapenta, Sinha, Singhania, Chandra. Automating Test Automation Suresh. ICSE 2012.

Page 44: Software Mining and Software Datasets
Page 45: Software Mining and Software Datasets

Selected unitsSelection threshold

Example of selection based approach from MS Word

Page 46: Software Mining and Software Datasets

conversational structure

Page 47: Software Mining and Software Datasets

…Rastkar, Murphy, Murray. Summarizing software artifacts: A case study of bug reports. ICSE’ 10.

Page 48: Software Mining and Software Datasets
Page 49: Software Mining and Software Datasets

Security bug report

“An attacker can exploit a buffer overflow by sending excessive data into an input field.”

Mislabeled security bug report

“The system crashes when receiving excessive text in the input field”

Two bug reports describing a buffer overflow

M. Gegick, P. Rotella, T. Xie. Identifying Security Bug Reports via Text Mining: An Industrial Case Study. MSR’10

Page 50: Software Mining and Software Datasets

Term Bug

Report 1

Bug

Report 2

Bug

Report 3

Attack 1 0 1

Buffer

Overflow1 0 0

Vulnerability 3 0 0

Term-by-document frequency matrix quantifies a document

M. Gegick, P. Rotella, T. Xie. Identifying Security Bug Reports via Text Mining: An Industrial Case Study. MSR’10

Start List

Label: Security Label: Non-Security Label:?

Page 51: Software Mining and Software Datasets
Page 52: Software Mining and Software Datasets

A HCP should not change patient’s account.

An [subject: HCP] should not [action: change] [resource: patient’s account].

ACP Rule

EffectSubject Action Resource

HCP UPDATE -change

patient’s account

deny

Linguistic Analysis

Model-Instance Construction

Transformation

Xiao, Paradkar, Thummalapenta, Xie. Automated Extraction of Security Policies from Natural-Language Software Documents. FSE 2012.

Page 53: Software Mining and Software Datasets

Example Repository:Project Communication – Mailing lists

Page 54: Software Mining and Software Datasets

A. E. Hassan and T. Xie:

Mining Software Engineering

Data

54

• Most open source projects communicate through mailing lists or IRC/IM channels

• Rich source of information about the inner workings of large projects

• Discussions cover topics such as future plans, design decisions, project policies, code or patch reviews

• Social network analysis could be performed on discussion threads

Page 55: Software Mining and Software Datasets

A. E. Hassan and T. Xie:

Mining Software Engineering

Data

55

• Study the content of messages before and after a release

• Use dimensions from a psychometric text analysis tool:– After Apache 1.3 release there was a drop in optimism

– After Apache 2.0 release there was an increase in sociability

Rigby, Hassan. What can OSS mailing lists tell us? a preliminary psychometric text analysis of the apache developer mailing list. MSR 2007.

Page 56: Software Mining and Software Datasets

A. E. Hassan and T. Xie:

Mining Software Engineering

Data

56

• When will a developer be invited to join a project?

– Expertise vs. interest

Bird, Gourley, Devanbu, Swaminathan, Hsu. Open borders? immigration in open source projects. MSR 2007.

Page 57: Software Mining and Software Datasets

Example Repositories:Source Control and Bug Repositories

Page 58: Software Mining and Software Datasets

A. E. Hassan and T. Xie: Mining

Software Engineering Data58

Source Control Repositories

• A source control system tracks changes to ChangeUnits

• Example of ChangeUnits:– File (most common)

– Function

– Dependency (e.g., Call)

• Each ChangeUnit:– Records the developer, change

time, change message, co-changing Units

ChangeListDeveloper

Time

ChangeChangeUnit

Modify

Add

Remove

Change

Type

* .. *

ChangeList

Message

FI

FR

GM

ChangeList

Type

FI: Feature IntroductionFR: Fault RepairingGM: General Maint

Page 59: Software Mining and Software Datasets

A. E. Hassan and T. Xie:

Mining Software Engineering

Data

59

Determine

Initial Entity

To Change

Change

Entity

Determine

Other Entities

To Change

Consult

Guru for

Advice

New Req., Bug Fix“How does a change in one source code

entity propagate to other entities?”

No More

Changes

For Each Entity

Suggested Entity

Page 60: Software Mining and Software Datasets

A. E. Hassan and T. Xie:

Mining Software Engineering

Data

60

• Mine association rules from change history

• Use rules to help propagate changes:– Recall as high as 44%

– Precision around 30%

• High precision and recall reached in < 1mth

• Prediction accuracy improves prior to a release (i.e., during maintenance phase)

• Better predictor than static dependencies alone

Zimmermann, Zeller, Weissgerber, Diehl. Mining Version Histories to Guide Software Changes. TSE 2005.Hassan, Holt. Predicting change propagation in software systems. ICSM2004.

Page 61: Software Mining and Software Datasets

A. E. Hassan and T. Xie:

Mining Software Engineering

Data

61

import org.eclipse.jdt.internal.compiler.lookup.*;

import org.eclipse.jdt.internal.compiler.*;

import org.eclipse.jdt.internal.compiler.ast.*;

import org.eclipse.jdt.internal.compiler.util.*;

...

import org.eclipse.pde.core.*;

import org.eclipse.jface.wizard.*;

import org.eclipse.ui.*;

14% of all files that import ui packages,

had to be fixed later on.

71% of files that import compiler packages,

had to be fixed later on.

Schröter, Zimmermann, Zeller. Predicting Component Failures at Design Time. ISESE 2006.

Page 62: Software Mining and Software Datasets

A. E. Hassan and T. Xie:

Mining Software Engineering

Data

62

• Given a change can we warn a developer that there is a bug in it?

– Recall/Precision in 50-60% range

Kim, Zimmermann, Pan, Whitehead. Automatic identification of bug-introducing changes. ASE 2006.

Page 63: Software Mining and Software Datasets

A. E. Hassan and T. Xie:

Mining Software Engineering

Data

63

Percentage of bug-introducing changes for Eclipse

Don’t program on Fridays ;-)

Sliwerski, Zimmermann, Zeller. Don’t Program on Fridays! How to Locate Fix-Inducing Changes. WSR 2005.

Page 64: Software Mining and Software Datasets

A. E. Hassan and T. Xie:

Mining Software Engineering

Data

64

Failure is a 4-letter Word

Zeller, Zimmermann, Bird. Failure is a four-letter word: a parody in empirical research. PROMISE 2011.

Page 65: Software Mining and Software Datasets

A. E. Hassan and T. Xie:

Mining Software Engineering

Data

65

Failure is a 4-letter Word

Zeller, Zimmermann, Bird. Failure is a four-letter word: a parody in empirical research. PROMISE 2011.

Page 66: Software Mining and Software Datasets

• “Cross-validation is inappropriate for estimating the performance of change classification because the data points, i.e., changes, follow a certain order in time.”

• “Randomly partitioning the data set may cause a model to use future knowledge which should not be known at the time of prediction to predict changes in the past.”– E.g., use information on a change committed in 2014

to predict whether a change committed in 2012 is buggy or clean.

Tan, Tan, Dara, Mayuex. Online defect prediction for imbalanced data. ICSE 2015 SEIP.Ye, Bunescu, Liu. Learning to rank relevant files for bug reports using domain knowledge. FSE 2014.

Page 67: Software Mining and Software Datasets

J. Czerwonka, R. Das, N. Nagappan, A. Tarvo, A. Teterev. CRANE: Failure Prediction, Change Analysis and Test Prioritization in Practice - Experiences from Windows. ICST 2011.

Page 68: Software Mining and Software Datasets

Example Repositories:Program Source Code

Page 69: Software Mining and Software Datasets

A. E. Hassan and T. Xie:

Mining Software Engineering

Data

69

Source data Mined info

Variable names and function names Software categories [Kawaguchi et al. 04]

Statement seq in a basic block Copy-paste code [Li et al. 04]

Set of functions, variables, and data

types within a C function

Programming rules[Li&Zhou 05]

Sequence of methods within a Java

method

API usages [Xie&Pei 06]

API method signatures API Jungloids[Mandelin et al. 05]

Page 70: Software Mining and Software Datasets

• Tons of papers published in the past decade

• Many years of International Workshop onSoftware Clones (IWSC) since 2006

• Dagstuhl Seminars– Software Clone Management towards Industrial

Application (2012)– Duplication, Redundancy, and Similarity in Software

(2006)

70

Source: http://www.dagstuhl.de/12071

Page 71: Software Mining and Software Datasets

• Motivation– Copy-and-paste is a common developer behavior

– A real tool widely adopted internally and externally

• XIAO enables code clone analysis in the following way– High tunability

– High scalability

– High compatibility

– High explorability

71Dang, Zhang, Ge, Chu, Qiu, Xie. XIAO: Tuning Code Clones at Hands of Engineers in Practice. ACSAC 2012.

Page 72: Software Mining and Software Datasets

72

• Intuitive similarity metric• Effective control of the degree of syntactical differences between two

code snippets

• Tunable at fine granularity• Statement similarity

• % of inserted/deleted/modified statements

• Balance between code structure and disordered statements

for (i = 0; i < n; i ++) {

a ++;

b ++;

c = foo(a, b);

d = bar(a, b, c);

e = a + c; }

for (i = 0; i < n; i ++) {

c = foo(a, b);

a ++;

b ++;

d = bar(a, b, c);

e = a + d;

e ++; }

Dang, Zhang, Ge, Chu, Qiu, Xie. XIAO: Tuning Code Clones at Hands of Engineers in Practice. ACSAC 2012.

Page 73: Software Mining and Software Datasets

73

1. Clone navigation based on source tree hierarchy

2. Pivoting of folder level statistics

3. Folder level statistics

4. Clone function list in selected folder

5. Clone function filters

6. Sorting by bug or refactoring potential

7. Tagging

1 2 3 4 5 6

7

1. Block correspondence

2. Block types

3. Block navigation

4. Copying

5. Bug filing

6. Tagging

1

2

3

4

1

6

5

Dang, Zhang, Ge, Chu, Qiu, Xie. XIAO: Tuning Code Clones at Hands of Engineers in Practice. ACSAC 2012.

Page 74: Software Mining and Software Datasets

74

Quality gates at milestones

• Architecture refactoring

• Code clone clean up

• Bug fixing

Post-release maintenance

• Security bug investigation

• Bug investigation for sustained

engineering

Development and testing

• Checking for similar issues before

check-in

• Reference info for code review

• Supporting tool for bug triage

Online code

clone search

Offline code

clone analysis

Dang, Zhang, Ge, Chu, Qiu, Xie. XIAO: Tuning Code Clones at Hands of Engineers in Practice. ACSAC 2012.

Page 75: Software Mining and Software Datasets

75

Available in Visual Studio 2012

Searching similar snippets for fixing bug once

Finding refactoring opportunity

Dang, Zhang, Ge, Chu, Qiu, Xie. XIAO: Tuning Code Clones at Hands of Engineers in Practice. ACSAC 2012.

Page 76: Software Mining and Software Datasets

76

Code Clone Search service integrated into workflow of Microsoft Security Response Center

Over hundreds of million lines of code indexed across multiple products

Real security issues proactively identified and addressed

Dang, Zhang, Ge, Chu, Qiu, Xie. XIAO: Tuning Code Clones at Hands of Engineers in Practice. ACSAC 2012.

Page 77: Software Mining and Software Datasets

Combined Security Update for Microsoft Office, Windows, .NET Framework, and Silverlight, published: Tuesday, May 08, 2012

3 publicly disclosed vulnerabilities and seven privately reported involved. Specifically, one is exploited by the Duqu malware to execute arbitrary code when a user opened a malicious Office document

Insufficient bounds check within the font parsing subsystem of win32k.sysCloned copy in gdiplus.dll, ogl.dll (office), Silver Light, Windows Journal viewer

Microsoft Technet Blog about this bulletin“However, we wanted to be sure to address the vulnerable code wherever it appeared across the Microsoft code base. To that end, we have been working with Microsoft Research to develop a “Cloned Code Detection” system that we can run for every MSRC case to find any instance of the vulnerable code in any shipping product. This system is the one that found several of the copies of CVE-2011-3402 that we are now addressing with MS12-034.”

77Dang, Zhang, Ge, Chu, Qiu, Xie. XIAO: Tuning Code Clones at Hands of Engineers in Practice. ACSAC 2012.

Page 78: Software Mining and Software Datasets

Software Data Sets

Page 79: Software Mining and Software Datasets

A. E. Hassan and T. Xie:

Mining Software Engineering

Data

79

• PROMISE repository– http://openscience.us/repo/

• Boa– http://boa.cs.iastate.edu/

• FLOSSmole:– http://flossmole.org/

• Software-artifact infrastructure repository:– http://sir.unl.edu/portal/index.html

• Socorro: Mozilla Crash Stats project– https://wiki.mozilla.org/Socorro

Page 80: Software Mining and Software Datasets

A. E. Hassan and T. Xie:

Mining Software Engineering

Data

80

• FLOSSmole– provides raw data about open source projects – provides summary reports about open source projects – integrates donated data from other research teams – provides tools so you can gather your own data

• Data sources– Sourceforge– Freshmeat– Rubyforge– ObjectWeb– Free Software Foundation (FSF)– SourceKibitzer

http://flossmole.org/

Page 81: Software Mining and Software Datasets

A. E. Hassan and T. Xie:

Mining Software Engineering

Data

81

Page 82: Software Mining and Software Datasets

A. E. Hassan and T. Xie:

Mining Software Engineering

Data

82

Yearly MSR Challenge Since 2006• http://2016.msrconf.org/#/challenge (Boa data for SourceForge, GitHub) • http://2015.msrconf.org/challenge.php (Stackoverflow data)• http://2014.msrconf.org/challenge.php (GitHub data)• http://2013.msrconf.org/challenge.php (Stackoverflow data)• http://2012.msrconf.org/challenge.php (Change/bug report data for Android)• http://2011.msrconf.org/msr-challenge.html (Eclipse, Netbeans, Firefox,

Chrome data)• http://msr.uwaterloo.ca/msr2010/challenge/ (FreeBSD, GNOME

Debian/Ubuntu)• http://msr.uwaterloo.ca/msr2009/challenge/ (GNOME)• http://msr.uwaterloo.ca/msr2008/ (Eclipse)• http://msr.uwaterloo.ca/msr2007/challenge/ (Eclipse, Firefox)• http://msr.uwaterloo.ca/challenge/ (PostgreSQL, ArgoUML)• http://msr.uwaterloo.ca/msr2005/• http://msr.uwaterloo.ca/msr2004/

Page 83: Software Mining and Software Datasets

A. E. Hassan and T. Xie:

Mining Software Engineering

Data

83

• PROMISE repository– http://openscience.us/repo/

• TraceLab– http://www.coest.org/index.php/tracelab/about-tracelab

• Apache SVN commits on Github– https://github.com/monperrus/apache-commits/

• Benchmarks for software maintenance tasks– http://www.cs.wm.edu/semeru/data/benchmarks/

• Non-functional requirements wordlists– http://softwareprocess.es/static/What%27s_in_a_Name.html

• Source code ECOsystem Linked Data (SeCold)– http://www.secold.org/

• Text Analysis for Software Engineering Wiki– http://textse.wikispaces.com/Home

Page 84: Software Mining and Software Datasets
Page 85: Software Mining and Software Datasets
Page 86: Software Mining and Software Datasets
Page 87: Software Mining and Software Datasets
Page 88: Software Mining and Software Datasets
Page 89: Software Mining and Software Datasets
Page 90: Software Mining and Software Datasets

Must show value before data quality improves

Correlation vs. Causation

Page 91: Software Mining and Software Datasets

A. E. Hassan and T. Xie:

Mining Software Engineering

Data

91

• Make sure you manually examine the repositories. Do not fully automate the process!

Image from http://www.quincyma.gov/Government/OCS/neighborhoodtips.cfm

Page 92: Software Mining and Software Datasets

A. E. Hassan and T. Xie:

Mining Software Engineering

Data

92

• Drop all transactions above a large threshold

Page 93: Software Mining and Software Datasets

A. E. Hassan and T. Xie:

Mining Software Engineering

Data

93

• Few developers are given commit privileges

• Actual developer is usually mentioned in the change message

• One must study project commit policies before reaching any conclusions

[German 2006]

Page 94: Software Mining and Software Datasets

94

APP DEVELOPERS

APP USERS

App Functional Requirements

App SecurityRequirements

User Functional

Requirements

User SecurityRequirements

informal: app description, etc. permission list, etc.

App Code

Pandita, Xiao, Yang, Enck, Xie. WHYPER: Towards Automating Risk Assessment of Mobile Applications. USENIX SEC 2013.

Tutorial Slides: http://www.slideshare.net/taoxiease/text-analytics-for-security

Page 95: Software Mining and Software Datasets
Page 96: Software Mining and Software Datasets

• External collaborators: Ahmed Hassan, Dongmei Zhang, …

• Students…

• Broad colleagues in this area, …

• Funding supports: