an empirical study on the risks of using off-the-shelf techniques for processing mailing list data

21
An Empirical Study on the Risks of Using Off-the-Shelf Techniques for Processing Mailing List Data Nicolas Bettenburg , Emad Shihab, Ahmed E. Hassan Queen’s University, Canada 1

Upload: nicolas-bettenburg

Post on 27-Jun-2015

955 views

Category:

Education


1 download

DESCRIPTION

Talk given at the 2009 International Conference on Software Maintenance in Edmonton, Alberta, Canada.

TRANSCRIPT

Page 1: An Empirical Study on the Risks of Using Off-the-Shelf Techniques for Processing Mailing List Data

An Empirical Study on the Risks of Using Off-the-Shelf Techniques for Processing Mailing List Data

Nicolas Bettenburg, Emad Shihab, Ahmed E. HassanQueen’s University, Canada

1

Page 2: An Empirical Study on the Risks of Using Off-the-Shelf Techniques for Processing Mailing List Data

Development Repositories

SOURCE CODE

COMMUNICATION ARCHIVES

BUG DATABASES

2

Page 3: An Empirical Study on the Risks of Using Off-the-Shelf Techniques for Processing Mailing List Data

Development Repositories

SOURCE CODE

COMMUNICATION ARCHIVES

BUG DATABASES

3

Page 4: An Empirical Study on the Risks of Using Off-the-Shelf Techniques for Processing Mailing List Data

The Importance of Mailing List Archives

• Email popular form of communication

• Mailing lists to distribute messages

• Messages contain valuable information

• Discussions of source code

• Development decisions

• Error reports

• User support requests

4

Page 5: An Empirical Study on the Risks of Using Off-the-Shelf Techniques for Processing Mailing List Data

Mining the Mailing Lists of23 Open-Source Projects

• Summarizing developer mailing lists

• Using off-the-shelf tools

• Data from around 500,000 emails

• Unexpected results from experiments

5

Page 6: An Empirical Study on the Risks of Using Off-the-Shelf Techniques for Processing Mailing List Data

scatter !! things !! info !! mlw !! bool !! palloc(bufsize); !! symlinks !! configuration !! - !! looks !! -C !! } !! getopt(argc, !! specifies

!! Nov !! --- !! && !! NULL, !! reasonable. !! -r !! argv, !! Added !! reasons !! A1 !! http://www.gnupg.org !! postmaster !! * !! == !! live !! wrote

!! > !! break; !! get !! SIGNATURE !! -----END !! != !! symlinks. !! command !! + while !! wrote: !! different !! EOF) !! ___ !! allows

!! char !! 1F !! file !! postgres !! Dec !! 43 !! DataDir, !! pg_hba.conf !! 69 !! + SetDataDir(potential_DataDir); !! convenient !! + } !! put !! GnuPG

!! -- !! 093E !! stuff !! -D !! switch !! NULL !! +extern !! recursion !! admin !! setting !! 5B !! Version: !! @@ !! system !! issues !! -u !! /

path/default.conf !! see !! overides !! http://xyzzy.dhs.org/~drew/ !! + /* !! (char !! -p !! sizeof(char); !! 19 !! if !! /etc/postgresql !! directory !! note !!

\"datadir\" !! running !! PGP !! (GNU/Linux) !! \"hbaconfig\" !! file. !! ((opt !! 2001 !! 72 !! Apache !! way !! NULL; !! options !! CONF_FILE);

!! bits !! simple !! databases !! */ !! servers !! multiple !! /* !! share !! \"A:a:B:b:c:D:d:Fh:ik:lm:MN:no:p:Ss-:\")) !! blow !! + !! DataDir); !! \"To !! method !!

#include !! *) !! vendors !! E3 !! people !! + { !! 08:27:06 !! 3B !! 16 !! +# !! explicit !! debian !! data !! = !! +char !!

malloc(strlen(DataDir) !! +++ !! !! v1.0.6 !! having !! 56 !! /etc. !! 613-389-5481 !! extern !! case !! error !! strlen(CONFIG_FILENAME) !! Comment: !! line !!

diff !! easier !! certs !! given !! { !!

6

Page 7: An Empirical Study on the Risks of Using Off-the-Shelf Techniques for Processing Mailing List Data

scatter !! things !! info !! mlw !! bool !! palloc(bufsize); !! symlinks !! configuration !! - !! looks !! -C !! } !! getopt(argc, !! specifies

!! Nov !! --- !! && !! NULL, !! reasonable. !! -r !! argv, !! Added !! reasons !! A1 !! http://www.gnupg.org !! postmaster !! * !! == !! live !! wrote

!! > !! break; !! get !! SIGNATURE !! -----END !! != !! symlinks. !! command !! + while !! wrote: !! different !! EOF) !! ___ !! allows

!! char !! 1F !! file !! postgres !! Dec !! 43 !! DataDir, !! pg_hba.conf !! 69 !! + SetDataDir(potential_DataDir); !! convenient !! + } !! put !! GnuPG

!! -- !! 093E !! stuff !! -D !! switch !! NULL !! +extern !! recursion !! admin !! setting !! 5B !! Version: !! @@ !! system !! issues !! -u !! /

path/default.conf !! see !! overides !! http://xyzzy.dhs.org/~drew/ !! + /* !! (char !! -p !! sizeof(char); !! 19 !! if !! /etc/postgresql !! directory !! note !!

\"datadir\" !! running !! PGP !! (GNU/Linux) !! \"hbaconfig\" !! file. !! ((opt !! 2001 !! 72 !! Apache !! way !! NULL; !! options !! CONF_FILE);

!! bits !! simple !! databases !! */ !! servers !! multiple !! /* !! share !! \"A:a:B:b:c:D:d:Fh:ik:lm:MN:no:p:Ss-:\")) !! blow !! + !! DataDir); !! \"To !! method !!

#include !! *) !! vendors !! E3 !! people !! + { !! 08:27:06 !! 3B !! 16 !! +# !! explicit !! debian !! data !! = !! +char !!

malloc(strlen(DataDir) !! +++ !! !! v1.0.6 !! having !! 56 !! /etc. !! 613-389-5481 !! extern !! case !! error !! strlen(CONFIG_FILENAME) !! Comment: !! line !!

diff !! easier !! certs !! given !! { !!

Funny, !! fiat !! configuration !! PGDATA !! impose !! them. !! opinion !! keys !! long !! environment !! agrees ! resides. !! start!! variable. !! normal !!

organize !! single !! creating !! exactly !! postgresql !! original !! stuff !! described !! (My !! say !! \"pg\" !! BSD !! fruity !! me, !! real !! little !! want

!! $PGDATA/; !! sort !! specifies !! certs !! data !! Tux !! looks !! policy, !! '/etc/pgsql/pg_hba.conf' !! servers !! maintain !! (This

!! = !! week !! scattered !! patch !! layout !! linux, !! '/u01/postgres' !! give !! path !! all. !! file. !! live !! belongs, !! stuff, !! result !! way !! -p

!! sux. !! Apache !! specified, !! hey, !! reasonable. !! reasons !! it. !! damn !! options: !! utterly !! line, !! files !! consistency !! datadir !!

debian. !! method !! considering !! always. !! options !! symlinks. !! different !! 5434 !! /etc/pgsql/mydb.conf !! delivers !! me. !! /etc/

apache. !! /etc/postgresql !! overides !! things !! using, !! symlinking !! convenient !! able !! hbaconfig !! /path/default.conf !! command !! controllable !! modssl !! undesired !! /path/name3" !! ","I !! Similarly, !! ObFlame: !! And, !!

postmaster !! Config !! directory !! discussion !! packager !! ass. !! really !! machine !! subdirectory !! distros !! bet !!

package. !! devil !! sense !! hbaconfig !! /etc/nessusd. !! logical. !! behavior !! crypto !! Debian !! set, !! 5432 !! as: !! share !! line

!! Ross !! having !! kinda !! see !! forced !! people !! pg_hba.conf ! !! pgdatadir !! /path/name2 !! guess !! get !! own. !! nice !! /path/name1 !! simple !! setting !! rational !!

6

Page 8: An Empirical Study on the Risks of Using Off-the-Shelf Techniques for Processing Mailing List Data

While mining Mailing Lists of23 Open-Source Projects

• Don’t treat mail archives as textual data

• Changing technologies

• Up to 98% of messages contain noise

7

Page 9: An Empirical Study on the Risks of Using Off-the-Shelf Techniques for Processing Mailing List Data

While mining Mailing Lists of23 Open-Source Projects

• Don’t treat mail archives as textual data

• Changing technologies

• Up to 98% of messages contain noise

Additional processing and cleaning needed!

8

Page 10: An Empirical Study on the Risks of Using Off-the-Shelf Techniques for Processing Mailing List Data

From [email protected] Wed Jan 21 08:11:26 1998Date: Mon, 27 Jan 1997 12:50:44 -0500 (EST)From: "Brian E. Gallew" <[email protected]>Subject: Re: [HACKERS] configure

- ---559023410-851401618-854387445=:824Content-Type: TEXT/PLAIN; CHARSET=US-ASCII

> If you can grab a copy and run it on your machine, and send me> the output, that would help alot.

Here is a gzip'ed tar of the results.

=====================================================================| Please do not shoot at the thermonuclear weapons! -- Deacon |=====================================================================| Finger [email protected] for my public key. |=====================================================================

- ---559023410-851401618-854387445=:824Content-Type: APPLICATION/x-gzipContent-Transfer-Encoding: BASE64Content-Description: m88k-dg-dgux5.4R3.10.tar.gz

H4sIAHDq7DICA+xba3vaSLLOV/MrepzsGHiQuNomOJ4MdnDMrC8csB17HQ8WUgM9FpJWFxsmyX8/Vd0tIYGwye5kP+w5fp4EaHW9XV1dXbdu6bY1ZCNV1/QxffWD/sql0k6tRl4RUq6IT1KWn4R/L5UI2ansVirVSrlWhpZqqVZ5RUqv/gN/gedrLiGvHNvzRy71VvUbUYu6mvnqv+zvNbkYM48MmUkJfGrEG1PTJJ7uMscn

9

Page 11: An Empirical Study on the Risks of Using Off-the-Shelf Techniques for Processing Mailing List Data

From [email protected] Wed Jan 21 08:11:26 1998Date: Mon, 27 Jan 1997 12:50:44 -0500 (EST)From: "Brian E. Gallew" <[email protected]>Subject: Re: [HACKERS] configure

- ---559023410-851401618-854387445=:824Content-Type: TEXT/PLAIN; CHARSET=US-ASCII

> If you can grab a copy and run it on your machine, and send me> the output, that would help alot.

Here is a gzip'ed tar of the results.

=====================================================================| Please do not shoot at the thermonuclear weapons! -- Deacon |=====================================================================| Finger [email protected] for my public key. |=====================================================================

- ---559023410-851401618-854387445=:824Content-Type: APPLICATION/x-gzipContent-Transfer-Encoding: BASE64Content-Description: m88k-dg-dgux5.4R3.10.tar.gz

H4sIAHDq7DICA+xba3vaSLLOV/MrepzsGHiQuNomOJ4MdnDMrC8csB17HQ8WUgM9FpJWFxsmyX8/Vd0tIYGwye5kP+w5fp4EaHW9XV1dXbdu6bY1ZCNV1/QxffWD/sql0k6tRl4RUq6IT1KWn4R/L5UI2ansVirVSrlWhpZqqVZ5RUqv/gN/gedrLiGvHNvzRy71VvUbUYu6mvnqv+zvNbkYM48MmUkJfGrEG1PTJJ7uMscn

10

Page 12: An Empirical Study on the Risks of Using Off-the-Shelf Techniques for Processing Mailing List Data

Resolving Multiple Sender Identities

• Participants send mail from different addresses

• Up to 21% of addresses are aliases

• Such aliases bias identity-based analyses

• Manual inspection and correction tedious

• No fully automated approach to resolve identities

11

Page 13: An Empirical Study on the Risks of Using Off-the-Shelf Techniques for Processing Mailing List Data

A

B

C

D

A

B

C

D

Linear Sequence Thread Hierarchy

Reconstructing Discussion Threads

• Mail stored sequentially in archives

• Logical grouping: discussion topics

• Required information erroneous or missing

• Essential for social network and topic analysis

12

Page 14: An Empirical Study on the Risks of Using Off-the-Shelf Techniques for Processing Mailing List Data

From [email protected] Wed Jan 21 08:11:26 1998Date: Mon, 27 Jan 1997 12:50:44 -0500 (EST)From: "Brian E. Gallew" <[email protected]>Subject: Re: [HACKERS] configure

- ---559023410-851401618-854387445=:824Content-Type: TEXT/PLAIN; CHARSET=US-ASCII

> If you can grab a copy and run it on your machine, and send me> the output, that would help alot.

Here is a gzip'ed tar of the results.

=====================================================================| Please do not shoot at the thermonuclear weapons! -- Deacon |=====================================================================| Finger [email protected] for my public key. |=====================================================================

- ---559023410-851401618-854387445=:824Content-Type: APPLICATION/x-gzipContent-Transfer-Encoding: BASE64Content-Description: m88k-dg-dgux5.4R3.10.tar.gz

H4sIAHDq7DICA+xba3vaSLLOV/MrepzsGHiQuNomOJ4MdnDMrC8csB17HQ8WUgM9FpJWFxsmyX8/Vd0tIYGwye5kP+w5fp4EaHW9XV1dXbdu6bY1ZCNV1/QxffWD/sql0k6tRl4RUq6IT1KWn4R/L5UI2ansVirVSrlWhpZqqVZ5RUqv/gN/gedrLiGvHNvzRy71VvUbUYu6mvnqv+zvNbkYM48MmUkJfGrEG1PTJJ7uMscn

13

Page 15: An Empirical Study on the Risks of Using Off-the-Shelf Techniques for Processing Mailing List Data

From [email protected] Wed Jan 21 08:11:26 1998Date: Mon, 27 Jan 1997 12:50:44 -0500 (EST)From: "Brian E. Gallew" <[email protected]>Subject: Re: [HACKERS] configure

- ---559023410-851401618-854387445=:824Content-Type: TEXT/PLAIN; CHARSET=US-ASCII

> If you can grab a copy and run it on your machine, and send me> the output, that would help alot.

Here is a gzip'ed tar of the results.

=====================================================================| Please do not shoot at the thermonuclear weapons! -- Deacon |=====================================================================| Finger [email protected] for my public key. |=====================================================================

- ---559023410-851401618-854387445=:824Content-Type: APPLICATION/x-gzipContent-Transfer-Encoding: BASE64Content-Description: m88k-dg-dgux5.4R3.10.tar.gz

H4sIAHDq7DICA+xba3vaSLLOV/MrepzsGHiQuNomOJ4MdnDMrC8csB17HQ8WUgM9FpJWFxsmyX8/Vd0tIYGwye5kP+w5fp4EaHW9XV1dXbdu6bY1ZCNV1/QxffWD/sql0k6tRl4RUq6IT1KWn4R/L5UI2ansVirVSrlWhpZqqVZ5RUqv/gN/gedrLiGvHNvzRy71VvUbUYu6mvnqv+zvNbkYM48MmUkJfGrEG1PTJJ7uMscn

14

Page 16: An Empirical Study on the Risks of Using Off-the-Shelf Techniques for Processing Mailing List Data

Attachments

• MIME standard defines extensions to email

• Binary data encoded as text

• Around 10% of messages have attachments

• Extract attachments and store separately

15

Page 17: An Empirical Study on the Risks of Using Off-the-Shelf Techniques for Processing Mailing List Data

From [email protected] Wed Jan 21 08:11:26 1998Date: Mon, 27 Jan 1997 12:50:44 -0500 (EST)From: "Brian E. Gallew" <[email protected]>Subject: Re: [HACKERS] configure

- ---559023410-851401618-854387445=:824Content-Type: TEXT/PLAIN; CHARSET=US-ASCII

> If you can grab a copy and run it on your machine, and send me> the output, that would help alot.

Here is a gzip'ed tar of the results.

=====================================================================| Please do not shoot at the thermonuclear weapons! -- Deacon |=====================================================================| Finger [email protected] for my public key. |=====================================================================

- ---559023410-851401618-854387445=:824Content-Type: APPLICATION/x-gzipContent-Transfer-Encoding: BASE64Content-Description: m88k-dg-dgux5.4R3.10.tar.gz

H4sIAHDq7DICA+xba3vaSLLOV/MrepzsGHiQuNomOJ4MdnDMrC8csB17HQ8WUgM9FpJWFxsmyX8/Vd0tIYGwye5kP+w5fp4EaHW9XV1dXbdu6bY1ZCNV1/QxffWD/sql0k6tRl4RUq6IT1KWn4R/L5UI2ansVirVSrlWhpZqqVZ5RUqv/gN/gedrLiGvHNvzRy71VvUbUYu6mvnqv+zvNbkYM48MmUkJfGrEG1PTJJ7uMscn

16

Page 18: An Empirical Study on the Risks of Using Off-the-Shelf Techniques for Processing Mailing List Data

From [email protected] Wed Jan 21 08:11:26 1998Date: Mon, 27 Jan 1997 12:50:44 -0500 (EST)From: "Brian E. Gallew" <[email protected]>Subject: Re: [HACKERS] configure

- ---559023410-851401618-854387445=:824Content-Type: TEXT/PLAIN; CHARSET=US-ASCII

> If you can grab a copy and run it on your machine, and send me> the output, that would help alot.

Here is a gzip'ed tar of the results.

=====================================================================| Please do not shoot at the thermonuclear weapons! -- Deacon |=====================================================================| Finger [email protected] for my public key. |=====================================================================

- ---559023410-851401618-854387445=:824Content-Type: APPLICATION/x-gzipContent-Transfer-Encoding: BASE64Content-Description: m88k-dg-dgux5.4R3.10.tar.gz

H4sIAHDq7DICA+xba3vaSLLOV/MrepzsGHiQuNomOJ4MdnDMrC8csB17HQ8WUgM9FpJWFxsmyX8/Vd0tIYGwye5kP+w5fp4EaHW9XV1dXbdu6bY1ZCNV1/QxffWD/sql0k6tRl4RUq6IT1KWn4R/L5UI2ansVirVSrlWhpZqqVZ5RUqv/gN/gedrLiGvHNvzRy71VvUbUYu6mvnqv+zvNbkYM48MmUkJfGrEG1PTJJ7uMscn

17

Page 19: An Empirical Study on the Risks of Using Off-the-Shelf Techniques for Processing Mailing List Data

Quotes and Signatures

• Duplicate information

• Unrelated to actual message

• Removing signatures is challenging

• Quoted text may or may not be desirable

• Signatures impact text mining approaches

• No perfect method for signature removal

============

============

============

============

============

=========

| Please do

not shoot at

the thermon

uclear weapo

ns! -- Deaco

n |

============

============

============

============

============

=========

| Finger gee

[email protected]

.edu for my

public key.

|

============

============

============

============

============

=========

18

Page 20: An Empirical Study on the Risks of Using Off-the-Shelf Techniques for Processing Mailing List Data

More Risks presented in the Paper

19

Page 21: An Empirical Study on the Risks of Using Off-the-Shelf Techniques for Processing Mailing List Data

(1) Mailing Lists contain valuable information on a project.

(3) Manual Data Processing is often not feasible or requires much effort.

(4) Off-the-Shelf tools were not designed to prepare data for mining.

(2) Data Needs Pre-Processing before applying traditional tools.

20