an empirical study on the risks of using off-the-shelf techniques for processing mailing list data
DESCRIPTION
Talk given at the 2009 International Conference on Software Maintenance in Edmonton, Alberta, Canada.TRANSCRIPT
An Empirical Study on the Risks of Using Off-the-Shelf Techniques for Processing Mailing List Data
Nicolas Bettenburg, Emad Shihab, Ahmed E. HassanQueen’s University, Canada
1
Development Repositories
SOURCE CODE
COMMUNICATION ARCHIVES
BUG DATABASES
2
Development Repositories
SOURCE CODE
COMMUNICATION ARCHIVES
BUG DATABASES
3
The Importance of Mailing List Archives
• Email popular form of communication
• Mailing lists to distribute messages
• Messages contain valuable information
• Discussions of source code
• Development decisions
• Error reports
• User support requests
4
Mining the Mailing Lists of23 Open-Source Projects
• Summarizing developer mailing lists
• Using off-the-shelf tools
• Data from around 500,000 emails
• Unexpected results from experiments
5
scatter !! things !! info !! mlw !! bool !! palloc(bufsize); !! symlinks !! configuration !! - !! looks !! -C !! } !! getopt(argc, !! specifies
!! Nov !! --- !! && !! NULL, !! reasonable. !! -r !! argv, !! Added !! reasons !! A1 !! http://www.gnupg.org !! postmaster !! * !! == !! live !! wrote
!! > !! break; !! get !! SIGNATURE !! -----END !! != !! symlinks. !! command !! + while !! wrote: !! different !! EOF) !! ___ !! allows
!! char !! 1F !! file !! postgres !! Dec !! 43 !! DataDir, !! pg_hba.conf !! 69 !! + SetDataDir(potential_DataDir); !! convenient !! + } !! put !! GnuPG
!! -- !! 093E !! stuff !! -D !! switch !! NULL !! +extern !! recursion !! admin !! setting !! 5B !! Version: !! @@ !! system !! issues !! -u !! /
path/default.conf !! see !! overides !! http://xyzzy.dhs.org/~drew/ !! + /* !! (char !! -p !! sizeof(char); !! 19 !! if !! /etc/postgresql !! directory !! note !!
\"datadir\" !! running !! PGP !! (GNU/Linux) !! \"hbaconfig\" !! file. !! ((opt !! 2001 !! 72 !! Apache !! way !! NULL; !! options !! CONF_FILE);
!! bits !! simple !! databases !! */ !! servers !! multiple !! /* !! share !! \"A:a:B:b:c:D:d:Fh:ik:lm:MN:no:p:Ss-:\")) !! blow !! + !! DataDir); !! \"To !! method !!
#include !! *) !! vendors !! E3 !! people !! + { !! 08:27:06 !! 3B !! 16 !! +# !! explicit !! debian !! data !! = !! +char !!
malloc(strlen(DataDir) !! +++ !! !! v1.0.6 !! having !! 56 !! /etc. !! 613-389-5481 !! extern !! case !! error !! strlen(CONFIG_FILENAME) !! Comment: !! line !!
diff !! easier !! certs !! given !! { !!
6
scatter !! things !! info !! mlw !! bool !! palloc(bufsize); !! symlinks !! configuration !! - !! looks !! -C !! } !! getopt(argc, !! specifies
!! Nov !! --- !! && !! NULL, !! reasonable. !! -r !! argv, !! Added !! reasons !! A1 !! http://www.gnupg.org !! postmaster !! * !! == !! live !! wrote
!! > !! break; !! get !! SIGNATURE !! -----END !! != !! symlinks. !! command !! + while !! wrote: !! different !! EOF) !! ___ !! allows
!! char !! 1F !! file !! postgres !! Dec !! 43 !! DataDir, !! pg_hba.conf !! 69 !! + SetDataDir(potential_DataDir); !! convenient !! + } !! put !! GnuPG
!! -- !! 093E !! stuff !! -D !! switch !! NULL !! +extern !! recursion !! admin !! setting !! 5B !! Version: !! @@ !! system !! issues !! -u !! /
path/default.conf !! see !! overides !! http://xyzzy.dhs.org/~drew/ !! + /* !! (char !! -p !! sizeof(char); !! 19 !! if !! /etc/postgresql !! directory !! note !!
\"datadir\" !! running !! PGP !! (GNU/Linux) !! \"hbaconfig\" !! file. !! ((opt !! 2001 !! 72 !! Apache !! way !! NULL; !! options !! CONF_FILE);
!! bits !! simple !! databases !! */ !! servers !! multiple !! /* !! share !! \"A:a:B:b:c:D:d:Fh:ik:lm:MN:no:p:Ss-:\")) !! blow !! + !! DataDir); !! \"To !! method !!
#include !! *) !! vendors !! E3 !! people !! + { !! 08:27:06 !! 3B !! 16 !! +# !! explicit !! debian !! data !! = !! +char !!
malloc(strlen(DataDir) !! +++ !! !! v1.0.6 !! having !! 56 !! /etc. !! 613-389-5481 !! extern !! case !! error !! strlen(CONFIG_FILENAME) !! Comment: !! line !!
diff !! easier !! certs !! given !! { !!
Funny, !! fiat !! configuration !! PGDATA !! impose !! them. !! opinion !! keys !! long !! environment !! agrees ! resides. !! start!! variable. !! normal !!
organize !! single !! creating !! exactly !! postgresql !! original !! stuff !! described !! (My !! say !! \"pg\" !! BSD !! fruity !! me, !! real !! little !! want
!! $PGDATA/; !! sort !! specifies !! certs !! data !! Tux !! looks !! policy, !! '/etc/pgsql/pg_hba.conf' !! servers !! maintain !! (This
!! = !! week !! scattered !! patch !! layout !! linux, !! '/u01/postgres' !! give !! path !! all. !! file. !! live !! belongs, !! stuff, !! result !! way !! -p
!! sux. !! Apache !! specified, !! hey, !! reasonable. !! reasons !! it. !! damn !! options: !! utterly !! line, !! files !! consistency !! datadir !!
debian. !! method !! considering !! always. !! options !! symlinks. !! different !! 5434 !! /etc/pgsql/mydb.conf !! delivers !! me. !! /etc/
apache. !! /etc/postgresql !! overides !! things !! using, !! symlinking !! convenient !! able !! hbaconfig !! /path/default.conf !! command !! controllable !! modssl !! undesired !! /path/name3" !! ","I !! Similarly, !! ObFlame: !! And, !!
postmaster !! Config !! directory !! discussion !! packager !! ass. !! really !! machine !! subdirectory !! distros !! bet !!
package. !! devil !! sense !! hbaconfig !! /etc/nessusd. !! logical. !! behavior !! crypto !! Debian !! set, !! 5432 !! as: !! share !! line
!! Ross !! having !! kinda !! see !! forced !! people !! pg_hba.conf ! !! pgdatadir !! /path/name2 !! guess !! get !! own. !! nice !! /path/name1 !! simple !! setting !! rational !!
6
While mining Mailing Lists of23 Open-Source Projects
• Don’t treat mail archives as textual data
• Changing technologies
• Up to 98% of messages contain noise
7
While mining Mailing Lists of23 Open-Source Projects
• Don’t treat mail archives as textual data
• Changing technologies
• Up to 98% of messages contain noise
Additional processing and cleaning needed!
8
From [email protected] Wed Jan 21 08:11:26 1998Date: Mon, 27 Jan 1997 12:50:44 -0500 (EST)From: "Brian E. Gallew" <[email protected]>Subject: Re: [HACKERS] configure
- ---559023410-851401618-854387445=:824Content-Type: TEXT/PLAIN; CHARSET=US-ASCII
> If you can grab a copy and run it on your machine, and send me> the output, that would help alot.
Here is a gzip'ed tar of the results.
=====================================================================| Please do not shoot at the thermonuclear weapons! -- Deacon |=====================================================================| Finger [email protected] for my public key. |=====================================================================
- ---559023410-851401618-854387445=:824Content-Type: APPLICATION/x-gzipContent-Transfer-Encoding: BASE64Content-Description: m88k-dg-dgux5.4R3.10.tar.gz
H4sIAHDq7DICA+xba3vaSLLOV/MrepzsGHiQuNomOJ4MdnDMrC8csB17HQ8WUgM9FpJWFxsmyX8/Vd0tIYGwye5kP+w5fp4EaHW9XV1dXbdu6bY1ZCNV1/QxffWD/sql0k6tRl4RUq6IT1KWn4R/L5UI2ansVirVSrlWhpZqqVZ5RUqv/gN/gedrLiGvHNvzRy71VvUbUYu6mvnqv+zvNbkYM48MmUkJfGrEG1PTJJ7uMscn
9
From [email protected] Wed Jan 21 08:11:26 1998Date: Mon, 27 Jan 1997 12:50:44 -0500 (EST)From: "Brian E. Gallew" <[email protected]>Subject: Re: [HACKERS] configure
- ---559023410-851401618-854387445=:824Content-Type: TEXT/PLAIN; CHARSET=US-ASCII
> If you can grab a copy and run it on your machine, and send me> the output, that would help alot.
Here is a gzip'ed tar of the results.
=====================================================================| Please do not shoot at the thermonuclear weapons! -- Deacon |=====================================================================| Finger [email protected] for my public key. |=====================================================================
- ---559023410-851401618-854387445=:824Content-Type: APPLICATION/x-gzipContent-Transfer-Encoding: BASE64Content-Description: m88k-dg-dgux5.4R3.10.tar.gz
H4sIAHDq7DICA+xba3vaSLLOV/MrepzsGHiQuNomOJ4MdnDMrC8csB17HQ8WUgM9FpJWFxsmyX8/Vd0tIYGwye5kP+w5fp4EaHW9XV1dXbdu6bY1ZCNV1/QxffWD/sql0k6tRl4RUq6IT1KWn4R/L5UI2ansVirVSrlWhpZqqVZ5RUqv/gN/gedrLiGvHNvzRy71VvUbUYu6mvnqv+zvNbkYM48MmUkJfGrEG1PTJJ7uMscn
10
Resolving Multiple Sender Identities
• Participants send mail from different addresses
• Up to 21% of addresses are aliases
• Such aliases bias identity-based analyses
• Manual inspection and correction tedious
• No fully automated approach to resolve identities
11
A
B
C
D
A
B
C
D
Linear Sequence Thread Hierarchy
Reconstructing Discussion Threads
• Mail stored sequentially in archives
• Logical grouping: discussion topics
• Required information erroneous or missing
• Essential for social network and topic analysis
12
From [email protected] Wed Jan 21 08:11:26 1998Date: Mon, 27 Jan 1997 12:50:44 -0500 (EST)From: "Brian E. Gallew" <[email protected]>Subject: Re: [HACKERS] configure
- ---559023410-851401618-854387445=:824Content-Type: TEXT/PLAIN; CHARSET=US-ASCII
> If you can grab a copy and run it on your machine, and send me> the output, that would help alot.
Here is a gzip'ed tar of the results.
=====================================================================| Please do not shoot at the thermonuclear weapons! -- Deacon |=====================================================================| Finger [email protected] for my public key. |=====================================================================
- ---559023410-851401618-854387445=:824Content-Type: APPLICATION/x-gzipContent-Transfer-Encoding: BASE64Content-Description: m88k-dg-dgux5.4R3.10.tar.gz
H4sIAHDq7DICA+xba3vaSLLOV/MrepzsGHiQuNomOJ4MdnDMrC8csB17HQ8WUgM9FpJWFxsmyX8/Vd0tIYGwye5kP+w5fp4EaHW9XV1dXbdu6bY1ZCNV1/QxffWD/sql0k6tRl4RUq6IT1KWn4R/L5UI2ansVirVSrlWhpZqqVZ5RUqv/gN/gedrLiGvHNvzRy71VvUbUYu6mvnqv+zvNbkYM48MmUkJfGrEG1PTJJ7uMscn
13
From [email protected] Wed Jan 21 08:11:26 1998Date: Mon, 27 Jan 1997 12:50:44 -0500 (EST)From: "Brian E. Gallew" <[email protected]>Subject: Re: [HACKERS] configure
- ---559023410-851401618-854387445=:824Content-Type: TEXT/PLAIN; CHARSET=US-ASCII
> If you can grab a copy and run it on your machine, and send me> the output, that would help alot.
Here is a gzip'ed tar of the results.
=====================================================================| Please do not shoot at the thermonuclear weapons! -- Deacon |=====================================================================| Finger [email protected] for my public key. |=====================================================================
- ---559023410-851401618-854387445=:824Content-Type: APPLICATION/x-gzipContent-Transfer-Encoding: BASE64Content-Description: m88k-dg-dgux5.4R3.10.tar.gz
H4sIAHDq7DICA+xba3vaSLLOV/MrepzsGHiQuNomOJ4MdnDMrC8csB17HQ8WUgM9FpJWFxsmyX8/Vd0tIYGwye5kP+w5fp4EaHW9XV1dXbdu6bY1ZCNV1/QxffWD/sql0k6tRl4RUq6IT1KWn4R/L5UI2ansVirVSrlWhpZqqVZ5RUqv/gN/gedrLiGvHNvzRy71VvUbUYu6mvnqv+zvNbkYM48MmUkJfGrEG1PTJJ7uMscn
14
Attachments
• MIME standard defines extensions to email
• Binary data encoded as text
• Around 10% of messages have attachments
• Extract attachments and store separately
15
From [email protected] Wed Jan 21 08:11:26 1998Date: Mon, 27 Jan 1997 12:50:44 -0500 (EST)From: "Brian E. Gallew" <[email protected]>Subject: Re: [HACKERS] configure
- ---559023410-851401618-854387445=:824Content-Type: TEXT/PLAIN; CHARSET=US-ASCII
> If you can grab a copy and run it on your machine, and send me> the output, that would help alot.
Here is a gzip'ed tar of the results.
=====================================================================| Please do not shoot at the thermonuclear weapons! -- Deacon |=====================================================================| Finger [email protected] for my public key. |=====================================================================
- ---559023410-851401618-854387445=:824Content-Type: APPLICATION/x-gzipContent-Transfer-Encoding: BASE64Content-Description: m88k-dg-dgux5.4R3.10.tar.gz
H4sIAHDq7DICA+xba3vaSLLOV/MrepzsGHiQuNomOJ4MdnDMrC8csB17HQ8WUgM9FpJWFxsmyX8/Vd0tIYGwye5kP+w5fp4EaHW9XV1dXbdu6bY1ZCNV1/QxffWD/sql0k6tRl4RUq6IT1KWn4R/L5UI2ansVirVSrlWhpZqqVZ5RUqv/gN/gedrLiGvHNvzRy71VvUbUYu6mvnqv+zvNbkYM48MmUkJfGrEG1PTJJ7uMscn
16
From [email protected] Wed Jan 21 08:11:26 1998Date: Mon, 27 Jan 1997 12:50:44 -0500 (EST)From: "Brian E. Gallew" <[email protected]>Subject: Re: [HACKERS] configure
- ---559023410-851401618-854387445=:824Content-Type: TEXT/PLAIN; CHARSET=US-ASCII
> If you can grab a copy and run it on your machine, and send me> the output, that would help alot.
Here is a gzip'ed tar of the results.
=====================================================================| Please do not shoot at the thermonuclear weapons! -- Deacon |=====================================================================| Finger [email protected] for my public key. |=====================================================================
- ---559023410-851401618-854387445=:824Content-Type: APPLICATION/x-gzipContent-Transfer-Encoding: BASE64Content-Description: m88k-dg-dgux5.4R3.10.tar.gz
H4sIAHDq7DICA+xba3vaSLLOV/MrepzsGHiQuNomOJ4MdnDMrC8csB17HQ8WUgM9FpJWFxsmyX8/Vd0tIYGwye5kP+w5fp4EaHW9XV1dXbdu6bY1ZCNV1/QxffWD/sql0k6tRl4RUq6IT1KWn4R/L5UI2ansVirVSrlWhpZqqVZ5RUqv/gN/gedrLiGvHNvzRy71VvUbUYu6mvnqv+zvNbkYM48MmUkJfGrEG1PTJJ7uMscn
17
Quotes and Signatures
• Duplicate information
• Unrelated to actual message
• Removing signatures is challenging
• Quoted text may or may not be desirable
• Signatures impact text mining approaches
• No perfect method for signature removal
============
============
============
============
============
=========
| Please do
not shoot at
the thermon
uclear weapo
ns! -- Deaco
n |
============
============
============
============
============
=========
| Finger gee
.edu for my
public key.
|
============
============
============
============
============
=========
18
More Risks presented in the Paper
19
(1) Mailing Lists contain valuable information on a project.
(3) Manual Data Processing is often not feasible or requires much effort.
(4) Off-the-Shelf tools were not designed to prepare data for mining.
(2) Data Needs Pre-Processing before applying traditional tools.
20