Download - An Empirical Study on the Risks of Using Off-the-Shelf Techniques for Processing Mailing List Data
![Page 1: An Empirical Study on the Risks of Using Off-the-Shelf Techniques for Processing Mailing List Data](https://reader033.vdocuments.us/reader033/viewer/2022060119/558ecaa81a28ab1c3c8b465c/html5/thumbnails/1.jpg)
An Empirical Study on the Risks of Using Off-the-Shelf Techniques for Processing Mailing List Data
Nicolas Bettenburg, Emad Shihab, Ahmed E. HassanQueen’s University, Canada
1
![Page 2: An Empirical Study on the Risks of Using Off-the-Shelf Techniques for Processing Mailing List Data](https://reader033.vdocuments.us/reader033/viewer/2022060119/558ecaa81a28ab1c3c8b465c/html5/thumbnails/2.jpg)
Development Repositories
SOURCE CODE
COMMUNICATION ARCHIVES
BUG DATABASES
2
![Page 3: An Empirical Study on the Risks of Using Off-the-Shelf Techniques for Processing Mailing List Data](https://reader033.vdocuments.us/reader033/viewer/2022060119/558ecaa81a28ab1c3c8b465c/html5/thumbnails/3.jpg)
Development Repositories
SOURCE CODE
COMMUNICATION ARCHIVES
BUG DATABASES
3
![Page 4: An Empirical Study on the Risks of Using Off-the-Shelf Techniques for Processing Mailing List Data](https://reader033.vdocuments.us/reader033/viewer/2022060119/558ecaa81a28ab1c3c8b465c/html5/thumbnails/4.jpg)
The Importance of Mailing List Archives
• Email popular form of communication
• Mailing lists to distribute messages
• Messages contain valuable information
• Discussions of source code
• Development decisions
• Error reports
• User support requests
4
![Page 5: An Empirical Study on the Risks of Using Off-the-Shelf Techniques for Processing Mailing List Data](https://reader033.vdocuments.us/reader033/viewer/2022060119/558ecaa81a28ab1c3c8b465c/html5/thumbnails/5.jpg)
Mining the Mailing Lists of23 Open-Source Projects
• Summarizing developer mailing lists
• Using off-the-shelf tools
• Data from around 500,000 emails
• Unexpected results from experiments
5
![Page 6: An Empirical Study on the Risks of Using Off-the-Shelf Techniques for Processing Mailing List Data](https://reader033.vdocuments.us/reader033/viewer/2022060119/558ecaa81a28ab1c3c8b465c/html5/thumbnails/6.jpg)
scatter !! things !! info !! mlw !! bool !! palloc(bufsize); !! symlinks !! configuration !! - !! looks !! -C !! } !! getopt(argc, !! specifies
!! Nov !! --- !! && !! NULL, !! reasonable. !! -r !! argv, !! Added !! reasons !! A1 !! http://www.gnupg.org !! postmaster !! * !! == !! live !! wrote
!! > !! break; !! get !! SIGNATURE !! -----END !! != !! symlinks. !! command !! + while !! wrote: !! different !! EOF) !! ___ !! allows
!! char !! 1F !! file !! postgres !! Dec !! 43 !! DataDir, !! pg_hba.conf !! 69 !! + SetDataDir(potential_DataDir); !! convenient !! + } !! put !! GnuPG
!! -- !! 093E !! stuff !! -D !! switch !! NULL !! +extern !! recursion !! admin !! setting !! 5B !! Version: !! @@ !! system !! issues !! -u !! /
path/default.conf !! see !! overides !! http://xyzzy.dhs.org/~drew/ !! + /* !! (char !! -p !! sizeof(char); !! 19 !! if !! /etc/postgresql !! directory !! note !!
\"datadir\" !! running !! PGP !! (GNU/Linux) !! \"hbaconfig\" !! file. !! ((opt !! 2001 !! 72 !! Apache !! way !! NULL; !! options !! CONF_FILE);
!! bits !! simple !! databases !! */ !! servers !! multiple !! /* !! share !! \"A:a:B:b:c:D:d:Fh:ik:lm:MN:no:p:Ss-:\")) !! blow !! + !! DataDir); !! \"To !! method !!
#include !! *) !! vendors !! E3 !! people !! + { !! 08:27:06 !! 3B !! 16 !! +# !! explicit !! debian !! data !! = !! +char !!
malloc(strlen(DataDir) !! +++ !! !! v1.0.6 !! having !! 56 !! /etc. !! 613-389-5481 !! extern !! case !! error !! strlen(CONFIG_FILENAME) !! Comment: !! line !!
diff !! easier !! certs !! given !! { !!
6
![Page 7: An Empirical Study on the Risks of Using Off-the-Shelf Techniques for Processing Mailing List Data](https://reader033.vdocuments.us/reader033/viewer/2022060119/558ecaa81a28ab1c3c8b465c/html5/thumbnails/7.jpg)
scatter !! things !! info !! mlw !! bool !! palloc(bufsize); !! symlinks !! configuration !! - !! looks !! -C !! } !! getopt(argc, !! specifies
!! Nov !! --- !! && !! NULL, !! reasonable. !! -r !! argv, !! Added !! reasons !! A1 !! http://www.gnupg.org !! postmaster !! * !! == !! live !! wrote
!! > !! break; !! get !! SIGNATURE !! -----END !! != !! symlinks. !! command !! + while !! wrote: !! different !! EOF) !! ___ !! allows
!! char !! 1F !! file !! postgres !! Dec !! 43 !! DataDir, !! pg_hba.conf !! 69 !! + SetDataDir(potential_DataDir); !! convenient !! + } !! put !! GnuPG
!! -- !! 093E !! stuff !! -D !! switch !! NULL !! +extern !! recursion !! admin !! setting !! 5B !! Version: !! @@ !! system !! issues !! -u !! /
path/default.conf !! see !! overides !! http://xyzzy.dhs.org/~drew/ !! + /* !! (char !! -p !! sizeof(char); !! 19 !! if !! /etc/postgresql !! directory !! note !!
\"datadir\" !! running !! PGP !! (GNU/Linux) !! \"hbaconfig\" !! file. !! ((opt !! 2001 !! 72 !! Apache !! way !! NULL; !! options !! CONF_FILE);
!! bits !! simple !! databases !! */ !! servers !! multiple !! /* !! share !! \"A:a:B:b:c:D:d:Fh:ik:lm:MN:no:p:Ss-:\")) !! blow !! + !! DataDir); !! \"To !! method !!
#include !! *) !! vendors !! E3 !! people !! + { !! 08:27:06 !! 3B !! 16 !! +# !! explicit !! debian !! data !! = !! +char !!
malloc(strlen(DataDir) !! +++ !! !! v1.0.6 !! having !! 56 !! /etc. !! 613-389-5481 !! extern !! case !! error !! strlen(CONFIG_FILENAME) !! Comment: !! line !!
diff !! easier !! certs !! given !! { !!
Funny, !! fiat !! configuration !! PGDATA !! impose !! them. !! opinion !! keys !! long !! environment !! agrees ! resides. !! start!! variable. !! normal !!
organize !! single !! creating !! exactly !! postgresql !! original !! stuff !! described !! (My !! say !! \"pg\" !! BSD !! fruity !! me, !! real !! little !! want
!! $PGDATA/; !! sort !! specifies !! certs !! data !! Tux !! looks !! policy, !! '/etc/pgsql/pg_hba.conf' !! servers !! maintain !! (This
!! = !! week !! scattered !! patch !! layout !! linux, !! '/u01/postgres' !! give !! path !! all. !! file. !! live !! belongs, !! stuff, !! result !! way !! -p
!! sux. !! Apache !! specified, !! hey, !! reasonable. !! reasons !! it. !! damn !! options: !! utterly !! line, !! files !! consistency !! datadir !!
debian. !! method !! considering !! always. !! options !! symlinks. !! different !! 5434 !! /etc/pgsql/mydb.conf !! delivers !! me. !! /etc/
apache. !! /etc/postgresql !! overides !! things !! using, !! symlinking !! convenient !! able !! hbaconfig !! /path/default.conf !! command !! controllable !! modssl !! undesired !! /path/name3" !! ","I !! Similarly, !! ObFlame: !! And, !!
postmaster !! Config !! directory !! discussion !! packager !! ass. !! really !! machine !! subdirectory !! distros !! bet !!
package. !! devil !! sense !! hbaconfig !! /etc/nessusd. !! logical. !! behavior !! crypto !! Debian !! set, !! 5432 !! as: !! share !! line
!! Ross !! having !! kinda !! see !! forced !! people !! pg_hba.conf ! !! pgdatadir !! /path/name2 !! guess !! get !! own. !! nice !! /path/name1 !! simple !! setting !! rational !!
6
![Page 8: An Empirical Study on the Risks of Using Off-the-Shelf Techniques for Processing Mailing List Data](https://reader033.vdocuments.us/reader033/viewer/2022060119/558ecaa81a28ab1c3c8b465c/html5/thumbnails/8.jpg)
While mining Mailing Lists of23 Open-Source Projects
• Don’t treat mail archives as textual data
• Changing technologies
• Up to 98% of messages contain noise
7
![Page 9: An Empirical Study on the Risks of Using Off-the-Shelf Techniques for Processing Mailing List Data](https://reader033.vdocuments.us/reader033/viewer/2022060119/558ecaa81a28ab1c3c8b465c/html5/thumbnails/9.jpg)
While mining Mailing Lists of23 Open-Source Projects
• Don’t treat mail archives as textual data
• Changing technologies
• Up to 98% of messages contain noise
Additional processing and cleaning needed!
8
![Page 10: An Empirical Study on the Risks of Using Off-the-Shelf Techniques for Processing Mailing List Data](https://reader033.vdocuments.us/reader033/viewer/2022060119/558ecaa81a28ab1c3c8b465c/html5/thumbnails/10.jpg)
From [email protected] Wed Jan 21 08:11:26 1998Date: Mon, 27 Jan 1997 12:50:44 -0500 (EST)From: "Brian E. Gallew" <[email protected]>Subject: Re: [HACKERS] configure
- ---559023410-851401618-854387445=:824Content-Type: TEXT/PLAIN; CHARSET=US-ASCII
> If you can grab a copy and run it on your machine, and send me> the output, that would help alot.
Here is a gzip'ed tar of the results.
=====================================================================| Please do not shoot at the thermonuclear weapons! -- Deacon |=====================================================================| Finger [email protected] for my public key. |=====================================================================
- ---559023410-851401618-854387445=:824Content-Type: APPLICATION/x-gzipContent-Transfer-Encoding: BASE64Content-Description: m88k-dg-dgux5.4R3.10.tar.gz
H4sIAHDq7DICA+xba3vaSLLOV/MrepzsGHiQuNomOJ4MdnDMrC8csB17HQ8WUgM9FpJWFxsmyX8/Vd0tIYGwye5kP+w5fp4EaHW9XV1dXbdu6bY1ZCNV1/QxffWD/sql0k6tRl4RUq6IT1KWn4R/L5UI2ansVirVSrlWhpZqqVZ5RUqv/gN/gedrLiGvHNvzRy71VvUbUYu6mvnqv+zvNbkYM48MmUkJfGrEG1PTJJ7uMscn
9
![Page 11: An Empirical Study on the Risks of Using Off-the-Shelf Techniques for Processing Mailing List Data](https://reader033.vdocuments.us/reader033/viewer/2022060119/558ecaa81a28ab1c3c8b465c/html5/thumbnails/11.jpg)
From [email protected] Wed Jan 21 08:11:26 1998Date: Mon, 27 Jan 1997 12:50:44 -0500 (EST)From: "Brian E. Gallew" <[email protected]>Subject: Re: [HACKERS] configure
- ---559023410-851401618-854387445=:824Content-Type: TEXT/PLAIN; CHARSET=US-ASCII
> If you can grab a copy and run it on your machine, and send me> the output, that would help alot.
Here is a gzip'ed tar of the results.
=====================================================================| Please do not shoot at the thermonuclear weapons! -- Deacon |=====================================================================| Finger [email protected] for my public key. |=====================================================================
- ---559023410-851401618-854387445=:824Content-Type: APPLICATION/x-gzipContent-Transfer-Encoding: BASE64Content-Description: m88k-dg-dgux5.4R3.10.tar.gz
H4sIAHDq7DICA+xba3vaSLLOV/MrepzsGHiQuNomOJ4MdnDMrC8csB17HQ8WUgM9FpJWFxsmyX8/Vd0tIYGwye5kP+w5fp4EaHW9XV1dXbdu6bY1ZCNV1/QxffWD/sql0k6tRl4RUq6IT1KWn4R/L5UI2ansVirVSrlWhpZqqVZ5RUqv/gN/gedrLiGvHNvzRy71VvUbUYu6mvnqv+zvNbkYM48MmUkJfGrEG1PTJJ7uMscn
10
![Page 12: An Empirical Study on the Risks of Using Off-the-Shelf Techniques for Processing Mailing List Data](https://reader033.vdocuments.us/reader033/viewer/2022060119/558ecaa81a28ab1c3c8b465c/html5/thumbnails/12.jpg)
Resolving Multiple Sender Identities
• Participants send mail from different addresses
• Up to 21% of addresses are aliases
• Such aliases bias identity-based analyses
• Manual inspection and correction tedious
• No fully automated approach to resolve identities
11
![Page 13: An Empirical Study on the Risks of Using Off-the-Shelf Techniques for Processing Mailing List Data](https://reader033.vdocuments.us/reader033/viewer/2022060119/558ecaa81a28ab1c3c8b465c/html5/thumbnails/13.jpg)
A
B
C
D
A
B
C
D
Linear Sequence Thread Hierarchy
Reconstructing Discussion Threads
• Mail stored sequentially in archives
• Logical grouping: discussion topics
• Required information erroneous or missing
• Essential for social network and topic analysis
12
![Page 14: An Empirical Study on the Risks of Using Off-the-Shelf Techniques for Processing Mailing List Data](https://reader033.vdocuments.us/reader033/viewer/2022060119/558ecaa81a28ab1c3c8b465c/html5/thumbnails/14.jpg)
From [email protected] Wed Jan 21 08:11:26 1998Date: Mon, 27 Jan 1997 12:50:44 -0500 (EST)From: "Brian E. Gallew" <[email protected]>Subject: Re: [HACKERS] configure
- ---559023410-851401618-854387445=:824Content-Type: TEXT/PLAIN; CHARSET=US-ASCII
> If you can grab a copy and run it on your machine, and send me> the output, that would help alot.
Here is a gzip'ed tar of the results.
=====================================================================| Please do not shoot at the thermonuclear weapons! -- Deacon |=====================================================================| Finger [email protected] for my public key. |=====================================================================
- ---559023410-851401618-854387445=:824Content-Type: APPLICATION/x-gzipContent-Transfer-Encoding: BASE64Content-Description: m88k-dg-dgux5.4R3.10.tar.gz
H4sIAHDq7DICA+xba3vaSLLOV/MrepzsGHiQuNomOJ4MdnDMrC8csB17HQ8WUgM9FpJWFxsmyX8/Vd0tIYGwye5kP+w5fp4EaHW9XV1dXbdu6bY1ZCNV1/QxffWD/sql0k6tRl4RUq6IT1KWn4R/L5UI2ansVirVSrlWhpZqqVZ5RUqv/gN/gedrLiGvHNvzRy71VvUbUYu6mvnqv+zvNbkYM48MmUkJfGrEG1PTJJ7uMscn
13
![Page 15: An Empirical Study on the Risks of Using Off-the-Shelf Techniques for Processing Mailing List Data](https://reader033.vdocuments.us/reader033/viewer/2022060119/558ecaa81a28ab1c3c8b465c/html5/thumbnails/15.jpg)
From [email protected] Wed Jan 21 08:11:26 1998Date: Mon, 27 Jan 1997 12:50:44 -0500 (EST)From: "Brian E. Gallew" <[email protected]>Subject: Re: [HACKERS] configure
- ---559023410-851401618-854387445=:824Content-Type: TEXT/PLAIN; CHARSET=US-ASCII
> If you can grab a copy and run it on your machine, and send me> the output, that would help alot.
Here is a gzip'ed tar of the results.
=====================================================================| Please do not shoot at the thermonuclear weapons! -- Deacon |=====================================================================| Finger [email protected] for my public key. |=====================================================================
- ---559023410-851401618-854387445=:824Content-Type: APPLICATION/x-gzipContent-Transfer-Encoding: BASE64Content-Description: m88k-dg-dgux5.4R3.10.tar.gz
H4sIAHDq7DICA+xba3vaSLLOV/MrepzsGHiQuNomOJ4MdnDMrC8csB17HQ8WUgM9FpJWFxsmyX8/Vd0tIYGwye5kP+w5fp4EaHW9XV1dXbdu6bY1ZCNV1/QxffWD/sql0k6tRl4RUq6IT1KWn4R/L5UI2ansVirVSrlWhpZqqVZ5RUqv/gN/gedrLiGvHNvzRy71VvUbUYu6mvnqv+zvNbkYM48MmUkJfGrEG1PTJJ7uMscn
14
![Page 16: An Empirical Study on the Risks of Using Off-the-Shelf Techniques for Processing Mailing List Data](https://reader033.vdocuments.us/reader033/viewer/2022060119/558ecaa81a28ab1c3c8b465c/html5/thumbnails/16.jpg)
Attachments
• MIME standard defines extensions to email
• Binary data encoded as text
• Around 10% of messages have attachments
• Extract attachments and store separately
15
![Page 17: An Empirical Study on the Risks of Using Off-the-Shelf Techniques for Processing Mailing List Data](https://reader033.vdocuments.us/reader033/viewer/2022060119/558ecaa81a28ab1c3c8b465c/html5/thumbnails/17.jpg)
From [email protected] Wed Jan 21 08:11:26 1998Date: Mon, 27 Jan 1997 12:50:44 -0500 (EST)From: "Brian E. Gallew" <[email protected]>Subject: Re: [HACKERS] configure
- ---559023410-851401618-854387445=:824Content-Type: TEXT/PLAIN; CHARSET=US-ASCII
> If you can grab a copy and run it on your machine, and send me> the output, that would help alot.
Here is a gzip'ed tar of the results.
=====================================================================| Please do not shoot at the thermonuclear weapons! -- Deacon |=====================================================================| Finger [email protected] for my public key. |=====================================================================
- ---559023410-851401618-854387445=:824Content-Type: APPLICATION/x-gzipContent-Transfer-Encoding: BASE64Content-Description: m88k-dg-dgux5.4R3.10.tar.gz
H4sIAHDq7DICA+xba3vaSLLOV/MrepzsGHiQuNomOJ4MdnDMrC8csB17HQ8WUgM9FpJWFxsmyX8/Vd0tIYGwye5kP+w5fp4EaHW9XV1dXbdu6bY1ZCNV1/QxffWD/sql0k6tRl4RUq6IT1KWn4R/L5UI2ansVirVSrlWhpZqqVZ5RUqv/gN/gedrLiGvHNvzRy71VvUbUYu6mvnqv+zvNbkYM48MmUkJfGrEG1PTJJ7uMscn
16
![Page 18: An Empirical Study on the Risks of Using Off-the-Shelf Techniques for Processing Mailing List Data](https://reader033.vdocuments.us/reader033/viewer/2022060119/558ecaa81a28ab1c3c8b465c/html5/thumbnails/18.jpg)
From [email protected] Wed Jan 21 08:11:26 1998Date: Mon, 27 Jan 1997 12:50:44 -0500 (EST)From: "Brian E. Gallew" <[email protected]>Subject: Re: [HACKERS] configure
- ---559023410-851401618-854387445=:824Content-Type: TEXT/PLAIN; CHARSET=US-ASCII
> If you can grab a copy and run it on your machine, and send me> the output, that would help alot.
Here is a gzip'ed tar of the results.
=====================================================================| Please do not shoot at the thermonuclear weapons! -- Deacon |=====================================================================| Finger [email protected] for my public key. |=====================================================================
- ---559023410-851401618-854387445=:824Content-Type: APPLICATION/x-gzipContent-Transfer-Encoding: BASE64Content-Description: m88k-dg-dgux5.4R3.10.tar.gz
H4sIAHDq7DICA+xba3vaSLLOV/MrepzsGHiQuNomOJ4MdnDMrC8csB17HQ8WUgM9FpJWFxsmyX8/Vd0tIYGwye5kP+w5fp4EaHW9XV1dXbdu6bY1ZCNV1/QxffWD/sql0k6tRl4RUq6IT1KWn4R/L5UI2ansVirVSrlWhpZqqVZ5RUqv/gN/gedrLiGvHNvzRy71VvUbUYu6mvnqv+zvNbkYM48MmUkJfGrEG1PTJJ7uMscn
17
![Page 19: An Empirical Study on the Risks of Using Off-the-Shelf Techniques for Processing Mailing List Data](https://reader033.vdocuments.us/reader033/viewer/2022060119/558ecaa81a28ab1c3c8b465c/html5/thumbnails/19.jpg)
Quotes and Signatures
• Duplicate information
• Unrelated to actual message
• Removing signatures is challenging
• Quoted text may or may not be desirable
• Signatures impact text mining approaches
• No perfect method for signature removal
============
============
============
============
============
=========
| Please do
not shoot at
the thermon
uclear weapo
ns! -- Deaco
n |
============
============
============
============
============
=========
| Finger gee
.edu for my
public key.
|
============
============
============
============
============
=========
18
![Page 20: An Empirical Study on the Risks of Using Off-the-Shelf Techniques for Processing Mailing List Data](https://reader033.vdocuments.us/reader033/viewer/2022060119/558ecaa81a28ab1c3c8b465c/html5/thumbnails/20.jpg)
More Risks presented in the Paper
19
![Page 21: An Empirical Study on the Risks of Using Off-the-Shelf Techniques for Processing Mailing List Data](https://reader033.vdocuments.us/reader033/viewer/2022060119/558ecaa81a28ab1c3c8b465c/html5/thumbnails/21.jpg)
(1) Mailing Lists contain valuable information on a project.
(3) Manual Data Processing is often not feasible or requires much effort.
(4) Off-the-Shelf tools were not designed to prepare data for mining.
(2) Data Needs Pre-Processing before applying traditional tools.
20