creating and using a correlated corpora to glean communicative commonalities jade...

19
Creating and Using a Correlated Corpora to Glean Communicative Commonalities Jade Goldstein- Stewart Kerri A. Goodwin Roberta E. Sabin Ransom K. Winder U.S. Dept. of Defense Loyola College Loyola College MITRE Corporation

Upload: alannah-poole

Post on 02-Jan-2016

212 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Creating and Using a Correlated Corpora to Glean Communicative Commonalities Jade Goldstein-StewartKerri A. GoodwinRoberta E. SabinRansom K. Winder U.S

Creating and Using a Correlated Corpora to Glean

Communicative Commonalities

Jade Goldstein-Stewart Kerri A. Goodwin Roberta E. Sabin Ransom K. Winder

U.S. Dept. of Defense Loyola College Loyola College MITRE Corporation

Page 2: Creating and Using a Correlated Corpora to Glean Communicative Commonalities Jade Goldstein-StewartKerri A. GoodwinRoberta E. SabinRansom K. Winder U.S

May 30, 2008 LREC 2

Outline• Motivation

• Corpora collection

• General Corpora Characteristics– Word count– Readability

• Future directions

Page 3: Creating and Using a Correlated Corpora to Glean Communicative Commonalities Jade Goldstein-StewartKerri A. GoodwinRoberta E. SabinRansom K. Winder U.S

May 30, 2008 LREC 3

Motivation

• How do computer-mediated communication genres differ from traditional genres?

email interview

blog essay

chat discussion

• How consistent are communicative features across genres for a single individual?

• If such commonalities exist, how can they be utilized for document classification?

Page 4: Creating and Using a Correlated Corpora to Glean Communicative Commonalities Jade Goldstein-StewartKerri A. GoodwinRoberta E. SabinRansom K. Winder U.S

May 30, 2008 LREC 4

Email sample (2E1S3)

I do not feel that gender discrimination is a problem in the United States at the moment. My supervisor at my current job is a woman, and everyone respects her the same as the owner of the company, who is a man. I think this issue was more prevalent earlier last century. In these modern times, it really is not an issue in my opinion.

Page 5: Creating and Using a Correlated Corpora to Glean Communicative Commonalities Jade Goldstein-StewartKerri A. GoodwinRoberta E. SabinRansom K. Winder U.S

May 30, 2008 LREC 5

Blog sample (2B1S2)

While gender discrimination is something that should always be avoided ideally, there are some problems I have with the issue in general.  As the discussion starter states, discrimination because of sex is defined as adverse action against another person, that would not have occurred had the person been of another sex. 

Page 6: Creating and Using a Correlated Corpora to Glean Communicative Commonalities Jade Goldstein-StewartKerri A. GoodwinRoberta E. SabinRansom K. Winder U.S

May 30, 2008 LREC 6

Chat sample (2C1S1)

– Are there a lot of issues like this in the news, because to me generder discrimination is a thing of the past 

– Aren't men found to be naturally more apt in certain fields, and women in others? 

– Did any of you experienece any personal discrimination at your jobs, or witness it or anything? 

– I definitely agree with that– Unless one person decides another person is not right for a

job solely based on gender, I don't believe it is discrimination

Page 7: Creating and Using a Correlated Corpora to Glean Communicative Commonalities Jade Goldstein-StewartKerri A. GoodwinRoberta E. SabinRansom K. Winder U.S

May 30, 2008 LREC 7

Aim: Collect a correlated corpora of

text samples

• Including both computer-mediated and not c-m • Including both individual and interactive, spoken

and text• Across 6 genres:

– email, essay, interview (phone) – blog, chat, discussion

• From the same individuals• On 6 distinct topics

Page 8: Creating and Using a Correlated Corpora to Glean Communicative Commonalities Jade Goldstein-StewartKerri A. GoodwinRoberta E. SabinRansom K. Winder U.S

May 30, 2008 LREC 8

Corpora CollectionSeptember 2006 through November 2007

Participants

• All college students, aged 18-29• 12 students in pilot study• 21 participants completed both Phase 1 (email, essay,

interview) & Phase 2 (blog, chat, discussion) • 10M/11W• 18 Caucasian/3 African-American • all had English as the primary language spoken at home

Page 9: Creating and Using a Correlated Corpora to Glean Communicative Commonalities Jade Goldstein-StewartKerri A. GoodwinRoberta E. SabinRansom K. Winder U.S

May 30, 2008 LREC 9

Topics• Piloted via individual interviews with a separate

group• Selected for

– production of expression– comfort of participates for the topic

• Topics: 1. Catholic Church2. Gay Marriage3. Iraq War4. Legalization of Marijuana5. Privacy as a U.S. Citizen6. Gender Discrimination

• Each introduced via a “starter” question

Page 10: Creating and Using a Correlated Corpora to Glean Communicative Commonalities Jade Goldstein-StewartKerri A. GoodwinRoberta E. SabinRansom K. Winder U.S

May 30, 2008 LREC 10

Other Design Issues

• Individual instructions standardized• Environments controlled

– In-house email system– Single discussion leader and phone interviewer– Relaxed discussion and interview setting– Chat sessions “gently” moderated

• Ordering of genres and topics controlled• Group membership randomized

– gender balance 2M/2W

Page 11: Creating and Using a Correlated Corpora to Glean Communicative Commonalities Jade Goldstein-StewartKerri A. GoodwinRoberta E. SabinRansom K. Winder U.S

May 30, 2008 LREC 11

All .txt files produced

• Interviews and Discussions transcribed– by trained psychology students– punctuation inserted– non-fluencies preserved

• Discussion and Chat dismembered to individual files

• Multiple blog entries combined to a single file

Page 12: Creating and Using a Correlated Corpora to Glean Communicative Commonalities Jade Goldstein-StewartKerri A. GoodwinRoberta E. SabinRansom K. Winder U.S

May 30, 2008 LREC 12

Resulting Corpora

•Blogs entries were combined into single files.

The 21 fully parallel corpora were used in this paper.

Limitations: size, homogeneity of subjects, non-spontaneity of discourse

Totals Emails Essays Interviews Blogs* Chat Discussion All180 180 186 132 132 132 942

21 fully parallel copora 126 126 126 126 126 126 756

From Same Individuals

Page 13: Creating and Using a Correlated Corpora to Glean Communicative Commonalities Jade Goldstein-StewartKerri A. GoodwinRoberta E. SabinRansom K. Winder U.S

May 30, 2008 LREC 13

General Corpora Characteristics

• Word Count– by topic – by genre – by gender of communicant

• Readability: Flesch reading ease & Flesch-Kincaid grade level– by topic – by genre – by gender of author

Page 14: Creating and Using a Correlated Corpora to Glean Communicative Commonalities Jade Goldstein-StewartKerri A. GoodwinRoberta E. SabinRansom K. Winder U.S

May 30, 2008 LREC 14

Word Count • No main effect for gender• No main effect for topic• Significant topic x gender interaction for Church and Discrimination

500

550

600

650

700

Church Gay Iraq Marij Privacy Discrim

TOPIC

ME

AN

WO

RD

CO

UN

T

Men

Women

Combined

Page 15: Creating and Using a Correlated Corpora to Glean Communicative Commonalities Jade Goldstein-StewartKerri A. GoodwinRoberta E. SabinRansom K. Winder U.S

May 30, 2008 LREC 15

Word Count (con’t)• Significant Main Effect for genre• Discussion had highest word counts• Direct communication produced higher word counts

100

300

500

700

900

1100

1300

E mailE s say Inter B log C hat D is c

G enre

Me

an

Wo

rd C

ou

nt

Men

W omen

C ombined

Page 16: Creating and Using a Correlated Corpora to Glean Communicative Commonalities Jade Goldstein-StewartKerri A. GoodwinRoberta E. SabinRansom K. Winder U.S

May 30, 2008 LREC 16

Readability• No significant main effect for gender• Significant main effect for genre

– Discussion and interview had highest reading ease– Main effect for topic

65

67

69

71

73

75

77

79

81

Church Gay Iraq Marij Privacy SexDis

TOPIC

ME

AN

FL

ES

CH

RE

AD

ING

EA

SE

Men

Women

Combined

Page 17: Creating and Using a Correlated Corpora to Glean Communicative Commonalities Jade Goldstein-StewartKerri A. GoodwinRoberta E. SabinRansom K. Winder U.S

May 30, 2008 LREC 17

Readability (con’t)• reading ease of conversational genres high• reading ease of non-conversational genres low

0

2

4

6

8

10

12

C hurc h G ay Iraq Marijuana P rivac y S exDis c

TO P IC

GR

AD

E L

EV

EL

E mail

E s s ay

Interview

B log

C hat

Dis c

Page 18: Creating and Using a Correlated Corpora to Glean Communicative Commonalities Jade Goldstein-StewartKerri A. GoodwinRoberta E. SabinRansom K. Winder U.S

May 30, 2008 LREC 18

Future Possibilities

• additional features for genderID, authorship • sentence complexity• cohesion of text

• feature change across time within a topic• classification by topic order

• classification by genre

• conversational dynamics in chat vs. discussion

Page 19: Creating and Using a Correlated Corpora to Glean Communicative Commonalities Jade Goldstein-StewartKerri A. GoodwinRoberta E. SabinRansom K. Winder U.S

May 30, 2008 LREC 19

Thank you.

Questions?

www.cs.loyola.edu/~res