i human media interaction group, university of twente c ... · bullshit pissed female male male...

1
hell dumb fuck bitch shit ass damn gay bullshit pissed f**k Faculty of Electrical Engineering, Mathematics and Computer Science, Human Media Interaction (HMI) Improved Cyberbullying Detection through Personal Profiles FP7-ICT-2007-3 Maral Dadvar [email protected] ZI2120, HMI, POBox 217, 7500 AE, University of Twente, the Netherlands Maral Dadvar and Franciska de Jong Human Media Interaction group, University of Twente Gender-based study MySpace dataset Profane words dictionary Support Vector Machine (SVM) classifier trained with four features. The dataset was classified into two groups, based on the gender of the person who has written the post, Female or Male. Cyberbullying is defined as an aggressive, intentional act carried out by a group or individual, using electronic forms of contact repeatedly or over time against a victim who cannot easily defend herself. (Espelage et al. 2003) Technical Challenges in cyberbullying detection There are not many technical studies on cyberbullying detection which mainly is due to the following challenges : In short There are several technical challenges in cyberbullying detection studies that need to be investigated properly. Due to the nature of this social misbehavior, we propose a socio-technical approach to address those challenges. In this study we demonstrated that incorporation of personal profile information improves the discrimination capacity of the system for cyberbullying detection. We are also evaluating a multi-system approach to overcome some of the shortages of the current studies. Available at http://caw2.barcelonamedia.org/ Male actors Female actors Cross-systems approach Feasibility study among random 1000 users on YouTube shows that 6.2 % link to all three, and 42.8% link to at least one of their Facebook, Twitter, and Tumbler accounts. This asks for: Post-harassing behaviour analysis A random harasser or a bullying stalker detection User tracking Genders’ wordings To support our hypothesis that more specific features based on users’ profile information would lead to more accurate classification of bullying contents, we analysed the use of foul words in a dataset from MySpace and we compared the most frequently foul words used by each gender. Features Dataset Gender Harassing Non Harassing Harassment Detection profane words second person pronouns other personal pronouns Male Female Term weighting Features profane words second person pronouns other personal pronouns Term weighting Single-system Multi-system 3. Features Current studies used conventional sentiment analysis features which are all Content based Single-system While social studies show the actors characteristics and personal information matter and may bully others differently. Age Gender Profession Educational level 1. Harassment or Bullying? It is hard to differentiate harassment from bullying without any complementary information. Some times foul words are used among teenagers as a sign of friendship and close relationships. Being bullied and becoming a victim of cyberbullying depends on the personality of the person. Bullying has continuity and repetition over time and perhaps over systems. 2. Data There is a lack of sufficient and standard labelled dataset for cyberbullying detection and the available datasets are not appropriate for these studies mainly due to the following reasons : Privacy issues Public effect No dataset with users’ demographic information Inconsistency in labelling process

Upload: others

Post on 30-Jul-2020

2 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: I Human Media Interaction group, University of Twente C ... · bullshit pissed Female Male Male actors Female actors f**k Faculty of Electrical Engineering, Mathematics and Computer

hell

dumb

fuck

bitch

shit

ass

damn

gay

bullshit

pissed

Female

Male

Male actors

Female actors

f**k

Faculty of Electrical Engineering, Mathematics and Computer Science, Human Media Interaction (HMI)

Improved Cyberbullying Detection through Personal Profiles

FP7-ICT-2007-3

Maral Dadvar [email protected]

ZI2120, HMI, POBox 217, 7500 AE, University of Twente, the Netherlands

Maral Dadvar and Franciska de Jong Human Media Interaction group, University of Twente

Gender-based study

MySpace dataset Profane words dictionary Support Vector Machine (SVM)

classifier trained with four features. The dataset was classified into two

groups, based on the gender of the person who has written the post, Female or Male.

Cyberbullying is defined as an

aggressive, intentional act carried out

by a group or individual, using

electronic forms of contact repeatedly

or over time against a victim who

cannot easily defend herself.

(Espelage et al. 2003)

Technical Challenges in cyberbullying detection

There are not many technical studies on cyberbullying detection which mainly is due to the following challenges :

In short

There are several technical challenges in cyberbullying detection studies that

need to be investigated properly. Due to the nature of this social misbehavior, we

propose a socio-technical approach to address those challenges. In this study we

demonstrated that incorporation of personal profile information improves the

discrimination capacity of the system for cyberbullying detection. We are also

evaluating a multi-system approach to overcome some of the shortages of the

current studies.

Available at http://caw2.barcelonamedia.org/

hell

dumb

fuck

bitch

shit

ass

damn

gay

bullshit

pissed

Female

Male

Male actors

Female actors

f**k

Cross-systems approach

Feasibility study among random 1000 users on YouTube shows that 6.2 % link to all three, and 42.8% link to at least one of their Facebook, Twitter, and Tumbler accounts. This asks for: Post-harassing behaviour analysis A random harasser or a bullying

stalker detection User tracking

Genders’ wordings

To support our hypothesis that more specific features based on users’ profile information would lead to more accurate classification of bullying contents, we analysed the use of foul words in a dataset from MySpace and we compared the most frequently foul words used by each gender.

Features

Dataset

Gender

Harassing Non Harassing

Harassment

Detection

profane words

second person pronouns

other personal pronouns

Male

Female

Term weighting

Features

profane words

second person pronouns

other personal pronouns

Term weighting

Single-system

Multi-system

3. Features Current studies used conventional sentiment analysis features which are all Content based Single-system While social studies show the actors characteristics and personal information matter and may bully others differently. Age Gender Profession Educational level

1. Harassment or Bullying? It is hard to differentiate harassment from bullying without any complementary information. Some times foul words are used

among teenagers as a sign of friendship and close relationships.

Being bullied and becoming a victim of cyberbullying depends on the personality of the person.

Bullying has continuity and repetition over time and perhaps over systems.

2. Data There is a lack of sufficient and standard labelled dataset for cyberbullying detection and the available datasets are not appropriate for these studies mainly due to the following reasons : Privacy issues Public effect No dataset with users’

demographic information Inconsistency in labelling process