scalable and timely detection of cyberbullying in online social...

Scalable and Timely Detection of Cyberbullyingin Online Social Networks

Rahat Ibn RafiqDepartment of Computer ScienceUniversity of Colorado Boulder

[email protected]

Homa HosseinmardiInformation Sciences Institute

University of Southern [email protected]

Richard HanDepartment of Computer ScienceUniversity of Colorado Boulder

[email protected]

Qin LvDepartment of Computer ScienceUniversity of Colorado Boulder

[email protected]

Shivakant MishraDepartment of Computer ScienceUniversity of Colorado Boulder

[email protected]

ABSTRACTCyberbullying in Online Social Networks (OSNs) has grownto be a serious problem among teenagers. While a consid-erable amount of research has been conducted focusing ondesigning highly accurate classifiers to automatically detectcyberbullying instances in OSNs, two key practical issuesremain to be worked upon, namely scalability of a cyberbully-ing detection system and timeliness of raising alerts whenevercyberbullying occurs. These two issues form the motivationof our work. We propose a multi-stage cyberbullying detec-tion solution that drastically reduces the classification timeand the time to raise alerts. The proposed system is highlyscalable without sacrificing accuracy and highly responsivein raising alerts. The design is comprised of two novel com-ponents, a dynamic priority scheduler and an incrementalclassification mechanism. We have implemented this solu-tion, and using data obtained from Vine, we conducted athorough performance evaluation to demonstrate the util-ity and scalability of each of these components. We showthat our complete solution is significantly more scalable andresponsive than the current state of the art.

CCS CONCEPTS• Networks → Online social networks; • Human-centeredcomputing → Social networks; • Applied computing → Soci-ology;

KEYWORDSScalable Systems, Cyberbullying , Social Networks

Permission to make digital or hard copies of all or part of this workfor personal or classroom use is granted without fee provided thatcopies are not made or distributed for profit or commercial advantageand that copies bear this notice and the full citation on the firstpage. Copyrights for components of this work owned by others thanACM must be honored. Abstracting with credit is permitted. To copyotherwise, or republish, to post on servers or to redistribute to lists,requires prior specific permission and/or a fee. Request permissionsfrom [email protected] 2018, April 9–13, 2018, Pau, France© 2018 Association for Computing Machinery.ACM ISBN 978-1-4503-5191-1/18/04. . . $15.00https://doi.org/10.1145/3167132.3167317

ACM Reference Format:Rahat Ibn Rafiq, Homa Hosseinmardi, Richard Han, Qin Lv,and Shivakant Mishra. 2018. Scalable and Timely Detection ofCyberbullying in Online Social Networks. In SAC 2018: SAC2018: Symposium on Applied Computing , April 9–13, 2018, Pau,France. ACM, New York, NY, USA, 10 pages. https://doi.org/10.1145/3167132.3167317

1 INTRODUCTIONUnprecedented growth in the popularity of OSNs, especiallyamong teenagers, has unfortunately resulted in significantincrease in cyberbullying. The main differences between bul-lying and cyberbullying are the facts that the Internet canhelp the perpetrators of cyberbullying hide their identities,cyberbullying can be incessant because of the availability andaccess of the Internet, and the possibility of cyberbullyingbeing viral and exposing the victims to an entire virtual world[29]. Numerous instances [5] and devastating consequencesof cyberbullying [1, 8, 27] have led researchers to exploredetection of cyberbullying incidents in OSNs like Ask.fm,Instagram, Vine, Twitter etc. [3, 12, 13, 19, 23, 28]. Theseworks have mostly followed the methodology of collectingand labeling data from OSNs and designing cyberbullyingclassifiers. Past work has also investigated the issue of identi-fying imbalance of power between perpetrators and victims,which is a key feature of bullying, and distinguishing betweencyber-aggression and cyberbullying [16], thus paving the wayfor highly accurate classifiers. Figure 1 shows an example ofcyberbullying in Vine.

While progress has been made on the accuracy of classifiersfor cyberbullying detection, there are two key practical issuesthat have largely been ignored to date. The first issue concernsthe scalability of the cyberbullying detection solutions. OSNs,of course, involve an enormous amount of data, on the orderof several hundred gigabytes per day. For example, it hasbeen reported that for Vine around 39 million videos havebeen shared since it was introduced [26] while for Instagram,the amount of shared media is 40 billion[18].

The second issue concerns the timeliness of raising alertswhenever cyberbullying incidents are suspected. Cyberbully-ing is different from traditional, face-to-face bullying, becauseit can occur 24/7, and perpetrators can be anonymous and

1738

https://doi.org/10.1145/3167132.3167317

https://doi.org/10.1145/3167132.3167317

https://doi.org/10.1145/3167132.3167317

Figure 1: An example of cyberbullying on Vine.

have easy access to sophisticated tools to launch cyberbully-ing attacks. Furthermore, the consequences can be disastrousand it is extremely important to provide the necessary sup-port to the victims as early as possible. So, a timely detectionof cyberbullying is of paramount importance, so that an alertcan be raised as soon as possible.

In this paper, we propose a multi-stage cyberbullyingdetection solution designed to improve the scalability andresponsiveness of cyberbullying detection. To the best of ourknowledge, we are the first to propose a scalable and responsivesolution to cyberbullying detection in OSNs. A key property ofthe solution is that it achieves sufficient classification accuracywhile accomplishing these two goals. The solution consists oftwo key components, namely, a dynamic, multilevel priorityscheduler for improved responsiveness, and an incrementalfeature extraction and classification stage for scaling. Usingonline social networking data from Vine, we demonstratethe utility of both of these components, and show that ourcomplete cyberbullying detection solution is significantlymore scalable and responsive than the current state-of-the-art. We make the following important contributions:∙ We propose an incremental computational design for fea-

ture extraction and classification that reuses previous clas-sification results to reduce overhead with minimal impacton accuracy.

∙ We propose a dynamic, multi-level priority scheduler thatassigns high preference to potential cyberbullying media-sessions, thereby improving responsiveness of the solution.

∙ Using real world data from Vine, we demonstrate that ourintegrated system substantially improves the scalability ofcyberbullying, making cyberbullying detection feasible forVine-scale social networks.

∙ We further demonstrate how our system scales to monitormuch larger, Instagram-scale networks.

2 RELATED WORKThe majority of research on cyberbullying detection has fo-cused on improving the accuracy of cyberbullying detection

M1

M2

Dynamic Priority Scheduler

Q1 M1 M2

Q2

Current Queue

Incremental Feature

Extraction

Incremental Classification

new comments

Q3

M2

ALERT

new priority

M1

history of confidence of earlier estimates

Incremental Classifier

Figure 2: Scalable and responsive cyberbullying detection ar-chitecture.

classifiers[6, 15, 22, 30]. Several cyberbullying detection ap-plications have been developed in recent years [7]. Most ofthese applications (e.g., Mobicip) introduce parental controlincluding category blocking, time limits, Internet activity re-ports, blocked phrases, and YouTube filtering, whereas others,e.g. iAnon and GoGoStat search for specific profanity words.Specific keyword based research on event detection have alsobeen performed [20, 24] where crime and disaster events wereextracted after hitting the Twitter API for a list of specifickeywords(tornado, earthquake) to get the tweets related toan event. Cyberbullying, in contrast, is much more subtle.As shown in [13], profanity can not solely be an indicator ofcyberbullying. So it is important to make use of the definitionof cyberbullying and take into account the repetitive natureof perpetrated aggressiveness and imbalance of power whilebuilding an accurate cyberbullying detection solution.

As far as we know, none of the prior research has addressedthe issue of computational scalability and/or responsivenessin the context of cyberbullying detection. Scalability hasbeen, however, an important factor for other research areassuch as misbehavior detection in online video chat services[4, 32] and cyber-attack detection in communities [11].

3 DESIGN OVERVIEWOur goal of this research is to propose a cyberbullying detec-tion system with two key characteristics, namely, scalable tohandle large OSNs without sacrificing accuracy and timeli-ness of raising an alert when a cyberbullying instance takesplace. To this aim, we face two key challenges. First, how toscale up the system while retaining a reasonable accuracy.Second, how to design a system so as to make sure an alert israised as soon as a cyberbullying instance takes place whilemonitoring large number of media sessions (media and itsassociated comments in Instagram or Vine). In the followingsubsections, we describe the two components of our systemthat address these challenges.

1739

𝑋𝑛 : saved feature vector values for 𝑛 comments frombefore for all features;

𝛿𝑛 : new comments to be processed;𝑋𝑖

𝑛 : feature vector value of 𝑖-th feature for 𝑛comments;

𝑋𝑖𝛿𝑛 : feature vector value of 𝑖-th feature for 𝛿𝑛

comments;|𝑋𝑛|: number of total features;forall 𝑖 in 1, 2, ..., |𝑋𝑛| do

𝑋𝑖𝑛+𝛿𝑛 : 𝑋𝑖

𝑛+Compute(𝑋𝑖𝛿𝑛);

endAlgorithm 1: IncrementalFeatureExtraction()

3.1 Incremental ClassifierOur first challenge is to build a cyberbullying detection clas-sifier that is scalable when it comes to time and computingresources while also retaining sufficient accuracy performance.While sophisticated deep learning classifiers have been re-cently introduced to solve complex problems with high accu-racy [14, 17], they come up with considerable computationalbaggage. For example, in [14], the authors used deep learningin real time to process one 1080p video frame in 644ms usingSamsung S7 with leveraging high performance GPUs (12GPUs) and 4GB memory. While it is tempting to use deeplearning for our system, we want our classifier to be able toleverage lightweight computational resources (Amazon AWSfree tier 1GB memory, for example). In addition to beingcomputationally lightweight, we also want our classifier to befaster than the slower current state-of-the-art AdaBoost[23](as shown later in Table 1) to classify when new commentsfor a media session comes in, without sacrificing accuracy.We see that both these cases (deep learning and AdaBoost),while being highly accurate, does not meet two key challenges,being computational resource-wise lightweight and efficient,respectively.

Our approach towards lowering computational resourcescalability and improving efficiency while retaining sufficientaccuracy is to incorporate incremental computation [9, 10]into the design of the potential classifier. Incremental com-putation reuses data from previous stages within the currentstage, thus resulting in less computational complexity. Tradi-tional classifiers need to execute a full run as each new datumarrives, e.g. new comment for a media session. Instead, ourapproach reuses previous stages’ results and combines themwith the new comments, thereby reducing computationalcost, rendering the solution scalable. We seek ways to applythis incremental approach to both feature extraction andclassification stages of the potential classifier. To this aim,we seek to employ a classifier that, during feature extractionstage, uses features which, by nature can be incrementallylinear in the sense that once the values corresponding to thesefeatures have been computed for the first 𝑛 comments, thenwhen 𝛿𝑛 new comments arrive, we only have to compute theindividual feature vector values for the new 𝛿𝑛 comments

while reusing the values for the previous 𝑛 comments to com-pute the overall feature vector for the 𝑛+ 𝛿𝑛 comments. Thisdramatically reduces resource and computation cost becausethis approach is driven by 𝛿𝑛 at each run instead of 𝑛+𝛿𝑛. Al-gorithm 1 provides a pseudo-code of the incremental featureextraction algorithm for our incremental detector. Similarly,the candidate classifier should also be able to leverage thisincremental approach during classification stage once thefeature vectors are extracted from new data.

We found that Logistic Regression (LR) was the mostpromising classifier that met all these aforementioned cri-teria. LR works as follows: if we have 𝑛 features 𝑎𝑖, 𝑖 =0, 1, 2, 3..., 𝑛 − 1, after training, LR assigns a weight 𝑤𝑖, 𝑖 =0, 1, 2..., 𝑛 − 1 to each of those features, and then computesthe combined features value 𝑐 = 𝑛−1

0 𝑎𝑖𝑤𝑖. This value is fedto a sigmoid function with output ranging from 0 to 1[31].The way we can leverage incremental computation in LR isas follows: LR takes as input a set of features 𝑋, and duringthe training process, the classifier generates a set of weights𝜃 corresponding to those features. When a new media sessioncomes in, the feature extraction step computes a matrix 𝑋for that particular media session and computes 𝐶 = 𝑋𝜃,which is then used to make the corresponding prediction. Forthe incremental feature extraction sub-component, we savethe 𝑋𝑜𝑙𝑑 value for the previous 𝑛 comments, compute 𝑋𝛿𝑛for the new set of 𝛿𝑛 arrived comments and compute thenew 𝑋 by combining 𝑋𝑜𝑙𝑑 and 𝑋𝛿𝑛 instead of computing 𝑋all over again for all 𝑛 + 𝛿𝑛 comments. For the incrementalclassification part, we only use those components of 𝑋 thathave been changed to compute 𝐶 = 𝑋𝜃 instead of doing thefull 𝑋𝜃 computation. For this purpose, we save 𝑋𝑖 × 𝜃𝑖,∀𝑖,where 𝑋𝑖 is the 𝑖-th feature at time 𝑡. Then we only changethe corresponding feature vector value 𝑋𝑖 at time 𝑡 + 𝛿𝑡 if ithas been changed by comparing it to the previous saved 𝑋𝑖

at time 𝑡. If it has been changed, only then we take it intothe account to compute ∀𝑖 𝑋𝑖 × 𝜃𝑖 by simple addition andsubtraction instead of full scale matrix multiplication.

To be able to use incremental computation in feature ex-traction stage, LR has to be able to make of features whoare, by nature, amenable to feature extraction. In addition,our incremental logistic regression also has to show sufficientefficiency and scalability over the current state-of-the-art. Allthese goals have to be met without sacrificing crucial clas-sification qualities like precision and recall. In this section,we provided the design of incremental techniques our LRclassifier uses to scale up. From now onward, we will refer tothis incremental LR classifier as classifier. In section 4.1, wefirst justify our choice of LR by comparing its execution time,precision and recall with current state-of-the-art[23] whilemaking use of features that accommodate incremental com-putation. We then justify using the incremental computationapproach in the LR by comparing its scalability performanceswith standard LR that does not use incremental approach.

1740

3.2 Dynamic Priority SchedulerWhile leveraging incremental computation helped our clas-sifier to gain reasonable scalability, we found that it stilllacked the other key issue that a potential cyberbullyingdetection system has to address: responsiveness. As the firststep towards tackling this issue, we make use of two keyobservations. First, not all media sessions need to be mon-itored equally. The theme is to apply limited resources towhere they needed the most. As most media sessions arenot bullying in nature [23], we should be able to apply ourresources on media sessions that are most likely to result incyberbullying. This observation makes it natural to builda scheduler that just keeps monitoring media sessions withhigh priority and discarding all the low priority ones. Wecall this scheduler Static Priority Scheduler (SPS). Our in-vestigation of the performance of SPS on Vine labeled data[23] found its precision and recall to be 70 and 58 percentrespectively. The low recall value shows that by totally ig-noring the media sessions that were given low priority somestage of their lifetimes to achieve responsiveness, we miss asignificant portion of potential cyberbullying media sessions.This trade-off between performance and responsiveness ledus to make our second observation: a media session, withits incoming stream of comments, can slowly evolve into acyberbullying instance even if it started as a normal one atits early stages of lifetime and vice versa. As new commentsarrive for a media session, it may become more or less indica-tive of cyberbullying, depending on the nature of the newlyarrived comments. This means, it is important to examine allsessions including the ones with low priority, as some of themmay evolve into cyberbullying sessions during later stages.

Investigation results after running SPS and the second ob-servation formed the motivation of the design of our DynamicPriority Scheduler (DPS). We define two levels of priority,namely high and low, for all media sessions and assign ahigh priority to all newly created media sessions. Now ourchallenge is to accommodate learning based in new commentsso as to dynamically vary each media session’s priority. Af-ter each invocation of our incremental classifier component(Section 3.1), a confidence value[2] of how likely a mediasession contains cyberbullying is generated. We make use ofthe history of these confidence values to dynamically changea media session’s priority. The reason for using history asopposed to just the most recent confidence value has to dowith the definition of cyberbullying. Cyberbullying is definedas an aggressive online behavior that is carried out repeat-edly against a person who cannot easily defend himself orherself, creating a power imbalance [16]. To identify repeatedaggressive behavior or whether a victim can defend himself orherself, we need to consider a much longer history than justthe most recent confidence value. We calculate the average ofall past confidence values for past classifications and currentclassification of a particular media session and compare thatwith a threshold value. If the average confidence value ismore than a certain threshold value, we assign a high pri-ority to the session and if the average value is lower than

the threshold value, we assign a low priority. Algorithm 2illustrates our priority setting algorithm using an averageconfidence threshold (0.2 in this example). Upon prioritizingthe media sessions, we run our classifier component on highpriority media sessions more frequently and postpone theclassification and processing of low priority media sessionsuntil a later phase, thus achieving responsiveness withoutsacrificing recall performances.

In this section, we outlined the motivation and designof introducing our dynamic priority scheduler into the pro-posed cyberbullying detection system. In Section 4.2, wefirst determine what threshold is appropriate for our DPS bycomparing it with our baseline round-robin scheduler, whereall media sessions always have the same priority and, so,monitored equally. Then we demonstrate through thoroughexperiments that DPS introduces significant responsivenessgain over round robin scheduler, thus justifying the utility ofthis component.

forall media session 𝑚 do𝐶𝑜𝑛𝑓𝑚

𝑖 : confidence value of the 𝑖-th commentsession prediction for this media session;

𝑛: number of total comment session prediction inthe confidence history;

𝐴𝑣𝑔𝑚𝑐𝑜𝑛𝑓𝑖𝑑𝑒𝑛𝑐𝑒 =

𝑛𝑖=1 𝐶𝑜𝑛𝑓𝑚

𝑖𝑛 ;

if 𝐴𝑣𝑔𝑚𝑐𝑜𝑛𝑓𝑖𝑑𝑒𝑛𝑐𝑒 ≥ 0.2 and current priority is

𝐿𝑂𝑊 thenset current Priority to 𝐻𝐼𝐺𝐻;continue;

endif 𝐴𝑣𝑔𝑚

𝑐𝑜𝑛𝑓𝑖𝑑𝑒𝑛𝑐𝑒 < 0.2 and current priority is𝐻𝐼𝐺𝐻 then

set current Priority to 𝐿𝑂𝑊 ;continue;

endend

Algorithm 2: SettingPriority()

3.3 An ExampleConsider the example shown in Figure 2, where M1 and M2represent newly created media sessions from Vine in real time.Three separate queues, Q1, Q2 and Q3 are maintained. Thescheduler schedules media sessions in queue Q1 (pointed toby current head) for processing one by one in the queue order.After a media session has been processed by the cyberbullyingdetector component and if no alert is raised, it is placed atthe end of either queue Q2 or queue Q3 depending on thenew priority assigned to it (discussed in the next subsection).If the session’s new priority is high, it is placed in queueQ2, and if the session’s new priority is low, it is placed inqueue Q3. When all media sessions in queue Q1 have beenprocessed, queue Q2 becomes queue Q1, queue Q3 becomesqueue Q2, and queue Q3 becomes empty. In the exampleshown in Figure 2, initially M1 and M2 are assigned high

1741

priority and placed in queue Q1. M1 is scheduled first andis processed by the cyberbullying detector component. Afterthis processing, it is assigned a low priority, and so is addedat the end of queue Q3. M2 is scheduled next and is processednext. After this processing, it is assigned a high priority, andso is added at the end of queue Q2. At this time queue Q1is empty, and queue Q2 becomes queue Q1 and queue Q3becomes queue Q2. This process then continues. Since lowerpriority processes are eventually elevated into higher levelqueues, our solution ensures that no media session will starve.From this example, the low priority media session M1 isprocessed when the queue Q3 becomes Q1, thus making surethe low priority media sessions are also processed, albeit lessfrequently than its high priority counterpart.

4 PERFORMANCE EVALUATION4.1 Incremental Classifier EvaluationIn [23], the AdaBoost classifier was reported to have thebest performance based on accuracy, precision and recallvalues. Table 1 compares AdaBoost with logistic regressionin terms of precision, recall,F-1 score and running time. Thefeatures AdaBoost classifier used were number of followersand followings, likes and views for media sessions, media cap-tion polarity and subjectivity [21], total number of negativecomments, summation of negative comment polarity and sub-jectivity, total individual comment polarity, total individualcomment subjectivity, total negative words, total number ofnegative comments and unigrams. For Logistic Regression,the features we used were number of followers, followings,media caption polarity and subjectivity,total individual com-ment polarity, total individual comment subjectivity, totalnegative words, and total number of negative comments. Wemade sure that the features used by logistic regression wereincrementally linear by nature, as noted in Section 3.1. Theperformance values showed in Table 1 were obtained using10-fold cross validation on the labeled Vine data available in[23]. We notice that although the Adaboost classifier achievesa slightly higher precision, logistic regression achieves higherrecall and F-1 score. Furthermore, the running time of logisticregression classifier is significantly less, more than five timesfaster than that of the Adaboost classifier. The reason forthis is twofold. First, Adaboost needed unigram features toachieve a high precision and recall, but unigram feature ex-traction is computationally expensive. In comparison, logisticregression is able to achieve effectively the same precisionand better recall while using features that are much morelightweight to compute, thus yielding much lower runningtime. Second, the Adaboost classifier is a meta-estimator thatbegins by fitting a classifier on the original data-set and thenfits additional copies of the classifier on the same data-setbut where the weights of incorrectly classified instances areadjusted such that subsequent classifiers focus more on dif-ficult cases [25], thus making it computationally expensivecompared to much simpler logistic regression. Based on theseanalyses, we have chosen the logistic regression classifier for

0 50 100 150 200 250 300 350 400

number of comments

0

50

100

150

200

250

300

350

400

tota

l tim

e (m

s)

Standard Classifier

Incremental Classifier

Figure 3: Total time taken by standard and incremental clas-sifiers as new comments come in per media session.

detecting cyberbullying in our solution. It is worth mention-ing that we have employed other classifiers based on differentcombinations of features too (Decision Tree, Random Forest,Naive Bayes, Perceptron etc) and only present the classifiersand feature combinations that yielded the best results.

Next, we show that leveraging incremental approach intoour logistic regression classifier significantly improves thescalability of the execution stage. To this aim, we defineda baseline solution as consisting of non-incremental featureextraction and a non-incremental logistic regression classi-fier. As new comments arrive for the baseline solution, itwould need to recompute all feature vectors from scratch,and recompute the entire logistic regression from scratch. Wecompared the total running time of the baseline solution withan incremental solution that implemented both incrementalfeature extraction and incremental logistic regression, as de-scribed in Section 3.1. Our measurements showed that thefraction of time taken by the logistic regression comparedwith the feature extraction time was negligible, so that thetotal running time was dominated by feature extraction.

Figure 3 shows the average time taken for the standardand the incremental classifiers as the number of commentsincreases in media sessions. To simplify the plot, we groupthe comments in sets of 10. The time taken by the standardclassification solution goes up almost linearly with the numberof comments in the media session, since the standard solutionmust recompute all features and regression weights. On theother hand, the time taken for the incremental classifier isbasically constant every time a set of 10 additional commentscome in because it only has to compute the feature values forthe additional 10 new comments. The justification for using10 comments-set is given in Section 4.2.

4.2 Dynamic Priority Scheduler EvaluationIn this section, we compare performances of our DPS withround robin scheduler, a scheduler with no assignment ofpriority. The aim of performing this comparison is twofold.First, we want to show the gain of responsiveness our sched-uler achieve over round robin scheduler. Second, we also wantto investigate several design choices crucial to building our

1742

Table 1: Comparison of different classifiers using the 983 Labeled Media Sessions

Classifier Precision Recall F-1 Score Time (s) FeaturesAdaBoost 0.7138 0.54 0.61 228 number of followers,followings, likes,views, media-caption

polarity,subjectivity[21], total number of negative comments,summation of negative comment polarity and subjectivity,total individual comment polarity, total individual commentsubjectivity, total negative words, total number of negativecomments and unigrams

Logistic Regression 0.71 0.66 0.68 44.42 number of followers, followings, media caption polarity andsubjectivity,total individual comment polarity, total individ-ual comment subjectivity, total negative words, and totalnumber of negative comments.

0 50 100 150 200

media sessions' indexes

0.6

0.8

1.0

1.2

1.4

1.6

1.8

2.0

2.2

schedule

r gain

tim

e r

ati

o

0.05

0.2

0.3

0.4

0.10 0.15 0.20 0.25 0.30 0.35 0.40 0.45 0.50

confidence Threshold

0.6

0.8

1.0

1.2

1.4

1.6

schedule

r gain

tim

e r

ati

o

10 Comment Interval

20 Comment Interval

30 Comment Interval

Figure 4: Left: Scheduler gain time ratio for different confidence thresholds for labeled cyberbullying media sessions. Right:Scheduler gain time ratio for different confidence thresholds by using different comment increment sizes for different number ofmedia sessions using Vine labeled data [23].

scheduler to decide upon the best choice based on the metricsof performance gain each design choice achieves over roundrobin scheduler. We use labeled Vine dataset [23] to performthe experiments.

First, we need to determine what threshold is appropriatefor our solution, as noted in 3.2. We also need to determinewith what granularity our classifier ingests batches of newcomments, because this affects the time to first alert. Thescheduler will choose a high priority media session to passto the classifier. In the time between classification attempts,a media session may receive N new comments. If all N com-ments are input to the classifier at once, and N is quitelarge, we may delay recognizing cyberbullying, i.e., a burstof negative comments may be swamped by the other positivecomments. Therefore, we need to consider comments in smallenough batches or intervals so that the classifier can catchcyberbullying with finer granularity and raise the alert early.

The left figure in Figure 4 assesses which combination ofthreshold and interval size produced the best improvementin response time using dynamic prioritization compared witha simple round-robin policy. The round-robin scheduler isdefined as one where media sessions are not assigned any

priority, and the scheduler simply rotates through all mediasessions, with no particular attention being paid to likelycyberbullying sessions.

As can be seen from the figure, by using a confidence thresh-old of 0.2 and comment increment size of 10, we were able togain the maximum responsiveness over the round-robin sched-uler. We also like to note here that that, when the number ofnew comments for a media session is less than 10, we just takethose 5 available new comments. We think this is because asthe comment increment size goes up to 20 or 30, the burstof cyberbullying comments can get nullified by the otherpositive comments, which in turn influences the features(i.e.summation of individual comment sentiments) that are usedby our incremental classifier. So 10 comment increment sizetends to be the optimal size for having enough context of acomment thread to make a knowledgeable decision about cy-berbullying while also not being too big to risk being nullifiedby other positive comments. The figure also confirms thatthe confidence threshold of 0.2 offers the best speedup for ourdynamic priority scheduler. For example, if a media session𝑚 has been classified 3 times at 𝑡1, 𝑡2, 𝑡3 with classificationdecisions not-bullying, not-bullying, not-bullying respectively

1743

Table 2: When to Send Alert

Number of Predicted Cyber-bullying Comment Sessionsin the History

Precision Recall F-1 Score

≥ 1 0.68 0.71 0.69≥ 2 0.71 0.71 0.71≥ 3 0.71 0.71 0.71

with confidence values of 0.85, 0, 85, 0.55, this means eventhough it has been classified as not cyberbullying, the con-fidence values of cyberbullying decision is also increasing(0.15, 0.15, 0.45) which makes it a potential candidate for afuture cyberbullying session. So we take the average of theprevious classification confidence values of cyberbullying class(0.25 in this case) and see that the average confidence valueis more than 0.2 and change the priority of this media sessionas high and insert it in the dynamic priority scheduler. Wemention that we tried all possible combinations of commentchunk sizes and confidence thresholds and only present thosecombinations that yielded the best results. The right figure inFigure 4 demonstrates the gain time achieved by our sched-uler for each media sessions from the Vine dataset labeled ascyberbullying. This figure further justifies the choice of confi-dence threshold of 0.2 in our scheduler. These experimentshelped us to not only justify our choice of using DPS butalso helped us to decide upon the crucial design choices ofusing confidence threshold of 0.2 and 10 comment incrementsize.

4.3 Alert PerformanceSince each media session will be passed sporadically to theclassifier by the scheduler, the classifier will generate a se-quence of cyberbullying detection decisions over time foreach media session. It is therefore worth considering to whatextent we should utilize the history of detection decisionsin generating the alert. The default is to generate an alertimmediately after the classifier decides that the current batchof 10 comments, in combination with earlier content, consti-tutes cyberbullying. However, we wish to be sure and avoidfalse positives. One option is to decouple the alert from theclassification, and delay the alert until N positive decisionshave been recently seen. This design gives us some flexibilityin trading off responsiveness and precision.

For each media session, we maintain an array storing theresults of each classification result of that session along withthe time of that classification. We use this array to decidewhen to raise an alert. In particular, we set a thresholdvalue, which is the number of times a media session has beenclassified as cyberbullying since the last time an alert wasraised for that session, or from the beginning if no alert hasyet been raised. After experimenting with different number ofthreshold values(2), we find that by raising an alert only whenwe have at least 2 decisions for cyberbullying since the lasttime an alert was raised, we achieve the best precision,recalland F-1 score of of 0.71,0.71 and 0.71 respectively, thus

reducing the number of false alarms. This performance isa marked improvement over the Static Priority Scheduler(SPS) described in Section 3.2 that had a recall of only58%. Moreover, the recall is, in fact, an improvement overthe standard classifier’s 0.66 (See Table 1 for comparison).This marked improvement of recall over two baselines (SPSand standard classifier) demonstrates the justification ofusing incremental classifier component and dynamic priorityscheduler along all the associated design choices: that thesecomponents are efficient and responsive and also retainssufficient classifier performance when compared to currentstate-of-the-art.

4.4 Scalability EvaluationIn this section, we first demonstrate the way our proposedsystem scales when it has to deal with a substantial numberof media sessions. For this purpose, we first deploy AmazonAWS free tier 1GB memory virtual machine instances to startmonitoring media sessions, implementing both our proposedscheduler and round robin scheduler. An acceptable respon-siveness of the system is our primary goal along with thescalability of the system. In these experiments, we decidedthat an average alert time under 2 hours is acceptable, whichmeans an alert will be within 2 hours of a cyberbullying in-stance. We acknowledge that this decision is purely becauseof the lack of research in this particular area. In future, wewill conduct an elaborate survey to explore the acceptabil-ity of a potential cyberbullying system’s responsiveness. Forthe experiments presented in this section, we replicated the100000 media sessions’ traffic from the dataset in [23] up tothe scale of 39 million. Those media sessions were gathered byperforming snowball sampling after selecting a random seed.We believe the randomness of the seed selection, snowballsampling, and large number (100000) of media sessions inthis dataset should enable the scaled up traffic to reasonablyapproximate the behavior of the overall network.

The left graph in Figure 5 shows the number of media ses-sions that can be processed as the number of AWS instancesis increased, keeping the average alert time under 2 hours.This figure shows that the number of media sessions that canbe processed increases linearly with the increase in the num-ber of instances, and that our system scales five times betterthan the round robin scheduler. Given that the Vine socialnetwork generated about 39 million media sessions since 2012[26], our system is capable of monitoring Vine-scale socialnetworks with only 8 AWS instances, keeping the averagealert time below 2 hours. In contrast, a round-robin schedulerwould require upwards of 40 instances.

The right graph in Figure 5 shows the average alert timefor round robin and dynamic priority scheduler versus thenumber of media sessions. It can be seen from the figurethat we are able to process 5 million media sessions with ourproposed system with an average alert time under 2 hourswhere as for round robin, it is 1 million. To show the cost ofusing our dynamic priority scheduler in terms of worst casescenario, we see in Figure 6 that around 10 percent of media

1744

0 5 10 15 20 25

number of instances

0

20

40

60

80

100

120

num

ber

of m

edia

ses

sion

s (m

illio

ns)

scheduler

round robin

# of all Vine Instances to-date

4.0 4.5 5.0 5.5 6.0 6.5 7.0

# of media sessions in log 10

0

50

100

150

200

250

300

350

400

aler

t tim

e in

min

utes

scheduler

round robin

Figure 5: Left: Number of media sessions vs number of instances needed to monitor them, keeping average alert time under 2hours. Right: Average alert time vs number of media sessions for round-robin and dynamic priority scheduler.

0 2000 4000 6000 8000 10000

alert time in seconds (x)

0.0

0.2

0.4

0.6

0.8

1.0

% o

f med

ia s

essi

ons,

P(X

>x)

Figure 6: CCDF of alert time for 5 million media sessions in1GB memory amazon AWS instance using dynamic priorityscheduler.

1 2 3 4 5 6 7memory (power of 2) in GB

0

10

20

30

40

50

media

sess

ions i

n milli

ons Scheduler

Round robin

Figure 7: Memory vs number of media sessions in millions.

sessions get their alerts after 2 hours. This is the cost we payfor postponing the processing of low priority media sessionsin our scheduler.

Next,we investigate the resources our system will needto monitor a comparatively more popular social networklike Instagram. Since its establishment in 2010, Instagramhas accumulated over 40 billion media sessions [18], whichmeans almost 6 billion media sessions per year, a numbermuch larger than Vine’s 39 million [26]. To accommodate

Table 3: Total Time Comparison for Different Approaches andDifferent Number of Media Sessions (seconds)

Approach 10000 50000 100000AdaBoost 5674 26784 -Logistic Regression 1110 5320 (44X) 10438Incremental Classifier 22 120 (223X) 206

such a high volume social network, based on Figure 5, wewould need 1,200 AWS instances to keep the average alerttime under 2 hours every year. To scale up for such a load,we have two choices. We can either spawn off 1,200 suchinstances of 1 GB memory or we can increase the memoryof our instances to process more media sessions. To furtherinvestigate the memory performance, we implemented oursystem with different memory-sized instances belonging toAmazon AWS services. Figure 7 shows the number of mediasessions processed by each instance of a particular memorysize. The number of media sessions in the Y axis illustratesthe highest number of media sessions that can be processedby that instance without having an average alert time of overtwo hours. The figure demonstrates that, while increasingmemory does help increase the media session monitoring ca-pacity, at a certain point around 32 GB/instance, additionalmemory no longer enables additional monitoring of mediasessions. That is, the graph plateaus around 50 million mediasessions, so that there is no additional benefit to using 64GB or 128 GB instances compared to 32 GB instances. Wehypothesize that this behavior is due to computation be-coming the main bottleneck rather than memory. Therefore,to monitor Instagram-scale social networks, we would needapproximately 120 32 GB instances. Note that without ourdynamically scheduled incremental classification system, ap-proximately 600 32 GB instances would be required, almostfive times as many, as it can be seen from Figure 7.

For evaluating the incremental classifier’s scaling perfor-mance, we compare three types of approaches, namely thebest reported Vine cyberbullying classifier [23] (StandardAdaBoost), Logistic regression without incremental feature

1745

2.0 2.5 3.0 3.5 4.0 4.5 5.0

number of media sessions in log10 scale

0

1

2

3

4

5

6

7

8sc

hedu

ler

gain

tim

e ra

tio

Figure 8: Gain time ratio (round-robin scheduler/dynamic pri-ority scheduler)for different number of media sessions.

0 100 200 300 400 500

Hours (x)

0.0

0.2

0.4

0.6

0.8

1.0

%of

med

ia s

essi

ons,

P(X

>x)

Figure 9: CCDF of Time Interval in hours until First Com-ment For Bullying Media Sessions.

extraction or classification (Standard Logistic regression),and Logistic regression with incremental feature extractionand classification (Incremental Classifier). Table 3 shows thetime needed in seconds for these three approaches to processdifferent numbers of media sessions. The table clearly demon-strates that the choice of using incremental classifier helpedus to improve classification time by 223 times faster thanAdaBoost for 50,000 media sessions and 44 times faster thanthe standard logistic regression. For this evaluation purpose,we used the Vine dataset provided in [23].

Next, we compare the responsiveness of our dynamic prior-ity scheduler against the unprioritized round robin scheduler.The metric we use to compare these two approaches is re-sponsiveness gain, meaning the ratio of time taken by theround-robin scheduler over the time taken by our dynamicpriority scheduler to raise an alert. Figure 8 shows the re-sponsiveness gain vs. number of media sessions. The gaintends to increase as the number of media sessions goes up,reaching almost 7 times faster responsiveness for 100, 000media sessions. This improvement is due to the fact that ourscheduler tends to process cyberbullying media sessions firstwhereas round robin processes all media sessions at each pass.For this reason, as the number of media sessions grows, sodoes the improvement of using our dynamic priority schedulerdue to its priority processing of cyberbullying media sessions.

0 2000 4000 6000 8000 10000 12000 14000 16000 18000

Hours (x)

0.0

0.2

0.4

0.6

0.8

1.0

%of

med

ia s

essi

ons,

P(X

>x) all media sessions

bullying media sessions

Figure 10: CCDF of activity for all and bullying media ses-sions.

For further insights into resource scalability, we presentactivity graphs of the media sessions from Vine. We inves-tigate the distribution of how long a bullying media sessiontakes to receive its first comment. Figure 9 shows that veryfew bullying media sessions receive their first comment after500 hours since session creation. Hence, one way to improvescaling is to stop monitoring any session that takes longerthan 500 hours for its first comment. Secondly, Figure 10shows the CCDF of activity of media sessions in Vine. A fairpercentage of media sessions receive comments even after10000 hours after initial media posting. In comparison, bully-ing media sessions received all their comments within 9000hours, i.e. within a year of their creation. So another wayto improve scaling would be to purge out all media sessionsthat are one year old.

5 CONCLUSION AND FUTURE WORKIn this work, we have developed a cyberbullying detectionsystem for media-based social networks, consisting of a dy-namic priority scheduler, a novel incremental classifier, andan initial predictor. The evaluation results show that our sys-tem substantially improves the scalability of cyberbullyingdetection compared to an unprioritized system. Morever, wedemonstrate that our system can fully monitor Vine-scale so-cial networks for cyberbullying detection for a year using onlyeight 1 GB AWS VM instances. We discover the point (32GB) at which adding memory no longer enables monitoringof more media sessions, and project that our system wouldneed 120 32 GB instances to fully monitor Instagram-scaletraffic for cyberbullying.

As part of future work, we propose to investigate theplateauing effect that limits the effectiveness of adding morememory, namely that there is likely a computational bot-tleneck that needs to be further addressed. We also planto research on developing a diverse,scalable and responsiveclassifier with better precision and recall performances forVine as well other diverse OSNs like Facebook.

1746

REFERENCES[1] Ryan Broderick. 2014. 9 Teenage Suicides In The Last

Year Were Linked To Cyber-Bullying On Social Net-work Ask.fm. http://www.buzzfeed.com/ryanhatesthis/a-ninth-teenager-since-last-september-has-committed-suicide.(2014). [Online;accessed 14-January-2014].

[2] Probability Calibration. 2017. http://scikit-learn.org/stable/modules/calibration.html. (2017). [Online; accessed September,2017].

[3] Despoina Chatzakou, Nicolas Kourtellis, Jeremy Blackburn, Emil-iano De Cristofaro, Gianluca Stringhini, and Athena Vakali. 2017.Detecting Aggressors and Bullies on Twitter. In Proceedings ofthe 26th International Conference on World Wide Web Com-panion (WWW ’17 Companion). International World Wide WebConferences Steering Committee, Republic and Canton of Geneva,Switzerland, 767–768. https://doi.org/10.1145/3041021.3054211

[4] Hanqiang Cheng, Yu-Li Liang, Xinyu Xing, Xue Liu, RichardHan, Qin Lv, and Shivakant Mishra. 2012. Efficient MisbehavingUser Detection in Online Video Chat Services. In Proceedings ofthe Fifth ACM International Conference on Web Search andData Mining (WSDM ’12). ACM, New York, NY, USA, 23–32.https://doi.org/10.1145/2124295.2124301

[5] National Crime Prevention Council. 2007. Teens and Cyberbully-ing. (2007). Executive summary of a report on research conductedfor National Crime Prevention Council.

[6] Karthik Dinakar, Roi Reichart, and Henry Lieberman. 2011. Mod-eling the detection of Textual Cyberbullying. (2011).

[7] Linda DiProperzio. 2015. Cyberbullying Applica-tions. http://www.parents.com/kids/safety/internet/best-apps-prevent-cyberbullying/. (2015). [Online; accessedFebruary 6, 2015.].

[8] Russel Goldman. 2014. Teens Indicted After AllegedlyTaunting Girl Who Hanged Herself. http://abcnews.go.com/Technology/TheLaw/teens-charged-bullying-mass-girl-kill/story?id=10231357. (2014). [Online; accessed 14-January-2014].

[9] Matthew A. Hammer, Joshua Dunfield, Kyle Headley, NicholasLabich, Jeffrey S. Foster, Michael Hicks, and David Van Horn.2015. Incremental Computation with Names. In Proceedings ofthe 2015 ACM SIGPLAN International Conference on Object-Oriented Programming, Systems, Languages, and Applications(OOPSLA 2015). ACM, New York, NY, USA, 748–766. https://doi.org/10.1145/2814270.2814305

[10] Matthew A. Hammer, Khoo Yit Phang, Michael Hicks, and Jef-frey S. Foster. 2014. Adapton: Composable, Demand-driven In-cremental Computation. In Proceedings of the 35th ACM SIG-PLAN Conference on Programming Language Design and Im-plementation (PLDI ’14). ACM, New York, NY, USA, 156–166.https://doi.org/10.1145/2594291.2594324

[11] Keith Harrison. 2012. Scalable Detection of Community CyberIncidents Utilizing Distributed and Anonymous Security Infor-mation Sharing. Ph.D. Dissertation. UTSA. Advisor(s) White,Gregory. AAI3548643.

[12] Homa Hosseinmardi, Amir Ghasemianlangroodi, Richard Han,Qin Lv, and Shivakant Mishra. 2014. Towards UnderstandingCyberbullying Behavior in a Semi-Anonymous Social Network. InASONAM’14. IEEE, Beijing,China, 244–252.

[13] Homa Hosseinmardi, Sabrina Arredondo Mattson, RahatIbn Rafiq, Richard Han, Qin Lv, and Shivakant Mishra. 2015.Analyzing Labeled Cyberbullying Incidents on the Instagram So-cial Network. Springer, Cham, 49–66. https://doi.org/10.1007/978-3-319-27433-1_4

[14] Loc N. Huynh, Youngki Lee, and Rajesh Krishna Balan. 2017.DeepMon: Mobile GPU-based Deep Learning Framework for Con-tinuous Vision Applications. In Proceedings of the 15th An-nual International Conference on Mobile Systems, Applications,and Services (MobiSys ’17). ACM, New York, NY, USA, 82–95.https://doi.org/10.1145/3081333.3081360

[15] A. Kontostathis K. Reynolds and L. Edwards. 2011. Using Ma-chine Learning to Detect Cyberbullying. ICMLA’11 2 (2011),

241–244. https://doi.org/10.1109/ICMLA.2011.152[16] Robin M. Kowalski, Sue Limber, Susan P. Limber, and Patricia W.

Agatston. 2012. Cyberbullying: Bullying in the digital age. JohnWiley & Sons, Reading, MA.

[17] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. 2017.ImageNet Classification with Deep Convolutional Neural Networks.Commun. ACM 60, 6 (May 2017), 84–90. https://doi.org/10.1145/3065386

[18] Evan Lepage. 2015. Instagram Statistics. http://blog.hootsuite.com/instagram-statistics-for-business/. (2015). [Online; accessedFebruary 6, 2015.].

[19] H. H. S. Li, Z. Yang, Q. Lv, R. I. R. R. Han, and S. Mishra. 2014.A Comparison of Common Users across Instagram and Ask.fm toBetter Understand Cyberbullying. In IEEE BDCloud’14. IEEE,Sydney, Australia, 355–362. https://doi.org/10.1109/BDCloud.2014.87

[20] Rui Li, Kin Hou Lei, Ravi Khadiwala, and Kevin Chen-ChuanChang. 2012. TEDAS: A Twitter-based Event Detection andAnalysis System. In ICDE ’12. IEEE, Washington, DC, USA,1273–1276. https://doi.org/10.1109/ICDE.2012.125

[21] Steven Loria. 2016. Python Sentiment Library. https://github.com/sloria/textblob. (2016). [Online; accessed May 30, 2016].

[22] M. Ptaszynski, P. Dybala, T. Matsuba, F. Masui, R. Rzepka, K.Araki, and Y. Momouchi. 2010. In the service of online orderTackling cyberbullying with machine learning and affect analysis.International Journal of Computational Linguistics Research 1,3 (2010), 135–154.

[23] Rahat Ibn Rafiq, Homa Hosseinmardi, Richard Han, Qin Lv,Shivakant Mishra, and Sabrina Arredondo Mattson. 2015. Carefulwhat you share in six seconds: detecting cyberbullying instancesin vine. In ASONAM. ACM, Paris,France, 617–622.

[24] Takeshi Sakaki, Makoto Okazaki, and Yutaka Matsuo. 2010.Earthquake Shakes Twitter Users: Real-time Event Detectionby Social Sensors. In Proceedings of the 19th International Con-ference on World Wide Web (WWW ’10). ACM, New York, NY,USA, 851–860. https://doi.org/10.1145/1772690.1772777

[25] Python scikit learn. 2016. AdaBoost Classifier. http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.AdaBoostClassifier.html. (2016).

[26] Craig Smith. 2015. Vine Statistics. http://expandedramblings.com/index.php/vine-statistics/. (2015). [Online; accessed Febru-ary 6, 2015.].

[27] Laura Smith-Spark. 2014. Hanna Smith suicide fuels calls for ac-tion on Ask.fm cyberbullying. http://www.cnn.com/2013/08/07/world/europe/uk-social-media-bullying/. (2014). [Online;accessed14-January-2014].

[28] Francesca Spezzano. 2016. Bad Actors in Social Media. In Pro-ceedings of the First International Workshop on ComputationalMethods for CyberSafety (CyberSafety’16). ACM, New York, NY,USA, 1–1. https://doi.org/10.1145/3002137.3002138

[29] StopCyberBullying.org. 2017. 5 Differences between CyberBullying and Traditional Bullying. http://onlinesense.org/5-differences-cyber-bullying-traditional-bullying/. (2017). [On-line; accessed May 22, 2017.].

[30] Cynthia Van Hee, Els Lefever, Ben Verhoeven, Julie Mennes,Bart Desmet, Guy De Pauw, Walter Daelemans, and VeroniqueHoste. 2015. Automatic detection and prevention of cyberbullying.In International Conference on Human and Social Analytics,Proceedings, Pascal Lorenz and Christian Bourret (Eds.). IARIA,Saint Julians, Malta, 13–18.

[31] Wiki. 2016. logistic Regression. https://en.wikipedia.org/wiki/Logistic_regression. (2016). [Online; accessed May 30, 2016].

[32] Xinyu Xing, Yu-li Liang, Sui Huang, Hanqiang Cheng, RichardHan, Qin Lv, Xue Liu, Shivakant Mishra, and Yi Zhu. 2012.Scalable Misbehavior Detection in Onlline Video Chat Services.In KDD. ACM, NY, USA, 552–560.

1747

http://www.buzzfeed.com/ryanhatesthis/a-ninth-teenager-since-last-september-has- committed-suicide

http://www.buzzfeed.com/ryanhatesthis/a-ninth-teenager-since-last-september-has- committed-suicide

http://scikit-learn.org/stable/modules/calibration.html

http://scikit-learn.org/stable/modules/calibration.html

https://doi.org/10.1145/3041021.3054211

https://doi.org/10.1145/2124295.2124301

http://www.parents.com/kids/safety/internet/best-apps-prevent-cyberbullying/

http://www.parents.com/kids/safety/internet/best-apps-prevent-cyberbullying/

http://abcnews.go.com/Technology/TheLaw/teens-charged-bullying-mass-girl-kill/story?id=10231357



https://doi.org/10.1145/2814270.2814305

https://doi.org/10.1145/2814270.2814305

https://doi.org/10.1145/2594291.2594324

https://doi.org/10.1007/978-3-319-27433-1_4

https://doi.org/10.1007/978-3-319-27433-1_4

https://doi.org/10.1145/3081333.3081360

https://doi.org/10.1109/ICMLA.2011.152

https://doi.org/10.1145/3065386

https://doi.org/10.1145/3065386

http://blog.hootsuite.com/instagram-statistics-for-business/

http://blog.hootsuite.com/instagram-statistics-for-business/

https://doi.org/10.1109/BDCloud.2014.87

https://doi.org/10.1109/BDCloud.2014.87

https://doi.org/10.1109/ICDE.2012.125

https://github.com/sloria/textblob

https://github.com/sloria/textblob

https://doi.org/10.1145/1772690.1772777

http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.AdaBoostClassifier.html



http://expandedramblings.com/index.php/vine-statistics/

http://expandedramblings.com/index.php/vine-statistics/

http://www.cnn.com/2013/08/07/world/europe/uk-social-media-bullying/

http://www.cnn.com/2013/08/07/world/europe/uk-social-media-bullying/

https://doi.org/10.1145/3002137.3002138

http://onlinesense.org/5-differences-cyber-bullying-traditional-bullying/

http://onlinesense.org/5-differences-cyber-bullying-traditional-bullying/

https://en.wikipedia.org/wiki/Logistic_regression

https://en.wikipedia.org/wiki/Logistic_regression

scalable and timely detection of cyberbullying in online social...

Documents