mining social media: a brief introduction€¦ · mining social media: a brief introduction pritam...

17
INFORMS 2012 c 2012 INFORMS | isbn 978-0-9843378-3-5 http://dx.doi.org/10.1287/educ.1120.0105 Mining Social Media: A Brief Introduction Pritam Gundecha, Huan Liu Arizona State University, Tempe, Arizona 85287 {[email protected], [email protected]} Abstract The pervasive use of social media has generated unprecedented amounts of social data. Social media provides easily an accessible platform for users to share informa- tion. Mining social media has its potential to extract actionable patterns that can be beneficial for business, users, and consumers. Social media data are vast, noisy, unstructured, and dynamic in nature, and thus novel challenges arise. This tutorial reviews the basics of data mining and social media, introduces representative research problems of mining social media, illustrates the application of data mining to social media using examples, and describes some projects of mining social media for human- itarian assistance and disaster relief for real-world applications. Keywords social media; data mining; social data; social media mining; social networking sites; blogging; microblogging; crowdsourcing; HADR; privacy; trust 1. Introduction Data mining research has successfully produced numerous methods, tools, and algorithms for handling large amounts of data to solve real-world problems. Traditional data mining has become an integral part of many application domains including bioinformatics, data warehousing, business intelligence, predictive analytics, and decision support systems. Pri- mary objectives of the data mining process are to effectively handle large-scale data, extract actionable patterns, and gain insightful knowledge. Because social media is widely used for various purposes, vast amounts of user-generated data exist and can be made available for data mining. Data mining of social media can expand researchers’ capability of under- standing new phenomena due to the use of social media and improve business intelligence to provide better services and develop innovative opportunities. For example, data mining techniques can help identify the influential people in the vast blogosphere, detect implicit or hidden groups in a social networking site, sense user sentiments for proactive planning, develop recommendation systems for tasks ranging from buying specific products to mak- ing new friends, understand network evolution and changing entity relationships, protect user privacy and security, or build and strengthen trust among users or between users and entities. Mining social media is a burgeoning multidisciplinary area where researchers of dif- ferent backgrounds can make important contributions that matter for social media research and development. The objective of this tutorial is to introduce social media, data mining, and their con- fluence—mining social media. We attempt to achieve the goal by presenting representative and interesting research issues and important social media tasks based on our experience and research. This tutorial first reviews data mining, social media and its types, and the importance of social media mining. In §2, we briefly introduce representative issues in social media mining. In §3, we highlight the impact of social media mining using three examples based on our current research. Section 4 illustrates how social media mining is applied in some real-world applications—two projects on humanitarian assistance and disaster relief (HADR) carried out in the Data Mining and Machine Learning Laboratory (DMML) at Arizona State University (ASU). We conclude this tutorial in §5. 1

Upload: others

Post on 13-Jul-2020

8 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Mining Social Media: A Brief Introduction€¦ · Mining Social Media: A Brief Introduction Pritam Gundecha, Huan Liu Arizona State University, Tempe, Arizona 85287 fpritam@asu.edu,huan.liu@asu.edug

INFORMS 2012 c© 2012 INFORMS | isbn 978-0-9843378-3-5

http://dx.doi.org/10.1287/educ.1120.0105

Mining Social Media: A Brief Introduction

Pritam Gundecha, Huan LiuArizona State University, Tempe, Arizona 85287 {[email protected], [email protected]}

Abstract The pervasive use of social media has generated unprecedented amounts of socialdata. Social media provides easily an accessible platform for users to share informa-tion. Mining social media has its potential to extract actionable patterns that canbe beneficial for business, users, and consumers. Social media data are vast, noisy,unstructured, and dynamic in nature, and thus novel challenges arise. This tutorialreviews the basics of data mining and social media, introduces representative researchproblems of mining social media, illustrates the application of data mining to socialmedia using examples, and describes some projects of mining social media for human-itarian assistance and disaster relief for real-world applications.

Keywords social media; data mining; social data; social media mining; social networking sites;blogging; microblogging; crowdsourcing; HADR; privacy; trust

1. Introduction

Data mining research has successfully produced numerous methods, tools, and algorithmsfor handling large amounts of data to solve real-world problems. Traditional data mininghas become an integral part of many application domains including bioinformatics, datawarehousing, business intelligence, predictive analytics, and decision support systems. Pri-mary objectives of the data mining process are to effectively handle large-scale data, extractactionable patterns, and gain insightful knowledge. Because social media is widely usedfor various purposes, vast amounts of user-generated data exist and can be made availablefor data mining. Data mining of social media can expand researchers’ capability of under-standing new phenomena due to the use of social media and improve business intelligenceto provide better services and develop innovative opportunities. For example, data miningtechniques can help identify the influential people in the vast blogosphere, detect implicitor hidden groups in a social networking site, sense user sentiments for proactive planning,develop recommendation systems for tasks ranging from buying specific products to mak-ing new friends, understand network evolution and changing entity relationships, protectuser privacy and security, or build and strengthen trust among users or between users andentities. Mining social media is a burgeoning multidisciplinary area where researchers of dif-ferent backgrounds can make important contributions that matter for social media researchand development.

The objective of this tutorial is to introduce social media, data mining, and their con-fluence—mining social media. We attempt to achieve the goal by presenting representativeand interesting research issues and important social media tasks based on our experienceand research. This tutorial first reviews data mining, social media and its types, and theimportance of social media mining. In §2, we briefly introduce representative issues in socialmedia mining. In §3, we highlight the impact of social media mining using three examplesbased on our current research. Section 4 illustrates how social media mining is applied insome real-world applications—two projects on humanitarian assistance and disaster relief(HADR) carried out in the Data Mining and Machine Learning Laboratory (DMML) atArizona State University (ASU). We conclude this tutorial in §5.

1

Page 2: Mining Social Media: A Brief Introduction€¦ · Mining Social Media: A Brief Introduction Pritam Gundecha, Huan Liu Arizona State University, Tempe, Arizona 85287 fpritam@asu.edu,huan.liu@asu.edug

Gundecha and Liu: Mining Social Media: A Brief Introduction2 Tutorials in Operations Research, c© 2012 INFORMS

1.1. Data Mining

Data mining (Tan et al. [55]) is a process of discovering useful or actionable knowledgein large-scale data. Data mining also means knowledge discovery from data (KDD) (Hanet al. [24]), which describes the typical process of extracting useful information from rawdata. The KDD process broadly consists of the following tasks: data preprocessing, datamining, and postprocessing. These steps need not be separate tasks and can be combinedtogether. Data mining is an integral part of many related fields including statistics, machinelearning, pattern recognition, database systems, visualization, data warehouse, and infor-mation retrieval (Han et al. [24]).

Data mining algorithms are broadly classified into supervised, unsupervised, and semi-supervised learning algorithms. Classification is a common example of supervised learningapproach. For supervised learning algorithms, a given data set is typically divided into twoparts: training and testing data sets with known class labels. Supervised algorithms buildclassification models from the training data and use the learned models for prediction. Toevaluate a classification model’s performance, the model is applied to the test data to obtainclassification accuracy. Typical supervised learning methods include decision tree induction,k-nearest neighbors, naive Bayes classification, and support vector machines.

Unsupervised learning algorithms are designed for data without class labels. Clusteringis a common example of unsupervised learning. For a given task, unsupervised learningalgorithms build the model based on the similarity or dissimilarity between data objects.Similarity or dissimilarity between the data objects can be measured using proximity mea-sures including Euclidean distance, Minkowski distance, and Mahalanobis distance. Otherproximity measures such as simple matching coefficient, Jaccard coefficient, cosine similarity,and Pearson’s correlation can also be used to calculate similarity or dissimilarity betweenthe data objects. K-means, hierarchical clustering (agglomerative or partitional methods),and density-based clustering are typical examples of unsupervised learning.

Semisupervised learning algorithms are most applicable where there exist small amountsof labeled data and large amounts of unlabeled data. Two typical types of semisupervisedlearning are semisupervised classification and semisupervised clustering. The former useslabeled data to make classification and unlabeled data to refine the classification boundariesfurther, and the latter uses labeled data to guide clustering. Cotraining is a representativesemisupervised learning algorithm. Active learning algorithms allow users to play an activerole in the learning process via labeling. Typically, users are domain experts and theirskills are employed to label some data instances for which a machine learning algorithm areconfident about its classification. Minimum marginal hyperplane and maximum curiosityare two popular active learning algorithms.

Data mining includes other techniques such as association rule mining, anomaly detection,feature selection, instance selection, and visual analytics. Additional details related to thesedata mining techniques can be found in Han et al. [24], Tan et al. [55], Witten et al. [59],Zhao and Liu [63], and Liu and Motoda [37].

1.2. Social Media

Social media (Kaplan and Haenlein [28]) is defined as a group of Internet-based applicationsthat build on the ideological and technological foundations of Web 2.0 and that allow thecreation and exchanges of user-generated content. Social media is conglomerate of differenttypes of social media sites including traditional media such as newspaper, radio, and televi-sion and nontraditional media such as Facebook, Twitter, etc. Table 1 shows characteristicsof different types of social media.

Social media gives users an easy-to-use way to communicate and network with each otheron an unprecedented scale and at rates unseen in traditional media. The popularity of socialmedia continues to grow exponentially, resulting in an evolution of social networks, blogs,

Page 3: Mining Social Media: A Brief Introduction€¦ · Mining Social Media: A Brief Introduction Pritam Gundecha, Huan Liu Arizona State University, Tempe, Arizona 85287 fpritam@asu.edu,huan.liu@asu.edug

Gundecha and Liu: Mining Social Media: A Brief IntroductionTutorials in Operations Research, c© 2012 INFORMS 3

Table 1. Characteristics of different types of social media.

Type Characteristics

Online social networking Online social networks are Web-based services that allowindividuals and communities to connect with real-world friendsand acquaintances online. Users interact with each otherthrough status updates, comments, media sharing, messages,etc. (e.g., Facebook, Myspace, LinkedIn).

Blogging A blog is a journal-like website for users, aka bloggers, tocontribute textual and multimedia content, arranged in reversechronological order. Blogs are generally maintained by anindividual or by a community (e.g., Huffington Post, BusinessInsider, Engadget).

Microblogging Microblogs can be considered same a blogs but with limitedcontent (e.g., Twitter, Tumblr, Plurk).

Wikis A wiki is a collaborative editing environment that allow multipleusers to develop Web pages (e.g., Wikipedia, Wikitravel,Wikihow).

Social news Social news refers to the sharing and selection of news stories andarticles by community of users (e.g., Digg, Slashdot, Reddit).

Social bookmarking Social bookmarking sites allow users to bookmark Web contentfor storage, organization, and sharing (e.g., Delicious,StumbleUpon).

Media sharing Media sharing is an umbrella term that refers to the sharing ofvariety of media on the Web including video, audio, and photo(e.g., YouTube, Flickr, UstreamTV).

Opinion, reviews, and ratings The primary function of such sites is to collect and publish user-submitted content in the form of subjective commentary onexisting products, services, entertainment, businesses, places,etc. Some of these sites also provide products reviews (e.g.,Epinions, Yelp, Cnet).

Answers These sites provide a platform for users seeking advice, guidance,or knowledge to ask questions. Other users from thecommunity can answer these questions based on previousexperiences, personal opinions, or relevent research. Answersare generally judged using ratings and comments (e.g., Yahoo!answers, WikiAnswers).

microblogs, location-based social networks (LBSNs), wikis, social bookmarking applications,social news, media (text, photo, audio, and video) sharing, product and business reviewsites, etc. Facebook,1 the social networking site, recorded more than 845 million active usersas of December 2011. This number suggests that China (approximately 1.3 billion) andIndia (approximately 1.1 billion) are the only two countries in the world that have largerpopulations than Facebook. Facebook and Twitter have accrued more than 1.2 billion users,2

more than thrice the population of the United States and more than the population of anycontinent except Asia.

1.3. Mining Social Media

Vast amounts of user-generated content are created on social media sites every day. Thistrend is likely to continue with exponentially more content in the future. Hence, it is criticalfor producers, consumers, and service providers to figure out management and utility ofmassive user-generated data. Social media growth is driven by these challenges: (1) How

1 http://newsroom.fb.com/content/default.aspx?NewsAreaId=22 (accessed March 2012).2 Facebook and Twitter have many overlapping users. Hence, 1.2 billion users are not unique.

Page 4: Mining Social Media: A Brief Introduction€¦ · Mining Social Media: A Brief Introduction Pritam Gundecha, Huan Liu Arizona State University, Tempe, Arizona 85287 fpritam@asu.edu,huan.liu@asu.edug

Gundecha and Liu: Mining Social Media: A Brief Introduction4 Tutorials in Operations Research, c© 2012 INFORMS

can a user be heard? (2) Which source of information should a user use? (3) How can userexperience be improved? Answers to these questions are hidden in the social media data.These challenges present ample opportunities for data miners to develop new algorithmsand methods for social media.

Data generated on social media sites are different from conventional attribute-value datafor classic data mining. Social media data are largely user-generated content on social mediasites. Social media data are vast, noisy, distributed, unstructured, and dynamic. These char-acteristics pose challenges to data mining tasks to invent new efficient techniques and algo-rithms. For example, Facebook3 and Twitter4 report Web traffic data from approximately149 million and 90 million unique U.S. visitors per month, respectively. According to thevideo sharing site YouTube,5 more than 4 billion videos are viewed per day, and 60 hours ofvideos are uploaded every minute. The picture sharing site Flickr,6 as of August 2011, hostsmore than 6 billion photo images. Web-based, collaborative, and multilingual Wikipedia7

hosts over 20 million articles attracting over 365 million readers.Depending on social media platforms, social media data can often be very noisy. Remov-

ing the noise from the data is essential before performing effective mining. Researchersnotice that spammers (Yardi et al. [61], Chu et al. [12]) generate more data than legiti-mate users. Social media data are distributed because there is no central authority thatmaintains data from all social media sites. Distributed social media data pose a dauntingtask for researchers to understand the information flows on the social media. Social mediadata are often unstructured. To make meaningful observations based on unstructured datafrom various data sources is a big challenge. For example, social media sites like LinkedIn,Facebook, and Flickr serve different purposes and meet different needs of users.

Social media sites are dynamic and continuously evolving. For example, Facebook recentlybrought about many concepts including a user’s timeline, the creation of in-groups for auser, and numerous user privacy policy changes. The dynamic nature of social media datais a significant challenge for continuously and speedily evolving social media sites. Thereare many additional interesting questions related to human behavior can be studied usingsocial media data. Social media can also help advertisers to find the influential people tomaximize the reach of their products within an advertising budget. Social media can helpsociologists to uncover the human behavior such as in-group and out-group behaviors ofusers. Recently, social media was reported to play an instrumental role in facilitating massmovements such as the Arab Spring8 and Occupy Wall Street.9

2. Issues in Mining Social Media

In this section, we introduce some representative research issues in mining social media.

2.1. Community Analysis

A community is formed by individuals such that those within a group interact with eachother more frequently than with those outside the group. Based on the context, a communityis also referred to as a group, cluster, cohesive subgroup, or module. Communities can beobserved via connections in social media because social media allows people to expand socialnetworks online (Tang and Liu [56]). Social media enables people to connect friends and

3 http://www.quantcast.com/facebook.com (accessed March 2012).4 http://www.quantcast.com/twitter.com (accessed March 2012).5 http://www.youtube.com/t/press statistics (accessed March 2012).6 http://news.softpedia.com/news/Flickr-Boasts-6-Billion-Photo-Uploads-215380.shtml (accessed March 2012).7 http://en.wikipedia.org/wiki/Wikipedia (accessed March 2012).8 http://en.wikipedia.org/wiki/Arab Spring (accessed March 2012).9 http://en.wikipedia.org/wiki/Occupy Wall Street (accessed March 2012).

Page 5: Mining Social Media: A Brief Introduction€¦ · Mining Social Media: A Brief Introduction Pritam Gundecha, Huan Liu Arizona State University, Tempe, Arizona 85287 fpritam@asu.edu,huan.liu@asu.edug

Gundecha and Liu: Mining Social Media: A Brief IntroductionTutorials in Operations Research, c© 2012 INFORMS 5

find new users of similar interests. Communities found in social media are broadly classifiedinto explicit and implicit groups. Explicit groups are formed by user subscriptions, whereasimplicit groups emerge naturally through interactions. Community analysts are generallyfaced with issues such as community detection, formation, and evolution.

Community detection often refers to the extraction of implicit groups in a network. Themain challenges of community detections are that (1) the definition of a community canbe subjective, and (2) the lack of ground truth makes community evaluation difficult. Tangand Liu [56] divided community detection methods into four categories: (1) node-centriccommunity detection, where each node satisfies certain properties such as complete mutual-ity, reachability, node degrees, frequency of within and outside ties, etc. (examples includecliques, k-cliques, and k-clubs); (2) group-centric community detection, where a group needsto satisfy certain properties (for example, minimum group densities); (3) network-centriccommunity detection, where groups are formed based on partition of network into disjointsets (examples are spectral clustering and modularity maximization); and (4) hierarchy-centric community detection, where the goal is to build a hierarchical structure of com-munities. This allows the analysis of a network with different resolutions. Representativemethods are divisive clustering and agglomerative clustering.

Social media networks are highly dynamic. Communities can expand, shrink, or dissolvein dynamic networks. Community evolution aims to discover the patterns of a communityover time with the presence of dynamic network interactions. Backstrom et al. [8] foundthat the more friends you have in a group, the more likely you are to join, and communitieswith cliques grow more slowly than those that are not tightly connected.

2.2. Sentiment Analysis and Opinion Mining

Sentiment analysis and opinion mining aim to automatically extract opinions expressed inthe user-generated content. Sentiment analysis and opinion mining tools allow businesses tounderstand product sentiments, brand perception, new product perception, and reputationmanagement. These tools help users to perceive product opinions or sentiments on a globalscale. There are many social media sites reporting user opinions of products in many differentformats. Monitoring these opinions related to a particular company or product on socialmedia sites is a new challenge.

Sentiment analysis is hard because languages used to create contents are ambiguous.Major steps of sentiment analysis are (1) finding relevant documents, (2) finding relevantsections, (3) finding the overall sentiment, (4) quantifying the sentiment, and (5) aggregatingall sentiments to form an overview. Basic components of an opinion are (1) an object onwhich opinion is expressed, (2) an opinion expressed on a object, and (3) the opinion holder.Objects are generally represented as a finite set of features, where each feature representsa finite set of synonymous words or phrases. Opinion mining tasks can be performed atthe document level (Turney [58], Pang et al. [49]), sentence level (Riloff and Wiebe [51],Yu and Hatzivassiloglou [62]), or feature level (Hu and Liu [25], Popescu and Etzioni [50],Liu and Maes [36]). Extracting opinions expressed in comparative sentences is a difficulttask, and some preliminary work can be found in Jindal and Liu [26], Jindal and Liu [27],Liu [35], and Pang and Lee [48]. Performance evaluation of sentiment analysis is anotherchallenge because of the lack of ground truth.

2.3. Social Recommendation

Traditional recommendation systems attempt to recommend items based on aggregatedratings of objects from users or past purchase histories of users. A social recommendationsystem makes use of user’s social network and related information in addition to the tra-ditional recommendation means. Social recommendation is based on the hypothesis thatpeople who are socially connected are more likely to share the same or similar interests

Page 6: Mining Social Media: A Brief Introduction€¦ · Mining Social Media: A Brief Introduction Pritam Gundecha, Huan Liu Arizona State University, Tempe, Arizona 85287 fpritam@asu.edu,huan.liu@asu.edug

Gundecha and Liu: Mining Social Media: A Brief Introduction6 Tutorials in Operations Research, c© 2012 INFORMS

(homophily), and users can be easily influenced by the friends they trust and prefer theirfriends’ recommendations to random recommendations. Objectives of social recommenda-tion systems are to improve the quality of recommendation and alleviate the problem ofinformation overload. Examples of social recommendation systems are book recommenda-tions based on friends’ reading lists on Amazon or friend recommendations on Twitterand Facebook. More details on social recommendation systems can be found in Konstaset al. [30], Ma et al. [38], and Backstrom and Leskovec [7].

2.4. Influence Modeling

Social scientists have been exploring influence and homophily in social networks for quitesome time (McPherson et al. [42], Lazarsfeld and Merton [34]). It is important to knowwhether the underlying social network is influence driven or homophily driven. For example,in the advertisement industry, if the social network is influence driven, then the influentialusers should be identified and incentivized to promote the product or services to the membersof the social network. However, if the social network is homophily (similarity) driven, thensome individual users should be directly targeted to promote sales. Most social networkshave a mixture of both homophily and influence. Hence, distinguishing them is a challenge.Aral et al. [5] and La Fond and Neville [33] gave details on distinguishing social influence andhomophily in social networks. Kempe et al. [29] studied the influence maximization problem.For a given information propagation model, influence maximization aims to identify theset of initial influential users from a given snapshot of a social network such that they caninfluence the maximum number of other users within given budget constraints. Agarwalet al. [3, 4] presented a preliminary model to identify influential bloggers in a community.The blogosphere obeys a power law distribution with a few blogs being extremely influentialand a huge number of blogs being largely unknown. They reported that active bloggers arenot necessarily influential and proposed effective influence measures to identify influentialbloggers. A more general discussion on modeling and data mining in the blogosphere is givenby Agarwal and Liu [2].

2.5. Information Diffusion and Provenance

Researchers study how information diffuses and explore different models of informationdiffusion, including the independent cascade model, the threshold model, the susceptible–infected model, and the susceptible–infected–recovered model. Many such models have beenstudied (Bailey [10], Granovetter [21], Mahajan [40], Macy [39], Berger [11]). Researchersapply these models to analyze the spread of rumors, computer viruses, and diseases duringoutbreaks. Two important problems from the social media viewpoint are (1) how informationspreads in a social media network and which factors affect the spread, and (2) what plausiblesources are, given some information from social media. The first problem of informationdiffusion has received good attention from researchers. The second problem is still an openresearch problem—information provenance in social media—and recognized as a key issueto differentiate rumors from truth. Because social media data are distributed and dynamic,the conventional techniques used in classical provenance research cannot be directly appliedin social media.

2.6. Privacy, Security, and Trust

The low barrier to and pervasive use of social media give rise to concerns on user privacy andsecurity issues. New challenges arise due to user’s opposing needs: on one hand, a user wouldlike to have as many friends and share as much as possible, and on the other hand, a userwould like to be as private as possible when needed. However, being gregarious requiresopenness and transparency, but being private constricts one’s sharing. In addition, a socialnetworking site has its business needs to encourage users to easily find each other and expand

Page 7: Mining Social Media: A Brief Introduction€¦ · Mining Social Media: A Brief Introduction Pritam Gundecha, Huan Liu Arizona State University, Tempe, Arizona 85287 fpritam@asu.edu,huan.liu@asu.edug

Gundecha and Liu: Mining Social Media: A Brief IntroductionTutorials in Operations Research, c© 2012 INFORMS 7

their friendship networks as widely as possible—in other words, to be open. Hence, socialmedia poses new security challenges to fend off security threats to users and organizations.With the variety of personal information disclosed in user profiles (e.g., information aboutother users and user networks may be indirectly accessible), individuals may put themselvesand members of their social networks at risk for a variety of attacks. Social media has beenthe target of numerous passive as well as active attacks including stalking, cyberbullying,malvertizing, phishing, social spamming, scamming, and clickjacking.

Gross and Acquisti [22] showed that only a few users change the default privacy prefer-ences on Facebook. In some cases, user profiles are completely public, making informationavailable and providing a communication mechanism to anyone who wants to access it. Itis no secret that when a profile is made public, malicious users including stalkers, spam-mers, and hackers can use sensitive information for their personal gain. Sometimes malev-olent users can even cause physical or emotional distress to other users (Rosenblum [52]).Narayanan and Shmatikov [45, 46] demonstrated how users’ privacy can be weakened if anattacker knows the presence of connections among users. Wondracek et al. [60] presented asuccessful scheme to breach privacy by exploiting only the group membership information ofusers.

Liu and Maes [36] pointed out a lack of privacy awareness and found a large numberof social network profiles in which people described themselves with a rich vocabulary interms of their passions and interests. Krishnamurthy and Wills [31] discussed the problemof leakage of personally identifiable information and how it can be misused by third parties(Narayanan and Shmatikov [46]). Squicciarini et al. [53] introduced a novel collective privacymechanism for better managing shared content between users. Fang and LeFevre [14] focusedon helping users to understand simple privacy settings, but did not consider additionalproblems such as attribute inference (Zheleva and Getoor [64]) or shared data ownership(Squicciarini et al. [53]). Zheleva and Getoor [64] showed how an adversary can exploit anonline social network with a mixture of public and private user profiles to predict the privateattributes of users. Baden et al. [9] presented a framework where users dictate who mayaccess their information based on public–private encryption–decryption algorithms.

Social trust depends on many factors that cannot be easily modeled in a computa-tional system. Many different versions of definition of trust are proposed in the litera-ture (Deutsch [13], Sztompka [54], Mui et al. [44], Olmedilla et al. [47], Grandison andSloman [20], Artz and Gil [6]). A highly cited thesis on trust computation (Marsh [41])provides theoretical perspectives of modeling trust, but its complex nature makes it verydifficult to apply, especially to social networks (Golbeck and Hendler [18]). Trust betweenany two people is observed to be affected by many factors including past experiences, opin-ions expressed and actions taken, contributions to spreading rumors, influence by others’opinions, and motives to gain something extra. Another important aspect of trust is thetrustworthiness of user-generated content. Moturu and Liu [43] provided an intuitive scoringmeasure to quantify the trustworthiness of health-related user-generated content in socialmedia.

3. Illustrative Examples of Mining Social Media

In this section, we present some examples in our research to illustrate how to mine socialmedia data to address novel research issues. The first example is about assessing user vulner-ability on a social networking sites to maintain user privacy. The second example exploresthe importance of social and historical ties on a location-based social network. The thirdexample introduces a method that can take advantage of multifaceted trust for predictiveanalytics.

Page 8: Mining Social Media: A Brief Introduction€¦ · Mining Social Media: A Brief Introduction Pritam Gundecha, Huan Liu Arizona State University, Tempe, Arizona 85287 fpritam@asu.edu,huan.liu@asu.edug

Gundecha and Liu: Mining Social Media: A Brief Introduction8 Tutorials in Operations Research, c© 2012 INFORMS

3.1. Assessing User Vulnerability on a Social Networking Site(Gundecha et al. [23])

Attribute-value data are a principal data form in social media. Attributes available for everyuser on a social networking site can be categorized into two major types: individual attributesand community attributes. Individual attributes are those attributes that contain individualuser information. Individual attributes include personal information such as gender, birthdate, phone number, home address, etc. Community attributes are those attributes thatcontain information about the friends of a user. Community attributes include friends thatare traceable from a user’s profile (i.e., a user’s friends list), tagged pictures, and wallinteractions. Using the privacy and security settings of a profile, a user can control thevisibility of most individual attributes but has limited control over the visibility of mostcommunity attributes. For example, Facebook users these days can control photo taggingand the sharing of their friend list with the public but still can not control friends sharingtheir friend lists or uploading photos of them from their profiles to the public.

Our earlier work (Gundecha et al. [23]) used a large-scale Facebook data set to assessattribute visibility, which can be used to obtain general behavior of Facebook users. Forexample, Facebook users do not usually disclose their mobile phone number. Hence, usersthat do disclose phone numbers have a propensity to vulnerability because they disclosemore sensitive information in their profiles. A large portion of users are either not careful ornot aware of consequences of their actions on the privacy information of their friends. Thus,protecting community attributes is important in protecting user privacy.

A novel mechanism was proposed by Gundecha et al. [23] to enable users to protect againstvulnerability. Often users on a social networking site are unaware that they could pose athreat to their friends because of their vulnerability. Gundecha et al. [23] showed that itis feasible to measure a user’s vulnerability based on three factors: (1) the user’s privacysettings that can reveal personal information, (2) the user’s action on a social networkingsite that can expose his or her friends’ personal information, and (3) friends’ actions ona social networking site that can reveal the user’s personal information. Based on thesefactors, Gundecha et al. [23] formally presented one of the earliest models for vulnerabilityreduction. They proposed a four-step procedure to estimate user vulnerability: (1) estimaterisk to privacy due to individual attributes (referred to as the I-index), (2) estimate risk toprivacy due to community attributes (referred to as the C-index), (3) estimate the visibilityof a user based on the I-index and C-index (referred to as the P -index), and (4) estimatethe user vulnerability based on the P -indexes of a user and his one-hop friends (referred toas the V -index).

Figures 1(a) and 1(b) show the relationship between the I-index and C-index as well asthat between the P -index and V -index for 100,000 randomly chosen Facebook users. Notethat users are sorted in ascending order of their I-indexes and P -indexes. The x-axis andy-axis show users and their index values, respectively.

A user’s vulnerable friend is defined as a friend whose unfriending will lower the V -indexscore of a user. This definition of a vulnerable friend can be generalized to multiple vulnerablefriends. Figure 2 shows a comparison of the V -index values of each user before and after theunfriending of the user’s k most vulnerable friends. For each graph in Figure 2, the x-axisand y-axis indicate users and their V -index values, respectively. Without loss of generality,we sorted all users, in the ascending order based on the existing V -index values, before weplotted the graphs in Figure 2. We ran the experiment on 300,000 randomly selected usersof the Facebook data set. Figure 2 shows the performance comparison of V -index values foreach user before (+) and after (◦) unfriending the k most vulnerable friends. We ran theexperiments for different values of k including 1, 2, 10, and 50. As expected, vulnerabilitydecreases consistently as the value of k increases, as seen in Figures 2(a)–2(d).

Page 9: Mining Social Media: A Brief Introduction€¦ · Mining Social Media: A Brief Introduction Pritam Gundecha, Huan Liu Arizona State University, Tempe, Arizona 85287 fpritam@asu.edu,huan.liu@asu.edug

Gundecha and Liu: Mining Social Media: A Brief IntroductionTutorials in Operations Research, c© 2012 INFORMS 9

Figure 1. Relationship among index values for each user.

(a) I -index and C­index for each user

1.0

0.8

C-index I-index

0.6

Inde

x va

lues

0.4

0.2

00 2.5

Users ×104

5.0 7.5 10

(b) P -index and V­index for each user

V -index P -index1.0

0.8

0.6

Inde

x va

lues

0.4

0.2

00 2.5

Users ×104

5.0 7.5 10

3.2. Exploring Social–Historical Ties on Location-BasedSocial Networks (Gao et al. [16])

LBSNs have been a popular form of social media in recent years. They provide location-related services that allow users to check in at geographical locations and share such expe-rience with their friends. Millions of check-in records in LBSNs contain rich social andgeographical information and provide a unique opportunity for researchers to study users’

Figure 2. Performance comparison of V -index values for each user before (+) and after (◦)unfriending the k most vulnerable friends from his or her social network.

(a) Most vulnerable

M1-index V-index M2-index V-index

(b) 2 most vulnerable

(c) 10 most vulnerable (d) 50 most vulnerable

Inde

x va

lues

0.6

0.4

0.2

0

Inde

x va

lues

0.6

0.4

0.2

0

Inde

x va

lues

0.6

0.4

0.2

0

Inde

x va

lues

0.6

0.4

0.2

0

0 1 2

Users ×105

3 0 1 2

Users ×105

3

0 1 2

Users ×105

30 1 2

Users ×105

3

M50-index V-indexM10-index V-index

Page 10: Mining Social Media: A Brief Introduction€¦ · Mining Social Media: A Brief Introduction Pritam Gundecha, Huan Liu Arizona State University, Tempe, Arizona 85287 fpritam@asu.edu,huan.liu@asu.edug

Gundecha and Liu: Mining Social Media: A Brief Introduction10 Tutorials in Operations Research, c© 2012 INFORMS

social behavior from a spatial–temporal aspect, which in turn enables a variety of servicesincluding place advertisement or recommendation, traffic forecasting, and disaster relief.

To understand a user’s check-in behavior, it is inevitable to perform a historical analysis ofusers. It is because the historical check-ins provide rich information about a user’s interestsand hints about when and where a particular user would like to go. In addition, socialcorrelation theory suggests to consider users’ social ties, because human movement is usuallyaffected by their social events, such as visiting friends, going out with colleagues, and so on.These two relationship ties can shape the user’s check-in experience on LBSNs, and eachtie gives rise to a different probability of check-in activity, which indicates that people indifferent spatial–temporal–social circles have different interactions.

The historical ties of a user’s check-in behavior have two properties on LBSNs. First,a user’s check-in history approximately follows a power-law distribution; i.e., a user goesto a few places many times and to many places a few times. Second, the historical tieshave a short-term effect. Taking advantage of the similarity between language modeling andlocation-based social network mining, the work of Gao et al. [16] introduced the Pitman–Yorprocess to location-based social networks to model the historical ties of a user i for his check-in behavior cn+1 = l at time (n+ 1) and location l, specifically, the power-law distributionand short-term effect of historical ties, denoted as historical model (HM) as shown below:

P iH(cn+1 = l) = P i, i

HPY (cn+1 = l | u, tul, d|u|, r|u|, tu),

where u, tul, d|u|, r|u|, and tu are parameters. A social–historical model (SHM) is proposedto explore user i’s check-in behavior integrating both of the social and historical effects:

P iSH(cn+1 = l) = ηP i

H(cn+1 = l) + (1− η)P iS(cn+1 = l),

where

P iS(cn+1 = l) =

∑uj∈N (ui)

sim(ui, uj)Pi, jHPY (cn+1 = l),

where N indicates user i’s set of neighbors.The experiments with location prediction on a real-world LBSN compare the proposed

methods (HM and SHM) with some baseline methods. The results are plotted in Figure 3,demonstrating that the proposed methods best model users’ check-ins in terms of locationprediction; in other words, social and historical ties can help location prediction.

This work finds that a user’s friends can influence his next location because users that haveshared friends tend to go to similar locations than those without. The power-law propertyand short-term effect are observed in historical ties; thus, a historical model is introduced tocapture these properties. The experimental results on location prediction demonstrate thatthe proposed approach is suitable in capturing a user’s check-in property and outperformscurrent models.

3.3. Discerning Multifaceted Trust in a Connected World(Tang et al. [57])

The issue of trust has attracted increasing attention from the community of social mediaresearch. Trust, as a social concept, naturally has multiple facets, indicating multiple andheterogeneous trust relationships between users. Here is a multifaceted trust example fromEpinions. Figure 4(a) shows single trust relationships between user 1 and his 20 friends.Here, we can see that user 7 is the more trustable for user 1. Figures 4(b) and 4(c) show theirmultifaceted trust relationships in the categories “home and garden” and “restaurants,”respectively. For the category “home and garden,” user 7 is not necessary the most trusted

Page 11: Mining Social Media: A Brief Introduction€¦ · Mining Social Media: A Brief Introduction Pritam Gundecha, Huan Liu Arizona State University, Tempe, Arizona 85287 fpritam@asu.edu,huan.liu@asu.edug

Gundecha and Liu: Mining Social Media: A Brief IntroductionTutorials in Operations Research, c© 2012 INFORMS 11

Figure 3. The performance comparison of prediction models demonstrates that by consideringsocial and historical ties, the proposed models can help location prediction.

10 20 30 40 50 60 70 80 90

0.15

0.20

0.25

0.30

0.35

0.40

Fraction of training set

Pre

dict

ion

accu

racy

MFCMFTOrder-1Order-2HMSHM

Note. MFC, most frequent check-in model; MFT, most frequent time model; Order-1, order-1 Markov model;Order-2, order-2 Markov model.

friend of user 1. This shows that trust relationships in different categories vary. Thus, peopletrust others differently in different facets.

There are two challenges to study in obtaining multifaceted trust between users: first,the representation of multiple and heterogeneous trust relationships between users, andsecond, estimating the strength of multifaceted trust. Traditionally, trust is represented byan adjacency matrix. However, this cannot capture the multifaceted trust relations. Tanget al. [57] developed a new algorithm, mTrust, that extends a matrix representation toa tensor representation, adding an extra dimension for facet description. Previous workobserved a strong correlation between trust and user similarity in the context of ratingsystems. Therefore, it is reasonable to embed trust strength inference in rating prediction.Thus, to evaluate the usefulness of multifaced trust, this work embeds the multifaceted trustinference in the framework of rating prediction.

Interesting findings from the experiments are that (1) more than 20% of reciprocal links areheterogeneous, (2) more than 14% transitive trust relations are heterogeneous, and (3) more

Figure 4. Single trust and multifaceted trust relationships of one user in Epinions.

(a) Single trust (b) Trust in home and garden (c) Trust in restaurants

1

2

3

45 6 7

8

9

10

11

12

13

14

151617

1819

20

21

Pajek

1

2

3

45

6 78

9

10

11

12

13

14

151617

1819

20

21

Pajek

1

2

3

4

5 6 78

9

10

11

12

13

14

151617

1819

20

21

Pajek

Note. The thickness of a line indicates the level of trust.

Page 12: Mining Social Media: A Brief Introduction€¦ · Mining Social Media: A Brief Introduction Pritam Gundecha, Huan Liu Arizona State University, Tempe, Arizona 85287 fpritam@asu.edu,huan.liu@asu.edug

Gundecha and Liu: Mining Social Media: A Brief Introduction12 Tutorials in Operations Research, c© 2012 INFORMS

than 11% of cocitation trust relations are heterogeneous. With these findings, mTrust canbe applied to many online tasks such as improving rating prediction, enabling facet-sensitiveranking, and making status theory applicable to reciprocal links.

4. Employing Social Media in Real-World Applications

Social media has been increasingly used in a wide range of domains; examples includepolitical campaigns (e.g., presidential elections), mass movements (e.g., organizing OccupyWall Street movements, Arab Spring), as well as disaster and crisis response and reliefcoordination. In this section, we show how social media mining is used for HADR.

Government and nongovernmental organizations (NGOs) have been faced with challengesto effectively respond to crises such as natural disasters (e.g., tsunamis, hurricanes, earth-quakes). Providing efficient relief to the victims is a primary focus in saving lives, minimizingfurther losses in the aftermath, and accelerating recovery. Crowdsourcing tools via socialmedia such as Twitter, Ushahidi, and Sahana have proven useful in gathering informationduring and after a crisis. However, they are designed to go beyond crowdsourcing. Additionalcapabilities are required for response coordination, secured collaboration, and trust build-ing among relief organizations. Goolsby [19] pointed out the need for such a social mediasystem and described the effort to build it to allow different organizations to share crowd-sourced and groupsourced information, and analyze and visualize the processed informationfor intelligent decision making. In this section, we describe two social media prototype sys-tems designed to facilitate efficient collaboration among disparate organizations for effectiveand coordinated responses to crises.

4.1. ASU Coordination Tracker (Gao et al. [17])

When natural disasters occur, the international community would join forces to provide dis-aster relief and humanitarian assistance. Some prominent examples include the Haiti earth-quake and the subsequent cholera outbreak, and the devastating earthquake and tsunami inJapan. Social media has revolutionized the use of traditional media and played an importantrole in these events as an information collector and disseminator, and as a communicationand collaboration tool.

As one of the most important functions of social media, crowdsourcing is a collaborativeinformation sharing mechanism based on the principle of collective wisdom. Wikipedia10 is aperfect example of crowdsourcing, where people collectively publish information on varioustopics. Crowdsourcing is capable of leveraging participatory social media services and toolsto collect information, and it allows crowds to participate in various HADR tasks. Its integra-tion with crisis maps has been a very effective crowdsourcing application in HADR efforts.One of the earliest such systems was Alive in Afghanistan,11 where people in Afghanistansubmitted their reports on accidents and terrorist attacks across Afghanistan.

Although crowdsourcing applications can provide accurate and timely information abouta crisis for decision making, current crowdsourcing applications still fall short in supportingdisaster relief efforts (Gao et al. [15]). Most importantly, current applications do not pro-vide a common mechanism specifically designed for collaboration and coordination betweendisparate relief organizations. For example, relief organizations that work independentlycan cause conflicts and complicate the relief efforts. If independent organizations duplicatea response, it will draw resources away from the other areas in need that could use theduplicated supplies. It can also result in delayed response to other disaster areas. Further-more, because of the noisy and chaotic nature of crowdsourced data, current crowdsourcingapplications cannot provide readily useful information for disaster relief efforts.

10 http://www.wikipedia.org/ (accessed March 2012).11 http://aliveinafghanistan.org/ (accessed March 2012).

Page 13: Mining Social Media: A Brief Introduction€¦ · Mining Social Media: A Brief Introduction Pritam Gundecha, Huan Liu Arizona State University, Tempe, Arizona 85287 fpritam@asu.edu,huan.liu@asu.edug

Gundecha and Liu: Mining Social Media: A Brief IntroductionTutorials in Operations Research, c© 2012 INFORMS 13

To address these shortfalls of crowdsourcing to a certain extent, an online coordinationsystem, the ASU Coordination Tracker (ACT), was devised. The ACT is an event responsecoordination system with the primary goal of facilitating multiorganization response (mili-tary, governments, NGOs, etc.) to an event such as a disaster and providing relief organi-zations the means for better collaboration and coordination during a crisis. It is a speedyand effective approach with easy, open communication and streamlined coordination. Themajor features and advantages of the ACT are summarized below:

• leveraging crowdsourcing information to provide the means for a groupsourcing responsefor organizations to assist persons affected by a crisis effectively,• developing novel strategies to analyze the requests and maximize the coordinate efforts

of crowdsourced data for disaster relief,• visualizing the requests to give a global view of the request distribution to facilitate

responders, and• increasing the coordination efficiency by optimally responding to requests.

The ACT consists of five functional modules: request collection, request analysis, response,coordination, and situation awareness. Raw requests from users are collected via crowd-sourcing and groupsourcing methods. The request analysis module takes advantage of bothdata mining technology and expert knowledge to iteratively capture the essential contentof raw requests. Based on the demand and disaster background, raw requests are analyzedand classified into various categories. Essential contents extracted from the raw data arethen stored in a requests pool. The system visualizes the requests pool on its crisis map,with its selective visibility bonded to the access levels of users (organizations). The responsemodule is designed to allow relief organizations to contribute, receive, and evaluate differ-ent response options and plan response actions. Relief organizations are able to respondto requests through the crisis map directly. The coordination model uses the interagencyconcept (Goolsby [19]) to avoid response conflicts while maintaining centralized control.The ACT displays every available request on the crisis map. Each request is in one of thefollowing four states: available, in process, in delivery, or delivered. To avoid conflicts, relieforganizations are not able to respond to requests that are being fulfilled by another organi-zation. A statistics module runs in the background to help track relief progress for situationawareness, relief strategy adjustments, and further decision making.

The ACT is a developing system that enables coordination among organizations duringdisaster relief. We are investing approaches to improve collaboration efficiency and providedifferential security to relief organizations and decision makers. The disaster relief ASU CrisisResponse Game (Abbasi et al. [1]) has demonstrated that by both leveraging crowdsourcinginformation and providing means for a groupsourcing response, organizations can effectivelyassist victims.

4.2. TweetTracker (Kumar et al. [32])

TweetTracker12 is a Twitter-based analytic and visualization tool. The focus of the tool is tohelp HADR relief organizations to acquire situational awareness during disasters and emer-gencies to aid disaster relief efforts (Kumar et al. [32]). New social media platforms, such asTwitter microblogs, demonstrate their value and cability to provide information that is noteasily attainable from traditional media. For example, during the Mumbai blasts of 2011,13

firsthand information from the affected region was available on Twitter moments after theblast. TweetTracker is designed to help to track, analyze, review, and monitor tweets. Thisis achieved through near-real-time tracking of tweets with specific keywords/hashtags andtweets generated from the region affected by the crisis. The tool supports monitoring and

12 http://tweettracker.fulton.asu.edu/ (accessed March 2012).13 http://ibnlive.in.com/news/mumbai-blasts-twitter-joins-hands-to-help/167345-3.html (accessed March 2012).

Page 14: Mining Social Media: A Brief Introduction€¦ · Mining Social Media: A Brief Introduction Pritam Gundecha, Huan Liu Arizona State University, Tempe, Arizona 85287 fpritam@asu.edu,huan.liu@asu.edug

Gundecha and Liu: Mining Social Media: A Brief Introduction14 Tutorials in Operations Research, c© 2012 INFORMS

analysis of the collected tweets via real-time trending, data reduction, historical review, andintegrated data mining techniques.

TweetTracker consists of three main components: (1) a Twitter stream reader, (2) a datastorage module, and (3) a data mining and visualization module. The Twitter stream readeris a data collection module that continually crawls tweets through the Twitter streamingAPI (Application Programming Interface).14 Tweets are filtered based on user-specifiedkeywords, hashtags, and geolocations. The data storage module is responsible for storing andindexing the collected tweets into a relational database for use by the visualization module.The data mining and visualization module is a Web-based user interface to the collectedtweets and a means to analyze the collected tweets. It provides geospatial visualization oftweets related to a particular event on a map, summarizes the tweets, and visualizes thetrending keywords in the form of a word cloud, and it can identify popular resources (URLs)and users mentioned in the tweets. The tool also includes built-in language translationsupport for monitoring of multilingual tweets.

TweetTracker has been used in tracking, visualizing, and analyzing activities including theArab Spring movement, the Occupy Wall Street movement, and various natural disasterssuch as earthquakes and cholera outbreaks.

5. Summary

Valuable information is hidden in vast amounts of social media data, presenting ampleopportunities social media mining to discover actionable knowledge that is otherwise difficultto find. Social media data are vast, noisy, distributed, unstructured, and dynamic, whichposes novel challenges for data mining. In this tutorial, we offer a brief introduction tomining social media, use illustrative examples to show that burgeoning social media miningis spearheading the social media research, and demonstrate its invaluable contributions toreal-world applications.

As a main type of “big data,” social media is finding its many innovative uses, such aspolitical campaigns, job applications, business promotion and networking, and customerservices, and using and mining social media is reshaping business models, accelerating viralmarketing, and enabling the rapid growth of various grassroots communities. It also helpsin trend analysis and sales prediction. Social media data will continue their rapid growthin the foreseeable future. We are faced with an increasing demand for new algorithms andsocial media mining tools. Existing preliminary success in social media mining researchefforts convincingly demonstrates the promising future of the emerging social media miningcommunity and will help to expand research and development and explore online and off-linehuman behavior and interaction patterns.

Acknowledgments

The authors thank Huiji Gao, Shamanth Kumar, Jiliang Tang, and DMML members for theirassistance and feedback in preparing this tutorial. Some projects described in this brief introductorysurvey were sponsored by the Office of Naval Research [ONR N000141010091]; the Army ResearchOffice [ARO 025071]; and the National Science Foundation [Grant 0812551].

References[1] M. A. Abbasi, S. Kumar, J. A. Andrade Filho, and H. Liu. Lessons learned in using social media

for disaster relief—ASU Crisis Response Game. Proceedings of the International Conferenceon Social Computing, Behavioral-Cultural Modeling, and Prediction. Springer-Verlag, Berlin,282–289, 2012.

[2] N. Agarwal and H. Liu. Modeling and Data Mining in Blogosphere. Morgan & Claypool Pub-lishers, San Rafael, CA, 2009.

14 https://dev.twitter.com/docs/streaming-api (accessed March 2012).

Page 15: Mining Social Media: A Brief Introduction€¦ · Mining Social Media: A Brief Introduction Pritam Gundecha, Huan Liu Arizona State University, Tempe, Arizona 85287 fpritam@asu.edu,huan.liu@asu.edug

Gundecha and Liu: Mining Social Media: A Brief IntroductionTutorials in Operations Research, c© 2012 INFORMS 15

[3] N. Agarwal, H. Liu, L. Tang, and P. S. Yu. Identifying the influential bloggers in a community.Proceedings of the International Conference on Web Search and Web Data Mining. Associationfor Computing Machinery, New York, 207–218, 2008.

[4] N. Agarwal, H. Liu, L. Tang, and P. S. Yu. Modeling blogger influence in a community. SocialNetwork Analysis and Mining 2(2):139–162, 2012.

[5] S. Aral, L. Muchnik, and A. Sundararajan. Distinguishing influence-based contagion fromhomophily-driven diffusion in dynamic networks. Proceedings of the National Academy of Sci-ences of the United States of America 106(51):21544, 2009.

[6] D. Artz and Y. Gil. A survey of trust in computer science and the Semantic Web. Web Seman-tics: Science, Services and Agents on the World Wide Web 5(2):58–71, 2007.

[7] L. Backstrom and J. Leskovec. Supervised random walks: Predicting and recommending linksin social networks. Proceedings of the Fourth ACM International Conference on Web Searchand Data Mining. Association for Computing Machinery, New York, 635–644, 2011.

[8] L. Backstrom, D. Huttenlocher, J. Kleinberg, and X. Lan. Group formation in large socialnetworks: Membership, growth, and evolution. Proceedings of the 12th ACM SIGKDD Inter-national Conference on Knowledge Discovery and Data Mining. Association for ComputingMachinery, New York, 44–54, 2006.

[9] R. Baden, A. Bender, N. Spring, B. Bhattacharjee, and D. Starin. Persona: An onlinesocial network with user-defined privacy. ACM SIGCOMM Computer Communication Review39(4):135–146, 2009.

[10] N. T. J. Bailey. The Mathematical Theory of Infectious Diseases and Its Applications. CharlesGriffin, High Wycombe, UK, 1975.

[11] E. Berger. Dynamic monopolies of constant size. Journal of Combinatorial Theory, Series B83(2):191–200, 2001.

[12] Z. Chu, S. Gianvecchio, H. Wang, and S. Jajodia. Who is tweeting on Twitter: Human, bot,or cyborg? Proceedings of the 26th Annual Computer Security Applications Conference. Asso-ciation for Computing Machinery, New York, 21–30, 2010.

[13] M. Deutsch. Cooperation and trust: Some theoretical notes. Nebraska Symposium on Motiva-tion. University of Nebraska Press, Lincoln, 1962.

[14] L. Fang and K. LeFevre. Privacy wizards for social networking sites. Proceedings of the 19thInternational Conference on World Wide Web. Association for Computing Machinery, NewYork, 351–360, 2010.

[15] H. Gao, G. Barbier, and R. Goolsby. Harnessing the crowdsourcing power of social media fordisaster relief. IEEE Intelligent Systems 26(3):10–14, 2011.

[16] H. Gao, J. Tang, and H. Liu. Exploring social-historical ties on location-based social net-works. Proceedings of the 6th International AAAI Conference on Weblogs and Social Media.Association for the Advancement of Artificial Intelligence, Palo Alto, CA, 2012.

[17] H. Gao, X. Wang, G. Barbier, and H. Liu. Promoting coordination for disaster relief: Fromcrowdsourcing to coordination. Proceedings of the 4th Conference on Social Computing,Behavioral-Cultural Modeling and Prediction. Springer-Verlag, Berlin, 197–204, 2011.

[18] J. Golbeck and J. Hendler. Inferring binary trust relationships in web-based social networks.ACM Transactions on Internet Technology 6(4):497–529, 2006.

[19] R. Goolsby. Social media as crisis platform: The future of community maps/crisis maps. ACMTransactions on Intelligent Systems and Technology 1(1):1–11, 2010.

[20] T. Grandison and M. Sloman. A survey of trust in Internet applications. IEEE CommunicationsSurveys & Tutorials 3(4):2–16, 2009.

[21] M. Granovetter. Threshold models of collective behavior. American Journal of Sociology83(6):1420–1443, 1978.

[22] R. Gross and A. Acquisti. Information revelation and privacy in online social networks. Pro-ceedings of the 2005 ACM Workshop on Privacy in the Electronic Society. Association forComputing Machinery, New York, 71–80, 2005.

[23] P. Gundecha, G. Barbier, and H. Liu. Exploiting vulnerability to secure user privacy on a socialnetworking site. The 17th ACM SIGKDD International Conference on Knowledge Discoveryand Data Mining. Association for Computing Machinery, New York, 2011.

Page 16: Mining Social Media: A Brief Introduction€¦ · Mining Social Media: A Brief Introduction Pritam Gundecha, Huan Liu Arizona State University, Tempe, Arizona 85287 fpritam@asu.edu,huan.liu@asu.edug

Gundecha and Liu: Mining Social Media: A Brief Introduction16 Tutorials in Operations Research, c© 2012 INFORMS

[24] J. Han, M. Kamber, and J. Pei. Data Mining: Concepts and Techniques. Morgan Kaufmann,San Francisco, 2011.

[25] M. Hu and B. Liu. Mining and summarizing customer reviews. Proceedings of the Tenth ACMSIGKDD International Conference on Knowledge Discovery and Data Mining. Association forComputing Machinery, New York, 168–177, 2004.

[26] N. Jindal and B. Liu. Identifying comparative sentences in text documents. Proceedings of the29th Annual International ACM SIGIR Conference on Research and Development in Informa-tion Retrieval. Association for Computing Machinery, New York, 244–251, 2006.

[27] N. Jindal and B. Liu. Opinion spam and analysis. Proceedings of the International Conferenceon Web Search and Web Data Mining. Association for Computing Machinery, New York,219–230, 2008.

[28] A. M. Kaplan and M. Haenlein. Users of the world, unite! The challenges and opportunitiesof social media. Business Horizons 53(1):59–68, 2010.

[29] D. Kempe, J. Kleinberg, and E. Tardos. Maximizing the spread of influence through a socialnetwork. Proceedings of the Ninth ACM SIGKDD International Conference on KnowledgeDiscovery and Data Mining. Association for Computing Machinery, New York, 137–146, 2003.

[30] I. Konstas, V. Stathopoulos, and J. M. Jose. On social networks and collaborative recom-mendation. Proceedings of the 32nd International ACM SIGIR Conference on Research andDevelopment in Information Retrieval. Association for Computing Machinery, New York,195–202, 2009.

[31] B. Krishnamurthy and C. E. Wills. On the leakage of personally identifiable information viaonline social networks. ACM SIGCOMM Computer Communication Review 40(1):112–117,2010.

[32] S. Kumar, R. Zafarani, and H. Liu. Understanding user migration patterns across social media.Twenty-Fifth International Conference on Artificial Intelligence. Association for the Advance-ment of Artificial Intelligence, Palo Alto, CA, 2011.

[33] T. La Fond and J. Neville. Randomization tests for distinguishing social influence andhomophily effects. Proceedings of the 19th International Conference on World Wide Web. Asso-ciation for Computing Machinery, New York, 601–610, 2010.

[34] P. F. Lazarsfeld and R. K. Merton. Friendship as a social process: A substantive and method-ological analysis. Freedom and Control in Modern Society 18:18–66, 1954.

[35] B. Liu. Sentiment analysis and subjectivity. Handbook of Natural Language Processing. CRCPress, Boca Raton, FL, 627–666, 2010.

[36] H. Liu and P. Maes. InterestMap: Harvesting social network profiles for recommendations.Workshop: Beyond Personalization, San Diego, 2005.

[37] H. Liu and H. Motoda. Computational Methods of Feature Selection. Chapman & Hall, BocaRaton, FL, 2008.

[38] H. Ma, D. Zhou, C. Liu, M. R. Lyu, and I. King. Recommender systems with social regular-ization. Proceedings of the Fourth ACM International Conference on Web Search and DataMining. Association for Computing Machinery, New York, 287–296, 2011.

[39] M. W. Macy. Chains of cooperation: Threshold effects in collective action. American Sociolog-ical Review 56(6):730–747, 1991.

[40] V. Mahajan, E. Muller, and F. M. Bass. New product diffusion models in marketing: A reviewand directions for research. Journal of Marketing 54(1):1–26, 1990.

[41] S. P. Marsh. Formalising trust as a computational concept. Ph.D. thesis, Deptartment ofComputing Science and Mathematics, University of Stirling, Stirling, UK, 1994.

[42] M. McPherson, L. Smith-Lovin, and J. M. Cook. Birds of a feather: Homophily in socialnetworks. Annual Review of Sociology 27:415–444, 2001.

[43] S. T. Moturu and H. Liu. Quantifying the trustworthiness of social media content. Distributedand Parallel Databases 29(3):239–260, 2011.

[44] L. Mui, M. Mohtashemi, and A. Halberstadt. A computational model of trust and reputa-tion for E-businesses. Proceedings of the 35th Annual Hawaii Conference on System Sciences(HICSS’02). IEEE Computer Society, Washington, DC, 2431–2439, 2002.

Page 17: Mining Social Media: A Brief Introduction€¦ · Mining Social Media: A Brief Introduction Pritam Gundecha, Huan Liu Arizona State University, Tempe, Arizona 85287 fpritam@asu.edu,huan.liu@asu.edug

Gundecha and Liu: Mining Social Media: A Brief IntroductionTutorials in Operations Research, c© 2012 INFORMS 17

[45] A. Narayanan and V. Shmatikov. Robust de-anonymization of large sparse datasets. Proceed-ings of the 2008 IEEE Symposium on Security and Privacy. IEEE Computer Society, Wash-ington, DC, 111-125, 2008.

[46] A. Narayanan and V. Shmatikov. De-anonymizing social networks. Proceedings of the 2009IEEE Symposium on Security and Privacy. IEEE Computer Society, Washington, DC,173–187, 2009.

[47] D. Olmedilla, O. Rana, B. Matthews, and W. Nejdl. Security and trust issues in semanticgrids. Proceedings of the Dagsthul Seminar, Semantic Grid: The Convergence of Technologies5271:896–902, 2005.

[48] B. Pang and L. Lee. Opinion mining and sentiment analysis. Foundations and Trendsr inInformation Retrieval 2(1–2):1–135, 2008.

[49] B. Pang, L. Lee, and S. Vaithyanathan. Thumbs up? Sentiment classification using machinelearning techniques. Proceedings of the ACL-02 Conference on Empirical Methods in NaturalLanguage Processing, Vol. 10. Association for Computational Linguistics, Stroudsburg, PA,79–86, 2002.

[50] A. M. Popescu and O. Etzioni. Extracting product features and opinions from reviews. Pro-ceedings of the Conference on Human Language Technology and Empirical Methods in NaturalLanguage Processing. Association for Computational Linguistics, Stroudsburg, PA, 339–346,2005.

[51] E. Riloff and J. Wiebe. Learning extraction patterns for subjective expressions. Proceedings ofthe 2003 Conference on Empirical Methods in Natural Language Processing. Association forComputational Linguistics, Stroudsburg, PA, 105–112, 2003.

[52] D. Rosenblum. What anyone can know: The privacy risks of social networking sites. IEEESecurity and Privacy 5(3):40–49, 2007.

[53] A. C. Squicciarini, M. Shehab, and F. Paci. Collective privacy management in social networks.Proceedings of the 18th International Conference on World Wide Web. Association for Com-puting Machinery, New York, 521–530, 2009.

[54] P. Sztompka. Trust: A Sociological Theory. Cambridge University Press, Cambridge, UK, 1999.

[55] P.-N. Tan, M. Steinbach, and V. Kumar. Introduction to Data Mining. Pearson Addison Wesley,Boston, 2006.

[56] L. Tang and H. Liu. Community Detection and Mining in Social Media, Vol. 2. Morgan &Claypool Publishers, San Rafael, CA, 2010.

[57] J. Tang, H. Gao, and H. Liu. mTrust: Discerning multi-faceted trust in a connected world.Proceedings of the Fifth ACM International Conference on Web Search and Data Mining. Asso-ciation for Computing Machinery, New York, 93–102, 2012.

[58] P. D. Turney. Thumbs up or thumbs down?: Semantic orientation applied to unsupervised clas-sification of reviews. Proceedings of the 40th Annual Meeting on Association for ComputationalLinguistics. Association for Computational Linguistics, Stroudsburg, PA, 417–424, 2002.

[59] I. H. Witten, E. Frank, and M. A. Hall. Data Mining: Practical Machine Learning Tools andTechniques. Morgan Kaufmann, San Francisco, 2011.

[60] G. Wondracek, T. Holz, E. Kirda, and C. Kruegel. A practical attack to de-anonymize socialnetwork users. Proceedings of the 2010 IEEE Symposium on Security and Privacy. IEEE Com-puter Society, Washington, DC, 223–238, 2010.

[61] S. Yardi, D. Romero, G. Schoenebeck, and D. Boyd. Detecting spam in a Twitter network.First Monday 15(1):1–4, 2009.

[62] H. Yu and V. Hatzivassiloglou. Towards answering opinion questions: Separating facts fromopinions and identifying the polarity of opinion sentences. Proceedings of the 2003 Confer-ence on Empirical Methods in Natural Language Processing. Association for ComputationalLinguistics, Stroudsburg, PA, 129–136, 2003.

[63] Z. A. Zhao and H. Liu. Spectral Feature Selection for Data Mining. Chapman & Hall/CRCPress, Virginia Beach, VA, 2012.

[64] E. Zheleva and L. Getoor. To join or not to join: the illusion of privacy in social networkswith mixed public and private user profiles. Proceedings of the 18th International Conferenceon World Wide Web. Association for Computing Machinery, New York, 531–540, 2009.