the journey of my research - pennsylvania state university · ngot bui, john yen, and vasant...

The Journey of My Research

IST Brown Bag Seminar

John Yen

March 25, 2015

3/25/2015 @John Yen 1

Collaborators Cancer Informatics Initiative American Cancer Society

Prasenjit Mitra Kenneth Portier Vasant Honavar Greta Greer David Reitter Michelle Newman Cyber Security Heath Mackley Peng Liu Ngot Bui Cheng Zhong Yafei Wang Gaoyao Xiao Prakhar Biyani Ping Li Mo Yu Nanying Zhang Cornelia Caragea (U. N. Texas) Frank Ritter Kang Zhao (U. Iowa) Gerry Santoro Lior Rokach (BGU) Army Research Lab

Nir Ofek (BGU) Robert Erbacher Baojun Qiu (Straits) Steve Hutchinson Siddhartha Banerjee Christopher Garneau Yang Xu Hasan Cam William Glodek

@John Yen 2 3/25/2015

My childhood passion is to become an Electrical Engineer

• 1960’s-1970’s, Hsinchu, Taiwan

My Dad, an Electrical Engineer

@John Yen 3 3/25/2015

The Beginning of my Journey toward a “Scientist”

• 2010, American Cancer Society (ACS), Atlanta, GA • Collaborated on a proposal regarding network analysis of

the Cancer Survivors Network (CSN) operated by ACS • Visited ACS with Prasenjit • Met Greta Greer and Kenneth Portier regarding possible

collaborations

• Top questions ACS is interested in – How to better understand the benefits of CSN members? – How to identify influential users?

@John Yen 4 3/25/2015

These two questions turned out to be important for online health community (OHC)

80% use Internet for health-related purposes

Adult Internet Users in the U.S. (Pew)

34% reads about health-related experiences or comments from others

@John Yen 5 3/25/2015

Previous Research on Benefits of OHC

• Through surveys revealed that members

– Received more social support (Dunkel-Schetter 1984, Rodgers and Chen, 2005)

– Reduced the level of stress, depression, and psychological trauma (Beaudoin and Tao 2008, Winzelberg et al, 2003)

– Became more optimistic about disease (Rodgers and Chen, 2005)

• Q1-1: Can we use computational text analysis to capture the changes of their mood over time?

@John Yen 8 3/25/2015

Computational Analysis of Emotion • Bag-of-Word approach

– Occurrence of words in a word list associated with positive/negative emotion

– LIWC (Linguistic Inquiry and Word Count) (J. W. Pennebaker)

– Occurrence of words in a word list associated with strength of positive/negative emotion

• Part-of-Speech (POS) – Syntactic patterns of sentences – Useful also to extract what the sentiment is about

• Unsupervised Learning (Liu, 2010) • Supervised Learning facilitates “context” to be

incorporated into these models.

Michelle G. Newman, John Yen, Prasenjit Mitra, William Murphy, Nicholas Jacobson, Hanjoo Kim, "Supervised Machine Learning Classification of Journals Entries About Emotional Personal Experiences". in Proceedings of AMIA 2013.

@John Yen 9 3/25/2015

http://knowledge.amia.org/amia-55142-a2013e-1.580047/t-03-1.584514/f-003-1.584515/a-380-1.584588/a-382-1.584583?qr=1

http://knowledge.amia.org/amia-55142-a2013e-1.580047/t-03-1.584514/f-003-1.584515/a-380-1.584588/a-382-1.584583?qr=1

Sentiment Classifier Using Supervised Machine Learning

• Model training – Manually labeled 298 posts

(-5 to 5) – Use the labeled posts to train a

sentiment classifier – Feature extraction – If the output of the sentiment

classifier is greater than 0.5, the post is Positive; otherwise, the post is Negative

– Evaluate the model using un-trained labeled posts (cross validation)

• Model prediction – Apply the model to label the

sentiment of unlabeled posts of the entire community

CSN Forum Posts

Labeled Posts

Non-Labeled

Posts

Sentiment Model

Model Selection

Feature Extraction

(Post, Pr, Label)

@John Yen 10 3/25/2015

Features Features from modified sentiment word-lists

Novel features introduced through insights

@John Yen 12 3/25/2015

Which sentiment model to choose?

• 10-fold cross-validation

• AdaBoost-based sentiment model provided the best result

@John Yen 14 3/25/2015

What are the key features used in the model?

@John Yen 15 3/25/2015

Q1: How to quantify the benefits of CSN members?

Initial Post (P0)

Others’ Reply (P1)

Self-Reply (P0

1)

…

Self-Reply (P0

n0)

Others’ Reply (Pn1)

(P0, 0.3, Negative)

(P’1, 0.6, Positive)

(P1, 0.4, Negative)

(P’n1, 0.7, Positive)

(Pn0, 0.9, Positive)

A Discussion Thread

Sentiment Change Indicator: The difference between the average sentiment of the originators’ self replies and the initial sentiment of the thread originator

Avg

Difference

Sentiment Change Index

+

- A

A

A

Q1-2: How to quantify the sentiment changes of those who started a thread?

@John Yen 16 3/25/2015

75% of thread originators with negative initial posts became more

positive later on the thread

Baojun Qiu, Kang Zhao, Prasenjit Mitra, Dinghao Wu, Cornelia Caragea, John Yen, Greta E Greer, Kenneth Portier. "Get Online Support, Feel Better-Sentiment Analysis and Dynamics in an Online Cancer Survivor Community." In: Proceedings of the Third IEEE International Conference on Social Computing, Boston, Massachusetts, USA, 2011

@John Yen 17 3/25/2015

http://cani.ist.psu.edu/publication/CSN_SocialCom2011.pdf





Q1-3: How do those sentiment changes of thread initiator progress over time on a thread?

• The Hypothesis: The sentiment changes gradually (i.e., the more replies they received, the more likely they become positive).

Initial Post

1st Self-reply

2nd Self-reply

3rd Self-reply

@John Yen 18 3/25/2015

The “Aha” moment of a Scientist!

• The sentiment change mostly occurs at the first self-reply.

• After the first self-reply, the sentiment of the thread originator, on the average, does not change much.

@John Yen 20 3/25/2015

Q1: How to better understand the benefits of CSN users?

• Supervised learning-based sentiment classification of threaded discussions over 10 years indicated that – A1-1: 75% of those who started a thread with negative

sentiment became more positive later on the thread

– A1-2: Surprise! Most of the sentiment changes of thread originators occur early on the thread.

• Can we use this finding to help answer the 2nd question (finding influential users)?

@John Yen 22 3/25/2015

“Influential” Reply Posts are defined as

1. Reply posts that were “early” on the threads (i.e., before the first self-reply).

2. Reply posts whose sentiment is aligned with the direction of the sentiment change of the thread initiator

• Design a new “centrality” measure: The total number of influential replies posted by a user (IRR)

@John Yen 23 3/25/2015

IRR-based Influential User Ranking

Compare Top-K recalls

Metrics K=50 K=100 K=150

Total number of threads initiated 0.342 0.439 0.585

Total number of posts 0.415 0.707 0.781

In-degree centrality 0.317 0.512 0.610

Out-degree centrality 0.390 0.659 0.780

Betweenness 0.293 0.366 0.488

PageRank 0.390 0.561 0.732

IRR 0.511 0.732 0.805

• IRR is better than other centrality measures in predicting influential users

Kang Zhao, John Yen, Greta Greer, Baojun Qiu, Prasenjit Mitra, and Kenneth Portier, “Fidning influential users of online health communities: a new metric based on sentiment influence”, Journal of American Medical Informatics Association (JAMIA), Vol 21, 2014, e212-e218.

@John Yen 24 3/25/2015

http://faculty.ist.psu.edu/yen/publications/JAMIA2014ZhaoInfluentialUsersOHC.pdf





An Alternative Approach Using 60 Features

• Contribution features – The numbers of posts/threads – The length of posts – The time span of one’s activities – …

• Centrality features (not IRR) – A post-reply network among users – 27k+ nodes and 163k+ edges – In/out-degree, betweenness, PageRank

• Semantic features – Appearance of words with positive/negative

sentiment in a user’s posts – Use of slang and emoticons – …

@John Yen 26 3/25/2015

Compare the IRR-based approach with the alternative approach using 60+ features

• The IRR ranking (based on a single metric) outperforms the best classifier that uses 60+ metrics (ensemble classifier)

• Incorporating IRR into the classifier further improves the classifier’s performance. Top-K recalls and

precisions K=50 K=100 K=150

Recall

(max

=0.397)

Prec.

Recall

(max

=0.794)

Prec. Recall

Prec.

(max=

0.840)

The IRR Ranking 0.349 0.880 0.627 0.790 0.762 0.640

The ensemble classifier

without IRR 0.278 0.700 0.532 0.670 0.698 0.587

The new ensemble

classifier with IRR 0.373 0.940 0.579 0.730 1.000 0.840

@John Yen 30 3/25/2015

Apple Watch

Another game changer?

@John Yen 35 3/25/2015

Summary

• RQ1: How to better understand the benefits of CSN users?

• Supervised learning-based sentiment classification of threaded discussions over 10 years indicated that – 75% of those who started a thread with negative sentiment

became more positive later on the thread

– Surprise! Most of the sentiment changes occur early on the thread.

• RQ2: How to find influential users?

• Early replies whose sentiment (+/-) are the same as the sentiment change (+/-) of the thread initiator is an excellent predictor of influential users.

@John Yen 36 3/25/2015

Prima Facie Cause • A temporal causality analysis in probabilistic

computation tree logic (PCTL).

• The state C is a Prima Facie Cause of E if there is a p such that 1. C becomes true (eventually) with a non-zero

probability

2. There exists a probability p such that reaching E after C is more likely (>p) than reaching E (through any possible paths) (<p)

Ngot Bui, John Yen, and Vasant Honavar, "Temporal Causality of Social Support in an Online Community for Cancer Survivors", in Proceedings of 2015 International Social Computing, Behavioral Modeling and Prediction Conference (SBP15), 2015.

@John Yen 37 3/25/2015

http://faculty.ist.psu.edu/yen/publications/sbp2015.pdf


How to analyze temporal causality between replies and final sentiment?

Ngot Bui, John Yen, and Vasant Honavar, "Temporal Causality of Social Support in an Online Community for Cancer Survivors", in Proceedings of 2015 International Social Computing, Behavioral Modeling and Prediction Conference (SBP15), 2015.

o/ - o s / - s f / not f

r / not r r / not r

time

Sentiment of Initial Post

Sentiment of final Post

Avg Sentiment of reply posts

Avg Sentiment of reply posts

Sentiment of Self-reply Post

@John Yen 38 3/25/2015



Prima Facie Causes for Final Sentiment of Thread Originators

• The positive-sentiment-replies are prima facie causes of the final positive sentiment

• The negative-sentiment-replies are also prima facie causes of the final negative sentiment.

• For colorectal cancer community, the negative-sentiment-self-reply is also prima facie causes of the final negative sentiment.

@John Yen 39 3/25/2015

“I am a computational social scientist?”

• Focus:

– What causes the change of emotion state of a cancer patient in an OHC?

• Impacts:

– Leverage this understanding to improve the supports for cancer patients

• “Reward” to self:

– A higher degree of satisfaction

– A broader implications (to other health conditions)

@John Yen 40 3/25/2015

The critical point of another journey

• 2013 Summer, Adelphi, Maryland

• Goal: Conduct a study regarding cyber analysts for a MURI (PI: Peng Liu)

• Realized the threat of cyber attack is much more serious than I originally thought

• Impressed by smart people who dedicated their life and wisdom to defend the cyber space.

@John Yen 41 3/25/2015

A “Scientist” Passion for the Human Dimension of Cyber Science

• How can we better understand the fine-grained cognitive processes of analysts who

– need to process a large influx of network monitoring data/alerts at a rapid pace

– need to “connect the dots” with missing links using their experience, knowledge, and skills?

@John Yen 42 3/25/2015

An “Engineer” with a Passion for Human Dimension of Cyber Science

• Built a non-intrusive way to capture “cognitive traces”

C. Zhong, J. Yen, P. Liu, R. Erbacher, R. Etoty, and C. Garneau, “ARSCA: A Computer Tool for Tracing the Cognitive Processes of Cyber-Attack Analysis”, ”, in Proceedings of the 5th IEEE International Interdisciplinary Conference on Cognitive Methods in Situation Awareness and Decision Support (CogSIMA), Orlando, March 9-12, 2015.

@John Yen 43 3/25/2015

The AOH Objects and Their Relationships in An Analyst’s Cognitive Trace

Root

Alternative Hypotheses

@John Yen 44 3/25/2015

Surprise: High-performance analysts performed more filtering

15

10

5

543

10

5

0

16

8

0

543

2

1

0

3.0

1.5

0.0

15

10

5

15

10

5

200

100

0

543

20

10

0

10

5

0

543

10

5

0

HOP_New

Performance Score

HOP_Add_Sibling HOP_SwitchContext HOP_Modify

HOP_Confirm/Deny AOP_Selecting OOP_Selected AOP_Searching

AOP_Filtering AOP_Inquring OOP_Linking

@John Yen 45 3/25/2015

How to better understand the cognitive behaviors of analysts?

• Represent each cognitive trace as a heterogeneous network

– Node: A filtering operation on an attribute (e.g., port)

– Link: Relationship between two filtering operations

• Subsumption: one filtering is more specific than another

(“port=4446 subsumes” “port=4446 and SourceIP= ….”)

• Complementary: one filtering complements the other

• Equal To: identical filtering

• Corresponding To: same condition, different data source

@John Yen 46 3/25/2015

Nodes (Filtering) Ordered by time around the circle.

Edges (Relationship from a filtering to its preceding activities) • Orange:

Complementary • Red: Equal to • Blue: Subsumed

by • Green:

Corresponding to

The Information Foraging Network of An Analyst

@John Yen 47 3/25/2015

Information Foraging Network of Another Analysts

Both analysts have high performance score.

Their filtering networks reveal different information foraging strategies.

@John Yen 48 3/25/2015

Figure 4. Role of working memory in explicit learning.

Yang J, Li P (2012) Brain Networks of Explicit and Implicit Learning. PLoS ONE 7(8): e42993. doi:10.1371/journal.pone.0042993

http://127.0.0.1:8081/plosone/article?id=info:doi/10.1371/journal.pone.0042993

http://127.0.0.1:8081/plosone/article?id=info:doi/10.1371/journal.pone.0042993

Am I an Engineer or a Scientist?

• Engineer Passion: Build better information value-added tools X

• Scientist Passion: Better understand the mechanism of Y

• The journey may lead to surprises (may be another “Aha” moment or …

Discover Y

Build X

Motivates

Enables

@John Yen 50 3/25/2015

The Next Frontier …

@John Yen 51 3/25/2015

A late celebration of National Puppy Day

@John Yen 52 3/25/2015

the journey of my research - pennsylvania state university · ngot bui, john yen, and vasant...

Documents