the journey of my research - pennsylvania state university · ngot bui, john yen, and vasant...
TRANSCRIPT
The Journey of My Research
IST Brown Bag Seminar
John Yen
March 25, 2015
3/25/2015 @John Yen 1
Collaborators Cancer Informatics Initiative American Cancer Society
Prasenjit Mitra Kenneth Portier Vasant Honavar Greta Greer David Reitter Michelle Newman Cyber Security Heath Mackley Peng Liu Ngot Bui Cheng Zhong Yafei Wang Gaoyao Xiao Prakhar Biyani Ping Li Mo Yu Nanying Zhang Cornelia Caragea (U. N. Texas) Frank Ritter Kang Zhao (U. Iowa) Gerry Santoro Lior Rokach (BGU) Army Research Lab
Nir Ofek (BGU) Robert Erbacher Baojun Qiu (Straits) Steve Hutchinson Siddhartha Banerjee Christopher Garneau Yang Xu Hasan Cam William Glodek
@John Yen 2 3/25/2015
My childhood passion is to become an Electrical Engineer
• 1960’s-1970’s, Hsinchu, Taiwan
My Dad, an Electrical Engineer
@John Yen 3 3/25/2015
The Beginning of my Journey toward a “Scientist”
• 2010, American Cancer Society (ACS), Atlanta, GA • Collaborated on a proposal regarding network analysis of
the Cancer Survivors Network (CSN) operated by ACS • Visited ACS with Prasenjit • Met Greta Greer and Kenneth Portier regarding possible
collaborations
• Top questions ACS is interested in – How to better understand the benefits of CSN members? – How to identify influential users?
@John Yen 4 3/25/2015
These two questions turned out to be important for online health community (OHC)
80% use Internet for health-related purposes
Adult Internet Users in the U.S. (Pew)
34% reads about health-related experiences or comments from others
@John Yen 5 3/25/2015
Previous Research on Benefits of OHC
• Through surveys revealed that members
– Received more social support (Dunkel-Schetter 1984, Rodgers and Chen, 2005)
– Reduced the level of stress, depression, and psychological trauma (Beaudoin and Tao 2008, Winzelberg et al, 2003)
– Became more optimistic about disease (Rodgers and Chen, 2005)
• Q1-1: Can we use computational text analysis to capture the changes of their mood over time?
@John Yen 8 3/25/2015
Computational Analysis of Emotion • Bag-of-Word approach
– Occurrence of words in a word list associated with positive/negative emotion
– LIWC (Linguistic Inquiry and Word Count) (J. W. Pennebaker)
– Occurrence of words in a word list associated with strength of positive/negative emotion
• Part-of-Speech (POS) – Syntactic patterns of sentences – Useful also to extract what the sentiment is about
• Unsupervised Learning (Liu, 2010) • Supervised Learning facilitates “context” to be
incorporated into these models.
Michelle G. Newman, John Yen, Prasenjit Mitra, William Murphy, Nicholas Jacobson, Hanjoo Kim, "Supervised Machine Learning Classification of Journals Entries About Emotional Personal Experiences". in Proceedings of AMIA 2013.
@John Yen 9 3/25/2015
Sentiment Classifier Using Supervised Machine Learning
• Model training – Manually labeled 298 posts
(-5 to 5) – Use the labeled posts to train a
sentiment classifier – Feature extraction – If the output of the sentiment
classifier is greater than 0.5, the post is Positive; otherwise, the post is Negative
– Evaluate the model using un-trained labeled posts (cross validation)
• Model prediction – Apply the model to label the
sentiment of unlabeled posts of the entire community
CSN Forum Posts
Labeled Posts
Non-Labeled
Posts
Sentiment Model
Model Selection
Feature Extraction
(Post, Pr, Label)
@John Yen 10 3/25/2015
Features Features from modified sentiment word-lists
Novel features introduced through insights
@John Yen 12 3/25/2015
Which sentiment model to choose?
• 10-fold cross-validation
• AdaBoost-based sentiment model provided the best result
@John Yen 14 3/25/2015
What are the key features used in the model?
@John Yen 15 3/25/2015
Q1: How to quantify the benefits of CSN members?
Initial Post (P0)
Others’ Reply (P1)
Self-Reply (P0
1)
…
Self-Reply (P0
n0)
Others’ Reply (Pn1)
(P0, 0.3, Negative)
(P’1, 0.6, Positive)
(P1, 0.4, Negative)
(P’n1, 0.7, Positive)
(Pn0, 0.9, Positive)
A Discussion Thread
Sentiment Change Indicator: The difference between the average sentiment of the originators’ self replies and the initial sentiment of the thread originator
Avg
Difference
Sentiment Change Index
+
- A
A
A
Q1-2: How to quantify the sentiment changes of those who started a thread?
@John Yen 16 3/25/2015
75% of thread originators with negative initial posts became more
positive later on the thread
Baojun Qiu, Kang Zhao, Prasenjit Mitra, Dinghao Wu, Cornelia Caragea, John Yen, Greta E Greer, Kenneth Portier. "Get Online Support, Feel Better-Sentiment Analysis and Dynamics in an Online Cancer Survivor Community." In: Proceedings of the Third IEEE International Conference on Social Computing, Boston, Massachusetts, USA, 2011
@John Yen 17 3/25/2015
Q1-3: How do those sentiment changes of thread initiator progress over time on a thread?
• The Hypothesis: The sentiment changes gradually (i.e., the more replies they received, the more likely they become positive).
Initial Post
1st Self-reply
2nd Self-reply
3rd Self-reply
@John Yen 18 3/25/2015
The “Aha” moment of a Scientist!
• The sentiment change mostly occurs at the first self-reply.
• After the first self-reply, the sentiment of the thread originator, on the average, does not change much.
@John Yen 20 3/25/2015
Q1: How to better understand the benefits of CSN users?
• Supervised learning-based sentiment classification of threaded discussions over 10 years indicated that – A1-1: 75% of those who started a thread with negative
sentiment became more positive later on the thread
– A1-2: Surprise! Most of the sentiment changes of thread originators occur early on the thread.
• Can we use this finding to help answer the 2nd question (finding influential users)?
@John Yen 22 3/25/2015
“Influential” Reply Posts are defined as
1. Reply posts that were “early” on the threads (i.e., before the first self-reply).
2. Reply posts whose sentiment is aligned with the direction of the sentiment change of the thread initiator
• Design a new “centrality” measure: The total number of influential replies posted by a user (IRR)
@John Yen 23 3/25/2015
IRR-based Influential User Ranking
Compare Top-K recalls
Metrics K=50 K=100 K=150
Total number of threads initiated 0.342 0.439 0.585
Total number of posts 0.415 0.707 0.781
In-degree centrality 0.317 0.512 0.610
Out-degree centrality 0.390 0.659 0.780
Betweenness 0.293 0.366 0.488
PageRank 0.390 0.561 0.732
IRR 0.511 0.732 0.805
• IRR is better than other centrality measures in predicting influential users
Kang Zhao, John Yen, Greta Greer, Baojun Qiu, Prasenjit Mitra, and Kenneth Portier, “Fidning influential users of online health communities: a new metric based on sentiment influence”, Journal of American Medical Informatics Association (JAMIA), Vol 21, 2014, e212-e218.
@John Yen 24 3/25/2015
An Alternative Approach Using 60 Features
• Contribution features – The numbers of posts/threads – The length of posts – The time span of one’s activities – …
• Centrality features (not IRR) – A post-reply network among users – 27k+ nodes and 163k+ edges – In/out-degree, betweenness, PageRank
• Semantic features – Appearance of words with positive/negative
sentiment in a user’s posts – Use of slang and emoticons – …
@John Yen 26 3/25/2015
Compare the IRR-based approach with the alternative approach using 60+ features
• The IRR ranking (based on a single metric) outperforms the best classifier that uses 60+ metrics (ensemble classifier)
• Incorporating IRR into the classifier further improves the classifier’s performance. Top-K recalls and
precisions K=50 K=100 K=150
Recall
(max
=0.397)
Prec.
Recall
(max
=0.794)
Prec. Recall
Prec.
(max=
0.840)
The IRR Ranking 0.349 0.880 0.627 0.790 0.762 0.640
The ensemble classifier
without IRR 0.278 0.700 0.532 0.670 0.698 0.587
The new ensemble
classifier with IRR 0.373 0.940 0.579 0.730 1.000 0.840
@John Yen 30 3/25/2015
Apple Watch
Another game changer?
@John Yen 35 3/25/2015
Summary
• RQ1: How to better understand the benefits of CSN users?
• Supervised learning-based sentiment classification of threaded discussions over 10 years indicated that – 75% of those who started a thread with negative sentiment
became more positive later on the thread
– Surprise! Most of the sentiment changes occur early on the thread.
• RQ2: How to find influential users?
• Early replies whose sentiment (+/-) are the same as the sentiment change (+/-) of the thread initiator is an excellent predictor of influential users.
@John Yen 36 3/25/2015
Prima Facie Cause • A temporal causality analysis in probabilistic
computation tree logic (PCTL).
• The state C is a Prima Facie Cause of E if there is a p such that 1. C becomes true (eventually) with a non-zero
probability
2. There exists a probability p such that reaching E after C is more likely (>p) than reaching E (through any possible paths) (<p)
Ngot Bui, John Yen, and Vasant Honavar, "Temporal Causality of Social Support in an Online Community for Cancer Survivors", in Proceedings of 2015 International Social Computing, Behavioral Modeling and Prediction Conference (SBP15), 2015.
@John Yen 37 3/25/2015
How to analyze temporal causality between replies and final sentiment?
Ngot Bui, John Yen, and Vasant Honavar, "Temporal Causality of Social Support in an Online Community for Cancer Survivors", in Proceedings of 2015 International Social Computing, Behavioral Modeling and Prediction Conference (SBP15), 2015.
o/ - o s / - s f / not f
r / not r r / not r
time
Sentiment of Initial Post
Sentiment of final Post
Avg Sentiment of reply posts
Avg Sentiment of reply posts
Sentiment of Self-reply Post
@John Yen 38 3/25/2015
Prima Facie Causes for Final Sentiment of Thread Originators
• The positive-sentiment-replies are prima facie causes of the final positive sentiment
• The negative-sentiment-replies are also prima facie causes of the final negative sentiment.
• For colorectal cancer community, the negative-sentiment-self-reply is also prima facie causes of the final negative sentiment.
@John Yen 39 3/25/2015
“I am a computational social scientist?”
• Focus:
– What causes the change of emotion state of a cancer patient in an OHC?
• Impacts:
– Leverage this understanding to improve the supports for cancer patients
• “Reward” to self:
– A higher degree of satisfaction
– A broader implications (to other health conditions)
@John Yen 40 3/25/2015
The critical point of another journey
• 2013 Summer, Adelphi, Maryland
• Goal: Conduct a study regarding cyber analysts for a MURI (PI: Peng Liu)
• Realized the threat of cyber attack is much more serious than I originally thought
• Impressed by smart people who dedicated their life and wisdom to defend the cyber space.
@John Yen 41 3/25/2015
A “Scientist” Passion for the Human Dimension of Cyber Science
• How can we better understand the fine-grained cognitive processes of analysts who
– need to process a large influx of network monitoring data/alerts at a rapid pace
– need to “connect the dots” with missing links using their experience, knowledge, and skills?
@John Yen 42 3/25/2015
An “Engineer” with a Passion for Human Dimension of Cyber Science
• Built a non-intrusive way to capture “cognitive traces”
C. Zhong, J. Yen, P. Liu, R. Erbacher, R. Etoty, and C. Garneau, “ARSCA: A Computer Tool for Tracing the Cognitive Processes of Cyber-Attack Analysis”, ”, in Proceedings of the 5th IEEE International Interdisciplinary Conference on Cognitive Methods in Situation Awareness and Decision Support (CogSIMA), Orlando, March 9-12, 2015.
@John Yen 43 3/25/2015
The AOH Objects and Their Relationships in An Analyst’s Cognitive Trace
Root
Alternative Hypotheses
@John Yen 44 3/25/2015
Surprise: High-performance analysts performed more filtering
15
10
5
543
10
5
0
16
8
0
543
2
1
0
3.0
1.5
0.0
15
10
5
15
10
5
200
100
0
543
20
10
0
10
5
0
543
10
5
0
HOP_New
Performance Score
HOP_Add_Sibling HOP_SwitchContext HOP_Modify
HOP_Confirm/Deny AOP_Selecting OOP_Selected AOP_Searching
AOP_Filtering AOP_Inquring OOP_Linking
@John Yen 45 3/25/2015
How to better understand the cognitive behaviors of analysts?
• Represent each cognitive trace as a heterogeneous network
– Node: A filtering operation on an attribute (e.g., port)
– Link: Relationship between two filtering operations
• Subsumption: one filtering is more specific than another
(“port=4446 subsumes” “port=4446 and SourceIP= ….”)
• Complementary: one filtering complements the other
• Equal To: identical filtering
• Corresponding To: same condition, different data source
@John Yen 46 3/25/2015
Nodes (Filtering) Ordered by time around the circle.
Edges (Relationship from a filtering to its preceding activities) • Orange:
Complementary • Red: Equal to • Blue: Subsumed
by • Green:
Corresponding to
The Information Foraging Network of An Analyst
@John Yen 47 3/25/2015
Information Foraging Network of Another Analysts
Both analysts have high performance score.
Their filtering networks reveal different information foraging strategies.
@John Yen 48 3/25/2015
Figure 4. Role of working memory in explicit learning.
Yang J, Li P (2012) Brain Networks of Explicit and Implicit Learning. PLoS ONE 7(8): e42993. doi:10.1371/journal.pone.0042993
http://127.0.0.1:8081/plosone/article?id=info:doi/10.1371/journal.pone.0042993
Am I an Engineer or a Scientist?
• Engineer Passion: Build better information value-added tools X
• Scientist Passion: Better understand the mechanism of Y
• The journey may lead to surprises (may be another “Aha” moment or …
Discover Y
Build X
Motivates
Enables
@John Yen 50 3/25/2015
The Next Frontier …
@John Yen 51 3/25/2015
A late celebration of National Puppy Day
@John Yen 52 3/25/2015