computational techniques for public health...
TRANSCRIPT
Computational Techniques
for Public Health Surveillance Scott H. Burton Ph.D. Dissertation Proposal Department of Computer Science Brigham Young University April 26, 2012
Overview
• Problem overview
• Research area overview
▫ Health research in social media
▫ Data mining
Social network analysis
Collective classification
Text mining
• Dissertation proposal
Health is Important
• U.S. 2010 total health expenditures:
▫ $2.6 trillion (17.9% of GDP)
• Millions of lives affected each year
National Health Expenditures 2010 Highlights.
http://www.cms.gov/NationalHealthExpendData/downloads/highlights.pdf
Image: http://health-ins.us/
Public Health Surveillance
“Public health surveillance is the continuous, systematic collection, analysis and interpretation of health-related data needed for the planning, implementation, and evaluation of public health practice.”
– World Health Organization
• Epidemiology
• Health promotion
• Substance abuse prevention
• Public policy
World Health Organization
http://www.who.int/topics/public_health_surveillance/en/
Traditional Methods
• Health Department Labs
• Focus Groups
• Questionnaires
• Clinical Trials
Limitations of Traditional Methods
Traditional Methods
• Cost
• Delay
• Isolated individuals
• Reported vs. actual behavior
• Often small samples
Social Media Opportunities
Traditional Methods Online Social Media
• Cost
• Delay
• Isolated individuals
• Reported vs. actual behavior
• Often small samples
• Inexpensive
• Real-time posting
• Near real-time analysis
• Relational data / social structures
• True feelings and behaviors
• Large samples
• Geo-located
• Reach under-represented countries and groups
Computational Health Science
“Developing computational techniques to build systems or applications to understand and influence individual health
and measure relevant outcomes.”
Computer Science
Sociology Health Science
The CHS Difference
• Community identification
• Data set size
• Relational classification
• Inductive models
• Text mining and automated analysis
Search Query Monitoring
• Influenza outbreak detection
Polgreen, P., Chen, Y., Pennock, D., Nelson, F., and Weinstein, R.
Using Internet Searches for Influenza Surveillance
Clinical Infectious Diseases, 47(11):1443-1448, 2008.
More Outbreak Detection
• Influenza outbreak detection (Ginsberg, et al.)
• 2009 H1N1 Influenza (Brownstein, et al.)
• Listeriosis (Wilson and Brownstein)
• Gastroenteritis and Chickenpox (Pelat, et al.)
Ginsberg, J., Mohebbi, M., Patel, R., Brammer, L., Smolinski, M., and Brilliant, L.
Detecting Influenza Epidemics using Search Engine Query Data.
Nature, 457(7232):1012-1014, 2008.
Brownstein, J. S., et al.
Information Technology and Global Surveillance of Cases of 2009 H1N1 Influenza
New England Journal of Medicine, 362(18):1731-1735, 2010.
Wilson, K. and Brownstein, J.
Early Detection of Disease Outbreaks using the Internet.
Canadian Medical Association Journal, 180(8):829, 2009.
Pelat, C., Turbelin, C., Bar-Hen, A., Flahault, A., and Valleron, A.
More Diseases Tracked by using Google Trends.
Emerging Infectious Diseases, 15(8):1327, 2009.
Health on YouTube
• Immunizations (N=153) (Keelan, et al.)
• Tanning Bed Use (N=72) (Hossler and Conroy)
• Tobacco (N=50) (Freeman and Chapman)
• Stop Smoking (N=191) (Backinger, et al.)
Keelan, J., Pavri-Garcia, V., Tomlinson, G., and Wilson, K.
YouTube as a Source of Information on Immunization: A Content Analysis.
Journal of the American Medical Association, 298(21):2482, 2007.
Hossler, E. and Conroy, M.
YouTube as a Source of Information on Tanning Bed Use.
Archives of Dermatology, 144(10):1395{1396, 2008.
Freeman, B. and Chapman, S.
Is “YouTube” Telling or Selling you Something? Tobacco Content on the YouTube Video-sharing Website.
Tobacco Control, 16(3):207, 2007.
Backinger, C. L., Pilsner, A. M., Augustson, E. M., Frydl, A., Phillips, T., and Rowden, J.
YouTube as a Source of Quitting Smoking Information.
Tobacco Control, 20(2):119-122, 2011.
Health on Facebook
• General Non-Communicable Disease Groups (N=757)
▫ Farmer, et al.
• Diabetes Groups (N=15)
▫ Greene, et al.
• Ethical Issues (N=202)
▫ Moubarak, et al.
Greene, J., Choudhry, N., Kilabuk, E., and Shrank, W.
Online Social Networking by Patients with Diabetes: A Qualitative Evaluation of Communication with Facebook.
Journal of General Internal Medicine, 26:287-292, 2011.
Moubarak, G., Guiot, A., Benhamou, Y., Benhamou, A., and Hariri, S.
Facebook Activity of Residents and Fellows and its Impact on the Doctor-Patient Relationship.
Journal of Medical Ethics, 37(2):101-104, 2011.
Farmer, A. D., Bruckner Holt, C. E. M., Cook, M. J., and D., H. S.
Social Networking Sites: A Novel Portal for Communication.
Postgraduate Medical Journal, 85:455-459, 2009.
Health on Blogs
• Health-related Blogs (N=951)
▫ Miller and Pole
• Breastfeeding and Blogging (32 blogs, 354 posts, 881 comments)
▫ West et al.
Miller, E. and Pole, A.
Diagnosis Blog: Checking up on Health Blogs in the Blogosphere.
American Journal of Public Health, 100(8):1514-1519, 2010.
West, J., Hall, P., Hanson, C., Thackeray, R., Barnes, M., Neiger, B., and McIntyre, E.
Breastfeeding and Blogging: Exploring the Utility of Blogs to Promote Breastfeeding.
American Journal of Health Education, 42(2):106-115, 2011.
Health on Twitter
• Dental Pain (N=772)
▫ Heaivilin, et al.
• Tobacco (N=5.9 million tweets, 5,000 tobacco-related)
▫ Prier, et al.
• Problem Drinking (N=5.5 million tweets, 21,000 alcohol-related)
▫ West et al.
Heaivilin, N., Gerbert, B., Page, J., and Gibbs, J.
Public Health Surveillance of Dental Pain via Twitter.
Journal of Dental Research, 90(9):1047-1051, 2011.
Prier, K. W., Smith, M. S., Giraud-Carrier, C., and Hanson, C. L.
Identifying Health-Related Topics on Twitter: An Exploration of Tobacco-related Tweets as a Test Topic.
In Proceedings of the 4th International Conference on Social Computing,
Behavioral-Cultural Modeling, and Prediction, pages 18-25. 2011.
West, J., Hall, P., Prier, K., Hanson, C., Giraud-Carrier, C., Neeley, S., Barnes, M.
Temporal Variability of Problem Drinking on Twitter
Open Journal of Preventive Medicine, 2(1):43-48. 2012.
Geo-Location in Twitter
• Pew Institute reports:
▫ 14% of users said they used automatic GPS tagging
• In our study, the data said:
▫ 2.0% of Tweets
▫ 2.7% of unique users
K. Zickuhr and A. Smith.
28% of American Adults Use Mobile and Social Location-based Services.
http://pewinternet.org/~/media//Files/Reports/2011/PIP_Locationbased-services.pdf, 2011.
Burton, S. H., Tanner, K. W., Giraud-Carrier, C. G., West, J. H., and Barnes, M. D.
Right Time, Right Place Health Communication in Twitter: How Good Is Location Information?
In Submission.
Tweets Around the World
Burton, S. H., Tanner, K. W., Giraud-Carrier, C. G., West, J. H., and Barnes, M. D.
Right Time, Right Place Health Communication in Twitter: How Good Is Location Information?
In Submission.
Data Mining
• “the process of discovering interesting and useful patterns and relationships in large volumes of data” – Christopher Clifton
• Algorithms
▫ Supervised
▫ Unsupervised
• Types of data
▫ Tabular
▫ Relational
▫ Text
Clifton, C.
Encyclopedia Britannica: Data Mining
http://www.britannica.com/EBchecked/topic/1056150/data-mining
Social Network Analysis
• Relational data
• Not just networks of “people”
Wasserman, S. and Faust, K.
Social Network Analysis: Methods and Applications. Cambridge University Press, 1994.
Scott, J.
Social Network Analysis: A Handbook. Sage Publications, Second Edition, 2000.
Community Mining
• “Dense subnetwork within a larger network”
Newman, M. E. J.
Communities, Modules and Large-scale Structure in Networks.
Nature Physics, 8:25-31. 2012
Community Mining Techniques
• Label Propagation
▫ Cordasco and Gargano
• Random Walks
▫ Rosvall and Bergstrom
• Rolling k-Cliques
▫ Palla et al.
Cordasco, G. and Gargano, L.
Community Detection via Semi-Synchronous Label Propagation Algorithms
IEEE International Workshop on Business Applications of Social Network Analysis, 2010
Rosvall, R. and Bergstrom, C. T.
Maps of Random Walks on Complex Networks Reveal Community Structure
Proceedings of the National Academy of Sciences 105(4):1118-1123. 2008
Palla, G., Dereneyi, I., Farkas, I., and Vicsek, T.
Uncovering the Overlapping Community Structure of Complex Networks in Nature and Society
Nature, 435(7043):814-818, 2005.
Modularity
• Actual edges minus expected
• Undirected
• Requires complete graph
Newman, M. E. J. and Girvan, M.
Finding and evaluating community structure in networks.
Physical Review E, 69(2):026113, Feb 2004.
Modularity Challenges
• Algorithm efficiency
• Varying sizes
• Overlapping
• Directed graphs
• Local discovery
Directed Community Mining
• Lost information by ignoring direction
• Directed Modularity
▫ Leicht and Newman
• Random Walks
▫ Kim, et al.
Leicht, E. A. and Newman, M. E. J.
Community Structure in Directed Networks.
Physical Review Letters, 100(11):118703, 2008.
Kim, Y., Son, S.-W., Jeong, H.
Finding Communities in Directed Networks
Physical Review E, 81(1):016103, 2010.
Clauset’s Local Modularity
• Steepness of boundary
• Greedily add nodes
Clauset, A.
Finding Local Community Structure in Networks.
Physical Review E, 72(2):026132, Aug 2005.
Collective Classification
• “Typical” classification
▫ Internal attributes
• Relational classification
▫ Neighbor classes
• Collective classification
▫ Both
Sen, P., Namata, G., Bilgic, M., Getoor, L., Galligher, B., and Eliassi-Rad, T.
Collective Classification in Network Data.
AI Magazine, 29(3):93, 2008.
Jensen, D., Neville, J., and Gallagher, B.
Why Collective Inference Improves Relational Classification.
In Proceedings of the International Conference on Knowledge Discovery and Data Mining, 2004.
Inferring Properties from Friends
• Location
▫ Backstrom, et al.
• Private information (politics, religion, etc.)
▫ Lindamood, et al.
Backstrom, L., Sun, E., and Marlow, C.
Find Me if You Can: Improving Geographical Prediction with Social and Spatial Proximity.
In Proceedings of the 19th International World Wide Web Conference, pages 61-70. 2010.
Lindamood, J., Heatherly, R., Kantarcioglu, M., and Thuraisingham, B.
Inferring private information using social network data.
In Proceedings of the 18th International World Wide Web Conference, pages 1145-1146. 2009.
Text Classification
• Different classes of documents
• Learn patterns from the words in each class
Sebastiani, F.
Machine Learning in Automated Text Categorization.
ACM Computing Surveys, 34(1):1-47, 2002.
Lorem
ipsum
sit
doler.
Etc.
Etc.
Lorem
ipsum
sit
doler.
Etc.
Etc.
Lorem
ipsum
sit
doler.
Etc.
Etc.
Lorem
ipsum
sit
doler.
Etc.
Etc.
Lorem
ipsum
sit
doler.
Etc.
Etc.
Lorem
ipsum
sit
doler.
Etc.
Etc.
Lorem
ipsum
sit
doler.
Etc.
Etc.
Lorem
ipsum
sit
doler.
Etc.
Etc.
Lorem
ipsum
sit
doler.
Etc.
Etc.
Text Classification Algorithms
• Naïve Bayes
▫ McCallum and Nigam, 1998
• k-Nearest Neighbor
▫ Yang, 1999
• Support Vector Machines
▫ Joachims, 1998
• Rule-learning
▫ Cohen and Singer, 1996
• Maximum Entropy
▫ Nigam, et al., 1999
Topic Modeling
• Latent Dirichlet allocation (LDA)
▫ User chooses a topic (z)
▫ Given the topic, user chooses a word
Blei, D. M., Ng, A. Y., and Jordan, M. I.
Latent Dirichlet Allocation.
Journal of Machine Learning Research, 3:993-1022, March 2003.
Labeled LDA
• Supervised LDA
• Incorporates a document label
Ramage, D., Hall, D., Nallapati, R., and Manning, C.
Labeled LDA: A Supervised Topic Model for Credit Attribution in Multi-Labeled Corpora
In Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing, pages 248-256
“Author-LDA”
• Small document challenges for LDA
• One approach:
▫ Combine all of an author’s tweets
Hong, L. and Davison, B.
Empirical Study of Topic Modeling in Twitter.
In Proceedings of the First Workshop on Social Media Analytics, pages 80-88. 2010.
Zhao, W., Jiang, J., Weng, J., He, J., Lim, E., Yan, H., and Li, X.
Comparing Twitter and Traditional Media using Topic Models.
In Proceedings of the 33rd European Conference on Advances in Information Retrieval, pages 338-349. 2011.
Ailment Topic Aspect Model (ATAM)
• Looking for specific health ailments in Twitter
• For each ailment:
▫ General words
▫ Symptoms
▫ Treatments
Paul, M. and Dredze, M.
You are what you Tweet: Analyzing Twitter for Public Health.
In International AAAI Conference on Weblogs and Social Media (ICWSM), 2011.
Identifying Questions in Micro-Text
• Survey (N=624), Questions characterization
▫ Morris, et al.
• I wonder, I’d like to know, etc.
▫ Efron and Winget
• Part of Speech Tagging
▫ Dent and Paul
Dent, K. and Paul, S.
Through the Twitter Glass: Detecting Questions in Micro-text.
In Workshops at the Twenty-Fifth AAAI Conference on Artificial Intelligence, 2011.
Efron, M. and Winget, M.
Questions are Content: A Taxonomy of Questions in a Microblogging Environment.
In Proceedings of the American Society for Information Science and Technology, 47(1):1-10, 2010.
Morris, M. R., Teevan, J., and Panovich, K.
What do People Ask their Social Networks, and Why?: A Survey Study of Status Message Q&A Behavior.
In Proceedings of the 28th International Conference on Human Factors in Computing Systems (CHI), pages 1739-1748, 2010.
Questions in Twitter
• Finding questions
▫ Look for “?”
▫ Use Mechanical Turk service
• 1152 Questions
▫ 18% Response rate
Paul, S., Hong, L., and Chi, E.
Is Twitter a Good Place for Asking Questions? A Characterization Study.
In Proceedings of the 5th International Conference on Weblogs and Social Media, pages 578-581, 2011.
Research Area Overview
• Health research in social media
• Data mining
▫ Social network analysis
▫ Collective classification
▫ Text mining
Dissertation Proposal
• Develop and improve computational techniques to better enable public health surveillance in online social media
Public Health Surveillance
in Social Media
Observe
Predict
Discover
Social Media Space
Micro-blogs
Video-sharing
Full-length blogs
Mining Communities
• People in their social structures
• Complete graph not feasible
• Direction matters
Observe
Predict
Discover
Community Mining
• “Dense subnetwork within a larger network”
Newman, M. E. J.
Communities, Modules and Large-scale Structure in Networks.
Nature Physics 8:25-31. 2012
Does Direction Really Matter?
Does Direction Really Matter?
Implications of Discovery
Local, Directed Modularity
Complete Graph Local Discovery
Undirected Modularity • Newman and Girvan (2004)
Local Modularity • Clauset (2005)
Directed Directed Modularity • Leicht and Newman (2008)
Local, Directed Modularity
Clauset’s Local Modularity
• Steepness of boundary
• Greedily add nodes
Clauset, A.
Finding Local Community Structure in Networks.
Physical Review E, 72(2):026132, Aug 2005.
Degrees of Freedom
• Expanding new nodes
▫ Which outside nodes are considered?
• Calculation of local modularity
▫ Which edges to outside nodes count?
▫ Which edges to core nodes count?
Conclusions
• Edge direction is important
• Algorithm extension requires assumptions
• Different assumptions lead to different communities
Public Health Surveillance
in YouTube • What are people:
▫ Sharing?
▫ Seeing?
▫ Saying?
• Implications for communication
Observe
Predict
Discover
YouTube Communities
• Users ▫ Friends
▫ Author – Subscribers
▫ Author – Commenters
▫ Co-commenters
• Videos ▫ Similar titles/keywords
▫ YouTube’s “related videos”
▫ Videos commented on by common users
▫ Videos “in-response-to” others
Burton, S., et al.
Public Health Community Mining in YouTube
In Proceedings of the ACM International Health Informatics Symposium, pages 81-90, 2012.
Anti-smoking Communities in YouTube
• “Tobacco Free Florida – Kid Tossing Ball”
Burton, S., et al.
Public Health Community Mining in YouTube
In Proceedings of the ACM International Health Informatics Symposium, pages 81-90, 2012.
http://www.youtube.com/watch?v=Ow-D9gCp-UA
Beam Search
• Quickly diverges to other topics
• Depth 4: as many sex-related videos as tobacco
Depth Unique Videos Smoking-related Sex-related
0 1 1 0
1 5 4 1
2 19 9 5
3 70 18 17
4 268 41 42
Total 363 73 65
Burton, S., et al.
Public Health Community Mining in YouTube
In Proceedings of the ACM International Health Informatics Symposium, pages 81-90, 2012.
Multiple Sub-Community Expansion
(MSCE) Algorithm 1. Given initial start video
2. Build sub-community
a. Add video most increasing local modularity
b. Continue until no increase
3. Choose next start video based on:
a. Links to existing community
b. Keyword matching
4. Repeat 2-3, until sufficient community built
• Videos more related to the topic than Beam Search (70% vs. 20%)
Burton, S., et al.
Public Health Community Mining in YouTube
In Proceedings of the ACM International Health Informatics Symposium, pages 81-90, 2012.
MSCE: Anti-Smoking Video Community
A
B
C
D
A. “Graphic Australian Anti-Smoking Ad”
▫ 2.5 million views
B. “How to quit smoking”
▫ Bridge between 3 sub-communities
C. Superhero sub-community
D. Superhero bridge videos
▫ “Star Wars Anti Smoking Ad”
▫ “Anti-Smoking : Superman versus Nick O’Teen (1981)”
Sampling on YouTube
• Current work:
▫ Search terms
▫ First N results
▫ YouTube limit of 1,000
• Typical users don’t page through search lists
iProspect.com. iProspect Search Engine User Behavior.
Technical report, iProspect.com, Inc., 2006.
Burton, S., et al.
Public Health Community Mining in YouTube
In Proceedings of the ACM International Health Informatics Symposium, pages 81-90, 2012.
Conclusions and
Public Health Implications
Conclusion Implication
Users leave health topics within a few clicks One chance to communicate message
Influential authors are involved in the community Simply posting a video is not sufficient
Users with affinities to the topic can be found Surveillance and communication is possible
Communities can be used for sampling Keyword-based approaches can be augmented
Burton, S., et al.
Public Health Community Mining in YouTube
In Proceedings of the ACM International Health Informatics Symposium, pages 81-90, 2012.
Horizontal Health Communication
Abroms, L. and Lefebvre, R. C.
Obama's Wired Campaign: Lessons for Public Health Communication.
Journal of Health Communication, 14(5):415-423, 2009
1. Dissemination
2. Feedback
Comparison of Communities
and Information Dissemination
• What health topics are dicussed?
• How do they spread?
Observe
Predict
Discover
Public Health Surveillance
in the Blogosphere • Everyone is a publisher
• Link to other blogs
• Establish credibility
Image: http://datamining.typepad.com/gallery/blog-map-gallery.html
Mommy-Blogs
• Mothers are highly influential in health decisions (Daniel 2009)
• Blog communities influence social norms (Wei 2004)
Daniel, K.
The Power of Mom in Communicating Health.
American Journal of Public Health, 99(12):2119, 2009.
Wei, C.
Formation of Norms in a Blog Community.
Into the Blogosphere: Rhetoric, Community, and Culture in Weblogs. 2004.
Health Topics on Mommy-Blogs
• Community of 450 blogs
Topic Count Percent
Autism 113 0.34
CMV 1 0.00
Down Syndrome 31 0.09
FAS 2 0.01
SIDS 17 0.05
Pregnancy 1,008 3.01
All Entries 33,527 100.00
Parallel Mommy-verses
• Build mommy-communities in Twitter and the Blogosphere
• Evaluate differences
▫ Network structure
▫ Health topics frequency
▫ Likelihood of reiterating
Image: http://www.psychedelicjunction.com/2011/04/what-are-parallel-universes.html
Implications for Health Communication
• Know what is being said
• Identify influential users
▫ Popular/respected
▫ Bridge nodes
• How to best get messages “passed along”
Surveillance of Health
Advice • Do people seek health advice?
• Are they receiving answers?
Observe
Predict
Discover
Is Health Data Too Private?
• Would you post that online?
• Our hypothesis:
▫ People are asking questions and receiving answers
▫ More social capital = Better leverage for advice
Burton, S. H., Tanner, K. W., and Giraud-Carrier, C. G.
Leveraging Social Networks for Anytime-Anyplace Health Information.
In Submission.
Benefits of Social Media
• No search result list
• Personalization
• Versatility
• Credibility
Burton, S. H., Tanner, K. W., and Giraud-Carrier, C. G.
Leveraging Social Networks for Anytime-Anyplace Health Information.
In Submission.
Our Study
• Platform: Twitter
▫ Public data
• Health topic: Dental advice
▫ Everyone manages dental health
▫ Not too private
▫ Easy vocabulary
Burton, S. H., Tanner, K. W., and Giraud-Carrier, C. G.
Leveraging Social Networks for Anytime-Anyplace Health Information.
In Submission.
Mining Dental Advice – Step 1
• Identify dental tweets
▫ Observe all tweets
▫ Filter by:
Tooth, teeth, dental, dentist, gums, molar, moler, floss, toothache
Burton, S. H., Tanner, K. W., and Giraud-Carrier, C. G.
Leveraging Social Networks for Anytime-Anyplace Health Information.
In Submission.
“Ugh I have the worst tooth ache every…#CantDeal” [sic.]
“I got a massive sweet tooth”
Mining Dental Advice – Step 2
• Identify advice-seeking questions
▫ Look for: “anybody”, “anyone”, “any1” and “?”
▫ Human raters fine-tune
Burton, S. H., Tanner, K. W., and Giraud-Carrier, C. G.
Leveraging Social Networks for Anytime-Anyplace Health Information.
In Submission.
“Can anyone suggest some home remedies for a #toothache?”
“does anyone know how long it takes for swelling on your mouth to go
down after getting teeth out?”
Mining Dental Advice – Step 3
• Identify answers
▫ Search for: @user-name
▫ Within 48 hours
▫ Verify “in-reply-to” original tweet
Burton, S. H., Tanner, K. W., and Giraud-Carrier, C. G.
Leveraging Social Networks for Anytime-Anyplace Health Information.
In Submission.
“@Dray_Z try gurgling with warmm salt water or put a tea bag btween
the ones that hurt” [sic.]
Results
• 2 weeks of tweets
▫ 1 million dental tweets (74,000 per day)
▫ 2,035 likely advice seeking (anyone … ?)
▫ 432 genuine advice-seeking
▫ 140 (32%) received at least one response
▫ 5.5 minutes to response (median)
Burton, S. H., Tanner, K. W., and Giraud-Carrier, C. G.
Leveraging Social Networks for Anytime-Anyplace Health Information.
In Submission.
Benefits of Social Capital
• More like to receive response
• Receive responses faster
Burton, S. H., Tanner, K. W., and Giraud-Carrier, C. G.
Leveraging Social Networks for Anytime-Anyplace Health Information.
In Submission.
Who is Answering?
• Answers come from people you know
Burton, S. H., Tanner, K. W., and Giraud-Carrier, C. G.
Leveraging Social Networks for Anytime-Anyplace Health Information.
In Submission.
Relationship Percent
No relation 6.6
Responder following asker 93.0
Asker following responder 70.0
Mutual following and follower 69.5
Conclusions
• People are seeking dental advice in Twitter
• Answers come frequently and quickly
• Users with more social capital are more likely to receive answers
Burton, S. H., Tanner, K. W., and Giraud-Carrier, C. G.
Leveraging Social Networks for Anytime-Anyplace Health Information.
In Submission.
Predicting Substance Abuse
• Identifying Trends
▫ Content of tweets
▫ Social network
Observe
Predict
Discover
Do People Tweet About That?
“So my family knows I smoke weed. The only one that doesn't really care or seem to concern is my pops” [sic.]
“if u dont like that i smoke weed then u dont like me... Weed is BIG part of my laugh. now pass me the blunt” [sic.]
“No wonder I smoke weed. Stupid people stress me out.”
Mining Process
A. Collect Marijuana Users
B. Collect Non-Marijuana Users
C. Build User Profiles
D. Induce Predictive Model
E. Analyze Model
Intervention (Future Work)
F. Predict Likely Users
Collecting Users
• Keyword filters
• Pilot study: “I smoke weed” (50 users)
▫ 36% - Definitely marijuana users
▫ 25% - Explicitly said it, but possible joking
▫ 19% - At least positive sentiment
▫ 78% - These three combined
• Non-marijuana users
A. Collect Marijuana Users
B. Collect Non-Marijuana Users
C. Build User Profiles
D. Induce Predictive Model
E. Analyze Model
Intervention (Future Work)
F. Predict Likely Users
Building User Profiles
• Complete tweet history (up to 3200)
• Follower List
• Following List
• User-supplied description
A. Collect Marijuana Users
B. Collect Non-Marijuana Users
C. Build User Profiles
D. Induce Predictive Model
E. Analyze Model
Intervention (Future Work)
F. Predict Likely Users
Feature Extraction
• Author-LDA
▫ “day today good time tonight happy”
▫ “real tho man gotta life twitter yo hit”
• Personal pronouns
▫ “My step-mom…”
▫ Bootstrap training set
• Traits from theoretical models
A. Collect Marijuana Users
B. Collect Non-Marijuana Users
C. Build User Profiles
D. Induce Predictive Model
E. Analyze Model
Intervention (Future Work)
F. Predict Likely Users
Hawkins, J., Catalano, R., and Miller, J.
Risk and Protective Factors for Alcohol and Other Drug Problems in Adolescence and
Early Adulthood: Implications for Substance Abuse Prevention.
Psychological Bulletin, 112(1):64, 1992.
The Predictive Model
• Comprehensibility
• Collective classification
▫ Predict personal traits
▫ Predict traits of friends
▫ Weighted, directed edges
A. Collect Marijuana Users
B. Collect Non-Marijuana Users
C. Build User Profiles
D. Induce Predictive Model
E. Analyze Model
Intervention (Future Work)
F. Predict Likely Users
Analysis and Validation
• Compare to theory
▫ “Risk and protective factors”
• Subjective validation
• Objective validation of easily-labeled traits
A. Collect Marijuana Users
B. Collect Non-Marijuana Users
C. Build User Profiles
D. Induce Predictive Model
E. Analyze Model
Intervention (Future Work)
F. Predict Likely Users
Future Work
• Personalized communication
• Intervention
• Communication with family/friends
A. Collect Marijuana Users
B. Collect Non-Marijuana Users
C. Build User Profiles
D. Induce Predictive Model
E. Analyze Model
Intervention (Future Work)
F. Predict Likely Users
Proposed Schedule
Sec. Topic Venue Target
2 Public Health Community Mining in YouTube ACM International Health Informatics
Symposium (IHI)
Published
4 Leveraging Social Networks for Anytime-
Anyplace Health Information
Network Modeling Analysis in Health
Informatics and Bioinformatics
(NetMAHIB)
In Submission
1 Local Community Mining in Directed Graphs Journal of Social Network Analysis and
Mining (SNAM)
June 2012
3 Mining the Spread of Health Content in
Social Media
International Conference on Social
Computing, Behavioral-Cultural
Modeling, and Prediction (SBP)
August 2012
5 Mining Social Media for Trends among
Substance Abusers
ACM Transactions of Knowledge
Discovery from Data (TKDD)
February 2013
Contributions
• Computational techniques
▫ Local, directed community mining
▫ Community mining for sampling
▫ Mining rare and meaningful traits in short text
▫ Combination of text mining and social network
analysis for prediction
• Implications for Health Surveillance
▫ YouTube as a source of communities
▫ Health differences across platforms
▫ Health advice in social media
▫ Prediction of high risk individuals
Observe
Predict
Discover
Questions