computational techniques for public health...

86
Computational Techniques for Public Health Surveillance Scott H. Burton Ph.D. Dissertation Proposal Department of Computer Science Brigham Young University April 26, 2012

Upload: others

Post on 26-Jun-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Computational Techniques for Public Health …dml.cs.byu.edu/~sburton/presentations/2012-04_26...Topic Modeling •Latent Dirichlet allocation (LDA) User chooses a topic (z) Given

Computational Techniques

for Public Health Surveillance Scott H. Burton Ph.D. Dissertation Proposal Department of Computer Science Brigham Young University April 26, 2012

Page 2: Computational Techniques for Public Health …dml.cs.byu.edu/~sburton/presentations/2012-04_26...Topic Modeling •Latent Dirichlet allocation (LDA) User chooses a topic (z) Given

Overview

• Problem overview

• Research area overview

▫ Health research in social media

▫ Data mining

Social network analysis

Collective classification

Text mining

• Dissertation proposal

Page 3: Computational Techniques for Public Health …dml.cs.byu.edu/~sburton/presentations/2012-04_26...Topic Modeling •Latent Dirichlet allocation (LDA) User chooses a topic (z) Given

Health is Important

• U.S. 2010 total health expenditures:

▫ $2.6 trillion (17.9% of GDP)

• Millions of lives affected each year

National Health Expenditures 2010 Highlights.

http://www.cms.gov/NationalHealthExpendData/downloads/highlights.pdf

Image: http://health-ins.us/

Page 4: Computational Techniques for Public Health …dml.cs.byu.edu/~sburton/presentations/2012-04_26...Topic Modeling •Latent Dirichlet allocation (LDA) User chooses a topic (z) Given

Public Health Surveillance

“Public health surveillance is the continuous, systematic collection, analysis and interpretation of health-related data needed for the planning, implementation, and evaluation of public health practice.”

– World Health Organization

• Epidemiology

• Health promotion

• Substance abuse prevention

• Public policy

World Health Organization

http://www.who.int/topics/public_health_surveillance/en/

Page 5: Computational Techniques for Public Health …dml.cs.byu.edu/~sburton/presentations/2012-04_26...Topic Modeling •Latent Dirichlet allocation (LDA) User chooses a topic (z) Given

Traditional Methods

• Health Department Labs

• Focus Groups

• Questionnaires

• Clinical Trials

Page 6: Computational Techniques for Public Health …dml.cs.byu.edu/~sburton/presentations/2012-04_26...Topic Modeling •Latent Dirichlet allocation (LDA) User chooses a topic (z) Given

Limitations of Traditional Methods

Traditional Methods

• Cost

• Delay

• Isolated individuals

• Reported vs. actual behavior

• Often small samples

Page 7: Computational Techniques for Public Health …dml.cs.byu.edu/~sburton/presentations/2012-04_26...Topic Modeling •Latent Dirichlet allocation (LDA) User chooses a topic (z) Given

Social Media Opportunities

Traditional Methods Online Social Media

• Cost

• Delay

• Isolated individuals

• Reported vs. actual behavior

• Often small samples

• Inexpensive

• Real-time posting

• Near real-time analysis

• Relational data / social structures

• True feelings and behaviors

• Large samples

• Geo-located

• Reach under-represented countries and groups

Page 8: Computational Techniques for Public Health …dml.cs.byu.edu/~sburton/presentations/2012-04_26...Topic Modeling •Latent Dirichlet allocation (LDA) User chooses a topic (z) Given

Computational Health Science

“Developing computational techniques to build systems or applications to understand and influence individual health

and measure relevant outcomes.”

Computer Science

Sociology Health Science

Page 9: Computational Techniques for Public Health …dml.cs.byu.edu/~sburton/presentations/2012-04_26...Topic Modeling •Latent Dirichlet allocation (LDA) User chooses a topic (z) Given

The CHS Difference

• Community identification

• Data set size

• Relational classification

• Inductive models

• Text mining and automated analysis

Page 10: Computational Techniques for Public Health …dml.cs.byu.edu/~sburton/presentations/2012-04_26...Topic Modeling •Latent Dirichlet allocation (LDA) User chooses a topic (z) Given

Search Query Monitoring

• Influenza outbreak detection

Polgreen, P., Chen, Y., Pennock, D., Nelson, F., and Weinstein, R.

Using Internet Searches for Influenza Surveillance

Clinical Infectious Diseases, 47(11):1443-1448, 2008.

Page 11: Computational Techniques for Public Health …dml.cs.byu.edu/~sburton/presentations/2012-04_26...Topic Modeling •Latent Dirichlet allocation (LDA) User chooses a topic (z) Given

More Outbreak Detection

• Influenza outbreak detection (Ginsberg, et al.)

• 2009 H1N1 Influenza (Brownstein, et al.)

• Listeriosis (Wilson and Brownstein)

• Gastroenteritis and Chickenpox (Pelat, et al.)

Ginsberg, J., Mohebbi, M., Patel, R., Brammer, L., Smolinski, M., and Brilliant, L.

Detecting Influenza Epidemics using Search Engine Query Data.

Nature, 457(7232):1012-1014, 2008.

Brownstein, J. S., et al.

Information Technology and Global Surveillance of Cases of 2009 H1N1 Influenza

New England Journal of Medicine, 362(18):1731-1735, 2010.

Wilson, K. and Brownstein, J.

Early Detection of Disease Outbreaks using the Internet.

Canadian Medical Association Journal, 180(8):829, 2009.

Pelat, C., Turbelin, C., Bar-Hen, A., Flahault, A., and Valleron, A.

More Diseases Tracked by using Google Trends.

Emerging Infectious Diseases, 15(8):1327, 2009.

Page 12: Computational Techniques for Public Health …dml.cs.byu.edu/~sburton/presentations/2012-04_26...Topic Modeling •Latent Dirichlet allocation (LDA) User chooses a topic (z) Given

Health on YouTube

• Immunizations (N=153) (Keelan, et al.)

• Tanning Bed Use (N=72) (Hossler and Conroy)

• Tobacco (N=50) (Freeman and Chapman)

• Stop Smoking (N=191) (Backinger, et al.)

Keelan, J., Pavri-Garcia, V., Tomlinson, G., and Wilson, K.

YouTube as a Source of Information on Immunization: A Content Analysis.

Journal of the American Medical Association, 298(21):2482, 2007.

Hossler, E. and Conroy, M.

YouTube as a Source of Information on Tanning Bed Use.

Archives of Dermatology, 144(10):1395{1396, 2008.

Freeman, B. and Chapman, S.

Is “YouTube” Telling or Selling you Something? Tobacco Content on the YouTube Video-sharing Website.

Tobacco Control, 16(3):207, 2007.

Backinger, C. L., Pilsner, A. M., Augustson, E. M., Frydl, A., Phillips, T., and Rowden, J.

YouTube as a Source of Quitting Smoking Information.

Tobacco Control, 20(2):119-122, 2011.

Page 13: Computational Techniques for Public Health …dml.cs.byu.edu/~sburton/presentations/2012-04_26...Topic Modeling •Latent Dirichlet allocation (LDA) User chooses a topic (z) Given

Health on Facebook

• General Non-Communicable Disease Groups (N=757)

▫ Farmer, et al.

• Diabetes Groups (N=15)

▫ Greene, et al.

• Ethical Issues (N=202)

▫ Moubarak, et al.

Greene, J., Choudhry, N., Kilabuk, E., and Shrank, W.

Online Social Networking by Patients with Diabetes: A Qualitative Evaluation of Communication with Facebook.

Journal of General Internal Medicine, 26:287-292, 2011.

Moubarak, G., Guiot, A., Benhamou, Y., Benhamou, A., and Hariri, S.

Facebook Activity of Residents and Fellows and its Impact on the Doctor-Patient Relationship.

Journal of Medical Ethics, 37(2):101-104, 2011.

Farmer, A. D., Bruckner Holt, C. E. M., Cook, M. J., and D., H. S.

Social Networking Sites: A Novel Portal for Communication.

Postgraduate Medical Journal, 85:455-459, 2009.

Page 14: Computational Techniques for Public Health …dml.cs.byu.edu/~sburton/presentations/2012-04_26...Topic Modeling •Latent Dirichlet allocation (LDA) User chooses a topic (z) Given

Health on Blogs

• Health-related Blogs (N=951)

▫ Miller and Pole

• Breastfeeding and Blogging (32 blogs, 354 posts, 881 comments)

▫ West et al.

Miller, E. and Pole, A.

Diagnosis Blog: Checking up on Health Blogs in the Blogosphere.

American Journal of Public Health, 100(8):1514-1519, 2010.

West, J., Hall, P., Hanson, C., Thackeray, R., Barnes, M., Neiger, B., and McIntyre, E.

Breastfeeding and Blogging: Exploring the Utility of Blogs to Promote Breastfeeding.

American Journal of Health Education, 42(2):106-115, 2011.

Page 15: Computational Techniques for Public Health …dml.cs.byu.edu/~sburton/presentations/2012-04_26...Topic Modeling •Latent Dirichlet allocation (LDA) User chooses a topic (z) Given

Health on Twitter

• Dental Pain (N=772)

▫ Heaivilin, et al.

• Tobacco (N=5.9 million tweets, 5,000 tobacco-related)

▫ Prier, et al.

• Problem Drinking (N=5.5 million tweets, 21,000 alcohol-related)

▫ West et al.

Heaivilin, N., Gerbert, B., Page, J., and Gibbs, J.

Public Health Surveillance of Dental Pain via Twitter.

Journal of Dental Research, 90(9):1047-1051, 2011.

Prier, K. W., Smith, M. S., Giraud-Carrier, C., and Hanson, C. L.

Identifying Health-Related Topics on Twitter: An Exploration of Tobacco-related Tweets as a Test Topic.

In Proceedings of the 4th International Conference on Social Computing,

Behavioral-Cultural Modeling, and Prediction, pages 18-25. 2011.

West, J., Hall, P., Prier, K., Hanson, C., Giraud-Carrier, C., Neeley, S., Barnes, M.

Temporal Variability of Problem Drinking on Twitter

Open Journal of Preventive Medicine, 2(1):43-48. 2012.

Page 16: Computational Techniques for Public Health …dml.cs.byu.edu/~sburton/presentations/2012-04_26...Topic Modeling •Latent Dirichlet allocation (LDA) User chooses a topic (z) Given

Geo-Location in Twitter

• Pew Institute reports:

▫ 14% of users said they used automatic GPS tagging

• In our study, the data said:

▫ 2.0% of Tweets

▫ 2.7% of unique users

K. Zickuhr and A. Smith.

28% of American Adults Use Mobile and Social Location-based Services.

http://pewinternet.org/~/media//Files/Reports/2011/PIP_Locationbased-services.pdf, 2011.

Burton, S. H., Tanner, K. W., Giraud-Carrier, C. G., West, J. H., and Barnes, M. D.

Right Time, Right Place Health Communication in Twitter: How Good Is Location Information?

In Submission.

Page 17: Computational Techniques for Public Health …dml.cs.byu.edu/~sburton/presentations/2012-04_26...Topic Modeling •Latent Dirichlet allocation (LDA) User chooses a topic (z) Given

Tweets Around the World

Burton, S. H., Tanner, K. W., Giraud-Carrier, C. G., West, J. H., and Barnes, M. D.

Right Time, Right Place Health Communication in Twitter: How Good Is Location Information?

In Submission.

Page 18: Computational Techniques for Public Health …dml.cs.byu.edu/~sburton/presentations/2012-04_26...Topic Modeling •Latent Dirichlet allocation (LDA) User chooses a topic (z) Given

Data Mining

• “the process of discovering interesting and useful patterns and relationships in large volumes of data” – Christopher Clifton

• Algorithms

▫ Supervised

▫ Unsupervised

• Types of data

▫ Tabular

▫ Relational

▫ Text

Clifton, C.

Encyclopedia Britannica: Data Mining

http://www.britannica.com/EBchecked/topic/1056150/data-mining

Page 19: Computational Techniques for Public Health …dml.cs.byu.edu/~sburton/presentations/2012-04_26...Topic Modeling •Latent Dirichlet allocation (LDA) User chooses a topic (z) Given

Social Network Analysis

• Relational data

• Not just networks of “people”

Wasserman, S. and Faust, K.

Social Network Analysis: Methods and Applications. Cambridge University Press, 1994.

Scott, J.

Social Network Analysis: A Handbook. Sage Publications, Second Edition, 2000.

Page 20: Computational Techniques for Public Health …dml.cs.byu.edu/~sburton/presentations/2012-04_26...Topic Modeling •Latent Dirichlet allocation (LDA) User chooses a topic (z) Given

Community Mining

• “Dense subnetwork within a larger network”

Newman, M. E. J.

Communities, Modules and Large-scale Structure in Networks.

Nature Physics, 8:25-31. 2012

Page 21: Computational Techniques for Public Health …dml.cs.byu.edu/~sburton/presentations/2012-04_26...Topic Modeling •Latent Dirichlet allocation (LDA) User chooses a topic (z) Given

Community Mining Techniques

• Label Propagation

▫ Cordasco and Gargano

• Random Walks

▫ Rosvall and Bergstrom

• Rolling k-Cliques

▫ Palla et al.

Cordasco, G. and Gargano, L.

Community Detection via Semi-Synchronous Label Propagation Algorithms

IEEE International Workshop on Business Applications of Social Network Analysis, 2010

Rosvall, R. and Bergstrom, C. T.

Maps of Random Walks on Complex Networks Reveal Community Structure

Proceedings of the National Academy of Sciences 105(4):1118-1123. 2008

Palla, G., Dereneyi, I., Farkas, I., and Vicsek, T.

Uncovering the Overlapping Community Structure of Complex Networks in Nature and Society

Nature, 435(7043):814-818, 2005.

Page 22: Computational Techniques for Public Health …dml.cs.byu.edu/~sburton/presentations/2012-04_26...Topic Modeling •Latent Dirichlet allocation (LDA) User chooses a topic (z) Given

Modularity

• Actual edges minus expected

• Undirected

• Requires complete graph

Newman, M. E. J. and Girvan, M.

Finding and evaluating community structure in networks.

Physical Review E, 69(2):026113, Feb 2004.

Page 23: Computational Techniques for Public Health …dml.cs.byu.edu/~sburton/presentations/2012-04_26...Topic Modeling •Latent Dirichlet allocation (LDA) User chooses a topic (z) Given

Modularity Challenges

• Algorithm efficiency

• Varying sizes

• Overlapping

• Directed graphs

• Local discovery

Page 24: Computational Techniques for Public Health …dml.cs.byu.edu/~sburton/presentations/2012-04_26...Topic Modeling •Latent Dirichlet allocation (LDA) User chooses a topic (z) Given

Directed Community Mining

• Lost information by ignoring direction

• Directed Modularity

▫ Leicht and Newman

• Random Walks

▫ Kim, et al.

Leicht, E. A. and Newman, M. E. J.

Community Structure in Directed Networks.

Physical Review Letters, 100(11):118703, 2008.

Kim, Y., Son, S.-W., Jeong, H.

Finding Communities in Directed Networks

Physical Review E, 81(1):016103, 2010.

Page 25: Computational Techniques for Public Health …dml.cs.byu.edu/~sburton/presentations/2012-04_26...Topic Modeling •Latent Dirichlet allocation (LDA) User chooses a topic (z) Given

Clauset’s Local Modularity

• Steepness of boundary

• Greedily add nodes

Clauset, A.

Finding Local Community Structure in Networks.

Physical Review E, 72(2):026132, Aug 2005.

Page 26: Computational Techniques for Public Health …dml.cs.byu.edu/~sburton/presentations/2012-04_26...Topic Modeling •Latent Dirichlet allocation (LDA) User chooses a topic (z) Given

Collective Classification

• “Typical” classification

▫ Internal attributes

• Relational classification

▫ Neighbor classes

• Collective classification

▫ Both

Sen, P., Namata, G., Bilgic, M., Getoor, L., Galligher, B., and Eliassi-Rad, T.

Collective Classification in Network Data.

AI Magazine, 29(3):93, 2008.

Jensen, D., Neville, J., and Gallagher, B.

Why Collective Inference Improves Relational Classification.

In Proceedings of the International Conference on Knowledge Discovery and Data Mining, 2004.

Page 27: Computational Techniques for Public Health …dml.cs.byu.edu/~sburton/presentations/2012-04_26...Topic Modeling •Latent Dirichlet allocation (LDA) User chooses a topic (z) Given

Inferring Properties from Friends

• Location

▫ Backstrom, et al.

• Private information (politics, religion, etc.)

▫ Lindamood, et al.

Backstrom, L., Sun, E., and Marlow, C.

Find Me if You Can: Improving Geographical Prediction with Social and Spatial Proximity.

In Proceedings of the 19th International World Wide Web Conference, pages 61-70. 2010.

Lindamood, J., Heatherly, R., Kantarcioglu, M., and Thuraisingham, B.

Inferring private information using social network data.

In Proceedings of the 18th International World Wide Web Conference, pages 1145-1146. 2009.

Page 28: Computational Techniques for Public Health …dml.cs.byu.edu/~sburton/presentations/2012-04_26...Topic Modeling •Latent Dirichlet allocation (LDA) User chooses a topic (z) Given

Text Classification

• Different classes of documents

• Learn patterns from the words in each class

Sebastiani, F.

Machine Learning in Automated Text Categorization.

ACM Computing Surveys, 34(1):1-47, 2002.

Lorem

ipsum

sit

doler.

Etc.

Etc.

Lorem

ipsum

sit

doler.

Etc.

Etc.

Lorem

ipsum

sit

doler.

Etc.

Etc.

Lorem

ipsum

sit

doler.

Etc.

Etc.

Lorem

ipsum

sit

doler.

Etc.

Etc.

Lorem

ipsum

sit

doler.

Etc.

Etc.

Lorem

ipsum

sit

doler.

Etc.

Etc.

Lorem

ipsum

sit

doler.

Etc.

Etc.

Lorem

ipsum

sit

doler.

Etc.

Etc.

Page 29: Computational Techniques for Public Health …dml.cs.byu.edu/~sburton/presentations/2012-04_26...Topic Modeling •Latent Dirichlet allocation (LDA) User chooses a topic (z) Given

Text Classification Algorithms

• Naïve Bayes

▫ McCallum and Nigam, 1998

• k-Nearest Neighbor

▫ Yang, 1999

• Support Vector Machines

▫ Joachims, 1998

• Rule-learning

▫ Cohen and Singer, 1996

• Maximum Entropy

▫ Nigam, et al., 1999

Page 30: Computational Techniques for Public Health …dml.cs.byu.edu/~sburton/presentations/2012-04_26...Topic Modeling •Latent Dirichlet allocation (LDA) User chooses a topic (z) Given

Topic Modeling

• Latent Dirichlet allocation (LDA)

▫ User chooses a topic (z)

▫ Given the topic, user chooses a word

Blei, D. M., Ng, A. Y., and Jordan, M. I.

Latent Dirichlet Allocation.

Journal of Machine Learning Research, 3:993-1022, March 2003.

Page 31: Computational Techniques for Public Health …dml.cs.byu.edu/~sburton/presentations/2012-04_26...Topic Modeling •Latent Dirichlet allocation (LDA) User chooses a topic (z) Given

Labeled LDA

• Supervised LDA

• Incorporates a document label

Ramage, D., Hall, D., Nallapati, R., and Manning, C.

Labeled LDA: A Supervised Topic Model for Credit Attribution in Multi-Labeled Corpora

In Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing, pages 248-256

Page 32: Computational Techniques for Public Health …dml.cs.byu.edu/~sburton/presentations/2012-04_26...Topic Modeling •Latent Dirichlet allocation (LDA) User chooses a topic (z) Given

“Author-LDA”

• Small document challenges for LDA

• One approach:

▫ Combine all of an author’s tweets

Hong, L. and Davison, B.

Empirical Study of Topic Modeling in Twitter.

In Proceedings of the First Workshop on Social Media Analytics, pages 80-88. 2010.

Zhao, W., Jiang, J., Weng, J., He, J., Lim, E., Yan, H., and Li, X.

Comparing Twitter and Traditional Media using Topic Models.

In Proceedings of the 33rd European Conference on Advances in Information Retrieval, pages 338-349. 2011.

Page 33: Computational Techniques for Public Health …dml.cs.byu.edu/~sburton/presentations/2012-04_26...Topic Modeling •Latent Dirichlet allocation (LDA) User chooses a topic (z) Given

Ailment Topic Aspect Model (ATAM)

• Looking for specific health ailments in Twitter

• For each ailment:

▫ General words

▫ Symptoms

▫ Treatments

Paul, M. and Dredze, M.

You are what you Tweet: Analyzing Twitter for Public Health.

In International AAAI Conference on Weblogs and Social Media (ICWSM), 2011.

Page 34: Computational Techniques for Public Health …dml.cs.byu.edu/~sburton/presentations/2012-04_26...Topic Modeling •Latent Dirichlet allocation (LDA) User chooses a topic (z) Given

Identifying Questions in Micro-Text

• Survey (N=624), Questions characterization

▫ Morris, et al.

• I wonder, I’d like to know, etc.

▫ Efron and Winget

• Part of Speech Tagging

▫ Dent and Paul

Dent, K. and Paul, S.

Through the Twitter Glass: Detecting Questions in Micro-text.

In Workshops at the Twenty-Fifth AAAI Conference on Artificial Intelligence, 2011.

Efron, M. and Winget, M.

Questions are Content: A Taxonomy of Questions in a Microblogging Environment.

In Proceedings of the American Society for Information Science and Technology, 47(1):1-10, 2010.

Morris, M. R., Teevan, J., and Panovich, K.

What do People Ask their Social Networks, and Why?: A Survey Study of Status Message Q&A Behavior.

In Proceedings of the 28th International Conference on Human Factors in Computing Systems (CHI), pages 1739-1748, 2010.

Page 35: Computational Techniques for Public Health …dml.cs.byu.edu/~sburton/presentations/2012-04_26...Topic Modeling •Latent Dirichlet allocation (LDA) User chooses a topic (z) Given

Questions in Twitter

• Finding questions

▫ Look for “?”

▫ Use Mechanical Turk service

• 1152 Questions

▫ 18% Response rate

Paul, S., Hong, L., and Chi, E.

Is Twitter a Good Place for Asking Questions? A Characterization Study.

In Proceedings of the 5th International Conference on Weblogs and Social Media, pages 578-581, 2011.

Page 36: Computational Techniques for Public Health …dml.cs.byu.edu/~sburton/presentations/2012-04_26...Topic Modeling •Latent Dirichlet allocation (LDA) User chooses a topic (z) Given

Research Area Overview

• Health research in social media

• Data mining

▫ Social network analysis

▫ Collective classification

▫ Text mining

Page 37: Computational Techniques for Public Health …dml.cs.byu.edu/~sburton/presentations/2012-04_26...Topic Modeling •Latent Dirichlet allocation (LDA) User chooses a topic (z) Given

Dissertation Proposal

• Develop and improve computational techniques to better enable public health surveillance in online social media

Page 38: Computational Techniques for Public Health …dml.cs.byu.edu/~sburton/presentations/2012-04_26...Topic Modeling •Latent Dirichlet allocation (LDA) User chooses a topic (z) Given

Public Health Surveillance

in Social Media

Observe

Predict

Discover

Page 39: Computational Techniques for Public Health …dml.cs.byu.edu/~sburton/presentations/2012-04_26...Topic Modeling •Latent Dirichlet allocation (LDA) User chooses a topic (z) Given

Social Media Space

Micro-blogs

Video-sharing

Full-length blogs

Page 40: Computational Techniques for Public Health …dml.cs.byu.edu/~sburton/presentations/2012-04_26...Topic Modeling •Latent Dirichlet allocation (LDA) User chooses a topic (z) Given

Mining Communities

• People in their social structures

• Complete graph not feasible

• Direction matters

Observe

Predict

Discover

Page 41: Computational Techniques for Public Health …dml.cs.byu.edu/~sburton/presentations/2012-04_26...Topic Modeling •Latent Dirichlet allocation (LDA) User chooses a topic (z) Given

Community Mining

• “Dense subnetwork within a larger network”

Newman, M. E. J.

Communities, Modules and Large-scale Structure in Networks.

Nature Physics 8:25-31. 2012

Page 42: Computational Techniques for Public Health …dml.cs.byu.edu/~sburton/presentations/2012-04_26...Topic Modeling •Latent Dirichlet allocation (LDA) User chooses a topic (z) Given

Does Direction Really Matter?

Page 43: Computational Techniques for Public Health …dml.cs.byu.edu/~sburton/presentations/2012-04_26...Topic Modeling •Latent Dirichlet allocation (LDA) User chooses a topic (z) Given

Does Direction Really Matter?

Page 44: Computational Techniques for Public Health …dml.cs.byu.edu/~sburton/presentations/2012-04_26...Topic Modeling •Latent Dirichlet allocation (LDA) User chooses a topic (z) Given

Implications of Discovery

Page 45: Computational Techniques for Public Health …dml.cs.byu.edu/~sburton/presentations/2012-04_26...Topic Modeling •Latent Dirichlet allocation (LDA) User chooses a topic (z) Given

Local, Directed Modularity

Complete Graph Local Discovery

Undirected Modularity • Newman and Girvan (2004)

Local Modularity • Clauset (2005)

Directed Directed Modularity • Leicht and Newman (2008)

Local, Directed Modularity

Page 46: Computational Techniques for Public Health …dml.cs.byu.edu/~sburton/presentations/2012-04_26...Topic Modeling •Latent Dirichlet allocation (LDA) User chooses a topic (z) Given

Clauset’s Local Modularity

• Steepness of boundary

• Greedily add nodes

Clauset, A.

Finding Local Community Structure in Networks.

Physical Review E, 72(2):026132, Aug 2005.

Page 47: Computational Techniques for Public Health …dml.cs.byu.edu/~sburton/presentations/2012-04_26...Topic Modeling •Latent Dirichlet allocation (LDA) User chooses a topic (z) Given

Degrees of Freedom

• Expanding new nodes

▫ Which outside nodes are considered?

• Calculation of local modularity

▫ Which edges to outside nodes count?

▫ Which edges to core nodes count?

Page 48: Computational Techniques for Public Health …dml.cs.byu.edu/~sburton/presentations/2012-04_26...Topic Modeling •Latent Dirichlet allocation (LDA) User chooses a topic (z) Given

Conclusions

• Edge direction is important

• Algorithm extension requires assumptions

• Different assumptions lead to different communities

Page 49: Computational Techniques for Public Health …dml.cs.byu.edu/~sburton/presentations/2012-04_26...Topic Modeling •Latent Dirichlet allocation (LDA) User chooses a topic (z) Given

Public Health Surveillance

in YouTube • What are people:

▫ Sharing?

▫ Seeing?

▫ Saying?

• Implications for communication

Observe

Predict

Discover

Page 50: Computational Techniques for Public Health …dml.cs.byu.edu/~sburton/presentations/2012-04_26...Topic Modeling •Latent Dirichlet allocation (LDA) User chooses a topic (z) Given

YouTube Communities

• Users ▫ Friends

▫ Author – Subscribers

▫ Author – Commenters

▫ Co-commenters

• Videos ▫ Similar titles/keywords

▫ YouTube’s “related videos”

▫ Videos commented on by common users

▫ Videos “in-response-to” others

Burton, S., et al.

Public Health Community Mining in YouTube

In Proceedings of the ACM International Health Informatics Symposium, pages 81-90, 2012.

Page 51: Computational Techniques for Public Health …dml.cs.byu.edu/~sburton/presentations/2012-04_26...Topic Modeling •Latent Dirichlet allocation (LDA) User chooses a topic (z) Given

Anti-smoking Communities in YouTube

• “Tobacco Free Florida – Kid Tossing Ball”

Burton, S., et al.

Public Health Community Mining in YouTube

In Proceedings of the ACM International Health Informatics Symposium, pages 81-90, 2012.

http://www.youtube.com/watch?v=Ow-D9gCp-UA

Page 52: Computational Techniques for Public Health …dml.cs.byu.edu/~sburton/presentations/2012-04_26...Topic Modeling •Latent Dirichlet allocation (LDA) User chooses a topic (z) Given

Beam Search

• Quickly diverges to other topics

• Depth 4: as many sex-related videos as tobacco

Depth Unique Videos Smoking-related Sex-related

0 1 1 0

1 5 4 1

2 19 9 5

3 70 18 17

4 268 41 42

Total 363 73 65

Burton, S., et al.

Public Health Community Mining in YouTube

In Proceedings of the ACM International Health Informatics Symposium, pages 81-90, 2012.

Page 53: Computational Techniques for Public Health …dml.cs.byu.edu/~sburton/presentations/2012-04_26...Topic Modeling •Latent Dirichlet allocation (LDA) User chooses a topic (z) Given

Multiple Sub-Community Expansion

(MSCE) Algorithm 1. Given initial start video

2. Build sub-community

a. Add video most increasing local modularity

b. Continue until no increase

3. Choose next start video based on:

a. Links to existing community

b. Keyword matching

4. Repeat 2-3, until sufficient community built

• Videos more related to the topic than Beam Search (70% vs. 20%)

Burton, S., et al.

Public Health Community Mining in YouTube

In Proceedings of the ACM International Health Informatics Symposium, pages 81-90, 2012.

Page 54: Computational Techniques for Public Health …dml.cs.byu.edu/~sburton/presentations/2012-04_26...Topic Modeling •Latent Dirichlet allocation (LDA) User chooses a topic (z) Given

MSCE: Anti-Smoking Video Community

A

B

C

D

A. “Graphic Australian Anti-Smoking Ad”

▫ 2.5 million views

B. “How to quit smoking”

▫ Bridge between 3 sub-communities

C. Superhero sub-community

D. Superhero bridge videos

▫ “Star Wars Anti Smoking Ad”

▫ “Anti-Smoking : Superman versus Nick O’Teen (1981)”

Page 55: Computational Techniques for Public Health …dml.cs.byu.edu/~sburton/presentations/2012-04_26...Topic Modeling •Latent Dirichlet allocation (LDA) User chooses a topic (z) Given

Sampling on YouTube

• Current work:

▫ Search terms

▫ First N results

▫ YouTube limit of 1,000

• Typical users don’t page through search lists

iProspect.com. iProspect Search Engine User Behavior.

Technical report, iProspect.com, Inc., 2006.

Burton, S., et al.

Public Health Community Mining in YouTube

In Proceedings of the ACM International Health Informatics Symposium, pages 81-90, 2012.

Page 56: Computational Techniques for Public Health …dml.cs.byu.edu/~sburton/presentations/2012-04_26...Topic Modeling •Latent Dirichlet allocation (LDA) User chooses a topic (z) Given

Conclusions and

Public Health Implications

Conclusion Implication

Users leave health topics within a few clicks One chance to communicate message

Influential authors are involved in the community Simply posting a video is not sufficient

Users with affinities to the topic can be found Surveillance and communication is possible

Communities can be used for sampling Keyword-based approaches can be augmented

Burton, S., et al.

Public Health Community Mining in YouTube

In Proceedings of the ACM International Health Informatics Symposium, pages 81-90, 2012.

Page 57: Computational Techniques for Public Health …dml.cs.byu.edu/~sburton/presentations/2012-04_26...Topic Modeling •Latent Dirichlet allocation (LDA) User chooses a topic (z) Given

Horizontal Health Communication

Abroms, L. and Lefebvre, R. C.

Obama's Wired Campaign: Lessons for Public Health Communication.

Journal of Health Communication, 14(5):415-423, 2009

1. Dissemination

2. Feedback

Page 58: Computational Techniques for Public Health …dml.cs.byu.edu/~sburton/presentations/2012-04_26...Topic Modeling •Latent Dirichlet allocation (LDA) User chooses a topic (z) Given

Comparison of Communities

and Information Dissemination

• What health topics are dicussed?

• How do they spread?

Observe

Predict

Discover

Page 59: Computational Techniques for Public Health …dml.cs.byu.edu/~sburton/presentations/2012-04_26...Topic Modeling •Latent Dirichlet allocation (LDA) User chooses a topic (z) Given

Public Health Surveillance

in the Blogosphere • Everyone is a publisher

• Link to other blogs

• Establish credibility

Image: http://datamining.typepad.com/gallery/blog-map-gallery.html

Page 60: Computational Techniques for Public Health …dml.cs.byu.edu/~sburton/presentations/2012-04_26...Topic Modeling •Latent Dirichlet allocation (LDA) User chooses a topic (z) Given

Mommy-Blogs

• Mothers are highly influential in health decisions (Daniel 2009)

• Blog communities influence social norms (Wei 2004)

Daniel, K.

The Power of Mom in Communicating Health.

American Journal of Public Health, 99(12):2119, 2009.

Wei, C.

Formation of Norms in a Blog Community.

Into the Blogosphere: Rhetoric, Community, and Culture in Weblogs. 2004.

Page 61: Computational Techniques for Public Health …dml.cs.byu.edu/~sburton/presentations/2012-04_26...Topic Modeling •Latent Dirichlet allocation (LDA) User chooses a topic (z) Given

Health Topics on Mommy-Blogs

• Community of 450 blogs

Topic Count Percent

Autism 113 0.34

CMV 1 0.00

Down Syndrome 31 0.09

FAS 2 0.01

SIDS 17 0.05

Pregnancy 1,008 3.01

All Entries 33,527 100.00

Page 62: Computational Techniques for Public Health …dml.cs.byu.edu/~sburton/presentations/2012-04_26...Topic Modeling •Latent Dirichlet allocation (LDA) User chooses a topic (z) Given

Parallel Mommy-verses

• Build mommy-communities in Twitter and the Blogosphere

• Evaluate differences

▫ Network structure

▫ Health topics frequency

▫ Likelihood of reiterating

Image: http://www.psychedelicjunction.com/2011/04/what-are-parallel-universes.html

Page 63: Computational Techniques for Public Health …dml.cs.byu.edu/~sburton/presentations/2012-04_26...Topic Modeling •Latent Dirichlet allocation (LDA) User chooses a topic (z) Given

Implications for Health Communication

• Know what is being said

• Identify influential users

▫ Popular/respected

▫ Bridge nodes

• How to best get messages “passed along”

Page 64: Computational Techniques for Public Health …dml.cs.byu.edu/~sburton/presentations/2012-04_26...Topic Modeling •Latent Dirichlet allocation (LDA) User chooses a topic (z) Given

Surveillance of Health

Advice • Do people seek health advice?

• Are they receiving answers?

Observe

Predict

Discover

Page 65: Computational Techniques for Public Health …dml.cs.byu.edu/~sburton/presentations/2012-04_26...Topic Modeling •Latent Dirichlet allocation (LDA) User chooses a topic (z) Given

Is Health Data Too Private?

• Would you post that online?

• Our hypothesis:

▫ People are asking questions and receiving answers

▫ More social capital = Better leverage for advice

Burton, S. H., Tanner, K. W., and Giraud-Carrier, C. G.

Leveraging Social Networks for Anytime-Anyplace Health Information.

In Submission.

Page 66: Computational Techniques for Public Health …dml.cs.byu.edu/~sburton/presentations/2012-04_26...Topic Modeling •Latent Dirichlet allocation (LDA) User chooses a topic (z) Given

Benefits of Social Media

• No search result list

• Personalization

• Versatility

• Credibility

Burton, S. H., Tanner, K. W., and Giraud-Carrier, C. G.

Leveraging Social Networks for Anytime-Anyplace Health Information.

In Submission.

Page 67: Computational Techniques for Public Health …dml.cs.byu.edu/~sburton/presentations/2012-04_26...Topic Modeling •Latent Dirichlet allocation (LDA) User chooses a topic (z) Given

Our Study

• Platform: Twitter

▫ Public data

• Health topic: Dental advice

▫ Everyone manages dental health

▫ Not too private

▫ Easy vocabulary

Burton, S. H., Tanner, K. W., and Giraud-Carrier, C. G.

Leveraging Social Networks for Anytime-Anyplace Health Information.

In Submission.

Page 68: Computational Techniques for Public Health …dml.cs.byu.edu/~sburton/presentations/2012-04_26...Topic Modeling •Latent Dirichlet allocation (LDA) User chooses a topic (z) Given

Mining Dental Advice – Step 1

• Identify dental tweets

▫ Observe all tweets

▫ Filter by:

Tooth, teeth, dental, dentist, gums, molar, moler, floss, toothache

Burton, S. H., Tanner, K. W., and Giraud-Carrier, C. G.

Leveraging Social Networks for Anytime-Anyplace Health Information.

In Submission.

“Ugh I have the worst tooth ache every…#CantDeal” [sic.]

“I got a massive sweet tooth”

Page 69: Computational Techniques for Public Health …dml.cs.byu.edu/~sburton/presentations/2012-04_26...Topic Modeling •Latent Dirichlet allocation (LDA) User chooses a topic (z) Given

Mining Dental Advice – Step 2

• Identify advice-seeking questions

▫ Look for: “anybody”, “anyone”, “any1” and “?”

▫ Human raters fine-tune

Burton, S. H., Tanner, K. W., and Giraud-Carrier, C. G.

Leveraging Social Networks for Anytime-Anyplace Health Information.

In Submission.

“Can anyone suggest some home remedies for a #toothache?”

“does anyone know how long it takes for swelling on your mouth to go

down after getting teeth out?”

Page 70: Computational Techniques for Public Health …dml.cs.byu.edu/~sburton/presentations/2012-04_26...Topic Modeling •Latent Dirichlet allocation (LDA) User chooses a topic (z) Given

Mining Dental Advice – Step 3

• Identify answers

▫ Search for: @user-name

▫ Within 48 hours

▫ Verify “in-reply-to” original tweet

Burton, S. H., Tanner, K. W., and Giraud-Carrier, C. G.

Leveraging Social Networks for Anytime-Anyplace Health Information.

In Submission.

“@Dray_Z try gurgling with warmm salt water or put a tea bag btween

the ones that hurt” [sic.]

Page 71: Computational Techniques for Public Health …dml.cs.byu.edu/~sburton/presentations/2012-04_26...Topic Modeling •Latent Dirichlet allocation (LDA) User chooses a topic (z) Given

Results

• 2 weeks of tweets

▫ 1 million dental tweets (74,000 per day)

▫ 2,035 likely advice seeking (anyone … ?)

▫ 432 genuine advice-seeking

▫ 140 (32%) received at least one response

▫ 5.5 minutes to response (median)

Burton, S. H., Tanner, K. W., and Giraud-Carrier, C. G.

Leveraging Social Networks for Anytime-Anyplace Health Information.

In Submission.

Page 72: Computational Techniques for Public Health …dml.cs.byu.edu/~sburton/presentations/2012-04_26...Topic Modeling •Latent Dirichlet allocation (LDA) User chooses a topic (z) Given

Benefits of Social Capital

• More like to receive response

• Receive responses faster

Burton, S. H., Tanner, K. W., and Giraud-Carrier, C. G.

Leveraging Social Networks for Anytime-Anyplace Health Information.

In Submission.

Page 73: Computational Techniques for Public Health …dml.cs.byu.edu/~sburton/presentations/2012-04_26...Topic Modeling •Latent Dirichlet allocation (LDA) User chooses a topic (z) Given

Who is Answering?

• Answers come from people you know

Burton, S. H., Tanner, K. W., and Giraud-Carrier, C. G.

Leveraging Social Networks for Anytime-Anyplace Health Information.

In Submission.

Relationship Percent

No relation 6.6

Responder following asker 93.0

Asker following responder 70.0

Mutual following and follower 69.5

Page 74: Computational Techniques for Public Health …dml.cs.byu.edu/~sburton/presentations/2012-04_26...Topic Modeling •Latent Dirichlet allocation (LDA) User chooses a topic (z) Given

Conclusions

• People are seeking dental advice in Twitter

• Answers come frequently and quickly

• Users with more social capital are more likely to receive answers

Burton, S. H., Tanner, K. W., and Giraud-Carrier, C. G.

Leveraging Social Networks for Anytime-Anyplace Health Information.

In Submission.

Page 75: Computational Techniques for Public Health …dml.cs.byu.edu/~sburton/presentations/2012-04_26...Topic Modeling •Latent Dirichlet allocation (LDA) User chooses a topic (z) Given

Predicting Substance Abuse

• Identifying Trends

▫ Content of tweets

▫ Social network

Observe

Predict

Discover

Page 76: Computational Techniques for Public Health …dml.cs.byu.edu/~sburton/presentations/2012-04_26...Topic Modeling •Latent Dirichlet allocation (LDA) User chooses a topic (z) Given

Do People Tweet About That?

“So my family knows I smoke weed. The only one that doesn't really care or seem to concern is my pops” [sic.]

“if u dont like that i smoke weed then u dont like me... Weed is BIG part of my laugh. now pass me the blunt” [sic.]

“No wonder I smoke weed. Stupid people stress me out.”

Page 77: Computational Techniques for Public Health …dml.cs.byu.edu/~sburton/presentations/2012-04_26...Topic Modeling •Latent Dirichlet allocation (LDA) User chooses a topic (z) Given

Mining Process

A. Collect Marijuana Users

B. Collect Non-Marijuana Users

C. Build User Profiles

D. Induce Predictive Model

E. Analyze Model

Intervention (Future Work)

F. Predict Likely Users

Page 78: Computational Techniques for Public Health …dml.cs.byu.edu/~sburton/presentations/2012-04_26...Topic Modeling •Latent Dirichlet allocation (LDA) User chooses a topic (z) Given

Collecting Users

• Keyword filters

• Pilot study: “I smoke weed” (50 users)

▫ 36% - Definitely marijuana users

▫ 25% - Explicitly said it, but possible joking

▫ 19% - At least positive sentiment

▫ 78% - These three combined

• Non-marijuana users

A. Collect Marijuana Users

B. Collect Non-Marijuana Users

C. Build User Profiles

D. Induce Predictive Model

E. Analyze Model

Intervention (Future Work)

F. Predict Likely Users

Page 79: Computational Techniques for Public Health …dml.cs.byu.edu/~sburton/presentations/2012-04_26...Topic Modeling •Latent Dirichlet allocation (LDA) User chooses a topic (z) Given

Building User Profiles

• Complete tweet history (up to 3200)

• Follower List

• Following List

• User-supplied description

A. Collect Marijuana Users

B. Collect Non-Marijuana Users

C. Build User Profiles

D. Induce Predictive Model

E. Analyze Model

Intervention (Future Work)

F. Predict Likely Users

Page 80: Computational Techniques for Public Health …dml.cs.byu.edu/~sburton/presentations/2012-04_26...Topic Modeling •Latent Dirichlet allocation (LDA) User chooses a topic (z) Given

Feature Extraction

• Author-LDA

▫ “day today good time tonight happy”

▫ “real tho man gotta life twitter yo hit”

• Personal pronouns

▫ “My step-mom…”

▫ Bootstrap training set

• Traits from theoretical models

A. Collect Marijuana Users

B. Collect Non-Marijuana Users

C. Build User Profiles

D. Induce Predictive Model

E. Analyze Model

Intervention (Future Work)

F. Predict Likely Users

Hawkins, J., Catalano, R., and Miller, J.

Risk and Protective Factors for Alcohol and Other Drug Problems in Adolescence and

Early Adulthood: Implications for Substance Abuse Prevention.

Psychological Bulletin, 112(1):64, 1992.

Page 81: Computational Techniques for Public Health …dml.cs.byu.edu/~sburton/presentations/2012-04_26...Topic Modeling •Latent Dirichlet allocation (LDA) User chooses a topic (z) Given

The Predictive Model

• Comprehensibility

• Collective classification

▫ Predict personal traits

▫ Predict traits of friends

▫ Weighted, directed edges

A. Collect Marijuana Users

B. Collect Non-Marijuana Users

C. Build User Profiles

D. Induce Predictive Model

E. Analyze Model

Intervention (Future Work)

F. Predict Likely Users

Page 82: Computational Techniques for Public Health …dml.cs.byu.edu/~sburton/presentations/2012-04_26...Topic Modeling •Latent Dirichlet allocation (LDA) User chooses a topic (z) Given

Analysis and Validation

• Compare to theory

▫ “Risk and protective factors”

• Subjective validation

• Objective validation of easily-labeled traits

A. Collect Marijuana Users

B. Collect Non-Marijuana Users

C. Build User Profiles

D. Induce Predictive Model

E. Analyze Model

Intervention (Future Work)

F. Predict Likely Users

Page 83: Computational Techniques for Public Health …dml.cs.byu.edu/~sburton/presentations/2012-04_26...Topic Modeling •Latent Dirichlet allocation (LDA) User chooses a topic (z) Given

Future Work

• Personalized communication

• Intervention

• Communication with family/friends

A. Collect Marijuana Users

B. Collect Non-Marijuana Users

C. Build User Profiles

D. Induce Predictive Model

E. Analyze Model

Intervention (Future Work)

F. Predict Likely Users

Page 84: Computational Techniques for Public Health …dml.cs.byu.edu/~sburton/presentations/2012-04_26...Topic Modeling •Latent Dirichlet allocation (LDA) User chooses a topic (z) Given

Proposed Schedule

Sec. Topic Venue Target

2 Public Health Community Mining in YouTube ACM International Health Informatics

Symposium (IHI)

Published

4 Leveraging Social Networks for Anytime-

Anyplace Health Information

Network Modeling Analysis in Health

Informatics and Bioinformatics

(NetMAHIB)

In Submission

1 Local Community Mining in Directed Graphs Journal of Social Network Analysis and

Mining (SNAM)

June 2012

3 Mining the Spread of Health Content in

Social Media

International Conference on Social

Computing, Behavioral-Cultural

Modeling, and Prediction (SBP)

August 2012

5 Mining Social Media for Trends among

Substance Abusers

ACM Transactions of Knowledge

Discovery from Data (TKDD)

February 2013

Page 85: Computational Techniques for Public Health …dml.cs.byu.edu/~sburton/presentations/2012-04_26...Topic Modeling •Latent Dirichlet allocation (LDA) User chooses a topic (z) Given

Contributions

• Computational techniques

▫ Local, directed community mining

▫ Community mining for sampling

▫ Mining rare and meaningful traits in short text

▫ Combination of text mining and social network

analysis for prediction

• Implications for Health Surveillance

▫ YouTube as a source of communities

▫ Health differences across platforms

▫ Health advice in social media

▫ Prediction of high risk individuals

Observe

Predict

Discover

Page 86: Computational Techniques for Public Health …dml.cs.byu.edu/~sburton/presentations/2012-04_26...Topic Modeling •Latent Dirichlet allocation (LDA) User chooses a topic (z) Given

Questions