the cure: making a game of gene selection for breast cancer survival prediction

1
The Cure: Making a game of gene selection for breast cancer survival prediction Background: Molecular signatures for predicting breast cancer prognosis could greatly improve care through personalization of treatment. Computational analyses of genome-wide expression datasets have identified such signatures, but these signatures leave much to be desired in terms of accuracy, reproducibility and biological interpretability. Methods that take advantage of structured prior knowledge (e.g. protein interaction networks) show promise in helping to define better signatures but most knowledge remains unstructured. Crowdsourcing via scientific discovery games is an emerging methodology that has the potential to tap into human intelligence at scales and in modes previously unheard of. Objective: The main objective of this study was to test the hypothesis that knowledge linking expression patterns of specific genes to breast cancer outcomes could be captured from players of an open, Web-based game. We envisioned capturing knowledge both from the player’s prior experience and from their ability to interpret text related to candidate genes presented to them in the context of the game. Methods: We developed and evaluated an online game called “The Cure” that captured information from players regarding genes for use in predictors of breast cancer survival. Information gathered from game play was aggregated using a voting approach and used to create rankings of genes. The top genes from these rankings were evaluated using annotation enrichment analysis, comparison to prior predictor gene sets, and by using them to train and test machine learning systems for predicting 10-year survival. Results: Between its launch in Sept. 2012 and Sept. 2013, The Cure attracted more than 1,000 registered players who collectively played nearly 10,000 games. Gene sets assembled through aggregation of the collected data showed significant enrichment for genes known to be related to key concepts such as Cancer, Disease Progression, and Recurrence (P < 1.1e-07). In terms of the accuracy of models trained using them, these gene sets provided comparable performance to gene sets generated using other methods including those used in commercial tests. The Cure is available at http://genegames.org/cure/ ABSTRACT Benjamin M. Good 1 , Karthik Gangavarapu 1 , Salvatore Loguercio 1 , Obi L. Griffith 2 , Max Nanis 1 , Chunlei Wu 1 , Andrew I. Su 1 1 The Scripps Research Institute, 2 Washington University School of Medicine Molecular survival prediction How Gene Wiki? REFERENCES CONTACT Benjamin Good: bgood @ scripps.edu @bgood Andrew Su: asu @ scripps.edu @andrewsu How Gene Wiki? Cure2.0: Interactive, Collaborative, Genomic Decision Tree Construction, now live! FUNDING ACKNOWLEDGEMENTS Thanks to all of the players of The Cure ! Crowdsourcing via scientific discovery games We acknowledge support from the National Institute of General Medical Sciences (GM089820 and GM083924). The Cure game. Players alternate turns taking a gene card from the board and adding it to their hand. The tabbed display provides gene annotations (‘ontology’, ‘Rifs’) and views of decision trees constructed by the system using the selected genes. There are one hundred boards to choose from in a given round of the game (four rounds were completed). find patterns make predictions on new samples < 10 year >10 year With tens of thousands of measurements but only hundreds of samples, many possible patterns are found. But which ones are real? Which genes should we use to build predictors? < 10 year > 10 year Online games are successfully tapping into the knowledge and reasoning abilities of thousands of people [4]. Devise protein folding algorithms Design RNA molecules The purpose Prior knowledge encoded in protein-protein interaction databases [1,2] and pathway databases [3] has been used to improve prediction What about knowledge that is not recorded in structured databases? 1.Dutkowski and Ideker (2011) Protein Networks as Logic Functions in Development and Cancer. PLoS Computational Biology 2.Winter et al (2012) Google Goes Cancer: Improving Outcome Prediction for Cancer Patients by Network-Based Ranking of Marker Genes. PLoS Computational Biology 3.Liu et al (2012) Identifying dysregulated pathways in cancers from pathway interaction networks. BMC Bioinformatics 4.Good and Su (2011) Games with a Scientific Purpose. Genome Biology 5.Wang, Jing, et al. (2013) WEB-based GEne SeT AnaLysis Toolkit (WebGestalt): update 2013. Nucleic Acids Research Goal: pick the best set of genes. Best: the gene set that produces the best decision tree classifier. Classifier: created using training data and selected genes, used to predict 10 year survival. Score: accuracy of the tree inferred using the selected genes The Cure is a game designed to focus the collective intelligence of a diverse community on the challenge of selecting genes for building prognostic classifiers The rules The game Results – recruitment and engagement One year, 1077 players, 9904 games played 1077 players Key result: Genes selected in high frequencies by the player community performed comparably to genes selected using statistical approaches and to genes used in commercial tests when used to train machine learning models for survival prediction Results – knowledge captured Workflow for Synthesizing Knowledge Regarding Gene Selection 1. Select a set of played games based on player information such as education. 2. Measure the frequency with which each gene was selected by these players across many different games and boards. Each time a gene is added to a hand a ‘vote’ is recorded for that gene. 3. Measure the likelihood of observing the number of votes a gene has received by chance and calculate a P value for that gene. 4. Rank genes by P value and select those with P<=0.001 3 gene sets extracted from all games, games from experts, and games from novices Overlap of ‘expert’ player selected gene set with known predictor gene sets Disease terms associated with 61 genes preferentially selected by all players using WebGestalt [5] with adj. P < 10 - 5 Overlap between genes selected by different player populations 61 genes preferentially selected by all players, P <= 0.001 Changes in Cure 2.0 1. Adapted for advanced players / scientists. 2. Players choose from all genes in dataset 3. Clinical features supported 4. Players control structure of trees. 5. Scoring based on accuracy, complexity and novelty of trees. 6. Collaborative – players can build from other players trees 7. Trees can also be kept private. http://genegames.org/cure/ Try it Now!

Upload: goodb

Post on 10-May-2015

1.749 views

Category:

Science


4 download

DESCRIPTION

Background: Molecular signatures for predicting breast cancer prognosis could greatly improve care through personalization of treatment. Computational analyses of genome-wide expression datasets have identified such signatures, but these signatures leave much to be desired in terms of accuracy, reproducibility and biological interpretability. Methods that take advantage of structured prior knowledge (e.g. protein interaction networks) show promise in helping to define better signatures but most knowledge remains unstructured. Crowdsourcing via scientific discovery games is an emerging methodology that has the potential to tap into human intelligence at scales and in modes previously unheard of. Objective: The main objective of this study was to test the hypothesis that knowledge linking expression patterns of specific genes to breast cancer outcomes could be captured from players of an open, Web-based game. We envisioned capturing knowledge both from the player’s prior experience and from their ability to interpret text related to candidate genes presented to them in the context of the game. Methods: We developed and evaluated an online game called “The Cure” that captured information from players regarding genes for use in predictors of breast cancer survival. Information gathered from game play was aggregated using a voting approach and used to create rankings of genes. The top genes from these rankings were evaluated using annotation enrichment analysis, comparison to prior predictor gene sets, and by using them to train and test machine learning systems for predicting 10-year survival. Results: Between its launch in Sept. 2012 and Sept. 2013, The Cure attracted more than 1,000 registered players who collectively played nearly 10,000 games. Gene sets assembled through aggregation of the collected data showed significant enrichment for genes known to be related to key concepts such as Cancer, Disease Progression, and Recurrence (P < 1.1e-07). In terms of the accuracy of models trained using them, these gene sets provided comparable performance to gene sets generated using other methods including those used in commercial tests. The Cure is available at http://genegames.org/cure/

TRANSCRIPT

Page 1: The Cure: Making a game of gene selection for breast cancer survival prediction

The Cure: Making a game of gene selection for breast cancer survival prediction

Background: Molecular signatures for predicting breast cancer prognosis could greatly improve care through personalization of treatment. Computational analyses of genome-wide expression datasets have identified such signatures, but these signatures leave much to be desired in terms of accuracy, reproducibility and biological interpretability. Methods that take advantage of structured prior knowledge (e.g. protein interaction networks) show promise in helping to define better signatures but most knowledge remains unstructured. Crowdsourcing via scientific discovery games is an emerging methodology that has the potential to tap into human intelligence at scales and in modes previously unheard of. Objective: The main objective of this study was to test the hypothesis that knowledge linking expression patterns of specific genes to breast cancer outcomes could be captured from players of an open, Web-based game. We envisioned capturing knowledge both from the player’s prior experience and from their ability to interpret text related to candidate genes presented to them in the context of the game. Methods: We developed and evaluated an online game called “The Cure” that captured information from players regarding genes for use in predictors of breast cancer survival. Information gathered from game play was aggregated using a voting approach and used to create rankings of genes. The top genes from these rankings were evaluated using annotation enrichment analysis, comparison to prior predictor gene sets, and by using them to train and test machine learning systems for predicting 10-year survival. Results: Between its launch in Sept. 2012 and Sept. 2013, The Cure attracted more than 1,000 registered players who collectively played nearly 10,000 games. Gene sets assembled through aggregation of the collected data showed significant enrichment for genes known to be related to key concepts such as Cancer, Disease Progression, and Recurrence (P < 1.1e-07). In terms of the accuracy of models trained using them, these gene sets provided comparable performance to gene sets generated using other methods including those used in commercial tests. The Cure is available at

http://genegames.org/cure/

ABSTRACT

Benjamin M. Good1, Karthik Gangavarapu1, Salvatore Loguercio1, Obi L. Griffith2, Max Nanis1, Chunlei Wu1, Andrew I. Su1

1The Scripps Research Institute, 2Washington University School of Medicine

Molecular survival prediction

How Gene Wiki?

REFERENCES

CONTACT

Benjamin Good: [email protected] @bgoodAndrew Su: [email protected] @andrewsu

How Gene Wiki?

Cure2.0: Interactive, Collaborative, Genomic Decision Tree Construction, now live!

FUNDING

ACKNOWLEDGEMENTSThanks to all of the players of The Cure !

Crowdsourcing via scientific discovery games

We acknowledge support from the National Institute of General Medical Sciences (GM089820 and GM083924).

The Cure game. Players alternate turns taking a gene card from the board and adding it to their hand. The tabbed display provides gene annotations (‘ontology’, ‘Rifs’) and views of decision trees constructed by the system using the selected genes. There are one hundred boards to choose from in a given round of the game (four rounds were completed). 

find patterns

make predictions on new samples

< 10 year >10 year

• With tens of thousands of measurements but only hundreds of samples, many possible patterns are found.

• But which ones are real?• Which genes should we use to build predictors?

< 10 year

> 10 year

Online games are successfully tapping into the knowledge and reasoning abilities of thousands of people [4].

Devise protein folding algorithmsDesign RNA molecules

The purpose

Prior knowledge encoded in protein-protein interaction databases [1,2] and pathway databases [3] has been used to improve prediction

What about knowledge that is not recorded in structured databases?

1. Dutkowski and Ideker (2011) Protein Networks as Logic Functions in Development and Cancer. PLoS Computational Biology

2. Winter et al (2012) Google Goes Cancer: Improving Outcome Prediction for Cancer Patients by Network-Based Ranking of Marker Genes. PLoS Computational Biology

3. Liu et al (2012) Identifying dysregulated pathways in cancers from pathway interaction networks. BMC Bioinformatics

4. Good and Su (2011) Games with a Scientific Purpose. Genome Biology

5. Wang, Jing, et al. (2013) WEB-based GEne SeT AnaLysis Toolkit (WebGestalt): update 2013. Nucleic Acids Research

• Goal: pick the best set of genes.• Best: the gene set that produces the best decision tree classifier.• Classifier: created using training data and selected genes, used to predict 10

year survival.• Score: accuracy of the tree inferred using the selected genes

The Cure is a game designed to focus the collective intelligence of a diverse community on the challenge of selecting genes for building prognostic classifiers

The rules

The game

Results – recruitment and engagement• One year, 1077 players, 9904 games played

1077players

Key result: Genes selected in high frequencies by the player community performed comparably to genes selected using statistical approaches and to genes used in commercial tests when used to train machine learning models for survival prediction

Results – knowledge captured

Workflow for Synthesizing Knowledge Regarding Gene Selection

1. Select a set of played games based on player information such as education.2. Measure the frequency with which each gene was selected by these players

across many different games and boards. Each time a gene is added to a hand a ‘vote’ is recorded for that gene.

3. Measure the likelihood of observing the number of votes a gene has received by chance and calculate a P value for that gene.

4. Rank genes by P value and select those with P<=0.001

3 gene sets extracted from all games, games from experts, and games from novices

Overlap of ‘expert’ player selected gene set with known predictor gene sets

Disease terms associated with 61 genes preferentially selected by all players using WebGestalt [5] with adj. P < 10-5

Overlap between genes selected by different player populations

61 genes preferentially selected by all players, P <= 0.001

Changes in Cure 2.0

1. Adapted for advanced players / scientists.2. Players choose from all genes in dataset3. Clinical features supported4. Players control structure of trees.5. Scoring based on accuracy, complexity and

novelty of trees.6. Collaborative – players can build from other

players trees7. Trees can also be kept private.

http://genegames.org/cure/

Try it Now!