analysis of geo datasets using geo2r parthav jailwala ccr collaborative bioinformatics resource...
TRANSCRIPT
Analysis of GEO datasets using GEO2R
Parthav JailwalaCCR Collaborative Bioinformatics Resource
CCR/NCI/NIH
Outline
• Background on GEO datasets• What is GEO2R and how can it help you• How to use GEO2R• Options and features• Limitations and caveats• Hands-on exercise
• An international public repository that archives and freely distributes high-throughput microarray & NGS data submitted by the scientific community
• About a billion individual gene expression measurements, derived from over 100 organisms, wide range of biological issues
• Data can be explored, queried and visualized using user-friendly Web-based tools
GEO data organization
[ GPLxxx ] [ GSMxxx ] [ GSExxx ]
[ GDSxxx ]
What kinds of data does GEO host?
• GEO was designed around the common features of most of the high-throughput and parallel molecular abundance-measuring technologies in use today. These include:
– Gene expression profiling by microarray or next-generation sequencing – Non-coding RNA profiling by microarray or next-generation sequencing– Chromatin immunoprecipitation (ChIP) profiling by microarray or next-
generation sequencing– Genome methylation profiling by microarray or next-generation
sequencing– Genome variation profiling by array (arrayCGH)– SNP arrays– Serial Analysis of Gene Expression (SAGE)– Protein arrays
What is GEO2R ?
• Interactive web tool that allows users to compare two or more groups of Samples in a GEO Series in order to identify genes that are differentially expressed across experimental conditions
• Uses GEOquery and Limma R packages from Bioconductor project
• Simple interface that allows users to perform R statistical analysis without command line expertise
• Does not rely on curated ‘DataSets’ and interrogates the original Series Matrix data file directly
How to use GEO2R• Enter a Series accession number
– Follow a link from a Series record OR– Enter a Series accession number
• Define Sample groups
– Atleast 2, upto 10 groups can be defined
• Assign Samples to each group
– Not all samples in a series need to be selected
• Perform the test
– Assess sample value distributions– Edit default test parameters
• Interpret the results
– Table of the top 250 genes ranked by p-value– Select columns to be included in the output table– Edit the test parameters -> Recalculate to apply edits– Download the tab-delimited table and open in Excel
Options and features
• Value distribution
– Number summary or boxplot– Median centered values indicative that data are normalized and cross-comparable
• Options
– Apply adjustment of p-values– Apply log transformation to the data– Category of Platform annotation to display on results (NCBI generated (preferred)
or Submitter supplied)
• Profile graph
• R script
Limitations & caveats• Check that Sample values are comparable
– Assess the value distribution boxplot– Review the GEO Series experiment description
• Data type restriction– Some GEO data do not have data tables (eg. High-throughput sequencing or
genome tiling arrays)
• Within-Series restriction– No cross-series comparisons
• 255 Sample limit
• 10 minute timeout
Summary statistics from Limma
Hands-on exercise
• Google: GSE18388• Microarray Analysis of Space-flown Murine
Thymus Tissue
Further learning resources on GEO2R
• Full description: – http://www.ncbi.nlm.nih.gov/geo/info/geo2r.html
• Youtube Video:– https://www.youtube.com/watch?v=EUPmGWS8ik0
• Example walkthrough: – http
://www.bioinformatics.polimi.it/masseroli/bcbmm/material/practices/E2_GEO2R_Bioconductor_Tutorial.docx