Applied Bioinformatics
Bing Zhang Department of Biomedical Informatics
Vanderbilt University
Course overview
What is bioinformatics Data driven science: the creation and
advancement of databases, algorithms, and computational and statistical methods to solve theoretical and practical problems arising from the management and analysis of biological data.
Major research areas: sequence alignment, gene finding, genome assembly, protein structure prediction, gene expression and regulation, protein interaction, drug design, genome-wide association studies, computational evolutionary biology etc.
Applied bioinformatics module Not a comprehensive guide to all facets of
bioinformatics
To equip you with the computational understanding and expertise needed to solve bioinformatics problems that you will likely encounter in your research.
Applied Bioinformatics, Spring 2011 2
http://www.ncbi.nlm.nih.gov/genbank/genbankstats.html
Course overview
What is bioinformatics Data driven science: the creation and
advancement of databases, algorithms, and computational and statistical methods to solve theoretical and practical problems arising from the management and analysis of biological data.
Major research areas: sequence alignment, gene finding, genome assembly, protein structure prediction, gene expression and regulation, protein interaction, drug design, genome-wide association studies, computational evolutionary biology etc.
Applied bioinformatics module Not a comprehensive guide to all facets of
bioinformatics
To equip you with the computational understanding and expertise needed to solve bioinformatics problems that you will likely encounter in your research.
Applied Bioinformatics, Spring 2011 3
http://www.bioinformatics.ca/links_directory/
Course content and grades
Applied Bioinformatics, Spring 2011 4
!Date Subject Instructor Homework (HW) 2/14 Finding information about genes Zhang 2/16 Navigating sequenced genomes Zhang 2/18 Pairwise sequence alignment and database search Zhao 2/21 Multiple sequence alignment Zhao 2/23 Inferring phylogenetic relationships from sequence data Zhao
HW I distribution 20 pts Zhao + 10 pts Zhang
2/25 Protein sequence annotation Tabb 2/28 Protein structure prediction and visualization Tabb HW I due 3/2 Protein identification by mass spectrometry Tabb HW II distribution
20 pts Tabb 3/4 Gene prediction and annotation Bush 3/7 Finding regulatory and conserved elements in DNA sequence Bush HW II due 3/9 Assessing the impact of genetic variation
Bush
HW III distribution 20 pts Bush
3/11 Supervised analysis of gene expression data Zhang 3/14 Unsupervised analysis of gene expression data Zhang HW III due 3/16 Functional interpretation of gene lists Zhang 3/18 Biological pathways Zhang 3/21 Biological networks Zhang HW IV distribution
30 pts Zhang 3/25 HW assignments will be graded by each instructor for their respective
sections. Final Grade = sum of the hw scores (100 pts in total). A: 85-100; B: 70-84; C: 55-69; D: 40-54; F: 0-39
Homework IV due by 5pm
!
Course materials and assignments
Lecture slides available at https://medschool.mc.vanderbilt.edu/iidea/admin/admin_login.php before each lecture
Homework assignments available at https://medschool.mc.vanderbilt.edu/iidea/admin/admin_login.php on the distribution date (2/23, 3/2, 3/9, 3/21)
Homework assignments are due at 5pm on the due date (2/28, 3/7, 3/14, 3/25). There will be a 10% per day deduction for late reports.
Email your reports in the pdf, doc, or docx format to corresponding instructor(s) HW I: [email protected]; [email protected]
HW II: [email protected]
HW III: [email protected]
HW IV: [email protected]
Text book (optional): Dear, Paul H. (2007) Methods Express: Bioinformatics. Scion, ISBN 978-1904842163.
Applied Bioinformatics, Spring 2011 5
Finding information about genes
Bing Zhang Department of Biomedical Informatics
Vanderbilt University
When do we need gene information?
Case 1 From Prof. Randy Blakely (Pharmacology): “We have hit an
uncharacterized gene in our hunt for SERT interacting proteins=****** that appears to be highly depleted when extracts are made from SERT KO mice. Can you help us come up with some ideas as to what this gene might be.”
Case 2 From Prof. Kevin Schey (Biochemistry): “I’ve attached a spreadsheet of
our proteomics results comparing 5 Vehicle and 5 Aldosterone treated patients. We’ve included only those proteins whose summed spectral counts are >30 in one treatment group. Would it be possible to get the GO annotations for these? The Uniprot name is listed in column A and the gene name is listed in column R. If this is a time consuming task (and I imagine that it is), can you tell me how to do it?”
Applied Bioinformatics, Spring 2011 7
Resources
Entrez Gene http://www.ncbi.nlm.nih.gov/gene
NCBI/NIH
All completely sequenced genomes
One gene per page
Ensembl BioMart http://www.ensembl.org/biomart/martview
EMBL-EBI and Sanger Institute
Vertebrates and other selected eukaryotic species
Batch information retrieval
Gene Cards http://www.genecards.org
Weizmann Institute of Science, Israel
Comprehensive information on human genes
WikiGenes http://www.wikigenes.org
MIT
Collaborative annotation in a wiki system
GLAD4U http://bioinfo.vanderbilt.edu/glad4u
Vanderbilt
Genes related to a specific topic
Applied Bioinformatics, Spring 2011 8
Learning objectives
To gain a basic understanding of the Entrez Gene system
To be able to retrieve information for individual genes using Entrez Gene
To gain a basic understanding of the Ensembl BioMart system
To be able to retrieve information for a list of genes using Ensembl BioMart
Applied Bioinformatics, Spring 2011 9
Entrez Gene: overview
Data source Automated analyses and curation by NCBI staff
Data stored in flat files
Updated continuously
Unique gene identifier Entrez Gene uses unique integers (GeneID) as stable identifiers for genes, e.g. GeneID for
human tumor protein p53 (TP53) is 7157
GeneID assigned to each record is species specific, e.g. GeneID for the mouse ortholog of TP53 (Trp53) is 22059
Statistics as of February 2011 7.2 million records distributed among 7039 taxa
45,227 records for human
Query system Entrez
Applied Bioinformatics, Spring 2011 10
Entrez Gene: Entrez
An integrated search and retrieval system that provides access to many discrete databases at the NCBI website.
All databases indexed by Entrez can be searched via a single query string, including Entrez Gene
Supports Boolean operators AND, OR, NOT
Supports search term tags to limit search to particular fields Title, organism, etc.
Sample query transporter[tit le] AND (”Homo sapiens"[organism] OR "Mus
musculus"[organism])
Applied Bioinformatics, Spring 2011 11
Entrez Gene: search result
Applied Bioinformatics, Spring 2011 12
Display Setting
Summary record
Advanced search
Filtering
Related data
Help
Entrez Gene: Gene record (I)
Each Gene record integrates multiple types of information Gene type: tRNA, rRNA, snRNA, scRNA, snoRNA, miscRNA, protein-
coding, pseudo, other, and unknown
Nomenclature, summary descriptions, accessions of gene specific and gene product-specific sequences, chromosomal location, reports of pathways and protein interactions, associated markers and phenotypes
Links to other databases at NCBI including literature citations, sequences, variations, and homologs
Links to databases outside of NCBI
Applied Bioinformatics, Spring 2011 13
Entrez Gene: Gene record (II)
Applied Bioinformatics, Spring 2011 14
http://www.ncbi.nlm.nih.gov/gene/7157
Help Expand
Export New search
Entrez Gene: advanced ways of accessing
FTP download ftp://ftp.ncbi.nlm.nih.gov/gene/README
E-Utilities (Entrez Programming Utilities) Server-side programs that provide a stable interface into the Entrez query and
database system
Uses a fixed URL syntax that translates a standard set of input parameters into the values necessary for various NCBI software components to search for and retrieve the requested data, including nucleotide and protein sequences, gene records, three-dimensional molecular structures, and the biomedical literature.
Works with any computer language that can send a URL to the E-utilities server and interpret the XML response, e.g. Perl, Python, Java, and C++.
Combining E-utilities components to form customized data pipelines within these applications is a powerful approach to data manipulation.
Applied Bioinformatics, Spring 2011 15
Entrez Gene: documentation and publications
Applied Bioinformatics, Spring 2011 16
http://www.ncbi.nlm.nih.gov/books/NBK3841/
Maglott et al. NAR, 39:D52-D57, 2011
Entrez Gene: exercise
Questions How many records can we get for a simple search of “kinase” in Entrez Gene?
Use Boolean operators and search term tags to search for mouse genes located on chromosome 1 and with kinase in title. With the default display setting, what is the first hit?
Click on the first hit and identify how many publications in PubMed are associated with this gene.
Identify which proteins interact with the protein product of this gene.
Answers 244,301 records
Query term: kinase[title] AND mouse[Organism] AND 1[Chromosome]
Epha4
Bibliograph section: 220 citations in PubMed
Interactions section: 3 proteins, Epha4, Ngef, and Vav2
Applied Bioinformatics, Spring 2011 17
Ensembl
Genome databases for vertebrates and other selected eukaryotic species Automated annotation system at EBI
Data stored in a relational database
Updated periodically with versions
Unique gene identifier Ensembl uses unique strings (Ensembl gene ID) as stable identifiers for genes, e.g. Ensembl
gene stable ID for human tumor protein p53 (TP53) is ENSG00000141510
GeneID assigned to each record is species specific, e.g. Ensembl gene stable ID for the mouse ortholog of TP53 (Trp53) is ENSMUSG00000059552
Clear gene, transcript, and protein relationship, e.g. ENSG00000141510 => 17 transcripts (e.g. ENST00000445888) => 13 proteins (e.g. ENSP00000391478)
Statistics as of February 2011 (version 61) 55 species
53,630 genes for human
Other species available in the recently expanded system EnsemblGenomes http://www.ensemblgenomes.org
Applied Bioinformatics, Spring 2011 18
Biomart is a query-oriented data management system.
Batch information retrieval for complex queries
Particularly suited for providing 'data mining' like searches of complex descriptive data such as those related to genes and proteins
Open source and can be customized
Originally developed for the Ensembl genome databases
Adopted by many other projects including UniProt, InterPro, Reactome, Pancreatic Expression Database, and many others (see a comp le te l i s t and ge t access to t he too l s f rom http://www.biomart.org/ )
Biomart: a batch information retrieval system
Applied Bioinformatics, Spring 2011 19
BioMart: basic concepts
Dataset
Filter
Attribute
From Prof. Kevin Schey (Biochemistry): “I’ve attached a spreadsheet of our proteomics results comparing 5 Vehicle and 5 Aldosterone treated patients. We’ve included only those proteins whose summed spectral counts are >30 in one treatment group. Would it be possible to get the GO annotations for these? The Uniprot name is listed in column A and the gene name is listed in column R. If this is a time consuming task (and I imagine that it is), can you tell me how to do it?”
From all human genes, selected those with the listed Uniprot IDs, and retrieve GO annotations.
Applied Bioinformatics, Spring 2011 20
Choose dataset Choose database: Ensembl Genes 61
Choose dataset: Homo sapiens genes (GRch37)
Set filters Gene: a list of genes/proteins identified by various database IDs (e.g. IPI IDs)
Gene Ontology: filter for proteins with specific GO terms (e.g. cell cycle)
Protein domains: filter for proteins with specific protein domains (e.g. SH2 domain)
Region: filter for genes in a specific chromosome region (e.g. chr1 1:1000000 or 11q13)
Others
Select output attributes Gene annotation information in the Ensembl database, e.g. gene description, chromosome
name, gene start, gene end, strand, band, gene name, etc.
External data: Gene Ontology, IDs in other databases
Expression: anatomical system, development stage, cell type, pathology
Protein domains: SMART, PFAM, Interpro, etc.
Ensembl Biomart analysis
Applied Bioinformatics, Spring 2011 21
Ensembl BioMart: query interface
Applied Bioinformatics, Spring 2011 22
Choose dataset
Set filters
Help Results Count Perl API
Select output attributes
Ensembl Biomart: sample output
Applied Bioinformatics, Spring 2011 23
Export all results to a file
Ensembl Biomart: documentation and publications
Applied Bioinformatics, Spring 2011 24
http://www.ensembl.org/info/website/tutorials/index.html
Smedley et al. BMC Genomics, 10:22, 2009
Ensembl Biomart analysis: exercise 1
Question I have two Ensembl gene IDs, ENSG00000162367 and ENSG00000187048. How do I get
their gene names from HGNC, IDs from EntrezGene, and any probes that contain these gene sequences from the Affymetrix microarray platform HC G110?
Choose data set Database: Ensembl Gene 61
Dataset: Homo sapiens genes (GRCh37.p2)
Set filters Under GENE: check ID list limit box
Select Header: Ensembl Gene IDs, Enter the gene IDs into the box.
Select output attributes Select Features (default)
Under EXTERNAL: External References, Select 'HGNC Symbol' and 'EntrezGene ID’
Under EXTERNAL: Microarray, Select 'Affy HC G110’
Click on Count and then Results
Export all results to File, TSV
Applied Bioinformatics, Spring 2011 25
Ensembl Biomart analysis: exercise 2
Question How can I get the 2kb upstream sequences for all genes on chromosome 1?
Choose data set Database: Ensembl Gene 61
Dataset: Mus musculus genes (NCBIM37)
Set filters Under REGION: check Chromosome, select 1
Select output attributes Select Sequences
Under SEQUENCES: select Flank (Gene)
Under Upstream flank: check and enter 2000 into the box
Under Header Information, Gene Information, check Description
Click on Count (1916/36817) and then Results
Export all results to File, FASTA format
Applied Bioinformatics, Spring 2011 26
Summary
Entrez Gene http://www.ncbi.nlm.nih.gov/gene
NCBI/NIH
All completely sequenced genomes
Data stored in flat files
Updated continuously
Unique gene identifier: Entrez Gene ID
Query system: Entrez
Output: one-gene-at-a-time
Ensembl BioMart http://www.ensembl.org/biomart/martview
EMBL-EBI and Sanger Institute
Mainly vertebrates
Data stored in a relational database
Updated periodically with versions
Unique gene identifier: Ensembl Gene ID
Query system: BioMart
Output: multiple genes at the same time
Applied Bioinformatics, Spring 2011 27