alastair kerr, ph.d. wtccb bioinformatics core
DESCRIPTION
Alastair Kerr, Ph.D. WTCCB Bioinformatics Core. An introduction to DNA and Protein Sequence Databases. Questions to address. What are the main sequence databases? Which one to use for: Looking up a gene name/identifier from a paper Identifiers What should I use and why? - PowerPoint PPT PresentationTRANSCRIPT
Alastair Kerr, Ph.D.WTCCB Bioinformatics Core
An introduction to DNA and Protein
Sequence Databases
Questions to address
What are the main sequence databases? Which one to use for:
Looking up a gene name/identifier from a paper Identifiers
What should I use and why? Coordinate based systems
Annotation Protein domains Gene Ontology
Database Varieties
Sequence Warehouses “everything under one roof”
Genome Databases Containing single genome dataset(s)
Reference Sets Often human curated, the 'standard' for a particular
gene or protein from which variants are defined Specialist
Short reads from next generation sequencing (Short read archive)
[EST] Expressed sequence tags and [GSS] Genome survey sequence
NCBIGenBank EMBL
DDBJ
Sharing primary data
NCBI
Warehouse GenBank <live demo>
NR dataset : NR = non redundant (but is is not..)
Reference Dataset RefSeq
Genome Datasets NCBI Genomes
EMBL
Warehouse EMBL
Historically Protein set was call translated EMBL (trEMBL)
Gold standard reference set was called SwissProt
Reference set = Uniprot UniProtKB/Swiss-Prot
Manually annotated and reviewed UniProtKB/TrEMBL
automatically annotated and not reviewed
Genome database Ensembl <live demo>
Live Demo
Search GenBank for human adh4 How many are there? How many should there be? Why are some different to those found in Uniprot? Are there better databases to use? Which identifier should you use in your lab book?
We should now be able to answer these:
What are the main sequence databases? Which one to use for:
Looking up a gene identifier from a paper Searching for a gene name Searching for an orthologus genes from another
species
Identifiers
Or what to write in your lab book
How to identify a feature
Gene/protein name Common name Standardised Name
Database identifier Unique for each database Some have revision numbers
Position in genome Dependant on Genome build
Position in a Gene/Protein Protein Domains
Never use common namesExample of EPHB2
EPH receptor B2EPHT MGC87492DRTERK EPTH3Hek5 Renal carcinoma antigen NY-REN-47Tyro5 hEK5CAPB HEK5PCBC EK5
EPHB2TYRO5
protein-tyrosine kinase HEK5
EPH-like kinase 5
EphBephrin type-B receptor 2elk-related tyrosine kinase Tyrosine-protein kinase TYRO5eph tyrosine kinase 3 Tyrosine-protein kinase receptor EPH-3
Consortia identifiers
Most key species have a consortia / group / community that provides the key identifiers in the field
Humans Was HUGO (HUman Genome Organisation) now the HGNC (Human Genome Nomenclature
Committee)
Database Identifiers
Every dataset has their own system of identifying gene/protein
Example: Human ADH4 Ensembl
ENSG00000198099 ENST00000423445 ENSP00000397939 SwissProt
ADH4_HUMAN P08319 RefSeq
NM_000670.3 NP_000661.2 GenBank
gi|71565152|ref|NP_000661.2|
Keeping Track of Changes
Gene models can change Will the id you used yesterday still get the same
sequence today? Or: How to you get the latest version of a
sequence?
Keeping Track of Changes
Genbank: GI or “genbank identifier” Gi number changes each time, often removed when it
gets superseded SwissProt: Accession and ID
Accession changes each time (P08319) but the ID remains constant (ADH4_HUMAN)
RefSeq and Ensembl Revision based ids
NM_000670.3 ENSG00000198099.1 XXX.number
XXX always retrieve latest XXX.number retrieves the version
Demo: Retrieving old data
Definining: Chromosome coordinates
Demo: Ensembl
Chromosome Positions
Features identified by Chromosome & position File formats: BED, WIG, gff .. All major genome databases store features as
coordinates Ubiquitous in deep sequencing studies
Note: coordinates change depending on the assembly
Always note the build number of the genome assembly if you are using coordinates
Coordinates
New concept of PATCH This is an assembly update without changing the
primary sequence However additional 'improved' contigs map to the
reference These will be in the net assembly: you may
wish to use them
Genome assembly names can differ by institution but are the same underlying sequence:
GenBank/UCSC DEMO liftOver
Protein Domains: Protein Positions
Protein Domains
Interpro Site that stores information on known protein domains from
different projects Covered by Interpro
Similarities between proteins Conserved region in an alignment Conserved protein folds
Not Covered by Interpro Predicted features on primary protein sequence Trans-membrane regions Low complexity regions Phosphorylation sites
Domain Complexity
Many different types of domains
Vast amounts of domain based data
Many different projects identifying them
x
=
Old way of interacting with a database
Request information
Retrieve information From single source
Distributed Annotation
DAS clients
Different type of software can have a DAS client build-in
Genome Browsers: ensembl, IGB, IGV.. Multiple Alignment editors: Jalview, STRAP 3D Structures: Spice 3D electron microscopy data: PeppeR
Demo
Annotation
Annotation
Problem: Many ways to name a gene Reductase = oxidase = dehydrogenase
Gene Ontology Consortium [GO] GO terms standardise naming Note that errors may still occur in the assignment
of terms Found in RefSeq, UniProt and most genome
databases GO browsers e.g. AmiGO
Gene Ontology
all [535063 gene products] GO:0008150 : biological_process
[404412 gene products] GO:0005575 : cellular_component
[372379 gene products] GO:0003674 : molecular_function
[436597 gene products]
Gene Ontology: acyclical Tree
Evidence Codes
Experimental# EXP: Inferred from Experiment # IDA: Inferred from Direct Assay
# IPI: Inferred from Physical Interaction # IMP: Inferred from Mutant Phenotype
# IGI: Inferred from Genetic Interaction # IEP: Inferred from Expression Pattern
Computational# ISS: Inferred from Sequence or Structural Similarity
# ISO: Inferred from Sequence Orthology # ISA: Inferred from Sequence Alignment
# ISM: Inferred from Sequence Model # IGC: Inferred from Genomic Context
# RCA: inferred from Reviewed Computational Analysis
Author Statement
# TAS: Traceable Author Statement # NAS: Non-traceable Author Statement
# Curator Statement Evidence Codes # IC: Inferred by Curator
# ND: No biological Data available
Automatically-assigned
# IEA: Inferred from Electronic Annotation
Best annotation?
Use DAS clients to get more information on genomic, gene or protein features
Protein Domains are especially useful The Gene Ontology is useful for general
classification BUT be aware from where the annotation was
derived