bioinformatics workshops 1 & 2 1. use of public database/search sites - range of data and access...

17
Bioinformatics Workshops 1 & 2 1. use of public database/search sites - range of data and access methods - interpretation of search results - understanding the meaning & effect of search (e.g. BLAST) parameters 2. functional analysis of single sequences - i.e. how to work out what your unknown protein might be doing - complex searches for (e.g.) patterns of motifs & secondary

Upload: curtis-parks

Post on 08-Jan-2018

218 views

Category:

Documents


0 download

DESCRIPTION

Using Public Data Resources There is (are!) data out there There are methods out there Quite often they are combined –BLAST searches of sequence databases

TRANSCRIPT

Page 1: Bioinformatics Workshops 1 & 2 1. use of public database/search sites - range of data and access methods - interpretation of search results - understanding

Bioinformatics Workshops 1 & 21. use of public database/search sites

- range of data and access methods   - interpretation of search results

- understanding the meaning & effect of search (e.g. BLAST) parameters

2. functional analysis of single sequences- i.e. how to work out what your unknown

protein might be doing- complex searches for (e.g.) patterns of

motifs & secondary structure elements

Page 2: Bioinformatics Workshops 1 & 2 1. use of public database/search sites - range of data and access methods - interpretation of search results - understanding

Workshop 1.overall survey of data

Mutation between species -> orthologs

Mutation between duplications -> domains

Search methods – 2D vs. 3D

Search methods – similarity vs. models vs. comparative

Main data axes

Main Portals

Database searches vs. genome browsers

Finding similar sequences

BLAST, et al

E-values!

Biological origin of sequences

Genes vs.loci

Random sequences

Page 3: Bioinformatics Workshops 1 & 2 1. use of public database/search sites - range of data and access methods - interpretation of search results - understanding

Using Public Data Resources

• There is (are!) data out there• There are methods out there• Quite often they are combined

– BLAST searches of sequence databases

Page 4: Bioinformatics Workshops 1 & 2 1. use of public database/search sites - range of data and access methods - interpretation of search results - understanding

Notes…

• Sequence databases– Entrez queries…

• Genome browsers/databases• Regulatory Elements• SNPs• Functional Sequence Models (PFam domains,

etc.)• Expression Data

– Array data– in situ data

Page 5: Bioinformatics Workshops 1 & 2 1. use of public database/search sites - range of data and access methods - interpretation of search results - understanding

Notes II

• Blast parameters– Low complexity: frameshifted cDNA– miRNAs vs genome– morpholinos for other genes– -q-2 for EST vs EST alignments– Entrez queries

Page 6: Bioinformatics Workshops 1 & 2 1. use of public database/search sites - range of data and access methods - interpretation of search results - understanding

What have we got…gene model

locus

~ gene

mRNA

protein

genomeprimary transcript

Page 7: Bioinformatics Workshops 1 & 2 1. use of public database/search sites - range of data and access methods - interpretation of search results - understanding

Derivative Sequences

mRNA

clone into cDNA library

3’ EST

5’ EST

cDNA sequence

Single pass sequence from each end of the clone

Multiple pass sequencing over whole length of the clone

Page 8: Bioinformatics Workshops 1 & 2 1. use of public database/search sites - range of data and access methods - interpretation of search results - understanding

Initial Growth of Databases

• Lots of ESTs were generated

• Some clones were selected for full-insert sequencing -> cDNAs

• cDNAs were translated to yield presumed protein sequences

Page 9: Bioinformatics Workshops 1 & 2 1. use of public database/search sites - range of data and access methods - interpretation of search results - understanding

Then Came Genomes

• With increasing larger fragments of genomic sequence came the ability to align cDNAs to create gene models

• And then to apply our understanding of exon/intron structure to predict theoretical genes…

Page 10: Bioinformatics Workshops 1 & 2 1. use of public database/search sites - range of data and access methods - interpretation of search results - understanding

Introns and Exons

gene model

genome

CTACCATCCATGCTAACCATTCTACCATTTTATACTCATGCAACGGACCGTAGCGTAGTCGCTTAGCATCCTTTATAACTGGCTA

CTACCATCCATGCTAACCATTCTAC CATTTTATACTCATGCAACGGACCGT AGCGTAGTCGCTTAGCATCCTTTATAACTGGCTA

CTACCATCCATGCTAACCATTCTACGTAAGTCATCTATATCAATATTATTTCAGCATTTTATACTCATGCAACGGACCGTGTCAGTATTACAGAGCGTAGTCGCTTAGCATCCTTTATAACTGGCTA

GTAAG.donor

.TTTCAG acceptor

mRNA

exon intron exon intron exon

splice sites

Page 11: Bioinformatics Workshops 1 & 2 1. use of public database/search sites - range of data and access methods - interpretation of search results - understanding

Gene PredictionsGiven:- coding sequence must run from ATG – STOP codon in-frame- introns GT. . . . . . AG can be spliced out

Also take a statistical approach:- coding and non-coding sequence are slightly different in composition- some ‘possible’ splice sites are more likely than others

. . .CGTCGTATGGCTTCGATGTAGTACATCGGATCGGTATGGAATCATTTCAGTCGCTAGCTAGCCTAACGTATATAGCTAGGTAAGACTA. .

. . .CGTCGTATGGCTTCGATGTAGTACATCGGATCGGTATGGAATCATTTCAGTCGCTAGCTAGCCTAACGTATATAGCTAGGTAAGACTA. .. . .CGTCGTATGGCTTCGATGTAGTACATCGGATCGGTATGGAATCATTTCAGTCGCTAGCTAGCCTAACGTATATAGCTAGGTAAGACTA. .

. . .CGTCGTATGGCTTCGATGTAGTACATCGGATCGGTATGGAATCATTTCAGTCGCTAGCTAGCCTAACGTATATAGCTAGGTAAGACTA. .

. . .CGTCGTATGGCTTCGATTCGCTAGCTAGCCTAACGTATATAGCTAGGTAAGACTA. . .

. . .CGTCGTATGGCTTCGATGTAGTACATCGGATCGTCGCTAGCTAGCCTAACGTATATAGCTAGGTAAGACTA. . . . . .CGTCGTATGGCTTCGATGTAGTACATCGGATCGGTATGGAATCATTTCAGTCGCTAGCTAGCCTAACGTATATAGCTAGGTAAGACTA. .

. . .CGTCGTATGGCTTCGATTCGCTAGCTAGCCTAACGTATATAGCTAGGTAAGACTA. . .

. . .CGTCGTATGGCTTCGATGTAGTACATCGGATCGTCGCTAGCTAGCCTAACGTATATAGCTAGGTAAGACTA. . .

scan genomic sequence …

. . .CGTCGTATGGCTTCGATGTAGTACATCGGATCGGTATGGAATCATTTCAGTCGCTAGCTAGCCTAACGTATATAGCTAGGTAAGACTA. .

most likely gene model

Page 12: Bioinformatics Workshops 1 & 2 1. use of public database/search sites - range of data and access methods - interpretation of search results - understanding

Supporting Evidence!

EST evidence

genome

gene model

We note that even though there is good evidence for the existence of all four exons, there is no evidence that all the exons would appear on a real transcript. An alternative transcript, skipping exon 3, would be plausible, if a little unlikely.

This gets less ambiguous as more ESTs are available, and clones are sequenced at both ends (which helps put distant exons into the same transcripts), and eventually full-length transcript sequences are available.

exons: 1 2 3 4

Page 13: Bioinformatics Workshops 1 & 2 1. use of public database/search sites - range of data and access methods - interpretation of search results - understanding

So What’s in the Databases Now?

• At NCBI– 15,000,000 EST sequences– 3,329,110 non-redundant DNA sequences (excluding

ESTs, etc.)– 2,693,904 non-redundant translated coding

sequences – 954,378 Protein Reference Sequences sequences

(RefSeq)• But the majority of RefSeq may be translations

of theoretical transcripts…

Page 14: Bioinformatics Workshops 1 & 2 1. use of public database/search sites - range of data and access methods - interpretation of search results - understanding

Main Data Axes

• Europe: EBI/EMBL– Swiss-Prot/Trembl/Ensembl/UniProt

• US: NIH/NCBI– GenBank/UniGene/RefSeq/Entrez

• Japan: DNA Data Bank of Japan – National Institute of Genetics

Page 15: Bioinformatics Workshops 1 & 2 1. use of public database/search sites - range of data and access methods - interpretation of search results - understanding

Synchronisation…

GenBank

DDBJ

EMBL

ATCGATCGATCATAGTATGCTAGCTGCTA

BC009638.1

ATCGATCGATCATAGTATGCTAGCTGCTA

BC009638.1

ATCGATCGATCATAGTATGCTAGCTGCTA

You submit a sequence

BC009638.1

ATCGATCGATCATAGTATGCTAGCTGCTA

Page 16: Bioinformatics Workshops 1 & 2 1. use of public database/search sites - range of data and access methods - interpretation of search results - understanding

Sequences, Accession Numbers and Genes

NM_001015922.1 gi=62860271

GATCGTTCGATTAGCTAGGGACACCACCGATCGATATGACCACAAAAA

BC009638.1 gi=16307106

GTTCGATTAGCTAGGGACACCACCGATCGATATGACCACAAAA

NM_001015922.2 gi=62860589

GACCGTTCGATTAGCTAGGGACACCACCGATCGATATGACCACAAAAA

Page 17: Bioinformatics Workshops 1 & 2 1. use of public database/search sites - range of data and access methods - interpretation of search results - understanding

Main Data Portals

• NCBI Entrez Databases• ExPASy Proteomics Server• DNA Data Bank of Japan DDBJ• EBI Ensembl Genome Browser• Santa Cruz Genome Browser