imgc2011 bioinformatics tutorial

47
IMGS 2011 Bioinformatics Workshop Deanna Church, NCBI Carol Bult, The Jackson Laboratory

Upload: deanna-church

Post on 10-May-2015

1.796 views

Category:

Education


1 download

TRANSCRIPT

Page 1: Imgc2011 bioinformatics tutorial

IMGS 2011 Bioinformatics Workshop

Deanna Church, NCBI

Carol Bult, The Jackson Laboratory

Page 2: Imgc2011 bioinformatics tutorial

Intro

Sequencing Technology: life in the fast laneAlignments: things to considerFile formats: everything you always wanted to know but were afraid to askTools: Pick the right one for the job at hand

Page 3: Imgc2011 bioinformatics tutorial

19901992

19941997

19992001

20032005

20072009

0.00

10,000.00

20,000.00

30,000.00

40,000.00

50,000.00

60,000.00

70,000.00

$0.00

$20.00

$40.00

$60.00

$80.00

$100.00

$120.00

$140.00

Gig

abas

esCost per Kb

Lucinda Fulton, The Genome Center at Washington University

Cost Throughput

Page 4: Imgc2011 bioinformatics tutorial

Sequencing Technologies

http://www.geospiza.com/finchtalk/uploaded_images/plates-and-slides-718301.png

Page 5: Imgc2011 bioinformatics tutorial

Sequence “Space”• Roche 454 – Flow space

– Measure pyrophosphate released by a nucleotide when it is added to a growing DNA chain

– Flow space describes sequence in terms of these base incorporations– http://www.youtube.com/watch?v=bFNjxKHP8Jc

• AB SOLiD – Color space– Sequencing by DNA ligation via synthetic DNA molecules that contain two nested known

bases with a flouorescent dye– Each base sequenced twice– http://www.youtube.com/watch?v=nlvyF8bFDwM&feature=related

• Illumina/Solexa – Base space– Single base extentions of fluorescent-labeled nucleotides with protected 3 ‘ OH groups– Sequencing via cycles of base addition/detection followed deprotection of the 3’ OH– http://www.youtube.com/watch?v=77r5p8IBwJk&feature=related

• GenomeTV – Next Generation Sequencing (lecture)– http://www.youtube.com/watch?v=g0vGrNjpyA8&feature=related

http://finchtalk.geospiza.com/2008/03/color-space-flow-space-sequence-space_23.html

Page 6: Imgc2011 bioinformatics tutorial

Global and local alignments

Optimal global alignment

Needleman-Wunsch

Sequences align essentially from end to end

Optimal local alignment

Smith-Waterman

Sequences align only in small, isolated regions

References

Needleman and Wunsch (1970). J. Mol. Biol. 48, 443-453.

Smith and Waterman (1981). Nucleic Acids Res 13, 645-656.

Page 7: Imgc2011 bioinformatics tutorial
Page 8: Imgc2011 bioinformatics tutorial

Hashing methods

MVRRLPERTSTPACE

MVRVRRRRLRLPLPEPERERTRTSTSTSTPTPAPACACE

Query sequence

Word size = 3(configurable)

References

Wilbur & Lipman (1983), PNAS 80, 726-30

Lipman & Pearson (1985), Science 227, 1435-1441

Pearson & Lipman (1988), PNAS 85, 2444-2448

Page 9: Imgc2011 bioinformatics tutorial
Page 10: Imgc2011 bioinformatics tutorial
Page 11: Imgc2011 bioinformatics tutorial
Page 12: Imgc2011 bioinformatics tutorial
Page 13: Imgc2011 bioinformatics tutorial

http://www.slideshare.net/thomaskeane/eccb-2010-nextgen-sequencing-tutorial

Page 14: Imgc2011 bioinformatics tutorial

http://www.slideshare.net/thomaskeane/eccb-2010-nextgen-sequencing-tutorial

Page 15: Imgc2011 bioinformatics tutorial

Sensitivity vs. Specificity

Sensitivity = actual number of true positives (tp) identifiedSpecificity = number of true negatives (tn) identified

Actu

al

Predicted

TP FN

FP TN

positives

negatives

positives negatives

Sensitivity= TP/(TP+FN)Specificity=TN/(TN+FP)

Page 16: Imgc2011 bioinformatics tutorial

Richa Agarwala

MHC Alternate locus

Alignment to chr6

Page 17: Imgc2011 bioinformatics tutorial

ToolsAlignments

BLAST: not for NGSBWABowtieMaq…

TranscriptomicsTophatCufflinks…

Variant callingssahaSNPMosaic…

Counting (Chip-Seq, etc)FindPeaksPeakSeq

Page 18: Imgc2011 bioinformatics tutorial

Genome Workbenchhttp://www.ncbi.nlm.nih.gov/projects/gbench/

Page 19: Imgc2011 bioinformatics tutorial

“Standard” File formats

Sequence containersFASTAFASTQBAM/SAM

AlignmentsBAM/SAMMAF

AnnotationBEDGFF/GTF/GFF3WIG

VariationVCFGVF

Page 20: Imgc2011 bioinformatics tutorial

FASTQ: Data Format• FASTQ

– Text based– Encodes sequence calls and quality scores with ASCII characters– Stores minimal information about the sequence read– 4 lines per sequence

• Line 1: begins with @; followed by sequence identifier and optional description

• Line 2: the sequence• Line 3: begins with the “+” and is followed by sequence identifiers and

description (both are optional)• Line 4: encoding of quality scores for the sequence in line 2

• References/Documentation– http://maq.sourceforge.net/fastq.shtml– Cock et al. (2009). Nuc Acids Res 38:1767-1771.

Page 21: Imgc2011 bioinformatics tutorial

FASTQ Example

FASTQ example from: Cock et al. (2009). Nuc Acids Res 38:1767-1771.

For analysis, it may be necessary to convert to the Sanger form of FASTQ…For example,

Illumina stores quality scores ranging from 0-62;Sanger quality scores range from 0-93.

Solexa quality scores have to be converted to PHRED quality scores.

Page 22: Imgc2011 bioinformatics tutorial

SAM (Sequence Alignment/Map)

• It may not be necessary to align reads from scratch…you can instead use existing alignments in SAM format– SAM is the output of aligners that map reads to a

reference genome– Tab delimited w/ header section and alignment

section• Header sections begin with @ (are optional)• Alignment section has 11 mandatory fields

– BAM is the binary format of SAM

http://samtools.sourceforge.net/

Page 23: Imgc2011 bioinformatics tutorial

http://samtools.sourceforge.net/SAM1.pdf

Mandatory Alignment Fields

Page 24: Imgc2011 bioinformatics tutorial

http://samtools.sourceforge.net/SAM1.pdf

Alignment Examples

Alignments in SAM format

Page 25: Imgc2011 bioinformatics tutorial

chr1 86114265 86116346 nsv433165chr2 1841774 1846089 nsv433166chr16 2950446 2955264 nsv433167chr17 14350387 14351933 nsv433168chr17 32831694 32832761 nsv433169chr17 32831694 32832761 nsv433170chr18 61880550 61881930 nsv433171

chr1 16759829 16778548 chr1:21667704 270866 -chr1 16763194 16784844 chr1:146691804 407277 +chr1 16763194 16784844 chr1:144004664 408925 -chr1 16763194 16779513 chr1:142857141 291416 -chr1 16763194 16779513 chr1:143522082 293473 -chr1 16763194 16778548 chr1:146844175 284555 -chr1 16763194 16778548 chr1:147006260 284948 -chr1 16763411 16784844 chr1:144747517 405362 +

Valid BED files

Page 26: Imgc2011 bioinformatics tutorial

Mouse chrX: 35,000,000-36,000000

Page 27: Imgc2011 bioinformatics tutorial

Mouse chrX: 35,000,000-36,000000

X

MGSCv3 Build 36

Page 28: Imgc2011 bioinformatics tutorial

NC_000086.6

Page 29: Imgc2011 bioinformatics tutorial

hg19GRCh37

mm8MGSCv37

NCBIM37

danRer5Zv7

Page 30: Imgc2011 bioinformatics tutorial

Assemblies with the same name aren’t always the same

chr21:8,913,216-9,246,964

Page 31: Imgc2011 bioinformatics tutorial

Assemblies with the same name aren’t always the same

Zv7 chr21:8,913,216-9,246,964 X Mouse Build 36 chrX

Page 32: Imgc2011 bioinformatics tutorial

hg19GRCh37

GCA_000001405.1

Page 33: Imgc2011 bioinformatics tutorial

Tutorial Web Sitehttp://www.ncbi.nlm.nih.gov/staff/church/GenomeAnalysis/index.shtml

This site will be accessible after the meeting. Check back for updates and new tutorials.

Page 34: Imgc2011 bioinformatics tutorial
Page 35: Imgc2011 bioinformatics tutorial

RNA Seq Workflow• Convert data to FASTQ• Upload files to Galaxy• Quality Control

– Throw out low quality sequence reads, etc.• Map reads to a reference genome

– Many algorithms available– Trade off between speed and sensitivity

• Data summarization– Associating alignments with genome annotations– Counts

• Data Visualization• Statistical Analysis

Page 36: Imgc2011 bioinformatics tutorial

Typical RNA_Seq Project Work Flow

Sequencing Sequencing

Tissue Sample Tissue Sample

Cufflinks Cufflinks

TopHat TopHat

FASTQ file FASTQ file

QC QC

Gene/Transcript/Exon Expression

Gene/Transcript/Exon Expression

VisualizationVisualization

Total RNA Total RNA mRNA mRNA cDNA cDNA

Statistical Analysis

Statistical Analysis

JAX Computational Sciences Service

Page 37: Imgc2011 bioinformatics tutorial

TopHat

Trapnell et al. (2009). Bioinformatics 25:1105-1111.

http://tophat.cbcb.umd.edu/

Figure from: Trapnell et al. (2010). Nature Biotechnology 28:511-515.

TopHat is a good tool for aligning RNA Seq data compared to other aligners (Maq, BWA) because it takes splicing into account during the alignment process.

Page 38: Imgc2011 bioinformatics tutorial

Trapnell C et al. Bioinformatics 2009;25:1105-1111

TopHat is built on the Bowtie alignment algorithm.

Page 39: Imgc2011 bioinformatics tutorial

Cufflinks

Trapnell et al. (2010). Nature Biotechnology 28:511-515.

http://cufflinks.cbcb.umd.edu/

• Assembles transcripts,• Estimates their abundances, and •Tests for differential expression and regulation in RNA-Seq samples

Page 40: Imgc2011 bioinformatics tutorial

Galaxyhttp://main.g2.bx.psu.edu/

See Tutorial 1

Build and share data and analysis workflowsNo programming experience requiredStrong and growing development and user community

Page 41: Imgc2011 bioinformatics tutorial

Short Read Archivehttp://trace.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?

Short Read Archive Handbookhttp://www.ncbi.nlm.nih.gov/books/NBK47528/

Page 42: Imgc2011 bioinformatics tutorial

http://www.asperasoft.com/en/products/client_software_2/aspera_connect_8

High performance file transfer for getting data from the Short Read Archive

Aspera Connect

Page 43: Imgc2011 bioinformatics tutorial

http://trace.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?view=softwareSRA Toolkit

Page 44: Imgc2011 bioinformatics tutorial

Galaxy on the Cloud• Create an Amazon Web Services AWS account

– Sign up for Amazon Elastic Compute Cloude (EC2) and– Amazon Simple Storage Service (S3 service)

• Use the AWS Management Console to start a master EC2 instance

• Use the Galaxy Cloud web interface to manage the cluster• Step by step instructions are here:

– https://bitbucket.org/galaxy/galaxy-central/wiki/cloud

• Screencast to demonstrate the sign up process is here:– https://bitbucket.org/galaxy/galaxy-central/wiki/cloud

Afgan E., Baker D., Coraor N., Chapman B., Nekrutenko A., Taylor J. (2010) BMC Bioinformatics. 11:2010.

Page 45: Imgc2011 bioinformatics tutorial

Why Go to the Cloud?• Files and Compute needs are much greater for next gen

sequence data • Amazon cloud provides a scalable, cost-effective solution

Afgan E., Baker D., Coraor N., Chapman B., Nekrutenko A., Taylor J. (2010) BMC Bioinformatics. 11:2010.

Page 46: Imgc2011 bioinformatics tutorial

Some Tips

• You’ll need a credit card to activate the service • You’ll need to be near a phone so that you can

verify your identity during the sign up process• There is a time lag between signing up for

AWS and getting access

Page 47: Imgc2011 bioinformatics tutorial

Tools HistoryDialog/Parameter Selection

Let’s Get Started!