imgc2011 bioinformatics tutorial

IMGS 2011 Bioinformatics Workshop

Deanna Church, NCBI

Carol Bult, The Jackson Laboratory

Intro

Sequencing Technology: life in the fast laneAlignments: things to considerFile formats: everything you always wanted to know but were afraid to askTools: Pick the right one for the job at hand

19901992

19941997

19992001

20032005

20072009

0.00

10,000.00

20,000.00

30,000.00

40,000.00

50,000.00

60,000.00

70,000.00

$0.00

$20.00

$40.00

$60.00

$80.00

$100.00

$120.00

$140.00

Gig

abas

esCost per Kb

Lucinda Fulton, The Genome Center at Washington University

Cost Throughput

Sequencing Technologies

http://www.geospiza.com/finchtalk/uploaded_images/plates-and-slides-718301.png

Sequence “Space”• Roche 454 – Flow space

– Measure pyrophosphate released by a nucleotide when it is added to a growing DNA chain

– Flow space describes sequence in terms of these base incorporations– http://www.youtube.com/watch?v=bFNjxKHP8Jc

• AB SOLiD – Color space– Sequencing by DNA ligation via synthetic DNA molecules that contain two nested known

bases with a flouorescent dye– Each base sequenced twice– http://www.youtube.com/watch?v=nlvyF8bFDwM&feature=related

• Illumina/Solexa – Base space– Single base extentions of fluorescent-labeled nucleotides with protected 3 ‘ OH groups– Sequencing via cycles of base addition/detection followed deprotection of the 3’ OH– http://www.youtube.com/watch?v=77r5p8IBwJk&feature=related

• GenomeTV – Next Generation Sequencing (lecture)– http://www.youtube.com/watch?v=g0vGrNjpyA8&feature=related

http://finchtalk.geospiza.com/2008/03/color-space-flow-space-sequence-space_23.html

http://www.youtube.com/watch?v=bFNjxKHP8Jc

http://www.youtube.com/watch?v=nlvyF8bFDwM&feature=related

http://www.youtube.com/watch?v=77r5p8IBwJk&feature=related

http://www.youtube.com/watch?v=g0vGrNjpyA8&feature=related

Global and local alignments

Optimal global alignment

Needleman-Wunsch

Sequences align essentially from end to end

Optimal local alignment

Smith-Waterman

Sequences align only in small, isolated regions

References

Needleman and Wunsch (1970). J. Mol. Biol. 48, 443-453.

Smith and Waterman (1981). Nucleic Acids Res 13, 645-656.

Hashing methods

MVRRLPERTSTPACE

MVRVRRRRLRLPLPEPERERTRTSTSTSTPTPAPACACE

Query sequence

Word size = 3(configurable)

References

Wilbur & Lipman (1983), PNAS 80, 726-30

Lipman & Pearson (1985), Science 227, 1435-1441

Pearson & Lipman (1988), PNAS 85, 2444-2448

http://www.slideshare.net/thomaskeane/eccb-2010-nextgen-sequencing-tutorial

Sensitivity vs. Specificity

Sensitivity = actual number of true positives (tp) identifiedSpecificity = number of true negatives (tn) identified

Actu

al

Predicted

TP FN

FP TN

positives

negatives

positives negatives

Sensitivity= TP/(TP+FN)Specificity=TN/(TN+FP)

Richa Agarwala

MHC Alternate locus

Alignment to chr6

ToolsAlignments

BLAST: not for NGSBWABowtieMaq…

TranscriptomicsTophatCufflinks…

Variant callingssahaSNPMosaic…

Counting (Chip-Seq, etc)FindPeaksPeakSeq

Genome Workbenchhttp://www.ncbi.nlm.nih.gov/projects/gbench/

http://www.ncbi.nlm.nih.gov/projects/gbench/

“Standard” File formats

Sequence containersFASTAFASTQBAM/SAM

AlignmentsBAM/SAMMAF

AnnotationBEDGFF/GTF/GFF3WIG

VariationVCFGVF

FASTQ: Data Format• FASTQ

– Text based– Encodes sequence calls and quality scores with ASCII characters– Stores minimal information about the sequence read– 4 lines per sequence

• Line 1: begins with @; followed by sequence identifier and optional description

• Line 2: the sequence• Line 3: begins with the “+” and is followed by sequence identifiers and

description (both are optional)• Line 4: encoding of quality scores for the sequence in line 2

• References/Documentation– http://maq.sourceforge.net/fastq.shtml– Cock et al. (2009). Nuc Acids Res 38:1767-1771.

http://maq.sourceforge.net/fastq.shtml

FASTQ Example

FASTQ example from: Cock et al. (2009). Nuc Acids Res 38:1767-1771.

For analysis, it may be necessary to convert to the Sanger form of FASTQ…For example,

Illumina stores quality scores ranging from 0-62;Sanger quality scores range from 0-93.

Solexa quality scores have to be converted to PHRED quality scores.

SAM (Sequence Alignment/Map)

• It may not be necessary to align reads from scratch…you can instead use existing alignments in SAM format– SAM is the output of aligners that map reads to a

reference genome– Tab delimited w/ header section and alignment

section• Header sections begin with @ (are optional)• Alignment section has 11 mandatory fields

– BAM is the binary format of SAM

http://samtools.sourceforge.net/

http://samtools.sourceforge.net/SAM1.pdf

Mandatory Alignment Fields

http://samtools.sourceforge.net/SAM1.pdf

Alignment Examples

Alignments in SAM format

chr1 86114265 86116346 nsv433165chr2 1841774 1846089 nsv433166chr16 2950446 2955264 nsv433167chr17 14350387 14351933 nsv433168chr17 32831694 32832761 nsv433169chr17 32831694 32832761 nsv433170chr18 61880550 61881930 nsv433171

chr1 16759829 16778548 chr1:21667704 270866 -chr1 16763194 16784844 chr1:146691804 407277 +chr1 16763194 16784844 chr1:144004664 408925 -chr1 16763194 16779513 chr1:142857141 291416 -chr1 16763194 16779513 chr1:143522082 293473 -chr1 16763194 16778548 chr1:146844175 284555 -chr1 16763194 16778548 chr1:147006260 284948 -chr1 16763411 16784844 chr1:144747517 405362 +

Valid BED files

Mouse chrX: 35,000,000-36,000000

Mouse chrX: 35,000,000-36,000000

X

MGSCv3 Build 36

NC_000086.6

hg19GRCh37

mm8MGSCv37

NCBIM37

danRer5Zv7

Assemblies with the same name aren’t always the same

chr21:8,913,216-9,246,964

Assemblies with the same name aren’t always the same

Zv7 chr21:8,913,216-9,246,964 X Mouse Build 36 chrX

hg19GRCh37

GCA_000001405.1

Tutorial Web Sitehttp://www.ncbi.nlm.nih.gov/staff/church/GenomeAnalysis/index.shtml

This site will be accessible after the meeting. Check back for updates and new tutorials.

RNA Seq Workflow• Convert data to FASTQ• Upload files to Galaxy• Quality Control

– Throw out low quality sequence reads, etc.• Map reads to a reference genome

– Many algorithms available– Trade off between speed and sensitivity

• Data summarization– Associating alignments with genome annotations– Counts

• Data Visualization• Statistical Analysis

Typical RNA_Seq Project Work Flow

Sequencing Sequencing

Tissue Sample Tissue Sample

Cufflinks Cufflinks

TopHat TopHat

FASTQ file FASTQ file

QC QC

Gene/Transcript/Exon Expression

Gene/Transcript/Exon Expression

VisualizationVisualization

Total RNA Total RNA mRNA mRNA cDNA cDNA

Statistical Analysis

Statistical Analysis

JAX Computational Sciences Service

TopHat

Trapnell et al. (2009). Bioinformatics 25:1105-1111.

http://tophat.cbcb.umd.edu/

Figure from: Trapnell et al. (2010). Nature Biotechnology 28:511-515.

TopHat is a good tool for aligning RNA Seq data compared to other aligners (Maq, BWA) because it takes splicing into account during the alignment process.

Trapnell C et al. Bioinformatics 2009;25:1105-1111

TopHat is built on the Bowtie alignment algorithm.

Cufflinks

Trapnell et al. (2010). Nature Biotechnology 28:511-515.

http://cufflinks.cbcb.umd.edu/

• Assembles transcripts,• Estimates their abundances, and •Tests for differential expression and regulation in RNA-Seq samples

Galaxyhttp://main.g2.bx.psu.edu/

See Tutorial 1

Build and share data and analysis workflowsNo programming experience requiredStrong and growing development and user community

Short Read Archivehttp://trace.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?

Short Read Archive Handbookhttp://www.ncbi.nlm.nih.gov/books/NBK47528/

http://www.asperasoft.com/en/products/client_software_2/aspera_connect_8

High performance file transfer for getting data from the Short Read Archive

Aspera Connect

http://trace.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?view=softwareSRA Toolkit

Galaxy on the Cloud• Create an Amazon Web Services AWS account

– Sign up for Amazon Elastic Compute Cloude (EC2) and– Amazon Simple Storage Service (S3 service)

• Use the AWS Management Console to start a master EC2 instance

• Use the Galaxy Cloud web interface to manage the cluster• Step by step instructions are here:

– https://bitbucket.org/galaxy/galaxy-central/wiki/cloud

• Screencast to demonstrate the sign up process is here:– https://bitbucket.org/galaxy/galaxy-central/wiki/cloud

Afgan E., Baker D., Coraor N., Chapman B., Nekrutenko A., Taylor J. (2010) BMC Bioinformatics. 11:2010.

https://bitbucket.org/galaxy/galaxy-central/wiki/cloud

https://bitbucket.org/galaxy/galaxy-central/wiki/cloud

Why Go to the Cloud?• Files and Compute needs are much greater for next gen

sequence data • Amazon cloud provides a scalable, cost-effective solution

Afgan E., Baker D., Coraor N., Chapman B., Nekrutenko A., Taylor J. (2010) BMC Bioinformatics. 11:2010.

Some Tips

• You’ll need a credit card to activate the service • You’ll need to be near a phone so that you can

verify your identity during the sign up process• There is a time lag between signing up for

AWS and getting access

Tools HistoryDialog/Parameter Selection

Let’s Get Started!

imgc2011 bioinformatics tutorial

Education

sequence spaceroche

sequence identifiers

phred quality scores

solexa quality scores

sam sequence alignmentmapit

sanger quality scores

illumina stores quality

sequence read4 lines