ngs data overview - unipr.itbiochimica.unipr.it/biocomp/corso_ngs.pdf · applications dna-seq:...

NGS data overview Università degli studi di parma

Dipartimento di Bioscienze

Davide Carnevali [email protected]

NGS

Next-generation sequencing refers to non-Sanger-b a s e d h i g h - t h r o u g h p u t D N A s e q u e n c i n g technologies. Millions or billions of DNA strands can be sequenced in parallel, yielding substantially more throughput and minimizing the need for the fragment-cloning methods that are often used in Sanger sequencing of genomes.

Migliaia di $ —> centesimi di $!

Milioni di $ —> migliaia di $!

Technologies

Illumina

  library preparation: •  Fragmentation •  Adapters ligation

single/paired end, barcodes (multiplexing) •  Amplification

  Sequencing —> reads

Library preparation

Library adapters

•  Single/paired end (reads) • Multiplexing (barcodes)

(multiple samples sequencing)

Library amplification

Library sequencing

Applications DNA-Seq: Sequencing of whole or targeted (e.g., exome) genomic DNA. SNPs detection, indel and other large-scale structural polymorphisms, and CNV (copy number variation). DNA-seq is also used for de novo assembly.

ChIP-seq: ChIP (chromatin immuno-precipitation) is used to enrich genomic DNA for regulatory elements, followed by sequencing and mapping of the enriched DNA to a reference genome. The initial statistical challenge is to identify regions where the mapped reads are enriched relative to a sample that did not undergo ChIP; a subsequent task is to identify differential binding across a designed experiment

Metagenomics: Sequencing generates sequences from samples containing multiple species, typically microbial communities sampled from niches such as the human oral cavity. Goals include inference of species composition (when sequencing typically targets phylogenetically informative genes such as 16S) or metabolic contribution

RNA-seq: Differential expression studies or novel transcript discovery.

Bisulfite-seq: Bisulfite treatment of DNA to determine its pattern of methylation.

SNPs

A Single Nucleotide Polymorphism is a DNA sequence variation occurring commonly within a population (e.g. 1%) in which a single nucleotide — A, T, C or G — in the genome differs between members of a biological species or paired chromosomes.

Variant

A “variation” or “variant” refers to an allele sequence that is different from the reference at as little as a single base or for a longer (potentially much longer) interval.!!In general the distinction between “variation” and “polymorphism” is that polymorphisms are by definition variable sites within or between populations. “Variation” makes no assumption about degree of polymorphism except by comparison between a sample and the reference. !

Variant Calling

Variant Calling The Variant Call Format (VCF) specifies the format of a text file used in bioinformatics for storing gene sequence variations.

http://samtools.github.io/hts-specs/VCFv4.2.pdf

ChIP-seq

ChIP-sequencing, is a method used to analyze protein interactions with DNA. ChIP-seq combines chromatin immunoprecipitation (ChIP) with massively parallel DNA sequencing to identify the binding sites of DNA-associated proteins.

ChIP-seq

RNA-seq

RNA-Seq is an approach to transcriptome profiling that uses deep-sequencing technologies. RNA-seq also provides a far more precise measurement of levels of transcripts and their isoforms than other methods.

RNA-seq

stranded RNA-seq

RNA-seq

Bisulfite-seq

•  Bisulfite turns unmethylated C's into U's but leaves methylated C's alone; U's get coverted back to T's in reads

•  Mapping to converted versions of reference show where methylation is happening

Bioinformatic workflow

•  Quality Control •  Alignment •  Experiment specific analyses (VC/PC/DE) •  Data visualization

Reads: fastq format

@PRESLEY_0005:2:1:1455:1033#0/1GGGTGATGGCCGCTGCCGATGGCGTCAAATCCCACC+PRESLEY_0005:2:1:1455:1033#0/1IIIIIIIIIIIIIIIIIIIIIIIIIIIIII9IG9IC

Rad name! sequence!

quality score (phred)!

Reads quality: Phred scale Phred quality scores Q are defined as a property which is logarithmically related to the base-call ing error probabilities P. Q = -10 Log10 P or P = 10-Q/10

Illumina: Phred+33 Q: 0 to 93

(ASCII 33-126)

Reads Quality Control

•  Check reads quality and contamination •  Remove poor quality bases (trimming) •  Remove adapter contamination (clipping)

FastQC (http://www.bioinformatics.babraham.ac.uk/projects/fastqc/)

Reads Quality Control

Alignment (to a reference genome)

Alignment itself is the process of determining the most likely source within the genome sequence for the observed DNA sequencing read, given the knowledge of which species the sequence has come from

Alignment (to a reference genome)

Short read alignment is tricky for several reasons: 1.  The reference genome is really big. Searching (in)

big things is harder than searching (in) small things.

2.  You aren’t always looking for exact matches in the reference genome.

Alignment Basic algorithm

Seed and extend (e.g. BLAST) Seeds: parts of the read with exact matches of a fixed size (e.g. size = 11) Tradeoff between speed and accuracy


We gain speed by skipping directly to the stage when we have a viable proto-alignment: a perfect match of the seed. We lose accuracy because we might miss the best alignment; the best alignment of the read with the target sequence might involve a difference between the read and the target that is inside the seed. If we require an exact match at the seeding stage, we will miss optimal alignments that have this feature.


The size of seed chosen affects the number of matches found during the seeding stage or alignment. In general, the larger seeds yield fewer exact matches while smaller seeds will match at more locations in the target

Alignment

NGS Alignment

•  Hash table–based implementations, in which the hash may be created using either the reference genome or the set of sequencing reads (es. SOAP) (spaced seeds)

•  Burrows Wheeler transform (BWT)-based methods,

which first create an efficient index (FM Index) of the reference genome assembly in a way that facilitates rapid searching in a low-memory footprint (es. Bowtie, BWA)

Alignment: Hash table–based

Hash table: data structure that is able to index complex and nonsequential data in a way that facilitates rapid searching

Alignment: Burrows Wheeler transform (BWT)-based

The Burrows–Wheeler transform rearranges a character string (e.g. reference genome) into runs of similar characters. The transformation is reversible, without needing to store any additional data

Alignment: Burrows Wheeler transform (BWT)-based

FM-index is a compressed full-text substring index based on the Burrows-Wheeler transform It can be used to efficiently find the number of occurrences of a pattern within the compressed text, as well as locate the position of each occurrence

Bowtie algorithm

Query: A AT G ATA C G G C G A C C A C C G A G AT C TA

Reference

BWT( Reference )

Bowtie algorithm

Query: A AT G T TA C G G C G A C C A C C G A G AT C TA

Reference

BWT( Reference )

Reads from RNA-seq can span exon-exon junction

RNA-seq reads alignment

splicing aware aligner (TopHat)

RNA-seq reads alignment

TopHat (spliced aligner)

Alignment: SAM format

Sequence Alignment/Map (SAM) format is TAB-delimited. Apart from the header lines, which are started with the `@' symbol, each alignment line consists of:


Each bit in the FLAG field is defined as:

http://broadinstitute.github.io/picard/explain-flags.html


@HD VN:1.0 SO:coordinate !@SQ SN:chr1 LN:249250621 !@SQ SN:chr10 LN:135534747 !@SQ SN:chr11 LN:135006516!. . . . . !. . . . . !. . . . . !!@PG ID:TopHat VN:2.0.11 CL:tophat -p 2 --b2-very-sensitive –o output_dir --library-type fr-firststrand!Genome rep1_1.fastq rep1_2.fastq !!TUPAC_0006:2:44:16964:12525#0 337 chr1 10017 0 76M chr5 11709 0 !CCTAACCCTATCCCTAACCCGAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTA!BBBBBBBBBBBBBBBBBBBBBBBdbbac_ffd`dZdd`^d\a`a]]ZWc_^dcfffde`ecfbfbdffffefffff!AS:i:-10 XN:i:0 XM:i:2 XO:i:0 XG:i:0 NM:i:2 MD:Z:10A9T55 YT:Z:UU NH:i:20 CC:Z:chr5 CP:i:10617 XS:A:+!

BAM is the compressed binary version of the Sequence Alignment/Map (SAM) format, a compact and index-able representation of nucleotide sequence alignments.

Alignment: BAM format

http://samtools.sourceforge.net/!

NGS data analysis

From alignment to…. •  Peak Calling (MACS) •  Variant Calling (GATK) •  Differential Expression analysis (DESeq)

Other formats

Expression: bedgraph, (big)wig Annotation: bed, gff/gtf

bed and bedgraph have 0-based coordinate system (and do not include stop position), others are 1-based (and include stop position)

Other formats

chrom start end value!!chr19 163488 163538 1 !chr19 527211 527261 2 !chr19 527435 527485 2 !chr19 583006 583007 1 !chr19 583007 583056 2 !chr19 583056 583057 1 !chr19 626610 626657 1 !chr19 845596 845646 1 !chr19 845696 845746 1 !chr19 871830 871880 1 !chr19 978730 978776 1 !chr19 1079712 1079762 1 !chr19 1092579 1092629 1 !chr19 1125251 1125301 1 !chr19 1272443 1272493 4 !

The bedGraph format allows display of continuous-valued data in track format. This display type is useful for probability scores and transcriptome data.

http://genome.ucsc.edu/goldenPath/help/bedgraph.html

Other formats

The wiggle (WIG) format is an older format for display of dense, continuous data such as GC percent, probability scores, and transcriptome data. The bigWig (compressed) format i s the recommended format for almost all graphing track needs

http://genome.ucsc.edu/goldenPath/help/wiggle.html

Other formats wiggle

variableStep is for data with irregular intervals between new data points and is the more commonly used wiggle format.

variableStep chrom=chr19 span=50 !163488 1 !variableStep chrom=chr19 span=50 !527211 2 !variableStep chrom=chr19 span=50 !527435 2 !variableStep chrom=chr19 span=1 !583006 1 !variableStep chrom=chr19 span=49 !583007 2 !variableStep chrom=chr19 span=1 !583056 1 !variableStep chrom=chr19 span=47 !626610 1 !variableStep chrom=chr19 span=50 !845596 1 !variableStep chrom=chr19 span=50 !

fixedStep is for data with regular intervals between new data values and is the more compact wiggle format.

fixedStep chrom=chr19 start=400601 step=100 !11 !22 !33 !21 !18 !9 !5 !

Other formats wiggle

Other formats

BED lines have three required fields and nine additional optional fields:

Other formats The 9 additional optional BED fields are:

Other formats

BED file example

chr1 62901974 62917475 NM_001017416 0 + 62905538 62916652 0 9 164,239,121,105,161,692,171,202,1559, 0,3495,5184,5891,6855,8434,11037,12161,13942, !

chr1 62901974 62917475 NM_001017416 0 +

chrom start end name score strand

Other formats

GFF (General Feature Format) lines have nine required fields that must be tab-separated. If the fields are separated by spaces instead of tabs, the track will not display correctly.

Other formats

GTF (Gene Transfer Format) is a refinement to GFF that tightens the specification. The first eight GTF fields are the same as GFF. The group field has been expanded into a list of attributes. Each attribute consists of a type/value pair. Attributes must end in a semi-colon, and be separated from any following attribute by exactly one space.

Other formats

The attribute list must begin with the two mandatory attributes:

Other formats

GTF file example

chr1 hg19_refGene exon 62901975 62902138 0.000000 + . gene_id "NM_001017416"; transcript_id "NM_001017416"; !chr1 hg19_refGene exon 62901975 62902233 0.000000 + . gene_id "NM_001017415"; transcript_id "NM_001017415"; !

Visualization

Use Genome Browsers to visual ize sam/bam alignments, expression and annotation tracks (files)

Genome Browser is a graphical interface for display of information from a biological database for genomic data.

http://genome.ucsc.edu/

http://www.broadinstitute.org/igv/

Visualization

ngs data overview - unipr.itbiochimica.unipr.it/biocomp/corso_ngs.pdf · applications dna-seq:...

Documents