ngs data overview - unipr.itbiochimica.unipr.it/biocomp/corso_ngs.pdf · applications dna-seq:...
TRANSCRIPT
NGS data overview Università degli studi di parma
Dipartimento di Bioscienze
Davide Carnevali [email protected]
NGS
Next-generation sequencing refers to non-Sanger-b a s e d h i g h - t h r o u g h p u t D N A s e q u e n c i n g technologies. Millions or billions of DNA strands can be sequenced in parallel, yielding substantially more throughput and minimizing the need for the fragment-cloning methods that are often used in Sanger sequencing of genomes.
Migliaia di $ —> centesimi di $!
Milioni di $ —> migliaia di $!
Technologies
Illumina
library preparation: • Fragmentation • Adapters ligation
single/paired end, barcodes (multiplexing) • Amplification
Sequencing —> reads
Library preparation
Library adapters
• Single/paired end (reads) • Multiplexing (barcodes)
(multiple samples sequencing)
Library amplification
Library amplification
Library sequencing
Library sequencing
Applications DNA-Seq: Sequencing of whole or targeted (e.g., exome) genomic DNA. SNPs detection, indel and other large-scale structural polymorphisms, and CNV (copy number variation). DNA-seq is also used for de novo assembly.
ChIP-seq: ChIP (chromatin immuno-precipitation) is used to enrich genomic DNA for regulatory elements, followed by sequencing and mapping of the enriched DNA to a reference genome. The initial statistical challenge is to identify regions where the mapped reads are enriched relative to a sample that did not undergo ChIP; a subsequent task is to identify differential binding across a designed experiment
Metagenomics: Sequencing generates sequences from samples containing multiple species, typically microbial communities sampled from niches such as the human oral cavity. Goals include inference of species composition (when sequencing typically targets phylogenetically informative genes such as 16S) or metabolic contribution
RNA-seq: Differential expression studies or novel transcript discovery.
Bisulfite-seq: Bisulfite treatment of DNA to determine its pattern of methylation.
SNPs
A Single Nucleotide Polymorphism is a DNA sequence variation occurring commonly within a population (e.g. 1%) in which a single nucleotide — A, T, C or G — in the genome differs between members of a biological species or paired chromosomes.
Variant
A “variation” or “variant” refers to an allele sequence that is different from the reference at as little as a single base or for a longer (potentially much longer) interval.!!In general the distinction between “variation” and “polymorphism” is that polymorphisms are by definition variable sites within or between populations. “Variation” makes no assumption about degree of polymorphism except by comparison between a sample and the reference. !
Variant Calling
Variant Calling The Variant Call Format (VCF) specifies the format of a text file used in bioinformatics for storing gene sequence variations.
http://samtools.github.io/hts-specs/VCFv4.2.pdf
Applications DNA-Seq: Sequencing of whole or targeted (e.g., exome) genomic DNA. SNPs detection, indel and other large-scale structural polymorphisms, and CNV (copy number variation). DNA-seq is also used for de novo assembly.
ChIP-seq: ChIP (chromatin immuno-precipitation) is used to enrich genomic DNA for regulatory elements, followed by sequencing and mapping of the enriched DNA to a reference genome. The initial statistical challenge is to identify regions where the mapped reads are enriched relative to a sample that did not undergo ChIP; a subsequent task is to identify differential binding across a designed experiment
Metagenomics: Sequencing generates sequences from samples containing multiple species, typically microbial communities sampled from niches such as the human oral cavity. Goals include inference of species composition (when sequencing typically targets phylogenetically informative genes such as 16S) or metabolic contribution
RNA-seq: Differential expression studies or novel transcript discovery.
Bisulfite-seq: Bisulfite treatment of DNA to determine its pattern of methylation.
ChIP-seq
ChIP-sequencing, is a method used to analyze protein interactions with DNA. ChIP-seq combines chromatin immunoprecipitation (ChIP) with massively parallel DNA sequencing to identify the binding sites of DNA-associated proteins.
ChIP-seq
ChIP-seq
Applications DNA-Seq: Sequencing of whole or targeted (e.g., exome) genomic DNA. SNPs detection, indel and other large-scale structural polymorphisms, and CNV (copy number variation). DNA-seq is also used for de novo assembly.
ChIP-seq: ChIP (chromatin immuno-precipitation) is used to enrich genomic DNA for regulatory elements, followed by sequencing and mapping of the enriched DNA to a reference genome. The initial statistical challenge is to identify regions where the mapped reads are enriched relative to a sample that did not undergo ChIP; a subsequent task is to identify differential binding across a designed experiment
Metagenomics: Sequencing generates sequences from samples containing multiple species, typically microbial communities sampled from niches such as the human oral cavity. Goals include inference of species composition (when sequencing typically targets phylogenetically informative genes such as 16S) or metabolic contribution
RNA-seq: Differential expression studies or novel transcript discovery.
Bisulfite-seq: Bisulfite treatment of DNA to determine its pattern of methylation.
RNA-seq
RNA-Seq is an approach to transcriptome profiling that uses deep-sequencing technologies. RNA-seq also provides a far more precise measurement of levels of transcripts and their isoforms than other methods.
RNA-seq
stranded RNA-seq
RNA-seq
Applications DNA-Seq: Sequencing of whole or targeted (e.g., exome) genomic DNA. SNPs detection, indel and other large-scale structural polymorphisms, and CNV (copy number variation). DNA-seq is also used for de novo assembly.
ChIP-seq: ChIP (chromatin immuno-precipitation) is used to enrich genomic DNA for regulatory elements, followed by sequencing and mapping of the enriched DNA to a reference genome. The initial statistical challenge is to identify regions where the mapped reads are enriched relative to a sample that did not undergo ChIP; a subsequent task is to identify differential binding across a designed experiment
Metagenomics: Sequencing generates sequences from samples containing multiple species, typically microbial communities sampled from niches such as the human oral cavity. Goals include inference of species composition (when sequencing typically targets phylogenetically informative genes such as 16S) or metabolic contribution
RNA-seq: Differential expression studies or novel transcript discovery.
Bisulfite-seq: Bisulfite treatment of DNA to determine its pattern of methylation.
Applications DNA-Seq: Sequencing of whole or targeted (e.g., exome) genomic DNA. SNPs detection, indel and other large-scale structural polymorphisms, and CNV (copy number variation). DNA-seq is also used for de novo assembly.
ChIP-seq: ChIP (chromatin immuno-precipitation) is used to enrich genomic DNA for regulatory elements, followed by sequencing and mapping of the enriched DNA to a reference genome. The initial statistical challenge is to identify regions where the mapped reads are enriched relative to a sample that did not undergo ChIP; a subsequent task is to identify differential binding across a designed experiment
Metagenomics: Sequencing generates sequences from samples containing multiple species, typically microbial communities sampled from niches such as the human oral cavity. Goals include inference of species composition (when sequencing typically targets phylogenetically informative genes such as 16S) or metabolic contribution
RNA-seq: Differential expression studies or novel transcript discovery.
Bisulfite-seq: Bisulfite treatment of DNA to determine its pattern of methylation.
Bisulfite-seq
• Bisulfite turns unmethylated C's into U's but leaves methylated C's alone; U's get coverted back to T's in reads
• Mapping to converted versions of reference show where methylation is happening
Bioinformatic workflow
• Quality Control • Alignment • Experiment specific analyses (VC/PC/DE) • Data visualization
Reads: fastq format
@PRESLEY_0005:2:1:1455:1033#0/1GGGTGATGGCCGCTGCCGATGGCGTCAAATCCCACC+PRESLEY_0005:2:1:1455:1033#0/1IIIIIIIIIIIIIIIIIIIIIIIIIIIIII9IG9IC
Rad name! sequence!
quality score (phred)!
Reads quality: Phred scale Phred quality scores Q are defined as a property which is logarithmically related to the base-call ing error probabilities P. Q = -10 Log10 P or P = 10-Q/10
Illumina: Phred+33 Q: 0 to 93
(ASCII 33-126)
Reads Quality Control
• Check reads quality and contamination • Remove poor quality bases (trimming) • Remove adapter contamination (clipping)
FastQC (http://www.bioinformatics.babraham.ac.uk/projects/fastqc/)
Reads Quality Control
Reads Quality Control
Alignment (to a reference genome)
Alignment itself is the process of determining the most likely source within the genome sequence for the observed DNA sequencing read, given the knowledge of which species the sequence has come from
Alignment (to a reference genome)
Short read alignment is tricky for several reasons: 1. The reference genome is really big. Searching (in)
big things is harder than searching (in) small things.
2. You aren’t always looking for exact matches in the reference genome.
Alignment Basic algorithm
Seed and extend (e.g. BLAST) Seeds: parts of the read with exact matches of a fixed size (e.g. size = 11) Tradeoff between speed and accuracy
Alignment Basic algorithm
Alignment Basic algorithm
We gain speed by skipping directly to the stage when we have a viable proto-alignment: a perfect match of the seed. We lose accuracy because we might miss the best alignment; the best alignment of the read with the target sequence might involve a difference between the read and the target that is inside the seed. If we require an exact match at the seeding stage, we will miss optimal alignments that have this feature.
Alignment Basic algorithm
The size of seed chosen affects the number of matches found during the seeding stage or alignment. In general, the larger seeds yield fewer exact matches while smaller seeds will match at more locations in the target
Alignment
NGS Alignment
• Hash table–based implementations, in which the hash may be created using either the reference genome or the set of sequencing reads (es. SOAP) (spaced seeds)
• Burrows Wheeler transform (BWT)-based methods,
which first create an efficient index (FM Index) of the reference genome assembly in a way that facilitates rapid searching in a low-memory footprint (es. Bowtie, BWA)
Alignment: Hash table–based
Hash table: data structure that is able to index complex and nonsequential data in a way that facilitates rapid searching
Alignment: Burrows Wheeler transform (BWT)-based
The Burrows–Wheeler transform rearranges a character string (e.g. reference genome) into runs of similar characters. The transformation is reversible, without needing to store any additional data
Alignment: Burrows Wheeler transform (BWT)-based
FM-index is a compressed full-text substring index based on the Burrows-Wheeler transform It can be used to efficiently find the number of occurrences of a pattern within the compressed text, as well as locate the position of each occurrence
Bowtie algorithm
Query: A AT G ATA C G G C G A C C A C C G A G AT C TA
Reference
BWT( Reference )
Bowtie algorithm
Query: A AT G ATA C G G C G A C C A C C G A G AT C TA
Reference
BWT( Reference )
Bowtie algorithm
Query: A AT G ATA C G G C G A C C A C C G A G AT C TA
Reference
BWT( Reference )
Bowtie algorithm
Query: A AT G ATA C G G C G A C C A C C G A G AT C TA
Reference
BWT( Reference )
Bowtie algorithm
Query: A AT G ATA C G G C G A C C A C C G A G AT C TA
Reference
BWT( Reference )
Bowtie algorithm
Query: A AT G ATA C G G C G A C C A C C G A G AT C TA
Reference
BWT( Reference )
Bowtie algorithm
Query: A AT G ATA C G G C G A C C A C C G A G AT C TA
Reference
BWT( Reference )
Bowtie algorithm
Query: A AT G T TA C G G C G A C C A C C G A G AT C TA
Reference
BWT( Reference )
Bowtie algorithm
Query: A AT G T TA C G G C G A C C A C C G A G AT C TA
Reference
BWT( Reference )
Reads from RNA-seq can span exon-exon junction
RNA-seq reads alignment
splicing aware aligner (TopHat)
RNA-seq reads alignment
TopHat (spliced aligner)
Alignment: SAM format
Sequence Alignment/Map (SAM) format is TAB-delimited. Apart from the header lines, which are started with the `@' symbol, each alignment line consists of:
Alignment: SAM format
Each bit in the FLAG field is defined as:
http://broadinstitute.github.io/picard/explain-flags.html
Alignment: SAM format
@HD VN:1.0 SO:coordinate !@SQ SN:chr1 LN:249250621 !@SQ SN:chr10 LN:135534747 !@SQ SN:chr11 LN:135006516!. . . . . !. . . . . !. . . . . !!@PG ID:TopHat VN:2.0.11 CL:tophat -p 2 --b2-very-sensitive –o output_dir --library-type fr-firststrand!Genome rep1_1.fastq rep1_2.fastq !!TUPAC_0006:2:44:16964:12525#0 337 chr1 10017 0 76M chr5 11709 0 !CCTAACCCTATCCCTAACCCGAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTA!BBBBBBBBBBBBBBBBBBBBBBBdbbac_ffd`dZdd`^d\a`a]]ZWc_^dcfffde`ecfbfbdffffefffff!AS:i:-10 XN:i:0 XM:i:2 XO:i:0 XG:i:0 NM:i:2 MD:Z:10A9T55 YT:Z:UU NH:i:20 CC:Z:chr5 CP:i:10617 XS:A:+!
BAM is the compressed binary version of the Sequence Alignment/Map (SAM) format, a compact and index-able representation of nucleotide sequence alignments.
Alignment: BAM format
http://samtools.sourceforge.net/!
NGS data analysis
From alignment to…. • Peak Calling (MACS) • Variant Calling (GATK) • Differential Expression analysis (DESeq)
Other formats
Expression: bedgraph, (big)wig Annotation: bed, gff/gtf
bed and bedgraph have 0-based coordinate system (and do not include stop position), others are 1-based (and include stop position)
Other formats
chrom start end value!!chr19 163488 163538 1 !chr19 527211 527261 2 !chr19 527435 527485 2 !chr19 583006 583007 1 !chr19 583007 583056 2 !chr19 583056 583057 1 !chr19 626610 626657 1 !chr19 845596 845646 1 !chr19 845696 845746 1 !chr19 871830 871880 1 !chr19 978730 978776 1 !chr19 1079712 1079762 1 !chr19 1092579 1092629 1 !chr19 1125251 1125301 1 !chr19 1272443 1272493 4 !
The bedGraph format allows display of continuous-valued data in track format. This display type is useful for probability scores and transcriptome data.
http://genome.ucsc.edu/goldenPath/help/bedgraph.html
Other formats
The wiggle (WIG) format is an older format for display of dense, continuous data such as GC percent, probability scores, and transcriptome data. The bigWig (compressed) format i s the recommended format for almost all graphing track needs
http://genome.ucsc.edu/goldenPath/help/wiggle.html
Other formats wiggle
variableStep is for data with irregular intervals between new data points and is the more commonly used wiggle format.
variableStep chrom=chr19 span=50 !163488 1 !variableStep chrom=chr19 span=50 !527211 2 !variableStep chrom=chr19 span=50 !527435 2 !variableStep chrom=chr19 span=1 !583006 1 !variableStep chrom=chr19 span=49 !583007 2 !variableStep chrom=chr19 span=1 !583056 1 !variableStep chrom=chr19 span=47 !626610 1 !variableStep chrom=chr19 span=50 !845596 1 !variableStep chrom=chr19 span=50 !
fixedStep is for data with regular intervals between new data values and is the more compact wiggle format.
fixedStep chrom=chr19 start=400601 step=100 !11 !22 !33 !21 !18 !9 !5 !
Other formats wiggle
Other formats
BED lines have three required fields and nine additional optional fields:
Other formats The 9 additional optional BED fields are:
Other formats
BED file example
chr1 62901974 62917475 NM_001017416 0 + 62905538 62916652 0 9 164,239,121,105,161,692,171,202,1559, 0,3495,5184,5891,6855,8434,11037,12161,13942, !
chr1 62901974 62917475 NM_001017416 0 +
chrom start end name score strand
Other formats
GFF (General Feature Format) lines have nine required fields that must be tab-separated. If the fields are separated by spaces instead of tabs, the track will not display correctly.
Other formats
GTF (Gene Transfer Format) is a refinement to GFF that tightens the specification. The first eight GTF fields are the same as GFF. The group field has been expanded into a list of attributes. Each attribute consists of a type/value pair. Attributes must end in a semi-colon, and be separated from any following attribute by exactly one space.
Other formats
The attribute list must begin with the two mandatory attributes:
Other formats
GTF file example
chr1 hg19_refGene exon 62901975 62902138 0.000000 + . gene_id "NM_001017416"; transcript_id "NM_001017416"; !chr1 hg19_refGene exon 62901975 62902233 0.000000 + . gene_id "NM_001017415"; transcript_id "NM_001017415"; !
Visualization
Use Genome Browsers to visual ize sam/bam alignments, expression and annotation tracks (files)
Genome Browser is a graphical interface for display of information from a biological database for genomic data.
http://genome.ucsc.edu/
http://www.broadinstitute.org/igv/
Visualization
Visualization