next-generation genomics: an integrative approach
TRANSCRIPT
Permissions: you are free to blog or live-blog about this presentation as long as you attribute the work to its authors
Korea Center for Disease Control & Prevention
Next-generation genomics:an integrative approach
Chang Bum Hong
Division of Structural and functional Genomics, Center for Genome Sciences, NIH
APPLICATIONS OF NEXT-GENERATION SEQUENCING
2011• Genome structural variation discovery and genotyping• RNA sequencing: advances, challenges and opportunities• Charting histon modifications and the functional organization of mammalian genomes
2010• Evaluating genome-scale approaches to eukaryotic DNA replication• Advances in understanding cancer genomes through second-generation sequencing• Genome-wide allele-specific analysis: insights into regulatory variation• Next-generation genomics: an integrative approach• Uncovering the roles of rare variants in common disease through whole-genome sequencing• Principles and challenges of genome-wide DNA methylation analysis• Prokaryotic transcriptomics: a new view on regulation, physiology and pathogenicity• Sequencing technologies - the next generation• RNA processing and its regulation: global insights into biological networks
2009• The complex eukaryotic transcriptome: unexpected pervasive transcription and novel small RNAs• ChIP-seq: advantages and challenges of a maturing technology• Insights from genomic profiling of transcription factors• RNA-Seq: a revolutionary tool for transcriptomics
DNA
RNA
Protein
Complete genome resequencingTargeted genomic resequencingde novo sequencing
Translated into proteins
DNA being transcribed into RNA
PhenotypeDisease
Chromatin immuniprecipitation sequencingSequencing of bisulfite-treated DNA
EpigenomeTranscriptome sequencingSmall RNA sequencing
Proteomics
Transcriptomics
Genomics
Genome-scale dataGWAS, ChIP-seq and RNA-seq
Next-generation sequencing
•We define this as the use of established sequencing platforms, including the
• Illumia/Solexa Genome Analyzer
• Roche/454 Genome Sequencer
• Applied Biosystems SOLiD
• Helicos and Pacific Biosciences
HiSeq 2000
5500xl SOLid System
MiSeq
Ion Personal Genome Machine
Genome Sequencer FLX System
GS Junior
HeliScope Single Molecule SequncerPACBIO RSJay Flatley Greg Lucier
Jay Flatley Greg Lucier Stephen Quake
Jim Watson Craig Venter
John WestFormer Illumina CEOFounder of HelicosLife Technogoies CEOIllumina CEO
?
BGI 1 x 454, 27 x SOLiD3/4, 128 x Illumina HiSeq
94 x Illumina GA2, 10 x 454, 8 x SOLiD3/4, 1 x Heliscope, 1 x Polonator, 1 x PacBioBroad Institute
Next Generation Genomics: World Map of High-throughput Sequencershttp://pathogenomics.bham.ac.uk/hts/
GMI at Seoul National University College of Medicine 10 x Illumina GA2Macrogen 10 x Illumina GA2, 1 x 454, 2 x SOLiD3/4NICEM Illumina GA2, 454Gachon University of Medicine and Science Illumina GA2, 2 x SOLiD 3/4KRIBB 1x Illumina GA2
• Next-next....-generation: how many ‘next’s are there?
• First Generation: automated version of Sanger sequencing(DNA-sequencing method invented by Fred Sanger in the 1970s)
• Second Generation
• Roche/454 sequencing machine from 454 Life Science(2005)
• 450 bases per read / $0.02 per 1000 bases / 2 days per Gb
• Solexa from Illumina(2006)
• 75 bases per read / $0.01 per 1000 bases / 0.5 days per Gb
• SOLiD from Applied BioSystem(2006)
• 50 bases per read / $0.001 per 1000 bases / 0.5 days per Gb
• Next-Next-Gen - Third Generation?
• Hiseq2000 from Illumina - 0.04 days per Gb
• Helicos Heliscope
• Pacific Biosciences SMART
Sequencing technologies
Shendure & Ji, 2008
Michael L. Metzker, 2010
Sequencing technologiesFeature generation
Sequencing technologiesSequencing by synthesis
Michael L. Metzker, 2010
• Sequencing
• How deep?
• Single, Paired read or both
• Alignment
• References, assemble or both
• Experimental specific analysis
• A ‘one-size-fits-all’ program dose not exist
NGS typical procedure
• Sequence assembly
• Whole Genome Assembly (Reference, De novo)
• Transcriptome Assembly
• Short Sequence Alignment
• Single read
• Paired read
• Genomic Variation Detection
• Detection of Single Nucleotide Polymorphism (SNP)
• Detection of Alternative Splicing Event
• Detection of major/minor transcript isoforms
Applications
Shendure & Ji, 2008
Applications
Bioinformatics tools
Shendure & Ji, 2008
• Sequence Reads
• fastq
• fasta
• Alignment
• Sequence Alignment Map (SAM)
• BAM (Binary Alignment Map)
• Variation
• VCF (Variation Call Format)
File Format
Data: Sequence Reads
Data: Sequence Reads
A challenge call for a new compression algorithmCompression of genomic sequences in FASTQ format
Sebastian Deorowicz et.al, 2011
Data: Sequence Reads
Compress type Compress time Size
gzip 14s 28M
bzip2 9.75s 23M
dsrc 1.36s 21M
• ChIP-Seq
• allows you to assay the amount of binding and location of a protein to DNA, such as a transcription factor bound to the start site of a gene, or a histones of a certain type
• RNA-Seq
• Mapping transcription start sites
• Characterization of alternative splicing patterns
• Gene fusion detection
• Estimation of the abundance of the transcripts from their depth of coverage in the mapping
Example of Applications
ChIP-Seq
Barski A & Zhao K, 2009
Chromatin immunoprecipitation (ChIP)
Kharchenko et al, 2008
Shirely et al, 2009
ChIP-Seq
Shirely et al, 2009
ChIP-Seq Software packages
Shirely et al, 2009
RNA-Seq
Zhong Wang, 2009
RNA-Seq (De novo transcriptome assembly)
RNA-Seq(Transcriptome resequencing)
RNA-Seq
RNA-Seq mapping of short reads in exon-exon junctionsRNA-Seq mapping of short reads over exon-exon junctions, depending on where each end maps to, it could be defined a Transor a Cis event.
from wikipedia.org
RNA-Seq Software packages
Shirely et al, 2009
• Genes in DNA being transcribed into RNA
• might be spliced
• transported to an appropriate cellular compartment
• translated into proteins
• Regulated at many levels
• DNA methylation
• chromatin modification
• binding of transcription factors to the DNA
• binding of splicing factors to the RNA and RNA transport
DNA encodes heritable traits
•What types of genomic data sets are available?
•Why perform integrative genomic analysis?
• Approaches to an integrative analysis
• Using large-scale data sets for integrative analysis
• Future perspectives
NGG(Next-generation genomics)an integrative approach
• Sequence variation data
• SNP genotyping arrays
• resequencing
• Transcriptomic data
• RNA-Seq
• identify transcripts arising from gene fusion events
• detect novel classes of non-coding RNAs
• Epigenomic data
• Bisulphite tratment
• Chromatin immunoprecipitation
• Interactome data
• RNA-protein interaction
• protein -protein interaction networks
• define genetic and signaling pathways
What types of genomic data sets are available?
• Annotating functional features of the genome
• Inferring the function of genetic variants
• Understanding mechanisms of gene regulation
Figure 1 | Annotating the genome through detecting transcription-factor binding sites and histone-modification states.
Why perform integrative genomic analysis?
Figure 2 | Identification of regulatory SNPs
Approaches to an integrative analysis
• Data complexity reduction
• summarize each experiment as a collection of genomic regions with strong enrichment of signal
• especially important to inspect at least some of the results by eye
• Unsupervised integration
• 목적은 어떤 올바른 답을 찾는 것이 아니라 데이터 집합 내에서 구조를 발견
• Clustering: partitioning a large data set into easily digestible, conceptual pieces
• Supervised integration
• 예제 입출력을 사용해 예측하는 방법을 학습하는 기법
• Bayesian network
Approaches to an integrative analysis
an intromic H3K4me1 peak predicts an enhancer elements
Promoter
Transcribed
UCSC browser with EnCODE data
Using large-scale data sets for integrative analysis
• For the bench scientist
• open-source web browser, such as FireFox
• add-ons: gatekeepers
Using large-scale data sets for integrative analysis
• For the bench scientist
• stand-alone analytical system: CisGenome
• genome browser: UCSC browser, Anno-J
Figure 4 | Flow chart for data analysisWorkflow for ChIP-seq analysis
Galaxy
UCSC browser
Online or stand-alone tools
Using large-scale data sets for integrative analysis
• Bioinformatics hurdles
• normalized data
Future perspectives
•Data integration itself is not an end
• designed to generate novel hypotheses and help to test them
• Community-wide effort, akin to Wikipedia
• Searchable with Google-like capabilities
Future perspectives
Future perspectives
토비 세가란Genstruct에서 약제 발현원리 이해를 위한 알고리즘 설계
사트남 알랙생명과학 커뮤니티를 위한 버티컬 검색 엔진을 개발
하는 넥스트바이오의 엔지니어링 부사장