advances 30 april, 2009 cancer genomics...sean grimmond april 30. th , 2008 ... • every cancer...
TRANSCRIPT
Sponsored by:
Participating Experts:
Sean Grimmond, Ph.D.Institute for Molecular BioscienceUniversity of QueenslandAustralia
Webinar SeriesWebinar SeriesScienceScienceAdvances inAdvances in 30 April, 200930 April, 2009
Brought to you by the Science/AAAS Business Office
David Wheeler, Ph.D.Baylor College of MedicineHouston, Texas
John McPherson, Ph.D.Ontario Institute for Cancer ResearchToronto, Canada
Cancer GenomicsCancer Genomics
Studying cancer transcriptomes at single nucleotide resolution
Expression Genomics Laboratory
http://www.expressiongenomics.org
Sean Grimmond April 30th , 2008
SQRL profiling is quantitative (SOLiD Vs Illumina array) Cancer Transcriptomics:
Over the last decade, transcriptomics has revolutionized our ability to capture the genes and pathways driving biological processes and pathological states.
SQRL profiling is quantitative (SOLiD Vs Illumina array) Cancer Transcriptomics:
Over the last decade, transcriptomics has revolutionized our ability to capture the genes and pathways driving biological processes and pathological states.
Cancer Transcriptomics is moving to massive scale sequence-based analyses for surveying :- i) locus activity, ii) transcript specific expression, and iii) sequence content.
Microarrayprofiling
RNAseqprofiling
Red: >2x up regulated in EB, Green: <2x down regulated in ES, Grey: Marginal detection (<.95 detection score for Illumina or 50tags for SQRL )
SQRL profiling is quantitative (SOLiD Vs Illumina array) Comparison of RNAseq & array-based gene expression profiling:
EB
ES
Microarrayprofiling
RNAseqprofiling
Red: >2x up regulated in EB, Green: <2x down regulated in ES, Grey: Marginal detection (<.95 detection score for Illumina or 50tags for SQRL )
SQRL profiling is quantitative (SOLiD Vs Illumina array) Comparison of RNAseq & array-based gene expression profiling:
AAA
AAA
AAA
AAA
Defining transcript specificExpression by “diagnostic” features:
AAA
AAA
AAA
AAA
Defining transcript specificExpression by “diagnostic” features:
AAA
AAA
AAA
AAA
Defining transcript specificExpression by “diagnostic” features:
AAA
AAA
AAA
AAA
Defining transcript specificExpression by “diagnostic” features:
Survey exon activity
Survey exon junction usage
Canonical ORF & mRNA (black arrow)
Complex transcriptional output from VEGFR1
Canonical ORF & mRNA (black arrow)
Complex transcriptional output from VEGFR1
Secreted decoy receptor 1 (common)
Canonical ORF & mRNA (black arrow)
Complex transcriptional output from VEGFR1
Secreted decoy receptor 1 (common)
Secreted decoy receptor 2 (rare)
Surveying known and novelTranscript expression:
Known complexity
•Alternative splicing•Alternative promoter usage•3’UTR switching
Theoretical complexity
•Novel Alternative splicing•Detection of gene fusions
Transcriptome discovery
•Novel Alternative splicing•Detection of gene fusions
Align, ID andQC call SNPs
Map to genome
Determine ifSNP is in dbSNP?
ORF, UTR, Syn/Non
Rank SNPs(polyphen, Canpredict
ACGATATTACACGTACACTCAAGTCGTTCGGAACCTACGATATTACACGTACATTCAAATCGTACGATATTACACGTACATTCAACTCGTACGATATTACACGCACATTCAAGTCGT
CGATATTACACGTACATTCAAGTCGTTATATTTCACGTACATTCAAGTCGTTCGATATTAAACGTACATTCAAGTCGTTCG
ATTACACGTACATTCAAGTCGTTCGGAATTACACGTACATTCACGTCGTTCGGA
CACGTACATTCAAGTCGTTCGGAACCT-----------------T------------------ SNP call
Aligned Reads
All tags
Variants expressed relative to the reference genome
Screening for expressed SNPs, mutations, RNA editing
MPP6: (W-260-stop)p55 MAGUK family member:
Tumour suppressor
W-260-stop
Profiling the small RNA Transcriptome:
APBB2A P PBCL2L11
CCND1
CCND2CCNG2
CDKN1A
CRK
CUL3
DMTF1
E2F1E2F3
E2F5
EREG
FOXO1A
GAB1
HAS2HIF1A
IRF1
KHDRBS1
KPNA2
MAP3K8MAPK9MYCN
NCOA3
NR4A3
PCAF
PDGFRA
PKD1
PKD2
PPARA
RB1RBBP7
RBL1RBL2
STAT3TP53INP1
TSG101
TXNIP
WEE1miR17-5p
SQRL profiling is quantitative (SOLiD Vs Illumina array) Cancer Transcriptomics:
WTseq is a powerful tool for monitoring gene activity and transcript specific expression and transcript discovery.
SQRL profiling is quantitative (SOLiD Vs Illumina array) Cancer Transcriptomics:
WTseq is a powerful tool for monitoring gene activity and transcript specific expression and transcript discovery.
WTseq can also be used to study the sequence content of RNAs. This allows one to study expressed mutations, RNA editing events and allele specific expression.
SQRL profiling is quantitative (SOLiD Vs Illumina array) Cancer Transcriptomics:
WTseq is a powerful tool for monitoring gene activity and transcript specific expression and transcript discovery.
WTseq can also be used to study the sequence content of RNAs. This allows one to study expressed mutations, RNA editing events and allele specific expression.
Sequence-based transcriptomics can also be applied to the small RNA fraction to perform similar studies in microRNAs.
Nicole Cloonan, Gabe Kolle, Brooke Gardiner, Geoff Faulkner, Darrin Taylor, Eshan Nourbakhsh, Keerthana Krishna, Shivangi Wani, Alan Robertson, David Tang, Christina Xu, Yunshan Xiao, Megan Vardy [Al Forrest, Graham Bethel, Tina Maguire].
Kevin McKernan, Gina Costa, Catalin Barbacioru Scott Kuersten, Jian Gu
Sponsored by:
Participating Experts:
Sean Grimmond, Ph.D.Institute for Molecular BioscienceUniversity of QueenslandAustralia
Webinar SeriesWebinar SeriesScienceScienceAdvances inAdvances in 30 April, 200930 April, 2009
Brought to you by the Science/AAAS Business Office
Cancer GenomicsCancer Genomics
David Wheeler, Ph.D.Baylor College of MedicineHouston, Texas
John McPherson, Ph.D.Ontario Institute for Cancer ResearchToronto, Canada
Cancer Genomics Impact of Next-Generation
Sequencing PlatformsJohn D. McPherson, Ph.D.Director, Cancer GenomicsSenior Principal Investigator
Ontario Institute for Cancer Research
April 30 2009www.oicr.on.ca
Prevention Ontario Cancer Cohort
EarlyDiagnosis
One Millimetre CancerChallenge
Cancer Stem Cells
International CancerGenome Consortium
Selective Agents(Terry Fox Research
Institute - Ontario Node)
Immuno- and Bio-therapeutics
NewTherapeutics
Imaging andInterventions
Bio-repositoriesandPathology
Genomicsand HighThroughputScreening
MedicinalChemistry
Cancer Care and Services (including Health Promotion)
Informaticsand Bio-computing
Innovation Platforms
Patents to Products
High Impact Clinical Trials
Themes Innovation Programs
TranslationPrograms
CancerTargets
Ontario Institute for Cancer Research
www.oicr.on.ca
Advantages of Next-Gen Platforms
• No sub-cloning, no need for a bacterial host.– less cloning bias– bulk libraries
• Vast improvements in amounts of data generated.– quantification is possible through “counting” of “unique” reads– enhanced dynamic range– detection of rare variants
• Readily adapted to a variety of applications.– genome, transcriptome, epigenome
• Dramatic decrease in cost and speed of data generation.– Huge amounts of data per run
Next(Now)‐generation sequencers
read length
base
s pe
r mac
hine
run
10 bp 1,000 bp100 bp
1 Gb
100 Mb
10 Mb
10 Gb
AB/SOLiDv3, Illumina/GAII, Helicosshort-read sequencers
ABI capillary sequencer
454 GS FLX pyrosequencer(100-500 Mb in 100-400 bp reads,
0.5-1M reads, 5-10 hours)
(10+Gb in 30, 50-100 bp reads,>100M reads, 7-10 days)
1 Mb
(0.04-0.08 Mb in 450-800 bp reads,96 reads, 1-3 hours)
100 Gb
More reads or longer reads?
Increasingthroughput
Increasingread length
$$$
$
$$
OICR Cancer Genomics Platform ~800 billion bases/month
ACGT…
1.2PB storage1600 cores
Matching applications to platforms- read mapping, variant detection, PE, MP …
21 flow cells
Next-Gen Applications at OICR• Whole genome sequencing• Targeted genomic sequencing• Structural variation
– Rearrangements, copy number
• SNP/indel discovery• Copy number variation
– Microarray and beadstation still excellent options
• Whole transcriptome sequencing• Small RNA discovery/sequencing• Epigenomics
– Chromatin IP transcription factor binding (ChIP-seq)– Nucleosome positioning
Structural variants• Mate-pair and paired-end reads can be
used to detect structural variants
Fragmentation & circularization to an internal adaptor
ShearIsolate internal adaptors and fragment ends
Mate-Pairs Paired-Ends
Fragmentation
Add amplificationand sequencing adaptors
SequenceAdd amplificationand sequencing adaptors
Genomic DNA
1 - 20kb200 – 500bp
Clusters of aberrantly aligned read pairs
Mapping of read pairs to reference
• Spanning unexpected distance• Unexpected orientation
Fragment size
Fragmentnumber
< <
Insertion
> <
Deletion
> <>Reference
<
Map
Seq
delMap
Seq
> <
Concordant Inversion translocation
ChrA ChrB
inv
Direct Selection (M. Lovett et al. 1991) “Direct selection: a method for the isolation of cDNAs encoded by large genomic regions”
• “Hybrid selection”; “Genome partitioning”• Solid support capture
– Nimbelgen (Roche), Agilent• In-solution capture oligos
– Agilent• Regional or entire exome
ShearedDNA
Elute and sequence
microarray
NimbleGen Sequence Capture of a 600kb region
Readdepth
Targets
Repeats
%GC
NimbleGen Sequence Capture of a 600kb region
Readdepth
Targets
Repeats%GC
Oligo(Tm)
Agilent SureSelect Capture of exon targets
Readdepth
Exons
Repeats
www.opengenomics.com
Modified histonesin chromatin
DNA fragments linked to nucleosomes
Immunoprecipitationof modified histones
Isolation of DNA fragmentsand ligation of adaptors
Epigenomics• ChIP-seq
– Histone modifications– DNA binding sites
• Methylation– Genome-wide analyses– Correlation with expression studies
Epigenomics
Readdepth
State 1
Genes
State 2
State 1
State 2
International Cancer Genome Consortium
• To obtain a comprehensive description of genomic,
transcriptomic
and epigenomic
changes in 50 different tumor
types and/or subtypes which are of clinical and societal
importance across the globe.
• Every cancer genome project should state a clear rationale for
its choice of sample size, in terms of the desired sensitivity to
detect mutations. The target number of 500 samples per
tumor
type/subtype is set as a minimum, pending further
information to be provided by ICGC members proposing to
tackle specific cancer types/subtypes.
“50 different tumor types and/or subtypes”
“500 samples per tumor”
50,000 Human Genome Projects
www.icgc.org
International Cancer Genome Consortium World Map of Comprehensive
Cancer Genome Projects
Ontario/Canada: Pancreas
US:GBM, Ovary & Lung
Japan: LiverSpain:
CLL
India: Oral Cavity
UK: Breast
France: Liver & Breast China:
Stomach
Australia: Pancreas
EU: TBD
TCGA Pilot Projects
ICGC Cancer Genome Projectswww.icgc.org
Data analysis
People to thank• Cancer Genomics
– John McPherson
– Tom Hudson
– Kamran
Shazand
– Johar
Ali
– Vanya
Peltekova
– Philip Zuzarte– Michelle Sam
– April Cockburn– Ada
Wong
– Lee Timms
– Tanja
Durbic
– David D’Souza– Stacey Quinn– Melissa Bernard
• Informatics and Biocomputing– Lincoln Stein– Francis Ouellette– Arek
Kasprzyk
– Vincent Ferretti– Mathieu Lemire
– Tim Beck
– Quang
Trinh
– Michelle Chan‐Seng‐Yue
– Richard De Borja– Dave Sutton– Greg Whynott
– Tim Brown
– Victor Gu
• DCC– Christina Yung– Jianxin Wang– Junjun Zhang
• OICR Faculty– Nizar Batada– Lakshmi
Muthuswamy
• OICR Fellow– Paul Boutros
• ICGC- Jennifer
Jennings- Vanessa Ballin
www.oicr.on.ca
Sponsored by:
Participating Experts:
Sean Grimmond, Ph.D.Institute for Molecular BioscienceUniversity of QueenslandAustralia
Webinar SeriesWebinar SeriesScienceScienceAdvances inAdvances in 30 April, 200930 April, 2009
Brought to you by the Science/AAAS Business Office
Cancer GenomicsCancer Genomics
David Wheeler, Ph.D.Baylor College of MedicineHouston, Texas
John McPherson, Ph.D.Ontario Institute for Cancer ResearchToronto, Canada
The Cancer Genome using Next-generation Sequencing
Technology
David A. Wheeler, Ph.D.Director, Bioinformatics and Cancer Genomics
HumanGenomeSequencingCenter
Cost of a genomeSequencer Date HGSC
Sequencing Capacity (billions)
Human Genome per Year
Cost per genome
First Generation 2003 16.2 0.04 $3,000,000,0002004 21 0.05 $250,000,0002005 30 0.07 $100,000,0002006 38 0.08 $25,000,000
Second Generation 2007 240 5 $2,000,0002008 2,040 45 $350,0002009 3,660 81 $100,000
Third Generation 2010 7,200 160 $10,000? 14,000 311 $1,000
HumanGenomeSequencingCenter
DNA Sequencing in Cancer• Somatic mutation in DNA
– Scale of variation: single base to whole chromosome– Variety of next-generation instruments
• Epigenetic changes DNA– chip seq– reduced representation– whole genome
• Expression– RNA abundance– Splice variants
• aberrant splicing• fusion transcripts
HumanGenomeSequencingCenter
Short read mapping software• Public-domain
– MAQ– MOSAIK– SOAP– Bowtie– TopHat (RNA-seq: splice junctions)
• Corporate– Corona Lite (AB/SOLiD)– Mapper (454)– ELAND (Illumina)
see also:http://en.wikipedia.org/wiki/List_of_sequence_alignment_software
Mini Sat
CNAFocal Amp
SNPs
1 10 100 103 104 105 106 107 108
Length (bases)
Aneuploidy covcovcovp.e.readsreads
SequenceData
TranslocationInsertion - Deletion (indel)
Size Scale of Somatic Variation
Type of Event
HumanGenomeSequencingCenter
Copy Number Alteration by Sequence Coverage
HumanGenomeSequencingCenter
• count number of reads per unit length of DNA• compare tumor and normal tissue from same patient
Amplification of 7p11.2
EGFR
Deletion at 9p21.3
CDKN2A
Mini Sat
CNAFocal Amp
SNPs
1 10 100 103 104 105 106 107 108
Length (bases)
Aneuploidy covcovcovp.e.readsreads
SequenceData
TranslocationInsertion - Deletion (indel)
Size Scale of Somatic Variation
Type of Event
HumanGenomeSequencingCenter
Mutation Validation
Discovery (Tumor – Normal) Sanger/PCR(Auto + visual)
Biotage(Single base variants)
454(Single base variants,
indels)PCR Gel-Sizing
(large rearrangements)
Released Mutation List
Quality Assurance
Quality Control
Putative Mutation List
Multi-Platform Sequencing Strategy
SOLiDSOLiDSOLiD 454454
20-30X coverage 6-10X coverage
Align Reads-BLAT/Crossmatch-Mosaik
ValidationValidation
WGS Sequencing: Sequence Twice
Align Reads-Corona Lite
SNP discoveryAB SNP Caller
SNP discoveryHGSC AtlasSNP
HumanGenomeSequencingCenter
SOLiD: mapping, mutations and validation pipeline
Valid Mutations
TumorReads
corona litemapping
&variant
detection
tumorvariants
probelist
e-Genotyping
454tumorreads
validtumor
var
NormalReads 454
normalreads
validnormal
var
normalvariants
probelist
15X Cov
12X Cov
Single base variation (SOLiD platform)
• 4.4 million variants (low stringency)• 2.2 million variants (high stringency)
– Allele must be seen at least 2X
eGenotyping Validation• 5,385 somatic mutations• 105 missense mutations
HumanGenomeSequencingCenter
GBM missense mutations• 7 possible cancer connection
– Growth factors, tumor suppressors, cell proliferation
Gene Ref Var Codon Gene NameHDGF2 Arg Trp CGG Hepatoma-derived growth factor-
Related protein 2
PALLD Glu Gln GAA Palladin, Cytoskeletal associated protein
IL1B Phe Ser TTT Interleukin 1, beta
IL4l1 Ser Ala TCG Interleukin 4 induced 1
SIPA1L1 Lys Arg AAA Signal-induced proliferation associated
MUC16 Asn Asp AAT Mucin 16, cell surface associated1 like 1
DDX18 Ala Thr GCA DEAD(Asp-Glu-Ala-Asp) box ploypeptide 18
Cost of a genomeSequencer Date HGSC
Sequencing Capacity (billions)
Human Genome per Year
Cost per genome
First Generation 2003 16.2 0.04 $3,000,000,0002004 21 0.05 $250,000,0002005 30 0.07 $100,000,0002006 38 0.08 $25,000,000
Second Generation 2007 240 5 $2,000,0002008 2,040 45 $350,0002009 3,660 81 $100,000
Third Generation 2010 7,200 160 $10,000? 14,000 311 $1,000
HumanGenomeSequencingCenter
Baylor HGSC Nimblegen Approach to Exome Sequencing
Elute
gDNAExon 1 Exon 2 Exon 3 Exon 4 Exon 5
Fragment and anneal to Nimblegen capture array
Sequencing
Analyze
Exon
Sequences
Coverage Profile over Capture Target
0
500000
1000000
1500000
2000000
2500000
-500
-400
-300
-200
-100 0 10 20 30 40 50 60 70 80 90 10
0
100
200
300
400
500
Target
BufferBuffer
Whole Exome Capture Chip
• Pancreatic Adenocarcinoma– SOLiD Single Slide Tumor and Normal– 3180 missense and nonsense mutations– 3 found in COSMIC
• NF2, neurofibromin 2• PTCH1, patched homolog 1 (tumor suppressor)• HEY1, hairy/enhancer-of-split related with YRPW
motif 1
HumanGenomeSequencingCenter
Summary• Deep sequence coverage by Next-
generation sequencing methods is accurately discovering mutations related to cancer
• Multi-platform approach yields rapid validation
• e-Genotyping rapidly and efficiently assess raw sequencing data for known SNPs and mutations
HumanGenomeSequencingCenter
Look out for more webinars in the series at:
www.sciencemag.org/webinar
For related information on this webinar topic, go to:
solid.appliedbiosystems.com
To provide feedback on this webinar, please e‐mail
your comments to [email protected]
Sponsored by:
Webinar SeriesWebinar SeriesScienceScienceAdvances inAdvances in 30 April, 200930 April, 2009
Brought to you by the Science/AAAS Business Office
Cancer GenomicsCancer Genomics