copyright © 2004 synamatix sdn bhd (538481-u) please dial: +44 870 22 333 65 pin: 444888 please...
TRANSCRIPT
Copyright © 2004 Synamatix sdn bhd (538481-U)
Please dial: +44 870 22 333 65
Pin: 444888 Please note that this is a UK number
Challenges of data management and analysis
from 2nd generation sequencing platforms
October 10 2006
Copyright © 2004 Synamatix sdn bhd (538481-U)
Please dial: +44 870 22 333 65
Pin: 444888 Please note that this is a UK number
Challenges of data management and analysis
from 2nd generation sequencing platforms
October 10 2006
Copyright © 2006 Synamatix sdn bhd (538481-U)To listen to the webcast please dial: +44 870 22 333 65 Pin: 444888 Please note that this is a UK number
PresentersPresenters
Colin Hercus25 mer mapping
Zayed Albertyn100 mer & polony mapping
Copyright © 2006 Synamatix sdn bhd (538481-U)To listen to the webcast please dial: +44 870 22 333 65 Pin: 444888 Please note that this is a UK number
IntroductionIntroduction
Copyright © 2006 Synamatix sdn bhd (538481-U)To listen to the webcast please dial: +44 870 22 333 65 Pin: 444888 Please note that this is a UK number
Personal Genome and Personalised Personal Genome and Personalised medicinemedicine
The Human Genome3 billion “pieces” – in every cell..
First Genome took 16yrs
Cost US$3 billion
Late 2006-2007New technologies
emerging..
Cost: US$1000Time: 1 day!
Copyright © 2006 Synamatix sdn bhd (538481-U)To listen to the webcast please dial: +44 870 22 333 65 Pin: 444888 Please note that this is a UK number
Personal Genome and Personalised Personal Genome and Personalised medicinemedicine
The Human Genome3 billion “pieces” – in every cell..
First Genome took 16yrs
Cost US$3 billion
Late 2006-2007New technologies
emerging..
Cost: US$1000Time: 1 day!
Copyright © 2006 Synamatix sdn bhd (538481-U)To listen to the webcast please dial: +44 870 22 333 65 Pin: 444888 Please note that this is a UK number
Variety of approaches towards ULCSVariety of approaches towards ULCS
Copyright © 2006 Synamatix sdn bhd (538481-U)To listen to the webcast please dial: +44 870 22 333 65 Pin: 444888 Please note that this is a UK number
MethodsMethods
Copyright © 2006 Synamatix sdn bhd (538481-U)To listen to the webcast please dial: +44 870 22 333 65 Pin: 444888 Please note that this is a UK number
Command line interface
CORE Database platform
SynaRex Bulk
SynaProbe Bulk
SynaSearch Bulk
SynaMer
SynaFrag
SXSequenceRefs
SXLRESearch
SXFuzzyPatternSearch
Sxpet
SXParse
Data analysi
s
Develop Tools
Another 20+ apps
Graphical Interface
Copyright © 2006 Synamatix sdn bhd (538481-U)To listen to the webcast please dial: +44 870 22 333 65 Pin: 444888 Please note that this is a UK number
How?How?
Copyright © 2006 Synamatix sdn bhd (538481-U)To listen to the webcast please dial: +44 870 22 333 65 Pin: 444888 Please note that this is a UK number
What do we
know about
data ?
Similarity
& association
Common PATTERNS and
functionality
Copyright © 2006 Synamatix sdn bhd (538481-U)To listen to the webcast please dial: +44 870 22 333 65 Pin: 444888 Please note that this is a UK number
A T G C
A T G C A T G A A T……
AT TG GCCAGAAA AT TGAT
ATG TGC GCACAT
ATG
TGAGAA AAT
ATGC TGCA
ATGCA
GCATCATG
TGCAT
Copyright © 2006 Synamatix sdn bhd (538481-U)To listen to the webcast please dial: +44 870 22 333 65 Pin: 444888 Please note that this is a UK number
Q* logN base AQ* logN base A
Size of database
Speed milliseconds
1 10 100 1000
100
200
300
400
500
600
700
800
900
Conventional
SynaBASE
Copyright © 2006 Synamatix sdn bhd (538481-U)To listen to the webcast please dial: +44 870 22 333 65 Pin: 444888 Please note that this is a UK number
Case Study - Comparison of Human v Mouse genome
3 yrs
SynaBASE BLAST
6h
PatternHunter
22days
Copyright © 2006 Synamatix sdn bhd (538481-U)To listen to the webcast please dial: +44 870 22 333 65 Pin: 444888 Please note that this is a UK number
ResultsResults
Copyright © 2006 Synamatix sdn bhd (538481-U)To listen to the webcast please dial: +44 870 22 333 65 Pin: 444888 Please note that this is a UK number
Read mappingRead mapping
Variety of novel methods for genome sequencingShorter reads with higher coverage
25mers - Solexa100-200mers – 454Polony reads
Larger volumes of sequence dataError rates much higher than Sanger methodComputationally Intractable for conventional bioinformatics applications
Copyright © 2006 Synamatix sdn bhd (538481-U)To listen to the webcast please dial: +44 870 22 333 65 Pin: 444888 Please note that this is a UK number
Mapping 25 mersMapping 25 mers
Copyright © 2006 Synamatix sdn bhd (538481-U)To listen to the webcast please dial: +44 870 22 333 65 Pin: 444888 Please note that this is a UK number
Mapping 25mersMapping 25mersSynaBASE API method SXSSASearch() can be used to rapidly map short oligos to a genome using un-gapped alignments
Suitable for finding substitution differences but not insert/delete differences
Gapped alignment of short oligos using a modified version of the SXSSASearch() method
SXSSASearch:
does not use heuristics and is guaranteed to find all matches to an oligo given the scoring matrix and a thresholduses a weight matrix with position dependent scores for each base
Mapping
25 mers
# SXOligoSearch# Thu Sep 14 16:22:07 2006# $Id: SXOligoSearch.cpp,v 1.28 2006/07/17 07:31:57 Exp $# SXOligoSearch chr22 dummy.txt
>Read-0:21200326AAGTAGCCAAGAGCATGCCC.........T.......... + chr22:21200327-21200346 20
>Read-1:21200835GTCTCCACAAGAAAATACAA.................... + chr22:21200836-21200855 20
>Read-2:21200982TGTATTCTGCAGAACTGATA...C..........G..... + chr22:21200983-21201002 20
Copyright © 2006 Synamatix sdn bhd (538481-U)To listen to the webcast please dial: +44 870 22 333 65 Pin: 444888 Please note that this is a UK number
Very fast and flexible approachVery fast and flexible approach
Example: 350,000 reads can be mapped in 125sec - 3 per ms
Makes approach suitable for reads that have varying quality over their length
Mismatch penalty can be reduced towards the 3’ end of reads
0 25
1.0
Quality or Probability of being correct
Mapping
25 mers
Copyright © 2006 Synamatix sdn bhd (538481-U)To listen to the webcast please dial: +44 870 22 333 65 Pin: 444888 Please note that this is a UK number
Mismatches and quality scoresMismatches and quality scoresIf a read maps to 2 locations:
One with a mismatch in the low quality 3’ end and one with a mismatch near the 5’ end. The position of the mismatch and quality should be taken into account when selecting the best mapping and for SNP qualification
In the example above the first reported alignment would likely be taken as the correct one as the mismatch is in a low quality base
To optimize performance the search process starts by searching for an exact match and the threshold is increased until at least one match is found
If a read maps to multiple locations then it may be from a repeat and may be ignored when determining putative SNPs
Mapping
25 mers
Copyright © 2006 Synamatix sdn bhd (538481-U)To listen to the webcast please dial: +44 870 22 333 65 Pin: 444888 Please note that this is a UK number
Finding SNPs IFinding SNPs ISNP identification should take into account:
Known SNPsWhether the species is Haploid, Diploid, etc.Quality of reads by base positionBackground SNP rateIf the SNP is within a documented exon, then translation neutral SNPs can be distinguished
Example 1: the reads all have a mismatch corresponding to the same position in the genome indicating a possible SNP
Mapping
25 mers
Copyright © 2006 Synamatix sdn bhd (538481-U)To listen to the webcast please dial: +44 870 22 333 65 Pin: 444888 Please note that this is a UK number
Finding SNPs IIFinding SNPs II
Example 2: One read has a mismatch and two reads match
The mismatch corresponds to a low quality base position in the read so the mismatch could be interpreted as insignificant and not reported. If the species is diploid and it is known from a SNP library that some individuals carry a SNP for a ‘C’ at this position. In this case there is an increased probability of this individual carrying the SNP on one of the two chromosome copies. Some SNPs cause disease only if they exist in both copies of the chromosome while others can cause disease even if only one copy carries the SNP
Copyright © 2006 Synamatix sdn bhd (538481-U)To listen to the webcast please dial: +44 870 22 333 65 Pin: 444888 Please note that this is a UK number
SummarySummary
Mapping of short reads achieved at very high throughput – less than 1ms
Position specific scoring allows variable quality reads to be mapped
Statistical analysis of mismatches to qualify SNPs
Mapping
25 mers
Copyright © 2006 Synamatix sdn bhd (538481-U)To listen to the webcast please dial: +44 870 22 333 65 Pin: 444888 Please note that this is a UK number
Mapping 100 mersMapping 100 mers
Copyright © 2006 Synamatix sdn bhd (538481-U)To listen to the webcast please dial: +44 870 22 333 65 Pin: 444888 Please note that this is a UK number
MultiPass Strategy for Mapping Sequence data to Genomes using SynaBASE
Search 4% mutated reads against the Human genome SynaBASE using high stringency parameters
Analysis Steps
Reduce repeat filtering stringency
SynaSearch matches ~61 % on first pass
Repeat the search by reducing filter score to identify shorter alignments e.g. score < 30
1st Pass
2nd Pass
3rd Pass
Mapping
120 mers
Copyright © 2006 Synamatix sdn bhd (538481-U)To listen to the webcast please dial: +44 870 22 333 65 Pin: 444888 Please note that this is a UK number
Input Sequence Reads: Input Sequence Reads: ~ 1.7 million @ 6X coverage ~ 1.7 million @ 6X coverage
of Hs chr22of Hs chr22
Mapping
120 mers
Copyright © 2006 Synamatix sdn bhd (538481-U)To listen to the webcast please dial: +44 870 22 333 65 Pin: 444888 Please note that this is a UK number
Mapping
120 mers
Copyright © 2006 Synamatix sdn bhd (538481-U)To listen to the webcast please dial: +44 870 22 333 65 Pin: 444888 Please note that this is a UK number
Analysis of ResultsAnalysis of Results
View read placement along chromosomeCalculate mapping efficiency1.7m reads mapped to human genome in 53 min 22 seconds
Mapping
120 mers
Copyright © 2006 Synamatix sdn bhd (538481-U)To listen to the webcast please dial: +44 870 22 333 65 Pin: 444888 Please note that this is a UK number
Simulation Mapping ResultsSimulation Mapping Results
Dataset
% Reads Mappe
d
No.Queries
Matched
Mean Aligned Length
Mean Percent
ID
Hits Per
Query
Minutes
Original
100.00 1738713 119.84 100.00 1.30 66.85
Pass 1 -4 %
Mutated
61.38 1067225 119.17 97.37 1.15 53.53
Pass 2 -4 %
Mutated
25.56 444415 119.32 96.23 1.24 24.42
Pass 3 -4 %
Mutated
9.76 169698 119.28 97.12 1.17 7.01
Mapping
120 mers
Copyright © 2006 Synamatix sdn bhd (538481-U)To listen to the webcast please dial: +44 870 22 333 65 Pin: 444888 Please note that this is a UK number
Chr22 mapping overviewChr22 mapping overview
Chr22 sequence position
Rea
d D
ensi
ty C
ou
nt
Red – forward
Green – Reverse complement
Mapping
120 mers
Copyright © 2006 Synamatix sdn bhd (538481-U)To listen to the webcast please dial: +44 870 22 333 65 Pin: 444888 Please note that this is a UK number
Human Chr2 Human Chr2
Chr2 sequence position
Rea
d D
ensi
ty C
ou
nt
Red – forward
Green – Reverse complement
Mapping
120 mers
Copyright © 2006 Synamatix sdn bhd (538481-U)To listen to the webcast please dial: +44 870 22 333 65 Pin: 444888 Please note that this is a UK number
Viewing ResultsViewing Results
Gbrowse: Community-based system to view resultsNumerous customisations to show sequence coverageAnalyze read mappings in the context of
Known genesRepeats and variations (SNP)Comparative genomics
Mapping
120 mers
Copyright © 2006 Synamatix sdn bhd (538481-U)To listen to the webcast please dial: +44 870 22 333 65 Pin: 444888 Please note that this is a UK number
Mapping
120 mers
Copyright © 2006 Synamatix sdn bhd (538481-U)To listen to the webcast please dial: +44 870 22 333 65 Pin: 444888 Please note that this is a UK number
RAB36 RAS Oncogene Family on chromosome 22RAB36 RAS Oncogene Family on chromosome 22Mapping
120 mers
Copyright © 2006 Synamatix sdn bhd (538481-U)To listen to the webcast please dial: +44 870 22 333 65 Pin: 444888 Please note that this is a UK number
Areas of lower read coverage
Mapping
120 mers
Copyright © 2006 Synamatix sdn bhd (538481-U)To listen to the webcast please dial: +44 870 22 333 65 Pin: 444888 Please note that this is a UK number
ConclusionsConclusions
Very significant performance improvements compared to MegaBLAST – <100ms per read
Very high coverage attained by using multi-pass strategy
Over 95% coverageRemaining 5% are repeats
High specificity – less matches per read
Enables multiple human genomes to be processed per day
Mapping
120 mers
Copyright © 2006 Synamatix sdn bhd (538481-U)To listen to the webcast please dial: +44 870 22 333 65 Pin: 444888 Please note that this is a UK number
Mapping Polony reads Mapping Polony reads
5 mers5 mers
Copyright © 2006 Synamatix sdn bhd (538481-U)To listen to the webcast please dial: +44 870 22 333 65 Pin: 444888 Please note that this is a UK number
Polony sequencing read mappingPolony sequencing read mapping
Convert genomic sequences to spectraSample random probe sets from random chromosomal regionsFilter probe sets using probe intensity spectraQuery probe sets against genome database
5 mers polony reads
Copyright © 2006 Synamatix sdn bhd (538481-U)To listen to the webcast please dial: +44 870 22 333 65 Pin: 444888 Please note that this is a UK number
Probe set Generation
Sample probe intensities from spectra using normal distribution(Mean 2000 / SD 250)
Generate 10,000 random 200bp reads from Hs chr22
Method Verification
Filter probes based on intensity thresholds for each error rate
Generate Overlapping segments for Hs chr22 @ 5X Coverage
Sequence to spectrum conversion using 512 bit translation
Sequence to spectrum conversion using 512 bit translation
Reference Database Generation
Simulate error rates at 1-7% in probe sequence
Build SynaBASE for querying with probe sets
Alignment search remainder of probes against reference SynaBASE of Hs
chr22
Analyze score and % identities for all probe sets at various intensity
thresholds
Copyright © 2006 Synamatix sdn bhd (538481-U)To listen to the webcast please dial: +44 870 22 333 65 Pin: 444888 Please note that this is a UK number
Error Rate
Threshold Hits (%)
No of queries
that found no
result
No of queries that match within 128
bases (Accuracy %)
No of queries that match the incorrect
target
Elapsed Time (s)
Avg. Query Time (ms)
0% 0 100.00 0 10000(100.00) 0 58.793 5.79 1% 2310 100.00 0 10000(100.00) 0 54.756 5.48 2% 2420 100.00 0 10000(100.00) 0 61.547 6.15 3% 2490 100.00 0 9995( 99.95) 5 68.223 6.82 4% 2550 99.99 1 9996( 99.96) 3 66.968 6.70 5% 2590 99.81 19 9968( 99.68) 13 114.028 11.40 6% 2630 99.41 59 9882( 98.82) 59 160.873 16.09 7% 2670 99.81 19 9815( 98.15) 166 166.932 16.69
OverallOverall 5 mers polony reads
Copyright © 2006 Synamatix sdn bhd (538481-U)To listen to the webcast please dial: +44 870 22 333 65 Pin: 444888 Please note that this is a UK number
AdvantagesAdvantages
Time taken to conduct 0% to 4% searches – around 6msEnhanced performance to the SynaBASE engine & associated algorithms100% hits matched for 1-3 % error margin data~15 million searches against a reference genome in 1 day
5 mers polony reads
Copyright © 2006 Synamatix sdn bhd (538481-U)To listen to the webcast please dial: +44 870 22 333 65 Pin: 444888 Please note that this is a UK number
ConclusionConclusion
Copyright © 2006 Synamatix sdn bhd (538481-U)To listen to the webcast please dial: +44 870 22 333 65 Pin: 444888 Please note that this is a UK number
SummarySummary
SynaBASE used as database PLATFORMUnique, leads to massive increases in speed and scalability
Applied to the 3 main classes of reads from 2nd generation sequencing platforms
100s of fold faster than conventional approaches
Specificity and accuracy enhanced due to exhaustive nature of SynaBASE
Copyright © 2006 Synamatix sdn bhd (538481-U)To listen to the webcast please dial: +44 870 22 333 65 Pin: 444888 Please note that this is a UK number
Thank you
Please email questions to: