![Page 2: Challenges with Data Quality, Sharing, and Versioning in Next-Generation Sequencing](https://reader034.vdocuments.us/reader034/viewer/2022042614/556ecec2d8b42adb678b4ffc/html5/thumbnails/2.jpg)
Production Centers• Tony Cox, Sanger
SequencingScaleInfrastructureData flow
• Toby Bloom, BroadQualityIntegrationStandardsSharing
• David Dooling, WUStLScaleQualitySharingVersioning
![Page 4: Challenges with Data Quality, Sharing, and Versioning in Next-Generation Sequencing](https://reader034.vdocuments.us/reader034/viewer/2022042614/556ecec2d8b42adb678b4ffc/html5/thumbnails/4.jpg)
Moore’s Law
!"""# !""$# !""!# !""%# !""&# !""'# !""(# !"")# !""*# !""+# !"$"#
,-./011-2#
300.-4/#567#
8,9#
:;0.6<-#
:-=>-1?-#
![Page 9: Challenges with Data Quality, Sharing, and Versioning in Next-Generation Sequencing](https://reader034.vdocuments.us/reader034/viewer/2022042614/556ecec2d8b42adb678b4ffc/html5/thumbnails/9.jpg)
FASTQ@HWI-EAS404:5:1:6:180#0/1
GCTGGTTTAACTCGAGTATTTGTCCATTCTACTAATTTGAGTGTCTGCTGTGGAAAGGTGTTTGTCATGTATTTT
+HWI-EAS404:5:1:6:180#0/1
aaaa`]aaaa`aa^aa]aaaa^\`_\``____`W]a_`T\[[b__`\YXUW][MSTNZX^[[`_Z[^``\X`^a\
@HWI-EAS404:5:1:6:396#0/1
TATTTACTCTATCCCATTATATACATATTATGATTTCAAAATAACAATGCCAATATAAAAACTAACAATATGATA
+HWI-EAS404:5:1:6:396#0/1
Yaaa_baa`^a]Wa___aaa^I^V]^]NQ_`\^ZPP[__^_\a`^a`JYQWVNFFMRQSX_X^a_Y[`^a^\\NZ
@HWI-EAS404:5:1:6:1344#0/1
GAGGACTTGCATGCTAGGTTTGGTTCTTGGCTGAATTGCTGAAACTGTCCAAGTATCAGTAGCAAAACATGGGTG
+HWI-EAS404:5:1:6:1344#0/1
aabaaa__]^a`[\^`]]Y``[ST_]`]WW]]WZ]`^ZT[_X``\`_WVNYWKDNLTW[Y\XSVZ^ZTZZVRUX[
@HWI-EAS404:5:1:6:1814#0/1
AAAGCTTACTGCTGTTTAGAATTCTTGCTACAGTCAGGAGAAAGCCGAAAGCTGAACGGGTACTGAATCTTCTAC
+HWI-EAS404:5:1:6:1814#0/1
aa````aa^a`_^`\`a`XY`^ZX^YW\^[X\UWUYOMVZZ\\_W^^\XXTSMHMLLNTTDWU__[WVVY]Y_]X
7 TB/week
![Page 10: Challenges with Data Quality, Sharing, and Versioning in Next-Generation Sequencing](https://reader034.vdocuments.us/reader034/viewer/2022042614/556ecec2d8b42adb678b4ffc/html5/thumbnails/10.jpg)
FASTQ@HWI-EAS404:5:1:6:180#0/1
GCTGGTTTAACTCGAGTATTTGTCCATTCTACTAATTTGAGTGTCTGCTGTGGAAAGGTGTTTGTCATGTATTTT
+HWI-EAS404:5:1:6:180#0/1
aaaa`]aaaa`aa^aa]aaaa^\`_\``____`W]a_`T\[[b__`\YXUW][MSTNZX^[[`_Z[^``\X`^a\
@HWI-EAS404:5:1:6:396#0/1
TATTTACTCTATCCCATTATATACATATTATGATTTCAAAATAACAATGCCAATATAAAAACTAACAATATGATA
+HWI-EAS404:5:1:6:396#0/1
Yaaa_baa`^a]Wa___aaa^I^V]^]NQ_`\^ZPP[__^_\a`^a`JYQWVNFFMRQSX_X^a_Y[`^a^\\NZ
@HWI-EAS404:5:1:6:1344#0/1
GAGGACTTGCATGCTAGGTTTGGTTCTTGGCTGAATTGCTGAAACTGTCCAAGTATCAGTAGCAAAACATGGGTG
+HWI-EAS404:5:1:6:1344#0/1
aabaaa__]^a`[\^`]]Y``[ST_]`]WW]]WZ]`^ZT[_X``\`_WVNYWKDNLTW[Y\XSVZ^ZTZZVRUX[
@HWI-EAS404:5:1:6:1814#0/1
AAAGCTTACTGCTGTTTAGAATTCTTGCTACAGTCAGGAGAAAGCCGAAAGCTGAACGGGTACTGAATCTTCTAC
+HWI-EAS404:5:1:6:1814#0/1
aa````aa^a`_^`\`a`XY`^ZX^YW\^[X\UWUYOMVZZ\\_W^^\XXTSMHMLLNTTDWU__[WVVY]Y_]X
350 TB/year
![Page 17: Challenges with Data Quality, Sharing, and Versioning in Next-Generation Sequencing](https://reader034.vdocuments.us/reader034/viewer/2022042614/556ecec2d8b42adb678b4ffc/html5/thumbnails/17.jpg)
The Balanced PC• Clock speed• AGP• Front-side bus• Hypertransport• 1 Gbps• PCI-X• SATA• PCI-Express• Infiniband• Multi-core• Front-side bus• GPU• 10 Gbps
![Page 18: Challenges with Data Quality, Sharing, and Versioning in Next-Generation Sequencing](https://reader034.vdocuments.us/reader034/viewer/2022042614/556ecec2d8b42adb678b4ffc/html5/thumbnails/18.jpg)
The balanced PS1
10 gosub get(sequencers)
20 gosub get(disk)
30 gosub get(backup_capacity)
40 gosub get(network_capacity)
50 gosub get(cluster_nodes)
1 - Pipeline for Sequencing
![Page 19: Challenges with Data Quality, Sharing, and Versioning in Next-Generation Sequencing](https://reader034.vdocuments.us/reader034/viewer/2022042614/556ecec2d8b42adb678b4ffc/html5/thumbnails/19.jpg)
The unbalanced PS
10 gosub get(sequencers)
20 gosub get(disk)
30 gosub get(backup_capacity)
40 gosub get(network_capacity)
50 gosub get(cluster_nodes)
60 goto 10
![Page 34: Challenges with Data Quality, Sharing, and Versioning in Next-Generation Sequencing](https://reader034.vdocuments.us/reader034/viewer/2022042614/556ecec2d8b42adb678b4ffc/html5/thumbnails/34.jpg)
...must be more than just a slogan
![Page 35: Challenges with Data Quality, Sharing, and Versioning in Next-Generation Sequencing](https://reader034.vdocuments.us/reader034/viewer/2022042614/556ecec2d8b42adb678b4ffc/html5/thumbnails/35.jpg)
Quality missteps
Initial low fidelity between basequality values and quality
Tsonev, S. SEP 2007
![Page 36: Challenges with Data Quality, Sharing, and Versioning in Next-Generation Sequencing](https://reader034.vdocuments.us/reader034/viewer/2022042614/556ecec2d8b42adb678b4ffc/html5/thumbnails/36.jpg)
An aside
“basecall calibration predicted vs. observed”
![Page 38: Challenges with Data Quality, Sharing, and Versioning in Next-Generation Sequencing](https://reader034.vdocuments.us/reader034/viewer/2022042614/556ecec2d8b42adb678b4ffc/html5/thumbnails/38.jpg)
Quality is the keyNeed high fidelity between prediction and observed
3 bits per base
50 bytes per base
20 bytes per base
2 bytes per base
![Page 39: Challenges with Data Quality, Sharing, and Versioning in Next-Generation Sequencing](https://reader034.vdocuments.us/reader034/viewer/2022042614/556ecec2d8b42adb678b4ffc/html5/thumbnails/39.jpg)
The down side
http://mammoth.psu.edu/labPhotos/imageOfFlowgram.jpg
http://www3.appliedbiosystems.com/cms/groups/mcb_marketing/documents/generaldocuments/cms_057559.pdf
![Page 46: Challenges with Data Quality, Sharing, and Versioning in Next-Generation Sequencing](https://reader034.vdocuments.us/reader034/viewer/2022042614/556ecec2d8b42adb678b4ffc/html5/thumbnails/46.jpg)
Submitted to central repositories
![Page 47: Challenges with Data Quality, Sharing, and Versioning in Next-Generation Sequencing](https://reader034.vdocuments.us/reader034/viewer/2022042614/556ecec2d8b42adb678b4ffc/html5/thumbnails/47.jpg)
... and replicatedacross the pond
![Page 48: Challenges with Data Quality, Sharing, and Versioning in Next-Generation Sequencing](https://reader034.vdocuments.us/reader034/viewer/2022042614/556ecec2d8b42adb678b4ffc/html5/thumbnails/48.jpg)
The goal of this project is to provide a system for storing and retrieving huge amounts of data, distributed among a large number of heterogenous server nodes, under a single virtual filesystem tree with a variety of standard access methods.
![Page 49: Challenges with Data Quality, Sharing, and Versioning in Next-Generation Sequencing](https://reader034.vdocuments.us/reader034/viewer/2022042614/556ecec2d8b42adb678b4ffc/html5/thumbnails/49.jpg)
Write-only databases
Search limited to sequence andvalues of specific XML entities
submitted as metadata
![Page 50: Challenges with Data Quality, Sharing, and Versioning in Next-Generation Sequencing](https://reader034.vdocuments.us/reader034/viewer/2022042614/556ecec2d8b42adb678b4ffc/html5/thumbnails/50.jpg)
Write-only databases
Search limited to sequence andvalues of specific XML entities
submitted as metadata
x
![Page 51: Challenges with Data Quality, Sharing, and Versioning in Next-Generation Sequencing](https://reader034.vdocuments.us/reader034/viewer/2022042614/556ecec2d8b42adb678b4ffc/html5/thumbnails/51.jpg)
Speaking of XML<?xml version="1.0" encoding="UTF-8"?><STUDY_SET xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"> <STUDY alias="LowSalternSDbayVir111005" accession="SRP000145"> <DESCRIPTOR> <STUDY_TITLE>Solar Salterns, viral fraction from low salinity saltern in San Diego, CA </STUDY_TITLE> <STUDY_TYPE existing_study_type="Metagenomics"/> <STUDY_ABSTRACT>Viral community from a "low" salinity saltern and sequenced at 454 Life Sciences. </STUDY_ABSTRACT> <CENTER_NAME>SDSU</CENTER_NAME> <CENTER_PROJECT_NAME>LowSalternSDbayVir111005</CENTER_PROJECT_NAME> <PROJECT_ID>28373</PROJECT_ID> </DESCRIPTOR> <STUDY_ATTRIBUTES> <STUDY_ATTRIBUTE> <TAG>NCBI parent project ID</TAG> <VALUE>28725</VALUE> </STUDY_ATTRIBUTE> </STUDY_ATTRIBUTES> </STUDY></STUDY_SET>
<?xml version="1.0" encoding="UTF-8"?><SAMPLE_SET xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"> <SAMPLE alias="28373" accession="SRS000373"> <SAMPLE_NAME> <TAXON_ID>496920</TAXON_ID> <COMMON_NAME>saltern metagenome</COMMON_NAME> </SAMPLE_NAME> <DESCRIPTION>viral fraction from low salinity saltern in San Diego, CA </DESCRIPTION> <SAMPLE_ATTRIBUTES> <SAMPLE_ATTRIBUTE> <TAG>collection_date</TAG> <VALUE>11/10/05</VALUE> </SAMPLE_ATTRIBUTE> <SAMPLE_ATTRIBUTE> <TAG>lat_lon</TAG> <VALUE>32.599040, -117.107356</VALUE> </SAMPLE_ATTRIBUTE> </SAMPLE_ATTRIBUTES> </SAMPLE></SAMPLE_SET>
<?xml version="1.0" encoding="UTF-8"?><EXPERIMENT_SET xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"> <EXPERIMENT alias="LowSalternSDbayVir111005_experiment" expected_number_runs="2" accession="SRX000217"> <TITLE>454 sequencing of saltern metagenome fragment library</TITLE> <STUDY_REF accession="SRP000145" refname="LowSalternSDbayVir111005"/> <DESIGN> <DESIGN_DESCRIPTION>454 Sequencing of viral fraction from low salinity saltern in San Diego, CA</DESIGN_DESCRIPTION> <SAMPLE_DESCRIPTOR accession="SRS000373" refname="28373"/> <LIBRARY_DESCRIPTOR> <LIBRARY_NAME>lowSalternSDbayVir111005</LIBRARY_NAME> <LIBRARY_STRATEGY>OTHER</LIBRARY_STRATEGY> <LIBRARY_SOURCE>OTHER</LIBRARY_SOURCE> <LIBRARY_SELECTION>RANDOM</LIBRARY_SELECTION> <LIBRARY_LAYOUT> <SINGLE/> </LIBRARY_LAYOUT> <LIBRARY_CONSTRUCTION_PROTOCOL> none provided </LIBRARY_CONSTRUCTION_PROTOCOL> </LIBRARY_DESCRIPTOR> <SPOT_DESCRIPTOR> <SPOT_DECODE_SPEC> <NUMBER_OF_READS_PER_SPOT>2</NUMBER_OF_READS_PER_SPOT> <READ_SPEC> <READ_INDEX>0</READ_INDEX> <READ_CLASS>Technical Read</READ_CLASS> <READ_TYPE>Adapter</READ_TYPE> <BASE_COORD>1</BASE_COORD> </READ_SPEC> <READ_SPEC> <READ_INDEX>1</READ_INDEX> <READ_CLASS>Application Read</READ_CLASS> <READ_TYPE>Forward</READ_TYPE> <BASE_COORD>5</BASE_COORD> </READ_SPEC> </SPOT_DECODE_SPEC> </SPOT_DESCRIPTOR> </DESIGN> <PLATFORM>
<LS454> <INSTRUMENT_MODEL>GS 20</INSTRUMENT_MODEL> <FLOW_SEQUENCE>TACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACG</FLOW_SEQUENCE> <FLOW_COUNT>168</FLOW_COUNT> </LS454> </PLATFORM> <PROCESSING> <BASE_CALLS> <SEQUENCE_SPACE>Base Space</SEQUENCE_SPACE> <BASE_CALLER>454BaseCaller</BASE_CALLER> </BASE_CALLS> <QUALITY_SCORES qtype="phred"> <QUALITY_SCORER>454BaseCaller</QUALITY_SCORER> <NUMBER_OF_LEVELS>64</NUMBER_OF_LEVELS> <MULTIPLIER>1</MULTIPLIER> </QUALITY_SCORES> </PROCESSING> </EXPERIMENT></EXPERIMENT_SET>
<?xml version="1.0" encoding="UTF-8"?><RUN_SET xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"> <RUN alias="D0IIGP3" instrument_model="454 GS 20" run_date="2006-03-17T09:39:51Z" run_file="D0IIGP3" run_center="454MSC" total_data_blocks="1" accession="SRR001053"> <EXPERIMENT_REF accession="SRX000217" refname="LowSalternSDbayVir111005_experiment"/> <DATA_BLOCK name="D0IIGP3" region="1" total_spots="51121" total_reads="51121" number_channels="1" format_code="1" sector="0"> <FILES> <FILE filename="D0IIGP301.sff" filetype="sff"/> </FILES> </DATA_BLOCK> <RUN_ATTRIBUTES> <RUN_ATTRIBUTE> <TAG>flow_count</TAG> <VALUE>168</VALUE> </RUN_ATTRIBUTE> <RUN_ATTRIBUTE> <TAG>flow_sequence</TAG>
<VALUE>TACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACG</VALUE> </RUN_ATTRIBUTE> <RUN_ATTRIBUTE> <TAG>key_sequence</TAG> <VALUE>TCAG</VALUE> </RUN_ATTRIBUTE> </RUN_ATTRIBUTES> </RUN> <RUN alias="D1LDSHL" instrument_model="454 GS 20" run_date="2006-04-06T09:25:19Z" run_file="D1LDSHL" run_center="454MSC" total_data_blocks="1" accession="SRR001054"> <EXPERIMENT_REF accession="SRX000217" refname="LowSalternSDbayVir111005_experiment"/> <DATA_BLOCK name="D1LDSHL" region="1" total_spots="70935" total_reads="70935" number_channels="1" format_code="1" sector="0"> <FILES> <FILE filename="D1LDSHL01.sff" filetype="sff"/> </FILES> </DATA_BLOCK> <RUN_ATTRIBUTES> <RUN_ATTRIBUTE> <TAG>flow_count</TAG> <VALUE>168</VALUE> </RUN_ATTRIBUTE> <RUN_ATTRIBUTE> <TAG>flow_sequence</TAG> <VALUE>TACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACG</VALUE> </RUN_ATTRIBUTE> <RUN_ATTRIBUTE> <TAG>key_sequence</TAG> <VALUE>TCAG</VALUE> </RUN_ATTRIBUTE> </RUN_ATTRIBUTES> </RUN></RUN_SET>
![Page 54: Challenges with Data Quality, Sharing, and Versioning in Next-Generation Sequencing](https://reader034.vdocuments.us/reader034/viewer/2022042614/556ecec2d8b42adb678b4ffc/html5/thumbnails/54.jpg)
The Cathedral and the BazaarLinux overturned much of what I thought I knew. I had been preaching the Unix gospel of small tools, rapid prototyping and evolutionary programming for years. But I also believed there was a certain critical complexity above which a more centralized, a priori approach was required. I believed that the most important software (operating systems and really large tools like the Emacs programming editor) needed to be built like cathedrals, carefully crafted by individual wizards or small bands of mages working in splendid isolation, with no beta to be released before its time.
![Page 55: Challenges with Data Quality, Sharing, and Versioning in Next-Generation Sequencing](https://reader034.vdocuments.us/reader034/viewer/2022042614/556ecec2d8b42adb678b4ffc/html5/thumbnails/55.jpg)
The Vatican and the Reformation
![Page 57: Challenges with Data Quality, Sharing, and Versioning in Next-Generation Sequencing](https://reader034.vdocuments.us/reader034/viewer/2022042614/556ecec2d8b42adb678b4ffc/html5/thumbnails/57.jpg)
GenBank genome
http://betterexplained.com/articles/intro-to-distributed-version-control-illustrated/
![Page 58: Challenges with Data Quality, Sharing, and Versioning in Next-Generation Sequencing](https://reader034.vdocuments.us/reader034/viewer/2022042614/556ecec2d8b42adb678b4ffc/html5/thumbnails/58.jpg)
git genome
http://betterexplained.com/articles/intro-to-distributed-version-control-illustrated/
![Page 59: Challenges with Data Quality, Sharing, and Versioning in Next-Generation Sequencing](https://reader034.vdocuments.us/reader034/viewer/2022042614/556ecec2d8b42adb678b4ffc/html5/thumbnails/59.jpg)
The Human Reference>7 dna:chromosome chromosome:NCBI36:7:1:158821424:1...AATAACTATATAAGTAAATAAGCAAGCTGTATGAATATACAAAGCTCTCTGGTAAAGGTAAATACATAAACAAACATAAAAACAGTCCTATTGTAATTTTGGTTTGTAACTCTGCTTTTTATTTTCTACATAATTTAAAAGGCAAATGCATAAAATGTAATTGTAAATCTGTTAGCTGGTATACAATGAATAAAGATATAATTTGTCACATCAATAACATAAAAAGAGTAGAGCTATATATATAGCAGTAGAATTTTGGTATGTGATTGAACTTAAGTTGAAATAAATTCAAATTAAAATGTTATAACTCTAGGATGTTATATGTAATTCTCATAGTAACCAAAAATGAAATATACATAGAATATAAACAAAAGGAAATGAGACTAGAAACAAAATGTGTCACTACAAAAAAATCAACTAAAGATAAAAAAGAAATAATTGAGAAAATGATTGGCAAAAATCAGTAACTCTGACGTATTAAAACTTTCCATGCTACATAAATCTGAAAACTCTATTTCACATAAAACTGGAGCTGAAAGAAACAAATATTTACCTATAAAGTTAAAAGTTATATAGGGAACAAACACTAATTTTTTTTAGAAAAAATTATAAAAAGAGTAAAAATATGCCTTATACTACCGTAATTTCATGTTTTACAGCTCTGGGAAAATAGAAAATAAAATGTTCTGTTAGCATGAATCCCTCTGTGCCCCC...
![Page 61: Challenges with Data Quality, Sharing, and Versioning in Next-Generation Sequencing](https://reader034.vdocuments.us/reader034/viewer/2022042614/556ecec2d8b42adb678b4ffc/html5/thumbnails/61.jpg)
The Human Reference
D Zhi, BJ Raphael, AL Price, H Tang and PA Pevzner. Identifying repeat domains in large genomes. Genome Biology 2006, 7:R7
A13
D2
B18
C2
H2
F4
E 139
G160
E
F
C
A
H
D
B
G
142
3(50)
2
4(22)
2(219)
3(3)
3(2)
71
2(19)
2(2)
3(3)
23(2)
6
2
2
2(50)
173
3(41)
158
2(7)
83
2
3
2
5(5)
58(2)
2(49)
5
6(3)
82
812
7 16(2)
52(6)3
8
38(6)
3(21)
2(3)
2(15)
2(4)
13(2)
3(5)
2(42) 4(9)
3(2)
8(6)
37
13(2)
6(2)
55(3)
2
5
4(7)
15819(8)
2(13)
2(2)
7(8)
4(3)
2 2(34)
4(24)
2(2)
5(7)
2(61)
4
2
3
2(7)
3(24)
5(7)2(15)
2(202)
3
3(50)
4(51)
2(4)
3
2
5
F
C
A 21
H
G 160
B18
D
s5766
E139
E
A
C
B
F
G
H
D
37
13(2)
184
142
158
38(6)
8
71
13(2)
123(2)
48(10)
32(3) 45(3)
13(2)
8(5)
158
20(2)
55(3)
13(7)
82
81
9(6)D117
A207
E
139
F
B
62
G171
G
B
E
A
F
D
37
13(2)
2993
13(2)
8(5)
114
127(2)
58(7)
55(3)
82
132
140
81
38(6)
8
18(6)
3(2)
(a)
(b) (c)