![Page 1: Big Data Challenges in Biology and the Sheffield Bioinformatics Hub Dr. Roy Chaudhuri Department Of Molecular Biology and Biotechnology](https://reader035.vdocuments.us/reader035/viewer/2022062815/568168f3550346895ddffb61/html5/thumbnails/1.jpg)
Big Data Challenges in Biology and the Sheffield Bioinformatics Hub
Dr. Roy ChaudhuriDepartment Of Molecular Biology and Biotechnology
![Page 2: Big Data Challenges in Biology and the Sheffield Bioinformatics Hub Dr. Roy Chaudhuri Department Of Molecular Biology and Biotechnology](https://reader035.vdocuments.us/reader035/viewer/2022062815/568168f3550346895ddffb61/html5/thumbnails/2.jpg)
Marta MiloBiomedical Science
Roy ChaudhuriMolecular Biology
and BiotechnologyEran ElhaikAnimal and Plant Sciences
James BradfordOncology
Winston HideSITran
(from August)
Ian SudberyMolecular Biology and Biotechnology(from December)
![Page 3: Big Data Challenges in Biology and the Sheffield Bioinformatics Hub Dr. Roy Chaudhuri Department Of Molecular Biology and Biotechnology](https://reader035.vdocuments.us/reader035/viewer/2022062815/568168f3550346895ddffb61/html5/thumbnails/3.jpg)
![Page 4: Big Data Challenges in Biology and the Sheffield Bioinformatics Hub Dr. Roy Chaudhuri Department Of Molecular Biology and Biotechnology](https://reader035.vdocuments.us/reader035/viewer/2022062815/568168f3550346895ddffb61/html5/thumbnails/4.jpg)
What is Big Data?
Small data: 1
Big data: 1
![Page 5: Big Data Challenges in Biology and the Sheffield Bioinformatics Hub Dr. Roy Chaudhuri Department Of Molecular Biology and Biotechnology](https://reader035.vdocuments.us/reader035/viewer/2022062815/568168f3550346895ddffb61/html5/thumbnails/5.jpg)
Biological Big Data – Imaging Data
![Page 6: Big Data Challenges in Biology and the Sheffield Bioinformatics Hub Dr. Roy Chaudhuri Department Of Molecular Biology and Biotechnology](https://reader035.vdocuments.us/reader035/viewer/2022062815/568168f3550346895ddffb61/html5/thumbnails/6.jpg)
Biological Big Data – Sequence Data
![Page 7: Big Data Challenges in Biology and the Sheffield Bioinformatics Hub Dr. Roy Chaudhuri Department Of Molecular Biology and Biotechnology](https://reader035.vdocuments.us/reader035/viewer/2022062815/568168f3550346895ddffb61/html5/thumbnails/7.jpg)
Sanger Dideoxy Sequencing
![Page 8: Big Data Challenges in Biology and the Sheffield Bioinformatics Hub Dr. Roy Chaudhuri Department Of Molecular Biology and Biotechnology](https://reader035.vdocuments.us/reader035/viewer/2022062815/568168f3550346895ddffb61/html5/thumbnails/8.jpg)
Dye-terminator Sequencing
Read lengths ~800bp
![Page 9: Big Data Challenges in Biology and the Sheffield Bioinformatics Hub Dr. Roy Chaudhuri Department Of Molecular Biology and Biotechnology](https://reader035.vdocuments.us/reader035/viewer/2022062815/568168f3550346895ddffb61/html5/thumbnails/9.jpg)
phiX174 genome - 1977
![Page 10: Big Data Challenges in Biology and the Sheffield Bioinformatics Hub Dr. Roy Chaudhuri Department Of Molecular Biology and Biotechnology](https://reader035.vdocuments.us/reader035/viewer/2022062815/568168f3550346895ddffb61/html5/thumbnails/10.jpg)
Escherichia coli K-124.6m base pairs
E.coli K-12 genome - 1997Ordered sequencing approach
![Page 11: Big Data Challenges in Biology and the Sheffield Bioinformatics Hub Dr. Roy Chaudhuri Department Of Molecular Biology and Biotechnology](https://reader035.vdocuments.us/reader035/viewer/2022062815/568168f3550346895ddffb61/html5/thumbnails/11.jpg)
Shotgun Sequencing
![Page 12: Big Data Challenges in Biology and the Sheffield Bioinformatics Hub Dr. Roy Chaudhuri Department Of Molecular Biology and Biotechnology](https://reader035.vdocuments.us/reader035/viewer/2022062815/568168f3550346895ddffb61/html5/thumbnails/12.jpg)
![Page 13: Big Data Challenges in Biology and the Sheffield Bioinformatics Hub Dr. Roy Chaudhuri Department Of Molecular Biology and Biotechnology](https://reader035.vdocuments.us/reader035/viewer/2022062815/568168f3550346895ddffb61/html5/thumbnails/13.jpg)
Human genome ~3 billion base pairs
![Page 14: Big Data Challenges in Biology and the Sheffield Bioinformatics Hub Dr. Roy Chaudhuri Department Of Molecular Biology and Biotechnology](https://reader035.vdocuments.us/reader035/viewer/2022062815/568168f3550346895ddffb61/html5/thumbnails/14.jpg)
![Page 15: Big Data Challenges in Biology and the Sheffield Bioinformatics Hub Dr. Roy Chaudhuri Department Of Molecular Biology and Biotechnology](https://reader035.vdocuments.us/reader035/viewer/2022062815/568168f3550346895ddffb61/html5/thumbnails/15.jpg)
119 volumes, 4.75pt Courier
![Page 16: Big Data Challenges in Biology and the Sheffield Bioinformatics Hub Dr. Roy Chaudhuri Department Of Molecular Biology and Biotechnology](https://reader035.vdocuments.us/reader035/viewer/2022062815/568168f3550346895ddffb61/html5/thumbnails/16.jpg)
2007: Next Generation Sequencinga.k.a. Massively parallel sequencing
![Page 17: Big Data Challenges in Biology and the Sheffield Bioinformatics Hub Dr. Roy Chaudhuri Department Of Molecular Biology and Biotechnology](https://reader035.vdocuments.us/reader035/viewer/2022062815/568168f3550346895ddffb61/html5/thumbnails/17.jpg)
![Page 18: Big Data Challenges in Biology and the Sheffield Bioinformatics Hub Dr. Roy Chaudhuri Department Of Molecular Biology and Biotechnology](https://reader035.vdocuments.us/reader035/viewer/2022062815/568168f3550346895ddffb61/html5/thumbnails/18.jpg)
![Page 19: Big Data Challenges in Biology and the Sheffield Bioinformatics Hub Dr. Roy Chaudhuri Department Of Molecular Biology and Biotechnology](https://reader035.vdocuments.us/reader035/viewer/2022062815/568168f3550346895ddffb61/html5/thumbnails/19.jpg)
Read lengths 50-300bp (initially 37bp)
![Page 20: Big Data Challenges in Biology and the Sheffield Bioinformatics Hub Dr. Roy Chaudhuri Department Of Molecular Biology and Biotechnology](https://reader035.vdocuments.us/reader035/viewer/2022062815/568168f3550346895ddffb61/html5/thumbnails/20.jpg)
![Page 21: Big Data Challenges in Biology and the Sheffield Bioinformatics Hub Dr. Roy Chaudhuri Department Of Molecular Biology and Biotechnology](https://reader035.vdocuments.us/reader035/viewer/2022062815/568168f3550346895ddffb61/html5/thumbnails/21.jpg)
De novo Genome Assembly
Genomic DNA
Gap ContigContig
![Page 22: Big Data Challenges in Biology and the Sheffield Bioinformatics Hub Dr. Roy Chaudhuri Department Of Molecular Biology and Biotechnology](https://reader035.vdocuments.us/reader035/viewer/2022062815/568168f3550346895ddffb61/html5/thumbnails/22.jpg)
![Page 23: Big Data Challenges in Biology and the Sheffield Bioinformatics Hub Dr. Roy Chaudhuri Department Of Molecular Biology and Biotechnology](https://reader035.vdocuments.us/reader035/viewer/2022062815/568168f3550346895ddffb61/html5/thumbnails/23.jpg)
![Page 24: Big Data Challenges in Biology and the Sheffield Bioinformatics Hub Dr. Roy Chaudhuri Department Of Molecular Biology and Biotechnology](https://reader035.vdocuments.us/reader035/viewer/2022062815/568168f3550346895ddffb61/html5/thumbnails/24.jpg)
![Page 25: Big Data Challenges in Biology and the Sheffield Bioinformatics Hub Dr. Roy Chaudhuri Department Of Molecular Biology and Biotechnology](https://reader035.vdocuments.us/reader035/viewer/2022062815/568168f3550346895ddffb61/html5/thumbnails/25.jpg)
![Page 26: Big Data Challenges in Biology and the Sheffield Bioinformatics Hub Dr. Roy Chaudhuri Department Of Molecular Biology and Biotechnology](https://reader035.vdocuments.us/reader035/viewer/2022062815/568168f3550346895ddffb61/html5/thumbnails/26.jpg)
De Bruijn graph assembly
“It was the best of times,it was the worst of times,it was the age of wisdomit was the age of foolishness”
Break up into fixed length chunks called k-mers
![Page 27: Big Data Challenges in Biology and the Sheffield Bioinformatics Hub Dr. Roy Chaudhuri Department Of Molecular Biology and Biotechnology](https://reader035.vdocuments.us/reader035/viewer/2022062815/568168f3550346895ddffb61/html5/thumbnails/27.jpg)
![Page 28: Big Data Challenges in Biology and the Sheffield Bioinformatics Hub Dr. Roy Chaudhuri Department Of Molecular Biology and Biotechnology](https://reader035.vdocuments.us/reader035/viewer/2022062815/568168f3550346895ddffb61/html5/thumbnails/28.jpg)
• As read lengths increase, the de Bruijn graph becomes simpler.
• Resolving bubbles is one of the key functions of assembly software. The process uses additional information such as coverage levels and paired reads.
• If a bubble cannot be resolved, it results in a break in the assembly.
• Memory is the limiting factor. De novo assembly of large and complex genomes can require >1TB
![Page 29: Big Data Challenges in Biology and the Sheffield Bioinformatics Hub Dr. Roy Chaudhuri Department Of Molecular Biology and Biotechnology](https://reader035.vdocuments.us/reader035/viewer/2022062815/568168f3550346895ddffb61/html5/thumbnails/29.jpg)
Resequencing: mapping to a reference genome
![Page 30: Big Data Challenges in Biology and the Sheffield Bioinformatics Hub Dr. Roy Chaudhuri Department Of Molecular Biology and Biotechnology](https://reader035.vdocuments.us/reader035/viewer/2022062815/568168f3550346895ddffb61/html5/thumbnails/30.jpg)
Variant detection• Efficient Burrows-Wheeler transformed genome indexes• Memory is less of an issue than de novo assembly• Embarrassingly parallelisable task – number of cores important• Deep coverage required – issues with storage and disk I/O
![Page 31: Big Data Challenges in Biology and the Sheffield Bioinformatics Hub Dr. Roy Chaudhuri Department Of Molecular Biology and Biotechnology](https://reader035.vdocuments.us/reader035/viewer/2022062815/568168f3550346895ddffb61/html5/thumbnails/31.jpg)
Transcriptome sequencing• RNA sequencing to understand gene expression• Requires splice-aware mapping to reference genome• It can be challenging to resolve alternative transcripts
![Page 32: Big Data Challenges in Biology and the Sheffield Bioinformatics Hub Dr. Roy Chaudhuri Department Of Molecular Biology and Biotechnology](https://reader035.vdocuments.us/reader035/viewer/2022062815/568168f3550346895ddffb61/html5/thumbnails/32.jpg)
De novo transcriptome assembly
• De novo transcriptome assembly is a complex problem
• Many reads could belong to multiple transcripts
• Transcripts present at different levels, so use coverage to distinguish overlapping transcripts
![Page 33: Big Data Challenges in Biology and the Sheffield Bioinformatics Hub Dr. Roy Chaudhuri Department Of Molecular Biology and Biotechnology](https://reader035.vdocuments.us/reader035/viewer/2022062815/568168f3550346895ddffb61/html5/thumbnails/33.jpg)
Metagenome assembly of complex populations
• Sargasso sea• Soil metagenomes• Human microbiota eg. gut, skin, oral cavity etc. “the second human genome”, linked with non-infectious conditions such as obesity and cancer
![Page 34: Big Data Challenges in Biology and the Sheffield Bioinformatics Hub Dr. Roy Chaudhuri Department Of Molecular Biology and Biotechnology](https://reader035.vdocuments.us/reader035/viewer/2022062815/568168f3550346895ddffb61/html5/thumbnails/34.jpg)
Single Molecule Real Time (SMRT) SequencingRead lengths up to 30kb
![Page 35: Big Data Challenges in Biology and the Sheffield Bioinformatics Hub Dr. Roy Chaudhuri Department Of Molecular Biology and Biotechnology](https://reader035.vdocuments.us/reader035/viewer/2022062815/568168f3550346895ddffb61/html5/thumbnails/35.jpg)
50kb reads “easily obtained”Promise of direct DNA, RNA and protein sequencing, and detection of epigenetic factors such as methylation
Min-ION Grid-ION
![Page 36: Big Data Challenges in Biology and the Sheffield Bioinformatics Hub Dr. Roy Chaudhuri Department Of Molecular Biology and Biotechnology](https://reader035.vdocuments.us/reader035/viewer/2022062815/568168f3550346895ddffb61/html5/thumbnails/36.jpg)
• Genome sequencing technologies are developing at a rate that exceeds Moore’s Law
• The limiting factor is our ability to analyse the data (this is known as “the Bioinformatics Gap”)
• This may be as bad as it gets, improved read lengths and sequence quality may mean that less coverage will be required for variant calling, and de novo assembly will become trivial or unnecessary
• In the long run, it may be simpler to store DNA and resequence, rather than store the data
• But there is no shortage of DNA to sequence, and there will be a need for real time analysis software as sequencing becomes routine and ubiquitous
• Increased emphasis on understanding genome function rather than structure
The future