2015 genome-center

A data intensive future:How can biology take full advantage of the

coming data deluge?

C. Titus BrownSchool of Veterinary Medicine;

Genome Center & Data Science Initiative11/13/15

Outline0. Background1. Research: what do we do with

infinite data?2. Development: software and

infrastructure.3. Open science & reproducibility.4. Training

0. Background

In which I present the perspective that we face increasingly large data sets, from diverse samples, generated in real time, with many different data

types.

DNA sequencing rates continues to grow.

Stephens et al., 2015 - 10.1371/journal.pbio.1002195

Oxford Nanopore sequencing

Slide via Torsten Seeman

Nanopore technology


Scaling up --

http://ebola.nextflu.org/

“Fighting Ebola With a Palm-Sized DNA Sequencer”

See: http://www.theatlantic.com/science/archive/2015/09/

ebola-sequencer-dna-minion/405466/

“DeepDOM” cruise: examination of dissolved organic matter & microbial metabolism vs physical parameters – potential collab.

Via Elizabeth Kujawinski

1. Research

In which I discuss advances made towards analyzing infinite amounts of genomic data, and the perspectives

engendered thereby: to whit, streaming and sketches.

Conway T C , Bromage A J Bioinformatics 2011;27:479-486

© The Author 2011. Published by Oxford University Press. All rights reserved. For Permissions, please email: [email protected]

De Bruijn graphs (sequencing graphs) scale with data size, not information size.

Why do sequence graphs scale badly?

Memory usage ~ “real” variation + number of errors

Number of errors ~ size of data set

Practical memory measurements

Velvet measurements (Adina Howe)

Our solution: lossy compression



Shotgun sequencing and coverage

“Coverage” is simply the average number of reads that overlap

each true base in genome.

Here, the coverage is ~10 – just draw a line straight down from the top through all of the reads.

Random sampling => deep sampling needed

Typically 10-100x needed for robust recovery (30-300 Gbp for human)

Digital normalization

Graph sizes now scales with information content.

Most samples can be reconstructed via de novo assembly on commodity computers.

Diginorm ~ “lossy compression”

Nearly perfect from an information theoretic perspective:

– Discards 95% more of data for genomes.– Loses < 00.02% of information.

This changes the way analyses scale.



Streaming lossy compression:

for read in dataset:

if estimated_coverage(read) < CUTOFF:

yield read

This is literally a three line algorithm. Not kidding.It took four years to figure out which three lines, though…

Diginorm can detect information saturation in a stream.

Zhang et al., submitted.

This generically permits semi-streaming analytical approaches.


e.g. E. coli analysis => ~1.2 pass, sublinear memory


Another simple algorithm.


Single pass, reference free, tunable, streaming online variant calling.

Error detection variant calling

Real time / streaming data analysis.

My real point -• We need well founded, and flexible, and

algorithmically efficient, and high performance

components for sequence data manipulation in biology.

• We are building some of these on a streaming and low memory paradigm.

• We are building out a scripting library for composing these operations.

2. Software and infrastructure

Alas, practical data analysis depends on software and computers, which

leads to depressingly practical considerations for gentleperson

scientists.

SoftwareIt’s all well and good to develop new data

analysis approaches, but their utility is greater when they are implemented in usable

software.

Writing, maintaining, and progressing research software is hard.

The khmer software package

• Demo implementation of research data structures & algorithms;

• 10.5k lines of C++ code, 13.7k lines of Python code;• khmer v2.0 has 87% statement coverage under

test;• ~3-4 developers, 50+ contributors, ~1000s of users

(?)

The khmer software package, Crusoe et al., 2015. http://f1000research.com/articles/4-900/v1

khmer is developed as a true open source package

• github.com/dib-lab/khmer;• BSD license;• Code review, two-person sign off on

changes;• Continuous integration (tests are run on

each change request);

Challenges:

Research vs stability!Stable software for users, & platform

for future research;vs research “culture”(funding and careers)

How is continued software dev feasible?!

Representative half-arsed lab software development

A not-insane way to do software development

Infrastructure issuesSuppose that we have a nice ecosystem of bioinformatics &

data analysis tools.Where and how do we run them?

Consider:1. Biologists hate funding computational infrastructure.2. Researchers are generally incompetent at building and

maintaining usable infrastructure.3. Centralized infrastructure fails in the face of infinite data.

Decentralized infrastructure for bioinformatics?

ivory.idyll.org/blog/2014-moore-ddd-award.html

3. Open science and reproducibility

In which I start from the point that most researchers* cannot replicate their own

computational analyses, much less reproduce those published by anyone else.

* This doesn’t apply to anyone in this

audience; you’re all outliers!

My lab & the diginorm paper.

• All our code was on github;• Much of our data analysis was in the

cloud (on Amazon EC2);• Our figures were made in IPython

Notebook.• Our paper was in LaTeX.

Brown et al., 2012 (arXiv)

IPython Notebook: data + code =>

To reproduce our paper:

git clone <khmer> && python setup.py install

git clone <pipeline>

cd pipeline

wget <data> && tar xzf <data>

make && cd ../notebook && make

cd ../ && make

This is standard process in lab --

Our papers now have:

• Source hosted on github;• Data hosted there or on

AWS;• Long running data

analysis => ‘make’• Graphing and data

digestion => IPython Notebook (also in github)

Zhang et al. doi: 10.1371/journal.pone.0101271

Research process

Literate graphing & interactive exploration

Camille Scott

Why bother??“There is no scientific knowledge of the

individual.” (Aristotle)

More pragmatically, we are tired of struggling to reproduce other people’s results.

And, in the end, it’s not all that much extra work.

What does this have to do with open science?

This is a longer & larger conversation, but:

All of our processes enable easy and efficient pre-publication sharing. Source code, analyses, preprints…

When we share early, our ideas have a significant competitive advantage in the research marketplace of

ideas.

4. Training

In which I note that methods and tools do little without a trained hand

wielding them, and a trained eye examining the results.

Perspectives on training• Prediction: The single biggest challenge

facing biology over the next 20 years is the lack of data analysis training (see: NIH DIWG report)

• Data analysis is not turning the crank; it is an intellectual exercise on par with experimental design or paper writing.

• Training is systematically undervalued in academia (!?)

UC Davis and trainingMy goal here is to support the

coalescence and growth of a local community of practice around “data

intensive biology”.

Summer NGS workshop (2010-2017)

General parameters:• Regular intensive workshops, half-day or longer.• Aimed at research practitioners (grad students &

more senior); open to all (including outside community).

• Novice (“zero entry”) on up.• Low cost for students.• Leverage global training initiatives.

Thus far & near future~12 workshops on bioinformatics in

2015.Trying out soon:• Half-day intro workshops;• Week-long advanced workshops;• Co-working hours.

dib-training.readthedocs.org/

The End.• If you think 5-10 years out, we face significant

practical issues for data analysis in biology.• We need new algorithms/data structures, AND good

implementations, AND better computational practice, AND training.

• This can be either viewed with despair… or seen as an opportunity to seize the competitive advantage!

(How I view it varies from day to day.)

Thanks for listening!Please contact me at [email protected]!Note: I work here now!

mailto:[email protected]