high throughput computational sequence analysis rob edwards [email protected] argonne national...

25
High Throughput Computational Sequence Analysis Rob Edwards [email protected] Argonne National Laboratory San Diego State University

Post on 19-Dec-2015

213 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: High Throughput Computational Sequence Analysis Rob Edwards redwards@salmonella.org Argonne National Laboratory San Diego State University

High Throughput ComputationalSequence Analysis

Rob [email protected]

Argonne National LaboratorySan Diego State University

Page 2: High Throughput Computational Sequence Analysis Rob Edwards redwards@salmonella.org Argonne National Laboratory San Diego State University

Firstbacterial genome

100bacterial genomes

1,000bacterial genomesN

um

ber

of

know

n s

equence

s

Year

How much has been sequenced

Environmentalsequencing

Page 3: High Throughput Computational Sequence Analysis Rob Edwards redwards@salmonella.org Argonne National Laboratory San Diego State University

Everybody inSan Diego

Everybody inUSA

AllculturedBacteria

100people

How much will be sequenced

One genome fromevery species

Most majormicrobial environments

Page 4: High Throughput Computational Sequence Analysis Rob Edwards redwards@salmonella.org Argonne National Laboratory San Diego State University

High Performance Computing

Page 5: High Throughput Computational Sequence Analysis Rob Edwards redwards@salmonella.org Argonne National Laboratory San Diego State University

TeraGrid

Page 6: High Throughput Computational Sequence Analysis Rob Edwards redwards@salmonella.org Argonne National Laboratory San Diego State University

The Teragrid National Resource

Page 7: High Throughput Computational Sequence Analysis Rob Edwards redwards@salmonella.org Argonne National Laboratory San Diego State University

Life Sciences Gateway to TeraGrid

Page 8: High Throughput Computational Sequence Analysis Rob Edwards redwards@salmonella.org Argonne National Laboratory San Diego State University

Subsystems

Page 9: High Throughput Computational Sequence Analysis Rob Edwards redwards@salmonella.org Argonne National Laboratory San Diego State University

Subsystems make up metabolism

Wik

ipedia

Meta

bolis

mhtt

p:/

/en.w

ikip

edia

.org

/wik

i/Port

al:M

eta

bolis

m

Page 10: High Throughput Computational Sequence Analysis Rob Edwards redwards@salmonella.org Argonne National Laboratory San Diego State University

Subsystems are not just metabolism

http://aig.cs.man.ac.uk/gallery/Utopia/

Enzyme complex

http://webdeptos.uma.es/

Cell Machinery

http://www.brown.edu/

Cell Processes

Page 11: High Throughput Computational Sequence Analysis Rob Edwards redwards@salmonella.org Argonne National Laboratory San Diego State University

http://www.theseed.org

Page 12: High Throughput Computational Sequence Analysis Rob Edwards redwards@salmonella.org Argonne National Laboratory San Diego State University

http://www.theseed.org

Page 13: High Throughput Computational Sequence Analysis Rob Edwards redwards@salmonella.org Argonne National Laboratory San Diego State University

Growth in generation of subsystems

Page 14: High Throughput Computational Sequence Analysis Rob Edwards redwards@salmonella.org Argonne National Laboratory San Diego State University

Microbial Genomics Annotation Platform

• Goal 1: Automate the generation of high quality annotations by leveraging the information contained in SubSystems and FIGfams.

• Goal 2: Minimize turnaround time. Initial target 48 hours

Page 15: High Throughput Computational Sequence Analysis Rob Edwards redwards@salmonella.org Argonne National Laboratory San Diego State University

• Automated process consisting of:– Gene calling– Initial annotation of function– Initial metabolic

reconstruction• Process takes 1-7 hours

depending on size and complexity of the genome

• ~20 genomes per day

• Password protected, secure, private

• Release to public databases if required

Freely available annotation service

http://www.nmpdr.org/anno-server/index48.cgi

Page 16: High Throughput Computational Sequence Analysis Rob Edwards redwards@salmonella.org Argonne National Laboratory San Diego State University

Some estimate of annotation quality

05

101520253035404550

Bacillus

anthracis str.

Sterne

Mycobacterium

tuberculosisCDC1551

Listeria

monocytogenes

EGD-e

Streptococcuspyogenes M1

GAS

Staphylococcusaureus subsp.

aureus MW2

260799 83331 169963 160490 196620

% in SS SEED

% in SS SP1Ke

% hypothecial SP1Ke

% hypothetical SEED

Page 17: High Throughput Computational Sequence Analysis Rob Edwards redwards@salmonella.org Argonne National Laboratory San Diego State University

Evaluation / Viewing

Page 18: High Throughput Computational Sequence Analysis Rob Edwards redwards@salmonella.org Argonne National Laboratory San Diego State University

Download results

• We provide a number of export formats:– Genbank, Fasta, GFF3, Excel– can easily be extended to all formats supported by

BioPerl

• Genomes can be deleted by the user at any time (we keep them for max. 120 days)

• Genomes can be directly imported into the SEED if the user wishes

• all genomes are password protected

Page 19: High Throughput Computational Sequence Analysis Rob Edwards redwards@salmonella.org Argonne National Laboratory San Diego State University

Metagenomics SEED

Page 20: High Throughput Computational Sequence Analysis Rob Edwards redwards@salmonella.org Argonne National Laboratory San Diego State University

http://metagenomics.theseed.org

Page 21: High Throughput Computational Sequence Analysis Rob Edwards redwards@salmonella.org Argonne National Laboratory San Diego State University

Metagenome Metabolic Reconstruction

Page 22: High Throughput Computational Sequence Analysis Rob Edwards redwards@salmonella.org Argonne National Laboratory San Diego State University

Starch utilization in cow rumens

Page 23: High Throughput Computational Sequence Analysis Rob Edwards redwards@salmonella.org Argonne National Laboratory San Diego State University

Metabolic potential in environments

Page 24: High Throughput Computational Sequence Analysis Rob Edwards redwards@salmonella.org Argonne National Laboratory San Diego State University

Everybody inSan Diego

Everybody inUSA

AllculturedBacteria

100people

Too much will be sequenced

One genome fromevery species

Most majormicrobial environments

Page 25: High Throughput Computational Sequence Analysis Rob Edwards redwards@salmonella.org Argonne National Laboratory San Diego State University

Acknowledgements

Argonne National LaboratoryRick StevensBob OlsonFolker Meyer

San Diego State UniversityForest Rohwer

Fellowship for Interpretation of Genomes

Ross OverbeekVeronika VonsteinThe Annotators