efi-genome neighborhood tool: a web tool for large-scale analysis of genome context

43
EFI-Genome Neighborhood Tool: a web tool for large-scale analysis of genome context Enzyme Function Initiative (EFI) Gordon Research Conference on Enzymes, Coenzymes, and Metabolic Pathways July 15, 2014

Upload: tulia

Post on 24-Feb-2016

52 views

Category:

Documents


0 download

DESCRIPTION

EFI-Genome Neighborhood Tool: a web tool for large-scale analysis of genome context Enzyme Function Initiative (EFI) Gordon Research Conference on Enzymes, Coenzymes, and Metabolic Pathways July 15, 2014. What is a Genome Neighborhood Network?. High sequence homology. Enzyme function. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: EFI-Genome Neighborhood Tool: a web tool for large-scale analysis of genome context

EFI-Genome Neighborhood Tool: a web tool for large-scale analysis of genome

context

Enzyme Function Initiative (EFI)Gordon Research Conference on

Enzymes, Coenzymes, and Metabolic PathwaysJuly 15, 2014

Page 2: EFI-Genome Neighborhood Tool: a web tool for large-scale analysis of genome context

What is a Genome Neighborhood Network?

High sequence homology Enzyme function

Low/Med. Sequence homology + Genome Context Enzyme function

Page 3: EFI-Genome Neighborhood Tool: a web tool for large-scale analysis of genome context

What is a Genome Neighborhood Network?

Genes << Operon << Regulon

gene products forming a biological pathway

R A B C

Genome neighborhood information facilitates enzyme function discovery via contextual evidence

Page 4: EFI-Genome Neighborhood Tool: a web tool for large-scale analysis of genome context

What is a Genome Neighborhood Network?

The GNN organizes genome neighborhood information for thousands of query genes in a high throughput and rapid fashion.

The resulting network allows a user to quickly identify the protein families that are encoded by the genes within close proximity to the SSN dataset.

Page 5: EFI-Genome Neighborhood Tool: a web tool for large-scale analysis of genome context

GNN Generation

The entire process is fast and computationally inexpensive

SSN Cluster Inventory

Neighbor Annotation Gathering Network Generation

• SSN network file parsing

• Singletons excluded

• Clusters assigned number and unique color

• European Nucleotide Archive (ENA) is queried with each SSN sequence

• Protein-encoding genes are compared to Pfam

• Additional annotation information is gathered

• Network xgmml file written

• Query sequences and neighbor sequences = nodes

• Genome proximity = edge

Page 6: EFI-Genome Neighborhood Tool: a web tool for large-scale analysis of genome context

Query families

GNNs: query families

Page 7: EFI-Genome Neighborhood Tool: a web tool for large-scale analysis of genome context

Genome neighbors

GNNs: bacterial proteins in gene clusters

Query families

Page 8: EFI-Genome Neighborhood Tool: a web tool for large-scale analysis of genome context

Genome neighbors

GNNs: collect neighbors

Query families

Page 9: EFI-Genome Neighborhood Tool: a web tool for large-scale analysis of genome context

Genome neighbors network for neighbors

GNNs: cluster neighbors

Query families

Page 10: EFI-Genome Neighborhood Tool: a web tool for large-scale analysis of genome context

Genome neighbors network for neighbors

shared contextsame pathwaysame function

unique contextunique pathwayunique function

GNNs: deduce function

Query families

Page 11: EFI-Genome Neighborhood Tool: a web tool for large-scale analysis of genome context

Example: proline racemase superfamily< 10-120

> 60% ID

Zhao et al. 2014 eLife: http://dx.doi.org/10.7554/eLife.03275

Page 12: EFI-Genome Neighborhood Tool: a web tool for large-scale analysis of genome context

GNN: “BLAST” network

Page 13: EFI-Genome Neighborhood Tool: a web tool for large-scale analysis of genome context

GNN: Pfam network

Full GNN Pfam GNN

Page 14: EFI-Genome Neighborhood Tool: a web tool for large-scale analysis of genome context

GNN: pathway “parts”

ALDH DAO

DHDPS

OCD

LDH/MDH

Page 15: EFI-Genome Neighborhood Tool: a web tool for large-scale analysis of genome context

From GNN: complete pathways

DAO

DHDPS

ALDH

OCD LDH/MDH

Page 16: EFI-Genome Neighborhood Tool: a web tool for large-scale analysis of genome context

GNN Format

The GNN visually organizes genome neighborhood information into multiple hub-and-spoke clusters.

Page 17: EFI-Genome Neighborhood Tool: a web tool for large-scale analysis of genome context

Hub Nodes

Hub node = Pfam family in neighborhood

Node Attribute, Neighbor_Accessions = list of all Pfam members found in genome context of SSN, with the following additional information:• EC number • PDB code • PDB-hit • Swiss-Prot status (reviewed/unreviewed)

Additional Node Attributes:

• Num_neighbors = the number of neighbor sequences belonging to this Pfam family

• pfam = Pfam number, e.g., PF13365• Pfam description = a short description of the family, e.g., Trypsin-like peptidase

domain

Page 18: EFI-Genome Neighborhood Tool: a web tool for large-scale analysis of genome context

PDB-Hit

PDB-hit - a sequence shares significant (e-value < e-15) homology with a protein with an X-ray crystal structure in RCSB Protein DataBase.

The format of this information is “PDB code:e-value”

Related structure homology model for docking

For users that are new to homology modeling, see resources by Sali lab at the University of California at San Francisco.

PDB284k

UniProt48M

PDB-Hit Database

22M

BLASTp

Page 19: EFI-Genome Neighborhood Tool: a web tool for large-scale analysis of genome context

Spoke Nodes

Spoke nodes = single cluster from SSN with ≥1 neighbor in hub

The Node Attributes:• Cluster Number = # assigned to SSN-cluster • Query_Accessions = a list of UniProt accessions for

query sequences• Distance = a list of distance between query and neighbor.

This is formatted “UniprotID-query:UniprotID-neighbor: (-)N”, where query = 0, next gene = 1, etc., and a negative N value indicates an upstream position.

• SSN Cluster Size = the size of SSN-cluster• Num_neighbors = # of neighbor sequences retrieved by

spoke node• Num_queries = # of query sequences in spoke node• Num_ratio = % co-occurrence as a ratio• ClusterFraction = % co-occurrence as fraction, 0-3

Page 20: EFI-Genome Neighborhood Tool: a web tool for large-scale analysis of genome context

Spoke Nodes

Spoke node size is dependent on the % co-occurrence of that Pfam in the neighborhood of that SSN cluster.

% co-occurrence = # neighbors retrieved / SSN cluster size * 100

% Co-Occurrence Indicative Situation< 100% The neighbor gene is not well-conserved and potentially

unimportant to the physiological pathway of the query gene.

< 100% This particular SSN-cluster is not isofunctional, containing multiple neighborhood contexts.

≈ 100% The neighbor gene is a well-conserved member of the genome neighborhood.

> 100% Two or more instances of neighbors from this particular Pfam family exist in the genome neighborhood.

Page 21: EFI-Genome Neighborhood Tool: a web tool for large-scale analysis of genome context

Pfam and the GNN

www.pfam.xfam.org

More universal

Unique

Highly represented in SSN cluster

Lowly represented in SSN cluster

Page 22: EFI-Genome Neighborhood Tool: a web tool for large-scale analysis of genome context

Pfam and the GNN

www.pfam.xfam.org

Identify the general classes of enzymes present in the genome context of an SSN cluster.

Eg., the presence of a kinase Pfam family and isomerase Pfam family, may indicate that the proteins of this particular SSN-cluster may carry out an aldolase-type reaction for a catabolic pathway.

Kinase Pfam

Isomerase Pfam

Page 23: EFI-Genome Neighborhood Tool: a web tool for large-scale analysis of genome context

Neighborhood Size

EFI-GNT default neighborhood size = +10 and -10 genes

Users may lower this to +/- 3 to 9 genes

R A B C

Zheng et al. 2002, Genome Research 12, 1221

Page 24: EFI-Genome Neighborhood Tool: a web tool for large-scale analysis of genome context

GNN Signal-to-Noise: Added Noise

The utility of the GNN is limited primarily by its signal-to-noise

Signal = proximal and functionally related genesNoise = proximal and irrelevant genes

Source of Noise RemedyDistant genes Decrease neighborhood size

Uncommonly co-occurring genes

Increase co-occurrence threshold

SSN over-fractionation Return SSN to less stringent e-value

Page 25: EFI-Genome Neighborhood Tool: a web tool for large-scale analysis of genome context

GNN Signal-to-Noise: Lost Signal

Why did my query sequence return less than 20 neighbors?

• Query sequence does not match to the ENA sub-databases • Non-coding RNA• Query sequence is located near the beginning or end of the ENA file• The neighbor entry does not have an associated EMBL accession number • The neighbor entry has not been incorporated into a current Pfam family.

R A B C X X

Page 26: EFI-Genome Neighborhood Tool: a web tool for large-scale analysis of genome context

EFI-GNT Web tool

www.enzymefunction.org

Page 27: EFI-Genome Neighborhood Tool: a web tool for large-scale analysis of genome context

EFI-GNT Input

1. Upload xgmml network, full or rep-node

2. Pick neighborhood size: 3-10 +/- genes

4. Enter email address

www.efi.igb.illinois.edu/efi-gnt

3. Enter co-occurrence cutoff(1-100)

5. Hit “go”

Upload status bar

Page 28: EFI-Genome Neighborhood Tool: a web tool for large-scale analysis of genome context

EFI-GNT Output

The EFI-GNT output is a pair of .xgmml files:

• genome neighborhood network (GNN)

• Colored version of the original SSN

Page 29: EFI-Genome Neighborhood Tool: a web tool for large-scale analysis of genome context

EFI-GNT Output

A download link will be sent to the e-mail address provided.

Data stored on server for 7 days.

Page 30: EFI-Genome Neighborhood Tool: a web tool for large-scale analysis of genome context

EFI-GNT Output

NOTE – depending on your browser, the files may download with an additional file extension, such as: .xgmml.txt or .xgmml.xml

You must delete the .txt or .xml extension in order to open these files in Cytoscape!

Cytoscape opens .xgmml

Page 31: EFI-Genome Neighborhood Tool: a web tool for large-scale analysis of genome context

Network Visualization

GNN files must be viewed in Cytoscape 3.0 (or more recent)

Best layouts: Organic or Prefuse Force Directed

Opening both the GNN and colored SSN in a single instance of Cytoscape allows fast comparison between the two networks (see above).

www.cytoscape.org

Version 3.1.0

Page 32: EFI-Genome Neighborhood Tool: a web tool for large-scale analysis of genome context

Network Visualization

NOTE – in Cytoscape the automatic rendering and coloring of the colorized SSN is size dependent. Cytoscape settings include a “Threshold View” that needs to be adjusted in the following manner in order to automatically view your colored SSN:

• In any version 3.X, go to Edit -> Preferences -> Properties• With “cytoscape 3” selected in the pull-down menu at the

top, scroll to the bottom of the Property list and select “viewThreshold”

• Click “Modify” and insert 5 zeros to the end of the displayed number

• Click “OK”Restart Cytoscape (this should only need to be done once per version of Cytoscape installed on your machine)

Page 33: EFI-Genome Neighborhood Tool: a web tool for large-scale analysis of genome context

Generally, the full +/-10 neighbor GNN presents an overwhelming amount of information.

Filter GNN networks by SNN Cluster Number, in order to assign enzyme function to subgroups of homologous sequences.

Network Manipulation

Page 34: EFI-Genome Neighborhood Tool: a web tool for large-scale analysis of genome context

Only hubs connected to the designated SSN cluster (eg., the cyan cluster 5).

Analyze the genome neighborhood Pfams specific to this SSN-cluster.

Network Manipulation

Page 35: EFI-Genome Neighborhood Tool: a web tool for large-scale analysis of genome context

Network Manipulation

Spoke length is arbitrary.

click+drag+drop overlapping spoke nodes until all are visible

Page 36: EFI-Genome Neighborhood Tool: a web tool for large-scale analysis of genome context

Tutorial Pages

Tutorial pages containing content similar to this presentation

Page 37: EFI-Genome Neighborhood Tool: a web tool for large-scale analysis of genome context

Test Case:Predicted Novelties of the Sialic Acid Degradation

Pathway

Page 38: EFI-Genome Neighborhood Tool: a web tool for large-scale analysis of genome context

Protein SSN

Bacterial extracellular solute-binding protein family 1 (SBP_bac_1, PF01547)

100% rep node netBLAST E-value 10-80

40% identical

21833 sequences11073 nodes

Cluster 16415 membersEFI ID 510644ThermoFluor hit on N-acetyl-neuraminate

J. Bouvier, UIUC

Page 39: EFI-Genome Neighborhood Tool: a web tool for large-scale analysis of genome context

Genome Neighborhood Network for Cluster 164

PermeaseABC transporter

KinaseEpimerase

DUF

DHDPS

Regulator

J. Bouvier, UIUC

Page 40: EFI-Genome Neighborhood Tool: a web tool for large-scale analysis of genome context

EFI ID 510644 gene neighborhood

+6 +4 +2 -2 -4

Pfam Family ID Pfam Description Predicted Role % Occurrence+6 Unassigned None none unavailable

+5 PF01380 PF01418 SIS HTH_6 transcription regulator 93

+4 PF05448 Acetyl xylan esterase deacetylase 7

+3 PF00480 ROK kinase 93

+2 PF00701 DHDPS lyase 93

+1 PF04074 DUF386 isomerase/deaminase 67

PF01547 SBP_bac_1 solute-binding 120

-1 PF00528 BPD_transp_1 permease 120

-2 PF00528 BDP_transp_1 permease 120

-3 PF04131 NanE epimerase 107

-4 PF00468 Ribosomal_L34 ribosome subunit 67

+3 +1 query -1 -3+5

Streptococcus uberis Diernhofer (strain 0140J, ATCC BAA-854)

J. Bouvier, UIUC

Page 41: EFI-Genome Neighborhood Tool: a web tool for large-scale analysis of genome context

N-acetylneuraminate degradation pathway

O O-O

OHOH

HOH

HNOH

OH

O

O

NH

O

OH

HOOH

HO

-O

O

O

O

NH

O

OH

HOOH

OPO-

-OO

N-acetyl-D-mannosamine6-phosphate

N-acetyl-D-mannosaminepyruvateN-acetyl

neuraminate

ATPADPH+

O

NH

O

OH

HOOH

OPO-

-OO

O

NH3+

OH

HOOH

OPO-

-OO

O

OH

OH

HO

OPO

-OO-

OH

glycolysisH2O NH4+H2O

N-acetyl-D-glucosamine6-phosphate

-O

O

acetate D-glucosamine6-phosphate

β-D-fructofuranose6-phopshate

PF00701 PF00480 PF04131

PF01979 PF01182

Enzyme Pfam family ID

J. Bouvier, UIUCFound in GNN Found alternative Pfam Orphan EC

Page 42: EFI-Genome Neighborhood Tool: a web tool for large-scale analysis of genome context

Three sources of unknown enzymes

1. Orphan enzyme activity (EC number with no enzyme) - in vivo evidence suggests an enzyme from PF04131 converts N-acetyl-D-mannosamine 6-phosphate to N-acetyl-D-glucosamine 6-phosphate in the third step of the pathway, but no biochemical work has been done on this putative epimerase.

2. Non orthologous gene replacement - The deacetylase from PF01979 known to convert N-acetyl-D-glucosamine 6-phosphate to D-glucosamine 6-phosphate in the four step of this pathway is located elsewhere in the genome (locus tag Sub1443). However Sub1651 which is located four genes downstream is a member of PF05448, and other members of PF05448 have known deacetylase activity. Is this a non orthologous gene replacement, and does it’s low occurrence (7%) in the neighborhoods of the queries suggest it to be a relic?

3. Domain of unknown function - The deaminase/isomerase from PF01182 known to convert α-D-glucosamine 6-phosphate to β-D-fructofuranose 6-phosphate in the fifth step of the pathway is located elsewhere in the genome (locus tag Sub1239). However Sub1654 which is located one gene downstream has been suggested to be a sugar isomerase. Sub1654 is a member of PF04074 (DUF386). Sub1654 is a good candidate for docking.

J. Bouvier, UIUC

Page 43: EFI-Genome Neighborhood Tool: a web tool for large-scale analysis of genome context

Hands-on Portion of Workshop

Feel free now to download Cytoscape 3.1, run EFI-EST, and run EFI-GNT for your protein (family) of interest.

Please see posters by Katie Whalen (#55) and Daniel Wichelecki (#56) for further examples of EFI-EST/EFI-GNT use.

Tutorials for using Cytoscape: http://enzymefunction.org/resources/tutorials/efi-and-cytoscape3

Feel free to contact us throughout the conference with questions/comments.

Acknowledgements

GNN Development Suwen Zhao (UCSF) Alan Barber (Pythoscape, UCSF) Shoshana Brown (Pythoscape, UCSF) Eyal Akiva (Pythoscape, UCSF) Jason Bouvier (UIUC)

Website Build Daniel Davidson (UIUC) David Slater (UIUC)

Documentation Katie Whalen (UIUC)

Principal Investigators Matthew Jacobson (UCSF) Patricia Babbitt (Pythoscape, UCSF) John Gerlt (UIUC)