visualizing the protein sequence universe e.kolker 1 visualizing the protein sequence universe...

Download VISUALIZING THE PROTEIN SEQUENCE UNIVERSE E.KOLKER 1 VISUALIZING THE PROTEIN SEQUENCE UNIVERSE L.STANBERRY 1, R.HIGDON 1, W.HAYNES 1, N.KOLKER 1, W.BROOMALL

If you can't read please download the document

Upload: clinton-cox

Post on 18-Jan-2018

225 views

Category:

Documents


0 download

DESCRIPTION

Ultimate Goal: Annotate All Proteins Visualizing PSU 3 Our approach:  Revitalize, expand & enhance protein annotation resources.  Develop sustainable software framework.  Use HPC and most powerful Cyberinfrastructure  Provide rigorous and reliable tools to annotate protein sequences. ECMLS 2012

TRANSCRIPT

VISUALIZING THE PROTEIN SEQUENCE UNIVERSE E.KOLKER 1 VISUALIZING THE PROTEIN SEQUENCE UNIVERSE L.STANBERRY 1, R.HIGDON 1, W.HAYNES 1, N.KOLKER 1, W.BROOMALL 1, S.EKANAYAKE 2, A.HUGHES 2, Y.RUAN 2, J.QIU 2, E.KOLKER 1, G.FOX 2 1 SEATTLE CHILDRENS, 2 INDIANA UNIVERSITY ECMLS 2012, 18 June 2012 Grand Challenge of Functional Genomics Visualizing PSU 2 New technologies produce peta- and exabytes of data Protein Sequence Universe (PSU), the protein sequence space, expands exponentially EMP, i5K, iPlant, NEON 30% of existing sequenced proteins unannotated even before drastic expansion Existing resources overwhelmed, many unsupported: COG, Systers, ClusTr, eggNOG. ECMLS 2012 Ultimate Goal: Annotate All Proteins Visualizing PSU 3 Our approach: Revitalize, expand & enhance protein annotation resources. Develop sustainable software framework. Use HPC and most powerful Cyberinfrastructure Provide rigorous and reliable tools to annotate protein sequences. ECMLS 2012 COG: Clusters of Orthologous Groups Visualizing PSU 4 COG database was developed by NCBI. Proteins classified into groups with common function encoded in complete genomes. Prokaryotes (COG): 66 genomes, 200K proteins, 5K clusters. Eukaryotes (KOG): 7 genomes, 113K proteins, 5K clusters. Valuable scientific resource: 5K citations. Last updated: ECMLS 2012 Protein Sequence Universe Visualizing PSU 5 PSU Goal: Enhance annotation resources with analytic and visualization (browser) tools. One component of PSU is to project sequence data into 3D using multidimensional scaling (MDS). MDS interpolation allows expanding the universe without time consuming all vs all O(N 2 ) 3D map allows much faster interpolation Use set of pairwise dissimilarities dont do MSA so dont have vectors in some space ECMLS 2012 Multi-Dimensional Scaling (MDS) Visualizing PSU 6 Sammons objective function is dissimilarity measure between sequences i and j d is Euclidean distance (here in 3D for visualization) between projections x i and x j Denominator chosen to get larger contribution in objective function from smaller dissimilarities f is monotone transformation of dissimilarity measure chosen artistically ECMLS 2012 Typical Metagenomics MDS ECMLS 2012Visualizing PSU 7 MDS Details ECMLS 2012Visualizing PSU 8 f chosen heuristically to increase the ratio of standard deviation to mean for and to increase the range of dissimilarity measures. O(n 2 ) complexity to map n sequences into 3D. MDS can be solved using EM (SMACOF fastest but limited) or directly by Newton's method (its just 2 ) Used robust implementation of nonlinear 2 minimization with Levenberg-Marquardt 3D projections visualized in PlotViz MDS Details ECMLS 2012Visualizing PSU 9 Input Data: 100K sequences from well-characterized prokaryotic COGs. Proximity measure: sequence alignment % scores Scores calculated using Needleman-Wunsch Scores sqrt 4D transformed and fed into MDS Analytic form for transformation to 4D ij n decreases dimension n > 1; increases n < 1 sqrt 4D reduced dimension of distance data from 244 for ij to14 for f( ij ) Hence more uniform coverage of Euclidean space 3D View of 100K COG Sequences Visualizing PSU 10 ECMLS 2012 Implementation Visualizing PSU 11 NW computed in parallel on 100 node 8-core system. Used Twister (IU) in the Reduce phase of MapReduce MDS Calculations performed on 768 core MS HPC cluster (32 nodes) Scaling, parallel MPI with threading intranode Parallel efficiency of the code approximately 70% Lost efficiency due memory bandwidth saturation NW required 1 day, MDS job - 3 days. ECMLS 2012 Cluster Annotation Visualizing PSU 12 COGAnnotationUniref100 COG1131ABC-type multidrug transport system, ATPase component14406 COG1136 ABC-type antimicrobial peptide transport system, ATPase component7306 COG1126ABC-type polar amino acid transport system, ATPase component4061 COG3839ABC-type sugar transport systems, ATPase component4121 COG0444 ABC-type dipeptide/oligopeptide/nickel transport system ATPase comp3520 COG4608ABC-type oligopeptide transport system, ATPase component3074 COG3842ABC-type spermidine/putrescine transport systems, ATPase comp3665 COG0333Ribosomal protein L COG0454Histone acetyltransferase HPA2 and Related acetyltransferases14085 COG0477Permeases of the major facilitator superfamily48590 COG1028Dehydrogenases with different specificities37461 ECMLS 2012 Selected Clusters Visualizing PSU 13 ECMLS 2012 Heatmap of NW vs Euclidean Distances Visualizing PSU 14 ECMLS 2012 Heatmap for Selected Clusters Visualizing PSU 15 ECMLS 2012 Future Steps Comparison Needleman-Wunsch v. Blast v. PSIBlast NW easier as complete; Blast has missing distances Different Transformations distance monotonic function(distance) to reduce formal starting dimension (increase sigma/mean) Automate cluster consensus finding as sequence that minimizes maximum distance to other sequences Improve O(N 2 ) to O(N) complexity by interpolating new sequences to original set and only doing small regions with O(N 2 ) Successful in metagenomics Can use Oct-tree from 3D mapping or set of consensus vectors Some clusters diffuse? ECMLS 2012Visualizing PSU 16 ECMLS 2012Visualizing PSU 17 Blast 6 ECMLS 2012Visualizing PSU 18 Full Data Blast 6 Original run has 0.96 cut ECMLS 2012Visualizing PSU 19 Cluster Data Blast 6 Original run has 0.96 cut 20 Use Barnes Hut OctTree originally developed to make O(N 2 ) astrophysics O(NlogN) 21 OctTree for 100K sample of Fungi We use OctTree for logarithmic interpolation 440K Interpolated 22 Conclusions Visualizing PSU 23 Data Knowledge: protein annotation Overwhelming influx of new sequences Annotation is an immense challenge. HPC and advanced analytics needed. PSU as tool to facilitate annotation: Interactive visualization and exploration Integrates info on function, pathways, structure, and environment MDS preserves grouping structure of protein space MDS can use different proximities and biological data Parallel MDS handles large-scale data MDS interpolation quickly maps new sequences into existing space ECMLS 2012 DELSA: Data Knowledge Action Visualizing PSU 24 Data-Enabled Life Sciences Alliance International Collective innovation to tackle modern biological challenges through best computational practices and advanced cyberinfrastructure. Harness expertise and resources across disciplines Promote accurate, sustainable, scalable approaches Facilitate translation of data influx into tangible innovations and groundbreaking discoveries ECMLS 2012 DW2 Workshop, May 2012, D.C. Who was there: -~ 90 participants (by invitation only) -Academia, Government, Industry, Media, NFP -9 Countries (Belgium, Canada, China, Germany, Israel, India, Russia, U.K., and U.S.A.) What were goals: -Help identify Transformational Business Models -Help identify Top (high impact/potential) Projects -Stay engaged with DELSA and support mission -Get the word out about DELSA... tweet, blog,, talk, FB, LI, present, connect etc. - Identify peoples optimal role/s in DELSA and endorsed Projects DELSA Endorsed Projects Project 1: Social Networking Platform for Tool Brokering/Community Building * Goal: Open a dialog and an organizational effort to build a social networking platform to broker bioinformatics tools. This project would encourage community engagement by crowdsourcing, accelerate discovery by making tools more accessible, and through community ranking, more trustworthy. It would also build community and connect people and resources. Deliverable: Social networking platform for idea exchange, resource sharing, tool ranking and brokering. Project 2: Data Set Accessibility Project Lead: Corinna Gries Goal: Make high quality life sciences data broadly available, traceable and usable. Deliverable: Follow-on workshop to define issues such as: Curation, Sustainability, Rapid growth in data volume, Data provider incentives, Non-trivial processing on the data in the repository, Limited bandwidth from the open Internet to clouds, and Security. Project 3: Training Data Scientists Lead: Geoffrey Fox Goal: Train new and established scientists to enable more effective use of big data and its cyberinfrastructure. Deliverable: Courses in data enabled science culminating in a certification similar to Microsoft or Cisco certification or existing scientific computing or computational science certificates/curricula. Need to evaluate existing resources such as: UW eScience classes, OGF Grid Computing certificate, and XSEDE HPC University Possible approach is to focus on particular life science subdomains. Project 4: Global Protein Atlas Lead: Jack Gilbert Goal: For all the meta-genomes and genomes that are available cluster at the protein level and annotate MG-RAST, CAMERA, MOPED, PSU, etc The goal is to characterize all the proteins and answer the question: what protein is expressed in what organism, what disease, what tissue, what condition, what environment, and in what concentration? Deliverable: Based on current large scale projects such as Earth Microbiome Project and Human Microbiome Project, we will analyze samples from diverse communities using meta-genomics and meta-proteomics to produce a Global Protein Atlas. DELSA Endorsed Projects, Cont. Project 5: Internet2 Application Lead: Michael Sullivan Background: Internet2 is an advanced not-for-profit networking consortium developing revolutionary Internet technologies and leveraging a high-performance network (http://www.internet2.edu/). It is currently being adopted by NLM. It has three components: 1) connect pilot place to Internet2; 2) Deploy Science DMZ at the pilot place; and 3) Perform routine exchange of BigData. It is a dedicated data transfer mode to enable fast data transfer mode. Goal: Create scalable process to connect entities (Research institutes, Universities, and Global Governments) to Internet2. Project 6: DELSA Matchmaking Website * Goal: Help scientists connect to each other, tools, publications, industry as a way to facilitate more effective science. Possible examples are VIVO and Linkedin. Could develop matchmaking 20 questions to determine individuals skills, interests, tools, review favorites. Could point to publications, tools or resources. Deliverable: A web-based platform for connecting scientists to other scientists as well as research resources. Project 7: Pregnancy Atlas Use Case Lead: Joseph Kemnitz Goal: Utilize DELSA and its members and connections for resources that would help the Pregnancy Atlas.The Pregnancy Atlas Consortium has an Integrative Discovery Platform that could be expanded. Help with metrics to assess Platform success. Deliverable: Additional information for the Pregnancy Atlas such as potential collaborators, CI tools, data formats and funding opportunities. Provide files in a format that could be integrated by the Platform. Project 8: ParaMEDIC Use Case Lead: Wu Feng Background: Frequent Pain Points experienced by DELSA members include ease of use issues with analysis tools and compute resources, as well as performance issues which may be due to compute problems, data management problems or data representation problems. Goal: Use the automated, easy to use and integrated high-performance Biocomputing system (including ParaMEDIC: Parallel Metadata Environment for Distributed I/O & Computing) on a Suggested BigData challenge to show what can be done if the system was widely available. Deliverable: BigData life sciences challenging project successfully accomplished. References and Resources Visualizing PSU 28 COG data is available at the NCBI site ftp://ftp.ncbi.nih.gov/pub/wolf/COGs/COG0303/ MDS results are available at All software used to analyze and visualize the data is open source. DELSA: Protein Global Atlas and Data Accessibility Projects ECMLS 2012 Acknowledgements ECMLS 2012Visualizing PSU 29 Grant support NSF: under DBI: (EK) and (GF) NIH: 5 RC2 HG (GF); NIGMS grant R01 GM (EK); NIDDK grants U01-DK and U01-DK (EK)