july 17 2006 geoffrey fox computer science, informatics, physics pervasive technology laboratories

29
1 Joint meeting of the Molecular Libraries Screening Centers Network (MLSCN) and the Exploratory Centers for Cheminformatics Research (ECCR): Talk I July 17 2006 Geoffrey Fox Computer Science, Informatics, Physics Pervasive Technology Laboratories Indiana University Bloomington IN 47401 [email protected] http://www.infomall.org http:// www.chembiogrid.org With apologies for my credentials. I have written a few papers on Biology, Chemistry and Crystallography while at Cambridge, Caltech and Syracuse Mostly on applications of parallel computing

Upload: umay

Post on 22-Feb-2016

30 views

Category:

Documents


0 download

DESCRIPTION

Joint meeting of the Molecular Libraries Screening Centers Network (MLSCN) and the Exploratory Centers for Cheminformatics Research (ECCR): Talk I. July 17 2006 Geoffrey Fox Computer Science, Informatics, Physics Pervasive Technology Laboratories Indiana University Bloomington IN 47401 - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: July 17 2006 Geoffrey Fox Computer Science, Informatics, Physics Pervasive Technology Laboratories

11

Joint meeting of the Molecular Libraries Screening Centers Network (MLSCN) and the Exploratory Centers

for Cheminformatics Research (ECCR): Talk I

July 17 2006Geoffrey Fox

Computer Science, Informatics, PhysicsPervasive Technology Laboratories

Indiana University Bloomington IN [email protected]

http://www.infomall.orghttp://www.chembiogrid.org

With apologies for my credentials. I have written a few papers on Biology, Chemistry and Crystallography while at Cambridge, Caltech and Syracuse Mostly on applications of parallel computing

Page 2: July 17 2006 Geoffrey Fox Computer Science, Informatics, Physics Pervasive Technology Laboratories

22

Start-up and Organization Local Teams, successful Prototypes and International

Collaboration set up in 3 major focus areas• “Tool and Data” Cyberinfrastructure• “Archival Database and Simulation” Cyberinfrastructure• Education

Wiki chosen to support project as a shared editable web space Web site http://www.chembiogrid.org Building Collaboratory involving PubChem – Global Information

System accessible anywhere and at any time – enhance PubChem with distributed tools (clustering, simulation, annotation etc.) and data

Initial results discussed at conferences/workshops/papers• Gordon Conferences, ACS, SDSC tutorial

First new Cheminformatics courses offered Advisory board set up and met Videoconferencing-based meetings with Peter Murray-Rust and group

at Cambridge roughly every 2-3 weeks Good interactions with NIH DTP, Lilly and Michigan ECCR

Page 3: July 17 2006 Geoffrey Fox Computer Science, Informatics, Physics Pervasive Technology Laboratories

33

http://www.chembiogrid.org

Page 4: July 17 2006 Geoffrey Fox Computer Science, Informatics, Physics Pervasive Technology Laboratories

44

CICC Senior Personnel Geoffrey C. Fox Mu-Hyun (Mookie) Baik Dennis B. Gannon Marlon Pierce Beth A. Plale Gary D. Wiggins David J. Wild Yuqing (Melanie) Wu

Peter T. Cherbas Mehmet M. Dalkilic Charles H. Davis A. Keith Dunker Kelsey M. Forsythe Kevin E. Gilbert John C. Huffman Malika Mahoui Daniel J. Mindiola Santiago D. Schnell William Scott Craig A. Stewart David R. Williams

From Biology, Chemistry, Computer Science, Informatics

at IU Bloomington and IUPUI (Indianapolis)

Page 5: July 17 2006 Geoffrey Fox Computer Science, Informatics, Physics Pervasive Technology Laboratories

55

CICC Advisory Board Alan D. Palkowitz (Eli Lilly) Andrew Martin (Kalypsys) David Spellmeyer (IBM) Dimitris K. Agrafiotis (Johnson & Johnson) Horst Hemmerle (Eli Lilly) James M. Caruthers (Purdue University) Jeremy G. Frey (University of Southampton) Joel Saltz (Ohio State University/University of Maryland/Johns

Hopkins University) John M. Barnard (Digital Chemistry) John Reynders (Eli Lilly) Peter Murray-Rust (University of Cambridge) Peter Willett (University of Sheffield) Thompson Doman (Eli Lilly) Val Gillet (University of Sheffield)

Industry andAcademiaMet October 2005will meet this fall

Page 6: July 17 2006 Geoffrey Fox Computer Science, Informatics, Physics Pervasive Technology Laboratories

66

PublicationsBaik says he is especially productive due to Cyberinfrastructure

Page 7: July 17 2006 Geoffrey Fox Computer Science, Informatics, Physics Pervasive Technology Laboratories

77

Our Meetings are on the Web

Page 8: July 17 2006 Geoffrey Fox Computer Science, Informatics, Physics Pervasive Technology Laboratories

8

Varuna environment for molecular modeling (Baik, IU)

QMDatabase

ResearcherResearcher

Simulation ServiceFORTRAN Code,

Scripts

Chemical Concepts

Experiments

QM/MMDatabasePubChem, PDB,

NCI, etc.

ChemBioGridChemBioGrid

ReactionDB

DB ServiceQueries, Clustering,

Curation, etc.

Papersetc.

Condor

TeraGridSupercomputers

“Flocks”

Page 9: July 17 2006 Geoffrey Fox Computer Science, Informatics, Physics Pervasive Technology Laboratories

99

Cyberinfrastructure and Grids These support eScience or distributed Computers,

Databases, Instruments, Sensors and People Grids use large scale managed Web services – the current major

technology building on modern Industry enterprise and Internet systems• W3C, OASIS, OGF or Open Grid Forum (Fox VP for

eScience) develops standards insuring distributed resources interoperate

Cheminformatics benefits from 2 styles of Grids• TeraGrid typifies Grid support of large scale computation of

parallel simulations• Bioinformatics (BIRN, caBIG, MyGrid …), Earth Science

and Astronomy Grids illustrate integration of real-time and archival data(bases) and computation

Well designed Grids run faster than older approaches

Page 10: July 17 2006 Geoffrey Fox Computer Science, Informatics, Physics Pervasive Technology Laboratories

1010

Cheminformatics Grids Need Broad System standards such as WSDL, SOAP,

WSRM, JSDL, BPEL Domain specific data structures

• CML Cheminformatics• GML Earth Science• CellML, SBML Biology• VOQL Astronomy

Use of specific Grid/Web service technologies such as• Web services directly for tools• Web service proxies for large simulation codes – ANYTHING

can be made a Web service efficiently if execution/network access time ≥ 20ms

• Portals/Portlets for user interfaces• Workflow for composition

Access to data and compute resources

Page 11: July 17 2006 Geoffrey Fox Computer Science, Informatics, Physics Pervasive Technology Laboratories

TeraGrid: Integrating NSF Cyberinfrastructure

TeraGrid is a facility that integrates computational, information, and analysis resources at the San Diego Supercomputer Center, the Texas Advanced Computing Center, the University of Chicago / Argonne National Laboratory, the National Center for Supercomputing Applications, Purdue University, Indiana University, Oak Ridge National Laboratory, the Pittsburgh Supercomputing Center, and the National Center for Atmospheric Research.

SDSCTACC

UC/ANL

NCSA

ORNL

PU

IUPSCNCAR

Caltech

USC-ISI

UtahIowa

Cornell

Buffalo

UNC-RENCI

Wisc

Page 12: July 17 2006 Geoffrey Fox Computer Science, Informatics, Physics Pervasive Technology Laboratories

12

Top500Supercomputers

in the world

Indiana University has Highest Performance

U.S. Academic Computer System20 Teraflops peak

Page 13: July 17 2006 Geoffrey Fox Computer Science, Informatics, Physics Pervasive Technology Laboratories

1313

Products and Demonstrationswww.

chembiogrid.org

Note mixture ofIn-house

Out of HouseCommercial

Academic

Page 14: July 17 2006 Geoffrey Fox Computer Science, Informatics, Physics Pervasive Technology Laboratories

CICC Prototype Web Services

Molecular weightsMolecular formulaeTanimoto similarity2D Structure diagramsMolecular descriptors3D structuresInChi generation/searchCMLRSS

Basic cheminformatics

Application based services

Compare (NIH)Toxicity predictions (ToxTree)Literature extraction (OSCAR3)Clustering (BCI Toolkit)Docking, filtering, ... (OpenEye)Varuna simulation

Define WSDL interfaces to enable global production of compatible Web services; refine CML Ready to try “Prototype Production” Develop more training material Refine/go into production with key services including both tools, workflows and TeraGrid style simulations in capacity and capability modes In-house algorithm work for new services in clustering, diversity analysis, QSAR methodologies

Next steps?

Key Ideas

Add value to PubChem with additional distributed services and databases Wrapping existing code in web services is not difficult Provide “core” (CDK) services and exemplars of typical tools Provide access to key databases via a web service interface Provide access to major Compute Grids

Page 15: July 17 2006 Geoffrey Fox Computer Science, Informatics, Physics Pervasive Technology Laboratories

Web Service LocationsIndiana University

Clustering VOTables OSCAR3 Toxicity classification Database services

Penn State UniversityCDK based services

Fingerprints Similarity calculations 2D structure diagrams Molecular descriptors

Cambridge University InChi generation / search CMLRSS OpenBabel

InfoChem SPRESI

database

SDSCTypical TeraGrid Site

NIHPubChem …..Compare …..

Page 16: July 17 2006 Geoffrey Fox Computer Science, Informatics, Physics Pervasive Technology Laboratories

Usage of Open Source Projects

A number of open source projects are used in our infrastructure CDK provides the underlying cheminformatics toolkit R provides the back-end modeling capabilities OSCAR is used for literature mining ToxTree is used to provide toxicity classification Open data and standards as promoted by the Blue

Obelisk project

Page 17: July 17 2006 Geoffrey Fox Computer Science, Informatics, Physics Pervasive Technology Laboratories

Contributions to Open Source Projects

We also contribute functionality to these projects Molecular descriptor development to the CDK Modifications of various CDK functionality to make

them suitable for web service usage Infrastructure for accessing R from the CDK Packages to use the CDK from within R Quality control, testing and documentation

Steinbeck, C. et al.; Curr. Pharm. Des., 2006, 12(17), 2110-2120Guha, R.; CDK News, 2005, 2(1), 7-13

Page 18: July 17 2006 Geoffrey Fox Computer Science, Informatics, Physics Pervasive Technology Laboratories

Workflows Using Chemical Literature

OSCAR3program

All of PubMed “just” takes about a day to run through OSCAR3 on 2048 node Big Red

SMILES NAME Pubmed IDCCC propane 1425356CC ethane 3546453..... ............. .............

Bulk download ofPubmed abstracts

Extract chemical structures

OSCAR3Service

Find similarmolecules

Searchable(structure/similarity)Grid database

Local DTP database

PubChem

PDBBind

Find similardocuments

Clustering of documents linked to clustering of chemicals

Page 19: July 17 2006 Geoffrey Fox Computer Science, Informatics, Physics Pervasive Technology Laboratories

19

ExistingUser Interface

Document-enhanced Cyberinfrastructure

etc.

Google Scholar

ManuscriptCentral

Science.gov

Windows Live Academic Search

Citeseer

CMT Conference

Management

Existing Document-basedResearch Tools

Web serviceWrappers

New Document-enhancedResearch Tools including

Web2.0, Mashups, Annotation

Integration/EnhancementUser Interface

Community Tools

Generic Document Tools

MyResearchDatabase

Bibliographic Database

Export:RSS, BibtexEndnote etc.

CiteULike

Connotea

Del.icio.us

Bibsonomy

BioliciousPubChem

PubMed

TraditionalCyberinfrastructure

Page 20: July 17 2006 Geoffrey Fox Computer Science, Informatics, Physics Pervasive Technology Laboratories

2020

Products and Demonstrations II

Page 21: July 17 2006 Geoffrey Fox Computer Science, Informatics, Physics Pervasive Technology Laboratories

David Wild – Research Overview July 2006. Page 21 Indiana University School of

Example HTS workflow: organization & flagging

A biological screen is selected. The activity results for all the compounds is extracted from the database (currently using DTP Tumor Cell Line database)

The compounds are clustered on

chemical structure similarity, to group similar compounds

together

The compounds along with property and cluster information are converted to VOTABLES format and displayed in VOPLOT

OpenEye FILTER is used to calculate biological and chemical properties of the compounds that are related to their potential effectiveness as drugs

Taverna Workflow

Page 22: July 17 2006 Geoffrey Fox Computer Science, Informatics, Physics Pervasive Technology Laboratories

22

LoadWorkflow

RunWorkflow

CurrentProcess

Result Output

ResultOutputURL

Page 23: July 17 2006 Geoffrey Fox Computer Science, Informatics, Physics Pervasive Technology Laboratories

2323

Lilly very interested in our new educational programs

Page 24: July 17 2006 Geoffrey Fox Computer Science, Informatics, Physics Pervasive Technology Laboratories

24

Total Grad Enrollment: Chem-, Lab, Bio-, Health Informatics, Fall 2005

Red = Expected, Chem, Fall 2006

MS Chem Lab Bio HealthIUB 3/3 0 38 0

IUPUI 6/3 15 34 36TOTAL 9/6 15 72 36

PhD Chem Lab Bio HealthIUB 1/3 0 3 0

IUPUI 0/1 0 4 3TOTAL 1/4 0 7 3

Page 25: July 17 2006 Geoffrey Fox Computer Science, Informatics, Physics Pervasive Technology Laboratories

25

Formal Cheminformatics Courses• I571 Chemical Information Technology (3 cr.)

– Distance Ed section had 10 students in Fall 2005, from California to Connecticut

• I572 Computational Chemistry and Molecular Modeling (3 cr.)

• I573 Programming Techniques for Chemical and Life Science Informatics (3 cr.)

• I553 Independent Study in Chemical Informatics (3 cr.)• Above courses required for the new Graduate

Certificate Program in Chemical Informatics• Also I533 (Cheminformatics seminar)

Page 26: July 17 2006 Geoffrey Fox Computer Science, Informatics, Physics Pervasive Technology Laboratories

26

More detailed Slides not used

Page 27: July 17 2006 Geoffrey Fox Computer Science, Informatics, Physics Pervasive Technology Laboratories

2727

TeraGrid Hardware and Software TeraGrid is coordinated at the University of Chicago

and includes 8 partner facilities• NCSA, SDSC, PSC, ORNL, IU, PU, TACC, UC/ANL

TeraGrid hardware totals > 102 teraflops of computing power.• Comprehensive information available from

http://www.teragrid.org/userinfo/hardware/overview.php.• Systems are primarily Linux clusters.

Grid software and services (Globus, MyProxy, etc) provide a uniform means for accessing TeraGrid resources.• Scheduling, running and monitoring jobs• Monitoring resources• Moving and managing remote files.• Common service APIs simplify the process for building remote

tools.

Page 28: July 17 2006 Geoffrey Fox Computer Science, Informatics, Physics Pervasive Technology Laboratories

28

Prototype CICC Project: Controlling the TGF pathwayCollaboration between Baik & Zhang at IU

PDB

1IAS1IASInactive TGF

VARUNA

Experimentsin the Zhang

Lab

Active TGFActive TGFWith inhibitorWith inhibitor

PubChem

in-house Molecules in Varuna

Conceptual Conceptual Understanding of Understanding of TGFTGF

InhibitionInhibition

Simulations AutoGeFFAutoGeFF

Questions:

- What molecular feature controls inhibitor binding?

- How do mutations impact binding?

Web Service togenerate customforce fields

Page 29: July 17 2006 Geoffrey Fox Computer Science, Informatics, Physics Pervasive Technology Laboratories

29

MLSCN Data - How services and workflows are used

MLSCN submits HTS data to Pubchem and/or sends directly to workflow for real-time feedback

Data is stored in Pubchem

Workflows perform different kinds of analysis on the MLSCN data - the variety of workflows is limitlessEnd-user

applications and interfaces utilize the information streams from the workflows for human interaction with the data and analysis

PubChem interfaces to workflows via SOAP